8 Reading and Listening Materials Development
9 Syllable Counter for Reading Passages
9.1 Overview
This tutorial walks you through building an automated syllable counter for text data using Python. The pipeline:
- Loads the CMU Pronouncing Dictionary (CMUdict) — a standard lexical resource that maps English words to their phonetic pronunciations in ARPABET notation.
- Reads raw
.txtfiles from a folder, cleans them, and tokenizes them. - Converts numeric tokens (e.g.,
42) into their spoken word equivalents (FORTY-TWO). - Counts syllables for each passage by combining two strategies:
- CMUdict-based counting — digit-counting on ARPABET strings for dictionary words.
- Rule-based counting — regex pattern matching for out-of-vocabulary (OOV) words.
This approach is particularly useful for readability analysis of SAT reading passages, where syllable count is a key component of formulas like the Flesch-Kincaid Grade Level.
9.2 Prerequisites
- Python 3.7+
pandas,numpy,num2words- The CMU Pronouncing Dictionary file (named
cmudict-07b) - A folder named
input_text/containing.txtfiles to analyze
9.3 Step 1 — Import and Build CMUdict
9.3.1 What is CMUdict?
The CMU Pronouncing Dictionary is a free, open-source pronunciation dictionary for North American English, maintained by Carnegie Mellon University. Each entry maps a word to its ARPABET phoneme sequence, where vowel phonemes are tagged with stress markers (digits 0, 1, 2). For example:
PYTHON P AY1 TH AH0 N
This word has two vowel phonemes (AY1, AH0), so it has 2 syllables. By counting the digits in an ARPABET string, we can count syllables reliably.
9.3.2 Loading CMUdict into a Python Dictionary
import pandas as pd
import numpy as np
cmu = pd.read_csv(
'cmudict-07b',
sep=' ', # two spaces
engine='python',
comment=';', # ignore comment lines
names=['word', 'cmu_pronunciation'],
encoding='latin-1'
)
# Build dictionary
cmudict = cmu.set_index('word')['cmu_pronunciation'].to_dict()Key points: - The CMUdict file uses two spaces as a delimiter between the word and its pronunciation — note sep=' '. - Lines beginning with ; are comments (metadata); comment=';' skips them automatically. - encoding='latin-1' is required because the file contains non-ASCII characters in some entries. - We convert the DataFrame into a plain Python dict (cmudict) for fast O(1) lookup during syllable counting.
9.4 Step 2 — Load Input Text Files
9.4.1 Check Your Working Directory
Before reading files, confirm your current working directory so that relative paths resolve correctly:
import os
print(os.getcwd())9.4.2 Read All .txt Files from a Folder
import os
import pandas as pd
folder_path = './input_text'
if not os.path.exists(folder_path):
raise FileNotFoundError(f"Folder not found: {folder_path}")
file_names = []
file_contents = []
for filename in sorted(os.listdir(folder_path)):
if filename.endswith('.txt'):
file_path = os.path.join(folder_path, filename)
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
except UnicodeDecodeError:
with open(file_path, 'r', encoding='latin-1') as f:
content = f.read()
file_names.append(filename)
file_contents.append(content)
df_input = pd.DataFrame({
'File_Name': file_names,
'Text': file_contents
})Key points: - sorted(os.listdir(...)) ensures files are processed in alphabetical order, making results reproducible across runs. - A try/except block handles encoding issues: we attempt utf-8 first, then fall back to latin-1. This covers the vast majority of plain-text files. - The result is a DataFrame with two columns: File_Name and Text.
9.5 Step 3 — Data Cleaning
Raw text is messy. Before tokenizing, we standardize the text by removing punctuation, normalizing whitespace, and unifying apostrophe characters.
import re
df_input["Clean"] = df_input["Text"].apply(lambda x: x.upper())
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()\"""]', '', str(x)))
# Replace curly/smart apostrophes with straight apostrophes
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[']', '\'', str(x)))
df_input['Clean'] = df_input['Clean'].str.replace(r'\s+', " ", regex=True)
df_input["Clean"] = df_input["Clean"].str.strip()
df_input.to_csv("df_input.csv")
df_input.head()What each step does:
| Step | What It Does | Why |
|---|---|---|
.upper() |
Converts all text to uppercase | CMUdict keys are uppercase (e.g., PYTHON, not python) |
re.sub('[,\\.!?()"""]', '', ...) |
Removes common punctuation marks | Punctuation would cause words to not match CMUdict |
re.sub('[']', "'", ...) |
Replaces curly/smart apostrophes with straight ones | Ensures contractions like DON'T match CMUdict |
str.replace(r'\s+', " ", ...) |
Collapses multiple spaces into one | Prevents empty tokens after splitting |
.strip() |
Removes leading/trailing whitespace | Clean boundaries for tokenization |
.to_csv(...) |
Saves the cleaned DataFrame to disk | Checkpoint — useful for inspection and debugging |
9.6 Step 4 — Tokenization
Split each cleaned text string into a list of individual word tokens:
df_input["Tokenized"] = df_input["Clean"].str.split(" ")
df_input.head()After this step, each row in the Tokenized column contains a Python list of uppercase word strings, e.g., ['THE', 'QUICK', 'BROWN', 'FOX'].
9.7 Step 5 — Convert Numbers to Words
SAT passages sometimes contain numeric expressions like years (1865) or quantities (42). CMUdict has entries for word forms (EIGHTEEN, SIXTY-FIVE) but not for raw numerals. We use the num2words library to convert digits to their spoken equivalents before lookup.
9.7.1 Install the Library
!pip install num2words9.7.2 Define the Conversion Function
import num2words
def convert_num_to_words(utterance):
utterance = ' '.join([num2words.num2words(i).upper() if i.isdigit() else i for i in utterance])
return utteranceHow it works: - Iterates over each token in the list. - If a token is a pure digit string (e.g., '42'), it calls num2words.num2words('42') → 'forty-two', then .upper() → 'FORTY-TWO'. - Non-digit tokens are passed through unchanged. - Finally, all tokens are joined back into a single space-separated string.
9.7.3 Apply and Clean Hyphens
df_input["words"] = df_input['Tokenized'].apply(convert_num_to_words)
df_input['words'] = df_input['words'].map(lambda x: re.sub('-', ' ', str(x))) # numbers that have '-'
df_inputnum2words produces hyphenated forms like FORTY-TWO. We replace hyphens with spaces so each part is treated as a separate token during lookup (FORTY and TWO are separate CMUdict entries).
9.8 Step 6 — Syllable Counting
This is the core of the pipeline. We combine two complementary methods:
9.8.1 6.1 Prepare the Working DataFrame
df = df_input[['File_Name', 'Text', 'words']].copy()
df.head(2)df['list'] = df['words'].str.split(" ")
df['list'].fillna('empty', inplace=True)
df.head()We split the cleaned words string back into a list, which we’ll iterate over for CMUdict lookups.
9.8.2 6.2 Identify Out-of-Vocabulary (OOV) Words
CMUdict is large (~130,000 entries) but not exhaustive. Proper nouns, technical terms, and very rare words may be missing. We separate these into a list for rule-based counting:
def extract_nocmu(data):
no_list = []
for word in data:
if word not in cmudict:
no_list.append(word)
return no_listdf["No_CMU_words"] = df["list"].apply(extract_nocmu)
df["No_CMU_words_cnt"] = df["list"].apply(extract_nocmu)
df.head()No_CMU_words— stores the actual OOV word strings (for inspection).No_CMU_words_cnt— a separate copy that will be overwritten with syllable counts in a later step.
9.8.3 6.3 Replace In-Vocabulary Words with ARPABET Strings
For every word that is in CMUdict, replace the word with its ARPABET pronunciation string:
for sub in df['list']:
for i, word in enumerate(sub):
if word in cmudict:
sub[i] = cmudict[word]
df.head()After this step, each token in df['list'] is either: - An ARPABET string like 'P AY1 TH AH0 N' (for CMUdict words), or - The original word as-is (for OOV words that weren’t found).
9.8.4 6.4 Rule-Based Syllable Counter for OOV Words
For words not in CMUdict, we use a regex-based heuristic. This approach counts vowel clusters and then applies correction rules for common English spelling patterns that are tricky (silent-e, adverb endings, etc.):
# The following code is adapted from:
# https://datascience.stackexchange.com/questions/23376/how-to-get-the-number-of-syllables-in-a-word
import re
VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I) # re.IGNORECASE
EXCEPTIONS = re.compile(
# fixes trailing e issues:
# smite, scared
"[^aeiou]e[sd]?$|"
# fixes adverbs:
# nicely
+ "[^e]ely$",
flags=re.I
)
ADDITIONAL = re.compile(
# fixes incorrect subtractions from exceptions:
# smile, scarred, raises, fated
"[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
# fixes miscellaneous issues:
# flying, piano, video, prism, fire, evaluate
+ ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
flags=re.I
)
def count_syllables(word):
vowel_runs = len(VOWEL_RUNS.findall(word))
exceptions = len(EXCEPTIONS.findall(word))
additional = len(ADDITIONAL.findall(word))
return max(1, vowel_runs - exceptions + additional)How the formula works:
syllables = max(1, vowel_runs − exceptions + additional)
| Variable | What It Counts | Example |
|---|---|---|
vowel_runs |
Groups of consecutive vowels (a, e, i, o, u, y) | "scared" → ['a', 'e'] = 2 |
exceptions |
Patterns that do not form a syllable (e.g., silent-e) | "scared" → 'red$' matches → subtract 1 |
additional |
Patterns that form an extra syllable missed by vowel runs | "video" → 'eo' matches → add 1 |
max(1, ...) |
Every word has at least 1 syllable | Prevents returning 0 for short words |
Apply it to all OOV words:
for sub in df['No_CMU_words_cnt']:
for i, word in enumerate(sub):
sub[i] = count_syllables(word)
df.head()After this, No_CMU_words_cnt contains integer syllable counts for each OOV word.
9.8.5 6.5 Count Syllables from ARPABET Strings (CMUdict Words)
In ARPABET notation, each vowel phoneme is followed by a stress digit (0, 1, or 2). Therefore, the number of digits in an ARPABET string equals the number of syllables:
P AY1 TH AH0 N → digits: 1, 0 → 2 syllables
df['ARPAsent'] = df['list'].astype(str).str.replace('\[|\]|\'', '')
df.head()def count_digits(string):
return sum(item.isdigit() for item in string)df['ARPAcnt'] = df['ARPAsent'].apply(count_digits)
df.head()ARPAsent— thelistcolumn converted to a flat string (list brackets and quotes removed).count_digits— iterates character by character;item.isdigit()returnsTruefor'0','1','2'(ARPABET stress markers).ARPAcnt— the total syllable count contributed by CMUdict-matched words.
9.9 Step 7 — Final Aggregation and Export
Combine syllable counts from both sources and save:
df['NUMsum'] = df['No_CMU_words_cnt'].apply(sum)
df['CMU_based_Syllables'] = df['ARPAcnt'] + df['NUMsum']
df.to_excel('CMU_Syllable_Count.xlsx')
df.head()| Column | Meaning |
|---|---|
NUMsum |
Total syllables from OOV words (rule-based counting) |
ARPAcnt |
Total syllables from CMUdict words (ARPABET digit counting) |
CMU_based_Syllables |
Grand total syllables for the passage = ARPAcnt + NUMsum |
The final output is written to CMU_Syllable_Count.xlsx, with one row per input .txt file.
9.10 Pipeline Summary
Raw .txt Files
│
▼
Load & Read (UTF-8 / latin-1 fallback)
│
▼
Clean Text (uppercase, remove punctuation, normalize whitespace)
│
▼
Tokenize (split on spaces)
│
▼
Convert Numbers → Words (num2words)
│
▼
Split into In-Vocabulary vs OOV
├─── In-Vocabulary → CMUdict lookup → ARPABET string → count digits
└─── OOV → Rule-based regex counter
│
▼
Sum both syllable counts per passage
│
▼
Export to CMU_Syllable_Count.xlsx
9.11 Tips & Common Issues
Q: Some words are counted wrong — what can I do?
A: The rule-based counter works well for common English words but may miscount proper nouns or foreign loanwords. You can manually add entries to cmudict to override:
cmudict['YOURWORD'] = 'Y AO1 R W ER1 D'Q: My text has em-dashes (—) or ellipses (…) — will these cause problems?
A: Yes. Add them to the cleaning regex:
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()"""—…]', '', str(x)))Q: How do I use the syllable count for readability scoring?
A: The Flesch-Kincaid Grade Level formula is:
FK Grade = 0.39 × (words / sentences) + 11.8 × (syllables / words) − 15.59
You can compute words with len(token_list) and add sentence counting with nltk.sent_tokenize().
9.12 Full Code Reference
Below is the complete pipeline in order, ready to copy into a single notebook or .py file:
# === Step 1: Load CMUdict ===
import pandas as pd
import numpy as np
cmu = pd.read_csv(
'cmudict-07b',
sep=' ',
engine='python',
comment=';',
names=['word', 'cmu_pronunciation'],
encoding='latin-1'
)
cmudict = cmu.set_index('word')['cmu_pronunciation'].to_dict()
# === Step 2: Load input texts ===
import os
folder_path = './input_text'
if not os.path.exists(folder_path):
raise FileNotFoundError(f"Folder not found: {folder_path}")
file_names, file_contents = [], []
for filename in sorted(os.listdir(folder_path)):
if filename.endswith('.txt'):
file_path = os.path.join(folder_path, filename)
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
except UnicodeDecodeError:
with open(file_path, 'r', encoding='latin-1') as f:
content = f.read()
file_names.append(filename)
file_contents.append(content)
df_input = pd.DataFrame({'File_Name': file_names, 'Text': file_contents})
# === Step 3: Clean text ===
import re
df_input["Clean"] = df_input["Text"].apply(lambda x: x.upper())
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()\"""]', '', str(x)))
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[']', '\'', str(x)))
df_input['Clean'] = df_input['Clean'].str.replace(r'\s+', " ", regex=True)
df_input["Clean"] = df_input["Clean"].str.strip()
df_input.to_csv("df_input.csv")
# === Step 4: Tokenize ===
df_input["Tokenized"] = df_input["Clean"].str.split(" ")
# === Step 5: Convert numbers to words ===
import num2words
def convert_num_to_words(utterance):
utterance = ' '.join([num2words.num2words(i).upper() if i.isdigit() else i for i in utterance])
return utterance
df_input["words"] = df_input['Tokenized'].apply(convert_num_to_words)
df_input['words'] = df_input['words'].map(lambda x: re.sub('-', ' ', str(x)))
# === Step 6: Count syllables ===
df = df_input[['File_Name', 'Text', 'words']].copy()
df['list'] = df['words'].str.split(" ")
df['list'].fillna('empty', inplace=True)
def extract_nocmu(data):
return [word for word in data if word not in cmudict]
df["No_CMU_words"] = df["list"].apply(extract_nocmu)
df["No_CMU_words_cnt"] = df["list"].apply(extract_nocmu)
for sub in df['list']:
for i, word in enumerate(sub):
if word in cmudict:
sub[i] = cmudict[word]
VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile("[^aeiou]e[sd]?$|[^e]ely$", flags=re.I)
ADDITIONAL = re.compile(
"[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|.y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
flags=re.I
)
def count_syllables(word):
vowel_runs = len(VOWEL_RUNS.findall(word))
exceptions = len(EXCEPTIONS.findall(word))
additional = len(ADDITIONAL.findall(word))
return max(1, vowel_runs - exceptions + additional)
for sub in df['No_CMU_words_cnt']:
for i, word in enumerate(sub):
sub[i] = count_syllables(word)
df['ARPAsent'] = df['list'].astype(str).str.replace('\[|\]|\'', '')
def count_digits(string):
return sum(item.isdigit() for item in string)
df['ARPAcnt'] = df['ARPAsent'].apply(count_digits)
# === Step 7: Aggregate and export ===
df['NUMsum'] = df['No_CMU_words_cnt'].apply(sum)
df['CMU_based_Syllables'] = df['ARPAcnt'] + df['NUMsum']
df.to_excel('CMU_Syllable_Count.xlsx')
print(df[['File_Name', 'CMU_based_Syllables']])