8  Reading and Listening Materials Development

9 Syllable Counter for Reading Passages

9.1 Overview

This tutorial walks you through building an automated syllable counter for text data using Python. The pipeline:

  1. Loads the CMU Pronouncing Dictionary (CMUdict) — a standard lexical resource that maps English words to their phonetic pronunciations in ARPABET notation.
  2. Reads raw .txt files from a folder, cleans them, and tokenizes them.
  3. Converts numeric tokens (e.g., 42) into their spoken word equivalents (FORTY-TWO).
  4. Counts syllables for each passage by combining two strategies:
    • CMUdict-based counting — digit-counting on ARPABET strings for dictionary words.
    • Rule-based counting — regex pattern matching for out-of-vocabulary (OOV) words.

This approach is particularly useful for readability analysis of SAT reading passages, where syllable count is a key component of formulas like the Flesch-Kincaid Grade Level.


9.2 Prerequisites

  • Python 3.7+
  • pandas, numpy, num2words
  • The CMU Pronouncing Dictionary file (named cmudict-07b)
  • A folder named input_text/ containing .txt files to analyze

9.3 Step 1 — Import and Build CMUdict

9.3.1 What is CMUdict?

The CMU Pronouncing Dictionary is a free, open-source pronunciation dictionary for North American English, maintained by Carnegie Mellon University. Each entry maps a word to its ARPABET phoneme sequence, where vowel phonemes are tagged with stress markers (digits 0, 1, 2). For example:

PYTHON  P AY1 TH AH0 N

This word has two vowel phonemes (AY1, AH0), so it has 2 syllables. By counting the digits in an ARPABET string, we can count syllables reliably.

9.3.2 Loading CMUdict into a Python Dictionary

import pandas as pd
import numpy as np

cmu = pd.read_csv(
    'cmudict-07b',
    sep='  ',                  # two spaces
    engine='python',
    comment=';',               # ignore comment lines
    names=['word', 'cmu_pronunciation'],
    encoding='latin-1'
)

# Build dictionary
cmudict = cmu.set_index('word')['cmu_pronunciation'].to_dict()

Key points: - The CMUdict file uses two spaces as a delimiter between the word and its pronunciation — note sep=' '. - Lines beginning with ; are comments (metadata); comment=';' skips them automatically. - encoding='latin-1' is required because the file contains non-ASCII characters in some entries. - We convert the DataFrame into a plain Python dict (cmudict) for fast O(1) lookup during syllable counting.


9.4 Step 2 — Load Input Text Files

9.4.1 Check Your Working Directory

Before reading files, confirm your current working directory so that relative paths resolve correctly:

import os
print(os.getcwd())

9.4.2 Read All .txt Files from a Folder

import os
import pandas as pd

folder_path = './input_text'

if not os.path.exists(folder_path):
    raise FileNotFoundError(f"Folder not found: {folder_path}")

file_names = []
file_contents = []

for filename in sorted(os.listdir(folder_path)):
    if filename.endswith('.txt'):
        file_path = os.path.join(folder_path, filename)
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
        except UnicodeDecodeError:
            with open(file_path, 'r', encoding='latin-1') as f:
                content = f.read()
        
        file_names.append(filename)
        file_contents.append(content)

df_input = pd.DataFrame({
    'File_Name': file_names,
    'Text': file_contents
})

Key points: - sorted(os.listdir(...)) ensures files are processed in alphabetical order, making results reproducible across runs. - A try/except block handles encoding issues: we attempt utf-8 first, then fall back to latin-1. This covers the vast majority of plain-text files. - The result is a DataFrame with two columns: File_Name and Text.


9.5 Step 3 — Data Cleaning

Raw text is messy. Before tokenizing, we standardize the text by removing punctuation, normalizing whitespace, and unifying apostrophe characters.

import re

df_input["Clean"] = df_input["Text"].apply(lambda x: x.upper())
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()\"""]', '', str(x))) 

# Replace curly/smart apostrophes with straight apostrophes
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[']', '\'', str(x))) 

df_input['Clean'] = df_input['Clean'].str.replace(r'\s+', " ", regex=True)
df_input["Clean"] = df_input["Clean"].str.strip()
df_input.to_csv("df_input.csv")
df_input.head()

What each step does:

Step What It Does Why
.upper() Converts all text to uppercase CMUdict keys are uppercase (e.g., PYTHON, not python)
re.sub('[,\\.!?()"""]', '', ...) Removes common punctuation marks Punctuation would cause words to not match CMUdict
re.sub('[']', "'", ...) Replaces curly/smart apostrophes with straight ones Ensures contractions like DON'T match CMUdict
str.replace(r'\s+', " ", ...) Collapses multiple spaces into one Prevents empty tokens after splitting
.strip() Removes leading/trailing whitespace Clean boundaries for tokenization
.to_csv(...) Saves the cleaned DataFrame to disk Checkpoint — useful for inspection and debugging

9.6 Step 4 — Tokenization

Split each cleaned text string into a list of individual word tokens:

df_input["Tokenized"] = df_input["Clean"].str.split(" ")
df_input.head()

After this step, each row in the Tokenized column contains a Python list of uppercase word strings, e.g., ['THE', 'QUICK', 'BROWN', 'FOX'].


9.7 Step 5 — Convert Numbers to Words

SAT passages sometimes contain numeric expressions like years (1865) or quantities (42). CMUdict has entries for word forms (EIGHTEEN, SIXTY-FIVE) but not for raw numerals. We use the num2words library to convert digits to their spoken equivalents before lookup.

9.7.1 Install the Library

!pip install num2words

9.7.2 Define the Conversion Function

import num2words

def convert_num_to_words(utterance):
    utterance = ' '.join([num2words.num2words(i).upper() if i.isdigit() else i for i in utterance])
    return utterance

How it works: - Iterates over each token in the list. - If a token is a pure digit string (e.g., '42'), it calls num2words.num2words('42')'forty-two', then .upper()'FORTY-TWO'. - Non-digit tokens are passed through unchanged. - Finally, all tokens are joined back into a single space-separated string.

9.7.3 Apply and Clean Hyphens

df_input["words"] = df_input['Tokenized'].apply(convert_num_to_words)
df_input['words'] = df_input['words'].map(lambda x: re.sub('-', ' ', str(x)))  # numbers that have '-'
df_input

num2words produces hyphenated forms like FORTY-TWO. We replace hyphens with spaces so each part is treated as a separate token during lookup (FORTY and TWO are separate CMUdict entries).


9.8 Step 6 — Syllable Counting

This is the core of the pipeline. We combine two complementary methods:

9.8.1 6.1 Prepare the Working DataFrame

df = df_input[['File_Name', 'Text', 'words']].copy()
df.head(2)
df['list'] = df['words'].str.split(" ")
df['list'].fillna('empty', inplace=True)

df.head()

We split the cleaned words string back into a list, which we’ll iterate over for CMUdict lookups.


9.8.2 6.2 Identify Out-of-Vocabulary (OOV) Words

CMUdict is large (~130,000 entries) but not exhaustive. Proper nouns, technical terms, and very rare words may be missing. We separate these into a list for rule-based counting:

def extract_nocmu(data):
    no_list = []
    for word in data:
        if word not in cmudict:
            no_list.append(word)
    return no_list
df["No_CMU_words"] = df["list"].apply(extract_nocmu)
df["No_CMU_words_cnt"] = df["list"].apply(extract_nocmu)
df.head()
  • No_CMU_words — stores the actual OOV word strings (for inspection).
  • No_CMU_words_cnt — a separate copy that will be overwritten with syllable counts in a later step.

9.8.3 6.3 Replace In-Vocabulary Words with ARPABET Strings

For every word that is in CMUdict, replace the word with its ARPABET pronunciation string:

for sub in df['list']:
    for i, word in enumerate(sub):
        if word in cmudict:
            sub[i] = cmudict[word]
df.head()

After this step, each token in df['list'] is either: - An ARPABET string like 'P AY1 TH AH0 N' (for CMUdict words), or - The original word as-is (for OOV words that weren’t found).


9.8.4 6.4 Rule-Based Syllable Counter for OOV Words

For words not in CMUdict, we use a regex-based heuristic. This approach counts vowel clusters and then applies correction rules for common English spelling patterns that are tricky (silent-e, adverb endings, etc.):

# The following code is adapted from:
# https://datascience.stackexchange.com/questions/23376/how-to-get-the-number-of-syllables-in-a-word

import re

VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)  # re.IGNORECASE
EXCEPTIONS = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)

ADDITIONAL = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)

How the formula works:

syllables = max(1, vowel_runs − exceptions + additional)
Variable What It Counts Example
vowel_runs Groups of consecutive vowels (a, e, i, o, u, y) "scared"['a', 'e'] = 2
exceptions Patterns that do not form a syllable (e.g., silent-e) "scared"'red$' matches → subtract 1
additional Patterns that form an extra syllable missed by vowel runs "video"'eo' matches → add 1
max(1, ...) Every word has at least 1 syllable Prevents returning 0 for short words

Apply it to all OOV words:

for sub in df['No_CMU_words_cnt']:
    for i, word in enumerate(sub):
        sub[i] = count_syllables(word)
df.head()

After this, No_CMU_words_cnt contains integer syllable counts for each OOV word.


9.8.5 6.5 Count Syllables from ARPABET Strings (CMUdict Words)

In ARPABET notation, each vowel phoneme is followed by a stress digit (0, 1, or 2). Therefore, the number of digits in an ARPABET string equals the number of syllables:

P AY1 TH AH0 N  →  digits: 1, 0  →  2 syllables
df['ARPAsent'] = df['list'].astype(str).str.replace('\[|\]|\'', '')
df.head()
def count_digits(string):
    return sum(item.isdigit() for item in string)
df['ARPAcnt'] = df['ARPAsent'].apply(count_digits)
df.head()
  • ARPAsent — the list column converted to a flat string (list brackets and quotes removed).
  • count_digits — iterates character by character; item.isdigit() returns True for '0', '1', '2' (ARPABET stress markers).
  • ARPAcnt — the total syllable count contributed by CMUdict-matched words.

9.9 Step 7 — Final Aggregation and Export

Combine syllable counts from both sources and save:

df['NUMsum'] = df['No_CMU_words_cnt'].apply(sum)
df['CMU_based_Syllables'] = df['ARPAcnt'] + df['NUMsum']
df.to_excel('CMU_Syllable_Count.xlsx')
df.head()
Column Meaning
NUMsum Total syllables from OOV words (rule-based counting)
ARPAcnt Total syllables from CMUdict words (ARPABET digit counting)
CMU_based_Syllables Grand total syllables for the passage = ARPAcnt + NUMsum

The final output is written to CMU_Syllable_Count.xlsx, with one row per input .txt file.


9.10 Pipeline Summary

Raw .txt Files
     │
     ▼
Load & Read (UTF-8 / latin-1 fallback)
     │
     ▼
Clean Text (uppercase, remove punctuation, normalize whitespace)
     │
     ▼
Tokenize (split on spaces)
     │
     ▼
Convert Numbers → Words (num2words)
     │
     ▼
Split into In-Vocabulary vs OOV
     ├─── In-Vocabulary → CMUdict lookup → ARPABET string → count digits
     └─── OOV           → Rule-based regex counter
     │
     ▼
Sum both syllable counts per passage
     │
     ▼
Export to CMU_Syllable_Count.xlsx

9.11 Tips & Common Issues

Q: Some words are counted wrong — what can I do?
A: The rule-based counter works well for common English words but may miscount proper nouns or foreign loanwords. You can manually add entries to cmudict to override:

cmudict['YOURWORD'] = 'Y AO1 R W ER1 D'

Q: My text has em-dashes () or ellipses () — will these cause problems?
A: Yes. Add them to the cleaning regex:

df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()"""—…]', '', str(x)))

Q: How do I use the syllable count for readability scoring?
A: The Flesch-Kincaid Grade Level formula is:

FK Grade = 0.39 × (words / sentences) + 11.8 × (syllables / words) − 15.59

You can compute words with len(token_list) and add sentence counting with nltk.sent_tokenize().


9.12 Full Code Reference

Below is the complete pipeline in order, ready to copy into a single notebook or .py file:

# === Step 1: Load CMUdict ===
import pandas as pd
import numpy as np

cmu = pd.read_csv(
    'cmudict-07b',
    sep='  ',
    engine='python',
    comment=';',
    names=['word', 'cmu_pronunciation'],
    encoding='latin-1'
)
cmudict = cmu.set_index('word')['cmu_pronunciation'].to_dict()

# === Step 2: Load input texts ===
import os

folder_path = './input_text'
if not os.path.exists(folder_path):
    raise FileNotFoundError(f"Folder not found: {folder_path}")

file_names, file_contents = [], []
for filename in sorted(os.listdir(folder_path)):
    if filename.endswith('.txt'):
        file_path = os.path.join(folder_path, filename)
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
        except UnicodeDecodeError:
            with open(file_path, 'r', encoding='latin-1') as f:
                content = f.read()
        file_names.append(filename)
        file_contents.append(content)

df_input = pd.DataFrame({'File_Name': file_names, 'Text': file_contents})

# === Step 3: Clean text ===
import re

df_input["Clean"] = df_input["Text"].apply(lambda x: x.upper())
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[,\\.!?()\"""]', '', str(x)))
df_input['Clean'] = df_input['Clean'].map(lambda x: re.sub('[']', '\'', str(x)))
df_input['Clean'] = df_input['Clean'].str.replace(r'\s+', " ", regex=True)
df_input["Clean"] = df_input["Clean"].str.strip()
df_input.to_csv("df_input.csv")

# === Step 4: Tokenize ===
df_input["Tokenized"] = df_input["Clean"].str.split(" ")

# === Step 5: Convert numbers to words ===
import num2words

def convert_num_to_words(utterance):
    utterance = ' '.join([num2words.num2words(i).upper() if i.isdigit() else i for i in utterance])
    return utterance

df_input["words"] = df_input['Tokenized'].apply(convert_num_to_words)
df_input['words'] = df_input['words'].map(lambda x: re.sub('-', ' ', str(x)))

# === Step 6: Count syllables ===
df = df_input[['File_Name', 'Text', 'words']].copy()
df['list'] = df['words'].str.split(" ")
df['list'].fillna('empty', inplace=True)

def extract_nocmu(data):
    return [word for word in data if word not in cmudict]

df["No_CMU_words"] = df["list"].apply(extract_nocmu)
df["No_CMU_words_cnt"] = df["list"].apply(extract_nocmu)

for sub in df['list']:
    for i, word in enumerate(sub):
        if word in cmudict:
            sub[i] = cmudict[word]

VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile("[^aeiou]e[sd]?$|[^e]ely$", flags=re.I)
ADDITIONAL = re.compile(
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|.y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)

for sub in df['No_CMU_words_cnt']:
    for i, word in enumerate(sub):
        sub[i] = count_syllables(word)

df['ARPAsent'] = df['list'].astype(str).str.replace('\[|\]|\'', '')

def count_digits(string):
    return sum(item.isdigit() for item in string)

df['ARPAcnt'] = df['ARPAsent'].apply(count_digits)

# === Step 7: Aggregate and export ===
df['NUMsum'] = df['No_CMU_words_cnt'].apply(sum)
df['CMU_based_Syllables'] = df['ARPAcnt'] + df['NUMsum']
df.to_excel('CMU_Syllable_Count.xlsx')
print(df[['File_Name', 'CMU_based_Syllables']])