%%capture
# Install necessary packages
!pip install git+https://github.com/huggingface/transformers gradio openpyxl
!pip install gdown
!pip install torch9 Automatic Speech Recognition with Whisper
Overview
This tutorial walks you through automatic speech recognition (ASR) using OpenAI’s Whisper models via the HuggingFace transformers library. By the end of this guide, you will be able to:
- Understand what Whisper is and why it matters for language research
- Load Whisper models of different sizes using HuggingFace
pipeline - Transcribe a batch of audio files stored in Google Drive
- Save the resulting transcriptions into a structured Excel spreadsheet
9.1 Prerequisites
Before starting, make sure you have:
- A Google account with Google Colab and Google Drive access
- Audio files (
.mp3or.wav) uploaded to your Google Drive - Basic familiarity with Python syntax
9.2 Introduction to Whisper
9.2.1 What is Whisper?
Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in 2022 and continuously improved through subsequent versions. It is trained on a massive dataset of 680,000 hours of multilingual speech from the internet, making it one of the most robust and versatile ASR systems available.
Key characteristics of Whisper:
| Feature | Description |
|---|---|
| Multilingual | Supports 99 languages out of the box |
| Multitask | Can transcribe, translate, and detect language |
| Robust | Handles accents, noise, and code-switching well |
| Open-source | Freely available on GitHub and HuggingFace Hub |
9.2.2 Whisper v3: What’s New?
Whisper large-v3 is the latest generation of the model family. Notable improvements include:
- Better multilingual performance, including mid-sentence language switching (e.g., shifting between English, Spanish, and Chinese within the same utterance)
- Reduced hallucination on silent or non-speech segments
- Higher accuracy across a wider range of accents and recording conditions
- Full integration with the HuggingFace
transformerslibrary, making migration from older versions seamless
HuggingFace transformers provides a unified API (pipeline) to load and run state-of-the-art models — including Whisper — with just a few lines of code. It also handles device management (CPU vs. GPU) and batching automatically.
9.2.3 The Whisper Model Family
Whisper comes in multiple sizes, each trading off accuracy for speed and memory:
| Model | Parameters | Relative Speed | Best Use Case |
|---|---|---|---|
whisper-tiny |
39 M | ~32× | Quick prototyping |
whisper-base |
74 M | ~16× | Lightweight batch jobs |
whisper-small |
244 M | ~6× | General use (balanced) |
whisper-medium |
769 M | ~2× | High quality on limited GPU |
whisper-large-v3 |
1550 M | 1× (baseline) | Best accuracy, research use |
In this tutorial we demonstrate both whisper-large-v3 (highest accuracy) and whisper-base (fastest inference) so you can compare them directly. Each model writes its transcription results to its own Excel file, allowing you to inspect and compare outputs independently.
9.3 Environment Setup
9.3.1 Step 1 — Install Required Packages
All dependencies are installed via pip. The %%capture magic suppresses verbose installation output in Colab.
9.4 Package Descriptions
| Package | Purpose |
|---|---|
transformers |
HuggingFace library that provides the Whisper pipeline |
gradio |
(Optional) Build quick web demos around your models |
openpyxl |
Read and write .xlsx Excel files from Python |
gdown |
Download files from Google Drive by URL |
We install transformers directly from GitHub to ensure we get the latest version with full Whisper v3 support. Once HuggingFace releases a stable version that includes Whisper v3, you can switch to pip install transformers instead.
9.4.1 Step 2 — Import Libraries
# Import necessary libraries
import torch
from transformers import pipeline
import gdown
import os
from openpyxl import Workbook, load_workbook
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline9.5 What Each Import Does
torch— PyTorch, the deep learning framework Whisper runs on. We use it to specify the computation device (GPU) and data type (float16).pipeline— The high-level HuggingFace API. It wraps model loading, preprocessing, inference, and postprocessing into one callable object.gdown— A helper for downloading files from Google Drive sharing links.os— Standard Python module for file and directory operations (listing files, joining paths, checking existence).Workbook,load_workbook— Fromopenpyxl; used to create new Excel files or open existing ones for appending rows.AutoModelForSpeechSeq2Seq,AutoProcessor— Lower-level HuggingFace classes. They are imported here for completeness; thepipelinewrapper uses them internally.
9.6 Connecting to Google Drive
In Google Colab, your files live in the cloud. Mounting Google Drive gives the notebook read/write access to your personal Drive storage.
# Authenticate and mount Google Drive
from google.colab import drive
drive.mount('/content/drive')When you run this cell:
- A pop-up (or link) will appear asking you to sign in with your Google account.
- After authentication, your Drive becomes accessible at the path
/content/drive/MyDrive/.
- You must re-mount Drive every time you start a new Colab session (sessions do not persist).
- The path
/content/drive/MyDrive/is equivalent to the root of your “My Drive” folder as seen in the Google Drive web interface. - Make sure your audio files are already uploaded to Drive before running the transcription cells.
9.6.1 Define File Paths
# Define the path to the folder containing your audio files
input_audio_path = '/content/drive/MyDrive/Teaching/HY_S26_EngTech/audio_files'Replace the path above with the actual folder on your Google Drive that contains your audio files. For example:
input_audio_path = '/content/drive/MyDrive/my_project/recordings'The folder should contain .mp3 or .wav files. Other formats (.m4a, .flac, .ogg) can be added by modifying the file-listing filter in the transcription function below.
9.7 Transcription: Whisper-Large-v3
This section runs the highest-accuracy Whisper model and saves all transcriptions to a dedicated Excel file. Each row corresponds to one audio file; the second column holds the model’s transcription.
The output spreadsheet will have the following structure:
| Filename | Whisper-Large-v3 |
|---|---|
| recording_001.wav | large-v3 transcript |
| recording_002.mp3 | large-v3 transcript |
| … | … |
9.7.1 Step 1 — Set Output Path and Load the Model
# Define the folder path in Google Drive
model_version = 'whisper-large-v3'
output_excel_file_path = f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/Output_{model_version}_Phoneme_Transcript_183.xlsx'
# Load Whisper model using the high level `pipeline` from the `transformers` library
pipe = pipeline(
"automatic-speech-recognition",
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device="cuda:0"
)9.8 Key Parameters
| Parameter | Value | Purpose |
|---|---|---|
torch_dtype |
torch.float16 |
Half-precision arithmetic — halves VRAM usage |
device |
"cuda:0" |
Runs inference on the first available GPU |
model_version is a human-readable label used only to build the output filename. The actual model loaded by pipeline is controlled by the string "openai/whisper-large-v3", which is the official HuggingFace model identifier.
9.8.1 Step 2 — Define the Transcription Function
def transcribe_audio_files(folder_path, excel_file_path):
"""
Transcribe all .mp3 and .wav files in `folder_path` using the
globally loaded `pipe` object, and write results to `excel_file_path`.
Parameters
----------
folder_path : str
Path to the folder containing audio files.
excel_file_path : str
Full path (including filename) for the output .xlsx file.
If the file does not yet exist it is created; otherwise rows
are appended to the existing 'Transcriptions' sheet.
"""
# ── 1. Collect audio files ──────────────────────────────────────────────
audio_files = [
f for f in os.listdir(folder_path)
if f.endswith('.mp3') or f.endswith('.wav')
]
# ── 2. Initialise or open the Excel workbook ────────────────────────────
if not os.path.exists(excel_file_path):
workbook = Workbook()
sheet = workbook.active
sheet.title = "Transcriptions"
sheet.append(["Filename", "Whisper-Large-v3"]) # header row
workbook.save(excel_file_path)
else:
workbook = load_workbook(excel_file_path)
sheet = workbook["Transcriptions"]
# ── 3. Transcribe and append each file ──────────────────────────────────
for audio_file in audio_files:
audio_path = os.path.join(folder_path, audio_file)
# return_timestamps=True enables sliding-window long-form transcription.
# chunk_length_s=30 matches Whisper's 30-second context window.
# stride_length_s=5 adds overlap to avoid cutting words at boundaries.
transcription = pipe(
audio_path,
generate_kwargs={"language": "english"},
return_timestamps=True,
chunk_length_s=30,
stride_length_s=5
)["text"]
sheet.append([audio_file, transcription])
workbook.save(excel_file_path)
print(f"Transcriptions saved to {excel_file_path}")9.9 Key Design Decisions
Conditional workbook creation
if not os.path.exists(excel_file_path):
workbook = Workbook() # create a brand-new file
...
else:
workbook = load_workbook(...) # open the existing fileThis guard prevents overwriting an existing file if the cell is re-run (e.g., after a kernel restart). New rows are always appended to the existing sheet, so partial results from a previous interrupted run are preserved.
return_timestamps=True and chunking parameters
Required for audio files longer than 30 seconds. Whisper’s native context window is 30 s; without this flag the model silently truncates longer recordings.
| Parameter | Value | Purpose |
|---|---|---|
return_timestamps |
True |
Enables sliding-window long-form transcription |
chunk_length_s |
30 |
Each chunk matches Whisper’s 30 s context window |
stride_length_s |
5 |
5 s overlap prevents words from being cut at boundaries |
9.9.1 Step 3 — Run the Transcription
# Run the transcription function for Whisper-Large-v3
transcribe_audio_files(input_audio_path, output_excel_file_path)When this cell finishes, an Excel file will appear at output_excel_file_path in your Google Drive containing the Whisper-Large-v3 transcriptions.
9.10 Transcription: Whisper-Base
This section runs the fastest Whisper model and saves all transcriptions to a separate Excel file. The workflow is identical to the large-v3 section; only the model identifier and output filename change.
The output spreadsheet will have the following structure:
| Filename | Whisper-Base |
|---|---|
| recording_001.wav | base transcript |
| recording_002.mp3 | base transcript |
| … | … |
9.10.1 Step 1 — Set Output Path and Load the Model
# Set the model version label and derive the output file path automatically
model_version = 'whisper-base'
output_excel_file_path = (
f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/'
f'Output_{model_version}_Phoneme_Transcript_183.xlsx'
)
# Load Whisper-Base using the HuggingFace high-level pipeline API
pipe = pipeline(
"automatic-speech-recognition",
"openai/whisper-base",
torch_dtype=torch.float16,
device="cuda:0"
)9.10.2 Step 2 — Define the Transcription Function
def transcribe_audio_files(folder_path, excel_file_path):
"""
Transcribe all .mp3 and .wav files in `folder_path` using the
globally loaded `pipe` object, and write results to `excel_file_path`.
Parameters
----------
folder_path : str
Path to the folder containing audio files.
excel_file_path : str
Full path (including filename) for the output .xlsx file.
If the file does not yet exist it is created; otherwise rows
are appended to the existing 'Transcriptions' sheet.
"""
# ── 1. Collect audio files ──────────────────────────────────────────────
audio_files = [
f for f in os.listdir(folder_path)
if f.endswith('.mp3') or f.endswith('.wav')
]
# ── 2. Initialise or open the Excel workbook ────────────────────────────
if not os.path.exists(excel_file_path):
workbook = Workbook()
sheet = workbook.active
sheet.title = "Transcriptions"
sheet.append(["Filename", "Whisper-Base"]) # header row
workbook.save(excel_file_path)
else:
workbook = load_workbook(excel_file_path)
sheet = workbook["Transcriptions"]
# ── 3. Transcribe and append each file ──────────────────────────────────
for audio_file in audio_files:
audio_path = os.path.join(folder_path, audio_file)
# For long audio, return_timestamps=True and chunking parameters are necessary.
transcription = pipe(
audio_path,
generate_kwargs={"language": "english"},
return_timestamps=True,
chunk_length_s=30,
stride_length_s=5
)["text"]
sheet.append([audio_file, transcription])
workbook.save(excel_file_path)
print(f"Transcriptions saved to {excel_file_path}")9.10.3 Step 3 — Run the Transcription
# Run the transcription function for Whisper-Base
transcribe_audio_files(input_audio_path, output_excel_file_path)# End.| Dimension | Whisper-Base |
Whisper-Large-v3 |
|---|---|---|
| Model size | ~74 M parameters | ~1.55 B parameters |
| Download size | ~150 MB | ~3 GB |
| Inference speed | ~16× faster than large | Baseline |
| Word Error Rate (WER) | Higher (less accurate) | Lower (more accurate) |
| Accent robustness | Moderate | High |
| Recommended for | Piloting, quick iteration | Research-grade output |
For phoneme-level research or publication-quality transcriptions, Whisper-Large-v3 is strongly recommended. Use Whisper-Base for quick sanity-checks or when working under a tight time budget.
9.11 Word Error Rate (WER) Calculation
Once you have transcriptions from multiple ASR systems, the natural next question is: which model is more accurate? This section introduces Word Error Rate (WER) — the standard metric for evaluating ASR output — and walks through a step-by-step Python implementation that computes WER for each ASR system and appends the scores to the original data file.
9.11.1 What is Word Error Rate?
Word Error Rate (WER) measures how many word-level edits are needed to transform a hypothesis (the ASR output) into a reference (the human-verified transcript), expressed as a proportion of the total number of words in the reference.
\[\text{WER} = \frac{S + D + I}{N}\]
where:
| Symbol | Meaning |
|---|---|
| \(S\) | Number of substitutions (wrong word) |
| \(D\) | Number of deletions (word in reference, missing in hypothesis) |
| \(I\) | Number of insertions (extra word in hypothesis not in reference) |
| \(N\) | Total number of words in the reference transcript |
9.12 Interpreting WER
| WER Range | Interpretation |
|---|---|
| 0.00 | Perfect match — hypothesis is identical to the reference |
| 0.05–0.15 | Excellent — suitable for most research and applied tasks |
| 0.15–0.30 | Acceptable — some errors, may need manual review |
| > 0.30 | Poor — substantial errors, caution advised |
WER can exceed 1.0 (i.e., 100%) if the hypothesis contains many extra insertions. A lower WER always indicates better performance.
9.12.1 The Levenshtein Algorithm
WER is computed via the Levenshtein (edit) distance at the word level. The algorithm fills in a dynamic programming matrix \(d\) of size \((|R|+1) \times (|H|+1)\), where \(|R|\) and \(|H|\) are the number of words in the reference and hypothesis respectively.
The recurrence relation is:
\[d[i][j] = \min \begin{cases} d[i-1][j] + 1 & \text{(deletion)} \\ d[i][j-1] + 1 & \text{(insertion)} \\ d[i-1][j-1] + \text{cost}(i,j) & \text{(substitution or match)} \end{cases}\]
where \(\text{cost}(i,j) = 0\) if \(R_i = H_j\) (the words match) and \(1\) otherwise.
The final WER is \(d[|R|][|H|] \,/\, \max(|R|, 1)\).
The jiwer library (installed below) provides a fast, battle-tested WER implementation. The custom implementation in this section is included for pedagogical purposes: seeing the matrix construction step-by-step makes the algorithm transparent and verifiable.
In production, prefer jiwer.wer(reference, hypothesis) for speed and correctness.
9.12.2 Step 1 — Install jiwer
!pip install jiwer9.13 About jiwer
jiwer is a lightweight Python package for computing common ASR evaluation metrics: WER, Match Error Rate (MER), Word Information Lost (WIL), and Character Error Rate (CER). It handles text normalisation (lowercasing, punctuation removal) and is the de-facto standard library for ASR benchmarking in Python.
9.13.1 Step 2 — Import Libraries and Load Data
# Environment: Python 3.9.16
# pip install jiwer (already installed above)
import pandas as pd
from jiwer import wer
# Load the transcription data
data = pd.read_csv('(Supplementary)_02_Input_Transcript.csv')
print("Columns in DataFrame:", data.columns.tolist())9.14 Input File Format
The input CSV ((Supplementary)_02_Input_Transcript.csv) must contain at least the following columns:
| Column | Content |
|---|---|
STANDARD_TRANSCRIPT |
The gold-standard human-verified reference transcript |
vosk-model-small-en-us-0.15 |
ASR output from the Vosk small model |
Wav2vec2 |
ASR output from Facebook’s Wav2Vec 2.0 |
HuBERT |
ASR output from Facebook’s HuBERT model |
Whisper_Base-En |
ASR output from Whisper Base (English) |
Whisper_Large-v3-EN |
ASR output from Whisper Large-v3 (English) |
Azure |
ASR output from Microsoft Azure Speech |
Each row represents one audio recording. The STANDARD_TRANSCRIPT column is the reference against which all other columns are evaluated.
pd.read_csv() loads the file into a pandas DataFrame — a tabular in-memory data structure where each column is a named series and each row is one observation.
9.14.1 Step 3 — Specify ASR Columns to Evaluate
# Define the list of ASR transcription columns to compare
asr_columns = [
"STANDARD_TRANSCRIPT", "vosk-model-small-en-us-0.15", "Wav2vec2",
"HuBERT", "Whisper_Base-En", "Whisper_Large-v3-EN", "Azure"
]9.15 Why include STANDARD_TRANSCRIPT in the list?
The list includes the reference column itself. The calculate_wer() call and the for-loop below both skip it explicitly (column != 'STANDARD_TRANSCRIPT'). Keeping it in the list provides a single place to inspect all column names and makes it easy to verify that the reference column is present in the DataFrame before the loop starts.
9.15.1 Step 4 — Implement the Custom WER Function
# Custom WER function using the Levenshtein (edit distance) algorithm
def calculate_wer(reference, hypothesis):
# Return None if either value is missing (NaN or None)
if pd.isnull(reference) or pd.isnull(hypothesis):
return None
ref_words = str(reference).strip().split() # tokenise reference into words
hyp_words = str(hypothesis).strip().split() # tokenise hypothesis into words
r_len = len(ref_words)
h_len = len(hyp_words)
# ── Initialise the (r_len+1) × (h_len+1) edit distance matrix ──────────
d = [[0] * (h_len + 1) for _ in range(r_len + 1)]
# Base cases: transforming N words into 0 words requires N deletions
for i in range(r_len + 1):
d[i][0] = i # delete all reference words
for j in range(h_len + 1):
d[0][j] = j # insert all hypothesis words
# ── Fill the matrix ─────────────────────────────────────────────────────
for i in range(1, r_len + 1):
for j in range(1, h_len + 1):
cost = 0 if ref_words[i - 1] == hyp_words[j - 1] else 1
d[i][j] = min(
d[i - 1][j] + 1, # deletion
d[i][j - 1] + 1, # insertion
d[i - 1][j - 1] + cost # substitution (cost=0 if words match)
)
# ── Compute WER ─────────────────────────────────────────────────────────
wer_result = d[r_len][h_len] / max(r_len, 1) # guard against empty reference
return wer_result9.16 Walking Through the Algorithm
Tokenisation
ref_words = str(reference).strip().split()
hyp_words = str(hypothesis).strip().split()str().strip().split() converts the transcript to a string, removes leading/trailing whitespace, and splits on any whitespace — yielding a list of individual words. This is a naïve tokeniser: it is case-sensitive and does not remove punctuation. For cleaner WER values, pre-process the text (lowercase, strip punctuation) before calling the function.
Matrix initialisation
"" w1 w2 … (hypothesis words)
"" [ 0, 1, 2, … ]
r1 [ 1, ?, ?, … ]
r2 [ 2, ?, ?, … ]
…
d[i][0] = i means: turning \(i\) reference words into an empty hypothesis requires \(i\) deletions. d[0][j] = j means: turning an empty reference into \(j\) hypothesis words requires \(j\) insertions.
The three operations at each cell d[i][j]
| Operation | Cost | Meaning |
|---|---|---|
d[i-1][j] + 1 |
+1 deletion | Reference word \(R_i\) was dropped |
d[i][j-1] + 1 |
+1 insertion | Hypothesis word \(H_j\) was added |
d[i-1][j-1] + cost |
+0 (match) or +1 (subst.) | Words are the same or swapped |
The algorithm picks the minimum of these three options, ensuring the final value in d[r_len][h_len] is the smallest number of edits needed.
Guard against empty reference
wer_result = d[r_len][h_len] / max(r_len, 1)If the reference is empty (r_len == 0), dividing by zero would raise a ZeroDivisionError. max(r_len, 1) returns 1 in that edge case, yielding a defined (though uninformative) result.
9.16.1 Step 5 — Apply WER Across All ASR Columns
# Calculate WER for each ASR column and add results as new columns
for column in asr_columns:
if column in data.columns and column != 'STANDARD_TRANSCRIPT':
wer_column_name = f"WER_{column}"
data[wer_column_name] = data.apply(
lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
axis=1
)
elif column != 'STANDARD_TRANSCRIPT':
print(f"Column '{column}' does not exist in the DataFrame.")
# Save the updated DataFrame to a new CSV file
data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)
# Preview the first 20 rows
print(data.head(20))# End.9.17 Code Breakdown
The for-loop
for column in asr_columns:
if column in data.columns and column != 'STANDARD_TRANSCRIPT':
...
elif column != 'STANDARD_TRANSCRIPT':
print(f"Column '{column}' does not exist in the DataFrame.")The loop skips STANDARD_TRANSCRIPT (the reference) and also prints a warning if a listed column is absent from the DataFrame — a defensive check that catches typos in column names early.
data.apply() with axis=1
data[wer_column_name] = data.apply(
lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
axis=1
)DataFrame.apply(..., axis=1) calls the given function once per row, passing the entire row as a Series. Here, the lambda extracts the reference and hypothesis values for that row and passes them to calculate_wer. The result is assigned back to a new column named WER_<column>.
Dynamic column naming
wer_column_name = f"WER_{column}"For example, processing the column Whisper_Large-v3-EN creates a new column WER_Whisper_Large-v3-EN. This naming convention keeps WER columns clearly paired with their source transcription columns.
Output file
data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)index=False omits the auto-generated row numbers (0, 1, 2, …) from the CSV file, keeping the output clean for subsequent use in Excel or R.
9.17.1 Expected Output Structure
After the loop finishes, the DataFrame gains one new WER column for each ASR system:
| Filename | STANDARD_TRANSCRIPT | Whisper_Large-v3-EN | … | WER_Whisper_Large-v3-EN | WER_Whisper_Base-En | … |
|---|---|---|---|---|---|---|
| rec_001 | the cat sat … | the cat sat … | … | 0.00 | 0.08 | … |
| rec_002 | she sells shells … | she sells shells … | … | 0.05 | 0.20 | … |
Lower WER values indicate closer agreement with the human reference transcript.
To get a single summary WER per model, compute the mean of each WER column:
wer_columns = [c for c in data.columns if c.startswith("WER_")]
summary = data[wer_columns].mean().sort_values()
print(summary)This gives a ranked list of models from best (lowest mean WER) to worst, which is the standard way to report ASR benchmark results in research papers.
9.18 Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
CUDA out of memory |
GPU VRAM exceeded | Switch to whisper-large-v2 or whisper-medium, or use float16 (already set) |
No module named transformers |
Package not installed | Re-run the !pip install cell |
| Drive path not found | Wrong path string | Check Drive file browser; confirm folder exists and path matches exactly |
| Empty transcription output | Silent or corrupt audio | Inspect the file manually; Whisper may return empty string for silent audio |
| Very slow inference on CPU | No GPU runtime selected | Runtime → Change runtime type → GPU (T4) |
KeyError: 'Transcriptions' |
Sheet name mismatch | Open the Excel file and confirm the sheet is named “Transcriptions” |
WER > 1.0 for some rows |
Many insertions in hypothesis | Expected behaviour — WER is unbounded above 1.0; check for hallucination |
None values in WER columns |
Missing transcript (NaN) |
Inspect rows where STANDARD_TRANSCRIPT or the ASR column is empty |
| Column not found warning | Typo in asr_columns list |
Check data.columns.tolist() output and correct the column name |
9.19 Summary
In this tutorial you learned how to:
- Install and configure the HuggingFace
transformerslibrary in Google Colab - Mount Google Drive and define input/output paths
- Load Whisper-Large-v3 and Whisper-Base as independent pipelines
- Define a reusable
transcribe_audio_files()function that safely creates or appends to an Excel file - Run each model separately and save results to its own dedicated
.xlsxfile - Understand the WER metric and its mathematical foundation (Levenshtein distance)
- Implement a custom WER function using dynamic programming
- Apply WER across multiple ASR columns and export results to a structured CSV
- Aggregate WER reporting: Use
data[wer_cols].mean()to produce a summary table ranking all models by accuracy. - Text normalisation: Lowercase transcripts and strip punctuation before WER calculation to avoid penalising capitalisation differences.
- Character Error Rate (CER): Use
jiwer.cer()for languages without clear word boundaries (e.g., Mandarin, Japanese). - Forced alignment: Use
whisperxfor word-level timestamps aligned to the audio waveform. - Custom vocabulary / prompting: Pass an
initial_promptstring ingenerate_kwargsto bias Whisper toward domain-specific vocabulary. - Gradio demo: Wrap the pipeline in a Gradio interface for interactive transcription without writing code.
9.20 References
- OpenAI Whisper GitHub: https://github.com/openai/whisper
- HuggingFace Whisper model page: https://huggingface.co/openai/whisper-large-v3
- HuggingFace
pipelinedocumentation: https://huggingface.co/docs/transformers/main_classes/pipelines openpyxldocumentation: https://openpyxl.readthedocs.iojiwerdocumentation: https://jitsi.github.io/jiwer/