9  Automatic Speech Recognition with Whisper

Overview

This tutorial walks you through automatic speech recognition (ASR) using OpenAI’s Whisper models via the HuggingFace transformers library. By the end of this guide, you will be able to:

  • Understand what Whisper is and why it matters for language research
  • Load Whisper models of different sizes using HuggingFace pipeline
  • Transcribe a batch of audio files stored in Google Drive
  • Save the resulting transcriptions into a structured Excel spreadsheet

9.1 Prerequisites

Before starting, make sure you have:

  • A Google account with Google Colab and Google Drive access
  • Audio files (.mp3 or .wav) uploaded to your Google Drive
  • Basic familiarity with Python syntax

9.2 Introduction to Whisper

9.2.1 What is Whisper?

Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in 2022 and continuously improved through subsequent versions. It is trained on a massive dataset of 680,000 hours of multilingual speech from the internet, making it one of the most robust and versatile ASR systems available.

Key characteristics of Whisper:

Feature Description
Multilingual Supports 99 languages out of the box
Multitask Can transcribe, translate, and detect language
Robust Handles accents, noise, and code-switching well
Open-source Freely available on GitHub and HuggingFace Hub

9.2.2 Whisper v3: What’s New?

Whisper large-v3 is the latest generation of the model family. Notable improvements include:

  • Better multilingual performance, including mid-sentence language switching (e.g., shifting between English, Spanish, and Chinese within the same utterance)
  • Reduced hallucination on silent or non-speech segments
  • Higher accuracy across a wider range of accents and recording conditions
  • Full integration with the HuggingFace transformers library, making migration from older versions seamless
Why Use HuggingFace?

HuggingFace transformers provides a unified API (pipeline) to load and run state-of-the-art models — including Whisper — with just a few lines of code. It also handles device management (CPU vs. GPU) and batching automatically.

9.2.3 The Whisper Model Family

Whisper comes in multiple sizes, each trading off accuracy for speed and memory:

Model Parameters Relative Speed Best Use Case
whisper-tiny 39 M ~32× Quick prototyping
whisper-base 74 M ~16× Lightweight batch jobs
whisper-small 244 M ~6× General use (balanced)
whisper-medium 769 M ~2× High quality on limited GPU
whisper-large-v3 1550 M 1× (baseline) Best accuracy, research use

In this tutorial we demonstrate both whisper-large-v3 (highest accuracy) and whisper-base (fastest inference) so you can compare them directly. Each model writes its transcription results to its own Excel file, allowing you to inspect and compare outputs independently.


9.3 Environment Setup

9.3.1 Step 1 — Install Required Packages

All dependencies are installed via pip. The %%capture magic suppresses verbose installation output in Colab.

%%capture
# Install necessary packages
!pip install git+https://github.com/huggingface/transformers gradio openpyxl
!pip install gdown
!pip install torch

9.4 Package Descriptions

Package Purpose
transformers HuggingFace library that provides the Whisper pipeline
gradio (Optional) Build quick web demos around your models
openpyxl Read and write .xlsx Excel files from Python
gdown Download files from Google Drive by URL

We install transformers directly from GitHub to ensure we get the latest version with full Whisper v3 support. Once HuggingFace releases a stable version that includes Whisper v3, you can switch to pip install transformers instead.

9.4.1 Step 2 — Import Libraries

# Import necessary libraries
import torch
from transformers import pipeline
import gdown
import os
from openpyxl import Workbook, load_workbook
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

9.5 What Each Import Does

  • torch — PyTorch, the deep learning framework Whisper runs on. We use it to specify the computation device (GPU) and data type (float16).
  • pipeline — The high-level HuggingFace API. It wraps model loading, preprocessing, inference, and postprocessing into one callable object.
  • gdown — A helper for downloading files from Google Drive sharing links.
  • os — Standard Python module for file and directory operations (listing files, joining paths, checking existence).
  • Workbook, load_workbook — From openpyxl; used to create new Excel files or open existing ones for appending rows.
  • AutoModelForSpeechSeq2Seq, AutoProcessor — Lower-level HuggingFace classes. They are imported here for completeness; the pipeline wrapper uses them internally.

9.6 Connecting to Google Drive

In Google Colab, your files live in the cloud. Mounting Google Drive gives the notebook read/write access to your personal Drive storage.

# Authenticate and mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

When you run this cell:

  1. A pop-up (or link) will appear asking you to sign in with your Google account.
  2. After authentication, your Drive becomes accessible at the path /content/drive/MyDrive/.
Drive Mounting in Practice
  • You must re-mount Drive every time you start a new Colab session (sessions do not persist).
  • The path /content/drive/MyDrive/ is equivalent to the root of your “My Drive” folder as seen in the Google Drive web interface.
  • Make sure your audio files are already uploaded to Drive before running the transcription cells.

9.6.1 Define File Paths

# Define the path to the folder containing your audio files
input_audio_path = '/content/drive/MyDrive/Teaching/HY_S26_EngTech/audio_files'
Adapting the Path to Your Setup

Replace the path above with the actual folder on your Google Drive that contains your audio files. For example:

input_audio_path = '/content/drive/MyDrive/my_project/recordings'

The folder should contain .mp3 or .wav files. Other formats (.m4a, .flac, .ogg) can be added by modifying the file-listing filter in the transcription function below.


9.7 Transcription: Whisper-Large-v3

This section runs the highest-accuracy Whisper model and saves all transcriptions to a dedicated Excel file. Each row corresponds to one audio file; the second column holds the model’s transcription.

The output spreadsheet will have the following structure:

Filename Whisper-Large-v3
recording_001.wav large-v3 transcript
recording_002.mp3 large-v3 transcript

9.7.1 Step 1 — Set Output Path and Load the Model

# Define the folder path in Google Drive
model_version = 'whisper-large-v3'
output_excel_file_path = f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/Output_{model_version}_Phoneme_Transcript_183.xlsx'

# Load Whisper model using the high level `pipeline` from the `transformers` library
pipe = pipeline(
    "automatic-speech-recognition",
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda:0"
)

9.8 Key Parameters

Parameter Value Purpose
torch_dtype torch.float16 Half-precision arithmetic — halves VRAM usage
device "cuda:0" Runs inference on the first available GPU
Model String vs. Model Version Label

model_version is a human-readable label used only to build the output filename. The actual model loaded by pipeline is controlled by the string "openai/whisper-large-v3", which is the official HuggingFace model identifier.

9.8.1 Step 2 — Define the Transcription Function

def transcribe_audio_files(folder_path, excel_file_path):
    """
    Transcribe all .mp3 and .wav files in `folder_path` using the
    globally loaded `pipe` object, and write results to `excel_file_path`.

    Parameters
    ----------
    folder_path : str
        Path to the folder containing audio files.
    excel_file_path : str
        Full path (including filename) for the output .xlsx file.
        If the file does not yet exist it is created; otherwise rows
        are appended to the existing 'Transcriptions' sheet.
    """
    # ── 1. Collect audio files ──────────────────────────────────────────────
    audio_files = [
        f for f in os.listdir(folder_path)
        if f.endswith('.mp3') or f.endswith('.wav')
    ]

    # ── 2. Initialise or open the Excel workbook ────────────────────────────
    if not os.path.exists(excel_file_path):
        workbook = Workbook()
        sheet = workbook.active
        sheet.title = "Transcriptions"
        sheet.append(["Filename", "Whisper-Large-v3"])  # header row
        workbook.save(excel_file_path)
    else:
        workbook = load_workbook(excel_file_path)
        sheet = workbook["Transcriptions"]

    # ── 3. Transcribe and append each file ──────────────────────────────────
    for audio_file in audio_files:
        audio_path = os.path.join(folder_path, audio_file)
        # return_timestamps=True enables sliding-window long-form transcription.
        # chunk_length_s=30 matches Whisper's 30-second context window.
        # stride_length_s=5 adds overlap to avoid cutting words at boundaries.
        transcription = pipe(
            audio_path,
            generate_kwargs={"language": "english"},
            return_timestamps=True,
            chunk_length_s=30,
            stride_length_s=5
        )["text"]
        sheet.append([audio_file, transcription])

    workbook.save(excel_file_path)
    print(f"Transcriptions saved to {excel_file_path}")

9.9 Key Design Decisions

Conditional workbook creation

if not os.path.exists(excel_file_path):
    workbook = Workbook()          # create a brand-new file
    ...
else:
    workbook = load_workbook(...)  # open the existing file

This guard prevents overwriting an existing file if the cell is re-run (e.g., after a kernel restart). New rows are always appended to the existing sheet, so partial results from a previous interrupted run are preserved.


return_timestamps=True and chunking parameters

Required for audio files longer than 30 seconds. Whisper’s native context window is 30 s; without this flag the model silently truncates longer recordings.

Parameter Value Purpose
return_timestamps True Enables sliding-window long-form transcription
chunk_length_s 30 Each chunk matches Whisper’s 30 s context window
stride_length_s 5 5 s overlap prevents words from being cut at boundaries

9.9.1 Step 3 — Run the Transcription

# Run the transcription function for Whisper-Large-v3
transcribe_audio_files(input_audio_path, output_excel_file_path)

When this cell finishes, an Excel file will appear at output_excel_file_path in your Google Drive containing the Whisper-Large-v3 transcriptions.


9.10 Transcription: Whisper-Base

This section runs the fastest Whisper model and saves all transcriptions to a separate Excel file. The workflow is identical to the large-v3 section; only the model identifier and output filename change.

The output spreadsheet will have the following structure:

Filename Whisper-Base
recording_001.wav base transcript
recording_002.mp3 base transcript

9.10.1 Step 1 — Set Output Path and Load the Model

# Set the model version label and derive the output file path automatically
model_version = 'whisper-base'
output_excel_file_path = (
    f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/'
    f'Output_{model_version}_Phoneme_Transcript_183.xlsx'
)

# Load Whisper-Base using the HuggingFace high-level pipeline API
pipe = pipeline(
    "automatic-speech-recognition",
    "openai/whisper-base",
    torch_dtype=torch.float16,
    device="cuda:0"
)

9.10.2 Step 2 — Define the Transcription Function

def transcribe_audio_files(folder_path, excel_file_path):
    """
    Transcribe all .mp3 and .wav files in `folder_path` using the
    globally loaded `pipe` object, and write results to `excel_file_path`.

    Parameters
    ----------
    folder_path : str
        Path to the folder containing audio files.
    excel_file_path : str
        Full path (including filename) for the output .xlsx file.
        If the file does not yet exist it is created; otherwise rows
        are appended to the existing 'Transcriptions' sheet.
    """
    # ── 1. Collect audio files ──────────────────────────────────────────────
    audio_files = [
        f for f in os.listdir(folder_path)
        if f.endswith('.mp3') or f.endswith('.wav')
    ]

    # ── 2. Initialise or open the Excel workbook ────────────────────────────
    if not os.path.exists(excel_file_path):
        workbook = Workbook()
        sheet = workbook.active
        sheet.title = "Transcriptions"
        sheet.append(["Filename", "Whisper-Base"])  # header row
        workbook.save(excel_file_path)
    else:
        workbook = load_workbook(excel_file_path)
        sheet = workbook["Transcriptions"]

    # ── 3. Transcribe and append each file ──────────────────────────────────
    for audio_file in audio_files:
        audio_path = os.path.join(folder_path, audio_file)
        # For long audio, return_timestamps=True and chunking parameters are necessary.
        transcription = pipe(
            audio_path,
            generate_kwargs={"language": "english"},
            return_timestamps=True,
            chunk_length_s=30,
            stride_length_s=5
        )["text"]
        sheet.append([audio_file, transcription])

    workbook.save(excel_file_path)
    print(f"Transcriptions saved to {excel_file_path}")

9.10.3 Step 3 — Run the Transcription

# Run the transcription function for Whisper-Base
transcribe_audio_files(input_audio_path, output_excel_file_path)
# End.
Whisper Base vs. Large-v3 — Practical Trade-offs
Dimension Whisper-Base Whisper-Large-v3
Model size ~74 M parameters ~1.55 B parameters
Download size ~150 MB ~3 GB
Inference speed ~16× faster than large Baseline
Word Error Rate (WER) Higher (less accurate) Lower (more accurate)
Accent robustness Moderate High
Recommended for Piloting, quick iteration Research-grade output

For phoneme-level research or publication-quality transcriptions, Whisper-Large-v3 is strongly recommended. Use Whisper-Base for quick sanity-checks or when working under a tight time budget.


9.11 Word Error Rate (WER) Calculation

Once you have transcriptions from multiple ASR systems, the natural next question is: which model is more accurate? This section introduces Word Error Rate (WER) — the standard metric for evaluating ASR output — and walks through a step-by-step Python implementation that computes WER for each ASR system and appends the scores to the original data file.

9.11.1 What is Word Error Rate?

Word Error Rate (WER) measures how many word-level edits are needed to transform a hypothesis (the ASR output) into a reference (the human-verified transcript), expressed as a proportion of the total number of words in the reference.

\[\text{WER} = \frac{S + D + I}{N}\]

where:

Symbol Meaning
\(S\) Number of substitutions (wrong word)
\(D\) Number of deletions (word in reference, missing in hypothesis)
\(I\) Number of insertions (extra word in hypothesis not in reference)
\(N\) Total number of words in the reference transcript

9.12 Interpreting WER

WER Range Interpretation
0.00 Perfect match — hypothesis is identical to the reference
0.05–0.15 Excellent — suitable for most research and applied tasks
0.15–0.30 Acceptable — some errors, may need manual review
> 0.30 Poor — substantial errors, caution advised

WER can exceed 1.0 (i.e., 100%) if the hypothesis contains many extra insertions. A lower WER always indicates better performance.

9.12.1 The Levenshtein Algorithm

WER is computed via the Levenshtein (edit) distance at the word level. The algorithm fills in a dynamic programming matrix \(d\) of size \((|R|+1) \times (|H|+1)\), where \(|R|\) and \(|H|\) are the number of words in the reference and hypothesis respectively.

The recurrence relation is:

\[d[i][j] = \min \begin{cases} d[i-1][j] + 1 & \text{(deletion)} \\ d[i][j-1] + 1 & \text{(insertion)} \\ d[i-1][j-1] + \text{cost}(i,j) & \text{(substitution or match)} \end{cases}\]

where \(\text{cost}(i,j) = 0\) if \(R_i = H_j\) (the words match) and \(1\) otherwise.

The final WER is \(d[|R|][|H|] \,/\, \max(|R|, 1)\).

Why implement it from scratch?

The jiwer library (installed below) provides a fast, battle-tested WER implementation. The custom implementation in this section is included for pedagogical purposes: seeing the matrix construction step-by-step makes the algorithm transparent and verifiable.

In production, prefer jiwer.wer(reference, hypothesis) for speed and correctness.


9.12.2 Step 1 — Install jiwer

!pip install jiwer

9.13 About jiwer

jiwer is a lightweight Python package for computing common ASR evaluation metrics: WER, Match Error Rate (MER), Word Information Lost (WIL), and Character Error Rate (CER). It handles text normalisation (lowercasing, punctuation removal) and is the de-facto standard library for ASR benchmarking in Python.


9.13.1 Step 2 — Import Libraries and Load Data

# Environment: Python 3.9.16
# pip install jiwer (already installed above)
import pandas as pd
from jiwer import wer

# Load the transcription data
data = pd.read_csv('(Supplementary)_02_Input_Transcript.csv')
print("Columns in DataFrame:", data.columns.tolist())

9.14 Input File Format

The input CSV ((Supplementary)_02_Input_Transcript.csv) must contain at least the following columns:

Column Content
STANDARD_TRANSCRIPT The gold-standard human-verified reference transcript
vosk-model-small-en-us-0.15 ASR output from the Vosk small model
Wav2vec2 ASR output from Facebook’s Wav2Vec 2.0
HuBERT ASR output from Facebook’s HuBERT model
Whisper_Base-En ASR output from Whisper Base (English)
Whisper_Large-v3-EN ASR output from Whisper Large-v3 (English)
Azure ASR output from Microsoft Azure Speech

Each row represents one audio recording. The STANDARD_TRANSCRIPT column is the reference against which all other columns are evaluated.

pd.read_csv() loads the file into a pandas DataFrame — a tabular in-memory data structure where each column is a named series and each row is one observation.


9.14.1 Step 3 — Specify ASR Columns to Evaluate

# Define the list of ASR transcription columns to compare
asr_columns = [
    "STANDARD_TRANSCRIPT", "vosk-model-small-en-us-0.15", "Wav2vec2",
    "HuBERT", "Whisper_Base-En", "Whisper_Large-v3-EN", "Azure"
]

9.15 Why include STANDARD_TRANSCRIPT in the list?

The list includes the reference column itself. The calculate_wer() call and the for-loop below both skip it explicitly (column != 'STANDARD_TRANSCRIPT'). Keeping it in the list provides a single place to inspect all column names and makes it easy to verify that the reference column is present in the DataFrame before the loop starts.


9.15.1 Step 4 — Implement the Custom WER Function

# Custom WER function using the Levenshtein (edit distance) algorithm
def calculate_wer(reference, hypothesis):
    # Return None if either value is missing (NaN or None)
    if pd.isnull(reference) or pd.isnull(hypothesis):
        return None

    ref_words = str(reference).strip().split()   # tokenise reference into words
    hyp_words = str(hypothesis).strip().split()  # tokenise hypothesis into words

    r_len = len(ref_words)
    h_len = len(hyp_words)

    # ── Initialise the (r_len+1) × (h_len+1) edit distance matrix ──────────
    d = [[0] * (h_len + 1) for _ in range(r_len + 1)]

    # Base cases: transforming N words into 0 words requires N deletions
    for i in range(r_len + 1):
        d[i][0] = i  # delete all reference words
    for j in range(h_len + 1):
        d[0][j] = j  # insert all hypothesis words

    # ── Fill the matrix ─────────────────────────────────────────────────────
    for i in range(1, r_len + 1):
        for j in range(1, h_len + 1):
            cost = 0 if ref_words[i - 1] == hyp_words[j - 1] else 1
            d[i][j] = min(
                d[i - 1][j]     + 1,    # deletion
                d[i][j - 1]     + 1,    # insertion
                d[i - 1][j - 1] + cost  # substitution (cost=0 if words match)
            )

    # ── Compute WER ─────────────────────────────────────────────────────────
    wer_result = d[r_len][h_len] / max(r_len, 1)  # guard against empty reference
    return wer_result

9.16 Walking Through the Algorithm

Tokenisation

ref_words = str(reference).strip().split()
hyp_words = str(hypothesis).strip().split()

str().strip().split() converts the transcript to a string, removes leading/trailing whitespace, and splits on any whitespace — yielding a list of individual words. This is a naïve tokeniser: it is case-sensitive and does not remove punctuation. For cleaner WER values, pre-process the text (lowercase, strip punctuation) before calling the function.


Matrix initialisation

       ""   w1   w2  … (hypothesis words)
  ""  [ 0,   1,   2, … ]
  r1  [ 1,   ?,   ?, … ]
  r2  [ 2,   ?,   ?, … ]
  …

d[i][0] = i means: turning \(i\) reference words into an empty hypothesis requires \(i\) deletions. d[0][j] = j means: turning an empty reference into \(j\) hypothesis words requires \(j\) insertions.


The three operations at each cell d[i][j]

Operation Cost Meaning
d[i-1][j] + 1 +1 deletion Reference word \(R_i\) was dropped
d[i][j-1] + 1 +1 insertion Hypothesis word \(H_j\) was added
d[i-1][j-1] + cost +0 (match) or +1 (subst.) Words are the same or swapped

The algorithm picks the minimum of these three options, ensuring the final value in d[r_len][h_len] is the smallest number of edits needed.


Guard against empty reference

wer_result = d[r_len][h_len] / max(r_len, 1)

If the reference is empty (r_len == 0), dividing by zero would raise a ZeroDivisionError. max(r_len, 1) returns 1 in that edge case, yielding a defined (though uninformative) result.


9.16.1 Step 5 — Apply WER Across All ASR Columns

# Calculate WER for each ASR column and add results as new columns
for column in asr_columns:
    if column in data.columns and column != 'STANDARD_TRANSCRIPT':
        wer_column_name = f"WER_{column}"
        data[wer_column_name] = data.apply(
            lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
            axis=1
        )
    elif column != 'STANDARD_TRANSCRIPT':
        print(f"Column '{column}' does not exist in the DataFrame.")

# Save the updated DataFrame to a new CSV file
data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)

# Preview the first 20 rows
print(data.head(20))
# End.

9.17 Code Breakdown

The for-loop

for column in asr_columns:
    if column in data.columns and column != 'STANDARD_TRANSCRIPT':
        ...
    elif column != 'STANDARD_TRANSCRIPT':
        print(f"Column '{column}' does not exist in the DataFrame.")

The loop skips STANDARD_TRANSCRIPT (the reference) and also prints a warning if a listed column is absent from the DataFrame — a defensive check that catches typos in column names early.


data.apply() with axis=1

data[wer_column_name] = data.apply(
    lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
    axis=1
)

DataFrame.apply(..., axis=1) calls the given function once per row, passing the entire row as a Series. Here, the lambda extracts the reference and hypothesis values for that row and passes them to calculate_wer. The result is assigned back to a new column named WER_<column>.


Dynamic column naming

wer_column_name = f"WER_{column}"

For example, processing the column Whisper_Large-v3-EN creates a new column WER_Whisper_Large-v3-EN. This naming convention keeps WER columns clearly paired with their source transcription columns.


Output file

data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)

index=False omits the auto-generated row numbers (0, 1, 2, …) from the CSV file, keeping the output clean for subsequent use in Excel or R.


9.17.1 Expected Output Structure

After the loop finishes, the DataFrame gains one new WER column for each ASR system:

Filename STANDARD_TRANSCRIPT Whisper_Large-v3-EN WER_Whisper_Large-v3-EN WER_Whisper_Base-En
rec_001 the cat sat … the cat sat … 0.00 0.08
rec_002 she sells shells … she sells shells … 0.05 0.20

Lower WER values indicate closer agreement with the human reference transcript.

Aggregating WER Across Files

To get a single summary WER per model, compute the mean of each WER column:

wer_columns = [c for c in data.columns if c.startswith("WER_")]
summary = data[wer_columns].mean().sort_values()
print(summary)

This gives a ranked list of models from best (lowest mean WER) to worst, which is the standard way to report ASR benchmark results in research papers.


9.18 Troubleshooting

Problem Likely Cause Solution
CUDA out of memory GPU VRAM exceeded Switch to whisper-large-v2 or whisper-medium, or use float16 (already set)
No module named transformers Package not installed Re-run the !pip install cell
Drive path not found Wrong path string Check Drive file browser; confirm folder exists and path matches exactly
Empty transcription output Silent or corrupt audio Inspect the file manually; Whisper may return empty string for silent audio
Very slow inference on CPU No GPU runtime selected Runtime → Change runtime type → GPU (T4)
KeyError: 'Transcriptions' Sheet name mismatch Open the Excel file and confirm the sheet is named “Transcriptions”
WER > 1.0 for some rows Many insertions in hypothesis Expected behaviour — WER is unbounded above 1.0; check for hallucination
None values in WER columns Missing transcript (NaN) Inspect rows where STANDARD_TRANSCRIPT or the ASR column is empty
Column not found warning Typo in asr_columns list Check data.columns.tolist() output and correct the column name

9.19 Summary

In this tutorial you learned how to:

  1. Install and configure the HuggingFace transformers library in Google Colab
  2. Mount Google Drive and define input/output paths
  3. Load Whisper-Large-v3 and Whisper-Base as independent pipelines
  4. Define a reusable transcribe_audio_files() function that safely creates or appends to an Excel file
  5. Run each model separately and save results to its own dedicated .xlsx file
  6. Understand the WER metric and its mathematical foundation (Levenshtein distance)
  7. Implement a custom WER function using dynamic programming
  8. Apply WER across multiple ASR columns and export results to a structured CSV
Next Steps
  • Aggregate WER reporting: Use data[wer_cols].mean() to produce a summary table ranking all models by accuracy.
  • Text normalisation: Lowercase transcripts and strip punctuation before WER calculation to avoid penalising capitalisation differences.
  • Character Error Rate (CER): Use jiwer.cer() for languages without clear word boundaries (e.g., Mandarin, Japanese).
  • Forced alignment: Use whisperx for word-level timestamps aligned to the audio waveform.
  • Custom vocabulary / prompting: Pass an initial_prompt string in generate_kwargs to bias Whisper toward domain-specific vocabulary.
  • Gradio demo: Wrap the pipeline in a Gradio interface for interactive transcription without writing code.

9.20 References