9 Automatic Speech Recognition with Whisper

Overview

This tutorial walks you through automatic speech recognition (ASR) using OpenAI’s Whisper models via the HuggingFace transformers library. By the end of this guide, you will be able to:

Understand what Whisper is and why it matters for language research
Load Whisper models of different sizes using HuggingFace pipeline
Transcribe a batch of audio files stored in Google Drive
Save the resulting transcriptions into a structured Excel spreadsheet

9.1 Prerequisites

Before starting, make sure you have:

A Google account with Google Colab and Google Drive access
Audio files (.mp3 or .wav) uploaded to your Google Drive
Basic familiarity with Python syntax

9.2 Introduction to Whisper

9.2.1 What is Whisper?

Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in 2022 and continuously improved through subsequent versions. It is trained on a massive dataset of 680,000 hours of multilingual speech from the internet, making it one of the most robust and versatile ASR systems available.

Key characteristics of Whisper:

Feature	Description
Multilingual	Supports 99 languages out of the box
Multitask	Can transcribe, translate, and detect language
Robust	Handles accents, noise, and code-switching well
Open-source	Freely available on GitHub and HuggingFace Hub

9.2.2 Whisper v3: What’s New?

Whisper large-v3 is the latest generation of the model family. Notable improvements include:

Better multilingual performance, including mid-sentence language switching (e.g., shifting between English, Spanish, and Chinese within the same utterance)
Reduced hallucination on silent or non-speech segments
Higher accuracy across a wider range of accents and recording conditions
Full integration with the HuggingFace transformers library, making migration from older versions seamless

Why Use HuggingFace?

HuggingFace transformers provides a unified API (pipeline) to load and run state-of-the-art models — including Whisper — with just a few lines of code. It also handles device management (CPU vs. GPU) and batching automatically.

9.2.3 The Whisper Model Family

Whisper comes in multiple sizes, each trading off accuracy for speed and memory:

Model	Parameters	Relative Speed	Best Use Case
`whisper-tiny`	39 M	~32×	Quick prototyping
`whisper-base`	74 M	~16×	Lightweight batch jobs
`whisper-small`	244 M	~6×	General use (balanced)
`whisper-medium`	769 M	~2×	High quality on limited GPU
`whisper-large-v3`	1550 M	1× (baseline)	Best accuracy, research use

In this tutorial we demonstrate both whisper-large-v3 (highest accuracy) and whisper-base (fastest inference) so you can compare them directly. Each model writes its transcription results to its own Excel file, allowing you to inspect and compare outputs independently.

9.3 Environment Setup

9.3.1 Step 1 — Install Required Packages

All dependencies are installed via pip. The %%capture magic suppresses verbose installation output in Colab.

%%capture
# Install necessary packages
!pip install git+https://github.com/huggingface/transformers gradio openpyxl
!pip install gdown
!pip install torch

9.4 Package Descriptions

Package	Purpose
`transformers`	HuggingFace library that provides the Whisper pipeline
`gradio`	(Optional) Build quick web demos around your models
`openpyxl`	Read and write `.xlsx` Excel files from Python
`gdown`	Download files from Google Drive by URL

We install transformers directly from GitHub to ensure we get the latest version with full Whisper v3 support. Once HuggingFace releases a stable version that includes Whisper v3, you can switch to pip install transformers instead.

9.4.1 Step 2 — Import Libraries

# Import necessary libraries
import torch
from transformers import pipeline
import gdown
import os
from openpyxl import Workbook, load_workbook
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

9.5 What Each Import Does

torch — PyTorch, the deep learning framework Whisper runs on. We use it to specify the computation device (GPU) and data type (float16).
pipeline — The high-level HuggingFace API. It wraps model loading, preprocessing, inference, and postprocessing into one callable object.
gdown — A helper for downloading files from Google Drive sharing links.
os — Standard Python module for file and directory operations (listing files, joining paths, checking existence).
Workbook, load_workbook — From openpyxl; used to create new Excel files or open existing ones for appending rows.
AutoModelForSpeechSeq2Seq, AutoProcessor — Lower-level HuggingFace classes. They are imported here for completeness; the pipeline wrapper uses them internally.

9.6 Connecting to Google Drive

In Google Colab, your files live in the cloud. Mounting Google Drive gives the notebook read/write access to your personal Drive storage.

# Authenticate and mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

When you run this cell:

A pop-up (or link) will appear asking you to sign in with your Google account.
After authentication, your Drive becomes accessible at the path /content/drive/MyDrive/.

Drive Mounting in Practice

You must re-mount Drive every time you start a new Colab session (sessions do not persist).
The path /content/drive/MyDrive/ is equivalent to the root of your “My Drive” folder as seen in the Google Drive web interface.
Make sure your audio files are already uploaded to Drive before running the transcription cells.

9.6.1 Define File Paths

# Define the path to the folder containing your audio files
input_audio_path = '/content/drive/MyDrive/Teaching/HY_S26_EngTech/audio_files'

Adapting the Path to Your Setup

Replace the path above with the actual folder on your Google Drive that contains your audio files. For example:

input_audio_path = '/content/drive/MyDrive/my_project/recordings'

The folder should contain .mp3 or .wav files. Other formats (.m4a, .flac, .ogg) can be added by modifying the file-listing filter in the transcription function below.

9.7 Transcription: Whisper-Large-v3

This section runs the highest-accuracy Whisper model and saves all transcriptions to a dedicated Excel file. Each row corresponds to one audio file; the second column holds the model’s transcription.

The output spreadsheet will have the following structure:

Filename	Whisper-Large-v3
recording_001.wav	large-v3 transcript
recording_002.mp3	large-v3 transcript
…	…

9.7.1 Step 1 — Set Output Path and Load the Model

# Define the folder path in Google Drive
model_version = 'whisper-large-v3'
output_excel_file_path = f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/Output_{model_version}_Phoneme_Transcript_183.xlsx'

# Load Whisper model using the high level `pipeline` from the `transformers` library
pipe = pipeline(
    "automatic-speech-recognition",
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda:0"
)

9.8 Key Parameters

Parameter	Value	Purpose
`torch_dtype`	`torch.float16`	Half-precision arithmetic — halves VRAM usage
`device`	`"cuda:0"`	Runs inference on the first available GPU

Model String vs. Model Version Label

model_version is a human-readable label used only to build the output filename. The actual model loaded by pipeline is controlled by the string "openai/whisper-large-v3", which is the official HuggingFace model identifier.

9.8.1 Step 2 — Define the Transcription Function

def transcribe_audio_files(folder_path, excel_file_path):
    """
    Transcribe all .mp3 and .wav files in `folder_path` using the
    globally loaded `pipe` object, and write results to `excel_file_path`.

    Parameters
    ----------
    folder_path : str
        Path to the folder containing audio files.
    excel_file_path : str
        Full path (including filename) for the output .xlsx file.
        If the file does not yet exist it is created; otherwise rows
        are appended to the existing 'Transcriptions' sheet.
    """
    # ── 1. Collect audio files ──────────────────────────────────────────────
    audio_files = [
        f for f in os.listdir(folder_path)
        if f.endswith('.mp3') or f.endswith('.wav')
    ]

    # ── 2. Initialise or open the Excel workbook ────────────────────────────
    if not os.path.exists(excel_file_path):
        workbook = Workbook()
        sheet = workbook.active
        sheet.title = "Transcriptions"
        sheet.append(["Filename", "Whisper-Large-v3"])  # header row
        workbook.save(excel_file_path)
    else:
        workbook = load_workbook(excel_file_path)
        sheet = workbook["Transcriptions"]

    # ── 3. Transcribe and append each file ──────────────────────────────────
    for audio_file in audio_files:
        audio_path = os.path.join(folder_path, audio_file)
        # return_timestamps=True enables sliding-window long-form transcription.
        # chunk_length_s=30 matches Whisper's 30-second context window.
        # stride_length_s=5 adds overlap to avoid cutting words at boundaries.
        transcription = pipe(
            audio_path,
            generate_kwargs={"language": "english"},
            return_timestamps=True,
            chunk_length_s=30,
            stride_length_s=5
        )["text"]
        sheet.append([audio_file, transcription])

    workbook.save(excel_file_path)
    print(f"Transcriptions saved to {excel_file_path}")

9.9 Key Design Decisions

Conditional workbook creation

if not os.path.exists(excel_file_path):
    workbook = Workbook()          # create a brand-new file
    ...
else:
    workbook = load_workbook(...)  # open the existing file

This guard prevents overwriting an existing file if the cell is re-run (e.g., after a kernel restart). New rows are always appended to the existing sheet, so partial results from a previous interrupted run are preserved.

return_timestamps=True and chunking parameters

Required for audio files longer than 30 seconds. Whisper’s native context window is 30 s; without this flag the model silently truncates longer recordings.

Parameter	Value	Purpose
`return_timestamps`	`True`	Enables sliding-window long-form transcription
`chunk_length_s`	`30`	Each chunk matches Whisper’s 30 s context window
`stride_length_s`	`5`	5 s overlap prevents words from being cut at boundaries

9.9.1 Step 3 — Run the Transcription

# Run the transcription function for Whisper-Large-v3
transcribe_audio_files(input_audio_path, output_excel_file_path)

When this cell finishes, an Excel file will appear at output_excel_file_path in your Google Drive containing the Whisper-Large-v3 transcriptions.

9.10 Transcription: Whisper-Base

This section runs the fastest Whisper model and saves all transcriptions to a separate Excel file. The workflow is identical to the large-v3 section; only the model identifier and output filename change.

The output spreadsheet will have the following structure:

Filename	Whisper-Base
recording_001.wav	base transcript
recording_002.mp3	base transcript
…	…

9.10.1 Step 1 — Set Output Path and Load the Model

# Set the model version label and derive the output file path automatically
model_version = 'whisper-base'
output_excel_file_path = (
    f'/content/drive/MyDrive/Teaching/HY_S26_EngTech/transcript_files/'
    f'Output_{model_version}_Phoneme_Transcript_183.xlsx'
)

# Load Whisper-Base using the HuggingFace high-level pipeline API
pipe = pipeline(
    "automatic-speech-recognition",
    "openai/whisper-base",
    torch_dtype=torch.float16,
    device="cuda:0"
)

9.10.2 Step 2 — Define the Transcription Function

def transcribe_audio_files(folder_path, excel_file_path):
    """
    Transcribe all .mp3 and .wav files in `folder_path` using the
    globally loaded `pipe` object, and write results to `excel_file_path`.

    Parameters
    ----------
    folder_path : str
        Path to the folder containing audio files.
    excel_file_path : str
        Full path (including filename) for the output .xlsx file.
        If the file does not yet exist it is created; otherwise rows
        are appended to the existing 'Transcriptions' sheet.
    """
    # ── 1. Collect audio files ──────────────────────────────────────────────
    audio_files = [
        f for f in os.listdir(folder_path)
        if f.endswith('.mp3') or f.endswith('.wav')
    ]

    # ── 2. Initialise or open the Excel workbook ────────────────────────────
    if not os.path.exists(excel_file_path):
        workbook = Workbook()
        sheet = workbook.active
        sheet.title = "Transcriptions"
        sheet.append(["Filename", "Whisper-Base"])  # header row
        workbook.save(excel_file_path)
    else:
        workbook = load_workbook(excel_file_path)
        sheet = workbook["Transcriptions"]

    # ── 3. Transcribe and append each file ──────────────────────────────────
    for audio_file in audio_files:
        audio_path = os.path.join(folder_path, audio_file)
        # For long audio, return_timestamps=True and chunking parameters are necessary.
        transcription = pipe(
            audio_path,
            generate_kwargs={"language": "english"},
            return_timestamps=True,
            chunk_length_s=30,
            stride_length_s=5
        )["text"]
        sheet.append([audio_file, transcription])

    workbook.save(excel_file_path)
    print(f"Transcriptions saved to {excel_file_path}")

9.10.3 Step 3 — Run the Transcription

# Run the transcription function for Whisper-Base
transcribe_audio_files(input_audio_path, output_excel_file_path)

# End.

Whisper Base vs. Large-v3 — Practical Trade-offs

Dimension	`Whisper-Base`	`Whisper-Large-v3`
Model size	~74 M parameters	~1.55 B parameters
Download size	~150 MB	~3 GB
Inference speed	~16× faster than large	Baseline
Word Error Rate (WER)	Higher (less accurate)	Lower (more accurate)
Accent robustness	Moderate	High
Recommended for	Piloting, quick iteration	Research-grade output

For phoneme-level research or publication-quality transcriptions, Whisper-Large-v3 is strongly recommended. Use Whisper-Base for quick sanity-checks or when working under a tight time budget.

9.11 Word Error Rate (WER) Calculation

Once you have transcriptions from multiple ASR systems, the natural next question is: which model is more accurate? This section introduces Word Error Rate (WER) — the standard metric for evaluating ASR output — and walks through a step-by-step Python implementation that computes WER for each ASR system and appends the scores to the original data file.

9.11.1 What is Word Error Rate?

Word Error Rate (WER) measures how many word-level edits are needed to transform a hypothesis (the ASR output) into a reference (the human-verified transcript), expressed as a proportion of the total number of words in the reference.

\[\text{WER} = \frac{S + D + I}{N}\]

where:

Symbol	Meaning
\(S\)	Number of substitutions (wrong word)
\(D\)	Number of deletions (word in reference, missing in hypothesis)
\(I\)	Number of insertions (extra word in hypothesis not in reference)
\(N\)	Total number of words in the reference transcript

9.12 Interpreting WER

WER Range	Interpretation
0.00	Perfect match — hypothesis is identical to the reference
0.05–0.15	Excellent — suitable for most research and applied tasks
0.15–0.30	Acceptable — some errors, may need manual review
> 0.30	Poor — substantial errors, caution advised

WER can exceed 1.0 (i.e., 100%) if the hypothesis contains many extra insertions. A lower WER always indicates better performance.

9.12.1 The Levenshtein Algorithm

WER is computed via the Levenshtein (edit) distance at the word level. The algorithm fills in a dynamic programming matrix \(d\) of size \((|R|+1) \times (|H|+1)\), where \(|R|\) and \(|H|\) are the number of words in the reference and hypothesis respectively.

The recurrence relation is:

\[d[i][j] = \min \begin{cases} d[i-1][j] + 1 & \text{(deletion)} \\ d[i][j-1] + 1 & \text{(insertion)} \\ d[i-1][j-1] + \text{cost}(i,j) & \text{(substitution or match)} \end{cases}\]

where \(\text{cost}(i,j) = 0\) if \(R_i = H_j\) (the words match) and \(1\) otherwise.

The final WER is \(d[|R|][|H|] \,/\, \max(|R|, 1)\).

Why implement it from scratch?

The jiwer library (installed below) provides a fast, battle-tested WER implementation. The custom implementation in this section is included for pedagogical purposes: seeing the matrix construction step-by-step makes the algorithm transparent and verifiable.

In production, prefer jiwer.wer(reference, hypothesis) for speed and correctness.

9.12.2 Step 1 — Install `jiwer`

!pip install jiwer

9.13 About `jiwer`

jiwer is a lightweight Python package for computing common ASR evaluation metrics: WER, Match Error Rate (MER), Word Information Lost (WIL), and Character Error Rate (CER). It handles text normalisation (lowercasing, punctuation removal) and is the de-facto standard library for ASR benchmarking in Python.

9.13.1 Step 2 — Import Libraries and Load Data

# Environment: Python 3.9.16
# pip install jiwer (already installed above)
import pandas as pd
from jiwer import wer

# Load the transcription data
data = pd.read_csv('(Supplementary)_02_Input_Transcript.csv')
print("Columns in DataFrame:", data.columns.tolist())

9.14 Input File Format

The input CSV ((Supplementary)_02_Input_Transcript.csv) must contain at least the following columns:

Column	Content
`STANDARD_TRANSCRIPT`	The gold-standard human-verified reference transcript
`vosk-model-small-en-us-0.15`	ASR output from the Vosk small model
`Wav2vec2`	ASR output from Facebook’s Wav2Vec 2.0
`HuBERT`	ASR output from Facebook’s HuBERT model
`Whisper_Base-En`	ASR output from Whisper Base (English)
`Whisper_Large-v3-EN`	ASR output from Whisper Large-v3 (English)
`Azure`	ASR output from Microsoft Azure Speech

Each row represents one audio recording. The STANDARD_TRANSCRIPT column is the reference against which all other columns are evaluated.

pd.read_csv() loads the file into a pandas DataFrame — a tabular in-memory data structure where each column is a named series and each row is one observation.

9.14.1 Step 3 — Specify ASR Columns to Evaluate

# Define the list of ASR transcription columns to compare
asr_columns = [
    "STANDARD_TRANSCRIPT", "vosk-model-small-en-us-0.15", "Wav2vec2",
    "HuBERT", "Whisper_Base-En", "Whisper_Large-v3-EN", "Azure"
]

9.15 Why include `STANDARD_TRANSCRIPT` in the list?

The list includes the reference column itself. The calculate_wer() call and the for-loop below both skip it explicitly (column != 'STANDARD_TRANSCRIPT'). Keeping it in the list provides a single place to inspect all column names and makes it easy to verify that the reference column is present in the DataFrame before the loop starts.

9.15.1 Step 4 — Implement the Custom WER Function

# Custom WER function using the Levenshtein (edit distance) algorithm
def calculate_wer(reference, hypothesis):
    # Return None if either value is missing (NaN or None)
    if pd.isnull(reference) or pd.isnull(hypothesis):
        return None

    ref_words = str(reference).strip().split()   # tokenise reference into words
    hyp_words = str(hypothesis).strip().split()  # tokenise hypothesis into words

    r_len = len(ref_words)
    h_len = len(hyp_words)

    # ── Initialise the (r_len+1) × (h_len+1) edit distance matrix ──────────
    d = [[0] * (h_len + 1) for _ in range(r_len + 1)]

    # Base cases: transforming N words into 0 words requires N deletions
    for i in range(r_len + 1):
        d[i][0] = i  # delete all reference words
    for j in range(h_len + 1):
        d[0][j] = j  # insert all hypothesis words

    # ── Fill the matrix ─────────────────────────────────────────────────────
    for i in range(1, r_len + 1):
        for j in range(1, h_len + 1):
            cost = 0 if ref_words[i - 1] == hyp_words[j - 1] else 1
            d[i][j] = min(
                d[i - 1][j]     + 1,    # deletion
                d[i][j - 1]     + 1,    # insertion
                d[i - 1][j - 1] + cost  # substitution (cost=0 if words match)
            )

    # ── Compute WER ─────────────────────────────────────────────────────────
    wer_result = d[r_len][h_len] / max(r_len, 1)  # guard against empty reference
    return wer_result

9.16 Walking Through the Algorithm

Tokenisation

ref_words = str(reference).strip().split()
hyp_words = str(hypothesis).strip().split()

str().strip().split() converts the transcript to a string, removes leading/trailing whitespace, and splits on any whitespace — yielding a list of individual words. This is a naïve tokeniser: it is case-sensitive and does not remove punctuation. For cleaner WER values, pre-process the text (lowercase, strip punctuation) before calling the function.

Matrix initialisation

       ""   w1   w2  … (hypothesis words)
  ""  [ 0,   1,   2, … ]
  r1  [ 1,   ?,   ?, … ]
  r2  [ 2,   ?,   ?, … ]
  …

d[i][0] = i means: turning \(i\) reference words into an empty hypothesis requires \(i\) deletions. d[0][j] = j means: turning an empty reference into \(j\) hypothesis words requires \(j\) insertions.

The three operations at each cell d[i][j]

Operation	Cost	Meaning
`d[i-1][j] + 1`	+1 deletion	Reference word \(R_i\) was dropped
`d[i][j-1] + 1`	+1 insertion	Hypothesis word \(H_j\) was added
`d[i-1][j-1] + cost`	+0 (match) or +1 (subst.)	Words are the same or swapped

The algorithm picks the minimum of these three options, ensuring the final value in d[r_len][h_len] is the smallest number of edits needed.

Guard against empty reference

wer_result = d[r_len][h_len] / max(r_len, 1)

If the reference is empty (r_len == 0), dividing by zero would raise a ZeroDivisionError. max(r_len, 1) returns 1 in that edge case, yielding a defined (though uninformative) result.

9.16.1 Step 5 — Apply WER Across All ASR Columns

# Calculate WER for each ASR column and add results as new columns
for column in asr_columns:
    if column in data.columns and column != 'STANDARD_TRANSCRIPT':
        wer_column_name = f"WER_{column}"
        data[wer_column_name] = data.apply(
            lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
            axis=1
        )
    elif column != 'STANDARD_TRANSCRIPT':
        print(f"Column '{column}' does not exist in the DataFrame.")

# Save the updated DataFrame to a new CSV file
data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)

# Preview the first 20 rows
print(data.head(20))

# End.

9.17 Code Breakdown

The for-loop

for column in asr_columns:
    if column in data.columns and column != 'STANDARD_TRANSCRIPT':
        ...
    elif column != 'STANDARD_TRANSCRIPT':
        print(f"Column '{column}' does not exist in the DataFrame.")

The loop skips STANDARD_TRANSCRIPT (the reference) and also prints a warning if a listed column is absent from the DataFrame — a defensive check that catches typos in column names early.

data.apply() with axis=1

data[wer_column_name] = data.apply(
    lambda row: calculate_wer(row['STANDARD_TRANSCRIPT'], row[column]),
    axis=1
)

DataFrame.apply(..., axis=1) calls the given function once per row, passing the entire row as a Series. Here, the lambda extracts the reference and hypothesis values for that row and passes them to calculate_wer. The result is assigned back to a new column named WER_<column>.

Dynamic column naming

wer_column_name = f"WER_{column}"

For example, processing the column Whisper_Large-v3-EN creates a new column WER_Whisper_Large-v3-EN. This naming convention keeps WER columns clearly paired with their source transcription columns.

Output file

data.to_csv('(Supplementary)_02_Output_Transcript_WER.csv', index=False)

index=False omits the auto-generated row numbers (0, 1, 2, …) from the CSV file, keeping the output clean for subsequent use in Excel or R.

9.17.1 Expected Output Structure

After the loop finishes, the DataFrame gains one new WER column for each ASR system:

Filename	STANDARD_TRANSCRIPT	Whisper_Large-v3-EN	…	WER_Whisper_Large-v3-EN	WER_Whisper_Base-En	…
rec_001	the cat sat …	the cat sat …	…	0.00	0.08	…
rec_002	she sells shells …	she sells shells …	…	0.05	0.20	…

Lower WER values indicate closer agreement with the human reference transcript.

Aggregating WER Across Files

To get a single summary WER per model, compute the mean of each WER column:

wer_columns = [c for c in data.columns if c.startswith("WER_")]
summary = data[wer_columns].mean().sort_values()
print(summary)

This gives a ranked list of models from best (lowest mean WER) to worst, which is the standard way to report ASR benchmark results in research papers.

9.18 Troubleshooting

Problem	Likely Cause	Solution
`CUDA out of memory`	GPU VRAM exceeded	Switch to `whisper-large-v2` or `whisper-medium`, or use `float16` (already set)
`No module named transformers`	Package not installed	Re-run the `!pip install` cell
Drive path not found	Wrong path string	Check Drive file browser; confirm folder exists and path matches exactly
Empty transcription output	Silent or corrupt audio	Inspect the file manually; Whisper may return empty string for silent audio
Very slow inference on CPU	No GPU runtime selected	Runtime → Change runtime type → GPU (T4)
`KeyError: 'Transcriptions'`	Sheet name mismatch	Open the Excel file and confirm the sheet is named “Transcriptions”
`WER > 1.0` for some rows	Many insertions in hypothesis	Expected behaviour — WER is unbounded above 1.0; check for hallucination
`None` values in WER columns	Missing transcript (`NaN`)	Inspect rows where `STANDARD_TRANSCRIPT` or the ASR column is empty
Column not found warning	Typo in `asr_columns` list	Check `data.columns.tolist()` output and correct the column name

9.19 Summary

In this tutorial you learned how to:

Install and configure the HuggingFace transformers library in Google Colab
Mount Google Drive and define input/output paths
Load Whisper-Large-v3 and Whisper-Base as independent pipelines
Define a reusable transcribe_audio_files() function that safely creates or appends to an Excel file
Run each model separately and save results to its own dedicated .xlsx file
Understand the WER metric and its mathematical foundation (Levenshtein distance)
Implement a custom WER function using dynamic programming
Apply WER across multiple ASR columns and export results to a structured CSV

Next Steps

Aggregate WER reporting: Use data[wer_cols].mean() to produce a summary table ranking all models by accuracy.
Text normalisation: Lowercase transcripts and strip punctuation before WER calculation to avoid penalising capitalisation differences.
Character Error Rate (CER): Use jiwer.cer() for languages without clear word boundaries (e.g., Mandarin, Japanese).
Forced alignment: Use whisperx for word-level timestamps aligned to the audio waveform.
Custom vocabulary / prompting: Pass an initial_prompt string in generate_kwargs to bias Whisper toward domain-specific vocabulary.
Gradio demo: Wrap the pipeline in a Gradio interface for interactive transcription without writing code.

9.20 References

OpenAI Whisper GitHub: https://github.com/openai/whisper
HuggingFace Whisper model page: https://huggingface.co/openai/whisper-large-v3
HuggingFace pipeline documentation: https://huggingface.co/docs/transformers/main_classes/pipelines
openpyxl documentation: https://openpyxl.readthedocs.io
jiwer documentation: https://jitsi.github.io/jiwer/