10 DTW-Based Prosody Analysis
10.1 Overview
This notebook implements two variants of Dynamic Time Warping (DTW) to measure prosodic similarity between native speaker (NS) reference recordings and student learner recordings. The core acoustic feature used is F0 (fundamental frequency), which captures the pitch contour of an utterance — a primary carrier of prosodic information such as intonation and stress patterns.
The pipeline consists of four stages:
| Stage | Description |
|---|---|
| 1. Standard DTW | Aligns raw z-score-normalised F0 contours and computes a distance score |
| 2. Derivative DTW (DDTW) | Applies DTW to the first-order derivative of F0, emphasising shape over absolute pitch |
| 3. DTW Contour Plots | Visualises pitch contours and the accumulated-cost matrix for Standard DTW |
| 4. DDTW Contour Plots | Visualises pitch + derivative contours and the cost matrix for DDTW |
10.2 Background
10.2.1 Dynamic Time Warping (DTW)
DTW is a sequence-alignment algorithm that finds the optimal non-linear mapping between two time series of different lengths. Unlike simple Euclidean distance, DTW is robust to time-axis distortions — a crucial property when comparing speech from speakers who produce the same utterance at different rates.
Given two sequences \(X = (x_1, \ldots, x_n)\) and \(Y = (y_1, \ldots, y_m)\), DTW constructs an \(n \times m\) accumulated cost matrix \(D\) where each cell stores the minimum cumulative cost to align up to that point:
\[D(i, j) = d(x_i, y_j) + \min\bigl(D(i-1,j),\; D(i,j-1),\; D(i-1,j-1)\bigr)\]
The warping path is then traced back from \(D(n, m)\) to \(D(1, 1)\), and the final DTW distance equals \(D(n, m)\).
10.2.2 Derivative DTW (DDTW)
Standard DTW can be misled by absolute pitch differences between speakers (e.g., a male NS vs. a female learner). DDTW (Keogh & Pazzani, 2001) addresses this by aligning the first-order derivatives of the two series rather than the raw values. This makes the comparison sensitive to the shape of the pitch contour — rises, falls, and plateaus — rather than its overall level.
The derivative is estimated with a symmetric finite-difference scheme:
\[x'_i = \begin{cases} x_2 - x_1 & i = 1 \\[4pt] \dfrac{x_{i+1} - x_{i-1}}{2} & 1 < i < n \\[4pt] x_n - x_{n-1} & i = n \end{cases}\]
10.2.3 F0 Extraction
F0 is extracted with librosa.pyin, a probabilistic YIN algorithm that estimates the fundamental frequency frame by frame. Unvoiced frames produce NaN values and are removed before DTW is applied.
10.2.4 Z-Score Normalisation
Before comparison, each F0 sequence is standardised to zero mean and unit variance:
\[z_i = \frac{x_i - \bar{x}}{\sigma_x}\]
This normalisation removes speaker-specific pitch range differences and makes distances interpretable across different sentence-speaker pairs.
10.3 Dependencies and Installation
Before running any code, install the two non-standard audio packages:
%%capture
# !pip install dtaidistance
!pip install librosaThe %%capture magic suppresses noisy installation output in the notebook. dtaidistance is commented out because it is assumed to already be installed; uncomment it if you encounter an ImportError.
| Package | Role in this project |
|---|---|
pandas |
Reading the audio list spreadsheet; writing result Excel files |
numpy |
Array arithmetic and finite-difference derivative computation |
librosa |
Loading .wav files and extracting F0 via the pyin algorithm |
scipy.stats |
Z-score normalisation (zscore) |
scipy.signal |
find_peaks — imported for completeness, not actively called |
dtaidistance |
Optimised Standard DTW computation (Stages 1 and 3) |
matplotlib |
Generating and saving two-panel contour figures |
PIL (Pillow) |
Imported for potential image annotation; not actively used |
10.4 Data Import
10.4.1 Code
import pandas as pd
import os
# Load the combined DataFrame
df_audio_list = pd.read_excel('DTW_audio_list.xlsx')
# Directory paths
reference_dir = 'NS_wav' # Native Speaker Reference Audio
test_files_dir = 'Students_wav' # Student Audio Files
df_audio_list.head()10.4.2 Explanation
This cell sets up the data index that drives every subsequent stage.
pd.read_excel('DTW_audio_list.xlsx') Reads the spreadsheet into a DataFrame. Each row represents one NS-student comparison pair. The expected columns are:
| Column | Description |
|---|---|
Sent_NO |
Integer ID of the sentence; used to locate the NS file (e.g., 1.wav) |
Sentence |
Text transcription of the target sentence; appears in plot titles |
Audio_File |
Filename of the student recording (e.g., S01_Sent01.wav) |
Directory variables reference_dir and test_files_dir are string path prefixes combined with os.path.join() later to build full file paths. The working directory must contain these two subdirectories.
df_audio_list.head() Previews the first five rows to verify the spreadsheet loaded correctly before running the computationally expensive audio pipeline.
10.5 Stage 1: Standard DTW
10.5.1 Overview
Standard DTW computes a single scalar distance between two z-score-normalised F0 contours after finding the optimal non-linear time alignment. A smaller distance indicates that the student’s pitch contour more closely matches the native speaker’s.
10.5.2 Full Code
from PIL import Image, ImageDraw, ImageFont
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial.distance import euclidean # For distance calculation
from scipy.signal import find_peaks
import librosa # For audio processing
import librosa.display
from scipy import stats # For z-score calculation
from dtaidistance import dtw # Using dtaidistance for original DTW
def extract_f0(file_path):
y, sr = librosa.load(file_path)
f0, voiced_flag, voiced_probs = librosa.pyin(
y,
fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7')
)
return f0
# Initialize list to store results
results = []
# Process each row in the DataFrame
for index, row in df_audio_list.iterrows():
# Reference file
reference_file_num = row['Sent_NO']
reference_file_sent = row['Sentence']
reference_file_path = os.path.join(reference_dir, f"{reference_file_num}.wav")
# Test file
test_file_name = row['Audio_File']
test_file_path = os.path.join(test_files_dir, test_file_name)
# Check if files exist
if not os.path.exists(reference_file_path) or not os.path.exists(test_file_path):
print(f"File not found: {reference_file_path} or {test_file_path}")
continue
# Load and extract F0 from the reference and test audio files
f0_reference = extract_f0(reference_file_path)
f0_test = extract_f0(test_file_path)
# Remove NaNs (in case of unvoiced segments)
f0_reference = f0_reference[~np.isnan(f0_reference)]
f0_test = f0_test[~np.isnan(f0_test)]
# Normalize the F0 values using z-score
f0_reference_z = stats.zscore(f0_reference)
f0_test_z = stats.zscore(f0_test)
# Perform DTW and get the best path
distance, path = dtw.warping_paths(f0_reference_z, f0_test_z)
best_path = np.array(dtw.best_path(path))
# Store results
results.append({
'Sent_NO': reference_file_num,
'Test File': test_file_name,
'DTW_Original_Distance_z': distance,
'Sentence': reference_file_sent
})
# Create DataFrame from results and display
df_results = pd.DataFrame(results)
print(df_results)
# Save results to an Excel file
excel_file_path = 'DTW_Output/01_Standard_DTW_z.xlsx'
df_results.to_excel(excel_file_path, index=False)
print(f"Results saved to {excel_file_path}")10.5.3 Explanation
10.5.3.1 Import block
The import block pulls in all required libraries for this stage. Only from dtaidistance import dtw is specific to DTW computation; the rest provide general-purpose array, audio, and statistics utilities. PIL, scipy.spatial.distance, and scipy.signal are imported for completeness but are not directly called in this stage.
10.5.3.2 extract_f0(file_path) — F0 extraction helper
def extract_f0(file_path):
y, sr = librosa.load(file_path)
f0, voiced_flag, voiced_probs = librosa.pyin(
y,
fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7')
)
return f0This reusable function wraps two librosa calls:
| Line | What it does |
|---|---|
librosa.load(file_path) |
Decodes the .wav file and resamples to 22 050 Hz (librosa default). Returns waveform y and sample rate sr. |
librosa.pyin(...) |
Runs the probabilistic YIN (pYIN) pitch tracker frame by frame. Returns f0 (Hz per frame), voiced_flag (boolean), and voiced_probs (voicing probability). |
fmin=librosa.note_to_hz('C2') |
Sets the lower search bound to ~65 Hz — below the lowest typical male speaking voice. |
fmax=librosa.note_to_hz('C7') |
Sets the upper bound to ~2093 Hz — well above the highest typical female speaking voice. |
return f0 |
Returns only the pitch array; voiced_flag and voiced_probs are discarded here but could be used for voicing-based filtering. |
Unvoiced frames (silence, fricatives, stops) are represented as NaN in f0.
10.5.3.3 Processing loop
for index, row in df_audio_list.iterrows():
reference_file_path = os.path.join(reference_dir, f"{reference_file_num}.wav")
test_file_path = os.path.join(test_files_dir, test_file_name)iterrows() yields one (index, Series) pair per spreadsheet row. os.path.join() constructs OS-independent file paths from the directory prefix and filename, handling slash differences across platforms automatically.
10.5.3.4 File existence check
if not os.path.exists(reference_file_path) or not os.path.exists(test_file_path):
print(f"File not found: ...")
continueGuards against missing audio files. Rather than crashing the entire loop, continue skips the current pair and moves to the next row. This is important when a dataset has gaps or filename mismatches.
10.5.3.5 NaN removal
f0_reference = f0_reference[~np.isnan(f0_reference)]
f0_test = f0_test[~np.isnan(f0_test)]np.isnan() returns a boolean mask; ~ inverts it; boolean indexing keeps only voiced frames. DTW cannot handle NaN values — leaving them in would propagate nan to the entire distance computation.
10.5.3.6 Z-score normalisation
f0_reference_z = stats.zscore(f0_reference)
f0_test_z = stats.zscore(f0_test)scipy.stats.zscore subtracts the mean and divides by the standard deviation, producing a sequence with mean 0 and standard deviation 1. This step is essential because male and female speakers can differ by an octave in absolute F0, yet their intonation patterns may be nearly identical in shape.
10.5.3.7 DTW computation
distance, path = dtw.warping_paths(f0_reference_z, f0_test_z)
best_path = np.array(dtw.best_path(path))| Call | Return | Meaning |
|---|---|---|
dtw.warping_paths(x, y) |
(distance, path_matrix) |
Fills the entire \(n \times m\) accumulated cost matrix; distance is the bottom-right cell value. |
dtw.best_path(path_matrix) |
List of (i, j) tuples |
Traces the optimal alignment path from (0,0) to (n-1, m-1). |
np.array(...) |
Shape (L, 2) array |
Converts the path to NumPy for easy column-wise indexing during plotting. |
best_path is stored but not used further in this stage; it is essential in Stage 3 for the warping-path overlay on the cost matrix heatmap.
10.5.3.8 Results collection and export
results.append({...})
df_results = pd.DataFrame(results)
df_results.to_excel('DTW_Output/01_Standard_DTW_z.xlsx', index=False)Results are accumulated in a plain Python list and converted to a DataFrame only once at the end — growing a DataFrame row-by-row inside a loop is significantly slower due to repeated memory allocation. index=False prevents pandas from writing row numbers into the Excel file.
10.5.4 Output columns
| Column | Description |
|---|---|
Sent_NO |
Sentence identifier |
Test File |
Student audio filename |
DTW_Original_Distance_z |
DTW distance on z-scored F0 (lower = more similar) |
Sentence |
Sentence transcription |
10.6 Stage 2: Derivative DTW (DDTW)
10.6.1 Overview
DDTW compares the shape of pitch contours rather than their absolute values by running DTW on the first-order derivatives of the F0 sequences. This makes the metric less sensitive to overall pitch level and more sensitive to prosodic patterns such as rises, falls, and plateaus.
10.6.2 Full Code
from PIL import Image, ImageDraw, ImageFont
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import librosa # For audio processing
def compute_derivative(x):
"""Compute the derivative of the time series."""
derivative = np.zeros(len(x))
# Compute the central difference for the middle elements
derivative[1:-1] = (x[2:] - x[:-2]) / 2.0
# Forward difference for the first element
derivative[0] = x[1] - x[0]
# Backward difference for the last element
derivative[-1] = x[-1] - x[-2]
return derivative
def ddtw(x, y):
"""Compute the Derivative Dynamic Time Warping (DDTW) distance."""
# Compute derivatives of the input time series
x_derivative = compute_derivative(x)
y_derivative = compute_derivative(y)
# Initialize the cost matrix
n, m = len(x), len(y)
cost_matrix = np.zeros((n, m))
# Initialize the first row and column of the cost matrix
cost_matrix[0, 0] = abs(x_derivative[0] - y_derivative[0])
for i in range(1, n):
cost_matrix[i, 0] = cost_matrix[i-1, 0] + abs(x_derivative[i] - y_derivative[0])
for j in range(1, m):
cost_matrix[0, j] = cost_matrix[0, j-1] + abs(x_derivative[0] - y_derivative[j])
# Fill the rest of the cost matrix using dynamic programming
for i in range(1, n):
for j in range(1, m):
cost = abs(x_derivative[i] - y_derivative[j])
cost_matrix[i, j] = cost + min(
cost_matrix[i-1, j], # Insertion
cost_matrix[i, j-1], # Deletion
cost_matrix[i-1, j-1] # Match
)
# Backtrack to find the best path
i, j = n - 1, m - 1
path = []
while i > 0 and j > 0:
path.append((i, j))
min_cost = min(cost_matrix[i-1, j], cost_matrix[i, j-1], cost_matrix[i-1, j-1])
if min_cost == cost_matrix[i-1, j-1]:
i -= 1
j -= 1
elif min_cost == cost_matrix[i-1, j]:
i -= 1
else:
j -= 1
path.append((0, 0))
path.reverse()
return cost_matrix[-1, -1], np.array(path)
# Initialize list to store results
results = []
for index, row in df_audio_list.iterrows():
reference_file_num = row['Sent_NO']
reference_file_sent = row['Sentence']
reference_file_path = os.path.join(reference_dir, f"{reference_file_num}.wav")
test_file_name = row['Audio_File']
test_file_path = os.path.join(test_files_dir, test_file_name)
if not os.path.exists(reference_file_path) or not os.path.exists(test_file_path):
print(f"File not found: {reference_file_path} or {test_file_path}")
continue
f0_reference = extract_f0(reference_file_path)
f0_test = extract_f0(test_file_path)
f0_reference = f0_reference[~np.isnan(f0_reference)]
f0_test = f0_test[~np.isnan(f0_test)]
f0_reference_z = stats.zscore(f0_reference)
f0_test_z = stats.zscore(f0_test)
# Perform DDTW and get the best path
distance, best_path = ddtw(f0_reference_z, f0_test_z)
results.append({
'Sent_NO': reference_file_num,
'Test File': test_file_name,
'DDTW_Distance_z': distance,
'Sentence': reference_file_sent
})
df_results = pd.DataFrame(results)
print(df_results)
excel_file_path = 'DTW_Output/02_DDTW_z.xlsx'
df_results.to_excel(excel_file_path, index=False)
print(f"Results saved to {excel_file_path}")10.6.3 Explanation
10.6.3.1 compute_derivative(x) — finite-difference derivative
def compute_derivative(x):
derivative = np.zeros(len(x))
derivative[1:-1] = (x[2:] - x[:-2]) / 2.0 # central difference
derivative[0] = x[1] - x[0] # forward difference
derivative[-1] = x[-1] - x[-2] # backward difference
return derivativeThis function approximates the instantaneous rate of change of the pitch contour at each frame. Three different finite-difference schemes are used depending on position:
| Region | Slice notation | Formula | Accuracy |
|---|---|---|---|
| Interior frames | [1:-1] |
\((x_{i+1} - x_{i-1}) / 2\) | Second-order (symmetric) |
| First frame | [0] |
\(x_1 - x_0\) | First-order forward |
| Last frame | [-1] |
\(x_{n-1} - x_{n-2}\) | First-order backward |
The result is a same-length array where each value encodes whether the pitch was rising (positive), flat (near zero), or falling (negative) at that moment. This shape information is exactly what DDTW compares.
10.6.3.2 ddtw(x, y) — custom DDTW implementation
Step 1: Derive both series
x_derivative = compute_derivative(x)
y_derivative = compute_derivative(y)Converts raw z-scored F0 sequences into slope sequences before DTW is applied. From this point forward, the algorithm is identical to Standard DTW except that it operates on these derivative arrays rather than the raw F0 values.
Step 2: Initialise boundary conditions
cost_matrix[0, 0] = abs(x_derivative[0] - y_derivative[0])
for i in range(1, n):
cost_matrix[i, 0] = cost_matrix[i-1, 0] + abs(x_derivative[i] - y_derivative[0])
for j in range(1, m):
cost_matrix[0, j] = cost_matrix[0, j-1] + abs(x_derivative[0] - y_derivative[j])The first row and column have no predecessors on both axes, so they can only accumulate costs along a single direction. This forces the warping path to start at (0, 0) and creates the boundary for the recurrence.
Step 3: Dynamic programming fill
for i in range(1, n):
for j in range(1, m):
cost = abs(x_derivative[i] - y_derivative[j])
cost_matrix[i, j] = cost + min(
cost_matrix[i-1, j], # Insertion: NS advances, student stays
cost_matrix[i, j-1], # Deletion: student advances, NS stays
cost_matrix[i-1, j-1] # Match: both advance together
)Each cell’s value is the local derivative distance plus the cheapest way to have arrived at that cell. The three predecessors correspond to three types of alignment moves:
| Predecessor | Move | Speech interpretation |
|---|---|---|
(i-1, j) |
NS advances alone | NS pitch change has no matching student change yet |
(i, j-1) |
Student advances alone | Student pitch change has no matching NS change yet |
(i-1, j-1) |
Both advance | One NS frame matched to one student frame |
Step 4: Backtrack the optimal path
i, j = n - 1, m - 1
path = []
while i > 0 and j > 0:
path.append((i, j))
min_cost = min(cost_matrix[i-1, j], cost_matrix[i, j-1], cost_matrix[i-1, j-1])
if min_cost == cost_matrix[i-1, j-1]: i -= 1; j -= 1
elif min_cost == cost_matrix[i-1, j]: i -= 1
else: j -= 1
path.append((0, 0))
path.reverse()Starts at the bottom-right corner (n-1, m-1) and greedily steps toward (0, 0) by always choosing the neighbour with the lowest accumulated cost. path.append((0, 0)) manually adds the origin, and path.reverse() re-orders from start to end for consistent downstream use.
Return values
return cost_matrix[-1, -1], np.array(path)cost_matrix[-1, -1] is the total DDTW distance (the scalar stored in the Excel output). np.array(path) is a 2-D array of shape (L, 2). Note that the full matrix is not returned here — that variant is introduced in Stage 4 for visualisation purposes.
10.6.3.3 Comparison with Standard DTW
| Aspect | Standard DTW (Stage 1) | DDTW (Stage 2) |
|---|---|---|
| Input to DTW | Raw z-scored F0 | First derivative of z-scored F0 |
| Local cost | |f0_ref[i] - f0_test[j]| |
|Df0_ref[i] - Df0_test[j]| |
| Sensitivity | Absolute pitch level | Pitch slope (rising/falling) |
| Implementation | dtaidistance C extension |
Pure Python nested loop |
| Output column | DTW_Original_Distance_z |
DDTW_Distance_z |
The pure Python double loop in ddtw() is \(O(n \cdot m)\) in both time and memory. For long utterances this can be slow. If performance is a concern, consider a NumPy-vectorised or Cython implementation.
10.7 Stage 3: DTW Contour Plots
10.7.1 Overview
For every NS-student pair, this stage produces a two-panel figure saved as DTW_Output/Contours/DTW_<student_filename>.png. The left panel overlays the raw pitch contours; the right panel shows the accumulated cost matrix with the optimal warping path.
10.7.2 Full Code
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from dtaidistance import dtw
os.makedirs('DTW_Output/Contours', exist_ok=True)
for index, row in df_audio_list.iterrows():
reference_file_num = row['Sent_NO']
reference_file_sent = row['Sentence']
reference_file_path = os.path.join(reference_dir, f"{reference_file_num}.wav")
test_file_name = row['Audio_File']
test_file_path = os.path.join(test_files_dir, test_file_name)
if not os.path.exists(reference_file_path) or not os.path.exists(test_file_path):
print(f"File not found: {reference_file_path} or {test_file_path}")
continue
# Extract and z-score normalize F0
f0_ref = extract_f0(reference_file_path)
f0_test = extract_f0(test_file_path)
f0_ref = f0_ref[~np.isnan(f0_ref)]
f0_test = f0_test[~np.isnan(f0_test)]
f0_ref_z = stats.zscore(f0_ref)
f0_test_z = stats.zscore(f0_test)
# Compute DTW accumulated cost matrix and best warping path
distance, path_matrix = dtw.warping_paths(f0_ref_z, f0_test_z)
best_path = np.array(dtw.best_path(path_matrix))
# Clip -inf sentinels used by dtaidistance for display
disp_matrix = np.where(path_matrix < 0, np.nan, path_matrix)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle(
f'Standard DTW | "{reference_file_sent}" | {test_file_name} vs NS {reference_file_num}',
fontsize=12, fontweight='bold'
)
# Left panel: F0 pitch contours
ax1 = axes[0]
t_ref = np.linspace(0, 1, len(f0_ref_z))
t_test = np.linspace(0, 1, len(f0_test_z))
ax1.plot(t_ref, f0_ref_z, color='steelblue', linewidth=2, label='NS Reference')
ax1.plot(t_test, f0_test_z, color='tomato', linewidth=2, label='Student')
ax1.set_xlabel('Normalized Time')
ax1.set_ylabel('F0 (z-score)')
ax1.set_title('F0 Pitch Contours')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Right panel: DTW accumulated cost matrix contour with warping path
ax2 = axes[1]
x_idx = np.arange(disp_matrix.shape[1])
y_idx = np.arange(disp_matrix.shape[0])
masked = np.ma.array(disp_matrix, mask=np.isnan(disp_matrix))
cf = ax2.contourf(x_idx, y_idx, masked, levels=20, cmap='YlOrRd')
ax2.plot(best_path[:, 1], best_path[:, 0], 'b-', linewidth=2, label='Warping Path')
ax2.plot(best_path[ 0, 1], best_path[ 0, 0], 'b^', markersize=8)
ax2.plot(best_path[-1, 1], best_path[-1, 0], 'bo', markersize=8)
plt.colorbar(cf, ax=ax2, label='Accumulated Cost')
ax2.set_xlabel('Student Frame Index')
ax2.set_ylabel('NS Reference Frame Index')
ax2.set_title(f'DTW Cost Matrix (distance = {distance:.3f})')
ax2.legend(loc='lower right')
plt.tight_layout()
out_path = f'DTW_Output/Contours/DTW_{test_file_name[:-4]}.png'
plt.savefig(out_path, dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved: {out_path}")10.7.3 Explanation
10.7.3.1 Output directory creation
os.makedirs('DTW_Output/Contours', exist_ok=True)Creates the nested output directory if it does not already exist. exist_ok=True prevents an error if the directory is already present, making the script safe to re-run without manual cleanup.
10.7.3.2 Sentinel value removal
disp_matrix = np.where(path_matrix < 0, np.nan, path_matrix)dtaidistance stores sentinel values of \(-\infty\) in cells that were not visited during computation (e.g., cells outside a Sakoe-Chiba constraint band). These negative values would collapse the colour scale of the contour plot to a single colour, so they are replaced with NaN before display. np.where(condition, x, y) acts as a vectorised conditional: it returns x where the condition is True and y elsewhere.
10.7.3.3 Figure layout
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle(...)Creates a 1-row, 2-column figure approximately 14 inches wide and 5 inches tall. fig.suptitle places a shared title above both panels that identifies the sentence text and the exact files being compared.
10.7.3.4 Left panel — pitch contour overlay
t_ref = np.linspace(0, 1, len(f0_ref_z))
t_test = np.linspace(0, 1, len(f0_test_z))
ax1.plot(t_ref, f0_ref_z, color='steelblue', linewidth=2, label='NS Reference')
ax1.plot(t_test, f0_test_z, color='tomato', linewidth=2, label='Student')np.linspace(0, 1, n) maps any sequence of length n to the unit interval [0, 1], making different-length contours visually comparable on the same x-axis. Without this normalisation, a longer student recording would appear wider than the NS reference, making shape comparison harder.
The two contours use contrasting colours (steel blue for NS, tomato for student) to allow side-by-side pitch pattern inspection without DTW alignment. Differences here that shrink after alignment (Stage 4 right panel) indicate timing differences, not intonation differences.
10.7.3.5 Right panel — cost matrix heatmap
masked = np.ma.array(disp_matrix, mask=np.isnan(disp_matrix))
cf = ax2.contourf(x_idx, y_idx, masked, levels=20, cmap='YlOrRd')np.ma.array creates a masked array so NaN cells (sentinels) are rendered transparent rather than assigned a colour. contourf draws 20 filled contour bands using the YlOrRd colour map — yellow for low accumulated cost, dark red for high cost. Low-cost regions are where the two sequences are locally most similar.
ax2.plot(best_path[:, 1], best_path[:, 0], 'b-', linewidth=2, label='Warping Path')
ax2.plot(best_path[ 0, 1], best_path[ 0, 0], 'b^', markersize=8) # start triangle
ax2.plot(best_path[-1, 1], best_path[-1, 0], 'bo', markersize=8) # end circlebest_path[:, 1] are the student (column) indices; best_path[:, 0] are the NS (row) indices. Note the axis order: x-axis = student frames, y-axis = NS frames. The triangle marks where alignment begins; the circle marks where it ends.
Reading the plot: A warping path that hugs the main diagonal indicates that the two speakers are well-aligned in time — similar speech rate and rhythm. Large horizontal excursions mean the student held a syllable while the NS moved on; large vertical excursions mean the opposite.
10.7.3.6 Colour bar and saving
plt.colorbar(cf, ax=ax2, label='Accumulated Cost')Adds a labelled colour bar on the right of ax2 to decode the numeric cost values from the heatmap colours.
out_path = f'DTW_Output/Contours/DTW_{test_file_name[:-4]}.png'
plt.savefig(out_path, dpi=150, bbox_inches='tight')test_file_name[:-4] strips the .wav extension so S01_Sent01.wav becomes DTW_S01_Sent01.png. dpi=150 produces medium-resolution output suitable for screen review and standard printing. bbox_inches='tight' trims empty margins around the figure.
10.8 Stage 4: DDTW Contour Plots
10.8.1 Overview
This stage mirrors Stage 3 but for DDTW. It introduces a new helper ddtw_with_matrix() that also returns the full accumulated cost matrix, and extends the left panel to show both raw F0 and its derivative simultaneously via a dual y-axis.
10.8.2 Full Code
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
os.makedirs('DTW_Output/Contours', exist_ok=True)
def ddtw_with_matrix(x, y):
"""DDTW returning (distance, accumulated_cost_matrix, best_path)."""
x_d = compute_derivative(x)
y_d = compute_derivative(y)
n, m = len(x), len(y)
C = np.zeros((n, m))
C[0, 0] = abs(x_d[0] - y_d[0])
for i in range(1, n):
C[i, 0] = C[i-1, 0] + abs(x_d[i] - y_d[0])
for j in range(1, m):
C[0, j] = C[0, j-1] + abs(x_d[0] - y_d[j])
for i in range(1, n):
for j in range(1, m):
C[i, j] = abs(x_d[i] - y_d[j]) + min(C[i-1, j], C[i, j-1], C[i-1, j-1])
i, j = n - 1, m - 1
path = []
while i > 0 and j > 0:
path.append((i, j))
mc = min(C[i-1, j], C[i, j-1], C[i-1, j-1])
if mc == C[i-1, j-1]: i -= 1; j -= 1
elif mc == C[i-1, j]: i -= 1
else: j -= 1
path.append((0, 0))
path.reverse()
return C[-1, -1], C, np.array(path)
for index, row in df_audio_list.iterrows():
reference_file_num = row['Sent_NO']
reference_file_sent = row['Sentence']
reference_file_path = os.path.join(reference_dir, f"{reference_file_num}.wav")
test_file_name = row['Audio_File']
test_file_path = os.path.join(test_files_dir, test_file_name)
if not os.path.exists(reference_file_path) or not os.path.exists(test_file_path):
print(f"File not found: {reference_file_path} or {test_file_path}")
continue
f0_ref = extract_f0(reference_file_path)
f0_test = extract_f0(test_file_path)
f0_ref = f0_ref[~np.isnan(f0_ref)]
f0_test = f0_test[~np.isnan(f0_test)]
f0_ref_z = stats.zscore(f0_ref)
f0_test_z = stats.zscore(f0_test)
# Compute DDTW cost matrix and warping path
distance, cost_matrix, best_path = ddtw_with_matrix(f0_ref_z, f0_test_z)
# Derivative series for the left panel
dx_ref = compute_derivative(f0_ref_z)
dx_test = compute_derivative(f0_test_z)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle(
f'DDTW | "{reference_file_sent}" | {test_file_name} vs NS {reference_file_num}',
fontsize=12, fontweight='bold'
)
# Left panel: F0 contours + derivative contours (dual y-axis)
ax1 = axes[0]
ax1_r = ax1.twinx()
t_ref = np.linspace(0, 1, len(f0_ref_z))
t_test = np.linspace(0, 1, len(f0_test_z))
ax1.plot(t_ref, f0_ref_z, color='steelblue', linewidth=2, label='NS F0')
ax1.plot(t_test, f0_test_z, color='tomato', linewidth=2, label='Student F0')
ax1_r.plot(t_ref, dx_ref, color='steelblue', linewidth=1.2, linestyle='--', alpha=0.6, label='NS dF0')
ax1_r.plot(t_test, dx_test, color='tomato', linewidth=1.2, linestyle='--', alpha=0.6, label='Student dF0')
ax1.set_xlabel('Normalized Time')
ax1.set_ylabel('F0 (z-score)')
ax1_r.set_ylabel('dF0 (derivative)', color='gray')
ax1.set_title('F0 and dF0 Contours')
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax1_r.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, fontsize=8, loc='upper right')
ax1.grid(True, alpha=0.3)
# Right panel: DDTW accumulated cost matrix contour with warping path
ax2 = axes[1]
x_idx = np.arange(cost_matrix.shape[1])
y_idx = np.arange(cost_matrix.shape[0])
cf = ax2.contourf(x_idx, y_idx, cost_matrix, levels=20, cmap='YlOrRd')
ax2.plot(best_path[:, 1], best_path[:, 0], 'b-', linewidth=2, label='Warping Path')
ax2.plot(best_path[ 0, 1], best_path[ 0, 0], 'b^', markersize=8)
ax2.plot(best_path[-1, 1], best_path[-1, 0], 'bo', markersize=8)
plt.colorbar(cf, ax=ax2, label='Accumulated Cost')
ax2.set_xlabel('Student Frame Index')
ax2.set_ylabel('NS Reference Frame Index')
ax2.set_title(f'DDTW Cost Matrix (distance = {distance:.3f})')
ax2.legend(loc='lower right')
plt.tight_layout()
out_path = f'DTW_Output/Contours/DDTW_{test_file_name[:-4]}.png'
plt.savefig(out_path, dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved: {out_path}")10.8.3 Explanation
10.8.3.1 ddtw_with_matrix(x, y) — extended DDTW helper
This function is algorithmically identical to ddtw() from Stage 2 but returns three values instead of two:
return C[-1, -1], C, np.array(path)
# ^^^^^^^^ ^ ^^^^^^^^^^^^
# distance full matrix warping pathThe full cost matrix C is needed to render the heatmap in the right panel. In Stage 2, only the scalar distance was required for the Excel output, so C was deliberately not returned there.
10.8.3.2 Computing derivative series for display
dx_ref = compute_derivative(f0_ref_z)
dx_test = compute_derivative(f0_test_z)These derivative arrays are computed after z-scoring so their scale is consistent with the z-scored F0 on the primary axis. They are used exclusively for the left-panel visualisation — ddtw_with_matrix internally re-derives them for the cost calculation.
10.8.3.3 Dual y-axis left panel
ax1 = axes[0]
ax1_r = ax1.twinx()ax1.twinx() creates a second y-axis (right) that shares the same x-axis as ax1 (left). This allows two different value ranges to coexist:
| Axis | Lines | Style | Scale label |
|---|---|---|---|
ax1 (left y) |
Raw z-scored F0 | Solid, linewidth=2 |
F0 (z-score) |
ax1_r (right y) |
First-order derivative dF0 | Dashed, alpha=0.6 |
dF0 (derivative) |
Both axes use the same colours (steel blue = NS, tomato = student), linking the two representations visually. This lets you simultaneously read where the pitch is (solid lines) and whether it is rising or falling (dashed lines) at each moment in the utterance.
10.8.3.4 Combined legend for dual-axis plot
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax1_r.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, fontsize=8, loc='upper right')By default, a matplotlib legend only captures the lines belonging to its own Axes object. Because ax1 and ax1_r are two separate Axes objects, their handle lists must be retrieved separately and concatenated before passing to ax1.legend(). This produces a single legend box listing all four lines (NS F0, Student F0, NS dF0, Student dF0).
10.8.3.5 Right panel — DDTW cost matrix
cf = ax2.contourf(x_idx, y_idx, cost_matrix, levels=20, cmap='YlOrRd')Unlike Stage 3, no sentinel masking is needed here because ddtw_with_matrix initialises C with np.zeros() and every cell is populated during the dynamic programming pass — there are no unvisited cells. The warping path overlay, colour bar, axis labels, and saving logic are identical to Stage 3.
10.8.3.6 Interpreting DDTW vs Standard DTW results side-by-side
| Observation | Interpretation |
|---|---|
| Lower DDTW distance than DTW | Student matches the shape of intonation better than the absolute pitch level |
| Higher DDTW distance than DTW | Student’s pitch level is close but rises and falls occur at different moments |
| Diagonal warping path (both) | Good temporal alignment — similar speech rate and rhythm |
| Path deviates near a region | Timing mismatch at that part of the utterance (e.g., different pause length) |
| High-cost cluster in matrix | The two speakers diverge strongly in that local time region |
10.9 Output Summary
| File | Content |
|---|---|
DTW_Output/01_Standard_DTW_z.xlsx |
Per-pair Standard DTW distances on z-scored F0 |
DTW_Output/02_DDTW_z.xlsx |
Per-pair DDTW distances on z-scored derivative F0 |
DTW_Output/Contours/DTW_*.png |
Standard DTW two-panel visualisations |
DTW_Output/Contours/DDTW_*.png |
DDTW two-panel visualisations |