Audio to Text Transcription Tutorial¶

This notebook walks through how to:

  1. Download audio from YouTube
  2. Transcribe speech to text with OpenAI Whisper
  3. Run speaker diarization with pyannote.audio (who spoke when)
  4. Combine both to produce a labeled transcript: [SPEAKER_00] 12.3s: Hello...

Note: This tutorial assumes basic familiarity with Python, including variables, lists, dictionaries, loops, and functions. It was developed with assistance from Claude Code.

Use Cases in Computational Social Science¶

Doing text analysis on non-text data (podcasts, videos, etc.)! For example...

  • Analyzing political speeches, press conferences, or legislative debates at scale
  • Studying how media coverage differs across radio/TV outlets
  • Tracking mentions of topics or actors over time in news broadcasts
  • Cross-lingual research: transcribing interviews or media from multiple countries

Audio-specific applications:

  • Studying conversational dynamics (turn-taking, interruptions) using diarization
  • Transcribing your own research outputs (recorded talks, interviews) to make them searchable, quotable, and easier to share

What is Whisper?¶

Whisper is an open-source Automatic Speech Recognition (ASR) model released by OpenAI in 2022 (Radford et al., 2022). It was trained on 680,000 hours of multilingual audio from the web and supports transcription in 99 languages. Unlike older ASR tools, it generalizes well out of the box, meaning there is no need to fine-tune the model for most languages. (Although there is a lot of variation in transcription accuracy. See GitHub)

Training data: Podcasts, YouTube, audiobooks, lectures. 117 languages, ~65% English. Transcripts were not human-verified (weakly supervised), so accuracy varies by language and domain.

Model sizes and (relative) speeds:

Model Parameters VRAM Speed (vs. large)
tiny 39 M ~1 GB ~10×
base 74 M ~1 GB ~7×
small 244 M ~2 GB ~4×
medium 769 M ~5 GB ~2×
large 1550 M ~10 GB 1×
turbo 809 M ~6 GB ~8×

Why do we need to do this in Python?¶

We don't! If you want to transcribe a single audio file, you can upload clips to Whisper's in-browser transcription and diarization tool: https://whispertranscribe.ai/

Drawbacks:

  • not feasible if you need to transcribe many files (because you need to keep the browser open the whole time)
  • costly (limits on how many minutes you can transcribe for free)

1. Install dependencies¶

We will need a Hugging Face token with access to the model. To do so, go here: https://huggingface.co/settings/tokens

Local environment:
Set the token as an environment variable before launching Jupyter:

# Windows PowerShell
$env:HF_TOKEN = "hf_..."

# Linux/macOS
export HF_TOKEN="hf_..."

Google Colab: use the Secrets manager instead (key icon in the left sidebar):

  1. Click the key icon → Add new secret
  2. Name: HF_TOKEN, Value: your token
  3. Toggle Notebook access on

Then load it in your notebook:

from google.colab import userdata
import os
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

Accept the model license at: https://huggingface.co/pyannote/speaker-diarization-community-1

In [22]:
from google.colab import userdata
import os
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
In [23]:
pip install pyannote.audio openai-whisper av yt-dlp scipy
Requirement already satisfied: pyannote.audio in /usr/local/lib/python3.12/dist-packages (4.0.4)
Requirement already satisfied: openai-whisper in /usr/local/lib/python3.12/dist-packages (20250625)
Requirement already satisfied: av in /usr/local/lib/python3.12/dist-packages (17.0.1)
Requirement already satisfied: yt-dlp in /usr/local/lib/python3.12/dist-packages (2026.3.17)
Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (1.16.3)
Requirement already satisfied: asteroid-filterbanks>=0.4.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.4.0)
Requirement already satisfied: einops>=0.8.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.8.2)
Requirement already satisfied: huggingface-hub>=0.28.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (1.11.0)
Requirement already satisfied: lightning>=2.4 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (2.6.1)
Requirement already satisfied: matplotlib>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (3.10.0)
Requirement already satisfied: opentelemetry-api>=1.34.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (1.42.0)
Requirement already satisfied: opentelemetry-exporter-otlp>=1.34.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (1.42.0)
Requirement already satisfied: opentelemetry-sdk>=1.34.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (1.42.0)
Requirement already satisfied: pyannote-core>=6.0.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (6.0.1)
Requirement already satisfied: pyannote-database>=6.1.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (6.1.1)
Requirement already satisfied: pyannote-metrics>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (4.1)
Requirement already satisfied: pyannote-pipeline>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (4.0.0)
Requirement already satisfied: pyannoteai-sdk>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.4.0)
Requirement already satisfied: pytorch-metric-learning>=2.8.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (2.9.0)
Requirement already satisfied: rich>=13.9.4 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (13.9.4)
Requirement already satisfied: safetensors>=0.5.2 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.7.0)
Requirement already satisfied: torch-audiomentations>=0.12.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.12.0)
Requirement already satisfied: torch>=2.8.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (2.10.0+cu128)
Requirement already satisfied: torchaudio>=2.8.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (2.10.0+cu128)
Requirement already satisfied: torchcodec>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (0.10.0+cu128)
Requirement already satisfied: torchmetrics>=1.6.1 in /usr/local/lib/python3.12/dist-packages (from pyannote.audio) (1.9.0)
Requirement already satisfied: more-itertools in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (10.8.0)
Requirement already satisfied: numba in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (0.65.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (2.4.6)
Requirement already satisfied: tiktoken in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (0.12.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (4.67.3)
Requirement already satisfied: triton>=2 in /usr/local/lib/python3.12/dist-packages (from openai-whisper) (3.6.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.12/dist-packages (from asteroid-filterbanks>=0.4.0->pyannote.audio) (4.15.0)
Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (3.29.0)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (2025.3.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (1.4.3)
Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (0.28.1)
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (26.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (6.0.3)
Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.28.1->pyannote.audio) (0.24.2)
Requirement already satisfied: lightning-utilities<2.0,>=0.10.0 in /usr/local/lib/python3.12/dist-packages (from lightning>=2.4->pyannote.audio) (0.15.3)
Requirement already satisfied: pytorch-lightning in /usr/local/lib/python3.12/dist-packages (from lightning>=2.4->pyannote.audio) (2.6.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (4.62.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (1.5.0)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (3.3.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.10.0->pyannote.audio) (2.9.0.post0)
Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc==1.42.0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.42.0)
Requirement already satisfied: opentelemetry-exporter-otlp-proto-http==1.42.0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.42.0)
Requirement already satisfied: googleapis-common-protos~=1.57 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp-proto-grpc==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.74.0)
Requirement already satisfied: grpcio<2.0.0,>=1.63.2 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp-proto-grpc==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.80.0)
Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.42.0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp-proto-grpc==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.42.0)
Requirement already satisfied: opentelemetry-proto==1.42.0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp-proto-grpc==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (1.42.0)
Requirement already satisfied: requests~=2.7 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-exporter-otlp-proto-http==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (2.32.4)
Requirement already satisfied: protobuf<7.0,>=5.0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-proto==1.42.0->opentelemetry-exporter-otlp-proto-grpc==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (5.29.6)
Requirement already satisfied: opentelemetry-semantic-conventions==0.63b0 in /usr/local/lib/python3.12/dist-packages (from opentelemetry-sdk>=1.34.0->pyannote.audio) (0.63b0)
Requirement already satisfied: pandas>=2.2.3 in /usr/local/lib/python3.12/dist-packages (from pyannote-core>=6.0.1->pyannote.audio) (3.0.3)
Requirement already satisfied: sortedcontainers>=2.4.0 in /usr/local/lib/python3.12/dist-packages (from pyannote-core>=6.0.1->pyannote.audio) (2.4.0)
Requirement already satisfied: scikit-learn>=1.6.1 in /usr/local/lib/python3.12/dist-packages (from pyannote-metrics>=4.0.0->pyannote.audio) (1.6.1)
Requirement already satisfied: optuna>=4.2.0 in /usr/local/lib/python3.12/dist-packages (from pyannote-pipeline>=4.0.0->pyannote.audio) (4.8.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.9.4->pyannote.audio) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.9.4->pyannote.audio) (2.20.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (75.2.0)
Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (3.6.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (3.1.6)
Requirement already satisfied: cuda-bindings==12.9.4 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.9.4)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.93)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.90)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.90)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (9.10.2.21)
Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (11.3.3.83)
Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (10.3.9.90)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (11.7.3.90)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.5.8.93)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (0.7.1)
Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (2.27.5)
Requirement already satisfied: nvidia-nvshmem-cu12==3.4.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (3.4.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.90)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (12.8.93)
Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.8.0->pyannote.audio) (1.13.1.3)
Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-bindings==12.9.4->torch>=2.8.0->pyannote.audio) (1.5.3)
Requirement already satisfied: julius<0.3,>=0.2.3 in /usr/local/lib/python3.12/dist-packages (from torch-audiomentations>=0.12.0->pyannote.audio) (0.2.7)
Requirement already satisfied: torch-pitch-shift>=1.2.2 in /usr/local/lib/python3.12/dist-packages (from torch-audiomentations>=0.12.0->pyannote.audio) (1.2.5)
Requirement already satisfied: llvmlite<0.48,>=0.47.0dev0 in /usr/local/lib/python3.12/dist-packages (from numba->openai-whisper) (0.47.0)
Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.12/dist-packages (from tiktoken->openai-whisper) (2025.11.3)
Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /usr/local/lib/python3.12/dist-packages (from fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (3.13.5)
Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.28.1->pyannote.audio) (4.13.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.28.1->pyannote.audio) (2026.4.22)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.28.1->pyannote.audio) (1.0.9)
Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.28.1->pyannote.audio) (3.13)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub>=0.28.1->pyannote.audio) (0.16.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=13.9.4->pyannote.audio) (0.1.2)
Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from optuna>=4.2.0->pyannote-pipeline>=4.0.0->pyannote.audio) (1.18.4)
Requirement already satisfied: colorlog in /usr/local/lib/python3.12/dist-packages (from optuna>=4.2.0->pyannote-pipeline>=4.0.0->pyannote.audio) (6.10.1)
Requirement already satisfied: sqlalchemy>=1.4.2 in /usr/local/lib/python3.12/dist-packages (from optuna>=4.2.0->pyannote-pipeline>=4.0.0->pyannote.audio) (2.0.49)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.7->matplotlib>=3.10.0->pyannote.audio) (1.17.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests~=2.7->opentelemetry-exporter-otlp-proto-http==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (3.4.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests~=2.7->opentelemetry-exporter-otlp-proto-http==1.42.0->opentelemetry-exporter-otlp>=1.34.0->pyannote.audio) (2.5.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.6.1->pyannote-metrics>=4.0.0->pyannote.audio) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.6.1->pyannote-metrics>=4.0.0->pyannote.audio) (3.6.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch>=2.8.0->pyannote.audio) (1.3.0)
Requirement already satisfied: primePy>=1.3 in /usr/local/lib/python3.12/dist-packages (from torch-pitch-shift>=1.2.2->torch-audiomentations>=0.12.0->pyannote.audio) (1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.8.0->pyannote.audio) (3.0.3)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->huggingface-hub>=0.28.1->pyannote.audio) (8.3.3)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->huggingface-hub>=0.28.1->pyannote.audio) (1.5.4)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->huggingface-hub>=0.28.1->pyannote.audio) (0.0.4)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (1.4.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (26.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (1.8.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (6.7.1)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (0.4.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<2028.0,>=2022.5.0->lightning>=2.4->pyannote.audio) (1.23.0)
Requirement already satisfied: Mako in /usr/local/lib/python3.12/dist-packages (from alembic>=1.5.0->optuna>=4.2.0->pyannote-pipeline>=4.0.0->pyannote.audio) (1.3.11)
Requirement already satisfied: greenlet>=1 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy>=1.4.2->optuna>=4.2.0->pyannote-pipeline>=4.0.0->pyannote.audio) (3.4.0)

2. Imports¶

In [24]:
import os
import numpy as np
import torch
import av
from math import gcd
from scipy.signal import resample_poly
import whisper
from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

3. Download test audio¶

The video we will use today is this two minute NPR video: https://www.youtube.com/watch?v=ThN7CkeEXXk

We use yt_dlp to download the best-quality audio stream from a YouTube video. The result is a .webm file (Opus-encoded audio).

In [27]:
from google.colab import drive
drive.mount('/content/drive')
DATA_DIR = "/content/drive/MyDrive/transcription_tutorial"  # adjust path as needed
VIDEO_ID  = "ThN7CkeEXXk"
webm_path = f"{DATA_DIR}/{VIDEO_ID}.webm"

if not os.path.exists(webm_path):
    import yt_dlp
    ydl_opts = {
        "format": "bestaudio",
        "outtmpl": f"{DATA_DIR}/{VIDEO_ID}.%(ext)s",
        "noplaylist": True,
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([f"https://www.youtube.com/watch?v={VIDEO_ID}"])
else:
    print("Audio already downloaded.")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[youtube] Extracting URL: https://www.youtube.com/watch?v=ThN7CkeEXXk
[youtube] ThN7CkeEXXk: Downloading webpage
WARNING: [youtube] No supported JavaScript runtime could be found. Only deno is enabled by default; to use another runtime add  --js-runtimes RUNTIME[:PATH]  to your command/config. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See  https://github.com/yt-dlp/yt-dlp/wiki/EJS  for details on installing one
[youtube] ThN7CkeEXXk: Downloading android vr player API JSON
WARNING: [youtube] No title found in player responses; falling back to title from initial data. Other metadata may also be missing
ERROR: [youtube] ThN7CkeEXXk: Sign in to confirm you’re not a bot. Use --cookies-from-browser or --cookies for the authentication. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies. Also see  https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies  for tips on effectively exporting YouTube cookies
---------------------------------------------------------------------------
ExtractorError                            Traceback (most recent call last)
/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in wrapper(self, *args, **kwargs)
   1697                 try:
-> 1698                     return func(self, *args, **kwargs)
   1699                 except (CookieLoadError, DownloadCancelled, LazyList.IndexError, PagedList.IndexError):

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in __extract_info(self, url, ie, download, extra_info, process)
   1832         try:
-> 1833             ie_result = ie.extract(url)
   1834         except UserNotLive as e:

/usr/local/lib/python3.12/dist-packages/yt_dlp/extractor/common.py in extract(self, url)
    764                         url if self.get_param('verbose') else truncate_string(url, 100, 20)))
--> 765                     ie_result = self._real_extract(url)
    766                     if ie_result is None:

/usr/local/lib/python3.12/dist-packages/yt_dlp/extractor/youtube/_video.py in _real_extract(self, url)
   4060                     )
-> 4061                 self.raise_no_formats(reason, expected=True)
   4062 

/usr/local/lib/python3.12/dist-packages/yt_dlp/extractor/common.py in raise_no_formats(self, msg, expected, video_id)
   1276         else:
-> 1277             raise ExtractorError(msg, expected=expected, video_id=video_id)
   1278 

ExtractorError: [youtube] ThN7CkeEXXk: Sign in to confirm you’re not a bot. Use --cookies-from-browser or --cookies for the authentication. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies. Also see  https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies  for tips on effectively exporting YouTube cookies

During handling of the above exception, another exception occurred:

DownloadError                             Traceback (most recent call last)
/tmp/ipykernel_5529/3984858366.py in <cell line: 0>()
     13     }
     14     with yt_dlp.YoutubeDL(ydl_opts) as ydl:
---> 15         ydl.download([f"https://www.youtube.com/watch?v={VIDEO_ID}"])
     16 else:
     17     print("Audio already downloaded.")

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in download(self, url_list)
   3668 
   3669         for url in url_list:
-> 3670             self.__download_wrapper(self.extract_info)(
   3671                 url, force_generic_extractor=self.params.get('force_generic_extractor', False))
   3672 

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in wrapper(*args, **kwargs)
   3641         def wrapper(*args, **kwargs):
   3642             try:
-> 3643                 res = func(*args, **kwargs)
   3644             except CookieLoadError:
   3645                 raise

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in extract_info(self, url, download, ie_key, extra_info, process, force_generic_extractor)
   1685                     raise ExistingVideoReached
   1686                 break
-> 1687             return self.__extract_info(url, self.get_info_extractor(key), download, extra_info, process)
   1688         else:
   1689             extractors_restricted = self.params.get('allowed_extractors') not in (None, ['default'])

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in wrapper(self, *args, **kwargs)
   1714                     self.report_error(msg)
   1715                 except ExtractorError as e:  # An error we somewhat expected
-> 1716                     self.report_error(str(e), e.format_traceback())
   1717                 except Exception as e:
   1718                     if self.params.get('ignoreerrors'):

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in report_error(self, message, *args, **kwargs)
   1152         in red if stderr is a tty file.
   1153         """
-> 1154         self.trouble(f'{self._format_err("ERROR:", self.Styles.ERROR)} {message}', *args, **kwargs)
   1155 
   1156     def write_debug(self, message, only_once=False):

/usr/local/lib/python3.12/dist-packages/yt_dlp/YoutubeDL.py in trouble(self, message, tb, is_error)
   1091             else:
   1092                 exc_info = sys.exc_info()
-> 1093             raise DownloadError(message, exc_info)
   1094         self._download_retcode = 1
   1095 

DownloadError: ERROR: [youtube] ThN7CkeEXXk: Sign in to confirm you’re not a bot. Use --cookies-from-browser or --cookies for the authentication. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies. Also see  https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies  for tips on effectively exporting YouTube cookies

4. Load audio into a tensor¶

We need to read the downloaded audio file into memory and package it in the format pyannote expects: a dictionary {"waveform": tensor, "sample_rate": int}.

What is PyAV?

PyAV is a Python library that wraps FFmpeg, i.e. the standard tool for decoding audio and video files. We use it to read the .webm file and extract the raw audio samples as a NumPy array, entirely in memory.

What is a tensor?

A tensor is a multi-dimensional array of numbers, similar to a NumPy array, but from the PyTorch library. PyTorch tensors can run on a GPU and support the operations deep learning models need. pyannote is built in PyTorch, so it requires input in this format.

What does pyannote expect?

A dictionary with two keys:

  • "waveform" — audio data as a tensor of shape (channels, samples), where channels is 1 (mono) or 2 (stereo) and samples is the total number of audio snapshots
  • "sample_rate" — how many samples per second (e.g. 48000)
In [ ]:
container = av.open(webm_path)
audio_stream = container.streams.audio[0]

# Decode every frame and stack them along the time axis
frames = [frame.to_ndarray() for frame in container.decode(audio_stream)]
waveform = torch.tensor(np.concatenate(frames, axis=1), dtype=torch.float32)

audio_input = {"waveform": waveform, "sample_rate": audio_stream.sample_rate}
container.close()

print(f"Waveform shape: {waveform.shape}  (channels × samples)")
print(f"Sample rate:    {audio_stream.sample_rate} Hz")
print(f"Duration:       {waveform.shape[1] / audio_stream.sample_rate:.1f} s")

5. Transcribe with Whisper¶

Whisper expects mono, float32 audio at 16 kHz. Our downloaded audio is likely stereo and at 48 kHz, so we need to convert it first.

Mono vs. stereo: Stereo audio has two channels (left and right). Whisper was trained on single-channel audio, so we average the two channels into one using audio_np.mean(axis=0).

float32: Audio samples are numbers representing air pressure over time (roughly between -1.0 and 1.0). Whisper expects them stored as 32-bit floating point — PyAV already gives us this format.

Sample rate: The sample rate is how many audio snapshots are taken per second. YouTube audio is typically 48,000 (48 kHz); Whisper expects 16,000 (16 kHz). If we feed it the wrong rate, Whisper will think the audio is 3× longer than it is and the transcription will be wrong. We use resample_poly from scipy to convert. It finds the simplest ratio between the two rates using their GCD (48,000/16,000 = 3/1) and downsamples, i.e. reduces the number of samples per second, accordingly.

Before running: Set your runtime to GPU for significantly faster performance: Runtime → Change runtime type → T4 GPU. Both Whisper and pyannote run 5–10× faster on GPU than CPU.

In [ ]:
# Step 1: Convert waveform tensor to a NumPy array
audio_np = waveform.numpy()
print(f"Original shape: {audio_np.shape}  (channels × samples)")

# Step 2: Mix to mono
if audio_np.shape[0] > 1:
    audio_np = audio_np.mean(axis=0)   # average across channels
else:
    audio_np = audio_np[0]
print(f"Mono shape:     {audio_np.shape}")

# Step 3: Resample to 16 kHz if needed
src_rate = audio_stream.sample_rate
target_rate = 16000
if src_rate != target_rate:
    g = gcd(src_rate, target_rate)
    audio_np = resample_poly(audio_np, target_rate // g, src_rate // g)
    print(f"Resampled from {src_rate} Hz \u2192 {target_rate} Hz")

audio_np = audio_np.astype(np.float32)
print(f"Final shape:    {audio_np.shape}")
In [ ]:
# Load Whisper model and transcribe
# verbose=True prints each segment as it's decoded
model = whisper.load_model("medium")

# fp16=True is faster but requires an NVIDIA GPU; use fp16=False on CPU
result = model.transcribe(audio_np, task="transcribe", language="en", fp16=False, verbose=True)

Reading the diagnostics: As a rough guide, segments where no_speech_prob > 0.6 are likely silence or background noise, and segments where avg_logprob < -1.0 are likely low-confidence transcriptions. You can filter these out before downstream analysis.

In [ ]:
for seg in result["segments"]:
    print(
        f"{seg['start']:6.1f}s | "
        f"no_speech={seg['no_speech_prob']:.2f} | "
        f"logprob={seg['avg_logprob']:.2f} | "
        #f"compression={seg['compression_ratio']:.2f} | "
        #f"temp={seg['temperature']} | "
        f"{seg['text'].strip()}"
    )
In [ ]:
result["segments"][10]

6. Load the pyannote diarization pipeline¶

The pipeline is downloaded from Hugging Face. If running locally, you only need to do this once — the model is cached automatically. With Google Colab, you'll need to re-download it every session unless you explicitly save the model to your Google Drive.

This requires your HF_TOKEN to be set and that you have accepted the model license on Hugging Face.

We move the pipeline to the GPU if one is available. CUDA is NVIDIA's framework for running computations on a GPU, and diarization runs significantly faster on GPU than CPU.

In [ ]:
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=os.environ["HF_TOKEN"]
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline.to(device)
print(f"Pipeline running on: {device}")

7. Run speaker diarization¶

The pipeline returns an Annotation object containing time-stamped speaker turns. Each turn has a start time, end time, and a speaker label like SPEAKER_00, SPEAKER_01, etc.

The labels are arbitrary: SPEAKER_00 is just whichever speaker the model encountered first. You'd need a separate speaker identification step to map them to real names.

In [ ]:
with ProgressHook() as hook:
    output = pipeline(audio_input, hook=hook)
In [ ]:
print(f"Type of output: {type(output)}")
print(f"Attributes of output: {dir(output)}")
In [ ]:
print("\nDiarization output:")
for turn, _, speaker in output.speaker_diarization.itertracks(yield_label=True):
    print(f"  {speaker}  {turn.start:6.1f}s → {turn.end:6.1f}s")

8. Assign speakers to transcript segments¶

Whisper returns a list of segments, each with start, end, and text. We match each segment to the diarization output by finding which speaker has the most overlap with that time window.

The get_speaker function iterates over all diarization turns and returns the speaker whose turn overlaps the most with the Whisper segment.

In [ ]:
def get_speaker(start, end, diarization_output):
      """Return the speaker with the most overlap with the time window [start, end]."""
      max_overlap = 0.0
      best_speaker = None

      for turn, _, speaker in diarization_output.speaker_diarization.itertracks(yield_label=True):
          overlap = min(turn.end, end) - max(turn.start, start)
          if overlap > max_overlap:
              max_overlap = overlap
              best_speaker = speaker

      return best_speaker if max_overlap > 0 else "UNKNOWN"
In [ ]:
# Print the labeled transcript
print("Labeled transcript:\n")
for seg in result["segments"]:
    speaker = get_speaker(seg["start"], seg["end"], output)
    print(f"[{speaker}] {seg['start']:6.1f}s: {seg['text'].strip()}")

9. Export to a DataFrame¶

It's useful to store the labeled transcript as a CSV for downstream analysis.

In [ ]:
import pandas as pd

rows = []
current_speaker = None
current_text = []
current_start = None

for seg in result["segments"]:
    speaker = get_speaker(seg["start"], seg["end"], output)

    if speaker == current_speaker:
        current_text.append(seg["text"].strip())
    else:
        # If there was a previous speaker, save their combined segment
        if current_speaker is not None:
            rows.append({
                "speaker": current_speaker,
                "start": current_start,
                "end": seg["start"], # End at the start of the new speaker's segment
                "text": " ".join(current_text),
            })

        # Start a new segment for the new speaker
        current_speaker = speaker
        current_start = seg["start"]
        current_text = [seg["text"].strip()]

# Add the last accumulated segment
if current_speaker is not None:
    # Use the end time of the very last segment for the final entry
    last_seg_end = result["segments"][-1]["end"]
    rows.append({
        "speaker": current_speaker,
        "start": current_start,
        "end": last_seg_end,
        "text": " ".join(current_text),
    })

df = pd.DataFrame(rows)
df.head(10)
In [ ]:
out_path = f"{DATA_DIR}/npr_diarized.csv"
df.to_csv(out_path, index=False)
print(f"Saved to {out_path}")

Notes and limitations¶

  • Hallucinations — on silent or noisy segments, Whisper can generate plausible-sounding but incorrect text. Check no_speech_prob, avg_logprob, and compression_ratio to flag low-confidence segments.
  • Timestamp drift — segment boundaries are approximate (±1–2 s), which can cause speaker misassignment at fast turn boundaries.
  • pyannote works best with 1–10 speakers — crowded rooms or overlapping speech may confuse the model.
  • GPU strongly recommended — medium on CPU runs at roughly real-time; GPU is 5–10× faster for both Whisper and pyannote.
  • Long audio — Whisper processes 30-second windows internally; for recordings over ~30 minutes, chunking first is recommended to avoid memory issues and accumulated drift.
  • Language accuracy varies — Spanish, French, and German work very well; low-resource language varieties may need manual spot-checking.

Extra: Language Auto-Detection and Multilingual Research¶

Whisper can automatically detect the language of an audio clip, which is useful when working with multilingual corpora or when you don't know the source language in advance. Set language=None in model.transcribe() to enable this. Whisper will report the detected language in result["language"].

This is particularly valuable in cross-national research, because you can run the same pipeline on media from many countries without specifying languages up front.

In [ ]:
result_auto = model.transcribe(audio_np, task="transcribe", language=None, fp16=False, verbose=False)
print(f"Detected language: {result_auto['language']}")

Extra: Transcribe and translate a WAV file with language auto-detection¶

Whisper accepts a file path directly for standard formats like WAV. Here we transcribe chunk_0.wav and let Whisper detect the language automatically and then translate the text to English.

In [ ]:
wav_path = f"{DATA_DIR}/chunk_0.wav"
result_wav = model.transcribe(wav_path, task="transcribe", language=None, fp16=False, verbose=True)
print(f"\nDetected language: {result_wav['language']}")
In [ ]:
wav_path = f"{DATA_DIR}/chunk_0.wav"
result_wav = model.transcribe(wav_path, task="translate", language="es", fp16=False, verbose=True)