%pip install requests beautifulsoup4 lxml selenium

Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (2.32.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)
Requirement already satisfied: lxml in /usr/local/lib/python3.12/dist-packages (6.1.0)
Requirement already satisfied: selenium in /usr/local/lib/python3.12/dist-packages (4.44.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests) (3.13)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests) (2.7.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests) (2026.4.22)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)
Requirement already satisfied: trio<1.0,>=0.31.0 in /usr/local/lib/python3.12/dist-packages (from selenium) (0.33.0)
Requirement already satisfied: trio-websocket<1.0,>=0.12.2 in /usr/local/lib/python3.12/dist-packages (from selenium) (0.12.2)
Requirement already satisfied: websocket-client<2.0,>=1.8.0 in /usr/local/lib/python3.12/dist-packages (from selenium) (1.9.0)
Requirement already satisfied: attrs>=23.2.0 in /usr/local/lib/python3.12/dist-packages (from trio<1.0,>=0.31.0->selenium) (26.1.0)
Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.12/dist-packages (from trio<1.0,>=0.31.0->selenium) (2.4.0)
Requirement already satisfied: outcome in /usr/local/lib/python3.12/dist-packages (from trio<1.0,>=0.31.0->selenium) (1.3.0.post0)
Requirement already satisfied: sniffio>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from trio<1.0,>=0.31.0->selenium) (1.3.1)
Requirement already satisfied: wsproto>=0.14 in /usr/local/lib/python3.12/dist-packages (from trio-websocket<1.0,>=0.12.2->selenium) (1.3.2)
Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in /usr/local/lib/python3.12/dist-packages (from urllib3[socks]<3.0,>=2.6.3->selenium) (1.7.1)
Requirement already satisfied: h11<1,>=0.16.0 in /usr/local/lib/python3.12/dist-packages (from wsproto>=0.14->trio-websocket<1.0,>=0.12.2->selenium) (0.16.0)

import requests

URL = "https://debates.org/voter-education/debate-transcripts/october-3-2012-debate-transcript/"

# A custom User-Agent identifies our request as coming from a research script,
# rather than appearing as an anonymous bot.
headers = {
    "User-Agent": "Mozilla/5.0 (research scraper; contact: your@email.com)"
}

response = requests.get(URL, headers=headers)

print(f"Status code : {response.status_code}")
print(f"Content-Type: {response.headers.get('Content-Type')}")
print(f"Page size   : {len(response.text):,} characters")

Status code : 200
Content-Type: text/html; charset=UTF-8
Page size   : 115,578 characters

# Safe pattern: raise immediately if something went wrong
response.raise_for_status()

# Peek at the raw HTML
print(response.text[:500])

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
    <head>
        <meta http-equiv="Content-Language" content="en-us"/>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
        <title>CPD: October 3, 2012 Debate Transcript</title>
        <link href="/wp-content/themes/debates2019/css/reset.css" rel="stylesheet" type="text/css"/>
        <link href="/wp-content/themes/debates2019/css/jc-main.css" rel="stylesheet" type="text/css" media="screen,pr

from bs4 import BeautifulSoup

# 'lxml' is the fastest parser; 'html.parser' is built-in but slower
soup = BeautifulSoup(response.text, "lxml")

print(type(soup))
print("Page title:", soup.title.string)

<class 'bs4.BeautifulSoup'>
Page title: CPD: October 3, 2012 Debate Transcript

# Find by tag name
first_p = soup.find("p")
print("First <p>:", first_p.get_text()[:80])

# Find by id attribute
content_div = soup.find("div", id="content-sm")
print("\nFound content div:", content_div is not None)

# Find by CSS class
# soup.find("div", class_="some-class")  # note: class_ not class (reserved word)

# Count all <p> tags on the whole page
all_p = soup.find_all("p")
print(f"\nTotal <p> tags on page: {len(all_p)}")

# Count <p> tags only inside the content div
content_p = content_div.find_all("p")
print(f"<p> tags inside content div: {len(content_p)}")

First <p>: PRESIDENT BARACK OBAMA AND FORMER GOV. MITT ROMNEY, R-MASS., PRESIDENTIAL CANDID

Found content div: True

Total <p> tags on page: 478
<p> tags inside content div: 477

# Extract the page title
title = soup.find("h1").get_text(strip=True)
print("Title:", title)

# Extract all paragraph text from the content div
paragraphs = [p.get_text(strip=True) for p in content_p]

# Preview the first few
for i, para in enumerate(paragraphs[:6]):
    print(f"\n[{i}] {para[:120]}")

Title: October 3, 2012 Debate Transcript

[0] PRESIDENT BARACK OBAMA AND FORMER GOV. MITT ROMNEY,R-MASS., PRESIDENTIAL CANDIDATE, PARTICIPATE IN ACANDIDATES DEBATE, U

[1] OCTOBER 3, 2012

[2] SPEAKERS: FORMER GOV. MITT ROMNEY, R-MASS.

[3] PRESIDENT BARACK OBAMA

[4] JIM LEHRER, MODERATOR

[5] LEHRER: Good evening from the Magness Arena at the University of Denver in Denver, Colorado. I’m Jim Lehrer of the “PBS

import re

# Pattern: optional whitespace, ALL-CAPS WORD(S), colon
SPEAKER_RE = re.compile(r'^([A-Z][A-Z\s\.]+):\s*')

turns = []
for para in paragraphs:
    match = SPEAKER_RE.match(para)
    if match:
        speaker = match.group(1).strip()
        text    = para[match.end():].strip()
        turns.append({"speaker": speaker, "text": text})

print(f"Found {len(turns)} speaker turns\n")
for turn in turns[:10]:
    print(f"[{turn['speaker']}]  {turn['text'][:100]}")

Found 207 speaker turns

[SPEAKERS]  FORMER GOV. MITT ROMNEY, R-MASS.
[LEHRER]  Good evening from the Magness Arena at the University of Denver in Denver, Colorado. I’m Jim Lehrer 
[LEHRER]  This debate and the next three — two presidential, one vice presidential — are sponsored by the Comm
[LEHRER]  You have two minutes. Each of you have two minutes to start. A coin toss has determined, Mr. Preside
[OBAMA]  Well, thank you very much, Jim, for this opportunity. I want to thank Governor Romney and the Univer
[LEHRER]  Governor Romney, two minutes.
[ROMNEY]  Thank you, Jim. It’s an honor to be here with you, and I appreciate the chance to be with the presid
[ROMNEY]  Now, I’m concerned that the path that we’re on has just been unsuccessful. The president has a view 
[LEHRER]  Mr. President, please respond directly to what the governor just said about trickle-down — his trick
[OBAMA]  Well, let me talk specifically about what I think we need to do. First, we’ve got to improve our edu

from collections import Counter

speaker_counts = Counter(t["speaker"] for t in turns)
print("Turn counts per speaker:")
for speaker, count in speaker_counts.most_common():
    print(f"  {speaker:10s}  {count}")

# Word counts per speaker
word_counts = Counter()
for turn in turns:
    word_counts[turn["speaker"]] += len(turn["text"].split())

print("\nWord counts per speaker:")
for speaker, count in word_counts.most_common():
    print(f"  {speaker:10s}  {count:,}")

Turn counts per speaker:
  LEHRER      80
  ROMNEY      71
  OBAMA       55
  SPEAKERS    1

Word counts per speaker:
  ROMNEY      2,368
  OBAMA       1,890
  LEHRER      1,280
  SPEAKERS    5

import html

def clean_text(text: str) -> str:
    text = html.unescape(text)                     # decode HTML entities
    text = text.replace("\xa0", " ")               # non-breaking spaces
    text = re.sub(r"\([A-Z ]+\)", "", text)        # stage directions
    text = re.sub(r"\s+", " ", text).strip()       # collapse whitespace
    return text

# Apply to all turns
for turn in turns:
    turn["text"] = clean_text(turn["text"])

raw_example = """
Governor Romney has a perspective that says if we cut taxes, skewed towards the wealthy, and roll back regulations, that we&#8217;ll be better off.
I&#8217;ve got a different view.
"""
print("Before:", raw_example[:200])
print("After: ", clean_text(raw_example)[:200])

Before: 
Governor Romney has a perspective that says if we cut taxes, skewed towards the wealthy, and roll back regulations, that we&#8217;ll be better off.
I&#8217;ve got a different view.

After:  Governor Romney has a perspective that says if we cut taxes, skewed towards the wealthy, and roll back regulations, that we’ll be better off. I’ve got a different view.

# Plain text — readable transcript
with open("debate_2012_oct3.txt", "w", encoding="utf-8") as f:
    f.write(title + "\n")
    f.write("=" * len(title) + "\n\n")
    for turn in turns:
        f.write(f"{turn['speaker']}: {turn['text']}\n\n")

print("Saved debate_2012_oct3.txt")

Saved debate_2012_oct3.txt

import json

debate_data = {"title": title, "url": URL, "turns": turns}

with open("debate_2012_oct3.json", "w", encoding="utf-8") as f:
    json.dump(debate_data, f, indent=2, ensure_ascii=False)

print("Saved debate_2012_oct3.json")
print(f"  {len(turns)} turns, {sum(len(t['text'].split()) for t in turns):,} words")

Saved debate_2012_oct3.json
  207 turns, 5,542 words

# Step 1: find all transcript links from the sidebar
BASE_URL = "https://debates.org"

left_menu = soup.find("div", id="leftmenu")
transcript_links = [
    {"text": a.get_text(strip=True), "url": BASE_URL + a["href"]}
    for a in left_menu.find_all("a")
    if "/voter-education/debate-transcripts/" in a.get("href", "")
    and "translations" not in a.get("href", "")
]

print(f"Found {len(transcript_links)} transcript links\n")
for link in transcript_links[:5]:
    print(f"  {link['text']}")
    print(f"  {link['url']}\n")

Found 48 transcript links

  Debate Transcripts
  https://debates.org/voter-education/debate-transcripts/

  October 07, 2020 Vice Presidential Debate Transcript
  https://debates.org/voter-education/debate-transcripts/vice-presidential-debate-at-the-university-of-utah-in-salt-lake-city-utah/

  October 22, 2020 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/october-22-2020-debate-transcript/

  September 29, 2020 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/september-29-2020-debate-transcript/

  October 19, 2016 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/october-19-2016-debate-transcript/

# Step 2: filter to links with dates
import re

transcript_links = [
    {"text": a.get_text(strip=True), "url": BASE_URL + a["href"]}
    for a in left_menu.find_all("a")
    if "/voter-education/debate-transcripts/" in a.get("href", "")
    and "translations" not in a.get("href", "")
    and re.search(r'\d{4}', a.get_text())  # only links with a year in the text
    ]

print(f"Found {len(transcript_links)} transcript links\n")
for link in transcript_links[:5]:
    print(f"  {link['text']}")
    print(f"  {link['url']}\n")

Found 47 transcript links

  October 07, 2020 Vice Presidential Debate Transcript
  https://debates.org/voter-education/debate-transcripts/vice-presidential-debate-at-the-university-of-utah-in-salt-lake-city-utah/

  October 22, 2020 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/october-22-2020-debate-transcript/

  September 29, 2020 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/september-29-2020-debate-transcript/

  October 19, 2016 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/october-19-2016-debate-transcript/

  October 9, 2016 Debate Transcript
  https://debates.org/voter-education/debate-transcripts/october-9-2016-debate-transcript/

import time

def scrape_transcript(url: str, fallback_title: str = "Unknown") -> dict:
    r = requests.get(url, headers=headers)
    r.raise_for_status()

    page_soup   = BeautifulSoup(r.text, "lxml")
    content_div = page_soup.find("div", id="content-sm")
    if not content_div:
        return {}

    title_tag  = page_soup.find("h1")
    title      = title_tag.get_text(strip=True) if title_tag else fallback_title
    paragraphs = [p.get_text(strip=True) for p in content_div.find_all("p")]

    turns = []
    for para in paragraphs:
        match = SPEAKER_RE.match(para)
        if match:
            turns.append({
                "speaker": match.group(1).strip(),
                "text":    clean_text(para[match.end():])
            })

    return {"title": title, "url": url, "turns": turns}


# Scrape the first 3 transcripts as a demo
corpus = []
for link in transcript_links[:3]:
    print(f"Scraping: {link['text']}...")
    data = scrape_transcript(link["url"], fallback_title=link["text"])
    if data:
        corpus.append(data)
        print(f"  → {len(data['turns'])} turns, "
              f"{sum(len(t['text'].split()) for t in data['turns']):,} words")
    time.sleep(2)  # be polite

print(f"\nCorpus: {len(corpus)} debates")

Scraping: October 07, 2020 Vice Presidential Debate Transcript...
  → 246 turns, 14,926 words
Scraping: October 22, 2020 Debate Transcript...
  → 354 turns, 17,997 words
Scraping: September 29, 2020 Debate Transcript...
  → 858 turns, 17,263 words

Corpus: 3 debates

with open("debate_corpus.json", "w", encoding="utf-8") as f:
    json.dump(corpus, f, indent=2, ensure_ascii=False)

print("Saved debate_corpus.json\n")
for debate in corpus:
    speakers = Counter(t["speaker"] for t in debate["turns"])
    words    = sum(len(t["text"].split()) for t in debate["turns"])
    print(f"  {debate['title'][:50]}")
    print(f"    words: {words:,}   speakers: {dict(speakers.most_common(3))}")

Saved debate_corpus.json

  October 07, 2020 Vice Presidential Debate Transcri
    words: 14,926   speakers: {'PAGE': 93, 'PENCE': 89, 'HARRIS': 62}
  October 22, 2020 Debate Transcript
    words: 17,997   speakers: {'WELKER': 146, 'TRUMP': 122, 'BIDEN': 84}
  September 29, 2020 Debate Transcript
    words: 17,263   speakers: {'TRUMP': 341, 'BIDEN': 269, 'WALLACE': 246}

with open(SAVE_DIR + "debate_corpus.json", "w", encoding="utf-8") as f:
    json.dump(corpus, f, indent=2, ensure_ascii=False)

print(f"Saved debate_corpus.json ({len(corpus)} debates)")

Saved debate_corpus.json (3 debates)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, len(corpus), figsize=(14, 5), sharey=True)

for ax, debate in zip(axes, corpus):
    speaker_counts = Counter(t["speaker"] for t in debate["turns"])
    top = [(s, c) for s, c in speaker_counts.most_common() if c > 1]
    if not top:
        continue
    names, counts = zip(*top)

    ax.bar(names, counts)
    ax.set_title(debate["title"][:40], fontsize=9)
    ax.set_xlabel("Speaker")
    ax.set_ylabel("Turns")
    ax.tick_params(axis="x", rotation=15)

plt.suptitle("Speaker Turns per Debate", fontsize=12)
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(1, len(corpus), figsize=(14, 5), sharey=True)

for ax, debate in zip(axes, corpus):
    speaker_counts = Counter(t["speaker"] for t in debate["turns"])
    word_totals = Counter()
    for t in debate["turns"]:
        word_totals[t["speaker"]] += len(t["text"].split())

    top_speakers = [s for s, _ in speaker_counts.most_common() if speaker_counts[s] > 1]
    avg_words = [word_totals[s] / speaker_counts[s] for s in top_speakers]

    ax.bar(top_speakers, avg_words)
    ax.set_title(debate["title"][:40], fontsize=9)
    ax.set_xlabel("Speaker")
    ax.set_ylabel("Avg words per turn")
    ax.tick_params(axis="x", rotation=15)

plt.suptitle("Average Words per Turn by Speaker", fontsize=12)
plt.tight_layout()
plt.show()

# Install Google Chrome in Colab (skip if running locally with Chrome already installed)
import sys
if "google.colab" in sys.modules:
    !wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    !apt-get install -y ./google-chrome-stable_current_amd64.deb -qq
    print("Chrome installed")
else:
    print("Not in Colab — using local Chrome installation")

Chrome installed

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def make_driver(headless: bool = True) -> webdriver.Chrome:
    """Create a configured Chrome WebDriver."""
    options = Options()
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1920,1080")
    options.add_argument(
        "user-agent=Mozilla/5.0 (research scraper; contact: your@email.com)"
    )
    # No binary_location or Service needed — Selenium Manager finds the matching
    # chromedriver automatically for a standard Google Chrome install.
    return webdriver.Chrome(options=options)

driver = make_driver()
print("Driver created:", driver.capabilities["browserName"],
      driver.capabilities.get("browserVersion", ""))

Driver created: chrome 148.0.7778.178

from selenium.webdriver.common.by import By

driver.get(URL)
print("Title  :", driver.title)
print("URL    :", driver.current_url)
print("HTML   :", len(driver.page_source), "chars")

Title  : CPD: October 3, 2012 Debate Transcript
URL    : https://debates.org/voter-education/debate-transcripts/october-3-2012-debate-transcript/
HTML   : 109440 chars

# --- Finding elements ---

# By ID  (equivalent to soup.find('div', id='content-sm'))
content_div = driver.find_element(By.ID, "content-sm")
print("Found content div:", content_div.tag_name)

# By CSS selector  (equivalent to soup.select('#content-sm p'))
paragraphs_sel = driver.find_elements(By.CSS_SELECTOR, "#content-sm p")
print(f"Paragraphs via CSS selector: {len(paragraphs_sel)}")

# By tag name
heading = driver.find_element(By.TAG_NAME, "h1")
print("H1 text:", heading.text)

# Getting text and attributes from an element
first_link = driver.find_element(By.CSS_SELECTOR, "#leftmenu a")
print("First sidebar link text:", first_link.text)
print("First sidebar link href:", first_link.get_attribute("href"))

Found content div: div
Paragraphs via CSS selector: 478
H1 text: October 3, 2012 Debate Transcript
First sidebar link text: Debate Videos
First sidebar link href: https://debates.org/voter-education/debate-videos/

# Implicit wait: retry all find_element calls for up to 5 seconds
driver.implicitly_wait(5)
print("Implicit wait set to 5 seconds")

Implicit wait set to 5 seconds

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Reload the page and wait until the content div is present
driver.get(URL)

wait = WebDriverWait(driver, timeout=10)

# Wait until the element is present in the DOM
content_div = wait.until(
    EC.presence_of_element_located((By.ID, "content-sm"))
)
print("Content div found after wait:", content_div.tag_name)

# Other useful expected conditions:
# EC.visibility_of_element_located(...)  — element exists AND is visible
# EC.element_to_be_clickable(...)        — element is visible and enabled
# EC.text_to_be_present_in_element(...)  — element contains specific text
# EC.staleness_of(element)               — wait for an element to disappear (e.g. after a reload)

Content div found after wait: div

from selenium.webdriver.common.action_chains import ActionChains

# Scroll to the bottom of the page (useful for infinite-scroll sites)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
print("Scrolled to bottom")

# Scroll back to top
driver.execute_script("window.scrollTo(0, 0);")

# Click an element
# button = driver.find_element(By.CSS_SELECTOR, "button.load-more")
# button.click()

# Type into a text input
# search_box = driver.find_element(By.NAME, "q")
# search_box.send_keys("Obama Romney debate")
# search_box.submit()

Scrolled to bottom

<div id="transcript">
  <!-- content loaded by script -->
</div>

BLUESKY_URL = "https://bsky.app/profile/hadley.nz"

# Attempt with requests — this is what BeautifulSoup sees
r = requests.get(BLUESKY_URL, headers=headers)
bs_soup = BeautifulSoup(r.text, "lxml")

print(f"Status  : {r.status_code}")
print(f"Title   : {bs_soup.title.string if bs_soup.title else 'none'}")
print(f"HTML size: {len(r.text):,} chars\n")

# Try to find any post text in the raw HTML
body_text = bs_soup.get_text(separator=" ", strip=True)
print("Body text from requests (first 400 chars):")
print(body_text[:400])

Status  : 200
Title   : @hadley.nz on Bluesky
HTML size: 8,391 chars

Body text from requests (first 400 chars):
@hadley.nz on Bluesky JavaScript Required This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is. Learn more about Bluesky at bsky.social and atproto.com . Profile Hadley Wickham hadley.nz did:plc:iz6v2itga76zik4okvzlv6di R, data, 🐕, 🍸, 🌈. He/him.

# Now try with Selenium — the browser runs the JavaScript and populates the page
driver2 = make_driver()
driver2.get(BLUESKY_URL)

# Wait until post text elements are present in the DOM
wait2 = WebDriverWait(driver2, timeout=15)
wait2.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[data-testid='postText']")))

bsky_soup = BeautifulSoup(driver2.page_source, "lxml")

print(f"Page title: {driver2.title}")
print(f"HTML size : {len(driver2.page_source):,} chars\n")

# Extract post text — Bluesky marks each post body with data-testid="postText"
post_elements = bsky_soup.find_all(attrs={"data-testid": "postText"})
bsky_posts = [{"post_num": i + 1, "text": p.get_text(strip=True)}
              for i, p in enumerate(post_elements)]

print(f"Found {len(bsky_posts)} posts\n")
for post in bsky_posts[:5]:
    print(f"[{post['post_num']}] {post['text'][:120]}")
    print()

driver2.quit()

Page title: Hadley Wickham (@hadley.nz) — Bluesky
HTML size : 685,934 chars

Found 29 posts

[1] I hate that “an SQL query” and “a SQL query” might be both correct.

[2] AI is expensive, but we're doing our best to make it available as cheaply as possible.

[3] We’re excited to launch the Posit Open Source website! 🚀

We’ve unified 15+ years of tools, 900+ blogs, and 1,600+ video

[4] I've been thinking that not only did I learn to code before AI, I learned to code before StackOverflow, and even before 

[5] Back in my day we really did have to walk up hill to school both ways, in the snow.

# Mount Google Drive so we can save files directly to it
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import csv, os

SAVE_DIR = "/content/drive/My Drive/Teaching/web_scraping_and_langextract_tutorial/"
os.makedirs(SAVE_DIR, exist_ok=True)

with open(SAVE_DIR + "bsky_posts.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["post_num", "type", "text"])
    writer.writeheader()
    writer.writerows(bsky_posts)

print(f"Saved bsky_posts.csv  ({len(bsky_posts)} rows)")

with open(SAVE_DIR + "bsky_posts.txt", "w", encoding="utf-8") as f:
    f.write(f"Bluesky posts from: {BLUESKY_URL}\n")
    f.write("=" * 60 + "\n\n")
    for post in bsky_posts:
        f.write(f"[{post['post_num']}] {post['text']}\n\n")

print(f"Saved bsky_posts.txt")

Saved bsky_posts.csv  (29 rows)
Saved bsky_posts.txt

Code	Meaning
200	OK — request succeeded
403	Forbidden — server refused the request
404	Not Found — the URL does not exist
429	Too Many Requests — you are being rate-limited
5xx	Server error — try again later

Situation	Use
Static HTML page (like debates.org)	`requests` + BeautifulSoup ✓
Content loaded by JavaScript after page load	Selenium ✓
Infinite scroll / "Load more" buttons	Selenium ✓
Login required before accessing content	Selenium ✓
Filling in search forms or dropdowns	Selenium ✓
Large-scale crawling of static pages	`requests` + BeautifulSoup (faster) ✓

Method	Description
`driver.get(url)`	Navigate to a URL
`driver.page_source`	Get the current page's full HTML
`driver.title`	Get the page title
`driver.current_url`	Get the current URL
`driver.back()` / `driver.forward()`	Browser navigation
`driver.quit()`	Close the browser

`By.*`	Example	Equivalent CSS/HTML
`By.ID`	`By.ID, "content-sm"`	`#content-sm`
`By.CLASS_NAME`	`By.CLASS_NAME, "lmtop"`	`.lmtop`
`By.TAG_NAME`	`By.TAG_NAME, "h1"`	`h1`
`By.CSS_SELECTOR`	`By.CSS_SELECTOR, "#content-sm p"`	any CSS selector
`By.XPATH`	`By.XPATH, "//div[@id='content-sm']//p"`	XPath expression
`By.LINK_TEXT`	`By.LINK_TEXT, "Debate Videos"`	exact `<a>` text

Web Scraping with BeautifulSoup and Selenium¶

Table of Contents¶

1. Before You Scrape: Ethical and Legal Considerations¶

2. Installation¶

3. Fetching a Webpage with `requests`¶

Understanding the status code¶

4. Parsing HTML with BeautifulSoup¶

How BeautifulSoup represents HTML¶

5. Navigating the HTML Tree¶

6. Extracting the Debate Transcript¶

Identifying speakers¶

7. Cleaning the Text¶

8. Saving to File¶

9. Scaling Up¶

Data Visualization¶

Part 2 — Selenium¶

10. When to Use Selenium¶

11. Setting Up Selenium¶

12. Basic Navigation and Element Selection¶

Locator strategies at a glance¶

13. Waits and Dynamic Content¶

Implicit wait¶

Explicit wait¶

Simulating user interactions¶

14. When Selenium is the only option¶

Demo: Bluesky¶

15. Exercises¶

Sources¶

Web Scraping with BeautifulSoup and Selenium¶

Table of Contents¶

1. Before You Scrape: Ethical and Legal Considerations¶

2. Installation¶

3. Fetching a Webpage with requests¶

Understanding the status code¶

4. Parsing HTML with BeautifulSoup¶

How BeautifulSoup represents HTML¶

5. Navigating the HTML Tree¶

6. Extracting the Debate Transcript¶

Identifying speakers¶

7. Cleaning the Text¶

8. Saving to File¶

9. Scaling Up¶

Data Visualization¶

Part 2 — Selenium¶

10. When to Use Selenium¶

11. Setting Up Selenium¶

12. Basic Navigation and Element Selection¶

Locator strategies at a glance¶

13. Waits and Dynamic Content¶

Implicit wait¶

Explicit wait¶

Simulating user interactions¶

14. When Selenium is the only option¶

Demo: Bluesky¶

15. Exercises¶

Sources¶

3. Fetching a Webpage with `requests`¶