# Install core library (Gemini support included)
%pip install langextract

Requirement already satisfied: langextract in /usr/local/lib/python3.12/dist-packages (1.5.0)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.4.0)
Requirement already satisfied: aiohttp>=3.8.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (3.13.5)
Requirement already satisfied: async_timeout>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (5.0.1)
Requirement already satisfied: exceptiongroup>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.3.1)
Requirement already satisfied: google-genai>=1.39.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.68.0)
Requirement already satisfied: google-cloud-storage>=2.14.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (3.10.1)
Requirement already satisfied: ml-collections>=0.1.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.1.0)
Requirement already satisfied: more-itertools>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (10.8.0)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.0.2)
Requirement already satisfied: pandas>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.2.2)
Requirement already satisfied: pydantic>=1.8.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.12.3)
Requirement already satisfied: python-dotenv>=0.19.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.2.2)
Requirement already satisfied: PyYAML>=6.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (6.0.3)
Requirement already satisfied: regex>=2023.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2025.11.3)
Requirement already satisfied: requests>=2.25.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.32.4)
Requirement already satisfied: tqdm>=4.64.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (4.67.3)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (4.15.0)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.4.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (26.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.8.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (6.7.1)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (0.4.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.23.0)
Requirement already satisfied: google-auth<3.0.0,>=2.26.1 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.47.0)
Requirement already satisfied: google-api-core<3.0.0,>=2.27.0 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.30.3)
Requirement already satisfied: google-cloud-core<3.0.0,>=2.4.2 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.5.1)
Requirement already satisfied: google-resumable-media<3.0.0,>=2.7.2 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.8.2)
Requirement already satisfied: google-crc32c<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (1.8.0)
Requirement already satisfied: anyio<5.0.0,>=4.8.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (4.13.0)
Requirement already satisfied: httpx<1.0.0,>=0.28.1 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (0.28.1)
Requirement already satisfied: tenacity<9.2.0,>=8.2.3 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (9.1.4)
Requirement already satisfied: websockets<17.0,>=13.0.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (15.0.1)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (1.9.0)
Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (1.3.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2026.1)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (2.41.4)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (3.13)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (2026.4.22)
Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.63.2 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (1.74.0)
Requirement already satisfied: protobuf<8.0.0,>=4.25.8 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (5.29.6)
Requirement already satisfied: proto-plus<2.0.0,>=1.22.3 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (1.27.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.12/dist-packages (from google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (0.4.2)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.12/dist-packages (from google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (4.9.1)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1.0.0,>=0.28.1->google-genai>=1.39.0->langextract) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1.0.0,>=0.28.1->google-genai>=1.39.0->langextract) (0.16.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->langextract) (1.17.0)
Requirement already satisfied: pyasn1<0.7.0,>=0.6.1 in /usr/local/lib/python3.12/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (0.6.3)

import os
from google.colab import userdata

os.environ["LANGEXTRACT_API_KEY"] = userdata.get('GEMINI_API_KEY')

import langextract as lx
import textwrap

print("LangExtract imported successfully")

LangExtract imported successfully

# --- Sample text ---
news_text = """
Francisco Avelino Vieira de Carvalho (born 4 December 1970) is a Cape Verdean politician and the leader of the African Party for the Independence of Cape Verde.
He is also the mayor of the municipality of Praia since 2020. In the 2026 parliamentary election, Carvalho's party secured a majority in the National Assembly,
making him the prime minister–designate of the country.
"""

# --- Extraction prompt ---
ner_prompt = textwrap.dedent("""
    Extract named entities from the text.
    Classify each entity as one of: person, title, location, or date.
    Use the exact text from the source. Do not paraphrase.
""")

# --- Few-shot examples ---
ner_examples = [
    lx.data.ExampleData(
        text="Adam Roberts is a PhD candidate in Political Science at the University of Rochester in Rochester, NY. He started his PhD on September 1, 2022.",
        extractions=[
            lx.data.Extraction(extraction_class="person",       extraction_text="Adam Roberts"),
            lx.data.Extraction(extraction_class="title",        extraction_text="PhD candidate"),
            lx.data.Extraction(extraction_class="location",     extraction_text="Rochester, NY"),
            lx.data.Extraction(extraction_class="date",         extraction_text="September 1, 2022"),
        ]
    )
]

# --- Run extraction ---
ner_result = lx.extract(
    text_or_documents=news_text,
    prompt_description=ner_prompt,
    examples=ner_examples,
    model_id="gemini-2.5-flash",
)

print(f"\nFound {len(ner_result.extractions)} entities\n")
for e in ner_result.extractions:
    print(f"  [{e.extraction_class:12s}]  '{e.extraction_text}'")

LangExtract: model=gemini-2.5-flash, current=376 chars, processed=0 chars:  [00:03]

Found 12 entities

  [person      ]  'Francisco Avelino Vieira de Carvalho'
  [date        ]  '4 December 1970'
  [title       ]  'Cape Verdean politician'
  [title       ]  'leader of the African Party for the Independence of Cape Verde'
  [location    ]  'Cape Verde'
  [title       ]  'mayor'
  [location    ]  'Praia'
  [date        ]  '2020'
  [date        ]  '2026'
  [person      ]  'Carvalho'
  [location    ]  'National Assembly'
  [title       ]  'prime minister–designate'

for e in ner_result.extractions:
    start = e.char_interval.start_pos
    end   = e.char_interval.end_pos
    span  = news_text[start:end]
    print(f"chars {start:4d}-{end:4d}  [{e.extraction_class}]  '{span}'")

chars    1-  37  [person]  'Francisco Avelino Vieira de Carvalho'
chars   44-  59  [date]  '4 December 1970'
chars   66-  89  [title]  'Cape Verdean politician'
chars   98- 104  [title]  'leader'
chars  112- 160  [institution]  'African Party for the Independence of Cape Verde'
chars  150- 160  [location]  'Cape Verde'
chars  177- 182  [title]  'mayor'
chars  206- 211  [location]  'Praia'
chars  218- 222  [date]  '2020'
chars  231- 235  [date]  '2026'
chars  260- 268  [person]  'Carvalho'
chars  303- 320  [institution]  'National Assembly'
chars  337- 361  [title]  'prime minister–designate'

# Survey question: "How do you feel about the changes to public transport in your city?"
survey_responses = """
Response 1: The new bus routes are so much more convenient! I've actually started
leaving the car at home.

Response 2: It's a disaster. They cut the night service and now I can't get home
after late shifts.

Response 3: Mixed feelings honestly. The app is great but the actual frequency
hasn't improved at all.

Response 4: I suppose it's fine. Nothing's really changed for me one way or another.
"""

survey_prompt = textwrap.dedent("""
    Thematically code open-ended survey responses about public transport.
    Extract each distinct theme raised by a respondent.
    Use the exact words from the response. Do not paraphrase.
    For each theme, record:
      sentiment: positive, negative, mixed, or neutral
      intensity: strong or weak
""")

survey_examples = [
    lx.data.ExampleData(
        text="Response A: The new cycle lanes are brilliant, though the parking situation has become a nightmare.",
        extractions=[
            lx.data.Extraction(
                extraction_class="infrastructure",
                extraction_text="new cycle lanes are brilliant",
                attributes={"sentiment": "positive", "intensity": "strong"}
            ),
            lx.data.Extraction(
                extraction_class="parking",
                extraction_text="parking situation has become a nightmare",
                attributes={"sentiment": "negative", "intensity": "strong"}
            ),
        ]
    )
]

survey_result = lx.extract(
    text_or_documents=survey_responses,
    prompt_description=survey_prompt,
    examples=survey_examples,
    model_id="gemini-2.5-flash",
)

print(f"\nCoded {len(survey_result.extractions)} theme mentions\n")
for e in survey_result.extractions:
    attrs = e.attributes or {}
    print(f"  [{e.extraction_class:15s}]  {attrs.get('sentiment','?'):8s} / {attrs.get('intensity','?'):6s}  '{e.extraction_text}'")

LangExtract: model=gemini-2.5-flash, current=492 chars, processed=0 chars:  [00:13]

Coded 5 theme mentions

  [infrastructure ]  positive / strong  'The new bus routes are so much more convenient!'
  [infrastructure ]  negative / strong  'They cut the night service and now I can't get home after late shifts.'
  [infrastructure ]  positive / strong  'The app is great'
  [infrastructure ]  negative / strong  'the actual frequency hasn't improved at all'
  [infrastructure ]  neutral  / weak    'it's fine. Nothing's really changed for me one way or another.'

from collections import Counter

theme_counts    = Counter(e.extraction_class for e in survey_result.extractions)
sentiment_counts = Counter((e.attributes or {}).get("sentiment") for e in survey_result.extractions)
negative_themes = [e for e in survey_result.extractions
                   if (e.attributes or {}).get("sentiment") == "negative"]

print("Theme frequency:")
for theme, n in theme_counts.most_common():
    print(f"  {theme:18s}  {n}")

print("\nOverall sentiment breakdown:")
total = sum(sentiment_counts.values())
for sentiment, n in sentiment_counts.most_common():
    print(f"  {sentiment:8s}  {n}  ({n/total*100:.0f}%)")

print("\nNegative mentions (flagged for follow-up):")
for e in negative_themes:
    intensity = (e.attributes or {}).get("intensity", "?")
    print(f"  [{e.extraction_class}] ({intensity})  '{e.extraction_text}'")

Theme frequency:
  infrastructure      5

Overall sentiment breakdown:
  positive  2  (40%)
  negative  2  (40%)
  neutral   1  (20%)

Negative mentions (flagged for follow-up):
  [infrastructure] (strong)  'They cut the night service and now I can't get home after late shifts'
  [infrastructure] (strong)  'the actual frequency hasn't improved at all'

interview_text = """
I first got involved through Maria Chen, who was my professor at City College.
She introduced me to the local tenant rights group, and that's where I met David Okafor,
who ran the outreach program. David had been organizing in this neighborhood for fifteen
years; he knew everyone. It was through David that I was introduced to Councilwoman
Patricia Reyes, who ended up being a crucial ally when we pushed the housing bill.
Maria and Patricia had known each other since the early nineties through the teachers' union.
There was also a guy named Greg Holloway. He was officially neutral, a city planner,
but he would quietly pass us zoning documents. I always felt like he was sympathetic
but couldn't be seen taking sides.
"""

rel_prompt = textwrap.dedent("""
    Extract facts about people mentioned in this oral history interview.
    For each fact, record the mentioned person's full name in the 'person_group'
    attribute so all facts about the same person can be grouped together.
    Extraction classes to use:
      person        — the person's name
      role          — their occupation, title, or organizational affiliation
      relationship  — how they are connected to the narrator or to another person
      tie_strength  — characterise the tie as: ally, neutral, or adversary
    Use exact text from the interview. Do not paraphrase.
""")

rel_examples = [
    lx.data.ExampleData(
        text="James introduced me to Rosa Park, a union organizer who had mentored him back in Detroit. Rosa was skeptical of our approach at first.",
        extractions=[
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="Rosa Park",
                attributes={"person_group": "Rosa Park"}
            ),
            lx.data.Extraction(
                extraction_class="role",
                extraction_text="union organizer",
                attributes={"person_group": "Rosa Park"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="mentored him back in Detroit",
                attributes={"person_group": "Rosa Park"}
            ),
            lx.data.Extraction(
                extraction_class="tie_strength",
                extraction_text="skeptical of our approach",
                attributes={"person_group": "Rosa Park", "tie_strength": "neutral"}
            ),
        ]
    )
]

rel_result = lx.extract(
    text_or_documents=interview_text,
    prompt_description=rel_prompt,
    examples=rel_examples,
    model_id="gemini-2.5-flash",
)

# Group by person
from collections import defaultdict

person_groups = defaultdict(list)
for e in rel_result.extractions:
    group = (e.attributes or {}).get("person_group", "unknown")
    person_groups[group].append(e)

for person, facts in person_groups.items():
    print(f"\n{person}:")
    for f in facts:
        print(f"  [{f.extraction_class:14s}]  '{f.extraction_text}'")

LangExtract: model=gemini-2.5-flash, current=724 chars, processed=0 chars:  [00:06]

Maria Chen:
  [person        ]  'Maria Chen'
  [role          ]  'professor at City College'
  [relationship  ]  'my professor at City College'
  [relationship  ]  'introduced me to the local tenant rights group'
  [relationship  ]  'had known each other since the early nineties through the teachers' union'

David Okafor:
  [person        ]  'David Okafor'
  [role          ]  'ran the outreach program'
  [relationship  ]  'met David Okafor'
  [relationship  ]  'had been organizing in this neighborhood for fifteen years'
  [relationship  ]  'knew everyone'
  [relationship  ]  'introduced to Councilwoman Patricia Reyes'

Councilwoman Patricia Reyes:
  [person        ]  'Councilwoman Patricia Reyes'
  [role          ]  'Councilwoman'
  [tie_strength  ]  'crucial ally'
  [relationship  ]  'had known each other since the early nineties through the teachers' union'

Greg Holloway:
  [person        ]  'Greg Holloway'
  [role          ]  'city planner'
  [tie_strength  ]  'officially neutral'
  [tie_strength  ]  'sympathetic but couldn't be seen taking sides'
  [relationship  ]  'would quietly pass us zoning documents'

# Save to JSONL
lx.io.save_annotated_documents(
    [ner_result, survey_result, rel_result],
    output_name="tutorial_extractions.jsonl",
    output_dir=".",
)
print("Saved tutorial_extractions.jsonl")

LangExtract: Saving to tutorial_extractions.jsonl: 3 docs [00:00, 1667.94 docs/s]

✓ Saved 3 documents to tutorial_extractions.jsonl
Saved tutorial_extractions.jsonl

html_content = lx.visualize("tutorial_extractions.jsonl")

html_str = html_content.data if hasattr(html_content, "data") else html_content
with open("tutorial_visualization.html", "w", encoding="utf-8") as f:
    f.write(html_str)

print("Saved tutorial_visualization.html — open it in a browser to explore results")

# Render inline in the notebook (works in Jupyter)
from IPython.display import HTML
HTML(html_str)

LangExtract: Loading tutorial_extractions.jsonl: 100%|██████████| 12.2k/12.2k [00:00<00:00, 12.9MB/s]

✓ Loaded 3 documents from tutorial_extractions.jsonl
Saved tutorial_visualization.html — open it in a browser to explore results

# Visualize the survey result inline
lx.io.save_annotated_documents([survey_result], output_name="survey_extractions.jsonl", output_dir=".")
survey_html = lx.visualize("survey_extractions.jsonl")
survey_html_str = survey_html.data if hasattr(survey_html, "data") else survey_html
HTML(survey_html_str)

LangExtract: Saving to survey_extractions.jsonl: 1 docs [00:00, 1761.57 docs/s]

✓ Saved 1 documents to survey_extractions.jsonl

LangExtract: Loading survey_extractions.jsonl: 100%|██████████| 2.19k/2.19k [00:00<00:00, 8.74MB/s]

✓ Loaded 1 documents from survey_extractions.jsonl

lx.io.save_annotated_documents([rel_result], output_name="survey_extractions.jsonl", output_dir=".")
survey_html = lx.visualize("survey_extractions.jsonl")
survey_html_str = survey_html.data if hasattr(survey_html, "data") else survey_html
HTML(survey_html_str)

LangExtract: Saving to survey_extractions.jsonl: 1 docs [00:00, 846.48 docs/s]

✓ Saved 1 documents to survey_extractions.jsonl

LangExtract: Loading survey_extractions.jsonl: 100%|██████████| 6.67k/6.67k [00:00<00:00, 16.6MB/s]

✓ Loaded 1 documents from survey_extractions.jsonl

# Load the data
from google.colab import drive
import json

drive.mount("/content/drive", force_remount=False)

CORPUS_PATH = "/content/drive/My Drive/Teaching/langextract_tutorial/debate_corpus.json"
with open(CORPUS_PATH, encoding="utf-8") as f:
    corpus = json.load(f)

print(f"Loaded {len(corpus)} debates:")
for i, d in enumerate(corpus):
    print(f"  [{i}] {d['title']} — {len(d['turns'])} turns")

# Build an excerpt from the October 22, 2020 presidential debate (TRUMP vs BIDEN)
debate = corpus[1]
candidate_speakers = {"TRUMP", "BIDEN"}
candidate_turns = [t for t in debate["turns"] if t["speaker"] in candidate_speakers][:4]

debate_text = "\n\n".join(f"{t['speaker']}: {t['text']}" for t in candidate_turns)

print(f"\nUsing: {debate['title']}")
print(f"Excerpt ({len(candidate_turns)} candidate turns):\n")
print(debate_text[:500] + "...")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loaded 3 debates:
  [0] October 07, 2020 Vice Presidential Debate Transcript — 246 turns
  [1] October 22, 2020 Debate Transcript — 354 turns
  [2] September 29, 2020 Debate Transcript — 858 turns

Using: October 22, 2020 Debate Transcript
Excerpt (4 candidate turns):

TRUMP: So, as you know, 2.2 million people, modeled out, were expected to die. We closed up the greatest economy in the world in order to fight this horrible disease that came from China. It’s a worldwide pandemic. It’s all over the world. You see the spikes in Europe and many other places right now. If you notice, the mortality rate is down, 85%. The excess mortality rate is way down, and much lower than almost any other country. And we’re fighting it and we’re fighting it hard. There is a spik...

# Write prompt, give few-shot examples, and run the model
politics_prompt = textwrap.dedent("""
    Analyse a political debate transcript and extract two types of content:

    1. policy_position — a stated stance on a policy issue.
       Attributes:
         actor: the speaker's last name (TRUMP or BIDEN)
         issue_area: the policy domain (economy, healthcare, foreign_policy, environment, or jobs)
         stance: support, oppose, or ambiguous

    2. rhetorical_move — a persuasive strategy used by a speaker.
       Attributes:
         actor: the speaker's last name
         strategy: one of appeal_to_values, appeal_to_authority,
                   statistical_claim, fear_appeal, or opponent_attack

    Use the exact text from the transcript. Do not paraphrase.
""")

politics_examples = [
    lx.data.ExampleData(
        text=(
            "Candidate Rivera: Experts agree that tax cuts spur growth, "
            "but Mayor Scott's reckless spending will bankrupt the city. "
            "Hard-working residents deserve better."
        ),
        extractions=[
            lx.data.Extraction(
                extraction_class="rhetorical_move",
                extraction_text="Experts agree that tax cuts spur growth",
                attributes={"actor": "Rivera", "strategy": "appeal_to_authority"}
            ),
            lx.data.Extraction(
                extraction_class="rhetorical_move",
                extraction_text="Mayor Scott's reckless spending will bankrupt the city",
                attributes={"actor": "Rivera", "strategy": "opponent_attack"}
            ),
            lx.data.Extraction(
                extraction_class="policy_position",
                extraction_text="tax cuts spur growth",
                attributes={"actor": "Rivera", "issue_area": "economy", "stance": "support"}
            ),
            lx.data.Extraction(
                extraction_class="rhetorical_move",
                extraction_text="Hard-working residents deserve better",
                attributes={"actor": "Rivera", "strategy": "appeal_to_values"}
            ),
        ]
    )
]

politics_result = lx.extract(
    text_or_documents=debate_text,
    prompt_description=politics_prompt,
    examples=politics_examples,
    model_id="gemini-2.5-flash",
)

WARNING:absl:Prompt alignment: non-exact match: [example#0] class='policy_position' status=AlignmentStatus.MATCH_FUZZY text='tax cuts spur growth' char_span=(37, 57)
LangExtract: model=gemini-2.5-flash, current=4,030 chars, processed=0 chars:  [00:14]

#print results
print(f"Found {len(politics_result.extractions)} extractions\n")
for e in politics_result.extractions:
    attrs = e.attributes or {}
    attr_str = ", ".join(f"{k}={v}" for k, v in attrs.items())
    print(f"  [{e.extraction_class:18s}]  '{e.extraction_text[:60]}'")
    print(f"  {'':20s}   {attr_str}")

Found 24 extractions

  [rhetorical_move   ]  '2.2 million people, modeled out, were expected to die'
                         actor=TRUMP, strategy=statistical_claim
  [rhetorical_move   ]  'horrible disease that came from China'
                         actor=TRUMP, strategy=opponent_attack
  [rhetorical_move   ]  'the mortality rate is down, 85%'
                         actor=TRUMP, strategy=statistical_claim
  [rhetorical_move   ]  'The excess mortality rate is way down, and much lower than a'
                         actor=TRUMP, strategy=statistical_claim
  [policy_position   ]  'We have a vaccine that’s coming, it’s ready. It’s going to b'
                         actor=TRUMP, issue_area=healthcare, stance=support
  [policy_position   ]  'Operation Warp Speed, which is the military, is going to dis'
                         actor=TRUMP, issue_area=healthcare, stance=support
  [rhetorical_move   ]  'I can tell you from personal experience that I was in the ho'
                         actor=UNKNOWN, strategy=appeal_to_authority
  [rhetorical_move   ]  'More and more people are getting better.'
                         actor=UNKNOWN, strategy=statistical_claim
  [rhetorical_move   ]  'I’ve been congratulated by the heads of many countries'
                         actor=UNKNOWN, strategy=appeal_to_authority
  [policy_position   ]  'We’re now making ventilators.'
                         actor=UNKNOWN, issue_area=healthcare, stance=support
  [rhetorical_move   ]  '220,000 Americans dead'
                         actor=BIDEN, strategy=statistical_claim
  [rhetorical_move   ]  'Anyone who’s responsible for not taking control — in fact, n'
                         actor=BIDEN, strategy=opponent_attack
  [rhetorical_move   ]  'We’re in a situation where there are thousands of deaths a d'
                         actor=BIDEN, strategy=statistical_claim
  [rhetorical_move   ]  'as the New England Medical Journal said'
                         actor=BIDEN, strategy=appeal_to_authority
  [rhetorical_move   ]  'The expectation is we’ll have another 200,000 Americans dead'
                         actor=BIDEN, strategy=fear_appeal
  [policy_position   ]  'If we just wore these masks... we could save 100,000 lives'
                         actor=BIDEN, issue_area=healthcare, stance=support
  [rhetorical_move   ]  'the President’s own advisors have told him'
                         actor=BIDEN, strategy=appeal_to_authority
  [rhetorical_move   ]  'the President, thus far, still has no plan. No comprehensive'
                         actor=BIDEN, strategy=opponent_attack
  [policy_position   ]  'make sure we have everyone encouraged to wear a mask, all th'
                         actor=BIDEN, issue_area=healthcare, stance=support
  [policy_position   ]  'move in the direction of rapid testing, investing in rapid t'
                         actor=BIDEN, issue_area=healthcare, stance=support
  [policy_position   ]  'set up national standards as to how to open up schools and o'
                         actor=BIDEN, issue_area=healthcare, stance=support
  [rhetorical_move   ]  'the New England Medical Journal — one of the serious, most s'
                         actor=BIDEN, strategy=appeal_to_authority
  [rhetorical_move   ]  'the way this President has responded to this crisis has been'
                         actor=BIDEN, strategy=opponent_attack
  [rhetorical_move   ]  'I will take care of this, I will end this, I will make sure '
                         actor=BIDEN, strategy=appeal_to_values

from collections import defaultdict, Counter

# Split by actor
by_actor = defaultdict(lambda: defaultdict(list))
for e in politics_result.extractions:
    actor = (e.attributes or {}).get("actor", "unknown")
    by_actor[actor][e.extraction_class].append(e)

for actor, classes in sorted(by_actor.items()):
    print(f"\n{'='*55}")
    print(f"  {actor}")
    print(f"{'='*55}")

    positions = classes.get("policy_position", [])
    if positions:
        print(f"\n  Policy positions ({len(positions)}):")
        for p in positions:
            attrs = p.attributes or {}
            print(f"    [{attrs.get('issue_area','?'):12s}] {attrs.get('stance','?'):8s}  '{p.extraction_text[:55]}'")

    moves = classes.get("rhetorical_move", [])
    if moves:
        strategy_counts = Counter((m.attributes or {}).get("strategy", "?") for m in moves)
        print(f"\n  Rhetorical strategies ({len(moves)} total):")
        for strategy, n in strategy_counts.most_common():
            examples = [m.extraction_text[:45] for m in moves
                        if (m.attributes or {}).get("strategy") == strategy]
            print(f"    {strategy:22s} x{n}  e.g. '{examples[0]}'")

=======================================================
  BIDEN
=======================================================

  Policy positions (4):
    [healthcare  ] support   'If we just wore these masks... we could save 100,000 li'
    [healthcare  ] support   'make sure we have everyone encouraged to wear a mask, a'
    [healthcare  ] support   'move in the direction of rapid testing, investing in ra'
    [healthcare  ] support   'set up national standards as to how to open up schools '

  Rhetorical strategies (10 total):
    opponent_attack        x3  e.g. 'Anyone who’s responsible for not taking contr'
    appeal_to_authority    x3  e.g. 'as the New England Medical Journal said'
    statistical_claim      x2  e.g. '220,000 Americans dead'
    fear_appeal            x1  e.g. 'The expectation is we’ll have another 200,000'
    appeal_to_values       x1  e.g. 'I will take care of this, I will end this, I '

=======================================================
  TRUMP
=======================================================

  Policy positions (6):
    [economy     ] support   'We closed up the greatest economy in the world in order'
    [healthcare  ] support   'We have a vaccine that’s coming, it’s ready. It’s going'
    [healthcare  ] support   'Operation Warp Speed, which is the military, is going t'
    [healthcare  ] support   'We’re now making ventilators'
    [healthcare  ] support   'Johnson and Johnson is doing very well. Moderna is doin'
    [healthcare  ] support   'we’re working on very closely with other countries, in '

  Rhetorical strategies (8 total):
    appeal_to_authority    x3  e.g. 'the military, is going to distribute the vacc'
    statistical_claim      x2  e.g. '2.2 million people, modeled out, were expecte'
    appeal_to_values       x2  e.g. 'greatest economy in the world'
    opponent_attack        x1  e.g. 'horrible disease that came from China'

#Use this for the exercise

sample_text = "Barack Obama served as the 44th President of the United States from 2009 to 2017."

simple_prompt = "Extract named entities (person, role, organization, date, location)."
simple_examples = [
    lx.data.ExampleData(
        text="Nelson Mandela was president of South Africa from 1994 to 1999.",
        extractions=[
            lx.data.Extraction(extraction_class="person",   extraction_text="Nelson Mandela"),
            lx.data.Extraction(extraction_class="role",     extraction_text="president"),
            lx.data.Extraction(extraction_class="location", extraction_text="South Africa"),
            lx.data.Extraction(extraction_class="date",     extraction_text="1994 to 1999"),
        ]
    )
]

# --- Google Gemini ---
result_gemini = lx.extract(
    text_or_documents=sample_text,
    prompt_description=simple_prompt,
    examples=simple_examples,
    model_id="gemini-2.5-flash",      # Best schema-constraint support
)

# --- OpenAI (requires langextract[openai] and OPENAI_API_KEY) ---
# result_openai = lx.extract(
#     text_or_documents=sample_text,
#     prompt_description=simple_prompt,
#     examples=simple_examples,
#     model_id="gpt-4o-mini",
#     fence_output=True,           # Required for OpenAI
#     use_schema_constraints=False,
# )

# --- Local Ollama (requires Ollama running on localhost) ---
# result_ollama = lx.extract(
#     text_or_documents=sample_text,
#     prompt_description=simple_prompt,
#     examples=simple_examples,
#     model_id="gemma2:2b",
#     model_url="http://localhost:11434",
#     fence_output=False,
#     use_schema_constraints=False,
# )

# --- Anthropic Claude (not via LangExtract — uses the Anthropic SDK directly) ---
# LangExtract does not natively support Anthropic, but Claude's tool use feature
# provides equivalent structured extraction. You lose LangExtract's source grounding
# and visualization, but the schema enforcement works the same way.
#
# %pip install anthropic
# import anthropic
# os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
#
# client = anthropic.Anthropic()
# response = client.messages.create(
#     model="claude-sonnet-4-6",
#     max_tokens=1024,
#     tools=[{
#         "name": "extract_entities",
#         "description": simple_prompt,
#         "input_schema": {
#             "type": "object",
#             "properties": {
#                 "extractions": {
#                     "type": "array",
#                     "items": {
#                         "type": "object",
#                         "properties": {
#                             "extraction_class": {"type": "string"},
#                             "extraction_text":  {"type": "string"},
#                         },
#                         "required": ["extraction_class", "extraction_text"],
#                     }
#                 }
#             },
#             "required": ["extractions"],
#         }
#     }],
#     messages=[{"role": "user", "content": sample_text}],
# )
# result_claude = response.content[0].input["extractions"]
# print("Claude results:")
# for e in result_claude:
#     print(f"  [{e['extraction_class']:12s}]  '{e['extraction_text']}'")

print("Gemini results:")
for e in result_gemini.extractions:
    print(f"  [{e.extraction_class:12s}]  '{e.extraction_text}'")

LangExtract: model=gemini-2.5-flash, current=81 chars, processed=0 chars:  [00:02]

Gemini results:
  [person      ]  'Barack Obama'
  [role        ]  '44th President'
  [location    ]  'United States'
  [date        ]  '2009 to 2017'

Argument	Type	Purpose
`text_or_documents`	`str` or list	The text you want to analyze
`prompt_description`	`str`	Natural-language instructions for what to extract
`examples`	`list[lx.data.ExampleData]`	Few-shot examples that teach the schema

Task	Extraction classes	Useful attributes
Manifesto coding	`commitment`, `policy_area`	`direction` (positive/negative), `domain`
Legislative speech analysis	`claim`, `evidence`, `appeal`	`party`, `topic`
Social media political discourse	`stance`, `target`	`sentiment`, `platform`
Propaganda detection	`technique`	`type` (e.g. scapegoating, glittering_generality)
Cross-national media comparison	`frame`	`country`, `outlet`, `valence`

LangExtract Tutorial¶

Table of Contents¶

1. Installation & Setup¶

API Key Configuration¶

2. Core Concepts¶

3. Basic Extraction — Named Entity Recognition¶

Inspecting character positions¶

4. Extraction with Attributes¶

Summarising the coded responses¶

5. Relationship Extraction¶

6. Saving Results & Visualization¶

Generate Interactive Visualizations for Each Example¶

Political Content Analysis¶

Comparing candidates side by side¶

What else is possible?¶

10. Exercise¶

EXTRA: Multi-Provider Support¶