LangExtract Tutorial¶
LangExtract is an open-source Python library by Google for extracting structured information from unstructured text using large language models (LLMs). Key features:
- Source grounding: every extraction is mapped to its exact character location in the source text
- Structured outputs: enforces a consistent schema using few-shot examples
- Long document handling: automatically chunks and processes large texts in parallel
- Interactive visualization: generates self-contained HTML files to review extractions
- Multi-model support: works with Google Gemini, OpenAI, and local Ollama models
NOTE: API keys and cost: The examples in this tutorial call a live LLM, so you will get the best results by providing a paid API key. Google Gemini is recommended (it has the strongest schema-enforcement support in LangExtract). You can get a Gemini API key at Google AI Studio. Running every cell in this notebook from start to finish should cost less than one cent in API credits. Google AI Studio requires a minimum initial deposit of $10 to activate paid access; you will not come close to spending that on this tutorial, and any remaining balance carries over for future use.
Further reading: Google Developers Blog announcement · DataCamp tutorial · PyPI package
Note: This tutorial assumes basic familiarity with Python, including variables, lists, dictionaries, loops, and functions. It was developed with assistance from Claude Code.
Table of Contents¶
- Installation & Setup
- Core Concepts
- Basic Extraction — Named Entity Recognition
- Extraction with Attributes
- Relationship Extraction
- Saving Results & Visualization
- Multi-Provider Support (OpenAI, Ollama)
- Production Configuration
- Political Content Analysis
- Exercises
# Install core library (Gemini support included)
%pip install langextract
Requirement already satisfied: langextract in /usr/local/lib/python3.12/dist-packages (1.5.0) Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.4.0) Requirement already satisfied: aiohttp>=3.8.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (3.13.5) Requirement already satisfied: async_timeout>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (5.0.1) Requirement already satisfied: exceptiongroup>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.3.1) Requirement already satisfied: google-genai>=1.39.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.68.0) Requirement already satisfied: google-cloud-storage>=2.14.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (3.10.1) Requirement already satisfied: ml-collections>=0.1.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.1.0) Requirement already satisfied: more-itertools>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (10.8.0) Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.0.2) Requirement already satisfied: pandas>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.2.2) Requirement already satisfied: pydantic>=1.8.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.12.3) Requirement already satisfied: python-dotenv>=0.19.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (1.2.2) Requirement already satisfied: PyYAML>=6.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (6.0.3) Requirement already satisfied: regex>=2023.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2025.11.3) Requirement already satisfied: requests>=2.25.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (2.32.4) Requirement already satisfied: tqdm>=4.64.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (4.67.3) Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from langextract) (4.15.0) Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (2.6.1) Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.4.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (26.1.0) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.8.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (6.7.1) Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (0.4.1) Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.8.0->langextract) (1.23.0) Requirement already satisfied: google-auth<3.0.0,>=2.26.1 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.47.0) Requirement already satisfied: google-api-core<3.0.0,>=2.27.0 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.30.3) Requirement already satisfied: google-cloud-core<3.0.0,>=2.4.2 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.5.1) Requirement already satisfied: google-resumable-media<3.0.0,>=2.7.2 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (2.8.2) Requirement already satisfied: google-crc32c<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from google-cloud-storage>=2.14.0->langextract) (1.8.0) Requirement already satisfied: anyio<5.0.0,>=4.8.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (4.13.0) Requirement already satisfied: httpx<1.0.0,>=0.28.1 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (0.28.1) Requirement already satisfied: tenacity<9.2.0,>=8.2.3 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (9.1.4) Requirement already satisfied: websockets<17.0,>=13.0.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (15.0.1) Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (1.9.0) Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from google-genai>=1.39.0->langextract) (1.3.1) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.3.0->langextract) (2026.1) Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (0.7.0) Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (2.41.4) Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=1.8.0->langextract) (0.4.2) Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (3.4.7) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (3.13) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests>=2.25.0->langextract) (2026.4.22) Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.63.2 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (1.74.0) Requirement already satisfied: protobuf<8.0.0,>=4.25.8 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (5.29.6) Requirement already satisfied: proto-plus<2.0.0,>=1.22.3 in /usr/local/lib/python3.12/dist-packages (from google-api-core<3.0.0,>=2.27.0->google-cloud-storage>=2.14.0->langextract) (1.27.2) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.12/dist-packages (from google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (0.4.2) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.12/dist-packages (from google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (4.9.1) Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1.0.0,>=0.28.1->google-genai>=1.39.0->langextract) (1.0.9) Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1.0.0,>=0.28.1->google-genai>=1.39.0->langextract) (0.16.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->langextract) (1.17.0) Requirement already satisfied: pyasn1<0.7.0,>=0.6.1 in /usr/local/lib/python3.12/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0,>=2.26.1->google-cloud-storage>=2.14.0->langextract) (0.6.3)
API Key Configuration¶
LangExtract reads your API key from the environment. Set whichever key matches your chosen model provider.
import os
from google.colab import userdata
os.environ["LANGEXTRACT_API_KEY"] = userdata.get('GEMINI_API_KEY')
2. Core Concepts¶
Every lx.extract() call needs three things (full API reference on GitHub):
| Argument | Type | Purpose |
|---|---|---|
text_or_documents | str or list | The text you want to analyze |
prompt_description | str | Natural-language instructions for what to extract |
examples | list[lx.data.ExampleData] | Few-shot examples that teach the schema |
Results come back as an AnnotatedDocument with an .extractions list of Extraction objects.
import langextract as lx
import textwrap
print("LangExtract imported successfully")
LangExtract imported successfully
3. Basic Extraction — Named Entity Recognition¶
The simplest use case: extract named entities (people, places, organizations) from a passage. Example text is from this Wikipedia page: https://en.wikipedia.org/wiki/Francisco_Carvalho
# --- Sample text ---
news_text = """
Francisco Avelino Vieira de Carvalho (born 4 December 1970) is a Cape Verdean politician and the leader of the African Party for the Independence of Cape Verde.
He is also the mayor of the municipality of Praia since 2020. In the 2026 parliamentary election, Carvalho's party secured a majority in the National Assembly,
making him the prime minister–designate of the country.
"""
# --- Extraction prompt ---
ner_prompt = textwrap.dedent("""
Extract named entities from the text.
Classify each entity as one of: person, title, location, or date.
Use the exact text from the source. Do not paraphrase.
""")
# --- Few-shot examples ---
ner_examples = [
lx.data.ExampleData(
text="Adam Roberts is a PhD candidate in Political Science at the University of Rochester in Rochester, NY. He started his PhD on September 1, 2022.",
extractions=[
lx.data.Extraction(extraction_class="person", extraction_text="Adam Roberts"),
lx.data.Extraction(extraction_class="title", extraction_text="PhD candidate"),
lx.data.Extraction(extraction_class="location", extraction_text="Rochester, NY"),
lx.data.Extraction(extraction_class="date", extraction_text="September 1, 2022"),
]
)
]
# --- Run extraction ---
ner_result = lx.extract(
text_or_documents=news_text,
prompt_description=ner_prompt,
examples=ner_examples,
model_id="gemini-2.5-flash",
)
print(f"\nFound {len(ner_result.extractions)} entities\n")
for e in ner_result.extractions:
print(f" [{e.extraction_class:12s}] '{e.extraction_text}'")
LangExtract: model=gemini-2.5-flash, current=376 chars, processed=0 chars: [00:03]
Found 12 entities [person ] 'Francisco Avelino Vieira de Carvalho' [date ] '4 December 1970' [title ] 'Cape Verdean politician' [title ] 'leader of the African Party for the Independence of Cape Verde' [location ] 'Cape Verde' [title ] 'mayor' [location ] 'Praia' [date ] '2020' [date ] '2026' [person ] 'Carvalho' [location ] 'National Assembly' [title ] 'prime minister–designate'
Inspecting character positions¶
Every extraction carries the exact character span where it was found (the "source grounding" feature). This allows us to find exactly where to find the entity in the text.
for e in ner_result.extractions:
start = e.char_interval.start_pos
end = e.char_interval.end_pos
span = news_text[start:end]
print(f"chars {start:4d}-{end:4d} [{e.extraction_class}] '{span}'")
chars 1- 37 [person] 'Francisco Avelino Vieira de Carvalho' chars 44- 59 [date] '4 December 1970' chars 66- 89 [title] 'Cape Verdean politician' chars 98- 104 [title] 'leader' chars 112- 160 [institution] 'African Party for the Independence of Cape Verde' chars 150- 160 [location] 'Cape Verde' chars 177- 182 [title] 'mayor' chars 206- 211 [location] 'Praia' chars 218- 222 [date] '2020' chars 231- 235 [date] '2026' chars 260- 268 [person] 'Carvalho' chars 303- 320 [institution] 'National Assembly' chars 337- 361 [title] 'prime minister–designate'
4. Extraction with Attributes¶
Extractions can carry arbitrary key-value attributes. This is useful when you need metadata alongside the entity text — e.g., sentiment, intensity, or category.
A natural social science application is thematic coding of open-ended survey responses. Qualitative researchers normally do this manually, working through responses and tagging each passage with a theme and notes. LangExtract can apply the same codebook at scale across hundreds of responses.
# Survey question: "How do you feel about the changes to public transport in your city?"
survey_responses = """
Response 1: The new bus routes are so much more convenient! I've actually started
leaving the car at home.
Response 2: It's a disaster. They cut the night service and now I can't get home
after late shifts.
Response 3: Mixed feelings honestly. The app is great but the actual frequency
hasn't improved at all.
Response 4: I suppose it's fine. Nothing's really changed for me one way or another.
"""
survey_prompt = textwrap.dedent("""
Thematically code open-ended survey responses about public transport.
Extract each distinct theme raised by a respondent.
Use the exact words from the response. Do not paraphrase.
For each theme, record:
sentiment: positive, negative, mixed, or neutral
intensity: strong or weak
""")
survey_examples = [
lx.data.ExampleData(
text="Response A: The new cycle lanes are brilliant, though the parking situation has become a nightmare.",
extractions=[
lx.data.Extraction(
extraction_class="infrastructure",
extraction_text="new cycle lanes are brilliant",
attributes={"sentiment": "positive", "intensity": "strong"}
),
lx.data.Extraction(
extraction_class="parking",
extraction_text="parking situation has become a nightmare",
attributes={"sentiment": "negative", "intensity": "strong"}
),
]
)
]
survey_result = lx.extract(
text_or_documents=survey_responses,
prompt_description=survey_prompt,
examples=survey_examples,
model_id="gemini-2.5-flash",
)
print(f"\nCoded {len(survey_result.extractions)} theme mentions\n")
for e in survey_result.extractions:
attrs = e.attributes or {}
print(f" [{e.extraction_class:15s}] {attrs.get('sentiment','?'):8s} / {attrs.get('intensity','?'):6s} '{e.extraction_text}'")
LangExtract: model=gemini-2.5-flash, current=492 chars, processed=0 chars: [00:13]
Coded 5 theme mentions [infrastructure ] positive / strong 'The new bus routes are so much more convenient!' [infrastructure ] negative / strong 'They cut the night service and now I can't get home after late shifts.' [infrastructure ] positive / strong 'The app is great' [infrastructure ] negative / strong 'the actual frequency hasn't improved at all' [infrastructure ] neutral / weak 'it's fine. Nothing's really changed for me one way or another.'
Summarising the coded responses¶
Once every response is coded, standard Python gives a quick frequency table similar to one that would appear in a research paper's results section.
from collections import Counter
theme_counts = Counter(e.extraction_class for e in survey_result.extractions)
sentiment_counts = Counter((e.attributes or {}).get("sentiment") for e in survey_result.extractions)
negative_themes = [e for e in survey_result.extractions
if (e.attributes or {}).get("sentiment") == "negative"]
print("Theme frequency:")
for theme, n in theme_counts.most_common():
print(f" {theme:18s} {n}")
print("\nOverall sentiment breakdown:")
total = sum(sentiment_counts.values())
for sentiment, n in sentiment_counts.most_common():
print(f" {sentiment:8s} {n} ({n/total*100:.0f}%)")
print("\nNegative mentions (flagged for follow-up):")
for e in negative_themes:
intensity = (e.attributes or {}).get("intensity", "?")
print(f" [{e.extraction_class}] ({intensity}) '{e.extraction_text}'")
Theme frequency: infrastructure 5 Overall sentiment breakdown: positive 2 (40%) negative 2 (40%) neutral 1 (20%) Negative mentions (flagged for follow-up): [infrastructure] (strong) 'They cut the night service and now I can't get home after late shifts' [infrastructure] (strong) 'the actual frequency hasn't improved at all'
5. Relationship Extraction¶
Use a shared attribute (a group ID) to link related entities together. For example, all facts about the same person mentioned in an interview can be grouped under a common ID.
This is a common task in qualitative social science research. Oral histories and in-depth interviews often contain rich descriptions of the respondent's social network, like mentors, collaborators, rivals, gatekeepers. Extracting and grouping these relationship facts systematically is the first step toward ego network analysis: mapping who the respondent knows, through what channels, and in what capacity.
interview_text = """
I first got involved through Maria Chen, who was my professor at City College.
She introduced me to the local tenant rights group, and that's where I met David Okafor,
who ran the outreach program. David had been organizing in this neighborhood for fifteen
years; he knew everyone. It was through David that I was introduced to Councilwoman
Patricia Reyes, who ended up being a crucial ally when we pushed the housing bill.
Maria and Patricia had known each other since the early nineties through the teachers' union.
There was also a guy named Greg Holloway. He was officially neutral, a city planner,
but he would quietly pass us zoning documents. I always felt like he was sympathetic
but couldn't be seen taking sides.
"""
rel_prompt = textwrap.dedent("""
Extract facts about people mentioned in this oral history interview.
For each fact, record the mentioned person's full name in the 'person_group'
attribute so all facts about the same person can be grouped together.
Extraction classes to use:
person — the person's name
role — their occupation, title, or organizational affiliation
relationship — how they are connected to the narrator or to another person
tie_strength — characterise the tie as: ally, neutral, or adversary
Use exact text from the interview. Do not paraphrase.
""")
rel_examples = [
lx.data.ExampleData(
text="James introduced me to Rosa Park, a union organizer who had mentored him back in Detroit. Rosa was skeptical of our approach at first.",
extractions=[
lx.data.Extraction(
extraction_class="person",
extraction_text="Rosa Park",
attributes={"person_group": "Rosa Park"}
),
lx.data.Extraction(
extraction_class="role",
extraction_text="union organizer",
attributes={"person_group": "Rosa Park"}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="mentored him back in Detroit",
attributes={"person_group": "Rosa Park"}
),
lx.data.Extraction(
extraction_class="tie_strength",
extraction_text="skeptical of our approach",
attributes={"person_group": "Rosa Park", "tie_strength": "neutral"}
),
]
)
]
rel_result = lx.extract(
text_or_documents=interview_text,
prompt_description=rel_prompt,
examples=rel_examples,
model_id="gemini-2.5-flash",
)
# Group by person
from collections import defaultdict
person_groups = defaultdict(list)
for e in rel_result.extractions:
group = (e.attributes or {}).get("person_group", "unknown")
person_groups[group].append(e)
for person, facts in person_groups.items():
print(f"\n{person}:")
for f in facts:
print(f" [{f.extraction_class:14s}] '{f.extraction_text}'")
LangExtract: model=gemini-2.5-flash, current=724 chars, processed=0 chars: [00:06]
Maria Chen: [person ] 'Maria Chen' [role ] 'professor at City College' [relationship ] 'my professor at City College' [relationship ] 'introduced me to the local tenant rights group' [relationship ] 'had known each other since the early nineties through the teachers' union' David Okafor: [person ] 'David Okafor' [role ] 'ran the outreach program' [relationship ] 'met David Okafor' [relationship ] 'had been organizing in this neighborhood for fifteen years' [relationship ] 'knew everyone' [relationship ] 'introduced to Councilwoman Patricia Reyes' Councilwoman Patricia Reyes: [person ] 'Councilwoman Patricia Reyes' [role ] 'Councilwoman' [tie_strength ] 'crucial ally' [relationship ] 'had known each other since the early nineties through the teachers' union' Greg Holloway: [person ] 'Greg Holloway' [role ] 'city planner' [tie_strength ] 'officially neutral' [tie_strength ] 'sympathetic but couldn't be seen taking sides' [relationship ] 'would quietly pass us zoning documents'
6. Saving Results & Visualization¶
LangExtract can save results to JSONL and generate interactive HTML visualizations where extractions are highlighted in the source text.
# Save to JSONL
lx.io.save_annotated_documents(
[ner_result, survey_result, rel_result],
output_name="tutorial_extractions.jsonl",
output_dir=".",
)
print("Saved tutorial_extractions.jsonl")
LangExtract: Saving to tutorial_extractions.jsonl: 3 docs [00:00, 1667.94 docs/s]
✓ Saved 3 documents to tutorial_extractions.jsonl Saved tutorial_extractions.jsonl
Generate Interactive Visualizations for Each Example¶
html_content = lx.visualize("tutorial_extractions.jsonl")
html_str = html_content.data if hasattr(html_content, "data") else html_content
with open("tutorial_visualization.html", "w", encoding="utf-8") as f:
f.write(html_str)
print("Saved tutorial_visualization.html — open it in a browser to explore results")
# Render inline in the notebook (works in Jupyter)
from IPython.display import HTML
HTML(html_str)
LangExtract: Loading tutorial_extractions.jsonl: 100%|██████████| 12.2k/12.2k [00:00<00:00, 12.9MB/s]
✓ Loaded 3 documents from tutorial_extractions.jsonl Saved tutorial_visualization.html — open it in a browser to explore results
# Visualize the survey result inline
lx.io.save_annotated_documents([survey_result], output_name="survey_extractions.jsonl", output_dir=".")
survey_html = lx.visualize("survey_extractions.jsonl")
survey_html_str = survey_html.data if hasattr(survey_html, "data") else survey_html
HTML(survey_html_str)
LangExtract: Saving to survey_extractions.jsonl: 1 docs [00:00, 1761.57 docs/s]
✓ Saved 1 documents to survey_extractions.jsonl
LangExtract: Loading survey_extractions.jsonl: 100%|██████████| 2.19k/2.19k [00:00<00:00, 8.74MB/s]
✓ Loaded 1 documents from survey_extractions.jsonl
lx.io.save_annotated_documents([rel_result], output_name="survey_extractions.jsonl", output_dir=".")
survey_html = lx.visualize("survey_extractions.jsonl")
survey_html_str = survey_html.data if hasattr(survey_html, "data") else survey_html
HTML(survey_html_str)
LangExtract: Saving to survey_extractions.jsonl: 1 docs [00:00, 846.48 docs/s]
✓ Saved 1 documents to survey_extractions.jsonl
LangExtract: Loading survey_extractions.jsonl: 100%|██████████| 6.67k/6.67k [00:00<00:00, 16.6MB/s]
✓ Loaded 1 documents from survey_extractions.jsonl
Political Content Analysis¶
In this section we apply LangExtract to a real presidential debate transcript scraped in the web scraping tutorial. We extract two types of content from candidate turns: policy positions (what each candidate advocates for) and rhetorical moves (the persuasive strategies they use). This kind of analysis mirrors computational approaches used in political communication research to study framing, argumentation, and candidate differentiation at scale.
# Load the data
from google.colab import drive
import json
drive.mount("/content/drive", force_remount=False)
CORPUS_PATH = "/content/drive/My Drive/Teaching/langextract_tutorial/debate_corpus.json"
with open(CORPUS_PATH, encoding="utf-8") as f:
corpus = json.load(f)
print(f"Loaded {len(corpus)} debates:")
for i, d in enumerate(corpus):
print(f" [{i}] {d['title']} — {len(d['turns'])} turns")
# Build an excerpt from the October 22, 2020 presidential debate (TRUMP vs BIDEN)
debate = corpus[1]
candidate_speakers = {"TRUMP", "BIDEN"}
candidate_turns = [t for t in debate["turns"] if t["speaker"] in candidate_speakers][:4]
debate_text = "\n\n".join(f"{t['speaker']}: {t['text']}" for t in candidate_turns)
print(f"\nUsing: {debate['title']}")
print(f"Excerpt ({len(candidate_turns)} candidate turns):\n")
print(debate_text[:500] + "...")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loaded 3 debates:
[0] October 07, 2020 Vice Presidential Debate Transcript — 246 turns
[1] October 22, 2020 Debate Transcript — 354 turns
[2] September 29, 2020 Debate Transcript — 858 turns
Using: October 22, 2020 Debate Transcript
Excerpt (4 candidate turns):
TRUMP: So, as you know, 2.2 million people, modeled out, were expected to die. We closed up the greatest economy in the world in order to fight this horrible disease that came from China. It’s a worldwide pandemic. It’s all over the world. You see the spikes in Europe and many other places right now. If you notice, the mortality rate is down, 85%. The excess mortality rate is way down, and much lower than almost any other country. And we’re fighting it and we’re fighting it hard. There is a spik...
# Write prompt, give few-shot examples, and run the model
politics_prompt = textwrap.dedent("""
Analyse a political debate transcript and extract two types of content:
1. policy_position — a stated stance on a policy issue.
Attributes:
actor: the speaker's last name (TRUMP or BIDEN)
issue_area: the policy domain (economy, healthcare, foreign_policy, environment, or jobs)
stance: support, oppose, or ambiguous
2. rhetorical_move — a persuasive strategy used by a speaker.
Attributes:
actor: the speaker's last name
strategy: one of appeal_to_values, appeal_to_authority,
statistical_claim, fear_appeal, or opponent_attack
Use the exact text from the transcript. Do not paraphrase.
""")
politics_examples = [
lx.data.ExampleData(
text=(
"Candidate Rivera: Experts agree that tax cuts spur growth, "
"but Mayor Scott's reckless spending will bankrupt the city. "
"Hard-working residents deserve better."
),
extractions=[
lx.data.Extraction(
extraction_class="rhetorical_move",
extraction_text="Experts agree that tax cuts spur growth",
attributes={"actor": "Rivera", "strategy": "appeal_to_authority"}
),
lx.data.Extraction(
extraction_class="rhetorical_move",
extraction_text="Mayor Scott's reckless spending will bankrupt the city",
attributes={"actor": "Rivera", "strategy": "opponent_attack"}
),
lx.data.Extraction(
extraction_class="policy_position",
extraction_text="tax cuts spur growth",
attributes={"actor": "Rivera", "issue_area": "economy", "stance": "support"}
),
lx.data.Extraction(
extraction_class="rhetorical_move",
extraction_text="Hard-working residents deserve better",
attributes={"actor": "Rivera", "strategy": "appeal_to_values"}
),
]
)
]
politics_result = lx.extract(
text_or_documents=debate_text,
prompt_description=politics_prompt,
examples=politics_examples,
model_id="gemini-2.5-flash",
)
WARNING:absl:Prompt alignment: non-exact match: [example#0] class='policy_position' status=AlignmentStatus.MATCH_FUZZY text='tax cuts spur growth' char_span=(37, 57) LangExtract: model=gemini-2.5-flash, current=4,030 chars, processed=0 chars: [00:14]
#print results
print(f"Found {len(politics_result.extractions)} extractions\n")
for e in politics_result.extractions:
attrs = e.attributes or {}
attr_str = ", ".join(f"{k}={v}" for k, v in attrs.items())
print(f" [{e.extraction_class:18s}] '{e.extraction_text[:60]}'")
print(f" {'':20s} {attr_str}")
Found 24 extractions
[rhetorical_move ] '2.2 million people, modeled out, were expected to die'
actor=TRUMP, strategy=statistical_claim
[rhetorical_move ] 'horrible disease that came from China'
actor=TRUMP, strategy=opponent_attack
[rhetorical_move ] 'the mortality rate is down, 85%'
actor=TRUMP, strategy=statistical_claim
[rhetorical_move ] 'The excess mortality rate is way down, and much lower than a'
actor=TRUMP, strategy=statistical_claim
[policy_position ] 'We have a vaccine that’s coming, it’s ready. It’s going to b'
actor=TRUMP, issue_area=healthcare, stance=support
[policy_position ] 'Operation Warp Speed, which is the military, is going to dis'
actor=TRUMP, issue_area=healthcare, stance=support
[rhetorical_move ] 'I can tell you from personal experience that I was in the ho'
actor=UNKNOWN, strategy=appeal_to_authority
[rhetorical_move ] 'More and more people are getting better.'
actor=UNKNOWN, strategy=statistical_claim
[rhetorical_move ] 'I’ve been congratulated by the heads of many countries'
actor=UNKNOWN, strategy=appeal_to_authority
[policy_position ] 'We’re now making ventilators.'
actor=UNKNOWN, issue_area=healthcare, stance=support
[rhetorical_move ] '220,000 Americans dead'
actor=BIDEN, strategy=statistical_claim
[rhetorical_move ] 'Anyone who’s responsible for not taking control — in fact, n'
actor=BIDEN, strategy=opponent_attack
[rhetorical_move ] 'We’re in a situation where there are thousands of deaths a d'
actor=BIDEN, strategy=statistical_claim
[rhetorical_move ] 'as the New England Medical Journal said'
actor=BIDEN, strategy=appeal_to_authority
[rhetorical_move ] 'The expectation is we’ll have another 200,000 Americans dead'
actor=BIDEN, strategy=fear_appeal
[policy_position ] 'If we just wore these masks... we could save 100,000 lives'
actor=BIDEN, issue_area=healthcare, stance=support
[rhetorical_move ] 'the President’s own advisors have told him'
actor=BIDEN, strategy=appeal_to_authority
[rhetorical_move ] 'the President, thus far, still has no plan. No comprehensive'
actor=BIDEN, strategy=opponent_attack
[policy_position ] 'make sure we have everyone encouraged to wear a mask, all th'
actor=BIDEN, issue_area=healthcare, stance=support
[policy_position ] 'move in the direction of rapid testing, investing in rapid t'
actor=BIDEN, issue_area=healthcare, stance=support
[policy_position ] 'set up national standards as to how to open up schools and o'
actor=BIDEN, issue_area=healthcare, stance=support
[rhetorical_move ] 'the New England Medical Journal — one of the serious, most s'
actor=BIDEN, strategy=appeal_to_authority
[rhetorical_move ] 'the way this President has responded to this crisis has been'
actor=BIDEN, strategy=opponent_attack
[rhetorical_move ] 'I will take care of this, I will end this, I will make sure '
actor=BIDEN, strategy=appeal_to_values
Comparing candidates side by side¶
Use the actor attribute to split the extractions by candidate, then summarise each one's policy positions and rhetorical repertoire.
from collections import defaultdict, Counter
# Split by actor
by_actor = defaultdict(lambda: defaultdict(list))
for e in politics_result.extractions:
actor = (e.attributes or {}).get("actor", "unknown")
by_actor[actor][e.extraction_class].append(e)
for actor, classes in sorted(by_actor.items()):
print(f"\n{'='*55}")
print(f" {actor}")
print(f"{'='*55}")
positions = classes.get("policy_position", [])
if positions:
print(f"\n Policy positions ({len(positions)}):")
for p in positions:
attrs = p.attributes or {}
print(f" [{attrs.get('issue_area','?'):12s}] {attrs.get('stance','?'):8s} '{p.extraction_text[:55]}'")
moves = classes.get("rhetorical_move", [])
if moves:
strategy_counts = Counter((m.attributes or {}).get("strategy", "?") for m in moves)
print(f"\n Rhetorical strategies ({len(moves)} total):")
for strategy, n in strategy_counts.most_common():
examples = [m.extraction_text[:45] for m in moves
if (m.attributes or {}).get("strategy") == strategy]
print(f" {strategy:22s} x{n} e.g. '{examples[0]}'")
=======================================================
BIDEN
=======================================================
Policy positions (4):
[healthcare ] support 'If we just wore these masks... we could save 100,000 li'
[healthcare ] support 'make sure we have everyone encouraged to wear a mask, a'
[healthcare ] support 'move in the direction of rapid testing, investing in ra'
[healthcare ] support 'set up national standards as to how to open up schools '
Rhetorical strategies (10 total):
opponent_attack x3 e.g. 'Anyone who’s responsible for not taking contr'
appeal_to_authority x3 e.g. 'as the New England Medical Journal said'
statistical_claim x2 e.g. '220,000 Americans dead'
fear_appeal x1 e.g. 'The expectation is we’ll have another 200,000'
appeal_to_values x1 e.g. 'I will take care of this, I will end this, I '
=======================================================
TRUMP
=======================================================
Policy positions (6):
[economy ] support 'We closed up the greatest economy in the world in order'
[healthcare ] support 'We have a vaccine that’s coming, it’s ready. It’s going'
[healthcare ] support 'Operation Warp Speed, which is the military, is going t'
[healthcare ] support 'We’re now making ventilators'
[healthcare ] support 'Johnson and Johnson is doing very well. Moderna is doin'
[healthcare ] support 'we’re working on very closely with other countries, in '
Rhetorical strategies (8 total):
appeal_to_authority x3 e.g. 'the military, is going to distribute the vacc'
statistical_claim x2 e.g. '2.2 million people, modeled out, were expecte'
appeal_to_values x2 e.g. 'greatest economy in the world'
opponent_attack x1 e.g. 'horrible disease that came from China'
What else is possible?¶
The same pattern generalises to many other political content analysis tasks:
| Task | Extraction classes | Useful attributes |
|---|---|---|
| Manifesto coding | commitment, policy_area | direction (positive/negative), domain |
| Legislative speech analysis | claim, evidence, appeal | party, topic |
| Social media political discourse | stance, target | sentiment, platform |
| Propaganda detection | technique | type (e.g. scapegoating, glittering_generality) |
| Cross-national media comparison | frame | country, outlet, valence |
10. Exercise¶
Pick a text domain you work with (legal clauses, academic abstracts, news articles) and design a schema with at least 3 entity classes and 2 attributes.
#Use this for the exercise
EXTRA: Multi-Provider Support¶
LangExtract works with several model providers. The model_id argument selects the provider automatically. See the GitHub README for the full list of supported models.
sample_text = "Barack Obama served as the 44th President of the United States from 2009 to 2017."
simple_prompt = "Extract named entities (person, role, organization, date, location)."
simple_examples = [
lx.data.ExampleData(
text="Nelson Mandela was president of South Africa from 1994 to 1999.",
extractions=[
lx.data.Extraction(extraction_class="person", extraction_text="Nelson Mandela"),
lx.data.Extraction(extraction_class="role", extraction_text="president"),
lx.data.Extraction(extraction_class="location", extraction_text="South Africa"),
lx.data.Extraction(extraction_class="date", extraction_text="1994 to 1999"),
]
)
]
# --- Google Gemini ---
result_gemini = lx.extract(
text_or_documents=sample_text,
prompt_description=simple_prompt,
examples=simple_examples,
model_id="gemini-2.5-flash", # Best schema-constraint support
)
# --- OpenAI (requires langextract[openai] and OPENAI_API_KEY) ---
# result_openai = lx.extract(
# text_or_documents=sample_text,
# prompt_description=simple_prompt,
# examples=simple_examples,
# model_id="gpt-4o-mini",
# fence_output=True, # Required for OpenAI
# use_schema_constraints=False,
# )
# --- Local Ollama (requires Ollama running on localhost) ---
# result_ollama = lx.extract(
# text_or_documents=sample_text,
# prompt_description=simple_prompt,
# examples=simple_examples,
# model_id="gemma2:2b",
# model_url="http://localhost:11434",
# fence_output=False,
# use_schema_constraints=False,
# )
# --- Anthropic Claude (not via LangExtract — uses the Anthropic SDK directly) ---
# LangExtract does not natively support Anthropic, but Claude's tool use feature
# provides equivalent structured extraction. You lose LangExtract's source grounding
# and visualization, but the schema enforcement works the same way.
#
# %pip install anthropic
# import anthropic
# os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
#
# client = anthropic.Anthropic()
# response = client.messages.create(
# model="claude-sonnet-4-6",
# max_tokens=1024,
# tools=[{
# "name": "extract_entities",
# "description": simple_prompt,
# "input_schema": {
# "type": "object",
# "properties": {
# "extractions": {
# "type": "array",
# "items": {
# "type": "object",
# "properties": {
# "extraction_class": {"type": "string"},
# "extraction_text": {"type": "string"},
# },
# "required": ["extraction_class", "extraction_text"],
# }
# }
# },
# "required": ["extractions"],
# }
# }],
# messages=[{"role": "user", "content": sample_text}],
# )
# result_claude = response.content[0].input["extractions"]
# print("Claude results:")
# for e in result_claude:
# print(f" [{e['extraction_class']:12s}] '{e['extraction_text']}'")
print("Gemini results:")
for e in result_gemini.extractions:
print(f" [{e.extraction_class:12s}] '{e.extraction_text}'")
LangExtract: model=gemini-2.5-flash, current=81 chars, processed=0 chars: [00:02]
Gemini results: [person ] 'Barack Obama' [role ] '44th President' [location ] 'United States' [date ] '2009 to 2017'