Entity Graph from a News Corpus

The Problem

Extraction is easy. Resolution is hard.

When two articles mention "Obama," "Barack Obama," and "President Obama," are those the same entity? When "Apple" appears in a tech article and a farming article, should they merge? A traditional data pipeline can extract rows. But handling ambiguity — deciding when two mentions refer to the same thing — requires judgment. And that judgment only becomes visible after the first pass, not before.

How It Works

Four phases. Two topologies. Agents only where they're needed.

The workflow uses map_reduce for the extraction phase and sharded_queue for targeted adjudication. Deterministic reducers handle the merge logic in between. The second phase only exists if the first phase discovers ambiguity.

01

Sample and prepare documents

Download a parquet file from Hugging Face (Awesome075/multi_news_parquet), select a sample of document clusters, and write normalized JSON inputs. Each input contains a document ID, the source text, and a reference summary. Deterministic, reproducible with a seed.

Function call

02

Fan out extraction across agents

One agent per document cluster. Each agent reads its input and produces strict JSON: a short summary, keywords, typed entities (PERSON, ORG, GPE), candidate relations, evidence snippets, and confidence scores. This is parallel work, but it is structured — agents produce machine-readable intermediate outputs, not free-form prose.

Agent — parallel map_reduce

03

Detect ambiguity across outputs

Deterministic reducers merge all extraction outputs and look for likely duplicates: alias overlap ("Barack Obama" vs "Obama"), similar names that might refer to the same person, conflicting type signals across documents. This is the key transition — the system inspects the outputs of the first phase and decides whether more work is needed.

Deterministic reducer

04

Adjudicate only the hard cases

If ambiguity candidates exist, a second phase launches using sharded_queue. Each adjudication agent gets one ambiguity case and decides: merge or keep separate, canonical name, entity type, aliases to preserve, supporting evidence. The finalizer applies those decisions and writes the canonical entity graph.

Agent — targeted sharded_queue

Why This Matters

Output-driven workflows are the real use case for agents.

Escalation, not uniform processing

It is wasteful to run high-touch reasoning on every item. This workflow processes everything cheaply in parallel, detects where uncertainty exists, and spends agent effort only on those cases. Junior analysts do the first pass; specialists resolve the tricky ones.

Different work needs different topologies

map_reduce for extraction because results need hierarchical aggregation. sharded_queue for adjudication because follow-up cases are flat, independent, and judgment-heavy. Real workflows should not use one topology for everything.

Agents + deterministic steps together

The agents extract and adjudicate. The reducers and finalizers merge, detect, and format. Malformed agent output does not collapse the whole run — retries are localized to failed tasks. This is how production workflows actually need to work.

What You Get

Structured outputs, not a wall of text.

entity_graph.json

Canonical entities with types, aliases, and relations. The final product — a knowledge graph built from the raw corpus.

entity_index.json

Lightweight lookup table for fast entity access by name or ID.

document_summaries.jsonl

One structured summary per document cluster with extracted keywords and entities.

run_report.md

High-level run summary: how many documents processed, entities found, ambiguities resolved, and failures recovered.

Run It Yourself

This ships with Epsilon.

The full source is in the repo at examples/hf_entity_graph/. Clone, set your API key, run the command.

# run the entity graph demo on 100 document clusters
$ OPENAI_API_KEY=... \
  PYTHONPATH=. python examples/hf_entity_graph/run_demo.py \
    --sample-size 100 \
    --sample-mode random \
    --sample-seed 17 \
    --worker-count 8

Phase 1: map_reduce extraction over 100 clusters
  workers: 8  topology: map_reduce
  ...extracting entities and relations
  ...reducing across shards
  ambiguity candidates found: 23

Phase 2: sharded_queue adjudication
  23 cases assigned to adjudication agents
  ...resolving merge/keep decisions

Finalizing entity graph
  entities: 847  relations: 1,204  adjudicated: 23
  output: final/entity_graph.json

Build an entity graph from a news corpus.