This is a bundled, runnable demo. Given 100 clusters of news articles from Hugging Face, Epsilon extracts entities, detects ambiguity, adjudicates only the hard cases, and produces a canonical entity graph. The whole thing runs with one command.
When two articles mention "Obama," "Barack Obama," and "President Obama," are those the same entity? When "Apple" appears in a tech article and a farming article, should they merge? A traditional data pipeline can extract rows. But handling ambiguity — deciding when two mentions refer to the same thing — requires judgment. And that judgment only becomes visible after the first pass, not before.
The workflow uses map_reduce for the extraction phase and sharded_queue for targeted adjudication. Deterministic reducers handle the merge logic in between. The second phase only exists if the first phase discovers ambiguity.
Download a parquet file from Hugging Face (Awesome075/multi_news_parquet), select a sample of document clusters, and write normalized JSON inputs. Each input contains a document ID, the source text, and a reference summary. Deterministic, reproducible with a seed.
One agent per document cluster. Each agent reads its input and produces strict JSON: a short summary, keywords, typed entities (PERSON, ORG, GPE), candidate relations, evidence snippets, and confidence scores. This is parallel work, but it is structured — agents produce machine-readable intermediate outputs, not free-form prose.
Agent — parallel map_reduceDeterministic reducers merge all extraction outputs and look for likely duplicates: alias overlap ("Barack Obama" vs "Obama"), similar names that might refer to the same person, conflicting type signals across documents. This is the key transition — the system inspects the outputs of the first phase and decides whether more work is needed.
Deterministic reducerIf ambiguity candidates exist, a second phase launches using sharded_queue. Each adjudication agent gets one ambiguity case and decides: merge or keep separate, canonical name, entity type, aliases to preserve, supporting evidence. The finalizer applies those decisions and writes the canonical entity graph.
It is wasteful to run high-touch reasoning on every item. This workflow processes everything cheaply in parallel, detects where uncertainty exists, and spends agent effort only on those cases. Junior analysts do the first pass; specialists resolve the tricky ones.
map_reduce for extraction because results need hierarchical aggregation. sharded_queue for adjudication because follow-up cases are flat, independent, and judgment-heavy. Real workflows should not use one topology for everything.
The agents extract and adjudicate. The reducers and finalizers merge, detect, and format. Malformed agent output does not collapse the whole run — retries are localized to failed tasks. This is how production workflows actually need to work.
entity_graph.json
Canonical entities with types, aliases, and relations. The final product — a knowledge graph built from the raw corpus.
entity_index.json
Lightweight lookup table for fast entity access by name or ID.
document_summaries.jsonl
One structured summary per document cluster with extracted keywords and entities.
run_report.md
High-level run summary: how many documents processed, entities found, ambiguities resolved, and failures recovered.
The full source is in the repo at examples/hf_entity_graph/. Clone, set your API key, run the command.
# run the entity graph demo on 100 document clusters $ OPENAI_API_KEY=... \ PYTHONPATH=. python examples/hf_entity_graph/run_demo.py \ --sample-size 100 \ --sample-mode random \ --sample-seed 17 \ --worker-count 8 Phase 1: map_reduce extraction over 100 clusters workers: 8 topology: map_reduce ...extracting entities and relations ...reducing across shards ambiguity candidates found: 23 Phase 2: sharded_queue adjudication 23 cases assigned to adjudication agents ...resolving merge/keep decisions Finalizing entity graph entities: 847 relations: 1,204 adjudicated: 23 output: final/entity_graph.json