Problem
The kleptocracy timeline project built a sophisticated actor alias detection system (`suggest_actor_aliases.py`) that uses fuzzy matching, Levenshtein distance, and a known acronym dictionary to detect duplicate actor names and suggest canonical mappings. This infrastructure exists only in the kleptocracy timeline repo and needs to be absorbed into Pyrite as a core capability.
Currently, Pyrite entries have an `aliases` field, and the wikilink service resolves links by checking aliases. But there is no tooling to:
1. Detect potential duplicates across entries (e.g., "FBI" and "Federal Bureau of Investigation" are the same entity) 2. Suggest alias mappings automatically using fuzzy matching 3. Maintain a known acronym dictionary (50+ acronyms like FBI, CIA, NSA, DOJ, DOGE, ACLU, etc.) 4. Interactively review and approve alias suggestions
Reference Implementation
`/Users/markr/kleptocracy-timeline/timeline/scripts/maintenance/suggest_actor_aliases.py` (~680 lines, self-contained Python). This is the source to port — read it for algorithm details, edge cases, and the interactive review UX.
Context
The kleptocracy timeline's `suggest_actor_aliases.py` (~680 lines) is a multi-pass duplicate detection pipeline that scans all actor strings across events and proposes merge groups. It processes ~1,235 unique actors across 4,400+ events, producing an `actor_aliases.json` mapping file (880 lines, ~250 canonical actors with variants).
This is a general-purpose capability that applies to any KB with person, organization, or actor entries — not just timelines.
Detection Pipeline (6 passes, sequential)
Each pass removes matched actors so later passes don't re-process them:
1. Case-insensitive exact match (100% confidence) — "donald trump" ↔ "Donald Trump" 2. Slug match (95% confidence) — Normalize to lowercase slugs (strip possessives, punctuation, unicode). Catches "Trump's DOJ" vs "Trumps DOJ" 3. Prefix stripping (90% confidence) — Remove U.S./US/United States prefixes, check if stripped version exists separately. "U.S. Department of Justice" ↔ "Department of Justice" 4. Parenthetical extraction (90% confidence) — Split "American Legislative Exchange Council (ALEC)" into base name + acronym, merge all variants 5. Known acronym table (85% confidence) — Hardcoded dictionary of ~50 government/political acronyms. If "FBI" and "Federal Bureau of Investigation" both appear, group them. Also checks U.S. prefix variants 6. Fuzzy matching (variable confidence, threshold 85%) — `difflib.SequenceMatcher` ratio. Groups by first word to avoid O(n²). Only considers actors with 2+ appearances. Confidence = match ratio × 100
Canonical Name Selection
When merging a group, pick the "best" name using priority: 1. Full name over acronym 2. Without U.S. prefix (unless 80%+ of usage has the prefix) 3. Proper case over lowercase 4. Without parenthetical over with 5. Highest usage count breaks ties
Review Modes
Scope — What Pyrite Absorbs
The core algorithm ports directly. Key adaptations:
1. Input source — Query the Pyrite index for entries and their string fields instead of reading YAML frontmatter directly 2. Output target — Update entry `aliases` fields directly, or optionally create/merge actor entries (instead of writing `actor_aliases.json`) 3. Acronym dictionary — Configurable per KB (in `kb.yaml` or a separate config file) instead of hardcoded 4. Generalization — Works on any string field, not just actors: `pyrite qa suggest-aliases --field=actors --type=cascade_event` or for deduplicating entries of any type
CLI Command
`pyrite qa suggest-aliases` with flags: