Source content verification via LLM (Phase 2)backlog_item

qacascadesourcesfact-checkingllm
1 min read · Edit on Pyrite

Problem

Even when source URLs are live, there's no verification that the source content actually supports the claims in the KB entry. AI research agents may accurately cite real URLs but misrepresent or hallucinate the content of those sources.

Context

This is Phase 2 of the source verification pipeline. Phase 1 (URL liveness checking) must be complete first. This phase requires LLM integration and has associated API costs.

Scope

  • Fetch source URL content and extract key claims from the entry body
  • Use LLM to compare: "Does this source support these claims?"
  • Flag entries where sources don't appear to support key claims
  • Store verification results as QA assessment entries linked to the original entry
  • Support batch processing with cost controls
  • Acceptance Criteria

  • `pyrite qa verify-sources --kb=timeline --sample=50` verifies a random sample
  • Each entry gets a verification score (0-1) based on source-claim alignment
  • Low-scoring entries are flagged for human review
  • Cost-controlled: respects `--max-cost` budget parameter
  • Results stored as QA assessment entries with evidence links
  • Incremental: skips entries already verified unless `--force`
  • Open Questions

  • Which LLM provider to use? Should respect Pyrite's existing LLMService config (BYOK)
  • What content extraction strategy for web pages? (readability, trafilatura, etc.)
  • How to handle paywalled sources? (skip with warning? use cached content?)