Source URL liveness checking for QA (Phase 1)backlog_item

validationqacascadesources
1 min read · Edit on Pyrite

Problem

Pyrite's QA service validates structural aspects of entries (required fields, date formats, schema compliance) but does not verify that source URLs are valid. This is a significant gap for research KBs where entries are created by AI agents that may cite non-existent URLs.

The kleptocracy timeline identified this as its biggest QA gap: 4,400+ events created by research agents with no automated verification that source URLs are live.

Context

The kleptocracy timeline's existing QA framework includes:

  • Structural validation (YAML, required fields, date logic)
  • Source quality scoring (tier-1/2/3 distribution)
  • Quality auditing (body completeness, source count, metadata)
  • What's missing: URL liveness checks (do source URLs return 200?).

    Scope

  • Add a `pyrite qa check-urls` command that validates all source URLs in a KB
  • Check HTTP status codes (200 OK, 301 redirect, 404 not found, etc.)
  • Handle rate limiting and timeouts gracefully (configurable concurrency, per-domain rate limits)
  • Cache results to avoid re-checking unchanged URLs (SQLite or file-based cache with TTL)
  • Report: valid, redirected, broken, unreachable URLs per entry
  • Support `--fix` to remove broken sources or mark entries for review
  • Support `--sample=N` to check a random sample (useful for large KBs)
  • Acceptance Criteria

  • `pyrite qa check-urls --kb=timeline` checks all source URLs
  • Broken URLs reported with entry ID and source index
  • Results cached (re-run skips already-checked URLs unless `--force`)
  • Exits with non-zero status if broken URLs found (CI-friendly)
  • Rate limiting prevents being blocked by target sites
  • `--fix` marks entries with broken URLs for review (adds tag or status)