PostgreSQL as Optional Storage Backendbacklog_item

infrastructuredeploymentfeaturestoragepostgres
3 min read · Edit on Pyrite

Problem

SQLite is the right default for single-user and agent workflows — zero config, embedded, fast. But multi-user deployments (demo site, corporate teams, hosted solution) need concurrent writes, row-level locking, and operational tooling that SQLite can't provide. The demo site in particular would benefit from Postgres: proper concurrent access for multiple visitors, pgvector for semantic search, and standard hosting on every cloud platform.

Solution

Add PostgreSQL as an optional storage backend. Most of the data access layer already uses SQLAlchemy, so the core migration is a configuration change. The work is in the areas that bypass SQLAlchemy or use SQLite-specific features.

What's Already Abstracted (low effort)

  • Entry CRUD (SQLAlchemy ORM)
  • Settings, starred entries, collections (SQLAlchemy ORM)
  • Basic queries and filters (SQLAlchemy query builder)
  • What Needs Work

    Full-text search:

  • SQLite: FTS5 virtual tables
  • Postgres: `tsvector`/`tsquery` with `GIN` index
  • Postgres FTS is actually more capable (ranking, stemming, language-aware)
  • Need: abstract FTS behind a search interface, implement both backends
  • Semantic search / embeddings:

  • SQLite: custom vector storage in `embed_queue` table, manual cosine similarity
  • Postgres: pgvector extension, `vector` column type, `<=>` cosine distance operator
  • pgvector is cleaner — native SQL operator instead of application-level math
  • Need: abstract embedding storage and similarity queries, implement pgvector backend
  • Atomic operations:

  • SQLite: `json_extract()` for metadata queries (e.g., task_claim CAS)
  • Postgres: native JSON operators (`->`, `->>`, `jsonb_set()`)
  • Need: abstract JSON query operations or use SQLAlchemy's JSON column support
  • Ephemeral KBs:

  • Already DB-only (no git) — natural fit for Postgres
  • TTL-based garbage collection maps cleanly to Postgres `DELETE` with timestamp comparison
  • Concurrent ephemeral KB creation is safer with row-level locking
  • Configuration

    ```yaml

    pyrite.yaml — SQLite (default, no change)

    storage: backend: sqlite path: .pyrite/index.db

    pyrite.yaml — PostgreSQL

    storage: backend: postgres url: postgresql://user:pass@localhost:5432/pyrite # Or via environment variable url: ${DATABASE_URL} ```

    Migration Path

  • `pyrite index sync` rebuilds the database from source files regardless of backend
  • No data migration needed — Postgres instance starts empty, index sync populates it
  • This is the "derived data" advantage: the DB is always rebuildable from git
  • Hosted Solution Angle

    A Postgres backend opens the door to a hosted Pyrite offering:

  • Multi-tenant with schema-per-KB or row-level security
  • Managed Postgres (Neon, Supabase, RDS) for zero-ops
  • pgvector included in most managed Postgres offerings
  • Standard backup/restore via pg_dump
  • Connection pooling (PgBouncer) for high concurrency
  • This isn't on the immediate roadmap, but having the backend ready keeps the option open.

    Prerequisites

  • SQLAlchemy already used for most data access
  • Background embedding pipeline (#57, done) defines the embedding storage interface
  • FTS5 usage identified and bounded
  • Success Criteria

  • `pyrite serve` works with `storage.backend: postgres` configuration
  • Full-text search and semantic search functional on Postgres
  • Ephemeral KBs work on Postgres
  • Task claim CAS atomicity works (row-level locking is actually better than SQLite here)
  • `pyrite index sync` populates Postgres from git-backed files
  • Demo site runs on Postgres
  • All existing tests pass against both backends (parameterized test suite)
  • Delivered (0.10)

    Implemented as `PostgresBackend` in `pyrite/storage/backends/postgres_backend.py` via the `SearchBackend` protocol (ADR-0014).

  • 66/66 conformance tests passing — full protocol parity with SQLiteBackend
  • tsvector/tsquery with GIN index for weighted full-text search
  • pgvector with HNSW index for cosine similarity semantic search
  • Hybrid search combining both with application-level RRF reranking
  • Performance: ~3x slower indexing, ~2x slower queries vs SQLite — acceptable for server deployments where SQLite's single-writer lock is the real bottleneck
  • LanceDB evaluated as alternative but rejected (49-66x slower) — see ADR-0016
  • What's Deferred

  • Demo site deployment on Postgres (#85)
  • Multi-tenant schema-per-KB or row-level security
  • Ephemeral KBs on Postgres
  • pgvector managed hosting optimization
  • Launch Context

    Valuable for the demo site (#85) — Postgres is standard on Fly.io/Railway and handles concurrent visitors. Also opens the door to a future hosted offering. Could ship as part of 0.8 if demo site needs it, or as a fast-follow.