Iceberg archive — design

Status: writer landed (archives::iceberg::author_snapshot); wiring it into the nightly snapshot phase + BigLake registration are the remaining steps. This doc names every table, its partitioning, the GCS layout, the BigQuery wiring, and the retention policy.

Evaluation outcome (recorded 2026-06-14)

The writer uses the iceberg crate (0.9) for format correctness — a subtly-wrong hand-rolled Avro manifest is a silently unreadable archive, and this lake holds binding legal records. Two frictions shaped the integration:

v1 simplifications (documented so a reader isn't surprised): the table is unpartitioned — point-in-time queries come from the snapshot log (each run appends a snapshot), which is the design's stated mechanism; dt= identity partitioning and per-file column bounds are deferred refinements. The append-only snapshot log is preserved across runs by chaining from the prior v<N>.metadata.json (TableMetadataBuilder::new_from_metadata).

Why

The nightly Archives Restate workflow already snapshots every Postgres table to Parquet on GCS (archives/src/snapshot.rs, archives/src/parquet_io.rs). The objects already live under an Iceberg-shaped prefix:

gs://<project>-exports/iceberg/<table>/data/<yyyy-mm-dd>/part-<uuid_v7>.parquet

But they are loose Parquet files, not a table: there is no Iceberg metadata, so a reader has to glob the prefix and guess the schema, schema evolution is invisible, and "the documents table as of last Tuesday" is not a query — it is a file hunt. Promoting these snapshots to Apache Iceberg table format (table metadata + manifest lists + manifest files alongside the existing data/ Parquet) turns each prefix into an evolving, time-travelable, engine-portable table. "Archive everything to Iceberg" then has a precise meaning: Postgres entity snapshots and the email send/delivery streams are one queryable lake, readable from BigQuery (and Spark, Trino, DuckDB) without a copy.

What "promote to Iceberg" means concretely

For each table prefix we additionally write, every nightly run:

The data files we already produce are reused as-is; the new bytes are metadata. The snapshot is append-only: each run adds a new snapshot pointing at that night's data files, and prior snapshots stay valid — so the lake mirrors the append-only per-Project git repos and never rewrites history.

Tables

Two families, one lake.

1. Postgres entity snapshots (full-table, nightly)

The 25 tables in archives::tables::ALL_TABLES, one Iceberg table each:

addresses, answers, blobs, credentials, disclosures, documents, entities, entity_billing_profiles, entity_types, git_repositories, invoice_line_items, invoices, jurisdictions, letters, mailrooms, notation_events, notations, person_entity_roles, person_project_roles, persons, questions, relationship_logs, sent_emails, share_issuances, templates.

Each nightly run writes a full snapshot of the table (the current Postgres state), so a snapshot is a consistent as-of-that-night image — point-in-time queries come from Iceberg's snapshot log, not from diffing.

2. Email streams (append-only events)

Adding email_events to ALL_TABLES is the only table-set change; everything else is the metadata-layer promotion.

Partitioning

Partition spec is recorded in the table metadata, so a reader prunes by dt without scanning.

GCS layout

One bucket, gs://<project>-exports (the existing NAVIGATOR_EXPORTS_BUCKET; cloud::exports_from_env). Per table:

gs://<project>-exports/iceberg/<table>/
  metadata/
    version-hint.text
    v<N>.metadata.json
    snap-<snapshot-id>-<uuid>.avro      # manifest list
    <uuid>-m0.avro                       # manifest file(s)
  data/
    dt=2026-06-10/part-<uuid_v7>.parquet
    dt=2026-06-11/part-<uuid_v7>.parquet

Bytes stay in cloud::StorageService (GCS in prod, FsStorage in dev) — the writer goes through the trait, never the GCS SDK directly, per CLAUDE.md.

Catalog + BigQuery wiring — operator decision required

Two ways to make BigQuery read the lake; pick one before building the writer:

  1. BigLake / BigQuery managed Iceberg tables (recommended). Register each Iceberg table once via a BigLake connection; BigQuery reads the Iceberg metadata directly, so schema evolution and snapshot adds show up without a per-run DDL. This is the closest to "it's just a table." Cost: a one-time BigLake connection + IAM on the bucket.
  2. BigQuery external tables over the Parquet (status quo, simplest). A per-table CREATE EXTERNAL TABLE with OPTIONS(format = 'PARQUET', uris = ['gs://…/iceberg/<table>/data/*']) — the pattern already used for the email-events stream (see the email-events pipeline). No Iceberg metadata needed, but no time-travel and no schema evolution: it globs the data files. Use this only if BigLake is unavailable.

Recommendation: BigLake managed Iceberg, because the whole point of promoting to Iceberg is time-travel + schema evolution; option 2 throws both away. The writer should emit standards-compliant Iceberg metadata regardless, so the catalog choice does not change the bytes on GCS — only how BigQuery is pointed at them.

Flag for the operator: confirm BigLake-managed-Iceberg vs external-tables-over-Parquet before the writer commit. This is the one externally-constrained decision; the rest of this doc holds either way.

Retention

Where it runs

Inside the existing nightly Archives Restate workflow on workflows-serviceno new worker pod (all workflows live in workflows-service). The Iceberg metadata write is one additional journaled ctx.run step per table, after the Parquet write it already does. The diagnostic email gains an "Iceberg tables" line per table (snapshot id + metadata version).

Council consensus (folded in)

Run inline; the deliberation is not kept, only the decisions:

Writer follow-up — status

  1. Evaluate iceberg-rustdone: chosen (see Evaluation outcome above); not hand-rolled.
  2. Add email_events to archives::tables::ALL_TABLES (+ its fetch_batch arm). Pending.
  3. Emit metadata/ (table metadata JSON + manifest list + manifest), append-onlydone as a reusable writer (archives::iceberg::author_snapshot, unit-tested for metadata round-trip + snapshot-log chaining). Calling it from the nightly snapshot phase is the next wiring step.
  4. Rename the data partition dir to dt=<date>. Pending (the writer is unpartitioned in v1, so this is cosmetic until identity-partitioning lands).
  5. Wire BigLake (or external tables, per the operator's call) and add a smoke query. Pending (machine-bound; the one offline-unverifiable part — needs a live BigQuery to confirm the authored metadata is BigLake-readable).
  6. Surface per-table Iceberg snapshot ids in the nightly diagnostic email. Pending (with the wiring in step 3).