Email events pipeline — SendGrid delivery stream to BigQuery
Two halves of the outbound-email audit picture, joined in BigQuery.
- Request side —
web::email'sLoggingEmailwrites onesent_emailsrow per attempt. A SendGrid202proves the message was accepted, not delivered. Each row now carriessg_message_id(SendGrid'sX-Message-Id), the join key to the delivery stream. - Delivery side — SendGrid's Event Webhook POSTs lifecycle events (
processed,delivered,open,click,bounce,dropped,deferred,spam_report,unsubscribe). Theweb::email_eventshandler lands each POST as one Snappy Parquet object on GCS — the same Parquet-on-GCS shape thearchivessnapshot uses, so BigQuery reads it through an external table.
How the join works
At send time SendGridEmail::build_request_body stamps top-level custom_args — template_slug and person_id (see
OutboundEmail::with_person). SendGrid echoes those keys on every lifecycle event for that message, so the delivery
rows carry template_slug / person_id directly and the analytics join needs no address parsing:
sent_emails.sg_message_id ── 1:N ── email_events.sg_message_id
email_events.person_id ───────── persons.id
The admin "Send welcome" path sets person_id; workflow-driven sends carry template_slug only until EmailPayload
threads a person id (a follow-up).
Object layout
One POST becomes one object:
email-events/data/dt=<YYYY-MM-DD>/<sha256(body)>.parquet
dt=<date>is a Hive-style partition (date of the first event in the batch) so a BigQuery external table prunes by day.- The filename is the SHA-256 of the raw request body, so SendGrid's at-least-once retries (it re-POSTs the identical
body on any non-2xx for 24h) overwrite the same object — file-level idempotency with no dedupe table.
sg_event_idis unique per event for row-level dedupe at query time.
Columns (all nullable Utf8 except event_unix_ts which is Int64): sg_event_id, sg_message_id, event, email,
template_slug, person_id, url, reason, status, timestamp_utc, event_unix_ts, raw_json. raw_json keeps
the whole event so a field we don't model yet is never lost.
The events land in web's configured storage bucket under the email-events/ prefix. A deployer who wants delivery
analytics on a separate lifecycle (and IAM) can point a dedicated bucket — e.g. YOUR_PROJECT_ID-events — at the prefix
instead; nothing in the handler hard-codes a bucket.
Operator setup (one-time, machine-bound)
These run on the operator's machine against the live project — web issues no BigQuery DDL itself.
1. Configure the SendGrid Event Webhook
Point SendGrid's Event Webhook at the path-secret URL and store the secret so prod boot accepts it:
# Pick a long random token; SendGrid posts to this exact URL.
kubectl -n navigator create secret generic navigator-web-secrets \
--from-literal=SENDGRID_EVENTS_SECRET="$(openssl rand -hex 24)" \
--dry-run=client -o yaml | kubectl apply -f -
# In the SendGrid console, set the Event Webhook POST URL to:
# https://www.your-domain.example/api/email-events/<that-token>
SENDGRID_EVENTS_SECRET is a production boot invariant (web::config::enforce_prod_invariants), so a deploy without it
fails fast rather than serving an unauthenticated endpoint. The ECDSA "Signed Event Webhook" is the stronger next layer
— it authenticates the payload, not just the URL.
2. Create the BigQuery external table
Reuse the navigator_bi dataset and connection from the archives bootstrap (see
cloud/README.md), then:
CREATE EXTERNAL TABLE `YOUR_PROJECT_ID.navigator_bi.email_events`
WITH CONNECTION `us-west4.exports`
OPTIONS (
format = 'PARQUET',
uris = ['gs://YOUR_PROJECT_ID-assets/email-events/data/*'],
hive_partition_uri_prefix = 'gs://YOUR_PROJECT_ID-assets/email-events/data',
require_hive_partition_filter = false
);
Substitute the bucket web actually writes to (its NAVIGATOR_GCS_BUCKET / storage backend). New partitions show up on
the next query — BigLake external tables re-scan their uris glob, so there is no refresh step.
Analytics
Per-template delivery funnel, joined to the request side:
SELECT
e.template_slug,
COUNT(DISTINCT e.sg_message_id) AS messages,
COUNTIF(e.event = 'delivered') AS delivered,
COUNTIF(e.event = 'open') AS opened,
COUNTIF(e.event = 'click') AS clicked,
COUNTIF(e.event IN ('bounce', 'dropped', 'spam_report')) AS problems
FROM `YOUR_PROJECT_ID.navigator_bi.email_events` AS e
WHERE e.dt >= '2026-05-01'
GROUP BY e.template_slug
ORDER BY messages DESC;
Non-goals
- Live dashboards. The lake is for analytics; the
sent_emailstable (admin/portal/admin/email-log) stays the operational request-side view. - Replacing
sent_emails. Postgres holds the request side; Parquet-on-GCS holds the delivery side; BigQuery joins them. - Iceberg-managed metadata. Same deferral as
archives—format = 'ICEBERG'with authoredmetadata/v<n>.metadata.jsonis a later follow-up; the Parquet external table is the v1.