GitOps: edit → merge → release → deploy
Navigator's entire lifecycle hangs off one branch — main. Every change reaches it the same way (a PR that
auto-merges), main is what the production cluster pulls, and the daily release rides off main's history. This doc is
the source of truth for that flow; the workspace CLAUDE.md carries only the short rules and links here.
For agents, this collapses to two codebase actions: create a PR, or review/update an existing PR. The branch ceremony, test gate, release tag, and deploy hand-off are all supporting steps inside those actions.
main is sacred and squash-merge-only
- Never commit directly to
main. It advances solely through pull requests — there is no direct push, ever. - Every PR lands by squash. Squash is the only merge strategy: each PR collapses to exactly one commit on
main, regardless of how many commits (orMerge branch 'main'commits) the branch carried. Merge commits and rebase-merge are disabled on the repo — there is no other way to land. Somain's history is one linear commit per PR, and a branch's internal history never reaches it. mainis what production runs. The GKE cluster's Config Sync pullsexamples/deploy/k8s/gkefrommain(seegke-prod.md), and the nightly release tag is cut frommain's tip. A bad merge tomainis a production concern, not just a code-review one.
The branch → PR → auto-merge flow
Every task — agent or human — follows the same three steps. No workflow invents its own branch ceremony; they all inherit this.
- Branch. Before the first edit, create a topic branch:
git switch -c <kebab-topic>(e.g.git switch -c daily-cd-pipeline). If you find yourself onmainwith uncommitted work, branch first and carry the changes over — never commit them tomain. - Push + open a PR.
git push -u origin <branch>thengh pr create. - Enable auto-merge.
gh pr merge --auto --squash. Always--squash— it is the only strategy the repo accepts, so the flag matches what GitHub would do anyway, but pass it explicitly so intent is never ambiguous. GitHub squash-merges the moment every required check goes green — you do not babysit the merge or merge by hand. The whole PR becomes one commit onmain; write the PR title as the Conventional Commit you want inmain's history, since that title (not the branch's individual commits) is the squashed commit's subject.
Auto-merge is a GitHub-native repo setting, not a workflow — which is why the three workflows below still suffice.
TDD and the pre-commit gate
-
Tests land in the same commit as the implementation they cover.
-
Always run before committing:
cargo fmt cargo clippy --workspace --all-targets -- -D warnings cargo test --workspacePlus the markdown lint (
cargo run -p cli -- validate --markdown-only --no-default-excludes <path>) if you touched any.mdfile.
CI/CD — three workflows, plus maintenance
GitHub Actions carries exactly three CI/CD workflows, one per trigger — do not fold new gate logic into a fourth. Periodic housekeeping is the one carve-out: it lives in a separate maintenance workflow on its own cron, outside the CI/CD path, so a retention change never lands in a release diff and a cleanup run never shares state with a deploy.
| Workflow | Trigger | Job |
|---|---|---|
ci.yml | pull_request → main | lean fmt + clippy + cargo test --workspace |
release-tag.yml | cron 05:00 PST | cut + push the YY.MM.DD tag |
deploy.yml | YY.MM.DD tag push | integration → push images → Slack hand-off |
cleanup.yml | cron 07:00 PST | prune ghcr versions > 14 days (maintenance) |
PR flow — ci.yml
Runs only on every pull_request targeting main — never on push, so main itself runs no CI on merge (it
advances merge-only, and the heavy paths ride the release tag). Lean by design: a format check, a clippy pass with
warnings as errors, then the workspace test suite — nothing else. The job keeps target artifacts out of the cache,
disables CI debug info, and runs cargo clean between clippy and test so the standard hosted runner has enough disk.
One shared postgres:17-alpine container backs the whole job via TEST_DATABASE_URL (so store::test_support makes a
per-test schema in that single container instead of spawning a testcontainer per binary).
Integration/KIND/docker/browser work does not run here.
Cron flow — release-tag.yml
Fires daily at 05:00 PST (0 13 * * * UTC). Its only job is to cut a calendar release tag YY.MM.DD (e.g.
26.06.18 for 2026-06-18) and push it with a PAT (secrets.RELEASE_PAT) so the push re-triggers the tag flow below.
Tag flow — deploy.yml
Triggered by the YY.MM.DD tag push. Runs the full KIND integration suite, then builds and pushes every image — the
two service images (navigator-web, navigator-workflows-service) and the five CronJob trigger images
(navigator-*-trigger) — to ghcr.io tagged with that date plus latest. On success it posts a "ready to
deploy" message to the engineering Slack channel (the prod ops incoming webhook, secrets.SLACK_WEBHOOK_URL, synced
from Doppler), tagging Nick with the exact power-push command to roll the new images to prod; a failure on any stage
posts a separate alert to the same channel, also tagging Nick. The images are published, not rolled out — see
Publish vs. roll out below.
Maintenance flow — cleanup.yml
Separate from the CI/CD three, on its own cron and knowing nothing about tags. Fires daily at 07:00 PST (15:00 UTC)
— two hours after the tag cut, so the day's fresh images already exist — and prunes ghcr: it deletes every navigator-*
container version older than 14 days via
snok/container-retention-policy, authenticated with
secrets.RELEASE_PAT (the PAT's package scope is what drives the navigator-* wildcard; the temporal GITHUB_TOKEN
can do neither). latest and the recent dated tags are re-pushed daily by deploy.yml, so their versions stay under
the cutoff and only stale images are swept. It then posts a Slack summary, tagging Nick on failure. New scheduled
maintenance belongs here, not in a CI/CD workflow.
Publish vs. roll out
The tag flow publishes dated images to ghcr.io; it does not roll them onto the cluster. There is no automatic production rollout, by design — promoting a dated image to prod is a separate, deliberate, operator-driven step. This keeps every cluster mutation in the hands of a human at a trusted, authenticated workstation: GitHub Actions holds no GCP credential, no cluster access, and no path to write to prod.
The manual deploy
When the "ready to deploy" Slack message lands (the green-deploy hand-off from deploy.yml), an operator rolls the
published image onto the GKE cluster with power-push — this is the exact command the Slack message hands you, with the
date filled in:
doppler run --project navigator --config prd -- \
cargo run --release -p cli -- power-push --tag YY.MM.DD
power-push builds nothing — the images already exist from the tag flow. It resolves the published tag, confirms
the prod Secret satisfies the new binary's boot invariants, pins both deployments (navigator-web and
workflows-service) plus the trigger CronJobs to that tag, rolls them out together, and re-registers the worker with
Restate. The full recipe — the pre-roll Secret check, the manifest-drift guard, and the no-rebuild restart path for a
bare secret rotation — lives in cloud-operations.md. The cluster's pull-based, credential-free
image delivery is documented in gke-prod.md.
Forks that run a GitOps controller (Config Sync, Argo CD, Flux) can let the controller reconcile the overlay instead of
running power-push by hand; this repo's production roll is the manual power-push above.