Cloud operations

This page replaces the old private cloud runbooks with one common operating model. Public docs are the shared surface every LLM and human maintainer should read first.

Navigator is GCP-wired and provider-agnostic. The production path uses GKE Autopilot, Cloud SQL for Postgres, GCS, Secret Manager, Cloud Logging, Cloud Trace, BigQuery billing export, and Restate Cloud. The application code keeps the cloud boundary behind traits, protocols, and env vars: cloud::StorageService, SeaORM/Postgres, OIDC, OPA, Restate, SendGrid, Kubernetes, and web::agent_router::AgentRouter.

Former private-runbook coverage

The collapse rule is simple: durable policy, invariants, architecture, and operator recipes live in docs/.

Local development

The standard local loop is KIND through the navigator CLI:

cargo run --release -p cli -- start-dev-server
set -a; source .devx/env; set +a
cargo run -p web
cargo run --release -p cli -- down

start-dev-server brings up Postgres, Keycloak, fake-gcs-server, OPA, Restate, workflows-service, and Grafana LGTM in KIND, then writes .devx/env for the host-side web process.

Cursor Cloud is the exception documented in ../AGENTS.md: KIND does not work there because the VM cannot initialize the KIND node's cgroup stack under fuse-overlayfs. On that VM, use the standalone-container recipe in AGENTS.md and the baked local Postgres.

Scratch artifacts go under /tmp, never the repo. Screenshots normally go under /tmp/navigator-screenshots/.

GCP setup

navigator gcp setup provisions GCP by calling REST APIs directly from cli/src/devx/gcp/ with reqwest. There is no gcloud shell-out for the setup pipeline and no broad Google SDK wrapper. That is deliberate: raw REST gives the CLI a single dry-run intercept point and keeps endpoint behavior testable with wiremock.

When touching cli/src/devx/gcp/, keep four things correct:

Every step follows the same conventions:

When an endpoint drifts, update the module's wiremock test to match Google's current docs first, then update the implementation and run the dry-run command from oss-install.md.

Production deploy

Code reaches production through PRs and dated images:

  1. Merge through the normal PR flow in gitops.md.
  2. The release-tag workflow cuts a YY.MM.DD tag.
  3. The deploy workflow builds and publishes both images to ghcr.io: navigator-web and navigator-workflows-service.
  4. An operator rolls GKE onto the published tag.

Always roll navigator-web and workflows-service together at the same YY.MM.DD tag. Version skew between the web surface and durable worker is an avoidable production risk.

Before a rollout, check the new binary's required env/secret keys against the live production Secret. web enforces boot invariants and crash-loops loudly when a required key is missing. If the image tag is unchanged and only a Secret changed, restart the deployments so pods re-read envFrom.

Run production cluster commands under the production secret context. Never paste real secret values into chat, docs, commits, or PR bodies.

Production database

Production is Cloud SQL for Postgres. Ad-hoc access goes through cloud-sql-proxy with IAM service-account impersonation, not password shortcuts.

Read-only SELECTs are allowed when the user asks for inspection. Before any INSERT, UPDATE, DELETE, or DDL:

The canonical seed is idempotent: it inserts missing rows and does not update existing production rows. A live data fix needs a guarded update, a migration, or an app seam.

Spend reporting

Report GCP spend from the BigQuery Cloud Billing export, not console guesses or rate-card math. Always show:

Discover the project from env and the billing table from BigQuery. Do not hard-code billing account generated table names into docs or code.

Observability

Every service binary emits through telemetry::init("navigator-<name>"). With no OTEL_EXPORTER_OTLP_ENDPOINT, logs stay human-readable on stdout. With the endpoint set, logs become JSON and traces/metrics export through OTLP.

The load-bearing rule is:

Identifiers and counts, never content.

Safe telemetry fields include ids, service names, outcomes, durations, status codes, and counts. Unsafe fields include client names, email addresses, answer bodies, document bodies, privileged facts, and full request or tool arguments. This rule applies in local Grafana LGTM, Cloud Logging, Cloud Trace, BigQuery, and any future sink.

Use navigator doctor, Cloud Logging/BigQuery, the Restate console, and the six-hourly Heartbeat email to debug missing periodic jobs or durable workflow failures. The architecture details live in observability.md and durable-workflows.md.

Website publication

Top-level files in docs/ are already published at /docs/:slug by web::docs. The site bakes the docs into the binary with include_str!, renders markdown under the Foundation brand, and rewrites top-level doc links to site routes. That gives every maintainer and LLM the same documentation surface.

Good next steps for the website: