Durable workflows
How Navigator runs long-lived, crash-safe work — retainer intake, Drive sync, the nightly Archives backup — on Restate, and how an operator tells why one didn't run.
The one rule that costs two hours when forgotten: a registered Restate deployment is a snapshot, not a subscription. Rolling a new worker image does not re-register it. A service you just added (or its new handlers) stays invisible at the ingress —
404 "service not found"— until you re-register the deployment. See The registration gotcha.The mental model: Kubernetes owns the clock; Restate owns the journal. A trigger fires an invocation once; Restate makes its execution durable — journals every step, retries failures, runs it to completion on the worker.
Two sides: submit vs. run
Durable execution is split across two crates so the rest of the workspace never binds to restate-sdk.
workflows (lib) | workflows-service (bin) | |
|---|---|---|
| Role | Outbound — submit a job | Inbound — run the handlers |
| Who calls it | web and the archives trigger | Restate dials into it |
| Runtime | InMemoryRuntime (dev/CI) or RestateRuntime | the worker itself |
restate-sdk | no | yes — the only crate with it |
| Tested via | wiremock (exact HTTP shape) | cargo test -p workflows-service |
One worker pod hosts every service — new workflows bind onto the same endpoint, never a new pod. Today that worker
serves one virtual object — notation (questionnaire + workflow timelines on one journal) — and the durable workflows
Archives, Statutes, Heartbeat, BillingCanary, MatterCloseInvoice, RecurringBilling, and ReconcileInvoices.
The exact set is the single source of truth in workflows_service::registry, whose tests assert every workflow name is
PascalCase (template filenames follow the separate snake_case convention N103 enforces) and that the registry never
drifts from the worker's actual .bind(...) calls. In the reference deploy the worker runs behind
workflows.your-domain.example (Restate worker + Envoy sidecar).
The runtime is chosen by RESTATE_BROKER_URL: unset means in-process / in-memory, so KIND works with zero config; set
means the RestateRuntime adapter posts to the broker over HTTP. The same selection is used in web::main,
web::drive_sync, and the archives trigger.
Three ways a workflow starts
Every workflow is kicked off in exactly one of three ways. All three land on the same worker.
| Mode | Fired by | Example | Code |
|---|---|---|---|
| Event-driven | web, on a user action | retainer intake; Drive sync | web::retainer_walk |
| Scheduled | a Kubernetes CronJob | Archives; statutes; canary | archives/statutes/billing-workflows |
| Manual | an admin button | POST /portal/admin/archives/run | web::archives |
The submit shape is identical in all three: POST {ingress}/{Service}/{key}/run (append /send for one-way), with the
optional bearer. The shared helper is workflows::start_workflow.
Where the schedule lives
Restate has no cron. The nightly schedule is a Kubernetes CronJob named archives-trigger in the navigator
namespace — stored in the cluster (etcd), evaluated by the kube-controller-manager, sourced from
examples/deploy/k8s/exports/cron-archives-trigger.yaml (schedule: "0 10 * * *", UTC, = 02:00 PST). Each firing runs
the thin navigator-archives-trigger image — one POST to the ingress, then it exits. Restate owns the retry schedule
from there.
kube CronJob (0 10 * * *) --fires--> trigger pod --POST /Archives/<date>/run/send--> Restate ingress
| Accepted
v
worker runs: snapshot -> cost -> notify (journaled)
To inspect or fire the schedule by hand:
kubectl -n navigator get cronjob archives-trigger
kubectl -n navigator create job --from=cronjob/archives-trigger archives-trigger-manual-001
Idempotency is the workflow key
Restate admits at most one invocation per workflow key. The key choice is the idempotency policy:
- Nightly Archives keys on the UTC run date, so a double-fire on the same day is a silent no-op — exactly what a backup wants.
- Manual runs key on a unique
manual-<uuid>, so every click actually executes and notifies — a test button that deduped against the nightly run would look broken.
Auth: two tokens, two ports
The single most error-prone area: there are two different credentials on two different ports, and conflating them is what silently broke prod.
- Submit / trigger an invocation — ingress, port
:8080. Credential:RESTATE_AUTH_TOKEN, the Restate Cloudkey_…API key (72 chars). Lives in the k8s secretnavigator-web-secrets, sourced from Dopplerprd. - Register a deployment — admin API, port
:9070. Credential: the SSO access token (a long JWT written byrestate cloud login), in~/.config/restate/config.tomlunder[global.cloud] access_token.
The secret takes the ingress key_, never the SSO JWT — they look nothing alike. If navigator-web-secrets holds a
long eyJ… JWT, it is wrong and the ingress answers 401 Unauthenticated. Doppler prd is the source of truth
(see secrets in Doppler). We once had this token drift across three places — Doppler key_,
Secret Manager stub, and a stale SSO JWT in the k8s secret — which is the failure this section exists to prevent.
The registration gotcha
Restate routes the ingress to registered services. Registration is a snapshot of the worker's handler list at register time — it does not follow new deploys.
- In KIND,
cargo run -p cli -- restate registerwires the worker URL into the in-cluster broker, so the dev loop just works. - In Restate Cloud, registration is an explicit admin operation. Rolling a new worker image does not re-register, so a service or handler added since the last registration is invisible at the ingress:
POST :8080/Archives/<key>/run/send
404 {"message":"service 'Archives' not found, make sure to register the service before calling it."}
Fix — re-register the deployment (re-runs discovery against the live worker and picks up every service). Either use the Restate Cloud console (your env, Deployments, Register deployment, overwrite the existing endpoint), or the admin REST API authenticated with the SSO token:
ADMIN="https://<env>.env.<region>.restate.cloud:9070"
TOK=$(sed -n 's/^access_token = "\(.*\)"/\1/p' ~/.config/restate/config.toml | head -1)
# dry-run first — confirm the discovered service list before committing:
curl -s -X POST "$ADMIN/deployments" -H "Authorization: Bearer $TOK" \
-d '{"uri":"https://workflows.your-domain.example/","force":true,"dry_run":true}' | jq '.services[].name'
# then commit (drop dry_run):
curl -s -X POST "$ADMIN/deployments" -H "Authorization: Bearer $TOK" \
-d '{"uri":"https://workflows.your-domain.example/","force":true}'
The restate CLI is configured for the env but may report "Unable to connect" to :9070 even when the host is
reachable; the admin REST API above works directly with the same SSO token.
How power-push re-registers (step 7d)
After rolling both deployments, power-push re-registers the worker so any handler added since the last registration is
reachable. Two design points:
- It targets the real worker URL, not the placeholder. The
navigatorCLI resolves the register target in precedence order: explicit--url→NAVIGATOR_WORKFLOWS_URL→ derivedhttps://workflows.<NAVIGATOR_PRIMARY_DOMAIN>/→ theworkflows.example.complaceholder of last resort. The derivation step exists because the 2026-06-10 ship had a domain configured but no explicitNAVIGATOR_WORKFLOWS_URL, fell through to the placeholder, and silently no-op'd the register. Underdoppler run --config prdthe explicit URL is now injected; the derivation is the belt-and-suspenders default for an operator who hasn't set it. - It picks its transport from the environment. When
RESTATE_ADMIN_URLandRESTATE_ADMIN_TOKENare both set (wired in Dopplerprd),restate_registerPOSTs{"uri":<worker>,"force":true}straight to the admin REST API — headless, needs norestate cloud env configure(which requires a TTY) and works with a non-expiring admin-scoped API key. Otherwise it shells out to the pinnedrestateCLI, which only reaches Restate Cloud when the operator has a selected environment in~/.config/restate/config.tomland a fresh SSO token; with neither, the CLI defaults to thelocalenvironment (localhost:9070) and the step fails. Wiring the two env vars is what makes an unattended ship re-register reliably. - It warns and continues — it does not gate the ship (yet). A
:9070admin endpoint that is firewalled from the operator's network, or an expired SSO token, would otherwise block every ship. Re-register is therefore best-effort: a failure printsWARN: Restate re-register failed (continuing)and the ship completes. The cost of warn-and-continue is a silent404on a newly added service until someone re-registers by hand; theNotation(retainer) service is already registered and unaffected. Flip step 7d to fail-the-ship only once:9070reachability is proven from the ship host and an admin-scoped token (RESTATE_ADMIN_TOKEN+RESTATE_ADMIN_URLin Dopplerprd) is wired — until then a hard gate trades a rare silent 404 for a frequent blocked ship.
Adding a workflow
- Author the spec in a notation template's
workflow:frontmatter (see notation authoring) or, for non-notation flows, bind a new Restate service inworkflows-service. - Signal it from
web(event-driven) or add a trigger (scheduled / manual). - Ship the worker — see GKE production and cloud operations. Always ship both
navigator-webandworkflows-serviceat one SHA. - Re-register the deployment (above) — otherwise the new service
404s at the ingress no matter how clean the deploy was. This step is invisible inkubectland easy to forget.
The heartbeat: proving the engine itself is alive
Every other scheduled workflow proves an integration — Archives proves the database and GCS are reachable,
BillingCanary proves Xero still agrees with us. None answers the bluntest operator question: is durable execution
itself alive right now? A silent Archives is ambiguous (engine down, or just a GCS outage?).
Heartbeat removes the ambiguity. It is a two-step Restate workflow (beat → notify) that depends on nothing —
no database, no object storage, no third-party API — so a green run can only mean the engine accepted an invocation,
journaled step one, and ran step two to completion. It fires every six hours (0 */6 * * * UTC), keyed on the UTC
date + hour so the four daily runs each get a distinct workflow key (a date-only key would dedupe three of four into
no-ops). Each run posts firm ops a "Where to look" notice to the engineering Slack channel carrying the exact
Restate Cloud + GCP console links and the kubectl/curl chain below — so the same notice that confirms health onboards
whoever debugs its absence. (Ops notices go to Slack only; the duplicate email was dropped once Slack proved itself.)
The signal that matters most is the missing one: a six-hour window with no heartbeat notice in Slack means the engine
may be down — walk the chain below. Like every new service, Heartbeat is invisible at the ingress until the
deployment is re-registered (see the registration gotcha); the absence of its first
notice after a ship is itself the test that re-register happened.
Debugging "the workflow didn't run"
Work down the chain; the break is almost always near the top:
- Did the trigger fire?
kubectl -n navigator get cronjob archives-trigger(last schedule) plus the trigger pod logs. For manual: did the admin button return the confirmation page or an error? - Did the ingress accept it? A
401is a wrong or staleRESTATE_AUTH_TOKEN; a404 service not foundis the registration gotcha — both have dedicated sections above. - Did the worker run it? Check the invocation in the Restate Cloud console (Invocations) or via the admin API; a failing step retries and surfaces there.
- Did the side effect happen? Email transmits through SendGrid only when
NAVIGATOR_EMAIL_BACKEND=sendgridandSENDGRID_API_KEYare present; otherwise the worker silently uses a capturing backend that logs "sent" without sending. See cloud operations for manifest-drift notes.
See also
- The what of each individual workflow: notation, retainer intake, Nautilus workflows.
- Scheduling any periodic job (the CronJob pattern, both flavors): Scheduled jobs.
- Deploy and secret mechanics: GKE production, secrets in Doppler.
- Crate entry points:
workflows/README.mdandworkflows-service/README.md.