Durable workflows

How Navigator runs long-lived, crash-safe work — retainer intake, Drive sync, the nightly Archives backup — on Restate, and how an operator tells why one didn't run.

The one rule that costs two hours when forgotten: a registered Restate deployment is a snapshot, not a subscription. Rolling a new worker image does not re-register it. A service you just added (or its new handlers) stays invisible at the ingress — 404 "service not found" — until you re-register the deployment. See The registration gotcha.

The mental model: Kubernetes owns the clock; Restate owns the journal. A trigger fires an invocation once; Restate makes its execution durable — journals every step, retries failures, runs it to completion on the worker.

Two sides: submit vs. run

Durable execution is split across two crates so the rest of the workspace never binds to restate-sdk.

workflows (lib)workflows-service (bin)
RoleOutboundsubmit a jobInboundrun the handlers
Who calls itweb and the archives triggerRestate dials into it
RuntimeInMemoryRuntime (dev/CI) or RestateRuntimethe worker itself
restate-sdknoyes — the only crate with it
Tested viawiremock (exact HTTP shape)cargo test -p workflows-service

One worker pod hosts every service — new workflows bind onto the same endpoint, never a new pod. Today that worker serves one virtual object — notation (questionnaire + workflow timelines on one journal) — and the durable workflows Archives, Statutes, Heartbeat, BillingCanary, MatterCloseInvoice, RecurringBilling, and ReconcileInvoices. The exact set is the single source of truth in workflows_service::registry, whose tests assert every workflow name is PascalCase (template filenames follow the separate snake_case convention N103 enforces) and that the registry never drifts from the worker's actual .bind(...) calls. In the reference deploy the worker runs behind workflows.your-domain.example (Restate worker + Envoy sidecar).

The runtime is chosen by RESTATE_BROKER_URL: unset means in-process / in-memory, so KIND works with zero config; set means the RestateRuntime adapter posts to the broker over HTTP. The same selection is used in web::main, web::drive_sync, and the archives trigger.

Three ways a workflow starts

Every workflow is kicked off in exactly one of three ways. All three land on the same worker.

ModeFired byExampleCode
Event-drivenweb, on a user actionretainer intake; Drive syncweb::retainer_walk
Scheduleda Kubernetes CronJobArchives; statutes; canaryarchives/statutes/billing-workflows
Manualan admin buttonPOST /portal/admin/archives/runweb::archives

The submit shape is identical in all three: POST {ingress}/{Service}/{key}/run (append /send for one-way), with the optional bearer. The shared helper is workflows::start_workflow.

Where the schedule lives

Restate has no cron. The nightly schedule is a Kubernetes CronJob named archives-trigger in the navigator namespace — stored in the cluster (etcd), evaluated by the kube-controller-manager, sourced from examples/deploy/k8s/exports/cron-archives-trigger.yaml (schedule: "0 10 * * *", UTC, = 02:00 PST). Each firing runs the thin navigator-archives-trigger image — one POST to the ingress, then it exits. Restate owns the retry schedule from there.

kube CronJob (0 10 * * *) --fires--> trigger pod --POST /Archives/<date>/run/send--> Restate ingress
                                                                                          | Accepted
                                                                                          v
                                                          worker runs: snapshot -> cost -> notify (journaled)

To inspect or fire the schedule by hand:

kubectl -n navigator get cronjob archives-trigger
kubectl -n navigator create job --from=cronjob/archives-trigger archives-trigger-manual-001

Idempotency is the workflow key

Restate admits at most one invocation per workflow key. The key choice is the idempotency policy:

Auth: two tokens, two ports

The single most error-prone area: there are two different credentials on two different ports, and conflating them is what silently broke prod.

The secret takes the ingress key_, never the SSO JWT — they look nothing alike. If navigator-web-secrets holds a long eyJ… JWT, it is wrong and the ingress answers 401 Unauthenticated. Doppler prd is the source of truth (see secrets in Doppler). We once had this token drift across three places — Doppler key_, Secret Manager stub, and a stale SSO JWT in the k8s secret — which is the failure this section exists to prevent.

The registration gotcha

Restate routes the ingress to registered services. Registration is a snapshot of the worker's handler list at register time — it does not follow new deploys.

POST :8080/Archives/<key>/run/send
404 {"message":"service 'Archives' not found, make sure to register the service before calling it."}

Fix — re-register the deployment (re-runs discovery against the live worker and picks up every service). Either use the Restate Cloud console (your env, Deployments, Register deployment, overwrite the existing endpoint), or the admin REST API authenticated with the SSO token:

ADMIN="https://<env>.env.<region>.restate.cloud:9070"
TOK=$(sed -n 's/^access_token = "\(.*\)"/\1/p' ~/.config/restate/config.toml | head -1)
# dry-run first — confirm the discovered service list before committing:
curl -s -X POST "$ADMIN/deployments" -H "Authorization: Bearer $TOK" \
  -d '{"uri":"https://workflows.your-domain.example/","force":true,"dry_run":true}' | jq '.services[].name'
# then commit (drop dry_run):
curl -s -X POST "$ADMIN/deployments" -H "Authorization: Bearer $TOK" \
  -d '{"uri":"https://workflows.your-domain.example/","force":true}'

The restate CLI is configured for the env but may report "Unable to connect" to :9070 even when the host is reachable; the admin REST API above works directly with the same SSO token.

How power-push re-registers (step 7d)

After rolling both deployments, power-push re-registers the worker so any handler added since the last registration is reachable. Two design points:

Adding a workflow

  1. Author the spec in a notation template's workflow: frontmatter (see notation authoring) or, for non-notation flows, bind a new Restate service in workflows-service.
  2. Signal it from web (event-driven) or add a trigger (scheduled / manual).
  3. Ship the worker — see GKE production and cloud operations. Always ship both navigator-web and workflows-service at one SHA.
  4. Re-register the deployment (above) — otherwise the new service 404s at the ingress no matter how clean the deploy was. This step is invisible in kubectl and easy to forget.

The heartbeat: proving the engine itself is alive

Every other scheduled workflow proves an integrationArchives proves the database and GCS are reachable, BillingCanary proves Xero still agrees with us. None answers the bluntest operator question: is durable execution itself alive right now? A silent Archives is ambiguous (engine down, or just a GCS outage?).

Heartbeat removes the ambiguity. It is a two-step Restate workflow (beatnotify) that depends on nothing — no database, no object storage, no third-party API — so a green run can only mean the engine accepted an invocation, journaled step one, and ran step two to completion. It fires every six hours (0 */6 * * * UTC), keyed on the UTC date + hour so the four daily runs each get a distinct workflow key (a date-only key would dedupe three of four into no-ops). Each run posts firm ops a "Where to look" notice to the engineering Slack channel carrying the exact Restate Cloud + GCP console links and the kubectl/curl chain below — so the same notice that confirms health onboards whoever debugs its absence. (Ops notices go to Slack only; the duplicate email was dropped once Slack proved itself.)

The signal that matters most is the missing one: a six-hour window with no heartbeat notice in Slack means the engine may be down — walk the chain below. Like every new service, Heartbeat is invisible at the ingress until the deployment is re-registered (see the registration gotcha); the absence of its first notice after a ship is itself the test that re-register happened.

Debugging "the workflow didn't run"

Work down the chain; the break is almost always near the top:

  1. Did the trigger fire? kubectl -n navigator get cronjob archives-trigger (last schedule) plus the trigger pod logs. For manual: did the admin button return the confirmation page or an error?
  2. Did the ingress accept it? A 401 is a wrong or stale RESTATE_AUTH_TOKEN; a 404 service not found is the registration gotcha — both have dedicated sections above.
  3. Did the worker run it? Check the invocation in the Restate Cloud console (Invocations) or via the admin API; a failing step retries and surfaces there.
  4. Did the side effect happen? Email transmits through SendGrid only when NAVIGATOR_EMAIL_BACKEND=sendgrid and SENDGRID_API_KEY are present; otherwise the worker silently uses a capturing backend that logs "sent" without sending. See cloud operations for manifest-drift notes.

See also