Track and respond to platform-wide latency SLOs

Outcome

The four SLOs that bound platform health — cold path, instance health, code rollout, and migration window — are visible, measured, and within target.

Prerequisites

Grafana access to the platform dashboard.
Familiarity with pg-boss supervisor and tenant resolver basics.

Targets

Measure	Target	Why
First-request cold path (new tenant pool)	p99 < 300 ms	Resolver + Key Vault + Knex-pool acquire + one SELECT. Above 300 ms, tenants experience a visible spinner on the first request after idle demotion.
New instance healthy (boot → /health/ready green)	< 60 s	Blue/green + autoscaler both assume a new pod joins rotation inside one minute. Above this, rolling deploys can starve capacity.
Code rollout (canary → 100%)	< 10 min	Mirrors the CI deploy window. Above this, an urgent fix cannot be shipped same-hour.
Fleet-wide cohorted migration (1000 tenants)	< 2 h	Sets the ceiling on how long a schema change can sit half-applied.

How to measure

Measure	Source	Query / probe
Cold path p99	Prometheus	`histogram_quantile(0.99, rate(rcm_supervisor_promote_duration_seconds_bucket[5m]))`
Instance healthy	Deploy system (Azure App Gateway / Container Apps)	Internal: `rcm_master_query_duration_seconds` for master latency
Code rollout	GitHub Actions deploy timer	Deploy dashboard
Migration window	Master DB	`identity.migration_cohort_run.finished_at - started_at` for the most recent run

When a target is breached

Breach	First action
Cold path p99 > 300 ms for 15 min	Inspect `rcm_supervisor_promote_duration_seconds` + `rcm_resolver_hot_tenants`. If pool is thrashing, raise `PER_TENANT_METRICS_HOT_CAP` and/or `PGBOSS_HOT_POOL_MAX`.
Instance healthy > 60 s	Master latency. Check `rcm_master_query_duration_seconds` + `DATABASE_MASTER_READ_URL` replica health.
Rollout > 10 min	Deploy-system ownership; outside this runbook. Check GitHub Actions for the deploy job.
Migration > 2 h	Reduce `--cohort-size` on the next run; confirm per-tenant migration time hasn't regressed (`report.perTenantMs` in `migration_cohort_run`). See Cohorted migrations.

Verification cadence

Steps

Weekly: eyeball the four series in Grafana. Bookmark the dashboard.
Per deploy: confirm the deploy pipeline records code-rollout time. Investigate any regression immediately — this is the SLO most likely to silently drift.

Per cohort migration: compare the new migration_cohort_run row against the prior five. Alert if duration grows > 50% without a matching tenant-count growth.

SELECT run_id, total_tenants, finished_at - started_at AS duration,
       (report->>'perTenantP95')::int AS per_tenant_p95_ms
FROM identity.migration_cohort_run
WHERE finished_at IS NOT NULL
ORDER BY started_at DESC
LIMIT 5;

On breach, follow the table above. File a follow-up ticket with the metric snapshot.

Validation

Check	Expected
Cold path p99 last 24 h	Within target
`/health/ready` p95 boot time	Within target
Most recent deploy duration	Within target
Most recent cohort migration duration	Within target

Deferred work

Synthetic load-test verification of these targets (a 1000-tenant cold burst

a 3k-tenant cohorted fan-out) is not continuously executed. The targets are documented as operational goals — the load harness that proves them at scale is out of scope and will ship with the first real production scale-up.

Cross-references

pg-boss supervisor — the cold-path p99 series comes from here.
Cohorted migrations for migration-window root cause analysis.

1.12 — Worker ownership & lease

Outcome​

Prerequisites​

Targets​

How to measure​

When a target is breached​

Verification cadence​

Steps​

Validation​

Deferred work​

Cross-references​

Next​