Track and respond to platform-wide latency SLOs
Outcome
The four SLOs that bound platform health — cold path, instance health, code rollout, and migration window — are visible, measured, and within target.
Prerequisites
- Grafana access to the platform dashboard.
- Familiarity with pg-boss supervisor and tenant resolver basics.
Targets
| Measure | Target | Why |
|---|---|---|
| First-request cold path (new tenant pool) | p99 < 300 ms | Resolver + Key Vault + Knex-pool acquire + one SELECT. Above 300 ms, tenants experience a visible spinner on the first request after idle demotion. |
| New instance healthy (boot → /health/ready green) | < 60 s | Blue/green + autoscaler both assume a new pod joins rotation inside one minute. Above this, rolling deploys can starve capacity. |
| Code rollout (canary → 100%) | < 10 min | Mirrors the CI deploy window. Above this, an urgent fix cannot be shipped same-hour. |
| Fleet-wide cohorted migration (1000 tenants) | < 2 h | Sets the ceiling on how long a schema change can sit half-applied. |
How to measure
| Measure | Source | Query / probe |
|---|---|---|
| Cold path p99 | Prometheus | histogram_quantile(0.99, rate(rcm_supervisor_promote_duration_seconds_bucket[5m])) |
| Instance healthy | Deploy system (Azure App Gateway / Container Apps) | Internal: rcm_master_query_duration_seconds for master latency |
| Code rollout | GitHub Actions deploy timer | Deploy dashboard |
| Migration window | Master DB | identity.migration_cohort_run.finished_at - started_at for the most recent run |
When a target is breached
| Breach | First action |
|---|---|
| Cold path p99 > 300 ms for 15 min | Inspect rcm_supervisor_promote_duration_seconds + rcm_resolver_hot_tenants. If pool is thrashing, raise PER_TENANT_METRICS_HOT_CAP and/or PGBOSS_HOT_POOL_MAX. |
| Instance healthy > 60 s | Master latency. Check rcm_master_query_duration_seconds + DATABASE_MASTER_READ_URL replica health. |
| Rollout > 10 min | Deploy-system ownership; outside this runbook. Check GitHub Actions for the deploy job. |
| Migration > 2 h | Reduce --cohort-size on the next run; confirm per-tenant migration time hasn't regressed (report.perTenantMs in migration_cohort_run). See Cohorted migrations. |
Verification cadence
Steps
Weekly: eyeball the four series in Grafana. Bookmark the dashboard.
Per deploy: confirm the deploy pipeline records code-rollout time. Investigate any regression immediately — this is the SLO most likely to silently drift.
Per cohort migration: compare the new
migration_cohort_runrow against the prior five. Alert if duration grows > 50% without a matching tenant-count growth.SELECT run_id, total_tenants, finished_at - started_at AS duration,(report->>'perTenantP95')::int AS per_tenant_p95_msFROM identity.migration_cohort_runWHERE finished_at IS NOT NULLORDER BY started_at DESCLIMIT 5;On breach, follow the table above. File a follow-up ticket with the metric snapshot.
Validation
| Check | Expected |
|---|---|
| Cold path p99 last 24 h | Within target |
/health/ready p95 boot time | Within target |
| Most recent deploy duration | Within target |
| Most recent cohort migration duration | Within target |
Deferred work
Synthetic load-test verification of these targets (a 1000-tenant cold burst
- a 3k-tenant cohorted fan-out) is not continuously executed. The targets are documented as operational goals — the load harness that proves them at scale is out of scope and will ship with the first real production scale-up.
Cross-references
- pg-boss supervisor — the cold-path p99 series comes from here.
- Cohorted migrations for migration-window root cause analysis.