Skip to main content

Track and respond to platform-wide latency SLOs

Outcome

The four SLOs that bound platform health — cold path, instance health, code rollout, and migration window — are visible, measured, and within target.

Prerequisites

Targets

MeasureTargetWhy
First-request cold path (new tenant pool)p99 < 300 msResolver + Key Vault + Knex-pool acquire + one SELECT. Above 300 ms, tenants experience a visible spinner on the first request after idle demotion.
New instance healthy (boot → /health/ready green)< 60 sBlue/green + autoscaler both assume a new pod joins rotation inside one minute. Above this, rolling deploys can starve capacity.
Code rollout (canary → 100%)< 10 minMirrors the CI deploy window. Above this, an urgent fix cannot be shipped same-hour.
Fleet-wide cohorted migration (1000 tenants)< 2 hSets the ceiling on how long a schema change can sit half-applied.

How to measure

MeasureSourceQuery / probe
Cold path p99Prometheushistogram_quantile(0.99, rate(rcm_supervisor_promote_duration_seconds_bucket[5m]))
Instance healthyDeploy system (Azure App Gateway / Container Apps)Internal: rcm_master_query_duration_seconds for master latency
Code rolloutGitHub Actions deploy timerDeploy dashboard
Migration windowMaster DBidentity.migration_cohort_run.finished_at - started_at for the most recent run

When a target is breached

BreachFirst action
Cold path p99 > 300 ms for 15 minInspect rcm_supervisor_promote_duration_seconds + rcm_resolver_hot_tenants. If pool is thrashing, raise PER_TENANT_METRICS_HOT_CAP and/or PGBOSS_HOT_POOL_MAX.
Instance healthy > 60 sMaster latency. Check rcm_master_query_duration_seconds + DATABASE_MASTER_READ_URL replica health.
Rollout > 10 minDeploy-system ownership; outside this runbook. Check GitHub Actions for the deploy job.
Migration > 2 hReduce --cohort-size on the next run; confirm per-tenant migration time hasn't regressed (report.perTenantMs in migration_cohort_run). See Cohorted migrations.

Verification cadence

Steps

  1. Weekly: eyeball the four series in Grafana. Bookmark the dashboard.

  2. Per deploy: confirm the deploy pipeline records code-rollout time. Investigate any regression immediately — this is the SLO most likely to silently drift.

  3. Per cohort migration: compare the new migration_cohort_run row against the prior five. Alert if duration grows > 50% without a matching tenant-count growth.

    SELECT run_id, total_tenants, finished_at - started_at AS duration,
    (report->>'perTenantP95')::int AS per_tenant_p95_ms
    FROM identity.migration_cohort_run
    WHERE finished_at IS NOT NULL
    ORDER BY started_at DESC
    LIMIT 5;
  4. On breach, follow the table above. File a follow-up ticket with the metric snapshot.

Validation

CheckExpected
Cold path p99 last 24 hWithin target
/health/ready p95 boot timeWithin target
Most recent deploy durationWithin target
Most recent cohort migration durationWithin target

Deferred work

Synthetic load-test verification of these targets (a 1000-tenant cold burst

  • a 3k-tenant cohorted fan-out) is not continuously executed. The targets are documented as operational goals — the load harness that proves them at scale is out of scope and will ship with the first real production scale-up.

Cross-references

Next

1.12 — Worker ownership & lease