Restore a tenant whose database is unreachable 🚨
Outcome
A tenant whose /health/ready was 503-flagged or whose customers report
TENANT_DB_UNAVAILABLE is back to serving traffic, with a clear root cause
recorded.
Prerequisites
- Key Vault read access.
psqlon PATH for direct connection probes.- On-call escalation rights for region/server-wide outages.
Decision tree
Steps — one tenant only
Confirm scope by checking other tenants on the same server (Grafana
rcm_tenants_on_server{server}+ sample-tenant probes):- If all tenants on the server are 503-ing → server is down. Skip to the "all tenants on a server" branch below.
- If only one tenant → keep reading.
Probe the Key Vault secret:
az keyvault secret show \--vault-name $KV \--name "tenant-db-acme"Result Meaning Fix Secret missing or wrong db_config_refdrifted, often after a manual DB renameRestore from KV soft-delete (within recovery window); reconcile identity.tenant.db_config_ref.Secret returns expected URL Move on. Probe direct connectivity bypassing the app:
psql "$(az keyvault secret show --vault-name $KV \--name tenant-db-acme --query value -o tsv)"Result Meaning Fix Connection refused / timeout Tenant DB itself is down Check pg_databaseon the server; if missing, restore from PITR or pg_dump.Connects but queries hang Stuck queries Investigate pg_stat_activityfor long-running queries; considerpg_terminate_backend.Works fine App resolver pool is stuck Rolling-restart rcm-core workers to force fresh pool creation. Mitigate while investigating by flipping the tenant to
suspended(§1.3) so customers see a clean 503 with a clear message instead of rawTENANT_DB_UNAVAILABLE. Flip back toactiveonce fixed.Post-mortem any incident where a tenant was unreachable for > 5 min. Update
identity.db_server.capacity_hintif the incident correlated with resource pressure.
Steps — all tenants on a server
Escalate to Azure immediately:
- Check Azure Portal alerts on the Postgres Flexible Server.
- File an Azure support ticket with the resource ID and incident time.
Communicate to affected customers via the status page. Note: every tenant whose
db_server_refpoints at the down server is impacted.While Azure investigates, options for fastest restore:
Option When Wait for Azure to bring the server back Default — usually < 30 min PITR restore to a sibling server Server is unrecoverable, retention window covers needed restore-time Geo-redundant backup restore PITR retention exceeded; daily granularity After recovery, post-mortem: was capacity, version, or backup configuration the contributor? Update
capacity_hint, file follow-up tickets, document.
Validation
| Check | Expected |
|---|---|
/health/ready | 200, all sampled tenants healthy |
| Customer can sign in | Yes |
pg_stat_activity on the affected DB | No long-running stuck queries |
| Audit log | Has incident-correlated STATUS_CHANGED rows if the tenant was suspended/restored |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Customer sees TENANT_DB_UNAVAILABLE despite suspend | Status cache hasn't propagated | Wait one TTL window; rolling-restart the fleet. |
| psql works but the resolver still fails | App pool is wedged with stale connections | Rolling-restart all rcm-core workers. |
| Multiple tenants on different servers fail simultaneously | Region or pgBouncer outage | Escalate; treat as multi-server incident. |
db_config_ref rolled forward but old DB still accessible | Stale resolver cache | Wait TENANT_METADATA_CACHE_TTL_MS or rolling-restart. |
Cross-references
- Suspend or read-only a tenant for the customer-facing 503 message.
- Connection budget alerts — if the unreachable tenant correlates with budget pressure.