Restore a tenant whose database is unreachable 🚨

Outcome

A tenant whose /health/ready was 503-flagged or whose customers report TENANT_DB_UNAVAILABLE is back to serving traffic, with a clear root cause recorded.

Prerequisites

Key Vault read access.
psql on PATH for direct connection probes.
On-call escalation rights for region/server-wide outages.

Decision tree

Steps — one tenant only

Confirm scope by checking other tenants on the same server (Grafana rcm_tenants_on_server{server} + sample-tenant probes):
- If all tenants on the server are 503-ing → server is down. Skip to the "all tenants on a server" branch below.
- If only one tenant → keep reading.

Probe the Key Vault secret:

az keyvault secret show \
  --vault-name $KV \
  --name "tenant-db-acme"

Result	Meaning	Fix
Secret missing or wrong	`db_config_ref` drifted, often after a manual DB rename	Restore from KV soft-delete (within recovery window); reconcile `identity.tenant.db_config_ref`.
Secret returns expected URL	Move on.

Probe direct connectivity bypassing the app:

psql "$(az keyvault secret show --vault-name $KV \
  --name tenant-db-acme --query value -o tsv)"

Result	Meaning	Fix
Connection refused / timeout	Tenant DB itself is down	Check `pg_database` on the server; if missing, restore from PITR or pg_dump.
Connects but queries hang	Stuck queries	Investigate `pg_stat_activity` for long-running queries; consider `pg_terminate_backend`.
Works fine	App resolver pool is stuck	Rolling-restart rcm-core workers to force fresh pool creation.

Mitigate while investigating by flipping the tenant to suspended (§1.3) so customers see a clean 503 with a clear message instead of raw TENANT_DB_UNAVAILABLE. Flip back to active once fixed.
Post-mortem any incident where a tenant was unreachable for > 5 min. Update identity.db_server.capacity_hint if the incident correlated with resource pressure.

Steps — all tenants on a server

Escalate to Azure immediately:
- Check Azure Portal alerts on the Postgres Flexible Server.
- File an Azure support ticket with the resource ID and incident time.
Communicate to affected customers via the status page. Note: every tenant whose db_server_ref points at the down server is impacted.

While Azure investigates, options for fastest restore:

Option	When
Wait for Azure to bring the server back	Default — usually < 30 min
PITR restore to a sibling server	Server is unrecoverable, retention window covers needed restore-time
Geo-redundant backup restore	PITR retention exceeded; daily granularity

After recovery, post-mortem: was capacity, version, or backup configuration the contributor? Update capacity_hint, file follow-up tickets, document.

Validation

Check	Expected
`/health/ready`	200, all sampled tenants healthy
Customer can sign in	Yes
`pg_stat_activity` on the affected DB	No long-running stuck queries
Audit log	Has incident-correlated `STATUS_CHANGED` rows if the tenant was suspended/restored

Troubleshooting

Symptom	Cause	Fix
Customer sees `TENANT_DB_UNAVAILABLE` despite suspend	Status cache hasn't propagated	Wait one TTL window; rolling-restart the fleet.
psql works but the resolver still fails	App pool is wedged with stale connections	Rolling-restart all rcm-core workers.
Multiple tenants on different servers fail simultaneously	Region or pgBouncer outage	Escalate; treat as multi-server incident.
`db_config_ref` rolled forward but old DB still accessible	Stale resolver cache	Wait `TENANT_METADATA_CACHE_TTL_MS` or rolling-restart.

Cross-references

Suspend or read-only a tenant for the customer-facing 503 message.
Connection budget alerts — if the unreachable tenant correlates with budget pressure.

1.11 — Cold-start & deploy SLOs

Outcome​

Prerequisites​

Decision tree​

Steps — one tenant only​

Steps — all tenants on a server​

Validation​

Troubleshooting​

Cross-references​

Next​