Skip to main content

Restore a tenant whose database is unreachable 🚨

Outcome

A tenant whose /health/ready was 503-flagged or whose customers report TENANT_DB_UNAVAILABLE is back to serving traffic, with a clear root cause recorded.

Prerequisites

  • Key Vault read access.
  • psql on PATH for direct connection probes.
  • On-call escalation rights for region/server-wide outages.

Decision tree

Steps — one tenant only

  1. Confirm scope by checking other tenants on the same server (Grafana rcm_tenants_on_server{server} + sample-tenant probes):

    • If all tenants on the server are 503-ing → server is down. Skip to the "all tenants on a server" branch below.
    • If only one tenant → keep reading.
  2. Probe the Key Vault secret:

    az keyvault secret show \
    --vault-name $KV \
    --name "tenant-db-acme"
    ResultMeaningFix
    Secret missing or wrongdb_config_ref drifted, often after a manual DB renameRestore from KV soft-delete (within recovery window); reconcile identity.tenant.db_config_ref.
    Secret returns expected URLMove on.
  3. Probe direct connectivity bypassing the app:

    psql "$(az keyvault secret show --vault-name $KV \
    --name tenant-db-acme --query value -o tsv)"
    ResultMeaningFix
    Connection refused / timeoutTenant DB itself is downCheck pg_database on the server; if missing, restore from PITR or pg_dump.
    Connects but queries hangStuck queriesInvestigate pg_stat_activity for long-running queries; consider pg_terminate_backend.
    Works fineApp resolver pool is stuckRolling-restart rcm-core workers to force fresh pool creation.
  4. Mitigate while investigating by flipping the tenant to suspended (§1.3) so customers see a clean 503 with a clear message instead of raw TENANT_DB_UNAVAILABLE. Flip back to active once fixed.

  5. Post-mortem any incident where a tenant was unreachable for > 5 min. Update identity.db_server.capacity_hint if the incident correlated with resource pressure.

Steps — all tenants on a server

  1. Escalate to Azure immediately:

    • Check Azure Portal alerts on the Postgres Flexible Server.
    • File an Azure support ticket with the resource ID and incident time.
  2. Communicate to affected customers via the status page. Note: every tenant whose db_server_ref points at the down server is impacted.

  3. While Azure investigates, options for fastest restore:

    OptionWhen
    Wait for Azure to bring the server backDefault — usually < 30 min
    PITR restore to a sibling serverServer is unrecoverable, retention window covers needed restore-time
    Geo-redundant backup restorePITR retention exceeded; daily granularity
  4. After recovery, post-mortem: was capacity, version, or backup configuration the contributor? Update capacity_hint, file follow-up tickets, document.

Validation

CheckExpected
/health/ready200, all sampled tenants healthy
Customer can sign inYes
pg_stat_activity on the affected DBNo long-running stuck queries
Audit logHas incident-correlated STATUS_CHANGED rows if the tenant was suspended/restored

Troubleshooting

SymptomCauseFix
Customer sees TENANT_DB_UNAVAILABLE despite suspendStatus cache hasn't propagatedWait one TTL window; rolling-restart the fleet.
psql works but the resolver still failsApp pool is wedged with stale connectionsRolling-restart all rcm-core workers.
Multiple tenants on different servers fail simultaneouslyRegion or pgBouncer outageEscalate; treat as multi-server incident.
db_config_ref rolled forward but old DB still accessibleStale resolver cacheWait TENANT_METADATA_CACHE_TTL_MS or rolling-restart.

Cross-references

Next

1.11 — Cold-start & deploy SLOs