Restore one tenant to a prior moment 🚨
Outcome
One tenant's database is reverted to a specific timestamp without affecting any other tenant on the same server. Validated against Postgres 17 on Azure Database for PostgreSQL — Flexible Server.
When to use this
- Tenant reports logical corruption (bad migration, accidental delete, app bug that mangled rows) and wants rollback to a specific minute.
- A compliance event requires producing an "as-of" snapshot of one tenant's DB without touching siblings.
- Hardware or region failure made the live server unrecoverable and geo-redundant PITR is the restore path.
Do not use this for routine point-lookups into a prior state — run
pg_dump against the live server instead.
Prerequisites
PLATFORM_ADMIN+ Azure RBAC:Microsoft.DBforPostgreSQL/flexibleServers/readand.../writeon the resource group.pg_dump+pg_restoreon PATH at the server's major version.- Scratch disk ≥ 2× the tenant's DB size (dump + safety margin).
identity.db_serverrow for the source server with a resolvableadmin_secret_ref.
Restore-time target
PITR granularity: 1 second inside the server's retention window (default 7 days, configurable up to 35). Outside retention, geo-redundant backup is the only option and is daily-coarse.
Plan ≈ 2 hours wall-clock for a single-tenant DB ≤ 50 GB on General-Purpose SKUs:
| Phase | Duration |
|---|---|
| Azure provisions the PITR instance | 60–90 min |
| Dump + restore | 15–30 min |
| Master flip + cache invalidation | 5 min |
Flow
Playbook
Freeze the tenant
pnpm --filter @rcm/rcm-core tenant-status \--slug acme --to read_only \--reason "PITR restore to 2026-04-22T14:30Z"Wait one full
TENANT_STATUS_CACHE_TTL_MS(default 30 s) for cache propagation.Trigger the Azure PITR to a parallel server — never overwrite live.
az postgres flexible-server restore \--resource-group rcm-prod \--name pg-us-east-1-pitr-202604221430 \--source-server pg-us-east-1 \--restore-time "2026-04-22T14:30:00Z"Capture the new server's FQDN. Expect 60–90 min.
Dump the one tenant DB from the PITR instance:
pg_dump -Fc -Z 6 \-h pg-us-east-1-pitr-202604221430.postgres.database.azure.com \-U <admin-user> \-d tenant_acme \-f /tmp/rcm-restore-acme-202604221430.dumpThe DB-name convention matches
provision-tenant: hyphens in the slug become underscores in the database name.Load into the live server, into a new database, so you can compare before cutting over:
psql -h pg-us-east-1.postgres.database.azure.com -U <admin-user> \-c "CREATE DATABASE tenant_acme_restore_202604221430"pg_restore --clean --if-exists --no-owner --no-privileges \-h pg-us-east-1.postgres.database.azure.com \-U <admin-user> \-d tenant_acme_restore_202604221430 \/tmp/rcm-restore-acme-202604221430.dumpRun the parity probe before cutting over — same canary tables move-tenant uses:
SELECT(SELECT count(*) FROM identity.organization) AS org_count,(SELECT count(*) FROM security.app_user) AS user_count,(SELECT count(*) FROM members.member) AS member_count;Restore counts should be ≤ live for rows created after the restore time. Stop and reconsider if numbers are orders-of-magnitude off, or zero where non-zero is expected.
Cut over. Two supported strategies:
Swap-and-keep (default) — Key Vault and master already point to the canonical name, so this needs only renames:
-- Terminate every connection to the old DB firstSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE datname = 'tenant_acme';ALTER DATABASE tenant_acme RENAME TO tenant_acme_pre_pitr_202604221430;ALTER DATABASE tenant_acme_restore_202604221430 RENAME TO tenant_acme;Schedule the
_pre_pitr_*drop for one week out.Flip-the-ref — leave the restored DB under its suffix, update Key Vault to point at it, and update master:
UPDATE identity.tenantSET db_config_ref = 'tenant-db-acme-202604221430'WHERE slug = 'acme';Use this when you want a documented, auditable rename.
Invalidate caches so the next request hits the restored DB:
SELECT pg_notify('tenant.status.changed', '<tenant-uuid>');Operators should also rolling-restart the rcm-core fleet so
TenantMetadataCache+TenantStatusCache+ the resolver pool clear. Without this, existing Node processes may hold a pool pointing at the pre-restore DB for up toTENANT_METADATA_CACHE_TTL_MS(default 30 s).Unfreeze:
pnpm --filter @rcm/rcm-core tenant-status \--slug acme --to active --reason "PITR restore complete"Audit: insert a
TENANT_PITR_RESTOREDrow intoidentity.tenant_auditmanually — there is no dedicated CLI yet.
Dry-run procedure (recommended quarterly)
Exercise every step except the cutover. Useful for surfacing IAM / network / disk problems before you need them in earnest.
- Steps 1–4 above.
- Run the parity probe (step 5) and compare against a known-good snapshot.
- DROP the temporary restore DB and the PITR server.
Rollback
Because the default cutover keeps tenant_<slug>_pre_pitr_<ts> for a week,
rollback is a reversed rename:
ALTER DATABASE tenant_acme RENAME TO tenant_acme_rollback_202604221430;
ALTER DATABASE tenant_acme_pre_pitr_202604221430 RENAME TO tenant_acme;
Then re-invalidate caches per step 7.
Validation
| Check | Expected |
|---|---|
| New PITR server provisioned in Azure | Yes |
| Dump file size | Reasonable for tenant volume |
| Parity probe | Counts match restore-point expectation |
Tenant active post-cutover | Yes |
| Customer can sign in | Yes |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Restore-time outside retention window | Past 7 days (default) | Use geo-redundant daily backup instead. |
pg_terminate_backend returns rows you didn't expect | Background workers | Re-run after pg-boss is paused, or flip the tenant to suspended briefly. |
| First request after cutover hits old DB | Cache hadn't expired | Rolling-restart, or wait TENANT_METADATA_CACHE_TTL_MS. |
pg_restore complains about extensions | Extension not pre-installed on live server | Install on live server with CREATE EXTENSION before restore. |