Restore one tenant to a prior moment 🚨

Outcome

One tenant's database is reverted to a specific timestamp without affecting any other tenant on the same server. Validated against Postgres 17 on Azure Database for PostgreSQL — Flexible Server.

When to use this

Tenant reports logical corruption (bad migration, accidental delete, app bug that mangled rows) and wants rollback to a specific minute.
A compliance event requires producing an "as-of" snapshot of one tenant's DB without touching siblings.
Hardware or region failure made the live server unrecoverable and geo-redundant PITR is the restore path.

Do not use this for routine point-lookups into a prior state — run pg_dump against the live server instead.

Prerequisites

PLATFORM_ADMIN + Azure RBAC: Microsoft.DBforPostgreSQL/flexibleServers/read and .../write on the resource group.
pg_dump + pg_restore on PATH at the server's major version.
Scratch disk ≥ 2× the tenant's DB size (dump + safety margin).
identity.db_server row for the source server with a resolvable admin_secret_ref.

Restore-time target

PITR granularity: 1 second inside the server's retention window (default 7 days, configurable up to 35). Outside retention, geo-redundant backup is the only option and is daily-coarse.

Plan ≈ 2 hours wall-clock for a single-tenant DB ≤ 50 GB on General-Purpose SKUs:

Phase	Duration
Azure provisions the PITR instance	60–90 min
Dump + restore	15–30 min
Master flip + cache invalidation	5 min

Flow

Playbook

Freeze the tenant

pnpm --filter @rcm/rcm-core tenant-status \
  --slug acme --to read_only \
  --reason "PITR restore to 2026-04-22T14:30Z"

Wait one full TENANT_STATUS_CACHE_TTL_MS (default 30 s) for cache propagation.

Trigger the Azure PITR to a parallel server — never overwrite live.

az postgres flexible-server restore \
  --resource-group rcm-prod \
  --name pg-us-east-1-pitr-202604221430 \
  --source-server pg-us-east-1 \
  --restore-time "2026-04-22T14:30:00Z"

Capture the new server's FQDN. Expect 60–90 min.

Dump the one tenant DB from the PITR instance:

pg_dump -Fc -Z 6 \
  -h pg-us-east-1-pitr-202604221430.postgres.database.azure.com \
  -U <admin-user> \
  -d tenant_acme \
  -f /tmp/rcm-restore-acme-202604221430.dump

The DB-name convention matches provision-tenant: hyphens in the slug become underscores in the database name.

Load into the live server, into a new database, so you can compare before cutting over:

psql -h pg-us-east-1.postgres.database.azure.com -U <admin-user> \
  -c "CREATE DATABASE tenant_acme_restore_202604221430"

pg_restore --clean --if-exists --no-owner --no-privileges \
  -h pg-us-east-1.postgres.database.azure.com \
  -U <admin-user> \
  -d tenant_acme_restore_202604221430 \
  /tmp/rcm-restore-acme-202604221430.dump

Run the parity probe before cutting over — same canary tables move-tenant uses:
```
SELECT
  (SELECT count(*) FROM identity.organization) AS org_count,
  (SELECT count(*) FROM security.app_user)     AS user_count,
  (SELECT count(*) FROM members.member)        AS member_count;
```
Restore counts should be ≤ live for rows created after the restore time. Stop and reconsider if numbers are orders-of-magnitude off, or zero where non-zero is expected.

Cut over. Two supported strategies:

Swap-and-keep (default) — Key Vault and master already point to the canonical name, so this needs only renames:

-- Terminate every connection to the old DB first
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'tenant_acme';

ALTER DATABASE tenant_acme RENAME TO tenant_acme_pre_pitr_202604221430;
ALTER DATABASE tenant_acme_restore_202604221430 RENAME TO tenant_acme;

Schedule the _pre_pitr_* drop for one week out.

Flip-the-ref — leave the restored DB under its suffix, update Key Vault to point at it, and update master:

UPDATE identity.tenant
   SET db_config_ref = 'tenant-db-acme-202604221430'
 WHERE slug = 'acme';

Use this when you want a documented, auditable rename.

Invalidate caches so the next request hits the restored DB:
```
SELECT pg_notify('tenant.status.changed', '<tenant-uuid>');
```
Operators should also rolling-restart the rcm-core fleet so TenantMetadataCache + TenantStatusCache + the resolver pool clear. Without this, existing Node processes may hold a pool pointing at the pre-restore DB for up to TENANT_METADATA_CACHE_TTL_MS (default 30 s).

Unfreeze:

pnpm --filter @rcm/rcm-core tenant-status \
  --slug acme --to active --reason "PITR restore complete"

Audit: insert a TENANT_PITR_RESTORED row into identity.tenant_audit manually — there is no dedicated CLI yet.

Dry-run procedure (recommended quarterly)

Exercise every step except the cutover. Useful for surfacing IAM / network / disk problems before you need them in earnest.

Steps 1–4 above.
Run the parity probe (step 5) and compare against a known-good snapshot.
DROP the temporary restore DB and the PITR server.

Rollback

Because the default cutover keeps tenant_<slug>_pre_pitr_<ts> for a week, rollback is a reversed rename:

ALTER DATABASE tenant_acme RENAME TO tenant_acme_rollback_202604221430;
ALTER DATABASE tenant_acme_pre_pitr_202604221430 RENAME TO tenant_acme;

Then re-invalidate caches per step 7.

Validation

Check	Expected
New PITR server provisioned in Azure	Yes
Dump file size	Reasonable for tenant volume
Parity probe	Counts match restore-point expectation
Tenant `active` post-cutover	Yes
Customer can sign in	Yes

Troubleshooting

Symptom	Cause	Fix
Restore-time outside retention window	Past 7 days (default)	Use geo-redundant daily backup instead.
`pg_terminate_backend` returns rows you didn't expect	Background workers	Re-run after `pg-boss` is paused, or flip the tenant to `suspended` briefly.
First request after cutover hits old DB	Cache hadn't expired	Rolling-restart, or wait `TENANT_METADATA_CACHE_TTL_MS`.
`pg_restore` complains about extensions	Extension not pre-installed on live server	Install on live server with `CREATE EXTENSION` before restore.

1.9 — Connection budget alerts

Outcome​

When to use this​

Prerequisites​

Restore-time target​

Flow​

Playbook​

Dry-run procedure (recommended quarterly)​

Rollback​

Validation​

Troubleshooting​

Next​