Skip to main content

Restore one tenant to a prior moment 🚨

Outcome

One tenant's database is reverted to a specific timestamp without affecting any other tenant on the same server. Validated against Postgres 17 on Azure Database for PostgreSQL — Flexible Server.

When to use this

  • Tenant reports logical corruption (bad migration, accidental delete, app bug that mangled rows) and wants rollback to a specific minute.
  • A compliance event requires producing an "as-of" snapshot of one tenant's DB without touching siblings.
  • Hardware or region failure made the live server unrecoverable and geo-redundant PITR is the restore path.

Do not use this for routine point-lookups into a prior state — run pg_dump against the live server instead.

Prerequisites

  • PLATFORM_ADMIN + Azure RBAC: Microsoft.DBforPostgreSQL/flexibleServers/read and .../write on the resource group.
  • pg_dump + pg_restore on PATH at the server's major version.
  • Scratch disk ≥ 2× the tenant's DB size (dump + safety margin).
  • identity.db_server row for the source server with a resolvable admin_secret_ref.

Restore-time target

PITR granularity: 1 second inside the server's retention window (default 7 days, configurable up to 35). Outside retention, geo-redundant backup is the only option and is daily-coarse.

Plan ≈ 2 hours wall-clock for a single-tenant DB ≤ 50 GB on General-Purpose SKUs:

PhaseDuration
Azure provisions the PITR instance60–90 min
Dump + restore15–30 min
Master flip + cache invalidation5 min

Flow

Playbook

  1. Freeze the tenant

    pnpm --filter @rcm/rcm-core tenant-status \
    --slug acme --to read_only \
    --reason "PITR restore to 2026-04-22T14:30Z"

    Wait one full TENANT_STATUS_CACHE_TTL_MS (default 30 s) for cache propagation.

  2. Trigger the Azure PITR to a parallel server — never overwrite live.

    az postgres flexible-server restore \
    --resource-group rcm-prod \
    --name pg-us-east-1-pitr-202604221430 \
    --source-server pg-us-east-1 \
    --restore-time "2026-04-22T14:30:00Z"

    Capture the new server's FQDN. Expect 60–90 min.

  3. Dump the one tenant DB from the PITR instance:

    pg_dump -Fc -Z 6 \
    -h pg-us-east-1-pitr-202604221430.postgres.database.azure.com \
    -U <admin-user> \
    -d tenant_acme \
    -f /tmp/rcm-restore-acme-202604221430.dump

    The DB-name convention matches provision-tenant: hyphens in the slug become underscores in the database name.

  4. Load into the live server, into a new database, so you can compare before cutting over:

    psql -h pg-us-east-1.postgres.database.azure.com -U <admin-user> \
    -c "CREATE DATABASE tenant_acme_restore_202604221430"

    pg_restore --clean --if-exists --no-owner --no-privileges \
    -h pg-us-east-1.postgres.database.azure.com \
    -U <admin-user> \
    -d tenant_acme_restore_202604221430 \
    /tmp/rcm-restore-acme-202604221430.dump
  5. Run the parity probe before cutting over — same canary tables move-tenant uses:

    SELECT
    (SELECT count(*) FROM identity.organization) AS org_count,
    (SELECT count(*) FROM security.app_user) AS user_count,
    (SELECT count(*) FROM members.member) AS member_count;

    Restore counts should be ≤ live for rows created after the restore time. Stop and reconsider if numbers are orders-of-magnitude off, or zero where non-zero is expected.

  6. Cut over. Two supported strategies:

    Swap-and-keep (default) — Key Vault and master already point to the canonical name, so this needs only renames:

    -- Terminate every connection to the old DB first
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE datname = 'tenant_acme';

    ALTER DATABASE tenant_acme RENAME TO tenant_acme_pre_pitr_202604221430;
    ALTER DATABASE tenant_acme_restore_202604221430 RENAME TO tenant_acme;

    Schedule the _pre_pitr_* drop for one week out.

    Flip-the-ref — leave the restored DB under its suffix, update Key Vault to point at it, and update master:

    UPDATE identity.tenant
    SET db_config_ref = 'tenant-db-acme-202604221430'
    WHERE slug = 'acme';

    Use this when you want a documented, auditable rename.

  7. Invalidate caches so the next request hits the restored DB:

    SELECT pg_notify('tenant.status.changed', '<tenant-uuid>');

    Operators should also rolling-restart the rcm-core fleet so TenantMetadataCache + TenantStatusCache + the resolver pool clear. Without this, existing Node processes may hold a pool pointing at the pre-restore DB for up to TENANT_METADATA_CACHE_TTL_MS (default 30 s).

  8. Unfreeze:

    pnpm --filter @rcm/rcm-core tenant-status \
    --slug acme --to active --reason "PITR restore complete"
  9. Audit: insert a TENANT_PITR_RESTORED row into identity.tenant_audit manually — there is no dedicated CLI yet.

Exercise every step except the cutover. Useful for surfacing IAM / network / disk problems before you need them in earnest.

  1. Steps 1–4 above.
  2. Run the parity probe (step 5) and compare against a known-good snapshot.
  3. DROP the temporary restore DB and the PITR server.

Rollback

Because the default cutover keeps tenant_<slug>_pre_pitr_<ts> for a week, rollback is a reversed rename:

ALTER DATABASE tenant_acme RENAME TO tenant_acme_rollback_202604221430;
ALTER DATABASE tenant_acme_pre_pitr_202604221430 RENAME TO tenant_acme;

Then re-invalidate caches per step 7.

Validation

CheckExpected
New PITR server provisioned in AzureYes
Dump file sizeReasonable for tenant volume
Parity probeCounts match restore-point expectation
Tenant active post-cutoverYes
Customer can sign inYes

Troubleshooting

SymptomCauseFix
Restore-time outside retention windowPast 7 days (default)Use geo-redundant daily backup instead.
pg_terminate_backend returns rows you didn't expectBackground workersRe-run after pg-boss is paused, or flip the tenant to suspended briefly.
First request after cutover hits old DBCache hadn't expiredRolling-restart, or wait TENANT_METADATA_CACHE_TTL_MS.
pg_restore complains about extensionsExtension not pre-installed on live serverInstall on live server with CREATE EXTENSION before restore.

Next

1.9 — Connection budget alerts