Skip to main content

Apply pending migrations to every tenant

Outcome

Every registered tenant database is at the latest knex migration head, and CI blocks the deploy if any tenant fails.

Prerequisites

  • PLATFORM_ADMIN (CLI access).
  • Master migrations already applied (migrate:master).
  • The deploy pipeline calls this between master migrations and the app rollout. Manual runs are reserved for post-incident convergence.

When to use fan-out vs. cohorts

SituationTool
≤ 200 tenants, single-environment deployFan-out — this chapter
200–1000 tenants, schema-only migrationFan-out with --concurrency 3
1000+ tenants, OR migration carries backfill, OR first-time touching a hot tableCohorted migrations
Any deploy where a partial failure should pause the restCohorted migrations

Rule of thumb: if you would hate to learn 30 minutes in that 800 of 1000 tenants silently drifted, use cohorts.

Flow

Steps

  1. Run the fan-out

    pnpm --filter @rcm/rcm-core migrate-all-tenants

    The CLI walks every active + read-only tenant in master, applies pending migrations, and prints a per-tenant + final summary report. Default concurrency: 3 tenants in parallel; default per-tenant timeout: 120 s.

  2. Read the summary

    OutcomeAction
    All tenants succeed (exit 0)Proceed to app deploy
    One or more failed (exit non-zero)CI blocks deploy. Triage per below.
  3. Triage failures from the summary block. Common causes:

    Error excerptCauseFix
    connection refused / ETIMEDOUTTransient network loss or pgBouncer hiccupRe-run the failed slug only; if it persists, check identity.db_server health.
    Key Vault secret "<ref>" missingdb_config_ref drift — secret was deleted or renamedRestore the secret from KV soft-delete, or re-provision the tenant.
    relation already exists / column already existsSchema was hand-edited and knex_migrations is out of syncReconcile knex_migrations manually, then re-run.
    migrate.latest timed out after 120000msGenuine long migration, or a lock holderRe-run with --timeout-ms 600000. Check for blockers: SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL.
  4. Converge the failed tenant without rescanning the whole estate:

    pnpm --filter @rcm/rcm-core migrate-all-tenants --only acme
  5. If a tenant must stay offline while under repair, flip its status to suspended (see Suspend a tenant). Future fan-out runs skip suspended tenants automatically and CI stops blocking on them.

Pre-flag tenants read_only before risky migrations

For a migration that must not run against live writes, set identity.tenant.status = read_only ahead of the fan-out. read_only is in the default status filter so the migration still applies, but enforceTenantStatus middleware rejects write verbs while the migration runs.

# Pre-flag everyone in the panel
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to read_only \
--reason "Pre-deploy gate for migration 105"
# ... fan out ...
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to active \
--reason "Migration 105 complete"

For 1000+ tenants this is manual toil — graduate to cohorted migrations where the gate is automatic per cohort.

Validation

CheckExpected
CLI exit code0
Per-tenant knex_migrations headmatches latest migration filename
identity.tenant_auditno migration-failure rows

Troubleshooting

SymptomCauseFix
Same tenant fails repeatedly with connectivity errorpgBouncer or db_server issueSee Tenant DB unreachable.
Migration applies cleanly to dev, fails in prodProduction has rows that violate the new constraintRun a backfill before tightening the constraint; or split into two migrations (loosen first, tighten next deploy).
Concurrency=3 saturates the master replicaRead-replica is undersizedDrop --concurrency to 1 for the affected wave; consider promoting to cohorted migrations.

Next

1.6 — Cohorted migrations for large fleets