Apply pending migrations to every tenant
Outcome
Every registered tenant database is at the latest knex migration head, and CI blocks the deploy if any tenant fails.
Prerequisites
PLATFORM_ADMIN(CLI access).- Master migrations already applied (
migrate:master). - The deploy pipeline calls this between master migrations and the app rollout. Manual runs are reserved for post-incident convergence.
When to use fan-out vs. cohorts
| Situation | Tool |
|---|---|
| ≤ 200 tenants, single-environment deploy | Fan-out — this chapter |
| 200–1000 tenants, schema-only migration | Fan-out with --concurrency 3 |
| 1000+ tenants, OR migration carries backfill, OR first-time touching a hot table | Cohorted migrations |
| Any deploy where a partial failure should pause the rest | Cohorted migrations |
Rule of thumb: if you would hate to learn 30 minutes in that 800 of 1000 tenants silently drifted, use cohorts.
Flow
Steps
Run the fan-out
pnpm --filter @rcm/rcm-core migrate-all-tenantsThe CLI walks every active + read-only tenant in master, applies pending migrations, and prints a per-tenant + final summary report. Default concurrency: 3 tenants in parallel; default per-tenant timeout: 120 s.
Read the summary
Outcome Action All tenants succeed (exit 0) Proceed to app deploy One or more failed (exit non-zero) CI blocks deploy. Triage per below. Triage failures from the summary block. Common causes:
Error excerpt Cause Fix connection refused/ETIMEDOUTTransient network loss or pgBouncer hiccup Re-run the failed slug only; if it persists, check identity.db_serverhealth.Key Vault secret "<ref>" missingdb_config_refdrift — secret was deleted or renamedRestore the secret from KV soft-delete, or re-provision the tenant. relation already exists/column already existsSchema was hand-edited and knex_migrationsis out of syncReconcile knex_migrationsmanually, then re-run.migrate.latest timed out after 120000msGenuine long migration, or a lock holder Re-run with --timeout-ms 600000. Check for blockers:SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL.Converge the failed tenant without rescanning the whole estate:
pnpm --filter @rcm/rcm-core migrate-all-tenants --only acmeIf a tenant must stay offline while under repair, flip its status to
suspended(see Suspend a tenant). Future fan-out runs skip suspended tenants automatically and CI stops blocking on them.
Pre-flag tenants read_only before risky migrations
For a migration that must not run against live writes, set
identity.tenant.status = read_only ahead of the fan-out. read_only is in
the default status filter so the migration still applies, but
enforceTenantStatus middleware rejects write verbs while the migration runs.
# Pre-flag everyone in the panel
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to read_only \
--reason "Pre-deploy gate for migration 105"
# ... fan out ...
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to active \
--reason "Migration 105 complete"
For 1000+ tenants this is manual toil — graduate to cohorted migrations where the gate is automatic per cohort.
Validation
| Check | Expected |
|---|---|
| CLI exit code | 0 |
Per-tenant knex_migrations head | matches latest migration filename |
identity.tenant_audit | no migration-failure rows |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Same tenant fails repeatedly with connectivity error | pgBouncer or db_server issue | See Tenant DB unreachable. |
| Migration applies cleanly to dev, fails in prod | Production has rows that violate the new constraint | Run a backfill before tightening the constraint; or split into two migrations (loosen first, tighten next deploy). |
| Concurrency=3 saturates the master replica | Read-replica is undersized | Drop --concurrency to 1 for the affected wave; consider promoting to cohorted migrations. |