Apply pending migrations to every tenant

Outcome

Every registered tenant database is at the latest knex migration head, and CI blocks the deploy if any tenant fails.

Prerequisites

PLATFORM_ADMIN (CLI access).
Master migrations already applied (migrate:master).
The deploy pipeline calls this between master migrations and the app rollout. Manual runs are reserved for post-incident convergence.

When to use fan-out vs. cohorts

Situation	Tool
≤ 200 tenants, single-environment deploy	Fan-out — this chapter
200–1000 tenants, schema-only migration	Fan-out with `--concurrency 3`
1000+ tenants, OR migration carries backfill, OR first-time touching a hot table	Cohorted migrations
Any deploy where a partial failure should pause the rest	Cohorted migrations

Rule of thumb: if you would hate to learn 30 minutes in that 800 of 1000 tenants silently drifted, use cohorts.

Flow

Steps

Run the fan-out
```
pnpm --filter @rcm/rcm-core migrate-all-tenants
```
The CLI walks every active + read-only tenant in master, applies pending migrations, and prints a per-tenant + final summary report. Default concurrency: 3 tenants in parallel; default per-tenant timeout: 120 s.
Read the summary
Outcome Action
All tenants succeed (exit 0) Proceed to app deploy
One or more failed (exit non-zero) CI blocks deploy. Triage per below.

Outcome	Action
All tenants succeed (exit 0)	Proceed to app deploy
One or more failed (exit non-zero)	CI blocks deploy. Triage per below.

Triage failures from the summary block. Common causes:

Error excerpt	Cause	Fix
`connection refused` / `ETIMEDOUT`	Transient network loss or pgBouncer hiccup	Re-run the failed slug only; if it persists, check `identity.db_server` health.
`Key Vault secret "<ref>" missing`	`db_config_ref` drift — secret was deleted or renamed	Restore the secret from KV soft-delete, or re-provision the tenant.
`relation already exists` / `column already exists`	Schema was hand-edited and `knex_migrations` is out of sync	Reconcile `knex_migrations` manually, then re-run.
`migrate.latest timed out after 120000ms`	Genuine long migration, or a lock holder	Re-run with `--timeout-ms 600000`. Check for blockers: `SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL`.

Converge the failed tenant without rescanning the whole estate:
```
pnpm --filter @rcm/rcm-core migrate-all-tenants --only acme
```
If a tenant must stay offline while under repair, flip its status to suspended (see Suspend a tenant). Future fan-out runs skip suspended tenants automatically and CI stops blocking on them.

Pre-flag tenants `read_only` before risky migrations

For a migration that must not run against live writes, set identity.tenant.status = read_only ahead of the fan-out. read_only is in the default status filter so the migration still applies, but enforceTenantStatus middleware rejects write verbs while the migration runs.

# Pre-flag everyone in the panel
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to read_only \
  --reason "Pre-deploy gate for migration 105"
# ... fan out ...
pnpm --filter @rcm/rcm-core tenant-status --slug acme --to active \
  --reason "Migration 105 complete"

For 1000+ tenants this is manual toil — graduate to cohorted migrations where the gate is automatic per cohort.

Validation

Check	Expected
CLI exit code	0
Per-tenant `knex_migrations` head	matches `latest` migration filename
`identity.tenant_audit`	no migration-failure rows

Troubleshooting

Symptom	Cause	Fix
Same tenant fails repeatedly with connectivity error	pgBouncer or `db_server` issue	See Tenant DB unreachable.
Migration applies cleanly to dev, fails in prod	Production has rows that violate the new constraint	Run a backfill before tightening the constraint; or split into two migrations (loosen first, tighten next deploy).
Concurrency=3 saturates the master replica	Read-replica is undersized	Drop `--concurrency` to 1 for the affected wave; consider promoting to cohorted migrations.

1.6 — Cohorted migrations for large fleets

Outcome​

Prerequisites​

When to use fan-out vs. cohorts​

Flow​

Steps​

Pre-flag tenants read_only before risky migrations​

Validation​

Troubleshooting​

Next​