Cohorted tenant migrations for large fleets
Outcome
A schema or backfill migration is applied across the fleet in waves — each wave staggered, monitored, and aborted if too many tenants fail in a single cohort. Aborted runs are resumable.
Prerequisites
PLATFORM_ADMIN.- Master DB has
identity.migration_cohort_run(RM-13 S13). - 1000+ tenants, OR a migration that carries backfill, OR the first time touching a hot table. Below this threshold, the fan-out is fine.
Why cohorts
A single fan-out can saturate the master catalog or the network when the fleet crosses ~1000 tenants. Cohorts replace one big wave with N smaller ones that can be audited, aborted, and resumed.
Steps
Plan a dry run to validate cohort composition without applying anything.
pnpm --filter @rcm/rcm-core migrate-cohorts --dry-runThe CLI prints the per-cohort tenant list and the projected duration. Confirm nothing looks unexpected (e.g., a
read_onlytenant you forgot about).Pick the cohort shape
Lever Default When to change --cohort-size100 Smaller (e.g., 50) for risky migrations; larger only after proven clean. --stagger-ms60000 Longer (e.g., 120000) to give monitoring + on-call humans more time between waves. --abort-rate0.1 Stays at 10% per cohort. Lower for migrations that must not partial-apply. Run with the chosen shape.
pnpm --filter @rcm/rcm-core migrate-cohorts \--cohort-size 50 --stagger-ms 120000The CLI prints a per-cohort progress line and a final aggregate report. Every run writes one row to
identity.migration_cohort_runat start, updates it per cohort, and finalises it at end-of-run.If a cohort trips
--abort-rate, the run halts. Inspect:SELECT run_id, started_at, finished_at,total_tenants, cohort_size, stagger_ms,succeeded_count, failed_count,aborted_at, abort_reason,report->'failedTenants' AS failedFROM identity.migration_cohort_runORDER BY started_at DESCLIMIT 1;Run
migrate-all-tenants --only <slug>per failing tenant to surface the root error. Once fixed, resume:pnpm --filter @rcm/rcm-core migrate-cohorts --resume <run_id>Resume picks up at the next unapplied cohort, not from the top — Knex's migration tracking ensures already-applied tenants are skipped.
Code-vs-schema compatibility
Every tenant migration may export minCompatibleCodeVersion. The startup hook
samples 10 active tenants and refuses to boot if the running package.json
version is older than any sampled tenant's max-applied
minCompatibleCodeVersion.
For deploy-time safety, run the full-fleet check before the cohort:
pnpm --filter @rcm/rcm-core check-code-version
| Exit | Meaning |
|---|---|
| 0 | Deploy is safe for every active tenant. |
| Non-zero | Prints the tenants whose DB is ahead of the build. Either roll forward the code version or pin the deploy to tenants still on the old schema. |
Deploy sequence when a cohorted migration adds a minCompatibleCodeVersion:
- Ship the new code first (it must tolerate both old and new schema).
- Run
check-code-versionon a canary tenant — must exit 0. - Run
migrate-cohorts. - Re-run
check-code-versionacross the fleet to confirm parity.
Never reverse 1 and 3. A tenant on the new schema with the old code will likely fail at query time.
Validation
| Check | Expected |
|---|---|
migration_cohort_run.finished_at | non-null on the latest run |
migration_cohort_run.failed_count | 0 (or known + suspended) |
migration_cohort_run.aborted_at | null |
check-code-version post-cohort | exit 0 |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Run aborted at cohort 1 | New migration is unsafe in production data | Inspect report->'failedTenants'; fix the migration; re-run from the start (do not resume — the bad migration would still apply). |
finished_at IS NULL AND aborted_at IS NULL | Run is still live | Wait, or check the rcm-core logs to see which cohort is in flight. |
failed_count > 0 AND finished_at IS NOT NULL | A small number of tenants failed but stayed below abort_rate | Triage individually with migrate-all-tenants --only <slug>. Resume is not required — the run is finished. |
Cohort run row missing report | Older runs predate the report column | Expected for legacy runs; new runs always populate the field. |