Skip to main content

Cohorted tenant migrations for large fleets

Outcome

A schema or backfill migration is applied across the fleet in waves — each wave staggered, monitored, and aborted if too many tenants fail in a single cohort. Aborted runs are resumable.

Prerequisites

  • PLATFORM_ADMIN.
  • Master DB has identity.migration_cohort_run (RM-13 S13).
  • 1000+ tenants, OR a migration that carries backfill, OR the first time touching a hot table. Below this threshold, the fan-out is fine.

Why cohorts

A single fan-out can saturate the master catalog or the network when the fleet crosses ~1000 tenants. Cohorts replace one big wave with N smaller ones that can be audited, aborted, and resumed.

Steps

  1. Plan a dry run to validate cohort composition without applying anything.

    pnpm --filter @rcm/rcm-core migrate-cohorts --dry-run

    The CLI prints the per-cohort tenant list and the projected duration. Confirm nothing looks unexpected (e.g., a read_only tenant you forgot about).

  2. Pick the cohort shape

    LeverDefaultWhen to change
    --cohort-size100Smaller (e.g., 50) for risky migrations; larger only after proven clean.
    --stagger-ms60000Longer (e.g., 120000) to give monitoring + on-call humans more time between waves.
    --abort-rate0.1Stays at 10% per cohort. Lower for migrations that must not partial-apply.
  3. Run with the chosen shape.

    pnpm --filter @rcm/rcm-core migrate-cohorts \
    --cohort-size 50 --stagger-ms 120000

    The CLI prints a per-cohort progress line and a final aggregate report. Every run writes one row to identity.migration_cohort_run at start, updates it per cohort, and finalises it at end-of-run.

  4. If a cohort trips --abort-rate, the run halts. Inspect:

    SELECT run_id, started_at, finished_at,
    total_tenants, cohort_size, stagger_ms,
    succeeded_count, failed_count,
    aborted_at, abort_reason,
    report->'failedTenants' AS failed
    FROM identity.migration_cohort_run
    ORDER BY started_at DESC
    LIMIT 1;

    Run migrate-all-tenants --only <slug> per failing tenant to surface the root error. Once fixed, resume:

    pnpm --filter @rcm/rcm-core migrate-cohorts --resume <run_id>

    Resume picks up at the next unapplied cohort, not from the top — Knex's migration tracking ensures already-applied tenants are skipped.

Code-vs-schema compatibility

Every tenant migration may export minCompatibleCodeVersion. The startup hook samples 10 active tenants and refuses to boot if the running package.json version is older than any sampled tenant's max-applied minCompatibleCodeVersion.

For deploy-time safety, run the full-fleet check before the cohort:

pnpm --filter @rcm/rcm-core check-code-version
ExitMeaning
0Deploy is safe for every active tenant.
Non-zeroPrints the tenants whose DB is ahead of the build. Either roll forward the code version or pin the deploy to tenants still on the old schema.

Deploy sequence when a cohorted migration adds a minCompatibleCodeVersion:

  1. Ship the new code first (it must tolerate both old and new schema).
  2. Run check-code-version on a canary tenant — must exit 0.
  3. Run migrate-cohorts.
  4. Re-run check-code-version across the fleet to confirm parity.

Never reverse 1 and 3. A tenant on the new schema with the old code will likely fail at query time.

Validation

CheckExpected
migration_cohort_run.finished_atnon-null on the latest run
migration_cohort_run.failed_count0 (or known + suspended)
migration_cohort_run.aborted_atnull
check-code-version post-cohortexit 0

Troubleshooting

SymptomCauseFix
Run aborted at cohort 1New migration is unsafe in production dataInspect report->'failedTenants'; fix the migration; re-run from the start (do not resume — the bad migration would still apply).
finished_at IS NULL AND aborted_at IS NULLRun is still liveWait, or check the rcm-core logs to see which cohort is in flight.
failed_count > 0 AND finished_at IS NOT NULLA small number of tenants failed but stayed below abort_rateTriage individually with migrate-all-tenants --only <slug>. Resume is not required — the run is finished.
Cohort run row missing reportOlder runs predate the report columnExpected for legacy runs; new runs always populate the field.

Next

1.7 — Sharding & rebalancing