Skip to main content

Multi-instance worker ownership via lease

Outcome

Multiple TenantJobSupervisor processes coexist against the same master, partition hot tenants across themselves via rendezvous hashing, and gracefully hand off when the worker set changes.

Prerequisites

  • 1000+ tenants OR multi-instance worker fleet. Below that, single-instance mode is the right answer — the lease is not used when only one worker process exists.
  • PLATFORM_ADMIN for env tuning and lease inspection.

How it works

  1. Each worker process claims a row in identity.worker_lease (UPSERT on worker_id). The row records hostname, pid, started_at, last_heartbeat_at, generation, and expires_at = last_heartbeat_at + 30s.
  2. A heartbeat timer (default 10 s) updates last_heartbeat_at. If the process dies, expires_at lapses within 30 s.
  3. listActive() returns rows with expires_at > NOW(), ordered by worker_id for stability.
  4. For every tenant, ownsTenant(tenantId, workerId, activeWorkers) uses rendezvous hashing: hash (tenantId, workerId) to 64 bits; the worker with the highest score wins.
  5. Every 30 s ownership tick, supervisors compare hot-tenant set vs. ownership and gracefully drain tenants they no longer own.
  6. promote() is gated: a non-owner raises NotOwnerError; cold + fan-out dispatchers honor this and skip to the next candidate.

Operating properties

EventBehavior
Add a workerNew lease appears in listActive() within ≤ 10 s. Next ownership tick (≤ 30 s), every existing worker re-evaluates; on average 1/N of hot tenants drain to the new one.
Remove a worker (graceful)stop() deletes the lease row. Remaining workers pick up tenants on next ownership tick.
Remove a worker (crash)Lease row stays but last_heartbeat_at goes stale. Within 30 s ages out of listActive().
Remap scopeRendezvous hashing guarantees adding/removing one worker only remaps 1/N tenants — no mass churn.

Configuration

VariableDefaultEffect
WORKER_ID${hostname}-${pid}Stable worker identity. Override in orchestrators that rotate pods so lease rows don't accumulate.
WORKER_LEASE_HEARTBEAT_MS10000Heartbeat cadence. Faster = less stale window; more master writes.
WORKER_LEASE_TTL_MS30000Lease expiry. Applied at read time; no background cleanup. Keep at ≥ 3× heartbeat.
WORKER_OWNERSHIP_TICK_MS30000How often each supervisor re-evaluates ownership.

Until env vars are wired into createServer(), every deployment runs single-instance with no lease. Wiring is the "turn on multi-instance mode" flip — treat it as a deliberate scale-out moment.

Inspect active workers

SELECT worker_id, hostname, pid, started_at,
last_heartbeat_at, generation,
expires_at > NOW() AS active
FROM identity.worker_lease
ORDER BY active DESC, worker_id;
PatternMeaning
active = false + old last_heartbeat_atWorker died; safe to delete the row to tidy up (not required for correctness).
Rapidly-incrementing generation on one rowWorker is restart-looping; check its logs.
started_at drift between rows > a few secondsRolling deploy in progress; normal during rollout.

Manually evict a stuck worker

If a worker holds a lease but is wedged (no heartbeat updates despite expires_at still valid — rare; usually clock skew or paused process):

DELETE FROM identity.worker_lease WHERE worker_id = '<stuck>';

Remaining workers pick up the tenants on the next ownership tick. The stuck worker, if it recovers, will re-UPSERT — harmless.

Terminating the process itself is still preferred; the DELETE is for cases where the process is unreachable but still running.

Validation

CheckExpected
identity.worker_lease row countMatches running worker count
Every active row has fresh last_heartbeat_atWithin WORKER_LEASE_TTL_MS
Hot-tenant distribution across workersRoughly even (within 30% deviation at 100+ tenants × 10+ workers)
rcm_supervisor_demote_total{reason='ownership-change'}Stable; not flapping

Troubleshooting

SymptomCauseFix
Adding a worker doesn't shed load from existing workersOwnership tick has not firedWait ≤ WORKER_OWNERSHIP_TICK_MS.
Lease rows accumulate foreverWORKER_ID rotates per pod restartPin a stable identity (e.g., StatefulSet name).
Repeated drain + promote churnWorker restart-loopingFix the underlying crash; the system absorbs single flaps but not chronic ones. A 2-tick hysteresis is a known follow-up.
Tenant pinned to Region A served by worker in Region BCross-region awareness not yet wiredDocument; add region affinity when cross-region deploy lands.

Known limits

  • Small-N variance: at 2–3 workers + low tenant count, one worker may carry 60% of load temporarily. Plan capacity for the worst-case share.
  • Flap hysteresis: chronic worker churn requires fixing the worker; graceful drain absorbs single flaps only.
  • No region affinity: ownership ignores identity.tenant.preferred_region.

Next

2.1 — Configure a new payer