Multi-instance worker ownership via lease
Outcome
Multiple TenantJobSupervisor processes coexist against the same master,
partition hot tenants across themselves via rendezvous hashing, and gracefully
hand off when the worker set changes.
Prerequisites
- 1000+ tenants OR multi-instance worker fleet. Below that, single-instance mode is the right answer — the lease is not used when only one worker process exists.
PLATFORM_ADMINfor env tuning and lease inspection.
How it works
- Each worker process claims a row in
identity.worker_lease(UPSERT onworker_id). The row recordshostname,pid,started_at,last_heartbeat_at,generation, andexpires_at = last_heartbeat_at + 30s. - A heartbeat timer (default 10 s) updates
last_heartbeat_at. If the process dies,expires_atlapses within 30 s. listActive()returns rows withexpires_at > NOW(), ordered byworker_idfor stability.- For every tenant,
ownsTenant(tenantId, workerId, activeWorkers)uses rendezvous hashing: hash(tenantId, workerId)to 64 bits; the worker with the highest score wins. - Every 30 s ownership tick, supervisors compare hot-tenant set vs. ownership and gracefully drain tenants they no longer own.
promote()is gated: a non-owner raisesNotOwnerError; cold + fan-out dispatchers honor this and skip to the next candidate.
Operating properties
| Event | Behavior |
|---|---|
| Add a worker | New lease appears in listActive() within ≤ 10 s. Next ownership tick (≤ 30 s), every existing worker re-evaluates; on average 1/N of hot tenants drain to the new one. |
| Remove a worker (graceful) | stop() deletes the lease row. Remaining workers pick up tenants on next ownership tick. |
| Remove a worker (crash) | Lease row stays but last_heartbeat_at goes stale. Within 30 s ages out of listActive(). |
| Remap scope | Rendezvous hashing guarantees adding/removing one worker only remaps 1/N tenants — no mass churn. |
Configuration
| Variable | Default | Effect |
|---|---|---|
WORKER_ID | ${hostname}-${pid} | Stable worker identity. Override in orchestrators that rotate pods so lease rows don't accumulate. |
WORKER_LEASE_HEARTBEAT_MS | 10000 | Heartbeat cadence. Faster = less stale window; more master writes. |
WORKER_LEASE_TTL_MS | 30000 | Lease expiry. Applied at read time; no background cleanup. Keep at ≥ 3× heartbeat. |
WORKER_OWNERSHIP_TICK_MS | 30000 | How often each supervisor re-evaluates ownership. |
Until env vars are wired into createServer(), every deployment runs
single-instance with no lease. Wiring is the "turn on multi-instance mode"
flip — treat it as a deliberate scale-out moment.
Inspect active workers
SELECT worker_id, hostname, pid, started_at,
last_heartbeat_at, generation,
expires_at > NOW() AS active
FROM identity.worker_lease
ORDER BY active DESC, worker_id;
| Pattern | Meaning |
|---|---|
active = false + old last_heartbeat_at | Worker died; safe to delete the row to tidy up (not required for correctness). |
Rapidly-incrementing generation on one row | Worker is restart-looping; check its logs. |
started_at drift between rows > a few seconds | Rolling deploy in progress; normal during rollout. |
Manually evict a stuck worker
If a worker holds a lease but is wedged (no heartbeat updates despite
expires_at still valid — rare; usually clock skew or paused process):
DELETE FROM identity.worker_lease WHERE worker_id = '<stuck>';
Remaining workers pick up the tenants on the next ownership tick. The stuck worker, if it recovers, will re-UPSERT — harmless.
Terminating the process itself is still preferred; the DELETE is for cases
where the process is unreachable but still running.
Validation
| Check | Expected |
|---|---|
identity.worker_lease row count | Matches running worker count |
Every active row has fresh last_heartbeat_at | Within WORKER_LEASE_TTL_MS |
| Hot-tenant distribution across workers | Roughly even (within 30% deviation at 100+ tenants × 10+ workers) |
rcm_supervisor_demote_total{reason='ownership-change'} | Stable; not flapping |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Adding a worker doesn't shed load from existing workers | Ownership tick has not fired | Wait ≤ WORKER_OWNERSHIP_TICK_MS. |
| Lease rows accumulate forever | WORKER_ID rotates per pod restart | Pin a stable identity (e.g., StatefulSet name). |
| Repeated drain + promote churn | Worker restart-looping | Fix the underlying crash; the system absorbs single flaps but not chronic ones. A 2-tick hysteresis is a known follow-up. |
| Tenant pinned to Region A served by worker in Region B | Cross-region awareness not yet wired | Document; add region affinity when cross-region deploy lands. |
Known limits
- Small-N variance: at 2–3 workers + low tenant count, one worker may carry 60% of load temporarily. Plan capacity for the worst-case share.
- Flap hysteresis: chronic worker churn requires fixing the worker; graceful drain absorbs single flaps only.
- No region affinity: ownership ignores
identity.tenant.preferred_region.