Multi-instance worker ownership via lease

Outcome

Multiple TenantJobSupervisor processes coexist against the same master, partition hot tenants across themselves via rendezvous hashing, and gracefully hand off when the worker set changes.

Prerequisites

1000+ tenants OR multi-instance worker fleet. Below that, single-instance mode is the right answer — the lease is not used when only one worker process exists.
PLATFORM_ADMIN for env tuning and lease inspection.

How it works

Each worker process claims a row in identity.worker_lease (UPSERT on worker_id). The row records hostname, pid, started_at, last_heartbeat_at, generation, and expires_at = last_heartbeat_at + 30s.
A heartbeat timer (default 10 s) updates last_heartbeat_at. If the process dies, expires_at lapses within 30 s.
listActive() returns rows with expires_at > NOW(), ordered by worker_id for stability.
For every tenant, ownsTenant(tenantId, workerId, activeWorkers) uses rendezvous hashing: hash (tenantId, workerId) to 64 bits; the worker with the highest score wins.
Every 30 s ownership tick, supervisors compare hot-tenant set vs. ownership and gracefully drain tenants they no longer own.
promote() is gated: a non-owner raises NotOwnerError; cold + fan-out dispatchers honor this and skip to the next candidate.

Operating properties

Event	Behavior
Add a worker	New lease appears in `listActive()` within ≤ 10 s. Next ownership tick (≤ 30 s), every existing worker re-evaluates; on average `1/N` of hot tenants drain to the new one.
Remove a worker (graceful)	`stop()` deletes the lease row. Remaining workers pick up tenants on next ownership tick.
Remove a worker (crash)	Lease row stays but `last_heartbeat_at` goes stale. Within 30 s ages out of `listActive()`.
Remap scope	Rendezvous hashing guarantees adding/removing one worker only remaps `1/N` tenants — no mass churn.

Configuration

Variable	Default	Effect
`WORKER_ID`	`${hostname}-${pid}`	Stable worker identity. Override in orchestrators that rotate pods so lease rows don't accumulate.
`WORKER_LEASE_HEARTBEAT_MS`	`10000`	Heartbeat cadence. Faster = less stale window; more master writes.
`WORKER_LEASE_TTL_MS`	`30000`	Lease expiry. Applied at read time; no background cleanup. Keep at ≥ 3× heartbeat.
`WORKER_OWNERSHIP_TICK_MS`	`30000`	How often each supervisor re-evaluates ownership.

Until env vars are wired into createServer(), every deployment runs single-instance with no lease. Wiring is the "turn on multi-instance mode" flip — treat it as a deliberate scale-out moment.

Inspect active workers

SELECT worker_id, hostname, pid, started_at,
       last_heartbeat_at, generation,
       expires_at > NOW() AS active
FROM identity.worker_lease
ORDER BY active DESC, worker_id;

Pattern	Meaning
`active = false` + old `last_heartbeat_at`	Worker died; safe to delete the row to tidy up (not required for correctness).
Rapidly-incrementing `generation` on one row	Worker is restart-looping; check its logs.
`started_at` drift between rows > a few seconds	Rolling deploy in progress; normal during rollout.

Manually evict a stuck worker

If a worker holds a lease but is wedged (no heartbeat updates despite expires_at still valid — rare; usually clock skew or paused process):

DELETE FROM identity.worker_lease WHERE worker_id = '<stuck>';

Remaining workers pick up the tenants on the next ownership tick. The stuck worker, if it recovers, will re-UPSERT — harmless.

Terminating the process itself is still preferred; the DELETE is for cases where the process is unreachable but still running.

Validation

Check	Expected
`identity.worker_lease` row count	Matches running worker count
Every active row has fresh `last_heartbeat_at`	Within `WORKER_LEASE_TTL_MS`
Hot-tenant distribution across workers	Roughly even (within 30% deviation at 100+ tenants × 10+ workers)
`rcm_supervisor_demote_total{reason='ownership-change'}`	Stable; not flapping

Troubleshooting

Symptom	Cause	Fix
Adding a worker doesn't shed load from existing workers	Ownership tick has not fired	Wait ≤ `WORKER_OWNERSHIP_TICK_MS`.
Lease rows accumulate forever	`WORKER_ID` rotates per pod restart	Pin a stable identity (e.g., StatefulSet name).
Repeated drain + promote churn	Worker restart-looping	Fix the underlying crash; the system absorbs single flaps but not chronic ones. A 2-tick hysteresis is a known follow-up.
Tenant pinned to Region A served by worker in Region B	Cross-region awareness not yet wired	Document; add region affinity when cross-region deploy lands.

Known limits

Small-N variance: at 2–3 workers + low tenant count, one worker may carry 60% of load temporarily. Plan capacity for the worst-case share.
Flap hysteresis: chronic worker churn requires fixing the worker; graceful drain absorbs single flaps only.
No region affinity: ownership ignores identity.tenant.preferred_region.

2.1 — Configure a new payer

Outcome​

Prerequisites​

How it works​

Operating properties​

Configuration​

Inspect active workers​

Manually evict a stuck worker​

Validation​

Troubleshooting​

Known limits​

Next​