pg-boss supervisor operations
Outcome
You can read pg-boss hot-pool capacity, diagnose stuck cold tenants, force-promote when needed, and tune the supervisor for your fleet without surprising anyone.
Prerequisites
- Grafana / Prometheus access for supervisor metrics.
PLATFORM_ADMINfor env tuning.
How it works
The ingestion pipeline runs under a TenantJobSupervisor that holds one
pg-boss instance per hot tenant. Jobs live in each tenant's own
pgboss.* schema, so a worker for tenant A physically cannot touch
tenant B's data.
Key metrics
| Metric | Meaning | Alert threshold |
|---|---|---|
rcm_supervisor_hot_tenants | Current hot-pool occupancy | — informational |
rcm_supervisor_hot_pool_max | Configured ceiling (PGBOSS_HOT_POOL_MAX) | static |
rcm_supervisor_promote_total{reason} | Cumulative promotes; reason ∈ {enqueue, cold-dispatch, fanout, manual} | spike → tenant churn |
rcm_supervisor_demote_total{reason} | Cumulative demotes; reason ∈ {idle, eviction, shutdown, error} | error > 0 → investigate |
rcm_supervisor_promote_duration_seconds | pg-boss cold-start latency | p99 > 2s warn |
rcm_supervisor_enqueue_total{queue} | Enqueue rate by queue | baseline drift |
rcm_cold_dispatcher_scan_duration_seconds | One cold-tier tick | p99 > 1s warn |
rcm_cold_dispatcher_tenants_scanned_total | Probe volume | — |
rcm_cold_dispatcher_promotions_total | Cold-triggered wake-ups | sustained > 0 → hot pool too small |
Hot-pool capacity tuning
Three env levers bound the per-worker Postgres connection budget:
connections_per_worker ≈ hot_pool_max × (TENANT_POOL_MAX + PGBOSS_TENANT_POOL_MAX)
Defaults (50 × (5 + 2) = 350) sit inside Azure Flexible Server's 428-conn
ceiling on the General Purpose SKU.
If rcm_connection_budget_ratio approaches 1.0, reduce
PGBOSS_HOT_POOL_MAX first — eviction will demote idle tenants and the
cold dispatcher will wake them as needed. See
Connection budget alerts.
When a tenant is stuck in cold
Symptom: jobs accumulate in a tenant's pgboss.job table but the worker
never fires.
Check
listHot()via the admin surface (or log-grep forsupervisor.promotewith the tenant's id).If the cold dispatcher is scanning (
rcm_cold_dispatcher_*) but the tenant isn't being probed → hot pool is full. Either increasePGBOSS_HOT_POOL_MAXor wait for the reaper.If the tenant is probed but never promotes, run the probe SQL manually against that tenant's DB:
SELECT state, count(*) FROM pgboss.job GROUP BY state;If rows are all in
state='active'with stalestarted_on, a previous worker crashed mid-lease. pg-boss reclaims them afterexpire_in(default 15 min) — if you need it sooner:UPDATE pgboss.jobSET state = 'retry'WHERE state = 'active'AND started_on < now() - interval '5 min';
Force-promote a tenant
For on-call triage (no CLI yet — follow-up):
// In a REPL connected to the running process:
await supervisor.promote('<tenant-id>', 'manual');
Graceful restart
SIGTERM triggers supervisor.stop() which:
- Stops the reaper interval.
- Awaits any in-flight promotes.
- Gracefully stops every hot tenant's pg-boss
(
stop({ graceful: true })).
In-flight jobs finish before shutdown. A rolling deploy drops < 2% of jobs into retry (pg-boss retries on timeout) in the 90th-percentile case.
Deployment note — tenantId on payloads
Every ingestion job carries tenantId in its payload. Before rolling
out a new build that adds a tenantId field to any new queue, drain
the old queue first (or accept that in-flight jobs without the field
are skipped with a supervisor.handler.payload_mismatch warn).
Validation
| Check | Expected |
|---|---|
rcm_supervisor_hot_tenants | Below hot_pool_max |
rcm_supervisor_promote_duration_seconds p99 | < 2 s |
rcm_supervisor_demote_total{reason='error'} | 0 |
| Stuck cold tenant after probe | Promotes after manual SQL or env-bump |
Cross-references
- Worker ownership & lease for multi-instance partitioning.
- Connection budget alerts — supervisor levers are also the budget levers.
- Cold-start & deploy SLOs — promote-duration is one of the four SLOs.