pg-boss supervisor operations

Outcome

You can read pg-boss hot-pool capacity, diagnose stuck cold tenants, force-promote when needed, and tune the supervisor for your fleet without surprising anyone.

Prerequisites

Grafana / Prometheus access for supervisor metrics.
PLATFORM_ADMIN for env tuning.

How it works

The ingestion pipeline runs under a TenantJobSupervisor that holds one pg-boss instance per hot tenant. Jobs live in each tenant's own pgboss.* schema, so a worker for tenant A physically cannot touch tenant B's data.

Key metrics

Metric	Meaning	Alert threshold
`rcm_supervisor_hot_tenants`	Current hot-pool occupancy	— informational
`rcm_supervisor_hot_pool_max`	Configured ceiling (`PGBOSS_HOT_POOL_MAX`)	static
`rcm_supervisor_promote_total{reason}`	Cumulative promotes; `reason ∈ {enqueue, cold-dispatch, fanout, manual}`	spike → tenant churn
`rcm_supervisor_demote_total{reason}`	Cumulative demotes; `reason ∈ {idle, eviction, shutdown, error}`	`error > 0` → investigate
`rcm_supervisor_promote_duration_seconds`	pg-boss cold-start latency	p99 > 2s warn
`rcm_supervisor_enqueue_total{queue}`	Enqueue rate by queue	baseline drift
`rcm_cold_dispatcher_scan_duration_seconds`	One cold-tier tick	p99 > 1s warn
`rcm_cold_dispatcher_tenants_scanned_total`	Probe volume	—
`rcm_cold_dispatcher_promotions_total`	Cold-triggered wake-ups	sustained > 0 → hot pool too small

Hot-pool capacity tuning

Three env levers bound the per-worker Postgres connection budget:

connections_per_worker ≈ hot_pool_max × (TENANT_POOL_MAX + PGBOSS_TENANT_POOL_MAX)

Defaults (50 × (5 + 2) = 350) sit inside Azure Flexible Server's 428-conn ceiling on the General Purpose SKU.

If rcm_connection_budget_ratio approaches 1.0, reduce PGBOSS_HOT_POOL_MAX first — eviction will demote idle tenants and the cold dispatcher will wake them as needed. See Connection budget alerts.

When a tenant is stuck in cold

Symptom: jobs accumulate in a tenant's pgboss.job table but the worker never fires.

Check listHot() via the admin surface (or log-grep for supervisor.promote with the tenant's id).
If the cold dispatcher is scanning (rcm_cold_dispatcher_*) but the tenant isn't being probed → hot pool is full. Either increase PGBOSS_HOT_POOL_MAX or wait for the reaper.
If the tenant is probed but never promotes, run the probe SQL manually against that tenant's DB:
```
SELECT state, count(*) FROM pgboss.job GROUP BY state;
```
If rows are all in state='active' with stale started_on, a previous worker crashed mid-lease. pg-boss reclaims them after expire_in (default 15 min) — if you need it sooner:
```
UPDATE pgboss.job
   SET state = 'retry'
 WHERE state = 'active'
   AND started_on < now() - interval '5 min';
```

Force-promote a tenant

For on-call triage (no CLI yet — follow-up):

// In a REPL connected to the running process:
await supervisor.promote('<tenant-id>', 'manual');

Graceful restart

SIGTERM triggers supervisor.stop() which:

Stops the reaper interval.
Awaits any in-flight promotes.
Gracefully stops every hot tenant's pg-boss (stop({ graceful: true })).

In-flight jobs finish before shutdown. A rolling deploy drops < 2% of jobs into retry (pg-boss retries on timeout) in the 90th-percentile case.

Deployment note — `tenantId` on payloads

Every ingestion job carries tenantId in its payload. Before rolling out a new build that adds a tenantId field to any new queue, drain the old queue first (or accept that in-flight jobs without the field are skipped with a supervisor.handler.payload_mismatch warn).

Validation

Check	Expected
`rcm_supervisor_hot_tenants`	Below `hot_pool_max`
`rcm_supervisor_promote_duration_seconds` p99	< 2 s
`rcm_supervisor_demote_total{reason='error'}`	0
Stuck cold tenant after probe	Promotes after manual SQL or env-bump

Cross-references

Worker ownership & lease for multi-instance partitioning.
Connection budget alerts — supervisor levers are also the budget levers.
Cold-start & deploy SLOs — promote-duration is one of the four SLOs.

9.2 — Distributed tracing & alert response

Outcome​

Prerequisites​

How it works​

Key metrics​

Hot-pool capacity tuning​

When a tenant is stuck in cold​

Force-promote a tenant​

Graceful restart​

Deployment note — tenantId on payloads​

Validation​

Cross-references​

Next​