Skip to main content

pg-boss supervisor operations

Outcome

You can read pg-boss hot-pool capacity, diagnose stuck cold tenants, force-promote when needed, and tune the supervisor for your fleet without surprising anyone.

Prerequisites

  • Grafana / Prometheus access for supervisor metrics.
  • PLATFORM_ADMIN for env tuning.

How it works

The ingestion pipeline runs under a TenantJobSupervisor that holds one pg-boss instance per hot tenant. Jobs live in each tenant's own pgboss.* schema, so a worker for tenant A physically cannot touch tenant B's data.

Key metrics

MetricMeaningAlert threshold
rcm_supervisor_hot_tenantsCurrent hot-pool occupancy— informational
rcm_supervisor_hot_pool_maxConfigured ceiling (PGBOSS_HOT_POOL_MAX)static
rcm_supervisor_promote_total{reason}Cumulative promotes; reason ∈ {enqueue, cold-dispatch, fanout, manual}spike → tenant churn
rcm_supervisor_demote_total{reason}Cumulative demotes; reason ∈ {idle, eviction, shutdown, error}error > 0 → investigate
rcm_supervisor_promote_duration_secondspg-boss cold-start latencyp99 > 2s warn
rcm_supervisor_enqueue_total{queue}Enqueue rate by queuebaseline drift
rcm_cold_dispatcher_scan_duration_secondsOne cold-tier tickp99 > 1s warn
rcm_cold_dispatcher_tenants_scanned_totalProbe volume
rcm_cold_dispatcher_promotions_totalCold-triggered wake-upssustained > 0 → hot pool too small

Hot-pool capacity tuning

Three env levers bound the per-worker Postgres connection budget:

connections_per_worker ≈ hot_pool_max × (TENANT_POOL_MAX + PGBOSS_TENANT_POOL_MAX)

Defaults (50 × (5 + 2) = 350) sit inside Azure Flexible Server's 428-conn ceiling on the General Purpose SKU.

If rcm_connection_budget_ratio approaches 1.0, reduce PGBOSS_HOT_POOL_MAX first — eviction will demote idle tenants and the cold dispatcher will wake them as needed. See Connection budget alerts.

When a tenant is stuck in cold

Symptom: jobs accumulate in a tenant's pgboss.job table but the worker never fires.

  1. Check listHot() via the admin surface (or log-grep for supervisor.promote with the tenant's id).

  2. If the cold dispatcher is scanning (rcm_cold_dispatcher_*) but the tenant isn't being probed → hot pool is full. Either increase PGBOSS_HOT_POOL_MAX or wait for the reaper.

  3. If the tenant is probed but never promotes, run the probe SQL manually against that tenant's DB:

    SELECT state, count(*) FROM pgboss.job GROUP BY state;

    If rows are all in state='active' with stale started_on, a previous worker crashed mid-lease. pg-boss reclaims them after expire_in (default 15 min) — if you need it sooner:

    UPDATE pgboss.job
    SET state = 'retry'
    WHERE state = 'active'
    AND started_on < now() - interval '5 min';

Force-promote a tenant

For on-call triage (no CLI yet — follow-up):

// In a REPL connected to the running process:
await supervisor.promote('<tenant-id>', 'manual');

Graceful restart

SIGTERM triggers supervisor.stop() which:

  1. Stops the reaper interval.
  2. Awaits any in-flight promotes.
  3. Gracefully stops every hot tenant's pg-boss (stop({ graceful: true })).

In-flight jobs finish before shutdown. A rolling deploy drops < 2% of jobs into retry (pg-boss retries on timeout) in the 90th-percentile case.

Deployment note — tenantId on payloads

Every ingestion job carries tenantId in its payload. Before rolling out a new build that adds a tenantId field to any new queue, drain the old queue first (or accept that in-flight jobs without the field are skipped with a supervisor.handler.payload_mismatch warn).

Validation

CheckExpected
rcm_supervisor_hot_tenantsBelow hot_pool_max
rcm_supervisor_promote_duration_seconds p99< 2 s
rcm_supervisor_demote_total{reason='error'}0
Stuck cold tenant after probePromotes after manual SQL or env-bump

Cross-references

Next

9.2 — Distributed tracing & alert response