Skip to main content

Respond to a connection-budget alert 🚨

Outcome

The rcm_connection_budget_ratio metric is back below the warning threshold and the platform has a documented mitigation choice.

Prerequisites

  • Grafana / Prometheus access.
  • PLATFORM_ADMIN for any mitigation that touches infrastructure or env.
  • On-call escalation rights to file an incident if the breach is sustained.

What the metric measures

rcm_connection_budget_ratio = (activeTenantCount × TENANT_POOL_MAX)
/ POSTGRES_SERVER_CONNECTION_LIMIT
ThresholdSeverityWhat it means
> 0.8WarningConnection budget is tight; no slack for spikes.
> 1.0PageWe are oversold — a connection storm will starve queries.

The ratio is per-server when sharding is in play. Identify the offender from identity.db_server + rcm_tenants_count{status='active'} on that server.

Decision flow

Steps

  1. Confirm which server is hot. With sharding, the ratio is per-server. Open Grafana and inspect:

    • rcm_connection_budget_ratio{server} — find the offending server.
    • rcm_tenants_on_server{server} vs. rcm_db_server_capacity_hint{server}.
    • rcm_connection_budget_remaining_per_server{server} for headroom.
  2. Pick the mitigation based on severity and trajectory.

    MitigationWhen to chooseEffect
    Raise POSTGRES_SERVER_CONNECTION_LIMITOne-time spike; no immediate growth in tenant countBuys time; restart of Postgres Flexible Server (Azure Portal → Compute + storage → Max connections)
    Lower TENANT_POOL_MAXPer-tenant pool is over-sized for actual concurrencyRestart rcm-core workers. Reduces per-tenant throughput; unblocks onboarding.
    Add a shard (§1.7)Sustained ratio > 1.0, or growth trajectory unfavourablePermanent capacity relief; preferred for any incident past the page threshold.
  3. Apply the chosen mitigation and watch the metric for 30 min. The ratio should drop within one TTL window after env reconfiguration; immediately after max_connections raise + restart.

  4. Document the incident in the on-call log: the chosen mitigation, the projected time until the next breach (given onboarding rate), and any follow-up tickets (e.g., schedule the next shard).

Things NOT to do

  • Do not increase perTenantPoolMax in a hot-fix scramble — it multiplies instead of divides the problem. Each tenant pool's max grows.
  • Do not drop tenants from the resolver LRU just to shed connections. They'll be re-resolved on the next request and the cycle repeats.

Validation

CheckExpected
rcm_connection_budget_ratioBelow 0.8 sustained
rcm_resolver_hot_tenantsStable; not flapping
Customer-side latencyReturns to baseline
Incident logEntry filed with mitigation + follow-up

Troubleshooting

SymptomCauseFix
Ratio drops after restart, climbs back within minutesMitigation only treated symptom; growth is realSchedule a shard.
Restart of Postgres failedManual Azure Portal restart can take 5–10 minWait; if it stalls, file an Azure support ticket.
Multiple servers above threshold simultaneouslyEstate-wide growthStand up a shard immediately and onboard new tenants there only.
Ratio shows 0 unexpectedlyMetric scrape brokenSee Distributed tracing & alert response.

Next

1.10 — Tenant DB unreachable