Respond to a connection-budget alert 🚨
Outcome
The rcm_connection_budget_ratio metric is back below the warning threshold
and the platform has a documented mitigation choice.
Prerequisites
- Grafana / Prometheus access.
PLATFORM_ADMINfor any mitigation that touches infrastructure or env.- On-call escalation rights to file an incident if the breach is sustained.
What the metric measures
rcm_connection_budget_ratio = (activeTenantCount × TENANT_POOL_MAX)
/ POSTGRES_SERVER_CONNECTION_LIMIT
| Threshold | Severity | What it means |
|---|---|---|
> 0.8 | Warning | Connection budget is tight; no slack for spikes. |
> 1.0 | Page | We are oversold — a connection storm will starve queries. |
The ratio is per-server when sharding is in play. Identify the offender
from identity.db_server + rcm_tenants_count{status='active'} on that
server.
Decision flow
Steps
Confirm which server is hot. With sharding, the ratio is per-server. Open Grafana and inspect:
rcm_connection_budget_ratio{server}— find the offending server.rcm_tenants_on_server{server}vs.rcm_db_server_capacity_hint{server}.rcm_connection_budget_remaining_per_server{server}for headroom.
Pick the mitigation based on severity and trajectory.
Mitigation When to choose Effect Raise POSTGRES_SERVER_CONNECTION_LIMITOne-time spike; no immediate growth in tenant count Buys time; restart of Postgres Flexible Server (Azure Portal → Compute + storage → Max connections) Lower TENANT_POOL_MAXPer-tenant pool is over-sized for actual concurrency Restart rcm-core workers. Reduces per-tenant throughput; unblocks onboarding. Add a shard (§1.7) Sustained ratio > 1.0, or growth trajectory unfavourable Permanent capacity relief; preferred for any incident past the page threshold. Apply the chosen mitigation and watch the metric for 30 min. The ratio should drop within one TTL window after env reconfiguration; immediately after
max_connectionsraise + restart.Document the incident in the on-call log: the chosen mitigation, the projected time until the next breach (given onboarding rate), and any follow-up tickets (e.g., schedule the next shard).
Things NOT to do
- Do not increase
perTenantPoolMaxin a hot-fix scramble — it multiplies instead of divides the problem. Each tenant pool's max grows. - Do not drop tenants from the resolver LRU just to shed connections. They'll be re-resolved on the next request and the cycle repeats.
Validation
| Check | Expected |
|---|---|
rcm_connection_budget_ratio | Below 0.8 sustained |
rcm_resolver_hot_tenants | Stable; not flapping |
| Customer-side latency | Returns to baseline |
| Incident log | Entry filed with mitigation + follow-up |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Ratio drops after restart, climbs back within minutes | Mitigation only treated symptom; growth is real | Schedule a shard. |
| Restart of Postgres failed | Manual Azure Portal restart can take 5–10 min | Wait; if it stalls, file an Azure support ticket. |
| Multiple servers above threshold simultaneously | Estate-wide growth | Stand up a shard immediately and onboard new tenants there only. |
| Ratio shows 0 unexpectedly | Metric scrape broken | See Distributed tracing & alert response. |