Respond to a connection-budget alert 🚨

Outcome

The rcm_connection_budget_ratio metric is back below the warning threshold and the platform has a documented mitigation choice.

Prerequisites

Grafana / Prometheus access.
PLATFORM_ADMIN for any mitigation that touches infrastructure or env.
On-call escalation rights to file an incident if the breach is sustained.

What the metric measures

rcm_connection_budget_ratio = (activeTenantCount × TENANT_POOL_MAX)
                            / POSTGRES_SERVER_CONNECTION_LIMIT

Threshold	Severity	What it means
`> 0.8`	Warning	Connection budget is tight; no slack for spikes.
`> 1.0`	Page	We are oversold — a connection storm will starve queries.

The ratio is per-server when sharding is in play. Identify the offender from identity.db_server + rcm_tenants_count{status='active'} on that server.

Decision flow

Steps

Confirm which server is hot. With sharding, the ratio is per-server. Open Grafana and inspect:
- rcm_connection_budget_ratio{server} — find the offending server.
- rcm_tenants_on_server{server} vs. rcm_db_server_capacity_hint{server}.
- rcm_connection_budget_remaining_per_server{server} for headroom.

Pick the mitigation based on severity and trajectory.

Mitigation	When to choose	Effect
Raise `POSTGRES_SERVER_CONNECTION_LIMIT`	One-time spike; no immediate growth in tenant count	Buys time; restart of Postgres Flexible Server (Azure Portal → Compute + storage → Max connections)
Lower `TENANT_POOL_MAX`	Per-tenant pool is over-sized for actual concurrency	Restart rcm-core workers. Reduces per-tenant throughput; unblocks onboarding.
Add a shard (§1.7)	Sustained ratio > 1.0, or growth trajectory unfavourable	Permanent capacity relief; preferred for any incident past the page threshold.

Apply the chosen mitigation and watch the metric for 30 min. The ratio should drop within one TTL window after env reconfiguration; immediately after max_connections raise + restart.
Document the incident in the on-call log: the chosen mitigation, the projected time until the next breach (given onboarding rate), and any follow-up tickets (e.g., schedule the next shard).

Things NOT to do

Do not increase perTenantPoolMax in a hot-fix scramble — it multiplies instead of divides the problem. Each tenant pool's max grows.
Do not drop tenants from the resolver LRU just to shed connections. They'll be re-resolved on the next request and the cycle repeats.

Validation

Check	Expected
`rcm_connection_budget_ratio`	Below 0.8 sustained
`rcm_resolver_hot_tenants`	Stable; not flapping
Customer-side latency	Returns to baseline
Incident log	Entry filed with mitigation + follow-up

Troubleshooting

Symptom	Cause	Fix
Ratio drops after restart, climbs back within minutes	Mitigation only treated symptom; growth is real	Schedule a shard.
Restart of Postgres failed	Manual Azure Portal restart can take 5–10 min	Wait; if it stalls, file an Azure support ticket.
Multiple servers above threshold simultaneously	Estate-wide growth	Stand up a shard immediately and onboard new tenants there only.
Ratio shows 0 unexpectedly	Metric scrape broken	See Distributed tracing & alert response.

1.10 — Tenant DB unreachable

Outcome​

Prerequisites​

What the metric measures​

Decision flow​

Steps​

Things NOT to do​

Validation​

Troubleshooting​

Next​