Distributed tracing and alert response
Outcome
You can pull a single claim's full journey out of logs + traces, and you have a pre-written response for the four Prometheus alerts most likely to page you.
Prerequisites
- Grafana / Prometheus access.
- OTel UI (Jaeger / Tempo / your collector's UI) access.
- Log aggregator with structured JSON fields.
Three identifiers, three scopes
| Field | Lives | Scope |
|---|---|---|
request_id | log line, x-request-id/x-trace-id response headers | one HTTP request |
correlation_id | log line, x-correlation-id request header, event-bus envelope, pg-boss payload | end-to-end across many requests (one user action, one batch run) |
trace_id (OTel) | log line, exported span attribute | full distributed trace including downstream services |
request_id and OTel trace_id are different — both emitted on every
line so legacy greps and OTel queries both work.
Pulling a single claim's journey out of logs
Find the inbound submission in the UI (or the
claim.submittedevent row) and copy itscorrelationId.Grep all rcm-core / edi-gateway logs for that id:
gunzip -c rcm-core-*.log.gz | jq 'select(.correlation_id=="<id>")'gunzip -c edi-gateway-*.log.gz | jq 'select(.correlation_id=="<id>")'Order by
timestamp. Expect (in this order): inbound HTTP request log →supervisor.handler.start→supervisor.handler.finish→ consumerstart→x12.generate.*span attribute lines → outbound submission.If a step is missing, use the OTel UI and search by
trace_idfrom the last surviving log line — the trace will show where the chain dropped.
Alert response playbooks
Files: ops/alerts/rcm-core.yml, ops/alerts/edi-gateway.yml.
RcmTenantPoolSaturation
Connection budget > 80%. Refer to
Connection budget alerts.
Identify the chattiest tenants via rcm_resolver_hot_tenants, decide
whether to lift TENANT_POOL_MAX or shed traffic.
RcmTenantStatusRejectionSpike
enforceTenantStatus is rejecting requests at > 5/s sustained for 5 min.
Cause is almost always:
- A tenant got moved to
read_only/suspendedand a client wasn't drained. - An operator forgot to refresh the metadata cache after a status change
(TTL is
TENANT_METADATA_CACHE_TTL_MS).
Cross-check rcm_tenant_status_rejection_total{status,verb}. If a
single tenant dominates, page that tenant's account team.
RcmIngestionQueueLag
Enqueue rate exceeding promote rate by > 100/s. The supervisor hot pool is saturated. Either:
- Raise
PGBOSS_HOT_POOL_MAX(note: this also raises connection budget — re-checkRcmTenantPoolSaturation). - Confirm the cold dispatcher is running:
rate(rcm_cold_dispatcher_scan_duration_seconds_count[10m]) > 0. - Restart rcm-core if both pool size and dispatcher look healthy and lag persists — usually means a wedged tenant boss instance.
See pg-boss supervisor for the full playbook.
RcmDenialRateSpike
Denial rate > 2× same-hour-yesterday. Look at the dominant CARC:
topk(5, sum by (carc_code) (rate(rcm_claim_denial_total[1h])))
Common drivers:
| Driver | Where to check |
|---|---|
| Payer config drift (e.g., timely-filing window changed) | Configure a new payer |
| Auto-correction handler regression | v_auto_correction_success_rate; see Auto-correction |
| Upstream contract change | payer_contract.effective_to for any expired contracts in the last week |
RcmEdiResponseTimeoutHigh (edi-gateway)
Currently informational — metric not yet wired in edi-gateway.
When wired: outbound submissions are not getting paired inbounds. Either
the trading-partner connection is down (see Handle SFTP failure)
or the response-poller is wedged (ResponsePoller next-run timestamp
will be stale).
RcmMetricsScrapeFailing / EdiGatewayMetricsScrapeFailing
Prometheus up is 0 for 2m. Process is down OR the bearer token rotated
and the scrape config is stale. Hit /metrics directly with the
configured Authorization: Bearer … header to disambiguate.
Sampling tuning
Default OTEL_SAMPLE_RATE: 1.0 in dev, 0.1 in prod. During an
incident, raise to 1.0 in prod for the duration:
kubectl set env deploy/rcm-core OTEL_SAMPLE_RATE=1.0
Roll back when the incident closes — sustained 1.0 in prod will
balloon collector storage.
Validation
| Check | Expected |
|---|---|
correlation_id on every log line | Yes |
| OTel UI shows full trace for a sample claim | Yes |
| Alert rules synced to Prometheus | Yes (CI lint) |
OTEL_SAMPLE_RATE rolled back after incident | Yes |
Follow-ups
rcm_edi_response_timeout_totalis API-shaped on rcm-core'sMetricsHandle, but the canonical increment will live in edi-gateway'sResponsePolleronce edi-gateway has its own Prom-client registry. Until then the alert rule is documentation.- Tail-based sampling (instead of head-based) once trace volume warrants it.
Cross-references
- pg-boss supervisor for queue-lag triage.
- Connection budget alerts.
- Real-time event stream
for
correlation_idpropagation through events.