Distributed tracing and alert response

Outcome

You can pull a single claim's full journey out of logs + traces, and you have a pre-written response for the four Prometheus alerts most likely to page you.

Prerequisites

Grafana / Prometheus access.
OTel UI (Jaeger / Tempo / your collector's UI) access.
Log aggregator with structured JSON fields.

Three identifiers, three scopes

Field	Lives	Scope
`request_id`	log line, `x-request-id`/`x-trace-id` response headers	one HTTP request
`correlation_id`	log line, `x-correlation-id` request header, event-bus envelope, pg-boss payload	end-to-end across many requests (one user action, one batch run)
`trace_id` (OTel)	log line, exported span attribute	full distributed trace including downstream services

request_id and OTel trace_id are different — both emitted on every line so legacy greps and OTel queries both work.

Pulling a single claim's journey out of logs

Find the inbound submission in the UI (or the claim.submitted event row) and copy its correlationId.

Grep all rcm-core / edi-gateway logs for that id:

gunzip -c rcm-core-*.log.gz | jq 'select(.correlation_id=="<id>")'
gunzip -c edi-gateway-*.log.gz | jq 'select(.correlation_id=="<id>")'

Order by timestamp. Expect (in this order): inbound HTTP request log → supervisor.handler.start → supervisor.handler.finish → consumer start → x12.generate.* span attribute lines → outbound submission.
If a step is missing, use the OTel UI and search by trace_id from the last surviving log line — the trace will show where the chain dropped.

Alert response playbooks

Files: ops/alerts/rcm-core.yml, ops/alerts/edi-gateway.yml.

`RcmTenantPoolSaturation`

Connection budget > 80%. Refer to Connection budget alerts. Identify the chattiest tenants via rcm_resolver_hot_tenants, decide whether to lift TENANT_POOL_MAX or shed traffic.

`RcmTenantStatusRejectionSpike`

enforceTenantStatus is rejecting requests at > 5/s sustained for 5 min. Cause is almost always:

A tenant got moved to read_only / suspended and a client wasn't drained.
An operator forgot to refresh the metadata cache after a status change (TTL is TENANT_METADATA_CACHE_TTL_MS).

Cross-check rcm_tenant_status_rejection_total{status,verb}. If a single tenant dominates, page that tenant's account team.

`RcmIngestionQueueLag`

Enqueue rate exceeding promote rate by > 100/s. The supervisor hot pool is saturated. Either:

Raise PGBOSS_HOT_POOL_MAX (note: this also raises connection budget — re-check RcmTenantPoolSaturation).
Confirm the cold dispatcher is running: rate(rcm_cold_dispatcher_scan_duration_seconds_count[10m]) > 0.
Restart rcm-core if both pool size and dispatcher look healthy and lag persists — usually means a wedged tenant boss instance.

See pg-boss supervisor for the full playbook.

`RcmDenialRateSpike`

Denial rate > 2× same-hour-yesterday. Look at the dominant CARC:

topk(5, sum by (carc_code) (rate(rcm_claim_denial_total[1h])))

Common drivers:

Driver	Where to check
Payer config drift (e.g., timely-filing window changed)	Configure a new payer
Auto-correction handler regression	`v_auto_correction_success_rate`; see Auto-correction
Upstream contract change	`payer_contract.effective_to` for any expired contracts in the last week

`RcmEdiResponseTimeoutHigh` (edi-gateway)

Currently informational — metric not yet wired in edi-gateway.

When wired: outbound submissions are not getting paired inbounds. Either the trading-partner connection is down (see Handle SFTP failure) or the response-poller is wedged (ResponsePoller next-run timestamp will be stale).

`RcmMetricsScrapeFailing` / `EdiGatewayMetricsScrapeFailing`

Prometheus up is 0 for 2m. Process is down OR the bearer token rotated and the scrape config is stale. Hit /metrics directly with the configured Authorization: Bearer … header to disambiguate.

Sampling tuning

Default OTEL_SAMPLE_RATE: 1.0 in dev, 0.1 in prod. During an incident, raise to 1.0 in prod for the duration:

kubectl set env deploy/rcm-core OTEL_SAMPLE_RATE=1.0

Roll back when the incident closes — sustained 1.0 in prod will balloon collector storage.

Validation

Check	Expected
`correlation_id` on every log line	Yes
OTel UI shows full trace for a sample claim	Yes
Alert rules synced to Prometheus	Yes (CI lint)
`OTEL_SAMPLE_RATE` rolled back after incident	Yes

Follow-ups

rcm_edi_response_timeout_total is API-shaped on rcm-core's MetricsHandle, but the canonical increment will live in edi-gateway's ResponsePoller once edi-gateway has its own Prom-client registry. Until then the alert rule is documentation.
Tail-based sampling (instead of head-based) once trace volume warrants it.

Cross-references

pg-boss supervisor for queue-lag triage.
Connection budget alerts.
Real-time event stream for correlation_id propagation through events.

9.3 — Deployability + DR drill

Outcome​

Prerequisites​

Three identifiers, three scopes​

Pulling a single claim's journey out of logs​

Alert response playbooks​

RcmTenantPoolSaturation​

RcmTenantStatusRejectionSpike​

RcmIngestionQueueLag​

RcmDenialRateSpike​

RcmEdiResponseTimeoutHigh (edi-gateway)​

RcmMetricsScrapeFailing / EdiGatewayMetricsScrapeFailing​

Sampling tuning​

Validation​

Follow-ups​

Cross-references​

Next​