Skip to main content

Distributed tracing and alert response

Outcome

You can pull a single claim's full journey out of logs + traces, and you have a pre-written response for the four Prometheus alerts most likely to page you.

Prerequisites

  • Grafana / Prometheus access.
  • OTel UI (Jaeger / Tempo / your collector's UI) access.
  • Log aggregator with structured JSON fields.

Three identifiers, three scopes

FieldLivesScope
request_idlog line, x-request-id/x-trace-id response headersone HTTP request
correlation_idlog line, x-correlation-id request header, event-bus envelope, pg-boss payloadend-to-end across many requests (one user action, one batch run)
trace_id (OTel)log line, exported span attributefull distributed trace including downstream services

request_id and OTel trace_id are different — both emitted on every line so legacy greps and OTel queries both work.

Pulling a single claim's journey out of logs

  1. Find the inbound submission in the UI (or the claim.submitted event row) and copy its correlationId.

  2. Grep all rcm-core / edi-gateway logs for that id:

    gunzip -c rcm-core-*.log.gz | jq 'select(.correlation_id=="<id>")'
    gunzip -c edi-gateway-*.log.gz | jq 'select(.correlation_id=="<id>")'
  3. Order by timestamp. Expect (in this order): inbound HTTP request log → supervisor.handler.startsupervisor.handler.finish → consumer startx12.generate.* span attribute lines → outbound submission.

  4. If a step is missing, use the OTel UI and search by trace_id from the last surviving log line — the trace will show where the chain dropped.

Alert response playbooks

Files: ops/alerts/rcm-core.yml, ops/alerts/edi-gateway.yml.

RcmTenantPoolSaturation

Connection budget > 80%. Refer to Connection budget alerts. Identify the chattiest tenants via rcm_resolver_hot_tenants, decide whether to lift TENANT_POOL_MAX or shed traffic.

RcmTenantStatusRejectionSpike

enforceTenantStatus is rejecting requests at > 5/s sustained for 5 min. Cause is almost always:

  • A tenant got moved to read_only / suspended and a client wasn't drained.
  • An operator forgot to refresh the metadata cache after a status change (TTL is TENANT_METADATA_CACHE_TTL_MS).

Cross-check rcm_tenant_status_rejection_total{status,verb}. If a single tenant dominates, page that tenant's account team.

RcmIngestionQueueLag

Enqueue rate exceeding promote rate by > 100/s. The supervisor hot pool is saturated. Either:

  • Raise PGBOSS_HOT_POOL_MAX (note: this also raises connection budget — re-check RcmTenantPoolSaturation).
  • Confirm the cold dispatcher is running: rate(rcm_cold_dispatcher_scan_duration_seconds_count[10m]) > 0.
  • Restart rcm-core if both pool size and dispatcher look healthy and lag persists — usually means a wedged tenant boss instance.

See pg-boss supervisor for the full playbook.

RcmDenialRateSpike

Denial rate > 2× same-hour-yesterday. Look at the dominant CARC:

topk(5, sum by (carc_code) (rate(rcm_claim_denial_total[1h])))

Common drivers:

DriverWhere to check
Payer config drift (e.g., timely-filing window changed)Configure a new payer
Auto-correction handler regressionv_auto_correction_success_rate; see Auto-correction
Upstream contract changepayer_contract.effective_to for any expired contracts in the last week

RcmEdiResponseTimeoutHigh (edi-gateway)

Currently informational — metric not yet wired in edi-gateway.

When wired: outbound submissions are not getting paired inbounds. Either the trading-partner connection is down (see Handle SFTP failure) or the response-poller is wedged (ResponsePoller next-run timestamp will be stale).

RcmMetricsScrapeFailing / EdiGatewayMetricsScrapeFailing

Prometheus up is 0 for 2m. Process is down OR the bearer token rotated and the scrape config is stale. Hit /metrics directly with the configured Authorization: Bearer … header to disambiguate.

Sampling tuning

Default OTEL_SAMPLE_RATE: 1.0 in dev, 0.1 in prod. During an incident, raise to 1.0 in prod for the duration:

kubectl set env deploy/rcm-core OTEL_SAMPLE_RATE=1.0

Roll back when the incident closes — sustained 1.0 in prod will balloon collector storage.

Validation

CheckExpected
correlation_id on every log lineYes
OTel UI shows full trace for a sample claimYes
Alert rules synced to PrometheusYes (CI lint)
OTEL_SAMPLE_RATE rolled back after incidentYes

Follow-ups

  • rcm_edi_response_timeout_total is API-shaped on rcm-core's MetricsHandle, but the canonical increment will live in edi-gateway's ResponsePoller once edi-gateway has its own Prom-client registry. Until then the alert rule is documentation.
  • Tail-based sampling (instead of head-based) once trace volume warrants it.

Cross-references

Next

9.3 — Deployability + DR drill