All courses > Technology and Programming > Cloud Computing and Web Servers ::

Observability at the Gateway: Logs, Metrics, Traces, and Correlation IDs

Capítulo 8

Estimated reading time: 9 minutes

Why the gateway is a central visibility point

An API gateway sits in the request path for many (or all) API calls, which makes it a high-leverage place to observe traffic patterns and failures consistently. Unlike individual services, the gateway sees every incoming request before it fans out to upstream services, and it can standardize what gets recorded (fields, formats, IDs) even when upstream teams use different frameworks.

At the gateway, you can answer operational questions quickly: Which clients are failing? Which routes are slow? Are errors caused by authentication, rate limits, or upstream timeouts? To do that reliably, you instrument the gateway using the three pillars of observability: logs, metrics, and traces, and you connect them using correlation IDs and trace context propagation.

The three pillars at the gateway

1) Logs (event details)

Gateway logs capture discrete events with rich context: a request arrived, it was rejected, it was forwarded, the upstream responded, a timeout occurred, etc. Logs are best for investigations and for answering “what happened for this specific request?”

2) Metrics (aggregated numbers)

Metrics are numeric time series (counters, gauges, histograms) that let you detect changes and trends: error rate spikes, p95 latency increases, or a surge in 401/403 responses. Metrics are best for dashboards and alerting.

3) Traces (end-to-end request path)

Distributed tracing shows how a single request flows through the gateway and upstream services, with timing for each hop. Traces are best for pinpointing where latency is introduced (gateway processing vs. upstream time vs. retries).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

What data to capture at the gateway

Capture enough to troubleshoot and measure performance, but not so much that you leak sensitive data or create excessive cost. A practical baseline is to standardize a “gateway access log schema,” a “gateway metrics set,” and “trace attributes.”

Request/response metadata (logs + trace attributes)

Timestamp (and timezone) and environment (prod/stage)
Request ID (gateway-generated) and correlation ID (propagated)
Trace context (trace ID/span ID) if tracing is enabled
Client identity: API key ID / OAuth client ID / subject ID (log stable identifiers, not raw tokens)
Source info: client IP (or forwarded-for), user agent, TLS version (optional)
Route info: host, method, normalized path/route template (e.g., /v1/orders/{id}), gateway service name, upstream target
Response info: status code, response size, content type (optional)

Latency breakdown (metrics + traces)

At minimum, record total request duration. Ideally, break it down so you can tell where time is spent:

Gateway processing time (routing, policy evaluation, transformations)
Upstream connect time (TCP/TLS handshake)
Upstream time to first byte (TTFB)
Upstream response time (full body)
Retry time (if the gateway retries upstream calls)

Expose latency as histograms so you can compute p50/p95/p99 per route and per upstream.

Status codes and error categories (metrics + logs)

Record both the raw HTTP status code and a normalized error category so dashboards stay readable:

2xx success
4xx client errors (split out 401, 403, 404, 429)
5xx server errors (gateway-generated vs upstream-generated)
Timeouts (connect timeout, read timeout)

Rate-limit events (logs + metrics)

When requests are throttled, capture fields that explain “who was limited and why”:

Limit key (e.g., client ID, API key ID, IP, tenant)
Policy name / rule ID
Decision: allowed vs rejected
Remaining and reset (if available)
HTTP status (often 429)

Authentication/authorization failures (logs + metrics)

For 401/403, record enough to diagnose configuration issues and attacks without logging secrets:

Failure type: missing credential, invalid signature, expired token, insufficient scope/role
Credential identifier: token issuer/audience, key ID, client ID (never the raw token)
Route and method

Correlation IDs and trace context propagation

Correlation is how you connect a gateway log line to an upstream service log line and to a distributed trace. You typically use two related mechanisms:

Correlation ID: a stable request identifier passed via a header (commonly X-Correlation-ID).
Trace context: standardized headers for distributed tracing (commonly W3C traceparent and tracestate).

Step-by-step: implement correlation IDs at the gateway

Goal: every request has a correlation ID, and it is returned to the client and forwarded upstream.

Step 1: Accept an incoming correlation ID if present. If the client sends X-Correlation-ID, validate it (length/charset) to prevent header injection or log pollution.
Step 2: Generate one if missing. Use a UUID (or similar) generated by the gateway.
Step 3: Store it in gateway context. Make it available to logging, metrics labels (careful), and tracing attributes.
Step 4: Forward it upstream. Add/overwrite X-Correlation-ID on the upstream request.
Step 5: Return it to the client. Add X-Correlation-ID to the response so support teams can ask clients for it.

# Pseudocode for a gateway policy/middleware flow (conceptual) if request.header['X-Correlation-ID'] is valid:     corr_id = request.header['X-Correlation-ID'] else:     corr_id = uuid_v4() context['correlation_id'] = corr_id upstream_request.header['X-Correlation-ID'] = corr_id response.header['X-Correlation-ID'] = corr_id log.fields['correlation_id'] = corr_id

Step-by-step: propagate trace context headers

Goal: traces show the gateway span and upstream spans under the same trace.

Step 1: Enable tracing in the gateway so it can create spans for inbound requests and outbound upstream calls.
Step 2: Extract incoming trace context. If the request includes traceparent, continue that trace; otherwise start a new trace.
Step 3: Inject trace context to upstream. Ensure the gateway forwards traceparent and tracestate (and any vendor-specific headers if required by your tracing backend).
Step 4: Add useful span attributes such as route template, upstream name, HTTP status code, and error flags.
Step 5: Ensure upstream services also extract and continue context (their HTTP clients/servers must be instrumented).

# Headers to propagate (common baseline) traceparent: 00-<trace-id>-<span-id>-01 tracestate: <optional vendor state> X-Correlation-ID: <your correlation id>

Practical tip: keep correlation ID and trace ID both available. Correlation IDs are often used in human workflows (support tickets), while trace IDs are used by tracing systems. You can log both and optionally map correlation ID to trace ID in log fields.

Dashboards you can build from gateway telemetry

Dashboards should be route- and upstream-aware. Use route templates (not raw paths) to avoid high-cardinality explosions (e.g., /orders/123, /orders/124).

Traffic and health overview

Requests per second by route template and by upstream
Error rate (4xx vs 5xx) stacked over time
Top status codes (200, 401, 403, 404, 429, 500, 502, 503, 504)

Latency dashboard (focus on percentiles)

p50/p95/p99 latency per route template
Latency breakdown (gateway vs upstream) if available
Slowest routes table: p95 latency, request volume, error rate

Auth and policy outcomes

401 rate and 403 rate over time
Auth failure reasons (expired token, invalid signature, missing credential)
Denied requests by client ID/tenant (use carefully to avoid cardinality and privacy issues)

Rate limiting and abuse signals

429 count by route and by limit key type (client/tenant/IP)
Throttle ratio = throttled / total
Top throttled clients (bounded list, sampled if needed)

Upstream reliability

Upstream timeouts (connect/read) over time
Gateway-generated 5xx vs upstream 5xx (separate series)
Retry rate and retry success rate (if retries are enabled)

Alerts that work well at the gateway

Alerts should be actionable and avoid noise. Prefer multi-window or burn-rate style alerts for error budgets when possible, but even simple thresholds can be effective if tuned.

Error rate spikes

Alert: 5xx rate > X% for Y minutes (overall and per critical route)
Refine: separate gateway 5xx (policy failures, misroutes) from upstream 5xx (backend incidents)
Attach context: top affected routes, top upstreams, recent deploy marker if available

p95 latency regression

Alert: p95 latency > threshold for Y minutes on critical routes
Refine: require minimum request volume to avoid low-traffic false positives
Debug path: compare gateway processing time vs upstream time to localize the issue

401/403 surges

Alert: 401 rate or 403 rate increases by N× baseline
Common causes: expired signing keys, misconfigured issuer/audience, client rollout sending bad tokens, clock skew
Include: top routes, top client IDs (if safe), dominant failure reason

Backend timeouts and 504s

Alert: upstream timeout count > threshold or 504 rate > threshold
Refine: break down by upstream and by region/zone if applicable
Include: p95 upstream time, connect vs read timeout split

Safe logging practices (avoid secrets and PII)

Because the gateway sees credentials and user data, unsafe logging is a common source of incidents. Treat gateway logs as sensitive by default.

What not to log

Authorization headers (Bearer tokens), API keys, session cookies
Request/response bodies by default (often contain PII or secrets)
Passwords, one-time codes, private keys, client secrets
Full payment data, government IDs, or other regulated identifiers

How to log safely (practical steps)

Step 1: Define an allowlist of headers to log (e.g., Content-Type, Accept, User-Agent, X-Correlation-ID, traceparent). Everything else is excluded unless explicitly allowed.
Step 2: Redact sensitive headers if you must record their presence. Example: log Authorization: [REDACTED] or log only token metadata like issuer and key ID extracted during verification.
Step 3: Normalize paths to route templates to avoid leaking IDs in URLs and to reduce cardinality.
Step 4: Hash or tokenize identifiers when you need grouping but not raw values (e.g., hash an email with a keyed hash). Ensure the hashing approach matches your privacy requirements.
Step 5: Sample verbose logs (like debug logs) and keep access logs structured and minimal.
Step 6: Set retention and access controls for logs (shorter retention for more sensitive fields; restrict who can query).

Structured log example (safe baseline)

{   "timestamp": "2026-01-16T12:34:56Z",   "env": "prod",   "correlation_id": "2f1c2c2a-3c7a-4b1c-9b2f-2f3d0a1b9e3a",   "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",   "http_method": "GET",   "route": "/v1/orders/{id}",   "status": 504,   "gateway_latency_ms": 12,   "upstream_latency_ms": 4988,   "upstream": "orders-service",   "error_category": "upstream_timeout",   "rate_limited": false,   "auth_outcome": "success",   "client_id": "client_123" }

Notice what is missing: no raw tokens, no cookies, no request body, no full URL with query parameters. If query parameters are needed for debugging, log an allowlisted subset (e.g., page, limit) and redact the rest.

Now answer the exercise about the content:

Which approach best connects gateway logs, upstream service logs, and distributed traces for a single request?

You are right! Congratulations, now go to the next page

You missed! Try again.

Correlation IDs and trace context propagation link a gateway log line to upstream logs and to an end-to-end trace. The gateway should accept/validate or generate an X-Correlation-ID, forward it upstream, return it to clients, and propagate traceparent/tracestate.