Free Ebook cover Cloud-Native Web Serving with Kubernetes Ingress and Service Mesh

Cloud-Native Web Serving with Kubernetes Ingress and Service Mesh

New course

20 pages

Ingress and Service Monitoring Dashboards and Alerting Signals

Capítulo 11

Estimated reading time: 0 minutes

+ Exercise

Why dashboards and alert signals matter for Ingress and service mesh

Ingress and service mesh sit on the request path for most web traffic, so small degradations can quickly become user-visible incidents. Monitoring for these layers is not just “is it up,” but “is it behaving correctly under load, routing to the right backends, and enforcing security and policy.” In practice, you want dashboards that answer: Are requests flowing? Are errors rising? Is latency increasing? Are retries/timeouts amplifying load? Are certificates and gateways healthy? And you want alerting signals that are specific enough to be actionable without paging you for every short-lived spike.

Monitoring scope: edge (Ingress) vs in-cluster (mesh)

Edge: Ingress controller and gateway

The edge layer includes the Ingress controller (for example, NGINX Ingress, HAProxy Ingress, Traefik) or a dedicated gateway (for example, an Envoy-based gateway). Key responsibilities include HTTP routing, TLS termination, request normalization, header policies, and sometimes WAF/rate limiting. Monitoring here focuses on: request volume by host/path, 4xx/5xx rates, TLS handshake/cert issues, upstream connection health, and controller resource saturation (CPU, memory, worker connections, queueing).

Service mesh: sidecars and mesh gateways

The mesh layer (for example, Istio, Linkerd, Consul) adds per-service telemetry, mTLS, retries/timeouts, circuit breaking, and policy enforcement. Monitoring here focuses on: service-to-service success rate, latency percentiles, retry rates, TCP connection resets, mTLS handshake failures, and gateway-to-service behavior. Mesh telemetry is often richer, but it can also be noisier; dashboards should emphasize golden signals and a few mesh-specific indicators like retries and mTLS errors.

Dashboards: design around questions, not charts

Start with “golden signals” per layer

For both Ingress and mesh, build dashboards around four primary signals: traffic (requests per second), errors (4xx/5xx), latency (p50/p95/p99), and saturation (resource or capacity constraints). The difference is where you slice them: at the edge you slice by host/path and upstream service; in the mesh you slice by source workload, destination workload, and route.

Use a drill-down structure

A practical dashboard layout is: (1) Global overview, (2) Edge/Gateway overview, (3) Per-host/per-route view, (4) Per-service view, (5) Infrastructure health (pods/nodes). Each panel should have a clear “next click” for investigation. For example: a global 5xx panel links to a table of top routes by 5xx, which links to a per-route latency and upstream breakdown.

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Ingress controller dashboards: what to include

Traffic and status codes by host and path

At the edge, aggregate requests by host and path prefix. You want stacked status code panels (2xx/3xx/4xx/5xx) and a “top N” table for 4xx and 5xx. This quickly distinguishes client-side issues (often 4xx due to auth, redirects, or request size) from server-side issues (5xx due to upstream failures, timeouts, or misroutes).

Latency percentiles and upstream timing

Ingress latency should be tracked as percentiles (p50/p95/p99). If your controller exports upstream timing (for example, upstream response time), chart both “request duration at ingress” and “upstream response time.” A growing gap between them can indicate queueing, connection pool exhaustion, TLS overhead, or rate limiting at the edge.

Upstream health and error reasons

Many controllers expose counters for upstream connect errors, timeouts, and resets. These are more actionable than generic 5xx. For example, a spike in upstream timeouts suggests slow backends or too-aggressive timeouts; a spike in connect errors suggests endpoints are missing, readiness is failing, or network policies are blocking.

TLS and certificate signals

Even if certificate automation is handled elsewhere, the edge should be monitored for TLS handshake failures and certificate expiration windows. A dashboard panel for “days to cert expiry” (or “cert notAfter timestamp”) per hostname helps prevent sudden outages due to misissued or non-rotating certs. Also track handshake errors and SNI mismatch errors if available.

Controller health and saturation

Include CPU/memory, pod restarts, and a few controller-specific saturation metrics such as worker connections, request queue length, or config reload failures. A config reload failure panel is critical: routing changes may silently not apply, leading to confusing incidents where manifests look correct but traffic still routes incorrectly.

Service mesh dashboards: what to include

Service-to-service golden signals

Build a “service graph” view (even if it is just a table) that shows request rate, success rate, and p95 latency for each destination service, with filters for namespace and workload. The key is to identify which service is the bottleneck and whether the issue is inbound to a service, outbound from a service, or at the gateway.

Retries, timeouts, and circuit breaking

Retries can hide errors while increasing load and latency. Add panels for retry rate and retry budget (retries per original request). If retry rate rises alongside latency, you may have a cascading failure pattern. Also chart timeout counts and circuit breaker open events (if your mesh exposes them). These signals are often the difference between “backend is slow” and “policy is amplifying the problem.”

mTLS and identity failures

When mTLS is enabled, monitor handshake failures and authorization denials. A spike in mTLS errors often correlates with certificate rotation issues, clock skew, or misconfigured peer authentication policies. Authorization denials can indicate policy rollouts that are too strict or missing service accounts.

Gateway-specific panels

If you use a mesh ingress gateway, treat it like an edge component: request rate, status codes, latency, and upstream errors, plus gateway pod saturation. Also add a panel that compares “edge ingress” vs “mesh gateway” traffic if you have both, to detect routing bypass or misconfiguration.

Step-by-step: build a minimal, actionable dashboard set

Step 1: define the SLO-aligned views

Pick the user-facing entry points (hostnames or APIs) and define what “good” looks like: success rate and latency targets. Your dashboards should show these targets as reference lines or annotations. Even without formal SLO tooling, you can display error ratio and p95 latency for the critical routes.

Step 2: choose labels carefully to avoid cardinality explosions

Ingress and mesh metrics often include labels like path, user agent, or full URL. Avoid high-cardinality labels in dashboards and alerts. Prefer normalized route labels (for example, “/api/*” instead of “/api/orders/12345”). If your controller or mesh supports route naming, use it to keep metrics stable and cheap.

Step 3: create an “overview” dashboard first

Start with 8–12 panels that answer “is the edge healthy” and “is the mesh healthy.” Example panels: total RPS, 5xx rate, p95 latency, upstream timeouts, gateway CPU, gateway restarts, retry rate, mTLS errors. Only after this is useful should you add per-host and per-service drilldowns.

Step 4: add drilldowns that match your on-call workflow

For edge: drill down by hostname, then by route, then by upstream service. For mesh: drill down by destination service, then by source service, then by workload/pod. Include tables that list “top offenders” (top routes by 5xx, top services by p95 latency) because they reduce time-to-diagnosis.

Step 5: annotate deployments and config changes

Dashboards become far more useful when you can correlate spikes with changes. Add annotations for Ingress/controller config reloads, gateway deployments, and policy changes. If you have a CI/CD system, emit an annotation event when a new version is rolled out. This helps distinguish regressions from traffic anomalies.

Alerting signals: pick symptoms that page, and causes that ticket

Alert philosophy for edge and mesh

Alerts should be layered. Page on user-visible symptoms (high 5xx, high latency, sustained traffic drop) and create non-paging alerts for likely causes (increased retries, config reload failures, certificate nearing expiry). This reduces alert fatigue while still surfacing early warnings.

Symptom alerts (page-worthy)

  • High 5xx error ratio at the edge for a critical hostname/route over a sustained window (for example, 5–10 minutes).
  • High p95 or p99 latency at the edge for critical routes, sustained and above a user-impact threshold.
  • Significant traffic drop to near-zero for a hostname (can indicate DNS, certificate, gateway crash, or routing misconfig).
  • Mesh-wide success rate drop for a critical destination service (especially if multiple sources are affected).

Cause and early-warning alerts (non-paging or lower severity)

  • Ingress/controller config reload failures or unusually frequent reloads (can indicate flapping config or invalid snippets).
  • Upstream timeout rate increasing (often precedes 5xx spikes).
  • Retry rate increasing beyond a budget (indicates hidden errors and load amplification).
  • mTLS handshake failures or authorization denials increasing after a policy change.
  • Certificate expiration within a threshold (for example, 14 days warning, 3 days critical).
  • Gateway/controller pod crash loop, high restart rate, or CPU throttling.

Practical Prometheus-style alert examples (Ingress and mesh)

The exact metric names differ by controller and mesh, but the patterns are consistent: compute ratios, use percentiles, and require sustained duration. The examples below show the structure you can adapt to your environment.

Ingress: 5xx error ratio by hostname

# Error ratio = 5xx / total over 5m, alert if > 1% for 10m sustained (example thresholds) ( sum(rate(ingress_requests_total{host="example.com",status=~"5.."}[5m])) / sum(rate(ingress_requests_total{host="example.com"}[5m])) ) > 0.01

Ingress: p95 latency for a critical route

# If histogram is available: histogram_quantile over rate of buckets histogram_quantile(0.95, sum by (le) (rate(ingress_request_duration_seconds_bucket{host="example.com",route="/api/*"}[5m])) ) > 0.5

Ingress: upstream timeouts increasing

sum(rate(ingress_upstream_timeouts_total{host="example.com"}[5m])) > 1

Mesh: destination service success rate drop

# Success rate = 1 - (5xx / total) for destination service over 5m ( 1 - ( sum(rate(mesh_requests_total{destination_service="orders",response_code=~"5.."}[5m])) / sum(rate(mesh_requests_total{destination_service="orders"}[5m])) ) ) < 0.99

Mesh: retry budget exceeded

# Retries per request over 10m (example) ( sum(rate(mesh_retries_total{destination_service="orders"}[10m])) / sum(rate(mesh_requests_total{destination_service="orders"}[10m])) ) > 0.2

Mesh: mTLS handshake failures

sum(rate(mesh_mtls_handshake_failures_total[5m])) > 0

Step-by-step: tune alerts to reduce noise

Step 1: alert on ratios, not raw counts (most of the time)

Raw 5xx counts scale with traffic. A ratio (5xx/total) is more stable across peak and off-peak. Use raw counts for “traffic drop to zero” or “timeouts per second” when absolute volume matters.

Step 2: use multi-window, multi-burn thinking for latency and errors

Even without formal SLO tooling, you can approximate it by using two alert rules: a fast one (short window, higher threshold) and a slow one (long window, lower threshold). This catches both sudden outages and slow degradation while reducing paging on brief spikes.

Step 3: add “for” durations and minimum traffic guards

Require the condition to hold for a duration (for example, 10 minutes) to avoid flapping. Also add a minimum request rate guard so you do not alert on a single error during low traffic. For example, only evaluate error ratio if RPS > 1.

Step 4: route alerts to the right owner

Edge alerts (Ingress/gateway saturation, TLS issues, config reload failures) often belong to the platform team. Service-level mesh alerts (destination service latency, success rate) often belong to the application team. Use labels like severity, team, service, and environment to route notifications and to create focused on-call views.

Step 5: link alerts to dashboards and runbooks

Every alert should include a link to the relevant dashboard and a short runbook snippet: what to check first, and what common causes look like. For example, an “Ingress 5xx high” alert should link to a dashboard filtered by host and show upstream timeouts, backend endpoint count, and recent config reload status.

Dashboards for incident response: common investigative paths

When 5xx rises at the edge

Check whether the 5xx is generated by the edge or by upstream. If upstream timeouts/connect errors rise, inspect backend readiness, endpoint count, and recent deployments. If edge-generated errors rise (for example, request too large, rate limited), check policy changes and request size distributions. Correlate with controller reload failures and gateway pod saturation.

When latency rises but errors do not

Look for increased retries (masking errors), increased upstream response time, or saturation at the gateway (CPU throttling, connection limits). In the mesh, compare source-to-destination latency vs destination workload latency; if the destination is fine but end-to-end is slow, the issue may be at the gateway, DNS, or network.

When traffic drops suddenly

Validate whether the drop is at the edge (no requests arriving) or inside the cluster (requests arrive but do not reach services). Edge-level drops can be DNS, certificate, load balancer, or gateway crash. In-mesh drops can be routing policy, authorization, or service discovery issues. Dashboards should make this distinction obvious by showing edge RPS alongside gateway-to-service RPS.

Operational checklist: what “good” looks like in a steady state

  • Edge dashboard shows stable p95 latency and low 5xx ratio per critical hostname/route.
  • Upstream timeout/connect error panels are near zero during normal operation.
  • Gateway/controller pods have headroom (CPU not constantly throttled, memory stable, low restarts).
  • Mesh dashboards show stable success rate and latency per destination service, with low retry ratios.
  • mTLS and authorization error panels are quiet except during controlled policy rollouts.
  • Certificate expiry panels show comfortable renewal windows for all hostnames.
  • Alerts are few, actionable, and consistently link to the right drill-down dashboards.

Now answer the exercise about the content:

Which alerting approach best reduces noise while still catching user-impacting problems in Ingress and service mesh?

You are right! Congratulations, now go to the next page

You missed! Try again.

Use layered alerts: page on sustained user-impact symptoms (errors, latency, traffic drop) and send non-paging early warnings for causes (retries, reload failures, cert expiry). This reduces alert fatigue while keeping signals actionable.

Next chapter

Reliability Patterns: Timeouts, Retries, Circuit Breaking, and Backpressure

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.