All courses > Technology and Programming > Cloud Computing and Web Servers ::

Troubleshooting Web Serving Failures: 502/504 Errors and Routing Mismatches

Capítulo 17

Estimated reading time: 14 minutes

What 502 and 504 Mean in Kubernetes Web Serving

Concept and symptoms: In cloud-native web serving, a 502 (Bad Gateway) or 504 (Gateway Timeout) typically means an intermediary component (Ingress controller, gateway, or proxy) could not successfully complete a request to an upstream backend. The key is that the client is not talking directly to your Pod; it is talking to a hop that is forwarding traffic. A 502 usually indicates the hop received an invalid response or could not establish/maintain a usable connection to the upstream. A 504 usually indicates the hop waited for a response but did not get one within its configured timeout. These errors often appear during deployments, scaling events, configuration changes, or partial outages, and they can be intermittent, affecting only some paths, hosts, or subsets of Pods.

Where the error is generated: The HTTP status code is emitted by the component that is acting as a gateway at that moment. In Kubernetes, that is commonly an Ingress controller (NGINX Ingress, HAProxy, Traefik), a cloud load balancer, an API gateway, or a sidecar proxy in front of the application. Knowing which hop generated the code is essential: the same 504 can mean “Ingress timed out waiting for Service endpoints” or “application timed out processing a request,” depending on where it was produced.

Routing mismatches as a root cause: Many 502/504 incidents are not “performance problems” but “routing problems.” A routing mismatch is any situation where the request is forwarded to the wrong destination or forwarded in a way the destination cannot handle. Examples include: wrong Service selector (no endpoints), wrong port mapping (Service port vs containerPort), wrong protocol (HTTP forwarded to HTTPS port), wrong Host header expectations, path rewrite mistakes, and stale endpoints during rollouts. These issues can look like timeouts because the proxy keeps trying to connect to a target that is not actually serving the expected traffic.

A Practical Triage Flow: Identify the Failing Hop

Step 1: Reproduce and capture request details: Start by capturing the exact URL, Host header, path, method, and whether TLS is involved. If possible, capture a request ID header (for example, x-request-id) from the response or add one at the client. Also note whether the failure is consistent or intermittent and whether it affects all clients or only some networks. These details will later help you correlate logs across components.

Step 2: Determine which component returned 502/504: Inspect response headers. Many gateways add identifying headers (for example, server: nginx or custom headers). If you have access logs, check which component logged the request with a 502/504. For NGINX Ingress, the controller logs will show upstream connection errors and timeouts. For a cloud load balancer, you may need its access logs. The goal is to answer: “Did the error come from the Ingress, from an internal proxy, or from the application itself?”

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Step 3: Split the path into hops and test each hop: Test connectivity progressively: (a) client to Ingress VIP, (b) Ingress to Service ClusterIP, (c) Service to Pod IP, (d) inside the Pod to the application process. You can do this with ephemeral debug Pods and simple tools like curl, wget, nc, and dig. The first hop where the request fails is usually where you should focus.

Common Root Cause 1: Service Has No Endpoints (Selector/Labels Mismatch)

Concept: A Kubernetes Service routes to Pods via label selectors. If the selector does not match any Pods, the Service has zero endpoints. Many Ingress controllers will return 502/503 in this case; some will return 504 if they keep trying to connect and time out. This often happens after a deployment where labels changed, or when a Service selector was edited incorrectly.

Step-by-step diagnosis:

Check Service selector: kubectl get svc -n <ns> <svc> -o yaml and inspect spec.selector.
Check matching Pods: kubectl get pods -n <ns> -l key=value --show-labels. If no Pods match, you found the issue.
Check endpoints: kubectl get endpoints -n <ns> <svc> -o wide (or kubectl get endpointSlices). If empty, the Service has nothing to route to.
Check readiness: Even if Pods exist, they may not be Ready, which also results in no usable endpoints. Use kubectl get pods -n <ns> and kubectl describe pod to see readiness probe failures.

Practical fix: Align labels and selectors. If a Deployment uses app: web but the Service selects app: website, update one side. If readiness probes are failing, fix the probe path/port or the application startup behavior so Pods become Ready and endpoints appear.

Common Root Cause 2: Port and Protocol Mismatches (Service Port, targetPort, containerPort)

Concept: The Service port is what clients connect to; targetPort is what the Service forwards to on the Pod. If targetPort is wrong, the proxy will connect to a port where nothing is listening, causing connection refused (often 502) or timeouts (often 504). Protocol mismatches are equally common: forwarding HTTP to a backend that expects HTTPS, or forwarding HTTP/2 or gRPC to a plain HTTP server without proper configuration.

Step-by-step diagnosis:

Inspect Service ports: kubectl get svc -n <ns> <svc> -o yaml and confirm spec.ports[].port and spec.ports[].targetPort.
Inspect Pod container ports: kubectl get deploy -n <ns> <deploy> -o yaml and confirm the container listens on the expected port. Note: containerPort is documentation for many runtimes; the real check is whether the process listens on that port.
Test from inside the cluster: Run a debug Pod and curl the Service and Pod directly: kubectl run -n <ns> tmp --rm -it --image=curlimages/curl -- sh, then curl -v http://<svc>:<port>/health. Also try the Pod IP: curl -v http://<podIP>:<targetPort>/health.
Check for TLS expectations: If the backend expects HTTPS, curling with HTTP will show connection resets or unreadable responses. Try curl -vk https://<podIP>:<port> to confirm.

Practical fix: Correct targetPort to match the actual listening port. If TLS is required on the backend, ensure the gateway is configured to use HTTPS upstream (or terminate TLS at the right layer). For gRPC, ensure the Ingress/gateway supports HTTP/2 to the upstream and that the backend protocol is configured consistently.

Common Root Cause 3: Path, Host, and Rewrite Routing Mismatches

Concept: Ingress rules route based on hostnames and paths. A mismatch can send traffic to the wrong Service or to a Service that does not serve that path. Rewrites can also break applications: for example, rewriting /api to / might be required for one backend but harmful for another. These issues often manifest as 502/504 when the wrong upstream is selected and cannot respond properly, or when the upstream responds slowly due to unexpected routes (for example, a heavy page instead of a lightweight health endpoint).

Step-by-step diagnosis:

List Ingress rules: kubectl get ingress -A and identify which Ingress should match the failing host/path.
Describe the Ingress: kubectl describe ingress -n <ns> <ing> to see rules, backends, and events (including annotation parsing errors).
Validate host and path matching: Confirm the request Host header matches the Ingress spec.rules[].host. Confirm the path type (Prefix vs Exact) matches your intent. A Prefix rule for / can shadow a more specific path if ordering or controller behavior is misunderstood.
Check rewrite annotations: If using NGINX Ingress, verify annotations like nginx.ingress.kubernetes.io/rewrite-target. Ensure the rewritten path is what the backend expects.
Test with explicit Host header: From a debug Pod or your workstation, run curl -v -H 'Host: example.com' http://<ingress-ip>/some/path to ensure you are hitting the intended rule.

Practical fix: Adjust host/path rules to be unambiguous. Avoid overly broad catch-all rules unless you intend them. If you need rewrites, document them and add backend tests that verify the effective path. For multi-service Ingresses, consider splitting into separate Ingress resources per hostname to reduce accidental overlaps.

Common Root Cause 4: Readiness, Liveness, and Startup Timing Issues During Rollouts

Concept: During deployments, Pods may exist but not be ready to serve. If readiness probes are too permissive, traffic can be sent to Pods that are still warming up, causing connection resets or timeouts. If readiness probes are too strict or misconfigured, Pods never become Ready, leaving no endpoints. Both patterns can produce 502/504 at the gateway.

Step-by-step diagnosis:

Check rollout status: kubectl rollout status deploy/<deploy> -n <ns> and look for stalled rollouts.
Inspect probe configuration: kubectl get deploy -n <ns> <deploy> -o yaml and review readiness/liveness/startup probes (path, port, initialDelaySeconds, periodSeconds, failureThreshold).
Describe failing Pods: kubectl describe pod -n <ns> <pod> to see probe failures and events.
Correlate with gateway errors: If 502/504 spikes align with rollout times, suspect readiness timing or termination behavior.

Practical fix: Ensure readiness reflects “can serve real traffic,” not merely “process is running.” If your app needs warm caches or migrations, use a startup probe to avoid premature restarts, and keep readiness false until the app can respond quickly. Also ensure graceful termination so Pods stop receiving traffic before they exit (for example, by allowing time for endpoint removal and in-flight requests to finish).

Common Root Cause 5: DNS and Service Discovery Problems Inside the Cluster

Concept: Gateways and Pods often reach backends via DNS names like service.namespace.svc.cluster.local. If CoreDNS is unhealthy, overloaded, or misconfigured, upstream resolution can fail. Some proxies will treat DNS failures as upstream connection errors (502) or will retry until a timeout (504). DNS issues can be intermittent and correlate with cluster load.

Step-by-step diagnosis:

Test DNS from a Pod: kubectl run -n <ns> dns --rm -it --image=busybox -- sh then nslookup <svc>.<ns>.svc.cluster.local.
Check CoreDNS health: kubectl get pods -n kube-system -l k8s-app=kube-dns and kubectl logs -n kube-system deploy/coredns for errors/timeouts.
Check for NXDOMAIN vs timeout: NXDOMAIN suggests a naming/namespace mistake; timeouts suggest DNS performance or network policy issues.

Practical fix: Correct service names and namespaces in upstream configuration. If CoreDNS is resource constrained, scale it and review caching and upstream resolvers. If network policies exist, ensure DNS traffic to CoreDNS is allowed from namespaces that need it.

Common Root Cause 6: NetworkPolicy, Firewall Rules, and Node-Level Connectivity

Concept: Even when Services and endpoints are correct, traffic can be blocked by Kubernetes NetworkPolicies, cloud security groups, or node firewall rules. The gateway might be unable to reach Pod IPs, resulting in connection timeouts (504) or immediate failures (502) depending on how the block manifests. This is especially common when an Ingress controller runs in a dedicated namespace and NetworkPolicies restrict cross-namespace traffic.

Step-by-step diagnosis:

List NetworkPolicies: kubectl get netpol -A and identify policies affecting the gateway namespace and the backend namespace.
Confirm allowed ports: Ensure policies allow ingress to backend Pods from the gateway’s Pod labels and namespace, on the correct port.
Test TCP connectivity: From the gateway Pod (or a debug Pod in the same namespace), run nc -vz <podIP> <port> or curl -v to the Pod IP.
Check node placement: If failures only occur for some Pods, compare which nodes they run on: kubectl get pods -n <ns> -o wide. Node-specific firewall or CNI issues can create partial reachability.

Practical fix: Update NetworkPolicies to explicitly allow the gateway to reach backend Pods. If node-specific, investigate CNI logs and node firewall configuration; cordon/drain problematic nodes to confirm the hypothesis.

Common Root Cause 7: Upstream Connection Handling and Keepalive Issues

Concept: Proxies maintain connection pools and reuse upstream connections. If an upstream closes idle connections unexpectedly, or if there is a mismatch in keepalive settings, the proxy may attempt to reuse a stale connection and receive a reset, often surfacing as intermittent 502s. This can be more visible under low traffic (connections go idle) or after backend restarts.

Step-by-step diagnosis:

Look for reset patterns in logs: Ingress logs may show messages like “upstream prematurely closed connection” or “connection reset by peer.”
Compare with backend restarts: Check kubectl get events and Pod restarts (kubectl get pods) to see if resets align with restarts or scaling.
Test with and without keepalive: If your gateway allows toggling upstream keepalive, temporarily adjust to see if errors disappear (do this in a controlled environment).

Practical fix: Align idle timeouts between gateway and backend (application server, load balancer, and any intermediate proxies). Ensure the backend gracefully handles connection reuse and does not close connections too aggressively. If using HTTP/2 upstream, confirm the backend’s HTTP/2 settings and max concurrent streams are appropriate.

Log and Metric Correlation: Turning Symptoms into Evidence

Ingress/gateway access logs: Access logs tell you which requests failed, which upstream was selected, and how long each phase took (request time, upstream connect time, upstream response time). Enable or increase log detail temporarily when troubleshooting. For NGINX Ingress, a custom log format that includes upstream status and upstream response time is invaluable for distinguishing “could not connect” from “connected but slow.”

Application logs: If the gateway shows upstream timeouts but the application logs show no corresponding requests, the request likely never reached the app (routing, network, endpoints, or protocol mismatch). If the application logs show the request but it ends late or errors, the issue is inside the app or its dependencies. Use request IDs to correlate; if you do not have them, add them at the edge and propagate them to the app logs.

Kubernetes events: Events often reveal configuration and lifecycle problems: failed mounts, image pull errors, readiness probe failures, endpoint updates, and Ingress controller reload issues. Use kubectl get events -n <ns> --sort-by=.metadata.creationTimestamp during the incident window.

EndpointSlice churn: During rollouts, EndpointSlices can change rapidly. If your gateway reloads configuration frequently or has delays applying endpoint updates, you can see transient 502/504. Inspect EndpointSlices and compare timestamps with error spikes.

Hands-On Checklist: A Repeatable 15-Minute Investigation

Step 1: Confirm the failing route: Identify the exact host/path and which Ingress resource should match. Use kubectl describe ingress and a curl with explicit Host header to confirm rule selection.

Step 2: Confirm the Service has ready endpoints: Check kubectl get endpoints or EndpointSlices. If empty, inspect selectors and readiness.

Step 3: Confirm port/protocol alignment: Validate Service targetPort and test Pod IP connectivity on that port. Verify whether the backend expects HTTP vs HTTPS.

Step 4: Check policy and connectivity: If endpoints exist and ports are right, test connectivity from the gateway namespace to Pod IPs. Review NetworkPolicies and node placement.

Step 5: Correlate logs: Use gateway logs to determine connect vs response timeout. Use app logs to confirm whether requests arrive. Use events to identify rollout/probe issues.

Example Commands and Outputs to Recognize Patterns

Service with no endpoints: You might see an empty subsets list.

kubectl get endpoints -n shop web-svc -o yaml

subsets: []

Port mismatch symptoms: Curling the Pod IP shows connection refused.

curl -v http://10.244.3.21:8080/health

connect to 10.244.3.21 port 8080 failed: Connection refused

Host mismatch symptoms: Curling the Ingress IP without the correct Host header hits a default backend or wrong rule.

curl -v http://<ingress-ip>/

Correct test with Host header:

curl -v -H 'Host: api.example.com' http://<ingress-ip>/v1/status

Routing Mismatch Scenarios You Can Spot Quickly

Wrong namespace reference: Ingress points to a Service name that exists in a different namespace. Kubernetes Ingress backends are namespace-scoped; a typo or copy-paste can silently route to a non-existent Service, leading to controller errors and 502/504 at runtime. Verify the Service exists in the same namespace as the Ingress.

Shadowed paths: A catch-all path like / routes to a default Service, while a more specific path like /api was intended for another Service. Depending on controller behavior and path types, requests may go to the wrong backend. Ensure your path rules are specific and use the correct pathType.

Rewrite breaks static assets or API base paths: A rewrite that strips a prefix may cause the backend to generate redirects to unexpected locations or to 404 internally, which can cascade into gateway errors if health checks or upstream expectations rely on certain paths. Validate effective paths by logging the received path at the application.

Protocol confusion on shared ports: Some stacks expose both HTTP and gRPC or HTTP and WebSocket behavior. If the gateway routes a WebSocket upgrade request to a backend that does not support it (or strips upgrade headers), clients may see 502. Confirm that the route supports required headers and protocols end-to-end.

Now answer the exercise about the content:

When troubleshooting intermittent 502 or 504 errors, what is the most effective first step to narrow down the root cause?

You are right! Congratulations, now go to the next page

You missed! Try again.

502/504 are emitted by the gateway component that fails to reach or get a timely response from an upstream. Determining which hop returned the code using headers and logs guides the next targeted tests across client, Ingress, Service, and Pod.