Why Reliability Patterns Matter in Cloud-Native Traffic
In Kubernetes-based web serving, failures are normal: pods restart, nodes drain, DNS hiccups happen, and upstream dependencies slow down under load. Reliability patterns are small, repeatable controls you apply to request flows so that a single slow or failing dependency does not cascade into a full outage. This chapter focuses on four core patterns—timeouts, retries, circuit breaking, and backpressure—and how they are typically implemented at the edge (Ingress/API gateway), at the service-to-service layer (service mesh sidecars), and inside the application.
A key idea is that these patterns must be coordinated. A retry without a timeout can hang longer than the original request. A timeout without backpressure can still overload a dependency. Circuit breaking without good retry policy can cause unnecessary errors. The goal is to shape traffic so that the system fails fast, recovers quickly, and stays within safe capacity.
Timeouts: Bound the Cost of Waiting
What a Timeout Does
A timeout sets an upper limit on how long a client will wait for a response. When the limit is reached, the client stops waiting and treats the request as failed. Timeouts protect your resources: each in-flight request consumes memory, threads, file descriptors, and connection slots. Without timeouts, a slow dependency can cause request queues to grow until the service becomes unresponsive.
In distributed systems, you usually need multiple timeouts: connection timeout (establishing TCP/TLS), request/response timeout (waiting for bytes), and idle timeout (no data flowing). You also need to consider end-to-end deadlines: if a user-facing request has a 2-second budget, internal calls must have smaller budgets so the overall request can still complete or fail gracefully.
Practical Steps: Set Timeouts at Each Layer
Step 1: Define an end-to-end budget per endpoint. Start from user expectations and SLOs. For example, a product page might have a 1500 ms budget, while a checkout operation might allow 3000 ms but must be more consistent.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Step 2: Allocate sub-budgets to dependencies. If the product page calls catalog and reviews, you might allocate 400 ms to catalog, 300 ms to reviews, leaving time for rendering and network overhead. Keep some slack for retries if you plan to use them.
Step 3: Configure timeouts at the edge (Ingress). This prevents slow clients or slow upstreams from tying up proxy resources. In NGINX Ingress, you typically tune proxy read/send timeouts and upstream connect timeouts. Example annotations (names can vary by controller version):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
annotations:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "2"
nginx.ingress.kubernetes.io/proxy-read-timeout: "5"
nginx.ingress.kubernetes.io/proxy-send-timeout: "5"
spec:
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 80Step 4: Configure timeouts for service-to-service calls. In a service mesh, you can set per-route timeouts so that the sidecar enforces them consistently across languages. For example, with Istio you can apply a VirtualService timeout:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: catalog
spec:
hosts:
- catalog
http:
- match:
- uri:
prefix: /api/
timeout: 0.4s
route:
- destination:
host: catalogStep 5: Enforce timeouts in the application client as well. This is your last line of defense and helps when traffic bypasses the mesh (batch jobs, migrations, admin tools). Ensure your HTTP client has connect and read timeouts, and that you propagate deadlines (for example, via request context) to downstream calls.
Common Timeout Pitfalls
Timeouts that are too long are almost as dangerous as having none, because they allow queues to build. Timeouts that are too short can create self-inflicted failures during normal latency spikes. Another common pitfall is mismatch: if the Ingress times out at 5 seconds but the service mesh times out at 400 ms, you may see confusing errors at the edge. Align timeouts so that inner layers time out first, and outer layers have slightly larger limits to allow clean error handling and logging.
Retries: Recover from Transient Failures Without Amplifying Load
What a Retry Does (and When It Helps)
A retry re-attempts a failed request, often after a short delay. Retries are useful for transient failures such as brief network drops, connection resets, or a pod that is temporarily not ready. They are risky when the failure is due to overload or a persistent bug, because retries increase traffic and can turn a partial outage into a full one.
Retries must be selective. You typically retry on network errors and certain response codes (like 503) but avoid retrying on application-level errors (like 400). You also need to consider idempotency: repeating a request must not cause unintended side effects. GET is usually safe; POST may not be unless you have idempotency keys or server-side deduplication.
Practical Steps: Design a Safe Retry Policy
Step 1: Classify operations by idempotency. For non-idempotent operations (payments, order creation), implement idempotency keys so that a client can safely retry without double charging or duplicate orders. A common approach is to require a unique header like Idempotency-Key and store the result keyed by that value for a limited time.
Step 2: Retry only within a time budget. Retries must fit inside the timeout budget you defined earlier. If your per-call timeout is 400 ms and you allow 2 attempts, you might set 150–180 ms per attempt plus backoff.
Step 3: Use bounded retries with exponential backoff and jitter. Exponential backoff reduces pressure on a struggling dependency; jitter prevents synchronized retry storms. A typical pattern is: base delay 25–50 ms, multiply by 2 each attempt, cap at 200–500 ms, add random jitter.
Step 4: Configure retries at the mesh or proxy layer for consistency. Example Istio VirtualService retry configuration:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- retries:
attempts: 2
perTryTimeout: 0.2s
retryOn: gateway-error,connect-failure,refused-stream,reset
timeout: 0.5s
route:
- destination:
host: reviewsStep 5: Avoid retrying at multiple layers for the same request. If the application retries and the mesh retries and the gateway retries, you can multiply traffic unexpectedly. Decide where retries live. A common approach is: do limited retries in the mesh for safe, idempotent calls; keep application retries for special cases where the app can make smarter decisions.
Retry Pitfalls and How to Detect Them
The most common failure mode is retry amplification: a dependency slows down, timeouts occur, clients retry, and the dependency gets even more load. Detect this by monitoring request rate, error rate, and latency together. If error rate rises and request rate rises at the same time without an external traffic increase, retries may be amplifying. Another pitfall is retrying large payloads, which can saturate bandwidth; consider disabling retries for large uploads or streaming endpoints.
Circuit Breaking: Stop Sending Traffic to a Failing Dependency
What Circuit Breaking Does
A circuit breaker prevents a client from repeatedly calling an unhealthy dependency. When failures exceed a threshold, the circuit “opens” and requests fail fast (or are routed elsewhere) for a cooling period. After that, the circuit enters a “half-open” state where a limited number of test requests are allowed. If those succeed, the circuit closes and normal traffic resumes.
Circuit breaking is about protecting both sides: it protects the caller from wasting resources on doomed calls, and it protects the dependency from being hammered while it is recovering. In Kubernetes, this is especially important during rolling updates, partial outages, or when a downstream service is overloaded.
Practical Steps: Apply Circuit Breaking in a Service Mesh
Step 1: Decide what you want to limit. Common limits include maximum concurrent connections, maximum pending requests, and maximum requests per connection. These limits prevent queue buildup and reduce tail latency.
Step 2: Configure outlier detection (eject unhealthy endpoints). Outlier detection removes individual pods from load balancing when they return too many errors. In Istio, this is done via a DestinationRule. Example:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: catalog
spec:
host: catalog
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 50
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50Step 3: Combine circuit breaking with timeouts and limited retries. Circuit breaking works best when failures are detected quickly. If timeouts are too long, the breaker reacts slowly. If retries are too aggressive, you may trip the breaker unnecessarily or create noise.
Step 4: Decide on fallback behavior. When the circuit is open, you can fail fast with a clear error, return cached data, return a degraded response (for example, omit recommendations), or route to a secondary region if available. The mesh can help with routing, but the application often needs to implement the degraded response logic.
Operational Considerations
Thresholds should reflect normal behavior. If a service occasionally returns 5xx during deployments, a very low consecutive error threshold may cause frequent ejections. Start with conservative settings, observe, and tune. Also ensure you can distinguish between “endpoint unhealthy” and “service overloaded.” Outlier detection ejects endpoints; it does not fix overload if all endpoints are slow. That is where backpressure and capacity controls become essential.
Backpressure: Keep the System Within Safe Capacity
What Backpressure Means in Request Flows
Backpressure is the mechanism by which a system signals “slow down” to prevent overload. Without backpressure, a fast producer can overwhelm a slower consumer, causing queues to grow, memory to spike, and latency to explode. In HTTP systems, backpressure often appears as limiting concurrency, rejecting excess requests quickly, or shaping traffic so that downstream services are not saturated.
Backpressure is not only about rejecting traffic; it is about controlling work-in-progress. A service that accepts unlimited concurrent requests can become slower for everyone. A service that caps concurrency may reject some requests but stays responsive for the rest, which is often the better outcome.
Practical Steps: Implement Backpressure at Multiple Points
Step 1: Set concurrency limits in the application. If your service uses a thread pool or async worker pool, cap it so that you do not exceed CPU and memory. For example, limit the number of concurrent database queries or external HTTP calls. When the limit is reached, fail fast with a 503 or a domain-specific “try again” response.
Step 2: Use queue bounds and load shedding. If you have an internal queue (for example, for background processing), make it bounded. When full, drop or reject new work rather than letting the queue grow without limit. For user-facing requests, load shedding can be as simple as returning 429 Too Many Requests when concurrency is high.
Step 3: Apply proxy-level limits. Many Ingress controllers and sidecars can enforce rate limits or connection limits. Even without a full rate-limiting system, you can cap connections and pending requests to avoid proxy overload. In Envoy-based meshes, connection pool and pending request limits (shown earlier) are a form of backpressure: they prevent unbounded queuing inside the proxy.
Step 4: Prefer adaptive signals when possible. Static limits are a starting point, but adaptive backpressure reacts to current conditions. Examples include limiting concurrency based on observed latency (if latency rises, reduce concurrency), or using token-bucket rate limiting tied to CPU utilization. If you implement adaptive logic in the application, keep it simple and observable: expose current limits and rejection counts as metrics.
Step 5: Coordinate client behavior. Backpressure works best when clients respect it. If a client receives 429 or 503, it should not immediately retry in a tight loop. Combine backpressure responses with Retry-After headers where appropriate, and ensure client retry logic uses backoff and jitter.
Backpressure and Streaming/Long-Lived Connections
Long-lived connections (WebSockets, server-sent events, gRPC streams) can consume capacity differently than short HTTP requests. Backpressure here often means limiting the number of concurrent streams, enforcing per-connection flow control, and setting idle timeouts. If you allow unlimited streams, you can run out of file descriptors or memory even if request rate is low. Ensure your proxies and applications have explicit limits for concurrent streams and that you monitor active connections as a first-class metric.
Putting the Patterns Together: A Coordinated Policy
Build a Reliability “Contract” Per Endpoint
Instead of configuring patterns ad hoc, define a small contract per endpoint: timeout budget, retry policy, circuit breaker thresholds, and backpressure behavior. For example, for an idempotent GET /api/catalog/items endpoint you might choose: 400 ms timeout, 1 retry with 200 ms per-try timeout, outlier detection eject after 5 consecutive 5xx, and a concurrency cap that returns 503 when saturated. For a POST /api/orders endpoint you might choose: 1500 ms timeout, no automatic retries unless idempotency key is present, stricter circuit breaking to protect the database, and explicit 429 responses when the system is under load.
Step-by-Step: Configure a Mesh Policy for a Single Dependency
Step 1: Set a per-route timeout and limited retries in a VirtualService for an idempotent call. Ensure total timeout is greater than perTryTimeout times attempts, but still within your end-to-end budget.
Step 2: Add a DestinationRule with connection pool limits and outlier detection. Start with modest max pending requests and max connections, then tune based on observed traffic and latency.
Step 3: Validate behavior under failure. Induce a controlled failure (for example, scale the dependency down to zero in a staging environment or inject 5xx responses) and observe: do requests fail fast, do retries stay bounded, does the proxy stop sending traffic to bad endpoints, and does the caller remain responsive?
Step 4: Add application-level fallbacks. If the dependency is optional, return a degraded response. If it is required, return a clear error quickly and log the failure reason with correlation IDs so you can trace it.
Step-by-Step: Avoiding Retry Storms During Partial Outages
Step 1: Ensure timeouts are short enough to detect failure quickly but not so short that normal p95 latency triggers them. Use observed latency distributions to set initial values.
Step 2: Limit retries to 1–2 attempts for most calls. More attempts rarely help and often harm.
Step 3: Add jittered backoff. If your tooling does not support jitter at the proxy layer, keep retries in the application where you can implement it correctly, and disable proxy retries.
Step 4: Use circuit breaking/outlier detection so that retries do not keep targeting the same failing pod.
Step 5: Add backpressure so that when the caller is overloaded, it rejects quickly rather than queuing and timing out later.
Observability Signals to Tune These Patterns
Metrics That Directly Reflect Reliability Controls
To tune timeouts, retries, circuit breaking, and backpressure, you need visibility into both normal and failure behavior. Track request latency percentiles (p50, p95, p99), error rates by status code, and request volume. Add specific signals: number of retries attempted, number of requests that timed out, number of circuit breaker open events, number of ejected endpoints, and number of rejected requests due to concurrency or rate limits.
Correlate these signals. If timeouts increase while latency p99 increases, you may need more capacity or a higher timeout. If retries increase while success rate decreases, retries may be amplifying load. If circuit breaker opens frequently during deployments, your thresholds may be too sensitive or your readiness signaling may be insufficient. If backpressure rejections spike, it may indicate a real traffic surge or a downstream bottleneck that needs scaling or optimization.
Logging and Tracing for Root Cause
When a request fails, you want to know which control triggered: timeout, retry exhaustion, circuit open, or backpressure rejection. Ensure your proxies and applications emit structured logs that include response flags or error reasons, upstream cluster/endpoint, and timing. In tracing, annotate spans with retry count, timeout values, and whether a fallback path was used. This makes it possible to distinguish “dependency slow” from “dependency down” from “caller overloaded,” which is essential for correct tuning.