All courses > Technology and Programming > Cloud Computing and Web Servers ::

Load Balancing Patterns: Distributing Traffic and Handling Failures

Capítulo 11

Estimated reading time: 12 minutes

What load balancing is (and what it is not)

Load balancing is the practice of distributing incoming requests across multiple backend instances (servers, containers, pods, or processes) so that no single instance becomes a bottleneck and so the service can keep working when some instances fail. A load balancer sits in front of a pool of backends and decides, for each request (or connection), which backend should handle it.

It helps you achieve three practical goals: (1) scale out by adding more backends, (2) improve availability by routing around failures, and (3) smooth performance by preventing hotspots. Load balancing is not the same as caching (serving responses without hitting the app), and it is not the same as autoscaling (adding/removing instances). Those can work together, but the load balancer’s core job is request distribution and failure handling.

Where load balancing happens

In real systems, you often have multiple layers of load balancing:

Edge / public load balancer: the first hop inside your infrastructure, receiving internet traffic and distributing it to your application tier.
Internal load balancer: used between services (for example, web tier to API tier) to spread traffic and isolate failures.
Client-side load balancing: the client (or a library) chooses a backend from a list (common in microservices with service discovery).

This chapter focuses on patterns and operational behaviors rather than the basics of HTTP/TLS or reverse proxies (covered earlier). Think of a load balancer as a traffic director with rules, health checks, and memory about recent failures.

Core distribution algorithms (patterns)

Round robin

Round robin sends requests to backends in a rotating order: A, then B, then C, then back to A. It’s simple and works well when requests are similar in cost and backends are identical.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Practical notes:

If one backend is slower, it can still receive the same number of requests and become overloaded.
Round robin can be weighted to give more traffic to stronger machines (for example, 2x traffic to a backend with double CPU).

Least connections (or least outstanding requests)

Least connections sends new requests to the backend currently handling the fewest active connections (or requests). This helps when request durations vary, such as a mix of quick API calls and slower report generation.

Practical notes:

It assumes “active connections” correlates with load; this is often true but not always (one connection can be CPU-heavy).
It can react faster to imbalances than round robin.

Least response time

Least response time (sometimes combined with least connections) prefers backends that are responding faster. This can improve tail latency, especially when some instances are degraded (for example, noisy neighbor, GC pauses, disk pressure).

Practical notes:

Requires measuring response times and maintaining rolling statistics.
Can create feedback loops: a backend that gets less traffic may “look fast” and then suddenly get flooded. Many implementations dampen changes with smoothing.

Hash-based routing (consistent hashing)

Hash-based routing chooses a backend based on a hash of something stable, such as client IP, a session cookie, or a user ID. The goal is stickiness: the same user tends to land on the same backend.

Consistent hashing is a refinement that minimizes remapping when backends are added/removed. Without it, adding one backend can reshuffle most users, causing cache misses and session disruptions.

Practical notes:

Great for stateful workloads or per-user caches.
Be careful with client-IP hashing when many users share an IP (mobile carriers, corporate NAT), which can overload a single backend.

Power of two choices

This pattern picks two random backends and sends the request to the one with fewer active requests (or lower latency). It’s a simple technique that often performs close to least-connections with less global coordination.

Practical notes:

Works well at large scale.
Reduces the chance of hotspots without requiring perfect metrics.

Sticky sessions (session affinity): when and how

Sticky sessions mean requests from the same user are routed to the same backend for some period. This is commonly implemented via a cookie set by the load balancer (preferred) or via source IP affinity (less reliable).

Use stickiness when:

Your application stores session state in memory on the backend (not ideal, but common in legacy apps).
You maintain per-user in-memory caches that would be expensive to rebuild.

Avoid stickiness when:

You can store session state in a shared store (database/redis) or make the app stateless.
You need the load balancer to freely reroute traffic during failures and scaling events.

Failure trade-off: with stickiness, if a backend fails, users bound to it may experience errors until the affinity expires or the load balancer breaks the stickiness. Many systems support “failover stickiness”: keep affinity unless the backend is unhealthy.

Health checks: the foundation of failure handling

Load balancers need a way to decide whether a backend should receive traffic. That’s what health checks do. A health check is a periodic probe (HTTP request, TCP connect, or custom script) that marks a backend healthy or unhealthy.

Types of health checks

TCP check: can the balancer establish a TCP connection to the backend port? Fast, but shallow; the app could be wedged while the port still accepts connections.
HTTP check: request a path like /health and expect a specific status code/body. More meaningful; can validate dependencies.
Deep health check: verifies critical dependencies (database, queue, downstream APIs). Useful, but can cause cascading failures if it marks instances unhealthy due to a transient dependency issue.

Designing a good health endpoint

A practical pattern is to expose two endpoints:

Liveness (is the process running?): returns OK if the app loop is alive. Used to restart crashed/hung processes.
Readiness (should it receive traffic?): returns OK only if the instance can serve requests (for example, warmed up, connected to dependencies, not in maintenance).

Even if your platform doesn’t explicitly separate them, you can implement the same idea with different URLs and configure the load balancer to use readiness for routing decisions.

Thresholds, intervals, and flapping

Health checks should not toggle rapidly. Use:

Interval: how often to check (for example, every 5s).
Timeout: how long to wait for a response (for example, 2s).
Healthy threshold: how many consecutive successes to mark healthy (for example, 2).
Unhealthy threshold: how many consecutive failures to mark unhealthy (for example, 3).

This reduces “flapping” where a backend alternates between healthy/unhealthy due to brief spikes.

Failover behaviors: what happens when a backend dies mid-flight

Failures come in different forms, and the load balancer’s behavior depends on whether it is balancing per-request or per-connection.

Connection failures before a request is sent

If the balancer can’t connect to a chosen backend, it can immediately retry another backend. Many load balancers do this automatically, but you should confirm:

How many retries are attempted?
Are retries limited to idempotent requests only?
Is there a per-request retry budget to avoid amplifying load?

Backend fails after receiving the request

This is trickier. If the backend received the request and then died, retrying might cause duplicate side effects (for example, charging a credit card twice). Safe retry depends on request semantics and your application’s idempotency design.

Practical mitigation patterns:

Retry only idempotent methods (commonly GET/HEAD; sometimes PUT/DELETE if designed carefully).
Idempotency keys: clients send a unique key; the backend deduplicates repeated attempts.
At-least-once vs exactly-once: accept that retries may happen and build deduplication into the app.

Load balancing and overload: protecting backends from traffic spikes

Even with perfect distribution, you can overload every backend if traffic exceeds capacity. Load balancers often provide controls to prevent collapse.

Connection limits and queueing

A load balancer may cap concurrent connections per backend. When a backend hits its limit, new requests are sent elsewhere or queued. Queueing can help absorb brief bursts, but long queues increase latency and can time out clients.

A practical approach is to keep queues short and fail fast when the system is saturated, so clients can retry later rather than waiting.

Circuit breakers (ejecting bad backends)

A circuit breaker pattern temporarily stops sending traffic to a backend (or a whole service) when error rates or timeouts exceed a threshold. This is related to health checks but often reacts faster and can be based on real request outcomes.

Typical states:

Closed: traffic flows normally.
Open: no traffic sent; backend is considered unhealthy.
Half-open: send a small amount of traffic to test recovery.

Rate limiting and load shedding

Rate limiting restricts request rates per client, API key, or path. Load shedding intentionally rejects some requests when overloaded (for example, return 503) to preserve core functionality.

Pattern: prioritize critical endpoints (login, checkout) and shed non-critical ones (analytics, expensive searches) first.

Multi-zone and multi-region balancing

To handle infrastructure failures, you typically spread backends across failure domains.

Active-active across zones

Backends run in multiple zones, and the load balancer distributes traffic across them. If one zone fails, health checks remove its backends and traffic shifts to remaining zones.

Practical considerations:

Ensure each zone has enough capacity to handle failover load (N+1 planning).
Watch for cross-zone data dependencies that can become bottlenecks.

Active-passive (standby)

One zone or region is primary; another is standby and receives little or no traffic until failover. This can simplify data consistency but requires reliable failover automation and regular testing.

Geo load balancing

Traffic is directed to the nearest or best-performing region. The decision can be based on latency measurements, geography, or regional health. The key pattern is to avoid sending users to a region that is “up” but degraded.

Step-by-step: choosing a load balancing pattern for an API

This practical checklist helps you pick an initial strategy and avoid common failure modes.

Step 1: Decide what you are balancing (requests vs connections)

If your traffic is mostly short HTTP requests, request-level balancing is ideal.
If you have long-lived connections (streaming, websockets), connection-level balancing matters more; least-connections or hashing can be more stable.

Step 2: Determine whether you need stickiness

If the app is stateless, disable stickiness to maximize flexibility during failures.
If you must keep in-memory session state, enable cookie-based affinity and configure failover behavior (break affinity when unhealthy).

Step 3: Pick a distribution algorithm

Start with round robin for uniform workloads.
Use least connections if request durations vary widely.
Use least response time if you see degraded instances and want latency-aware routing.
Use consistent hashing if you need per-user stickiness or cache locality.

Step 4: Implement readiness health checks

Create a lightweight endpoint (for example, /ready) that returns success only when the instance can serve production traffic. Keep it fast and avoid expensive dependency calls unless necessary.

// Example pseudo-handler for /ready (language-agnostic idea) if (app_is_warmed_up && can_accept_requests && not_in_maintenance) { return 200 "ok" } else { return 503 "not ready" }

Configure the load balancer with conservative thresholds (for example, 3 failures to mark unhealthy, 2 successes to mark healthy) to reduce flapping.

Step 5: Decide retry policy to avoid duplicate side effects

Enable automatic retries only for connection failures and timeouts on idempotent requests.
For non-idempotent operations, use idempotency keys and deduplication in the application.

Step 6: Add overload protection

Set per-backend connection/request limits.
Define a maximum queue time; fail fast beyond it.
Implement rate limiting for abusive clients and load shedding for non-critical endpoints.

Step 7: Validate with failure drills

Test how the system behaves when things go wrong:

Terminate one backend: confirm traffic shifts and error rate stays acceptable.
Introduce latency on one backend: confirm latency-aware routing or circuit breaking reduces impact.
Kill an entire zone: confirm remaining zones handle load and that health checks remove the failed zone quickly.

Step-by-step: rolling deployments without dropping traffic

Load balancers are central to safe deployments. A common pattern is to gradually replace old instances with new ones while keeping the service available.

Step 1: Add new instances as “not ready”

Start new instances but keep readiness failing until initialization is complete (config loaded, caches warmed, migrations done if applicable).

Step 2: Mark ready and ramp traffic

Once readiness passes, the load balancer starts sending traffic. If you support weights, begin with a small weight for the new version and increase gradually.

Step 3: Drain old instances

Draining means stopping new requests from being assigned to an instance while allowing in-flight requests to finish. Configure a drain timeout appropriate to your longest typical request.

Step 4: Remove old instances

After drain completes, terminate old instances. If you use sticky sessions, ensure the stickiness TTL is compatible with draining; otherwise, users may keep getting routed to draining instances.

Common pitfalls and how to avoid them

Uneven load due to “hot keys”

If you hash on user ID or another key, a small number of “heavy” users can overload one backend. Mitigations include:

Use a different key (or include more entropy).
Shard heavy tenants explicitly across multiple backends.
Use power-of-two choices with load awareness instead of strict hashing.

Health checks that cause outages

If readiness depends on a flaky dependency, you can mark all backends unhealthy and take the service down even though it could have served partial functionality. Consider:

Make readiness reflect the ability to serve core requests.
Use separate checks for observability vs routing decisions.
Degrade gracefully: serve cached or limited responses rather than failing readiness.

Retry storms

When a backend is slow, clients and load balancers may retry, multiplying traffic and making the problem worse. Controls:

Limit retries and use exponential backoff.
Use a retry budget (cap retries as a fraction of total traffic).
Combine with circuit breaking to stop sending traffic to failing instances.

Long-lived connections pin traffic

With websockets or streaming, a backend can accumulate many long connections and become “full” even if request rate is low. Use least-connections, enforce connection limits, and consider connection draining during deploys.

Observability signals that matter for load balancing

To know whether your load balancing pattern is working, monitor:

Per-backend request rate: should be roughly even unless intentionally weighted.
Error rate by backend: helps detect partial failures and bad deploys.
Latency percentiles (p50/p95/p99) by backend and overall: identifies stragglers.
Health check status changes: frequent toggling indicates flapping thresholds or unstable instances.
Retry counts: rising retries can precede an outage.
Backend saturation: CPU, memory, connection counts, queue depth.

Use these metrics to tune weights, thresholds, and routing algorithms. For example, if one backend consistently shows higher p99 latency, least-response-time routing or circuit breaking can reduce its impact while you investigate the root cause.

Now answer the exercise about the content:

An API has a mix of very fast requests and occasional slow ones, causing some servers to stay busy longer than others. Which load balancing pattern best helps reduce this imbalance by sending new requests to the backend with the fewest active requests?

You are right! Congratulations, now go to the next page

You missed! Try again.

Least connections routes new requests to the backend with the fewest active requests, which helps when request durations vary and some instances stay busy longer.

Next chapter

End-to-End Walkthrough: What Happens When You Type a URL

85%

Web Servers 101: How HTTP, DNS, TLS, and Reverse Proxies Work

New course

13 pages

Load Balancing Patterns: Distributing Traffic and Handling Failures

What load balancing is (and what it is not)

Where load balancing happens

Core distribution algorithms (patterns)

Round robin

Least connections (or least outstanding requests)

Least response time

Hash-based routing (consistent hashing)

Power of two choices

Sticky sessions (session affinity): when and how

Health checks: the foundation of failure handling

Types of health checks

Designing a good health endpoint

Thresholds, intervals, and flapping

Failover behaviors: what happens when a backend dies mid-flight

Connection failures before a request is sent

Backend fails after receiving the request

Load balancing and overload: protecting backends from traffic spikes

Connection limits and queueing

Circuit breakers (ejecting bad backends)

Rate limiting and load shedding

Multi-zone and multi-region balancing

Active-active across zones

Active-passive (standby)

Geo load balancing

Step-by-step: choosing a load balancing pattern for an API

Step 1: Decide what you are balancing (requests vs connections)

Step 2: Determine whether you need stickiness

Step 3: Pick a distribution algorithm

Step 4: Implement readiness health checks

Step 5: Decide retry policy to avoid duplicate side effects

Step 6: Add overload protection

Step 7: Validate with failure drills

Step-by-step: rolling deployments without dropping traffic

Step 1: Add new instances as “not ready”

Step 2: Mark ready and ramp traffic

Step 3: Drain old instances

Step 4: Remove old instances

Common pitfalls and how to avoid them

Uneven load due to “hot keys”

Health checks that cause outages

Retry storms

Long-lived connections pin traffic

Observability signals that matter for load balancing

An API has a mix of very fast requests and occasional slow ones, causing some servers to stay busy longer than others. Which load balancing pattern best helps reduce this imbalance by sending new requests to the backend with the fewest active requests?

Web Servers 101: How HTTP, DNS, TLS, and Reverse Proxies Work

LearnCloud Computing and Web Servers

LearnTechnology and Programming