What load balancing is (and what it is not)
Load balancing is the practice of distributing incoming requests across multiple backend instances (servers, containers, pods, or processes) so that no single instance becomes a bottleneck and so the service can keep working when some instances fail. A load balancer sits in front of a pool of backends and decides, for each request (or connection), which backend should handle it.
It helps you achieve three practical goals: (1) scale out by adding more backends, (2) improve availability by routing around failures, and (3) smooth performance by preventing hotspots. Load balancing is not the same as caching (serving responses without hitting the app), and it is not the same as autoscaling (adding/removing instances). Those can work together, but the load balancer’s core job is request distribution and failure handling.
Where load balancing happens
In real systems, you often have multiple layers of load balancing:
- Edge / public load balancer: the first hop inside your infrastructure, receiving internet traffic and distributing it to your application tier.
- Internal load balancer: used between services (for example, web tier to API tier) to spread traffic and isolate failures.
- Client-side load balancing: the client (or a library) chooses a backend from a list (common in microservices with service discovery).
This chapter focuses on patterns and operational behaviors rather than the basics of HTTP/TLS or reverse proxies (covered earlier). Think of a load balancer as a traffic director with rules, health checks, and memory about recent failures.
Core distribution algorithms (patterns)
Round robin
Round robin sends requests to backends in a rotating order: A, then B, then C, then back to A. It’s simple and works well when requests are similar in cost and backends are identical.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Practical notes:
- If one backend is slower, it can still receive the same number of requests and become overloaded.
- Round robin can be weighted to give more traffic to stronger machines (for example, 2x traffic to a backend with double CPU).
Least connections (or least outstanding requests)
Least connections sends new requests to the backend currently handling the fewest active connections (or requests). This helps when request durations vary, such as a mix of quick API calls and slower report generation.
Practical notes:
- It assumes “active connections” correlates with load; this is often true but not always (one connection can be CPU-heavy).
- It can react faster to imbalances than round robin.
Least response time
Least response time (sometimes combined with least connections) prefers backends that are responding faster. This can improve tail latency, especially when some instances are degraded (for example, noisy neighbor, GC pauses, disk pressure).
Practical notes:
- Requires measuring response times and maintaining rolling statistics.
- Can create feedback loops: a backend that gets less traffic may “look fast” and then suddenly get flooded. Many implementations dampen changes with smoothing.
Hash-based routing (consistent hashing)
Hash-based routing chooses a backend based on a hash of something stable, such as client IP, a session cookie, or a user ID. The goal is stickiness: the same user tends to land on the same backend.
Consistent hashing is a refinement that minimizes remapping when backends are added/removed. Without it, adding one backend can reshuffle most users, causing cache misses and session disruptions.
Practical notes:
- Great for stateful workloads or per-user caches.
- Be careful with client-IP hashing when many users share an IP (mobile carriers, corporate NAT), which can overload a single backend.
Power of two choices
This pattern picks two random backends and sends the request to the one with fewer active requests (or lower latency). It’s a simple technique that often performs close to least-connections with less global coordination.
Practical notes:
- Works well at large scale.
- Reduces the chance of hotspots without requiring perfect metrics.
Sticky sessions (session affinity): when and how
Sticky sessions mean requests from the same user are routed to the same backend for some period. This is commonly implemented via a cookie set by the load balancer (preferred) or via source IP affinity (less reliable).
Use stickiness when:
- Your application stores session state in memory on the backend (not ideal, but common in legacy apps).
- You maintain per-user in-memory caches that would be expensive to rebuild.
Avoid stickiness when:
- You can store session state in a shared store (database/redis) or make the app stateless.
- You need the load balancer to freely reroute traffic during failures and scaling events.
Failure trade-off: with stickiness, if a backend fails, users bound to it may experience errors until the affinity expires or the load balancer breaks the stickiness. Many systems support “failover stickiness”: keep affinity unless the backend is unhealthy.
Health checks: the foundation of failure handling
Load balancers need a way to decide whether a backend should receive traffic. That’s what health checks do. A health check is a periodic probe (HTTP request, TCP connect, or custom script) that marks a backend healthy or unhealthy.
Types of health checks
- TCP check: can the balancer establish a TCP connection to the backend port? Fast, but shallow; the app could be wedged while the port still accepts connections.
- HTTP check: request a path like
/healthand expect a specific status code/body. More meaningful; can validate dependencies. - Deep health check: verifies critical dependencies (database, queue, downstream APIs). Useful, but can cause cascading failures if it marks instances unhealthy due to a transient dependency issue.
Designing a good health endpoint
A practical pattern is to expose two endpoints:
- Liveness (is the process running?): returns OK if the app loop is alive. Used to restart crashed/hung processes.
- Readiness (should it receive traffic?): returns OK only if the instance can serve requests (for example, warmed up, connected to dependencies, not in maintenance).
Even if your platform doesn’t explicitly separate them, you can implement the same idea with different URLs and configure the load balancer to use readiness for routing decisions.
Thresholds, intervals, and flapping
Health checks should not toggle rapidly. Use:
- Interval: how often to check (for example, every 5s).
- Timeout: how long to wait for a response (for example, 2s).
- Healthy threshold: how many consecutive successes to mark healthy (for example, 2).
- Unhealthy threshold: how many consecutive failures to mark unhealthy (for example, 3).
This reduces “flapping” where a backend alternates between healthy/unhealthy due to brief spikes.
Failover behaviors: what happens when a backend dies mid-flight
Failures come in different forms, and the load balancer’s behavior depends on whether it is balancing per-request or per-connection.
Connection failures before a request is sent
If the balancer can’t connect to a chosen backend, it can immediately retry another backend. Many load balancers do this automatically, but you should confirm:
- How many retries are attempted?
- Are retries limited to idempotent requests only?
- Is there a per-request retry budget to avoid amplifying load?
Backend fails after receiving the request
This is trickier. If the backend received the request and then died, retrying might cause duplicate side effects (for example, charging a credit card twice). Safe retry depends on request semantics and your application’s idempotency design.
Practical mitigation patterns:
- Retry only idempotent methods (commonly GET/HEAD; sometimes PUT/DELETE if designed carefully).
- Idempotency keys: clients send a unique key; the backend deduplicates repeated attempts.
- At-least-once vs exactly-once: accept that retries may happen and build deduplication into the app.
Load balancing and overload: protecting backends from traffic spikes
Even with perfect distribution, you can overload every backend if traffic exceeds capacity. Load balancers often provide controls to prevent collapse.
Connection limits and queueing
A load balancer may cap concurrent connections per backend. When a backend hits its limit, new requests are sent elsewhere or queued. Queueing can help absorb brief bursts, but long queues increase latency and can time out clients.
A practical approach is to keep queues short and fail fast when the system is saturated, so clients can retry later rather than waiting.
Circuit breakers (ejecting bad backends)
A circuit breaker pattern temporarily stops sending traffic to a backend (or a whole service) when error rates or timeouts exceed a threshold. This is related to health checks but often reacts faster and can be based on real request outcomes.
Typical states:
- Closed: traffic flows normally.
- Open: no traffic sent; backend is considered unhealthy.
- Half-open: send a small amount of traffic to test recovery.
Rate limiting and load shedding
Rate limiting restricts request rates per client, API key, or path. Load shedding intentionally rejects some requests when overloaded (for example, return 503) to preserve core functionality.
Pattern: prioritize critical endpoints (login, checkout) and shed non-critical ones (analytics, expensive searches) first.
Multi-zone and multi-region balancing
To handle infrastructure failures, you typically spread backends across failure domains.
Active-active across zones
Backends run in multiple zones, and the load balancer distributes traffic across them. If one zone fails, health checks remove its backends and traffic shifts to remaining zones.
Practical considerations:
- Ensure each zone has enough capacity to handle failover load (N+1 planning).
- Watch for cross-zone data dependencies that can become bottlenecks.
Active-passive (standby)
One zone or region is primary; another is standby and receives little or no traffic until failover. This can simplify data consistency but requires reliable failover automation and regular testing.
Geo load balancing
Traffic is directed to the nearest or best-performing region. The decision can be based on latency measurements, geography, or regional health. The key pattern is to avoid sending users to a region that is “up” but degraded.
Step-by-step: choosing a load balancing pattern for an API
This practical checklist helps you pick an initial strategy and avoid common failure modes.
Step 1: Decide what you are balancing (requests vs connections)
- If your traffic is mostly short HTTP requests, request-level balancing is ideal.
- If you have long-lived connections (streaming, websockets), connection-level balancing matters more; least-connections or hashing can be more stable.
Step 2: Determine whether you need stickiness
- If the app is stateless, disable stickiness to maximize flexibility during failures.
- If you must keep in-memory session state, enable cookie-based affinity and configure failover behavior (break affinity when unhealthy).
Step 3: Pick a distribution algorithm
- Start with round robin for uniform workloads.
- Use least connections if request durations vary widely.
- Use least response time if you see degraded instances and want latency-aware routing.
- Use consistent hashing if you need per-user stickiness or cache locality.
Step 4: Implement readiness health checks
Create a lightweight endpoint (for example, /ready) that returns success only when the instance can serve production traffic. Keep it fast and avoid expensive dependency calls unless necessary.
// Example pseudo-handler for /ready (language-agnostic idea) if (app_is_warmed_up && can_accept_requests && not_in_maintenance) { return 200 "ok" } else { return 503 "not ready" }Configure the load balancer with conservative thresholds (for example, 3 failures to mark unhealthy, 2 successes to mark healthy) to reduce flapping.
Step 5: Decide retry policy to avoid duplicate side effects
- Enable automatic retries only for connection failures and timeouts on idempotent requests.
- For non-idempotent operations, use idempotency keys and deduplication in the application.
Step 6: Add overload protection
- Set per-backend connection/request limits.
- Define a maximum queue time; fail fast beyond it.
- Implement rate limiting for abusive clients and load shedding for non-critical endpoints.
Step 7: Validate with failure drills
Test how the system behaves when things go wrong:
- Terminate one backend: confirm traffic shifts and error rate stays acceptable.
- Introduce latency on one backend: confirm latency-aware routing or circuit breaking reduces impact.
- Kill an entire zone: confirm remaining zones handle load and that health checks remove the failed zone quickly.
Step-by-step: rolling deployments without dropping traffic
Load balancers are central to safe deployments. A common pattern is to gradually replace old instances with new ones while keeping the service available.
Step 1: Add new instances as “not ready”
Start new instances but keep readiness failing until initialization is complete (config loaded, caches warmed, migrations done if applicable).
Step 2: Mark ready and ramp traffic
Once readiness passes, the load balancer starts sending traffic. If you support weights, begin with a small weight for the new version and increase gradually.
Step 3: Drain old instances
Draining means stopping new requests from being assigned to an instance while allowing in-flight requests to finish. Configure a drain timeout appropriate to your longest typical request.
Step 4: Remove old instances
After drain completes, terminate old instances. If you use sticky sessions, ensure the stickiness TTL is compatible with draining; otherwise, users may keep getting routed to draining instances.
Common pitfalls and how to avoid them
Uneven load due to “hot keys”
If you hash on user ID or another key, a small number of “heavy” users can overload one backend. Mitigations include:
- Use a different key (or include more entropy).
- Shard heavy tenants explicitly across multiple backends.
- Use power-of-two choices with load awareness instead of strict hashing.
Health checks that cause outages
If readiness depends on a flaky dependency, you can mark all backends unhealthy and take the service down even though it could have served partial functionality. Consider:
- Make readiness reflect the ability to serve core requests.
- Use separate checks for observability vs routing decisions.
- Degrade gracefully: serve cached or limited responses rather than failing readiness.
Retry storms
When a backend is slow, clients and load balancers may retry, multiplying traffic and making the problem worse. Controls:
- Limit retries and use exponential backoff.
- Use a retry budget (cap retries as a fraction of total traffic).
- Combine with circuit breaking to stop sending traffic to failing instances.
Long-lived connections pin traffic
With websockets or streaming, a backend can accumulate many long connections and become “full” even if request rate is low. Use least-connections, enforce connection limits, and consider connection draining during deploys.
Observability signals that matter for load balancing
To know whether your load balancing pattern is working, monitor:
- Per-backend request rate: should be roughly even unless intentionally weighted.
- Error rate by backend: helps detect partial failures and bad deploys.
- Latency percentiles (p50/p95/p99) by backend and overall: identifies stragglers.
- Health check status changes: frequent toggling indicates flapping thresholds or unstable instances.
- Retry counts: rising retries can precede an outage.
- Backend saturation: CPU, memory, connection counts, queue depth.
Use these metrics to tune weights, thresholds, and routing algorithms. For example, if one backend consistently shows higher p99 latency, least-response-time routing or circuit breaking can reduce its impact while you investigate the root cause.