All courses > Technology and Programming > Cloud Computing and Web Servers ::

Scaling and Performance Bottlenecks in Ingress and Web Server Layers

Capítulo 19

Estimated reading time: 12 minutes

Where Scaling Breaks: A Mental Model for the Ingress and Web Server Layers

What this layer does: the ingress and web server layers are the “north-south” entry point for user traffic. They terminate or pass through TLS, apply L7 routing rules, and forward requests to Kubernetes Services and Pods. Performance problems here often look like “the whole site is slow,” even when application Pods are healthy, because these components sit on the critical path for every request.

How to reason about bottlenecks: treat the path as a pipeline of constrained resources: (1) client-to-edge network, (2) load balancer to ingress, (3) ingress CPU/memory and worker concurrency, (4) connection pools and upstream keepalive, (5) kube-proxy/IPVS/iptables and Service routing, (6) node network and Pod networking, and (7) the web server inside the workload (for example NGINX/Apache/Envoy sidecar or app server). A bottleneck is usually a saturation of one resource (CPU, memory, file descriptors, conntrack, bandwidth, or queue depth) that causes latency to rise nonlinearly.

Why scaling is tricky: scaling ingress replicas increases capacity, but it also changes connection distribution, cache hit rates, and TLS session reuse. Scaling web servers increases upstream capacity, but can expose new limits such as node-level conntrack, Service load balancing overhead, or uneven traffic spread due to sticky sessions or long-lived connections.

Key Bottleneck Categories and Their Symptoms

CPU Saturation (TLS, Regex Routing, Compression, WAF)

Typical symptoms: ingress Pod CPU near limits, increased request latency, and rising 5xx due to timeouts under load. CPU spikes often correlate with TLS handshakes, heavy header manipulation, complex path matching, or response compression.

Common causes: too many full TLS handshakes (no session reuse), expensive cipher suites, large certificate chains, per-request regex evaluation in routing rules, or enabling gzip/brotli at the ingress for already-compressed content.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Connection and File Descriptor Limits (Too Many Sockets)

Typical symptoms: errors like “too many open files,” sudden inability to accept new connections, or a plateau in RPS even though CPU is available. Long-lived connections (WebSockets, SSE, gRPC streams) amplify this.

Common causes: low ulimit for open files, insufficient worker connections, too few ingress replicas for the number of concurrent clients, or upstream keepalive misconfiguration causing excessive connection churn.

Queueing and Concurrency Limits (Workers, Threads, Event Loops)

Typical symptoms: latency climbs while throughput stays flat; p95/p99 explode first. You may see backlog growth at the load balancer or ingress accept queue.

Common causes: ingress worker processes too low, per-worker connection limits too low, web server thread pools saturated, or application server concurrency lower than expected (for example, a single-threaded runtime behind a high-RPS ingress).

Memory Pressure and Buffering (Large Headers/Bodies)

Typical symptoms: OOMKills of ingress/web server Pods, or sudden latency spikes when buffering to disk occurs. Large request bodies (uploads) and large response bodies can cause buffering and memory fragmentation.

Common causes: buffering enabled with large bodies, insufficient memory limits, large headers (cookies, JWTs), or mis-sized proxy buffers.

Node and Kernel Limits (conntrack, ephemeral ports, NIC bandwidth)

Typical symptoms: packet drops, intermittent connection failures under load, or performance that degrades only when traffic crosses a threshold. These issues can appear “random” because they depend on kernel tables and timeouts.

Common causes: conntrack table exhaustion, too-low conntrack hash size, ephemeral port exhaustion on egress-heavy nodes, or NIC bandwidth saturation. In Kubernetes, these are often node-level, so adding more Pods on the same node does not help.

Uneven Load Distribution (Hot Pods, Sticky Sessions, L7 Hashing)

Typical symptoms: some ingress replicas or backend Pods are overloaded while others are idle. Overall cluster metrics look fine, but tail latency is bad.

Common causes: session affinity, consistent hashing, long-lived connections pinned to a subset of replicas, or external load balancer behavior that does not rebalance frequently.

Measuring the Right Things: A Practical Observability Checklist

Ingress-Level Golden Signals

Start with four signals at the ingress: traffic (RPS), errors (4xx/5xx split), latency (p50/p95/p99), and saturation (CPU, memory, open connections, request queue depth). The goal is to determine whether the ingress is the bottleneck or merely the messenger for upstream slowness.

Latency breakdown: measure time to first byte (TTFB) and upstream response time if your ingress exposes it. A rising upstream time indicates backend saturation; rising ingress processing time indicates ingress bottleneck.
Connection metrics: active connections, accepted connections per second, and connection errors. For long-lived protocols, track concurrent connections as a first-class capacity metric.
Resource headroom: CPU throttling (cgroup throttled time), memory working set, and OOM events. CPU throttling can cause latency even when average CPU looks moderate.

Node-Level and Service Routing Signals

When ingress metrics suggest saturation but Pods look fine, check node and kernel constraints. Node-level issues often manifest as sudden cliffs in performance.

conntrack usage: current entries vs max, and drop counters.
Network: NIC throughput, packet drops, retransmits, and queue length.
Service routing overhead: if using iptables mode, large rule sets can add latency; IPVS can behave differently under churn. Watch for CPU usage in kube-proxy and node softirq.

Step-by-Step: Capacity Planning for Ingress Replicas

Step 1: Establish a Baseline Load Profile

Define the traffic shape you must handle: peak RPS, peak concurrent connections, request/response sizes, and protocol mix (HTTP/1.1, HTTP/2, gRPC, WebSockets). Concurrency matters as much as RPS because it drives open sockets and memory buffers.

# Example baseline targets (write these down before testing)  peak_rps: 5000  peak_concurrent_connections: 20000  avg_response_size_kb: 25  p95_target_ms: 200  protocols: [http1, http2, websocket]

Step 2: Find Per-Replica Limits Under Realistic Conditions

Run a load test against a single ingress replica (or a controlled small number) with production-like TLS settings and routing rules. Increase load until you hit one of these: CPU saturation, connection limit, memory pressure, or latency target breach. Record the maximum sustainable RPS and maximum sustainable concurrent connections separately.

RPS capacity per replica is often limited by CPU (TLS, routing, compression).
Connection capacity per replica is often limited by file descriptors and worker connection settings.

Step 3: Convert Targets into Replica Counts with Headroom

Compute replicas for RPS and for concurrency, then take the larger number and add headroom for failover and uneven distribution.

# Example calculation  per_replica_rps_sustainable = 800  per_replica_conn_sustainable = 3500  target_rps = 5000  target_conn = 20000  replicas_for_rps = ceil(5000/800) = 7  replicas_for_conn = ceil(20000/3500) = 6  base_replicas = max(7, 6) = 7  headroom_factor = 1.3  final_replicas = ceil(7 * 1.3) = 10

Also account for disruption: if you want to survive one node failure or one replica eviction without breaching SLOs, ensure the remaining replicas can carry the load.

Step 4: Configure Autoscaling on the Right Metric

CPU-based autoscaling is sometimes sufficient for pure HTTP request/response traffic, but it can fail for long-lived connections where CPU stays low while connection counts are high. Prefer autoscaling signals that match your bottleneck.

CPU-based HPA: good when TLS and request processing dominate.
Connection-based scaling: better for WebSockets/SSE/gRPC streams. Use a custom metric such as active connections per Pod.
Latency-based scaling: can work but is reactive; use carefully to avoid oscillations.

# Example HPA snippet (conceptual) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler spec:   minReplicas: 4   maxReplicas: 30   metrics:   - type: Resource     resource:       name: cpu       target:         type: Utilization         averageUtilization: 60

Step-by-Step: Eliminating Common Ingress Performance Bottlenecks

1) Reduce TLS Handshake Cost and Improve Reuse

If CPU spikes correlate with new connections, focus on handshake reduction. Ensure clients can reuse connections (keepalive) and sessions. Also verify that your external load balancer and ingress support HTTP/2 where appropriate, because multiplexing reduces connection count.

Enable and tune keepalive between client and ingress where safe.
Enable upstream keepalive from ingress to backends to reduce connection churn.
Prefer modern ciphers and avoid unnecessary certificate chain bloat.

2) Simplify Routing Rules and Avoid Expensive Matching

Ingress rules that rely on complex regular expressions or many path rules can increase per-request CPU. Consolidate rules, avoid regex where possible, and keep rule sets small. If you have hundreds of hosts and paths, consider splitting ingress resources across multiple controllers or shards.

Group related hosts into separate ingress controllers to reduce config size per controller.
Prefer prefix/path matching over regex when possible.
Watch reload frequency: frequent config reloads can cause latency spikes.

3) Tune Worker Concurrency and File Descriptor Limits

For event-driven proxies, concurrency is governed by worker processes and per-worker connection limits. For threaded servers, it is thread pools and accept queues. Ensure OS and container limits allow the desired number of sockets.

Increase open file limits (ulimit) and verify inside the container.
Set worker connections to match expected concurrency with headroom.
Ensure the node has sufficient ephemeral ports and conntrack capacity.

# Quick checks (run in the ingress Pod) cat /proc/sys/fs/file-max ulimit -n ss -s

4) Avoid Unnecessary Buffering and Right-Size Buffers

Large uploads and large headers can force buffering. If you terminate TLS at ingress and accept large bodies, ensure buffer sizes and max body limits are explicit. If you stream to backends, confirm that buffering is not forcing full-body reads into memory.

Set explicit maximum request body size aligned with product needs.
Right-size header buffers if you carry large cookies or tokens.
Prefer streaming for large uploads when supported end-to-end.

5) Spread Load Evenly and Prevent Hot Spots

Even with enough replicas, uneven distribution can overload a subset. Validate distribution at three points: external load balancer to ingress Pods, ingress to backend Pods, and backend Pod placement across nodes.

Use Pod anti-affinity or topology spread constraints for ingress Pods to avoid co-locating too many on one node.
Ensure readiness probes are accurate so overloaded Pods can be removed from rotation.
Be cautious with session affinity; if required, size for worst-case skew.

Web Server Layer Bottlenecks Inside the Workload

Static File Serving vs Dynamic App Serving

Many Kubernetes workloads include a web server (NGINX/Apache/Caddy) in front of an app server. Static file serving is typically limited by disk I/O, kernel page cache, and network throughput, while dynamic serving is limited by CPU and upstream calls. If you serve large static assets, consider whether the ingress should pass through to a dedicated static tier or object storage, reducing pressure on both ingress and app Pods.

Thread Pools, Event Loops, and Blocking Work

Web servers and app servers have different concurrency models. A common bottleneck is blocking work on an event loop (for example, synchronous filesystem or DNS calls), which reduces effective concurrency and increases tail latency. Another is a thread pool sized too small for peak concurrency, causing queueing.

Measure request queue depth at the web server/app server.
Increase worker threads/processes only if CPU and memory allow it.
Prefer non-blocking I/O patterns for high concurrency workloads.

Keepalive and Upstream Connection Pooling

Connection churn between the web server and upstream app can dominate latency under load. Ensure keepalive is enabled and sized appropriately. Too small a pool causes frequent reconnects; too large a pool can waste file descriptors and memory.

# Conceptual targets to document  upstream_keepalive_connections_per_pod: 200  max_idle_time_seconds: 30  max_requests_per_connection: 10000

Scaling Patterns: Sharding, Multi-Ingress, and Dedicated Gateways

When a Single Ingress Controller Becomes a Scaling Wall

At high scale, a single ingress controller may hit limits in configuration size, reload time, or control-plane churn. A practical pattern is to run multiple ingress controllers, each responsible for a subset of domains or namespaces. This reduces per-controller config and isolates noisy tenants.

Shard by domain: example.com on one controller, api.example.com on another.
Shard by namespace/team: each team gets a controller with quotas.
Shard by protocol: separate controllers for WebSockets/gRPC vs standard HTTP.

Dedicated Ingress for Heavy Endpoints

Some endpoints are disproportionately expensive: large downloads, uploads, or streaming. Put them behind a dedicated ingress class with tuned buffers, timeouts, and scaling metrics. This prevents heavy traffic from degrading latency for the rest of the site.

Load Testing and Bottleneck Isolation Workflow

Step-by-Step: A Repeatable Experiment Loop

Use a disciplined loop to avoid guessing. Change one variable at a time, and keep test conditions stable.

Step 1: pick a single hypothesis (for example, “TLS handshakes are saturating CPU”).
Step 2: choose a metric that should change if the hypothesis is true (CPU per request, handshake rate, new connections/sec).
Step 3: run a controlled load test and capture ingress, node, and backend metrics.
Step 4: apply one change (enable HTTP/2, increase keepalive, add replicas, adjust worker connections).
Step 5: rerun the same test and compare p95/p99 latency, error rate, and saturation metrics.

Separating Ingress Bottlenecks from Backend Bottlenecks

A practical technique is to add a lightweight “echo” backend that returns immediately. If ingress latency is high even when upstream is trivial, the ingress is the bottleneck. If ingress latency is low with echo but high with the real service, the bottleneck is downstream.

# Example echo deployment idea (conceptual) /echo returns 200 OK with small body and no upstream calls

Kubernetes-Specific Scaling Gotchas at the Edge

Pod Placement and Node Sizing

Ingress Pods are network-intensive. If you schedule too many on a single node, you can saturate that node’s NIC or conntrack table even if cluster-wide capacity looks sufficient. Use topology spread constraints and ensure nodes hosting ingress have adequate network bandwidth and CPU for TLS.

External Load Balancer Behavior

Not all load balancers distribute evenly under all conditions. Some use connection-based balancing, which can create hot spots with long-lived connections. Validate distribution by comparing per-ingress-Pod connection counts and request rates. If skew is persistent, you may need more ingress replicas, different balancing settings, or to separate long-lived traffic onto a dedicated entry point.

Config Reload and Control-Plane Churn

Frequent changes to ingress resources can trigger reloads or config updates that temporarily reduce throughput. If you deploy many services frequently, consider batching changes, reducing unnecessary ingress object churn, or sharding controllers so each sees fewer updates.

Now answer the exercise about the content:

When autoscaling ingress replicas, which metric is typically more appropriate for long-lived connections like WebSockets or gRPC streams?

You are right! Congratulations, now go to the next page

You missed! Try again.

Long-lived connections can saturate file descriptors and concurrency while CPU remains moderate, so scaling based on active connections per Pod better matches the bottleneck than CPU-based scaling.