All courses > Technology and Programming > Cloud Computing and Web Servers ::

Cloud-Native Web Serving Architecture on Kubernetes

Capítulo 1

Estimated reading time: 19 minutes

+ Exercise

Listen in audio

0:00 / 0:00

What “Cloud-Native Web Serving” Means on Kubernetes

Cloud-native web serving on Kubernetes is an architectural approach where HTTP(S) traffic enters a cluster through well-defined entry points, is routed to the correct workloads, and is managed with consistent policies for reliability, security, and observability. Instead of treating a web server as a single machine that listens on port 80 or 443, you model the system as a set of cooperating components: edge routing, internal service discovery, workload scaling, and optional service-to-service traffic management. The goal is to make web delivery repeatable across environments, resilient to failures, and adaptable to changing traffic patterns.

In practice, a Kubernetes-based web serving architecture separates concerns: (1) how traffic gets into the cluster, (2) how requests are routed to the right application, (3) how the application talks to other services, and (4) how you apply cross-cutting policies such as TLS, authentication, rate limiting, retries, timeouts, and metrics. This separation allows teams to evolve each layer independently: you can change routing rules without rebuilding images, scale workloads without changing DNS, and apply security policies without modifying application code.

Core Building Blocks and Their Responsibilities

Workloads: Pods, Deployments, and Autoscaling

At the bottom of the stack are your workloads: Pods running your web applications and supporting components. Most web apps are managed by a Deployment (or StatefulSet when stable identity/storage is required). The Deployment ensures the desired number of replicas are running and replaces unhealthy Pods. For traffic spikes, you typically add a HorizontalPodAutoscaler (HPA) to scale replicas based on CPU, memory, or custom metrics (for example, requests per second).

A key architectural decision is to keep Pods stateless whenever possible. Stateless web Pods can be scaled horizontally and replaced freely. If you need sessions, prefer external session stores (like Redis) or use sticky sessions at the edge only when necessary. For file uploads or generated assets, use object storage or persistent volumes behind a dedicated service rather than writing to the container filesystem.

Service: Stable Virtual IP and Load Balancing Inside the Cluster

A Kubernetes Service provides a stable virtual IP and DNS name that load-balances traffic to a set of Pods selected by labels. This decouples clients from Pod IPs, which change frequently. For web serving, Services are the internal “targets” that routing layers send traffic to. Services also enable rolling updates: as new Pods become ready, they join the Service endpoints; as old Pods terminate, they leave.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Architecturally, you can think of a Service as the contract boundary for a backend. Route to Services, not directly to Pods. This keeps routing rules stable and makes it easier to introduce canary releases, blue/green deployments, or versioned backends.

Ingress / Gateway: HTTP Entry Point and L7 Routing

To expose HTTP(S) applications, you typically use an Ingress controller or a Gateway API implementation. This component terminates client connections, applies L7 routing rules (hostnames, paths, headers), and forwards requests to Services. It is the edge of your cluster from the perspective of web traffic. Even if you use a cloud load balancer, the load balancer usually forwards to the Ingress/Gateway, which then performs the application-aware routing.

At this layer you define: which hostnames map to which backends, how TLS certificates are selected, whether HTTP is redirected to HTTPS, and how to handle path rewrites. This is also a common place to implement request buffering limits, maximum body size, and basic rate limiting (depending on the controller).

Service Mesh (Optional): Consistent Policies for East-West Traffic

A service mesh adds a dedicated data plane (often sidecar proxies or node-level proxies) and a control plane to manage service-to-service traffic inside the cluster. While the edge layer handles north-south traffic (client to cluster), the mesh focuses on east-west traffic (service to service). It can provide mutual TLS between services, fine-grained traffic policies, retries/timeouts, circuit breaking, and rich telemetry without requiring each application to implement these features.

Architecturally, the mesh becomes the “runtime network” for internal calls. You still route external traffic through an Ingress/Gateway, but once inside, requests can be governed by mesh policies. This is especially useful in microservice architectures where consistent security and observability are hard to implement uniformly in application code.

Reference Architecture: Request Flow End-to-End

A typical request flow looks like this: a client resolves a DNS name to a public IP (often a cloud load balancer). The load balancer forwards traffic to the cluster’s edge component (Ingress/Gateway). The edge component terminates TLS (or passes it through), selects a route based on host/path, and forwards the request to a Kubernetes Service. The Service load-balances to one of the ready Pods. The Pod may call other internal Services; if a service mesh is present, those calls are intercepted and managed by the mesh data plane. Responses travel back along the same path to the client.

This flow highlights where you attach policies. TLS and WAF-like controls are commonly at the edge. Authentication and authorization may be at the edge, in the application, or both. Retries and timeouts can be applied at the edge for client-to-service calls and in the mesh for service-to-service calls. Observability spans all layers: edge access logs, application logs, and mesh metrics/traces.

Design Principles for Cloud-Native Web Serving

Separate Edge Concerns from Application Concerns

Keep the edge responsible for routing and transport-level policies (TLS, redirects, basic request limits). Keep the application responsible for business logic and domain-level authorization. When you blur these boundaries, changes become risky: a routing change might require an app redeploy, or an app change might inadvertently break TLS behavior. A clean separation also enables platform teams to manage edge configuration while application teams focus on code.

Prefer Declarative Configuration and Immutable Artifacts

In Kubernetes, you describe desired state in manifests. For web serving, treat routing rules, TLS settings, and policy configuration as versioned, reviewed artifacts. Avoid manual changes in controllers or load balancers. This enables repeatable deployments across dev/stage/prod and supports GitOps workflows where the cluster converges to the declared configuration.

Design for Failure: Health Checks, Readiness, and Timeouts

Web serving architectures must assume that Pods will be restarted, nodes will be drained, and networks will occasionally fail. Use readiness probes so Pods only receive traffic when they are actually ready (for example, after warming caches or completing migrations). Use liveness probes carefully to detect deadlocks, but avoid overly aggressive settings that cause restart loops. Apply timeouts at the edge and between services so failures do not cascade and exhaust resources.

Make Scaling Predictable

Scaling is not only about adding replicas; it is also about ensuring the edge and internal routing layers can handle increased connections. Ensure your Ingress/Gateway has enough replicas and resources, and consider autoscaling it as well. For applications, define resource requests/limits, and scale based on meaningful metrics (CPU alone may not reflect request load for I/O-bound services). If you use a mesh, consider the overhead of proxies when sizing Pods.

Practical Step-by-Step: Build a Minimal Web Serving Stack

Step 1: Deploy a Simple Web Application

This example uses a basic HTTP echo service to demonstrate routing. You can replace it with your own container image later. Create a Deployment and a Service. The Service will be the stable backend target for routing.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-echo
  labels:
    app: web-echo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-echo
  template:
    metadata:
      labels:
        app: web-echo
    spec:
      containers:
      - name: echo
        image: ealen/echo-server:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "50m"
            memory: "64Mi"
          limits:
            cpu: "250m"
            memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: web-echo
spec:
  selector:
    app: web-echo
  ports:
  - name: http
    port: 80
    targetPort: 80

Step 2: Add an Edge Route (Ingress Example)

Assuming you already have an Ingress controller installed in the cluster, define an Ingress that routes a hostname to the Service. Choose a hostname you control in DNS (or use a local hosts file for testing). The Ingress controller will watch this resource and program its proxy accordingly.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-echo
spec:
  rules:
  - host: echo.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-echo
            port:
              number: 80

After applying, verify that the Ingress has an address and that requests reach the backend. In many environments, you will point DNS for echo.example.com to the external IP of the load balancer in front of the Ingress controller. If you are in a local cluster, you may use port-forwarding or a local load balancer integration.

Step 3: Enable TLS Termination at the Edge

For HTTPS, the edge component needs a certificate. A common pattern is to store the certificate and key in a Kubernetes Secret of type kubernetes.io/tls, then reference it from the Ingress. The example below assumes you already created a Secret named echo-tls in the same namespace.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-echo
spec:
  tls:
  - hosts:
    - echo.example.com
    secretName: echo-tls
  rules:
  - host: echo.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-echo
            port:
              number: 80

Architecturally, TLS termination at the edge simplifies backend services: they can speak plain HTTP inside the cluster while clients use HTTPS. If you require end-to-end encryption, you can also use TLS from edge to backend, but that introduces certificate distribution and validation complexity. Decide based on your threat model and compliance requirements.

Step 4: Add Readiness Probes to Protect Users During Rollouts

Readiness probes prevent traffic from reaching Pods that are not ready. For web apps, a dedicated /healthz endpoint is common. For the echo server, you may not have a custom endpoint, but in real applications you should implement one that checks critical dependencies (for example, database connectivity) without being too expensive.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-echo
spec:
  template:
    spec:
      containers:
      - name: echo
        image: ealen/echo-server:latest
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 2
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3

With readiness in place, rolling updates become safer: Kubernetes will only add new Pods to the Service endpoints after the probe succeeds. This reduces the chance of users seeing errors during deployments.

Step 5: Introduce Autoscaling for the Web Tier

Autoscaling helps handle variable traffic. The simplest starting point is CPU-based scaling. Ensure your Deployment has CPU requests set (as in Step 1), then create an HPA. In production, consider scaling on request rate or latency if you have metrics available, because CPU is not always correlated with web load.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-echo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-echo
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Routing Patterns for Real Applications

Host-Based Routing for Multi-App Clusters

When multiple applications share the same cluster, host-based routing is a clean pattern: app1.example.com routes to Service app1, app2.example.com routes to Service app2. This keeps URL paths stable and reduces the need for path rewriting. It also aligns well with TLS, since certificates can be issued per hostname or as a wildcard.

Path-Based Routing for Consolidated Domains

Path-based routing is useful when you want a single domain such as example.com with /api, /app, and /static routed to different backends. Be careful with path rewriting and trailing slashes, and ensure your applications generate correct absolute URLs. For APIs, path-based routing is common; for browser apps, host-based routing often avoids subtle issues with cookies, CORS, and asset paths.

Blue/Green and Canary at the Routing Layer

A cloud-native architecture allows you to shift traffic between versions without changing clients. One approach is to run two Deployments (v1 and v2) behind two Services, then update routing rules to switch traffic. Another approach is to use weighted routing if your edge or mesh supports it, gradually sending a percentage of requests to the new version. Even without advanced features, you can approximate canaries by using separate hostnames (canary.example.com) for validation before switching the main hostname.

Security and Policy Placement in the Architecture

Transport Security: TLS at the Edge and Beyond

At minimum, terminate TLS at the edge and enforce HTTPS. Decide whether to encrypt internal traffic. If you operate in a zero-trust model or have strict compliance requirements, encrypting service-to-service traffic is often mandatory. A service mesh can standardize mutual TLS between services, but you must also manage identity (service accounts), certificate rotation, and authorization policies.

Authentication and Authorization: Edge, App, and Mesh

Authentication can be implemented at the edge (for example, validating JWTs) to offload common checks and block unauthenticated traffic early. Authorization is often application-specific and belongs in the app, but some coarse-grained policies (for example, “service A may call service B”) can be enforced by a mesh. A practical pattern is: edge validates identity and basic scopes, application enforces business rules, and mesh enforces service-to-service access boundaries.

Rate Limiting and Abuse Protection

Rate limiting is typically most effective at the edge because it prevents abusive traffic from consuming internal resources. Depending on your controller, you may configure per-IP or per-token limits. For internal APIs, you may also apply rate limits between services to prevent a noisy neighbor from overwhelming a dependency. Keep in mind that rate limiting requires careful key selection (IP, user ID, API key) and clear error handling (429 responses with retry guidance).

Observability: Knowing What the Web Layer Is Doing

Logs: Edge Access Logs and Application Logs

For web serving, you usually need both edge access logs (method, path, status, latency, upstream) and application logs (business events, errors). Ensure logs include correlation identifiers such as request IDs. Many edge proxies can generate a request ID if the client does not provide one; propagate it to backends via headers so you can trace a request across services.

Metrics: Golden Signals for Web Serving

Track latency, traffic, errors, and saturation. At the edge, measure request rate and response codes per route/host. At the Service/Pod level, measure resource usage and application-level metrics such as handler latency. If you use a mesh, it can provide consistent metrics for service-to-service calls, including retries and timeouts, which helps distinguish application errors from network or policy issues.

Tracing: Following a Request Through Multiple Services

Distributed tracing becomes important as soon as a request touches multiple services. Ensure trace context headers are propagated. If you use a mesh, it may automatically emit spans for service-to-service hops; your applications should still create spans for internal operations (database queries, cache calls) to make traces actionable.

Common Pitfalls and How to Avoid Them

Routing Directly to Pods Instead of Services

Routing directly to Pod IPs breaks during rescheduling and rolling updates. Always route to Services so Kubernetes can manage endpoint membership based on readiness and health.

Missing Timeouts Leading to Resource Exhaustion

Without timeouts, slow upstreams can cause connection buildup at the edge and in clients, eventually exhausting worker threads or file descriptors. Define reasonable timeouts at the edge and between services, and ensure your applications handle cancellations properly.

Overloading the Edge Component

The Ingress/Gateway is a shared critical component. If it runs with too few replicas or insufficient resources, it becomes a bottleneck. Monitor its CPU/memory, connection counts, and request latency. Consider isolating high-traffic apps with dedicated gateways or namespaces and apply resource quotas to prevent one team from starving others.

Ignoring Readiness During Startup and Deployments

Applications that take time to initialize must not receive traffic too early. Use readiness probes and, when needed, startup probes. Also ensure graceful termination is configured so Pods stop receiving traffic before shutting down, allowing in-flight requests to complete.

Now answer the exercise about the content:

Why is it recommended to route edge traffic to a Kubernetes Service instead of directly to Pod IPs?

You are right! Congratulations, now go to the next page

You missed! Try again.

A Service offers a stable IP/DNS and selects endpoints based on labels and readiness, so traffic keeps flowing as Pods are replaced or rescheduled. Routing directly to Pod IPs breaks during updates and failures.