All courses > Technology and Programming > Cloud Computing and Web Servers ::

Progressive Delivery Rollouts with Health Checks and Safe Fallbacks

Capítulo 9

Estimated reading time: 13 minutes

What “Progressive Delivery” Means in Practice

Progressive delivery is the discipline of releasing changes to production in small, controlled steps while continuously validating that the system remains healthy. Instead of treating a deployment as a single event, you treat it as a sequence of checkpoints: shift a little traffic, observe health signals, decide whether to continue, pause, or roll back. The key difference from “basic canary” discussions is that progressive delivery is driven by explicit health checks, automated analysis, and predefined safe fallbacks. In Kubernetes terms, you are not only updating Pods; you are also defining how to measure success (readiness, liveness, synthetic probes, SLO signals) and what to do when success criteria are not met (abort, rollback, route to stable, or degrade gracefully).

In this chapter, you will focus on the mechanics that make progressive delivery safe: health checks that reflect real user experience, rollout controllers that gate traffic shifts, and fallback strategies that keep your service available even when a release is unhealthy. You will also see how to structure manifests so that rollouts are repeatable and auditable.

Health Checks: The Foundation of Safe Rollouts

Health checks are the signals that decide whether a rollout can proceed. If your health checks are too shallow, you will ship broken releases confidently. If they are too strict or noisy, you will roll back good releases and create alert fatigue. A robust progressive delivery setup typically uses multiple layers of checks: Kubernetes probes (liveness/readiness/startup), application-level endpoints, and external or synthetic checks that emulate real traffic.

Readiness vs Liveness vs Startup Probes

Readiness probes answer: “Should this Pod receive traffic right now?” During a rollout, readiness is the most important probe because it gates whether new Pods enter the load balancer and start serving users. Liveness probes answer: “Is this process stuck and should it be restarted?” Liveness is useful for self-healing but can be dangerous during rollouts if it restarts Pods that are merely slow to initialize. Startup probes answer: “Is the app still starting?” They prevent liveness from killing a slow-starting container.

A common progressive delivery failure mode is a readiness probe that only checks “process is up” (for example, returns 200 if the HTTP server responds) while the app is not actually ready to serve real requests (missing migrations, cannot reach dependencies, cache warmup incomplete). Your readiness probe should reflect “can serve a typical request successfully,” not “port is open.”

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Designing a Useful /healthz Endpoint

For web workloads, implement a dedicated endpoint such as /healthz (or /readyz) that performs lightweight checks: configuration loaded, critical dependencies reachable (with timeouts), and internal queues not overloaded. Keep it fast and deterministic. Avoid expensive database queries; instead, do a short connection check or a cached “last successful dependency ping” value. If you have multiple dependencies, consider reporting partial status in the response body, but keep the HTTP status code strict: 200 only when the service can safely receive traffic.

Example: Probes in a Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 4
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: example/web:1.2.3
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30
          periodSeconds: 2
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 10
          timeoutSeconds: 2
          failureThreshold: 3
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

In this example, /readyz can be stricter than /healthz. The startup probe gives the application up to 60 seconds (30 x 2s) to become responsive before liveness starts applying. This reduces false restarts during cold starts and makes rollout behavior more predictable.

Progressive Delivery Controllers: Why “kubectl apply” Isn’t Enough

Kubernetes Deployments already support rolling updates, but they do not provide advanced gating logic: automated analysis windows, step-by-step traffic shifts, pauses, and metric-based promotion. Progressive delivery controllers add those capabilities. Two common patterns are: (1) a rollout custom resource that replaces Deployment behavior with explicit steps, and (2) a pipeline-driven approach that updates weights and pauses based on checks. In both cases, the goal is the same: encode release intent and safety checks into declarative configuration.

When you use a rollout controller, you typically define: the stable and canary ReplicaSets, the traffic routing mechanism (for example, via a service mesh or ingress), the step sequence (set weight, pause, analyze), and the rollback behavior. The controller then orchestrates the rollout and records status, making it easier to audit what happened and why.

Step-by-Step: A Progressive Rollout with Automated Analysis (Argo Rollouts Example)

This section shows a practical, end-to-end example using Argo Rollouts concepts. Even if you use a different tool, the structure is transferable: define a rollout resource, configure traffic routing, add analysis checks, and define safe fallback behavior.

Step 1: Install or Enable a Rollout Controller

Progressive delivery requires a controller running in the cluster. The exact installation method depends on your environment, but the operational requirement is consistent: the controller needs RBAC permissions to manage ReplicaSets, Services, and the traffic routing resources you choose. Ensure you also have observability data available (metrics and logs) because analysis steps depend on it.

Step 2: Create Stable and Canary Services

Many rollout controllers use two Services: one that always points to the stable ReplicaSet and one that points to the canary ReplicaSet. Traffic routing then decides how much user traffic goes to each Service. This separation is useful for safe fallback: if the canary is unhealthy, you can immediately route all traffic back to stable without waiting for Pods to terminate.

apiVersion: v1
kind: Service
metadata:
  name: web-stable
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: web-canary
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

During a rollout, the controller updates selectors behind the scenes so that web-stable points to the stable ReplicaSet and web-canary points to the new ReplicaSet.

Step 3: Define the Rollout Steps and Traffic Weights

A rollout spec encodes the progressive steps. A typical sequence is: start at a small percentage, pause to observe, run analysis, increase weight, repeat. Pauses can be time-based (for example, 2 minutes) or manual (require an operator to resume). Automated analysis can run at each step or at key milestones.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web
spec:
  replicas: 6
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: example/web:2.0.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3
  strategy:
    canary:
      stableService: web-stable
      canaryService: web-canary
      steps:
      - setWeight: 5
      - pause:
          duration: 120s
      - analysis:
          templates:
          - templateName: web-error-rate
      - setWeight: 20
      - pause:
          duration: 180s
      - analysis:
          templates:
          - templateName: web-latency
      - setWeight: 50
      - pause:
          duration: 300s

This example uses explicit weights and pauses. The analysis steps reference analysis templates that define what “healthy” means. If analysis fails, the controller can automatically abort and roll back (depending on configuration), which is your first major safe fallback.

Step 4: Define Analysis Templates (Metric-Based Health Gates)

Analysis templates define checks such as error rate, latency, saturation, or custom business metrics. The most useful gates are those that correlate with user experience: HTTP 5xx rate, p95 latency, and request success ratio. You can also add domain-specific checks such as “checkout success rate” or “login failures,” but start with universal signals.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: web-error-rate
spec:
  metrics:
  - name: http-5xx-rate
    interval: 30s
    count: 5
    successCondition: result < 0.01
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{app="web",status=~"5.."}[2m]))
          /
          sum(rate(http_requests_total{app="web"}[2m]))

Here, the rollout will consider the step successful if the 5xx ratio stays below 1% over the analysis window. The failureLimit controls sensitivity: a single failed measurement can fail the analysis, which is appropriate for severe signals like high 5xx rates. For noisier metrics, allow more failures or use longer windows.

Safe Fallbacks: What Happens When Things Go Wrong

Progressive delivery is not only about detecting problems; it is about responding safely. A safe fallback is a preplanned action that reduces user impact when the new version is unhealthy. The best fallback depends on the failure mode: application bug, dependency outage, configuration error, or performance regression.

Fallback 1: Automatic Abort and Rollback

The most common fallback is to abort the rollout and return traffic to the stable version. With a rollout controller, this is typically automatic when readiness fails, analysis fails, or a step times out. The rollback should be fast and should not require deleting resources manually. Operationally, you want a single command (or an automated policy) to revert to the last known good ReplicaSet.

To make rollback reliable, keep a small revisionHistoryLimit so the stable ReplicaSet remains available, and avoid mutating stable configuration during the rollout. If you change ConfigMaps or Secrets, consider versioning them (for example, web-config-v12) so stable and canary can run with different configurations without conflict.

Fallback 2: Route-to-Stable “Kill Switch”

Sometimes you need an immediate traffic shift back to stable even before analysis completes (for example, a sudden spike in errors). A practical pattern is a “kill switch” that sets canary weight to 0% and stable to 100% quickly. Depending on your routing layer, this might be a patch to a routing resource or a rollout command. The important part is procedural: define who can trigger it, how it is audited, and how you verify it worked (for example, confirm stable request volume rises and canary drops).

Fallback 3: Graceful Degradation Instead of Full Rollback

Not every incident requires rolling back the entire release. If the new version introduces an optional feature that is failing, you can degrade gracefully by disabling that feature via a runtime flag while keeping the new version running. This is especially useful when rollback is risky (for example, schema changes) or when the new version contains important fixes unrelated to the failing feature.

To make this safe, design feature flags to be fast to toggle and safe to evaluate. Keep the “off” path well-tested. During progressive delivery, you can include an analysis check that validates the flag system itself (for example, a synthetic request that exercises both on and off paths).

Fallback 4: Automated Scaling and Load Shedding

Some rollouts fail not because of correctness but because of performance regressions: increased CPU usage, memory growth, or slower responses. A safe fallback can be to temporarily scale up the canary or stable workloads, or to enable load shedding (returning 429 for noncritical endpoints) to protect core functionality. This is not a substitute for fixing the regression, but it can prevent cascading failures while you decide whether to continue or roll back.

If you use autoscaling, ensure that your analysis windows account for scaling delays. A rollout might look unhealthy for 1–2 minutes while new Pods start and caches warm up. Use startup probes and appropriate pause durations to avoid rolling back due to expected transient behavior.

Choosing Health Signals That Actually Protect Users

Progressive delivery succeeds when your gates match user impact. Consider layering signals from “closest to the user” to “closest to the container.” Start with user-facing signals (success rate, latency), then add system signals (CPU throttling, memory, saturation), and finally add container signals (probe failures, restarts). Avoid using only infrastructure metrics; a service can be “green” on CPU and still be returning incorrect responses.

Recommended Baseline Gates

HTTP success ratio (for example, 2xx/3xx vs total) for the canary and overall service.
HTTP 5xx rate threshold with a short analysis window for fast detection.
p95 latency threshold compared to baseline (stable) to catch regressions.
Readiness probe success rate (should be near 100% once warmed up).
Restart rate (high restarts often indicate crashes or OOMKills).

When possible, compare canary metrics to stable metrics rather than absolute thresholds. Absolute thresholds can be misleading during low traffic. Relative comparison helps you detect regressions even when overall traffic is small.

Step-by-Step: Adding Synthetic Checks to a Rollout

Synthetic checks are controlled requests that validate critical user journeys. They are valuable because they can detect issues that raw error rates miss (for example, responses are 200 but content is wrong, authentication flows break, or a downstream integration fails silently). During progressive delivery, you can run synthetic checks at each step and fail the rollout if they do not pass.

Step 1: Create a Kubernetes Job That Runs a Smoke Test

apiVersion: batch/v1
kind: Job
metadata:
  name: web-smoke
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: smoke
        image: curlimages/curl:8.5.0
        command: ["sh","-c"]
        args:
        - |
          set -e
          curl -fsS http://web-canary/healthz
          curl -fsS http://web-canary/api/version
          curl -fsS http://web-canary/api/checkout/smoke

This job targets the canary Service directly, ensuring you test the new version even when it receives only a small percentage of user traffic. Keep smoke tests short and deterministic. If a test depends on external systems, use timeouts and clear failure messages.

Step 2: Wire the Job into an Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: web-smoke-test
spec:
  metrics:
  - name: smoke
    provider:
      job:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: smoke
                image: curlimages/curl:8.5.0
                command: ["sh","-c"]
                args:
                - |
                  set -e
                  curl -fsS http://web-canary/healthz
                  curl -fsS http://web-canary/api/checkout/smoke

Now you can add an analysis step referencing web-smoke-test. If the job fails, the rollout fails and triggers your rollback policy.

Operational Guardrails: Timeouts, Pauses, and Manual Approval

Even with good health checks, you need guardrails to prevent a rollout from getting stuck or progressing too quickly. Timeouts ensure that a step cannot run indefinitely. Pauses provide observation windows. Manual approvals are useful for high-risk releases or when you want a human to review dashboards before proceeding.

A practical approach is to automate early steps (5% to 20%) and require manual approval before going beyond a threshold (for example, 50%). This balances speed and safety: you catch obvious issues quickly, and you still have a checkpoint before most users are affected.

Common Failure Modes and How to Prevent Them

Probe Misconfiguration Causes False Rollbacks

If readiness probes are too strict (short timeouts, low failure thresholds), transient latency spikes can mark Pods unready and starve the canary of capacity, which then looks like an error-rate regression. Use realistic timeouts, and ensure your app responds quickly to health endpoints even under load. Consider separate lightweight endpoints for probes.

Low Traffic Makes Metrics Noisy

At 5% traffic, you might only have a handful of requests per minute. A single error can look like a huge error rate. Mitigate this by using longer analysis windows, minimum request thresholds, or synthetic checks that generate enough signal. If your tooling supports it, gate on “at least N requests observed” before evaluating ratios.

Dependency Incidents Look Like Release Incidents

A rollout might fail because a shared dependency is down, not because the new version is bad. If you always roll back, you may churn releases unnecessarily. Improve diagnosis by comparing canary vs stable: if both are failing equally, it is likely an external incident. You can encode this into analysis: fail only if canary is significantly worse than stable.

Schema and Data Changes Complicate Rollback

If a release includes database schema changes, rollback may not be straightforward. Progressive delivery still helps, but you must plan compatibility: make schema changes backward-compatible, deploy them first, then deploy application changes, and only later remove old schema elements. For safe fallback, ensure the stable version can still operate with the new schema during the rollout window.

Now answer the exercise about the content: