All courses > Technology and Programming > App Development ::

Retries, Exponential Backoff, and Robust Failure Handling

Capítulo 11

Estimated reading time: 11 minutes

Why retries matter in offline-first sync

In an offline-first app, failures are not exceptional; they are routine. Requests time out, radios sleep, captive portals intercept traffic, servers shed load, and background execution windows end abruptly. A robust sync client treats these failures as expected signals and responds with controlled retries, backoff, and safe recovery. The goal is not “never fail,” but “fail predictably”: avoid draining battery, avoid hammering servers, avoid duplicating writes, and avoid leaving the user in a confusing state.

This chapter focuses on retry policies, exponential backoff, jitter, and failure handling patterns that make a sync engine resilient. It assumes you already have a sync engine, operation queue, and idempotent operations; here we focus on how to execute them safely under unreliable conditions.

Failure taxonomy: decide what is retryable

A retry strategy starts with classification. Not all failures deserve another attempt, and some require different delays or user intervention. A practical taxonomy:

Transient network failures: DNS lookup failure, TCP reset, TLS handshake timeout, connection refused due to temporary routing. Usually retryable with backoff.
Timeouts: request timeout, read timeout. Often retryable, but may indicate server overload or a too-aggressive timeout setting.
Server transient errors: HTTP 502/503/504, gRPC UNAVAILABLE, gateway timeouts. Retryable; respect Retry-After if present.
Rate limiting: HTTP 429. Retryable with server-provided delay; also adjust concurrency and batch sizes.
Auth/session issues: HTTP 401/403 due to expired token. Not a “retry” until you refresh credentials; then retry once.
Permanent client errors: HTTP 400/404/422 for malformed payload or invalid state. Not retryable; mark operation failed and surface a fix path.
Conflict/semantic errors: domain-level rejection (e.g., “cannot delete because already archived”). Not retryable; requires domain handling.
Local resource constraints: disk full, database locked, encryption key unavailable. Retry may help (e.g., lock contention) but often needs user/system action.

Implement classification as a pure function that maps an error (including HTTP status, network error codes, and response body) into a FailureClass with fields like retryable, minDelay, maxDelay, requiresAuthRefresh, and userActionRequired.

Step-by-step: build an error classifier

1) Normalize errors into a single internal type (e.g., SyncError) that includes: transport error, HTTP status, server error code, and whether the request reached the server (if known).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

2) Create a mapping table. Example rules:

status in [500, 502, 503, 504] => retryable, backoff
status == 429 => retryable, delay from Retry-After
status == 401 => requires token refresh, then retry once
status in [400, 404, 409, 422] => not retryable (or special-case 409 if your API uses it for concurrency)
Network timeouts => retryable, backoff

3) Add “circuit breakers” for repeated permanent failures: if the same operation fails with the same non-retryable reason, stop reprocessing it and mark it terminal.

4) Log classification decisions with enough context to debug (operation type, endpoint, status, error code), but avoid logging sensitive payloads.

Exponential backoff: the default retry schedule

Exponential backoff increases the delay between attempts, reducing load during outages and giving the system time to recover. A common formula:

delay = min(maxDelay, baseDelay * 2^attempt)

Where:

baseDelay: often 0.5s–2s for foreground, 2s–10s for background sync
attempt: 0 for first retry, 1 for second, etc.
maxDelay: cap such as 1–5 minutes for interactive operations, 15–60 minutes for background

Backoff should be per operation (or per endpoint) rather than global-only. A single failing endpoint should not stall unrelated work, but you also want to avoid stampeding the server with many operations retrying simultaneously.

Jitter: prevent synchronized retry storms

If thousands of clients all retry at 1s, 2s, 4s, 8s, they create synchronized spikes (“thundering herd”). Add randomness (“jitter”) to spread retries. Common jitter strategies:

Full jitter: delay = random(0, expDelay)
Equal jitter: delay = expDelay/2 + random(0, expDelay/2)
Decorrelated jitter: delay = min(maxDelay, random(baseDelay, prevDelay*3))

Full jitter is simple and effective for large fleets. Decorrelated jitter tends to behave well when outages are long and you want retries to “wander” rather than lock into a pattern.

Practical step-by-step: compute a retry delay

1) Choose baseDelay and maxDelay per context (foreground vs background) and per class (rate limit vs transient).

2) Compute exponential delay: expDelay = min(maxDelay, baseDelay * pow(2, attempt)).

3) Apply jitter. For full jitter: delay = random(0, expDelay).

4) If server provides Retry-After (seconds or HTTP date), override: delay = max(delay, retryAfter).

5) Persist the next eligible time for the operation (e.g., nextAttemptAt) so app restarts don’t reset the schedule.

// Pseudocode (language-agnostic) for full-jitter exponential backoff with Retry-After support function computeDelay(attempt, baseDelayMs, maxDelayMs, retryAfterMsOrNull): expDelay = min(maxDelayMs, baseDelayMs * (2 ^ attempt)) jittered = randomInt(0, expDelay) if retryAfterMsOrNull != null: return max(jittered, retryAfterMsOrNull) return jittered

Retry budgets and caps: avoid infinite loops

Retries must be bounded. Without caps, a single bad operation can consume resources forever. Use multiple limits:

Max attempts: e.g., 8–12 attempts for a single operation before marking it “paused” or “needs attention.”
Max elapsed time: stop retrying after, say, 24 hours since first failure, then require user action or a new trigger.
Retry budget per sync run: limit total retries in one session to protect battery and data usage.
Per-host/endpoint concurrency limits: even if many operations are eligible, only run a few at a time.

When a cap is reached, do not silently drop data. Mark the operation as blocked with a reason and a recommended remediation path (e.g., re-authenticate, update app, contact support, edit invalid data).

Handling ambiguous outcomes safely

The hardest failures are ambiguous: you don’t know whether the server applied the request. This happens when the client times out after sending, or the connection drops mid-response. Retrying blindly can create duplicates unless your API supports safe replay.

Robust handling relies on two techniques:

Request identifiers: include a unique request ID per operation attempt (or per logical operation) so the server can deduplicate and return the original result.
Status reconciliation: after an ambiguous failure, query operation status (if supported) or fetch the affected resource to confirm whether the change applied.

Even with idempotent operations, you still need to treat ambiguous outcomes differently in UX: show “Syncing…” or “Pending confirmation” rather than “Failed,” because it may have succeeded.

Step-by-step: ambiguous write timeout policy

1) Detect ambiguity: timeout after request body was sent, or network error after bytes written.

2) Mark operation state as unknown (distinct from failed).

3) Schedule a reconciliation task: either call a /operations/{id} endpoint, or fetch the resource and compare server state.

4) If reconciliation confirms success, mark operation complete without retrying the write.

5) If reconciliation confirms not applied, retry with backoff.

Rate limiting and server overload: adapt, don’t just wait

When you receive 429 or overload signals, waiting is necessary but not sufficient. You should also reduce pressure:

Lower concurrency: reduce parallel requests to the host.
Increase batch size carefully: fewer requests can help, but too-large payloads can increase timeouts; tune based on observed latency.
Prioritize: send user-visible changes first; defer background maintenance.
Honor server hints: Retry-After, custom headers like X-RateLimit-Reset, or gRPC retry info.

Implement a per-host “backpressure controller” that adjusts concurrency and pacing based on recent responses. This is especially important when many queued operations become eligible at once (e.g., after reconnecting).

Circuit breakers: stop hammering a broken dependency

A circuit breaker prevents repeated attempts when a dependency is clearly down. Instead of each operation independently retrying and failing, the breaker opens and blocks new attempts for a cool-down period.

Typical states:

Closed: normal operation.
Open: requests are short-circuited (fail fast) for a period.
Half-open: allow a small number of test requests; if they succeed, close; if they fail, open again.

Use circuit breakers per host or per critical endpoint. Combine with exponential backoff: the breaker controls whether you attempt at all; backoff controls when individual operations become eligible.

Practical step-by-step: implement a simple per-host breaker

1) Track rolling failure rate: e.g., last 20 requests or last 2 minutes.

2) Define thresholds: open if failure rate > 50% and at least N failures (to avoid noise).

3) When open, set openUntil (e.g., now + 30s, then 60s, then 2m with caps).

4) In half-open, allow 1–3 requests; decide based on their outcome.

5) Persist breaker state across app restarts for a short time window (optional but helpful in background sync).

Timeouts: choose them intentionally

Timeouts are part of failure handling. Too short and you create false failures; too long and you block the queue and waste battery. Use layered timeouts:

Connect timeout: time to establish TCP/TLS (e.g., 5–10s).
Request timeout: total time for request+response (e.g., 15–30s for typical APIs).
Per-byte/read timeout: for streaming or large downloads.

Adjust timeouts by operation type. Uploading media or large batches needs longer request timeouts but should be chunked or resumable where possible. For small JSON writes, shorter timeouts reduce queue blockage.

When a timeout occurs, treat it as potentially ambiguous for writes. For reads, it is usually safe to retry immediately with backoff.

Queue-aware retries: prioritize and isolate failures

In a sync engine, retries interact with scheduling. A robust approach is to make retry behavior explicit in the operation record:

attemptCount
lastAttemptAt
nextAttemptAt
lastErrorClass
blockedReason (optional)

Then the scheduler selects eligible operations (nextAttemptAt <= now) and runs them with concurrency limits. Important patterns:

Don’t let one failing operation block the whole queue: keep processing other eligible operations unless ordering constraints require otherwise.
Group by dependency: if operations target different services, isolate failures per service.
Priority lanes: user-initiated actions (e.g., “Send message”) should have a higher lane than background maintenance.

Step-by-step: scheduling with nextAttemptAt

1) When an operation fails with a retryable class, compute delay and set nextAttemptAt = now + delay.

2) When it fails non-retryably, set blockedReason and remove it from the runnable set.

3) The scheduler loop queries runnable operations ordered by priority then nextAttemptAt.

4) Apply per-host concurrency and circuit breaker checks before dispatch.

5) On app resume or connectivity change, do not reset nextAttemptAt; instead, you may “nudge” eligible operations to run immediately if they are already due.

Robust failure UX: communicate state without panic

Failure handling is not only technical; it affects user trust. The UI should reflect the true state of work:

Pending: operation queued but not attempted yet.
Syncing: in progress.
Will retry: failed transiently; show subtle indicator and optionally “Retrying in a moment.”
Needs attention: permanent failure; show actionable message (e.g., “Sign in again,” “Fix invalid field,” “Storage full”).

Avoid showing repeated error toasts for background retries. Prefer a single persistent indicator (e.g., a small banner) and a detailed “Sync status” screen for diagnostics. For user-initiated actions, provide immediate feedback: “Sent when online” or “Will retry,” plus a manual retry button if appropriate.

Observability for retries: measure what matters

Without metrics, retry logic can silently harm performance. Track:

Retry counts per operation type and endpoint
Time-to-success distribution (p50/p95) for queued operations
Failure class rates (timeouts vs 5xx vs 401 vs 4xx)
Backoff delay totals (how much time spent waiting)
Circuit breaker opens and durations

On-device logs should include correlation IDs and operation IDs to reconstruct sequences. In production analytics, aggregate and sample to avoid excessive telemetry. Use these signals to tune base delays, max delays, concurrency, and timeouts.

Reference implementation sketch: retry policy and executor

// Pseudocode: operation execution with classification, backoff, and breaker function runOperation(op): if now < op.nextAttemptAt: return SKIP host = op.targetHost if breaker.isOpen(host): op.nextAttemptAt = breaker.openUntil(host) return SKIP result = httpClient.send(op.request, timeoutsFor(op)) if result.success: op.markComplete() breaker.recordSuccess(host) return OK err = normalizeError(result) cls = classify(err) breaker.recordOutcome(host, cls) if cls.requiresAuthRefresh: if auth.refreshToken(): // retry once immediately or with small delay result2 = httpClient.send(op.request, timeoutsFor(op)) if result2.success: op.markComplete(); return OK err = normalizeError(result2); cls = classify(err) if cls.retryable: op.attemptCount += 1 if op.attemptCount > op.maxAttempts: op.blockedReason = "retry_limit"; return BLOCKED delay = computeDelay(op.attemptCount - 1, baseDelayFor(op, cls), maxDelayFor(op, cls), cls.retryAfterMs) op.nextAttemptAt = now + delay op.lastErrorClass = cls.code return RETRY_SCHEDULED else: op.blockedReason = cls.userActionRequired ? cls.code : "non_retryable" op.lastErrorClass = cls.code return BLOCKED

This sketch highlights key decisions: classify first, handle auth refresh as a separate branch, schedule retries by persisting nextAttemptAt, and integrate a circuit breaker to fail fast during outages.

Common pitfalls and how to avoid them

Immediate tight-loop retries: retrying instantly on failure drains battery and overloads servers. Always back off.
No jitter: synchronized retries create spikes after outages. Add jitter by default.
Resetting retry state on app restart: users who force-close the app can accidentally create a retry storm. Persist attempt counters and next attempt times.
Treating all 4xx as permanent: some 409/412 patterns may be recoverable via refresh/retry flows. Classify based on your API semantics.
Ignoring Retry-After: if the server tells you when to come back, listen.
Single global circuit breaker: one failing endpoint should not block all sync. Scope breakers per host or per route group.
Overly aggressive maxDelay: if max delay is hours for user-visible actions, the app feels broken. Use shorter caps for interactive operations and longer for background.

Now answer the exercise about the content:

Which approach best reduces synchronized retry spikes when many clients retry after an outage?

You are right! Congratulations, now go to the next page

You missed! Try again.

Exponential backoff increases delays between attempts, and jitter randomizes those delays to avoid clients retrying at the same times and causing a thundering herd.