Failure Handling as a First-Class Requirement
In event-driven systems, failures are not exceptional; they are expected. Reliability by design means you assume messages can be duplicated, arrive late, arrive out of order, be malformed, or fail mid-processing. Your goal is to make processing safe under at-least-once delivery and to ensure failures are observable, bounded, and recoverable.
Design reliability around four pillars:
- Retries for transient failures (with backoff and limits).
- Idempotency so duplicates are safe (and “exactly-once” is approximated).
- Poison message handling so bad events don’t block the system.
- Timeouts and partial failure control so work is bounded and consistent.
Common Failure Modes You Must Design For
Transient network errors
Symptoms: intermittent timeouts, DNS hiccups, connection resets, 5xx responses. These are often resolved by retrying after a short delay. Treat them as retryable unless proven otherwise.
Throttling and rate limits
Symptoms: HTTP 429, “server busy”, Service Bus quota pressure, downstream APIs enforcing per-minute limits. Retrying immediately makes it worse; you need backoff and sometimes a circuit-breaker-like pause.
Malformed or unexpected events
Symptoms: invalid JSON, missing required fields, schema drift, unexpected enum values. Retrying will not fix these. These should be classified as non-retryable and routed to a dead-letter/poison path with enough context for investigation.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Partial failures
Symptoms: you successfully wrote to one system but failed writing to another; you updated state but failed to publish a follow-up event; you processed a batch where some items succeeded and others failed. These require careful ordering, idempotent writes, and compensating actions.
Duplicate and out-of-order delivery
Symptoms: the same message is processed multiple times, or events arrive in a different order than produced. This is normal in distributed systems. Your handler must be safe under duplicates and should not assume ordering unless you enforce it explicitly.
Retries: Platform-Level vs Code-Level
When to prefer platform-level retries
Use platform-level retries when the failure is likely transient and the operation is safe to retry (idempotent or protected by deduplication). Platform-level retries are simpler and keep retry logic consistent across functions.
Typical candidates:
- Transient failures calling a downstream HTTP API.
- Temporary database connectivity issues.
- Service Bus lock lost due to brief processing hiccup (when safe to reprocess).
When to implement code-level retries
Use code-level retries when you need fine-grained control: different retry policies per dependency, custom classification of retryable errors, or you want to stop retrying when a circuit-breaker is open.
Typical candidates:
- Downstream API returns 429 with a Retry-After header you must honor.
- Different backoff for different endpoints (e.g., payment provider vs internal service).
- You need to retry only a sub-step, not the whole function execution.
Step-by-step: Implementing platform-level retries (Azure Functions host.json)
Configure retries at the function host level so transient failures automatically re-run the invocation. Choose between a fixed delay and exponential backoff.
{ "version": "2.0", "retry": { "strategy": "exponentialBackoff", "maxRetryCount": 5, "minimumInterval": "00:00:02", "maximumInterval": "00:01:00" }}Practical guidance:
- maxRetryCount: keep it small enough to avoid long backlogs; start with 3–5.
- minimumInterval: avoid immediate retry storms; 1–5 seconds is typical.
- maximumInterval: cap to prevent excessive delays; 30–120 seconds is common.
Important: retries re-run the entire function invocation. If your function performs multiple side effects, you must make those side effects idempotent or guarded (covered below).
Step-by-step: Code-level retries with exponential backoff (HTTP example)
Implement a small retry helper that retries only on retryable conditions (timeouts, 5xx, 429) and uses exponential backoff with jitter.
static bool IsRetryable(HttpResponseMessage? resp, Exception? ex) { if (ex is TaskCanceledException) return true; // timeout if (resp is null) return false; var code = (int)resp.StatusCode; if (code == 429) return true; if (code >= 500) return true; return false;}static async Task<HttpResponseMessage> SendWithRetryAsync(HttpClient client, HttpRequestMessage req, int maxAttempts, CancellationToken ct) { var rng = new Random(); for (int attempt = 1; attempt <= maxAttempts; attempt++) { HttpResponseMessage? resp = null; Exception? ex = null; try { resp = await client.SendAsync(req, ct); if (!IsRetryable(resp, null)) return resp; } catch (Exception e) { ex = e; if (!IsRetryable(null, ex)) throw; } // Honor Retry-After for 429 when present if (resp?.StatusCode == (System.Net.HttpStatusCode)429 && resp.Headers.RetryAfter?.Delta is TimeSpan ra) { await Task.Delay(ra, ct); continue; } // Exponential backoff with jitter var baseDelayMs = (int)(Math.Pow(2, attempt - 1) * 200); // 200ms, 400ms, 800ms... var jitterMs = rng.Next(0, 200); var delay = TimeSpan.FromMilliseconds(Math.Min(baseDelayMs + jitterMs, 5000)); await Task.Delay(delay, ct); } throw new HttpRequestException($"Failed after {maxAttempts} attempts");}Practical guidance:
- Keep maxAttempts low (3–5) and rely on the trigger’s redelivery for longer recovery.
- Retry only idempotent requests (GET/PUT with idempotency key, not unsafe POST without protection).
- Use jitter to avoid synchronized retries across instances.
Exponential Backoff and “Circuit Breaker” Patterns
Why backoff matters
Without backoff, retries amplify outages: more load hits an already struggling dependency. Exponential backoff spreads retries over time and increases the chance the dependency recovers.
Circuit-breaker-like behavior in serverless
A classic circuit breaker is stateful (closed/open/half-open). In serverless, instances are ephemeral, so you typically implement a shared circuit signal using a durable store (e.g., a cache or database flag) or you approximate it with throttling and short-circuit checks.
Use a circuit-breaker-like approach when:
- A dependency is consistently failing (e.g., repeated 503s).
- Retries are causing cost and backlog growth.
- You can safely defer work (e.g., re-queue with delay) rather than hammering the dependency.
Step-by-step: Simple shared “open circuit” flag
Pattern: before calling a dependency, check a shared flag (with TTL). If open, skip the call and fail fast (or re-queue). When failures exceed a threshold, open the circuit for a cool-down period.
- Store: Redis/Cache, Cosmos DB, or any low-latency store.
- Key: per dependency (e.g.,
circuit:payments-api). - TTL: 30–120 seconds to cool down.
Pseudocode sketch:
// 1) Check circuit flag (shared store)if (await circuitStore.IsOpenAsync("payments-api", ct)) { // Fail fast so message will be retried later, or re-queue with delay throw new Exception("Payments API circuit open");}// 2) Call dependency; on repeated failures, open circuittry { await CallPaymentsApiAsync(ct);} catch (Exception ex) { await circuitStore.RecordFailureAsync("payments-api", ct); if (await circuitStore.ShouldOpenAsync("payments-api", ct)) { await circuitStore.OpenAsync("payments-api", ttl: TimeSpan.FromSeconds(60), ct); } throw;}Key point: failing fast is often better than slow timeouts because it preserves compute and reduces lock durations on messages.
Idempotency: Making At-Least-Once Delivery Safe
The core idea
At-least-once delivery means your function can run more than once for the same event. Idempotency ensures that running the same event multiple times produces the same final outcome (or at least does not produce incorrect side effects).
“Exactly-once” vs “exactly-once illusion”
True exactly-once processing across multiple systems is rarely achievable without heavy coordination. Instead, you build an “exactly-once illusion” by combining:
- Deduplication keys (event IDs, message IDs, business keys).
- Conditional writes (insert-if-not-exists, optimistic concurrency).
- Idempotent side effects (upserts, deterministic updates).
- Outbox/inbox patterns when publishing events alongside state changes.
Choosing a deduplication key
Prefer a stable, unique identifier that is consistent across retries and redeliveries:
- Service Bus:
MessageId(and/or a business correlation ID). - Event Grid:
idfield. - Custom events: include an
eventIdoroperationIdgenerated by the producer.
If you cannot trust producer IDs, derive a key from immutable business data (e.g., orderId + eventType + version), but be careful: hashing payloads can break if payload ordering changes.
Step-by-step: Inbox (dedup store) + conditional write
Pattern: before applying side effects, record the event ID in an “inbox” table/container with a unique constraint. If the insert fails because it already exists, treat the event as already processed and exit successfully.
Steps:
- 1) Extract
dedupKeyfrom the message. - 2) Try to create an inbox record with that key (insert-if-not-exists).
- 3) If created: proceed with processing.
- 4) If already exists: return success (do not redo side effects).
- 5) Optionally store processing metadata (timestamp, handler version, result).
// Pseudocode (store must support conditional insert)dedupKey = message.MessageId;try { await inboxStore.InsertAsync(new InboxRecord { Id = dedupKey, ReceivedAtUtc = DateTime.UtcNow }, insertIfNotExists: true, ct);} catch (ConflictException) { // Duplicate delivery; safe to ignore return;}// Now do side effects safelyawait ApplyBusinessUpdateAsync(message, ct);await PublishFollowUpEventAsync(message, ct);Important: the inbox insert must happen before side effects, and it must be atomic with respect to duplicates (unique key / conditional write).
Idempotent writes: prefer upserts and conditional updates
When updating state, avoid “increment” operations that double-count on retries unless you can guarantee exactly-once. Prefer:
- Upsert by natural key (e.g., set order status to “Paid”).
- Compare-and-swap using ETags/row versions to prevent stale updates.
- Append with uniqueness (e.g., store a list of processed event IDs per aggregate, bounded by TTL).
Example: conditional state transition (pseudocode):
// Only move from Pending -> Paid oncevar order = await orders.GetAsync(orderId, ct);if (order.Status == "Paid") return; // already appliedorder.Status = "Paid";await orders.ReplaceAsync(order, ifMatchEtag: order.ETag, ct);Idempotency for outbound HTTP calls
If you call an external API that supports idempotency keys, use them. If it doesn’t, you must protect the call with your own dedup store so you don’t create duplicates (e.g., duplicate charges).
- Generate an idempotency key from the event ID.
- Send it as a header if supported (e.g.,
Idempotency-Key). - Record the external operation result keyed by event ID so retries can reuse it.
Poison Messages and Dead-Letter Queues (Service Bus Focus)
What is a poison message?
A poison message is one that repeatedly fails processing and will never succeed without intervention (malformed payload, missing referenced entity, incompatible schema). If you keep retrying it, it can block progress or consume resources indefinitely.
Service Bus delivery count and dead-lettering
Service Bus tracks delivery count. When processing fails and the message is abandoned, it can be redelivered. After MaxDeliveryCount is exceeded, the message is moved to the dead-letter queue (DLQ).
Design goals:
- Let transient failures retry.
- Detect non-retryable failures and dead-letter quickly (don’t waste retries).
- Ensure DLQ messages contain enough context to diagnose and replay safely.
Step-by-step: Classify errors and dead-letter non-retryable messages
In a Service Bus-triggered function, classify exceptions into retryable vs non-retryable. For non-retryable cases, explicitly dead-letter with a reason and description.
// Pseudocode shape (exact APIs depend on runtime model)try { var evt = Deserialize(message.Body); Validate(evt); await ProcessAsync(evt, ct);} catch (JsonException ex) { await DeadLetterAsync(message, reason: "MalformedJson", description: ex.Message, ct); return;} catch (ValidationException ex) { await DeadLetterAsync(message, reason: "ValidationFailed", description: ex.Message, ct); return;} catch (Exception ex) { // transient/unknown: let it retry by rethrowing throw;}Practical guidance:
- Dead-letter on schema/validation errors immediately.
- Dead-letter on nonexistent referenced entity only if it truly won’t appear later; otherwise treat as retryable with delay (or re-queue).
- Include correlation IDs, event ID, and a failure category in DLQ metadata.
Step-by-step: Operational workflow for DLQ
DLQ is not just a bin; it’s a workflow.
- 1) Monitor DLQ length and age; alert when thresholds are exceeded.
- 2) Triage by reason (MalformedJson vs ValidationFailed vs DependencyDown).
- 3) Fix root cause (producer bug, schema update, missing data).
- 4) Replay safely: only after idempotency is in place. Replaying without idempotency can duplicate side effects.
When replaying, preserve the original MessageId/eventId so deduplication still works.
Timeouts: Bounding Work and Avoiding Resource Exhaustion
Why timeouts are part of correctness
A function that “eventually finishes” is not reliable if it holds message locks too long, blocks concurrency, or causes cascading retries. Timeouts define an upper bound on work and force you to handle slow dependencies intentionally.
Three layers of timeouts to set
- Function execution timeout: the maximum time an invocation is allowed to run (varies by hosting plan and configuration).
- Dependency timeouts: HTTP client timeout, database command timeout, SDK operation timeout.
- Message lock/visibility time: how long you have to complete processing before the message becomes available again (Service Bus lock duration; queue visibility timeout).
Structured approach to setting timeouts
Use a budget approach: start from the maximum time you can safely hold the message and allocate time to each step.
- 1) Determine the maximum safe processing time per message (e.g., 30 seconds) based on lock duration, throughput goals, and cost.
- 2) Allocate time budgets to dependencies (e.g., HTTP call 3 seconds, DB write 2 seconds) and keep headroom for retries within the invocation.
- 3) Use a single CancellationToken with a deadline for the whole invocation and pass it to all I/O calls.
- 4) If the deadline is exceeded, fail fast and let the platform retry later.
Step-by-step: Deadline-based cancellation token
// Pseudocodevar overallBudget = TimeSpan.FromSeconds(25);using var cts = CancellationTokenSource.CreateLinkedTokenSource(functionCt);cts.CancelAfter(overallBudget);var ct = cts.Token;// Pass ct to all downstream callsawait CallDependencyAsync(ct);await WriteStateAsync(ct);Practical guidance:
- Prefer short dependency timeouts and rely on retries rather than waiting a long time.
- Ensure timeouts are consistent: dependency timeout should be less than overall budget.
- Avoid unbounded operations (no infinite retries inside one invocation).
Handling Partial Failures Safely
Order your side effects
When you must do multiple actions, order them so that retries are safe:
- Prefer writing idempotent state first (with conditional logic), then emitting follow-up events.
- If you emit an event and then fail to write state, you may create downstream effects without local consistency.
Outbox pattern (conceptual)
To avoid “state updated but event not published” (or vice versa), store outgoing events in an outbox alongside your state update, then publish from the outbox asynchronously. This reduces partial failure windows and makes retries deterministic.
Minimal steps:
- 1) In the same transactional/atomic boundary as your state update, write an outbox record with the event payload and a unique ID.
- 2) A separate function reads outbox records and publishes them.
- 3) Mark outbox records as sent using conditional updates to avoid duplicates.
Batch processing: isolate failures
If a single trigger delivers multiple items (or your handler processes a batch), avoid failing the entire batch for one bad item unless required.
- Validate each item; dead-letter or quarantine bad items.
- Process items independently with per-item idempotency keys.
- Record per-item results so retries don’t redo successful items.
Putting It Together: A Reliability Checklist for Event Handlers
- Classify failures: transient vs throttling vs malformed vs business-rule failure.
- Retries: use exponential backoff; cap attempts; honor Retry-After; add jitter.
- Idempotency: choose a dedup key; implement an inbox/conditional insert; make writes idempotent.
- Poison handling: dead-letter non-retryable messages with reason codes; monitor DLQ; replay safely.
- Timeouts: set an overall budget; set dependency timeouts; fail fast; avoid long blocking calls.
- Partial failures: order side effects; use outbox/inbox where needed; ensure reprocessing is safe.