Why observability matters for serverless
In event-driven systems, failures can be intermittent, distributed, and hard to reproduce. Observability is the combination of logs, metrics, and traces that lets you answer: What happened? Where did it happen? Why did it happen? and How often is it happening? For Azure Functions, the goal is fast diagnosis and operational confidence without logging sensitive data or flooding storage with noise.
Structured logging in Azure Functions
What “structured” means
Structured logs are machine-queryable records (key/value fields) rather than unstructured text. This enables reliable filtering and aggregation (for example, “show all failures for tenantId=123” or “count timeouts by dependencyName”). In .NET Functions, structured logging is done through ILogger message templates and scopes; in JavaScript/TypeScript, it’s typically JSON objects written to the console or a logger library that outputs JSON.
Core fields to include
- Correlation identifiers:
operation_Id(trace ID),operation_parentId(span/parent), and your owncorrelationIdwhen you need to stitch business workflows across triggers. - Log level: use levels consistently so alerts can key off
Error/Criticalwhile keepingInformationuseful andDebugoptional. - Event name: a stable identifier like
Order.ValidationFailedorPayment.ProviderTimeoutto group incidents across code changes. - Safe payload: only log what you can safely store (no secrets, tokens, full request bodies, or personal data). Prefer hashes, counts, and redacted summaries.
Step-by-step: implement correlation IDs and scopes (.NET isolated example)
The pattern is: extract or generate a correlation ID, add it to a logging scope, and include it in downstream calls. Scopes automatically attach fields to all logs within the scope.
using System.Diagnostics;using Microsoft.Azure.Functions.Worker;using Microsoft.Azure.Functions.Worker.Http;using Microsoft.Extensions.Logging;public class ProcessOrder{ private readonly ILogger _logger; public ProcessOrder(ILoggerFactory loggerFactory) => _logger = loggerFactory.CreateLogger<ProcessOrder>(); [Function("ProcessOrder")] public async Task<HttpResponseData> Run( [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req, FunctionContext ctx) { var correlationId = GetOrCreateCorrelationId(req); using (_logger.BeginScope(new Dictionary<string, object> { ["correlationId"] = correlationId, ["functionInvocationId"] = ctx.InvocationId, ["traceId"] = Activity.Current?.TraceId.ToString() })) { _logger.LogInformation("Order processing started. event={EventName}", "Order.ProcessingStarted"); try { // ... validate, call downstream services, enqueue work, etc. _logger.LogInformation("Order processing completed. event={EventName}", "Order.ProcessingCompleted"); var res = req.CreateResponse(System.Net.HttpStatusCode.Accepted); return res; } catch (Exception ex) { _logger.LogError(ex, "Order processing failed. event={EventName}", "Order.ProcessingFailed"); var res = req.CreateResponse(System.Net.HttpStatusCode.InternalServerError); return res; } } } private static string GetOrCreateCorrelationId(HttpRequestData req) { if (req.Headers.TryGetValues("x-correlation-id", out var values)) { var v = values.FirstOrDefault(); if (!string.IsNullOrWhiteSpace(v)) return v; } return Guid.NewGuid().ToString("N"); }}Safe payload logging patterns
- Redact: log only whitelisted fields (e.g.,
orderId,tenantId,itemCount), never raw payloads. - Summarize: log sizes and counts (e.g.,
payloadBytes,recordCount). - Hash: if you need to correlate values without storing them, log a stable hash (e.g., SHA-256 of an email) and keep the salt in a secure location.
- Sampling: for high-volume success logs, consider sampling (e.g., log 1% of successes) while logging all failures.
Log level guidance for operational usefulness
- Information: lifecycle milestones and business events (started, completed, accepted, skipped) with minimal fields.
- Warning: recoverable issues (validation failures, transient dependency slowness, retries starting) that may indicate emerging problems.
- Error: failed invocations, exhausted retries, poison handling, or dependency failures that impact outcomes.
- Critical: widespread outage indicators (cannot reach core dependency, configuration missing, repeated startup failures).
Application Insights concepts you will use daily
Telemetry types and how they map to questions
- Requests: represent incoming operations (HTTP triggers and function invocations). Use them to track latency, success rate, and volume.
- Dependencies: outbound calls (HTTP to APIs, database calls, storage operations). Use them to find which downstream service is slow or failing.
- Exceptions: captured stack traces and exception types. Use them to group failures by root cause.
- Traces: your logs. Use them for narrative context around a request/operation.
- Custom events: business milestones (e.g.,
OrderPaid,InvoiceGenerated) to measure funnel progression. - Custom metrics: numeric time series (e.g.,
OrdersProcessed,QueueLagSeconds) for dashboards and alerts.
Step-by-step: add custom events and metrics
In many Function setups, Application Insights is enabled at the app level and logs flow automatically. Custom events and metrics are for business and operational signals that aren’t captured as requests/dependencies by default.
.NET isolated (using TelemetryClient):
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
using Microsoft.ApplicationInsights;using Microsoft.ApplicationInsights.DataContracts;public class BillingFunction{ private readonly TelemetryClient _telemetry; private readonly ILogger _logger; public BillingFunction(TelemetryClient telemetry, ILoggerFactory loggerFactory) { _telemetry = telemetry; _logger = loggerFactory.CreateLogger<BillingFunction>(); } [Function("BillCustomer")] public async Task Run([QueueTrigger("billing")] string msg) { var props = new Dictionary<string, string> { ["eventName"] = "Billing.Received" }; _telemetry.TrackEvent("Billing.Received", props); var metric = _telemetry.GetMetric("Billing.MessagesProcessed"); metric.TrackValue(1); try { // ... billing logic _telemetry.TrackEvent("Billing.Succeeded"); } catch (Exception ex) { _telemetry.TrackException(ex, new Dictionary<string, string> { ["stage"] = "charge" }); _logger.LogError(ex, "Billing failed"); throw; } }}Keep custom metric names stable and low-cardinality. Avoid putting user IDs or order IDs into metric dimensions; use those in logs/traces instead.
Distributed tracing across triggers and downstream services
How tracing works in practice
Distributed tracing links telemetry from the trigger through internal steps and outbound dependencies using a shared trace context. In Azure, this is typically W3C Trace Context (traceparent/tracestate). When correctly propagated, you can open a single end-to-end transaction view and see: trigger invocation → dependency calls → downstream service spans → exceptions.
What to propagate
- HTTP: forward
traceparentandtracestateheaders automatically when using modern HTTP clients with instrumentation; also forward yourx-correlation-idif you use one for business correlation. - Messaging: include trace context in message metadata/properties when possible. If you can’t, include a correlation ID in the message body (carefully) or a dedicated field.
Step-by-step: ensure outbound HTTP calls are traceable (.NET)
Use IHttpClientFactory and avoid creating raw HttpClient instances per invocation. With Application Insights/OpenTelemetry instrumentation, dependency telemetry is captured and correlated automatically.
using System.Net.Http;using Microsoft.Extensions.Logging;public class EnrichmentClient{ private readonly HttpClient _http; private readonly ILogger _logger; public EnrichmentClient(HttpClient http, ILoggerFactory loggerFactory) { _http = http; _logger = loggerFactory.CreateLogger<EnrichmentClient>(); } public async Task<string> GetEnrichmentAsync(string id, string correlationId) { using var req = new HttpRequestMessage(HttpMethod.Get, $"/enrich/{id}"); req.Headers.TryAddWithoutValidation("x-correlation-id", correlationId); var res = await _http.SendAsync(req); _logger.LogInformation("Enrichment call completed. status={StatusCode}", (int)res.StatusCode); res.EnsureSuccessStatusCode(); return await res.Content.ReadAsStringAsync(); }}If you see dependency calls in Application Insights but they are not connected to the triggering request, it usually indicates missing trace context propagation or conflicting instrumentation libraries.
Tracing across asynchronous triggers
When a workflow spans multiple triggers (for example, an HTTP-triggered function that enqueues work and a queue-triggered function that processes it), you need a linking strategy:
- Preferred: store W3C trace context in message properties/metadata and restore it in the consumer to continue the trace.
- Fallback: store a
correlationIdand use it to query logs and events across invocations. This won’t create a single distributed trace graph, but it still enables fast investigation.
Dashboard blueprint: what to visualize
Golden signals for Azure Functions
- Traffic/throughput: request count, trigger execution count, messages processed per minute.
- Errors: failure rate, exception count by type, dependency failure rate.
- Latency: end-to-end duration and dependency duration, with percentiles (P50/P95/P99).
- Saturation/backlog: queue length, oldest message age, event lag, throttling indicators.
Recommended dashboard sections
- Service overview: total executions, success rate, P95 latency, top failing functions.
- Dependencies: dependency P95 latency, failure rate by dependency name, top timeouts.
- Backlog health: queue length, queue age, processing rate vs arrival rate.
- Retries and poison indicators: count of retry attempts, number of messages moved to dead-letter/poison handling, repeated failures by message type.
- Cost-related signals: execution count trends, average duration, data ingestion volume to Application Insights, dependency call volume (especially paid APIs).
Example KQL queries for dashboards
Failure rate by function (last 1 hour)
requests| where timestamp > ago(1h)| summarize total=count(), failed=countif(success == false) by operation_Name| extend failureRate = todouble(failed) / todouble(total)| order by failureRate descLatency percentiles by function (last 1 hour)
requests| where timestamp > ago(1h)| summarize p50=percentile(duration, 50), p95=percentile(duration, 95), p99=percentile(duration, 99) by operation_Name| order by p95 descTop dependency failures (last 1 hour)
dependencies| where timestamp > ago(1h)| where success == false| summarize failures=count() by name, type, resultCode| order by failures descExceptions grouped by type and outer message (last 24 hours)
exceptions| where timestamp > ago(24h)| summarize count() by type, outerMessage| order by count_ descRetry-like patterns (heuristic via traces)
traces| where timestamp > ago(6h)| where message has "retry" or message has "Retry"| summarize count() by operation_Name| order by count_ descFor queue length and message age, prefer platform metrics from the underlying service (Storage Queues/Service Bus/Event Hubs) in Azure Monitor, then pin them to the same dashboard as Application Insights charts.
Alerting blueprint: what to page on vs what to ticket
Principles for actionable alerts
- Alert on symptoms, not noise: page on sustained failure rate or backlog growth, not on a single exception.
- Use multi-window logic: require the condition to hold for 5–15 minutes to avoid flapping.
- Route by ownership: dependency failures may belong to another team; include dependency name and region in the alert payload.
- Include context: link to the relevant dashboard and a pre-filtered logs query.
Recommended alert rules
- Failure rate: requests success rate below threshold (e.g., < 99% over 10 minutes) for critical functions.
- Latency: P95 duration above threshold (e.g., > 2s over 15 minutes) or sudden regression compared to baseline.
- Backlog: queue length above threshold or oldest message age above threshold (stronger signal than length alone).
- Dependency health: dependency failure rate above threshold; dependency P95 latency above threshold.
- Retry counts: spikes in retry-related logs/events; repeated failures for the same operation.
- Cost signals: unusual increase in executions, duration, or telemetry ingestion (can indicate loops, poison storms, or verbose logging).
Example KQL for an alert: high failure rate (last 10 minutes)
requests| where timestamp > ago(10m)| summarize total=count(), failed=countif(success == false) by operation_Name| extend failureRate = todouble(failed) / todouble(total)| where total > 50 and failureRate > 0.02Example KQL for an alert: dependency timeouts spike
dependencies| where timestamp > ago(10m)| where success == false| where resultCode in ("408", "504") or name has "timeout"| summarize failures=count() by name| where failures > 10Runbook-style diagnostics for common incidents
Incident: sudden spike in failures
- Step 1: confirm scope: check failure rate by
operation_Nameand identify the top failing function(s). - Step 2: identify dominant exception: query
exceptionsgrouped bytypeandouterMessagefor the same time window. - Step 3: check dependency failures: look at
dependenciesfor increased failure rate or timeouts; note dependency name and result codes. - Step 4: correlate a sample trace: pick one failed request and open the end-to-end transaction view; verify where time is spent and which call failed.
- Step 5: validate configuration drift: look for errors indicating missing settings, auth failures, or forbidden responses; compare to recent deployment changes.
Incident: backlog growing (queue length or message age increasing)
- Step 1: determine if arrivals exceed processing: compare incoming event rate vs processed count (requests/executions) over the same period.
- Step 2: check latency regression: inspect P95/P99 duration; a small code change can reduce throughput by increasing average duration.
- Step 3: check downstream throttling: dependency telemetry may show 429/503 or increased latency; backlog is often a symptom of a slow dependency.
- Step 4: look for hot partitions or skew: if one tenant/message type dominates, filter logs by tenant/message attributes to see if a subset is causing slow processing.
- Step 5: inspect poison/retry behavior: repeated retries can consume capacity; search traces for repeated failures with the same correlation ID or message identifier.
Incident: latency spike but low error rate
- Step 1: identify where latency increased: compare request duration percentiles and dependency duration percentiles.
- Step 2: isolate the dependency: find which dependency has the largest P95 increase; check if it correlates with a region or endpoint.
- Step 3: verify payload size changes: use safe payload logging (sizes/counts) to see if requests got larger.
- Step 4: check for retry-induced latency: warnings about retries can increase latency without increasing final failure rate.
Incident: noisy logs or high telemetry ingestion cost
- Step 1: find top talkers: query
tracescount byoperation_Nameand message patterns to identify spammy log statements. - Step 2: reduce verbosity safely: downgrade frequent success logs to
Debugor sample them; keep errors and key milestones. - Step 3: remove high-cardinality dimensions: avoid logging unique IDs as metric dimensions; keep them in traces only.
- Step 4: validate payload logging: ensure no large bodies are being logged; replace with size/count summaries.
Incident: cannot correlate events across multiple functions
- Step 1: verify correlation ID presence: check that logs include
correlationIdand that it is consistent across steps. - Step 2: verify trace context propagation: for HTTP, confirm
traceparentis forwarded; for messaging, confirm trace context or correlation ID is included in message metadata. - Step 3: standardize event names: ensure all functions emit stable
eventNamevalues so you can query by event rather than by free-text messages.