All courses > Technology and Programming > Cloud Computing and Web Servers ::

Observability: Structured Logging, Metrics, Tracing, and Alerting

Capítulo 9

Estimated reading time: 11 minutes

Why observability matters for serverless

In event-driven systems, failures can be intermittent, distributed, and hard to reproduce. Observability is the combination of logs, metrics, and traces that lets you answer: What happened? Where did it happen? Why did it happen? and How often is it happening? For Azure Functions, the goal is fast diagnosis and operational confidence without logging sensitive data or flooding storage with noise.

Structured logging in Azure Functions

What “structured” means

Structured logs are machine-queryable records (key/value fields) rather than unstructured text. This enables reliable filtering and aggregation (for example, “show all failures for tenantId=123” or “count timeouts by dependencyName”). In .NET Functions, structured logging is done through ILogger message templates and scopes; in JavaScript/TypeScript, it’s typically JSON objects written to the console or a logger library that outputs JSON.

Core fields to include

Correlation identifiers: operation_Id (trace ID), operation_parentId (span/parent), and your own correlationId when you need to stitch business workflows across triggers.
Log level: use levels consistently so alerts can key off Error/Critical while keeping Information useful and Debug optional.
Event name: a stable identifier like Order.ValidationFailed or Payment.ProviderTimeout to group incidents across code changes.
Safe payload: only log what you can safely store (no secrets, tokens, full request bodies, or personal data). Prefer hashes, counts, and redacted summaries.

Step-by-step: implement correlation IDs and scopes (.NET isolated example)

The pattern is: extract or generate a correlation ID, add it to a logging scope, and include it in downstream calls. Scopes automatically attach fields to all logs within the scope.

using System.Diagnostics;using Microsoft.Azure.Functions.Worker;using Microsoft.Azure.Functions.Worker.Http;using Microsoft.Extensions.Logging;public class ProcessOrder{    private readonly ILogger _logger;    public ProcessOrder(ILoggerFactory loggerFactory) => _logger = loggerFactory.CreateLogger<ProcessOrder>();    [Function("ProcessOrder")]    public async Task<HttpResponseData> Run(        [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req,        FunctionContext ctx)    {        var correlationId = GetOrCreateCorrelationId(req);        using (_logger.BeginScope(new Dictionary<string, object>        {            ["correlationId"] = correlationId,            ["functionInvocationId"] = ctx.InvocationId,            ["traceId"] = Activity.Current?.TraceId.ToString()        }))        {            _logger.LogInformation("Order processing started. event={EventName}", "Order.ProcessingStarted");            try            {                // ... validate, call downstream services, enqueue work, etc.                _logger.LogInformation("Order processing completed. event={EventName}", "Order.ProcessingCompleted");                var res = req.CreateResponse(System.Net.HttpStatusCode.Accepted);                return res;            }            catch (Exception ex)            {                _logger.LogError(ex, "Order processing failed. event={EventName}", "Order.ProcessingFailed");                var res = req.CreateResponse(System.Net.HttpStatusCode.InternalServerError);                return res;            }        }    }    private static string GetOrCreateCorrelationId(HttpRequestData req)    {        if (req.Headers.TryGetValues("x-correlation-id", out var values))        {            var v = values.FirstOrDefault();            if (!string.IsNullOrWhiteSpace(v)) return v;        }        return Guid.NewGuid().ToString("N");    }}

Safe payload logging patterns

Redact: log only whitelisted fields (e.g., orderId, tenantId, itemCount), never raw payloads.
Summarize: log sizes and counts (e.g., payloadBytes, recordCount).
Hash: if you need to correlate values without storing them, log a stable hash (e.g., SHA-256 of an email) and keep the salt in a secure location.
Sampling: for high-volume success logs, consider sampling (e.g., log 1% of successes) while logging all failures.

Log level guidance for operational usefulness

Information: lifecycle milestones and business events (started, completed, accepted, skipped) with minimal fields.
Warning: recoverable issues (validation failures, transient dependency slowness, retries starting) that may indicate emerging problems.
Error: failed invocations, exhausted retries, poison handling, or dependency failures that impact outcomes.
Critical: widespread outage indicators (cannot reach core dependency, configuration missing, repeated startup failures).

Application Insights concepts you will use daily

Telemetry types and how they map to questions

Requests: represent incoming operations (HTTP triggers and function invocations). Use them to track latency, success rate, and volume.
Dependencies: outbound calls (HTTP to APIs, database calls, storage operations). Use them to find which downstream service is slow or failing.
Exceptions: captured stack traces and exception types. Use them to group failures by root cause.
Traces: your logs. Use them for narrative context around a request/operation.
Custom events: business milestones (e.g., OrderPaid, InvoiceGenerated) to measure funnel progression.
Custom metrics: numeric time series (e.g., OrdersProcessed, QueueLagSeconds) for dashboards and alerts.

Step-by-step: add custom events and metrics

In many Function setups, Application Insights is enabled at the app level and logs flow automatically. Custom events and metrics are for business and operational signals that aren’t captured as requests/dependencies by default.

.NET isolated (using TelemetryClient):

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

using Microsoft.ApplicationInsights;using Microsoft.ApplicationInsights.DataContracts;public class BillingFunction{    private readonly TelemetryClient _telemetry;    private readonly ILogger _logger;    public BillingFunction(TelemetryClient telemetry, ILoggerFactory loggerFactory)    {        _telemetry = telemetry;        _logger = loggerFactory.CreateLogger<BillingFunction>();    }    [Function("BillCustomer")]    public async Task Run([QueueTrigger("billing")] string msg)    {        var props = new Dictionary<string, string> { ["eventName"] = "Billing.Received" };        _telemetry.TrackEvent("Billing.Received", props);        var metric = _telemetry.GetMetric("Billing.MessagesProcessed");        metric.TrackValue(1);        try        {            // ... billing logic            _telemetry.TrackEvent("Billing.Succeeded");        }        catch (Exception ex)        {            _telemetry.TrackException(ex, new Dictionary<string, string> { ["stage"] = "charge" });            _logger.LogError(ex, "Billing failed");            throw;        }    }}

Keep custom metric names stable and low-cardinality. Avoid putting user IDs or order IDs into metric dimensions; use those in logs/traces instead.

Distributed tracing across triggers and downstream services

How tracing works in practice

Distributed tracing links telemetry from the trigger through internal steps and outbound dependencies using a shared trace context. In Azure, this is typically W3C Trace Context (traceparent/tracestate). When correctly propagated, you can open a single end-to-end transaction view and see: trigger invocation → dependency calls → downstream service spans → exceptions.

What to propagate

HTTP: forward traceparent and tracestate headers automatically when using modern HTTP clients with instrumentation; also forward your x-correlation-id if you use one for business correlation.
Messaging: include trace context in message metadata/properties when possible. If you can’t, include a correlation ID in the message body (carefully) or a dedicated field.

Step-by-step: ensure outbound HTTP calls are traceable (.NET)

Use IHttpClientFactory and avoid creating raw HttpClient instances per invocation. With Application Insights/OpenTelemetry instrumentation, dependency telemetry is captured and correlated automatically.

using System.Net.Http;using Microsoft.Extensions.Logging;public class EnrichmentClient{    private readonly HttpClient _http;    private readonly ILogger _logger;    public EnrichmentClient(HttpClient http, ILoggerFactory loggerFactory)    {        _http = http;        _logger = loggerFactory.CreateLogger<EnrichmentClient>();    }    public async Task<string> GetEnrichmentAsync(string id, string correlationId)    {        using var req = new HttpRequestMessage(HttpMethod.Get, $"/enrich/{id}");        req.Headers.TryAddWithoutValidation("x-correlation-id", correlationId);        var res = await _http.SendAsync(req);        _logger.LogInformation("Enrichment call completed. status={StatusCode}", (int)res.StatusCode);        res.EnsureSuccessStatusCode();        return await res.Content.ReadAsStringAsync();    }}

If you see dependency calls in Application Insights but they are not connected to the triggering request, it usually indicates missing trace context propagation or conflicting instrumentation libraries.

Tracing across asynchronous triggers

When a workflow spans multiple triggers (for example, an HTTP-triggered function that enqueues work and a queue-triggered function that processes it), you need a linking strategy:

Preferred: store W3C trace context in message properties/metadata and restore it in the consumer to continue the trace.
Fallback: store a correlationId and use it to query logs and events across invocations. This won’t create a single distributed trace graph, but it still enables fast investigation.

Dashboard blueprint: what to visualize

Golden signals for Azure Functions

Traffic/throughput: request count, trigger execution count, messages processed per minute.
Errors: failure rate, exception count by type, dependency failure rate.
Latency: end-to-end duration and dependency duration, with percentiles (P50/P95/P99).
Saturation/backlog: queue length, oldest message age, event lag, throttling indicators.

Recommended dashboard sections

Service overview: total executions, success rate, P95 latency, top failing functions.
Dependencies: dependency P95 latency, failure rate by dependency name, top timeouts.
Backlog health: queue length, queue age, processing rate vs arrival rate.
Retries and poison indicators: count of retry attempts, number of messages moved to dead-letter/poison handling, repeated failures by message type.
Cost-related signals: execution count trends, average duration, data ingestion volume to Application Insights, dependency call volume (especially paid APIs).

Example KQL queries for dashboards

Failure rate by function (last 1 hour)

requests| where timestamp > ago(1h)| summarize total=count(), failed=countif(success == false) by operation_Name| extend failureRate = todouble(failed) / todouble(total)| order by failureRate desc

Latency percentiles by function (last 1 hour)

requests| where timestamp > ago(1h)| summarize p50=percentile(duration, 50), p95=percentile(duration, 95), p99=percentile(duration, 99) by operation_Name| order by p95 desc

Top dependency failures (last 1 hour)

dependencies| where timestamp > ago(1h)| where success == false| summarize failures=count() by name, type, resultCode| order by failures desc

Exceptions grouped by type and outer message (last 24 hours)

exceptions| where timestamp > ago(24h)| summarize count() by type, outerMessage| order by count_ desc

Retry-like patterns (heuristic via traces)

traces| where timestamp > ago(6h)| where message has "retry" or message has "Retry"| summarize count() by operation_Name| order by count_ desc

For queue length and message age, prefer platform metrics from the underlying service (Storage Queues/Service Bus/Event Hubs) in Azure Monitor, then pin them to the same dashboard as Application Insights charts.

Alerting blueprint: what to page on vs what to ticket

Principles for actionable alerts

Alert on symptoms, not noise: page on sustained failure rate or backlog growth, not on a single exception.
Use multi-window logic: require the condition to hold for 5–15 minutes to avoid flapping.
Route by ownership: dependency failures may belong to another team; include dependency name and region in the alert payload.
Include context: link to the relevant dashboard and a pre-filtered logs query.

Recommended alert rules

Failure rate: requests success rate below threshold (e.g., < 99% over 10 minutes) for critical functions.
Latency: P95 duration above threshold (e.g., > 2s over 15 minutes) or sudden regression compared to baseline.
Backlog: queue length above threshold or oldest message age above threshold (stronger signal than length alone).
Dependency health: dependency failure rate above threshold; dependency P95 latency above threshold.
Retry counts: spikes in retry-related logs/events; repeated failures for the same operation.
Cost signals: unusual increase in executions, duration, or telemetry ingestion (can indicate loops, poison storms, or verbose logging).

Example KQL for an alert: high failure rate (last 10 minutes)

requests| where timestamp > ago(10m)| summarize total=count(), failed=countif(success == false) by operation_Name| extend failureRate = todouble(failed) / todouble(total)| where total > 50 and failureRate > 0.02

Example KQL for an alert: dependency timeouts spike

dependencies| where timestamp > ago(10m)| where success == false| where resultCode in ("408", "504") or name has "timeout"| summarize failures=count() by name| where failures > 10

Runbook-style diagnostics for common incidents

Incident: sudden spike in failures

Step 1: confirm scope: check failure rate by operation_Name and identify the top failing function(s).
Step 2: identify dominant exception: query exceptions grouped by type and outerMessage for the same time window.
Step 3: check dependency failures: look at dependencies for increased failure rate or timeouts; note dependency name and result codes.
Step 4: correlate a sample trace: pick one failed request and open the end-to-end transaction view; verify where time is spent and which call failed.
Step 5: validate configuration drift: look for errors indicating missing settings, auth failures, or forbidden responses; compare to recent deployment changes.

Incident: backlog growing (queue length or message age increasing)

Step 1: determine if arrivals exceed processing: compare incoming event rate vs processed count (requests/executions) over the same period.
Step 2: check latency regression: inspect P95/P99 duration; a small code change can reduce throughput by increasing average duration.
Step 3: check downstream throttling: dependency telemetry may show 429/503 or increased latency; backlog is often a symptom of a slow dependency.
Step 4: look for hot partitions or skew: if one tenant/message type dominates, filter logs by tenant/message attributes to see if a subset is causing slow processing.
Step 5: inspect poison/retry behavior: repeated retries can consume capacity; search traces for repeated failures with the same correlation ID or message identifier.

Incident: latency spike but low error rate

Step 1: identify where latency increased: compare request duration percentiles and dependency duration percentiles.
Step 2: isolate the dependency: find which dependency has the largest P95 increase; check if it correlates with a region or endpoint.
Step 3: verify payload size changes: use safe payload logging (sizes/counts) to see if requests got larger.
Step 4: check for retry-induced latency: warnings about retries can increase latency without increasing final failure rate.

Incident: noisy logs or high telemetry ingestion cost

Step 1: find top talkers: query traces count by operation_Name and message patterns to identify spammy log statements.
Step 2: reduce verbosity safely: downgrade frequent success logs to Debug or sample them; keep errors and key milestones.
Step 3: remove high-cardinality dimensions: avoid logging unique IDs as metric dimensions; keep them in traces only.
Step 4: validate payload logging: ensure no large bodies are being logged; replace with size/count summaries.

Incident: cannot correlate events across multiple functions

Step 1: verify correlation ID presence: check that logs include correlationId and that it is consistent across steps.
Step 2: verify trace context propagation: for HTTP, confirm traceparent is forwarded; for messaging, confirm trace context or correlation ID is included in message metadata.
Step 3: standardize event names: ensure all functions emit stable eventName values so you can query by event rather than by free-text messages.

Now answer the exercise about the content:

In a serverless workflow that spans an HTTP-triggered function and a queue-triggered function, what is the preferred way to preserve an end-to-end distributed trace across both triggers?

You are right! Congratulations, now go to the next page

You missed! Try again.

For workflows spanning multiple triggers, the preferred approach is to propagate and restore W3C trace context via message properties/metadata so telemetry stays linked into one distributed trace.