Observability in production: the three-layer model
In production, you need fast answers to three questions: (1) Is the platform healthy? (2) What exactly is failing and where? (3) Who should act, and how quickly? Azure Monitor supports this with three layers that work together:
- Platform metrics: numeric time-series signals (CPU, memory, requests, latency percentiles) for quick health checks and dashboards.
- Logs: detailed event records (application traces, HTTP access logs, system events) for investigation and root cause analysis.
- Alerts: automated detection (threshold-based and basic anomaly-style) that notifies or triggers actions when something deviates from normal.
A practical baseline is: use metrics for “is it broken?”, logs for “why?”, and alerts for “tell me before users do.”
Layer 1: Platform metrics (CPU, memory, requests, latency)
What metrics are (and what they are not)
Metrics are lightweight and near real-time. They are ideal for dashboards and alerting because they are structured and fast to query. They are not a replacement for logs: metrics usually won’t tell you which URL failed or which exception was thrown.
Key web-hosting metrics to watch
- Traffic and success: request count, HTTP 2xx/4xx/5xx rates (or equivalent), throughput.
- Latency: average and percentile latency (p50/p95/p99 if available), response time.
- Compute saturation: CPU percentage, memory working set/usage, process restarts.
- Capacity signals: instance count (autoscale), queue length (if applicable), connection counts.
How to use metrics effectively
- Use rates and ratios (e.g., 5xx per minute, 5xx/requests) instead of raw counts when traffic varies.
- Compare current vs baseline: a “normal” CPU for your app might be 70% during peak; what matters is unexpected change.
- Correlate: a spike in p95 latency plus stable CPU suggests downstream dependency issues; a spike in CPU plus latency suggests compute saturation.
Layer 2: Logs (application logs, web server logs, and platform diagnostics)
Why logs matter
Logs answer “what happened?” and “where?” They include request details, exceptions, dependency failures, and platform events. For web hosting, you typically want:
- Application logs: your app’s structured logs, exceptions, traces.
- Web server/access logs: HTTP method, URL, status, duration, client IP (where available), user agent.
- Platform diagnostics: container stdout/stderr, VM system logs, App Service diagnostics, and resource health signals.
Centralize logs with a Log Analytics workspace
Log Analytics is the common destination for Azure Monitor Logs. Centralizing logs enables cross-resource queries (App Service + Container Apps + VMs) and consistent alerting.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Step-by-step: create a Log Analytics workspace
- In the Azure portal, search for Log Analytics workspaces > Create.
- Select the appropriate subscription and resource group.
- Choose a region close to your workloads (or follow your organization’s policy).
- Create the workspace, then note its name for diagnostics settings.
Enable diagnostics and route data to Log Analytics
App Service: enable diagnostics and send to Log Analytics
App Service can emit platform logs (HTTP logs, application logs, detailed errors) and metrics. For centralized investigation, route logs to Log Analytics using a diagnostic setting.
Step-by-step (portal):
- Open your App Service resource.
- Go to Monitoring > Diagnostic settings (or Diagnostics settings).
- Select Add diagnostic setting.
- Choose categories such as AppServiceHTTPLogs, AppServiceConsoleLogs (if applicable), AppServiceAppLogs, and any available platform categories.
- Set destination to Send to Log Analytics workspace and select your workspace.
- Save.
Practical tip: keep HTTP/access logs for a shorter retention if cost is a concern, and keep application error logs longer. Use workspace retention policies to match your needs.
Container Apps: send container logs and platform diagnostics to Log Analytics
Container Apps commonly rely on container stdout/stderr and platform events. Centralizing these logs is essential for troubleshooting crashes, failed revisions, and latency spikes.
Step-by-step (portal):
- Open your Container App.
- Go to Monitoring > Logs to confirm log access, then configure Diagnostic settings (if available for your resource).
- Add a diagnostic setting and select relevant categories (container logs, system logs, ingress/access logs where available).
- Send to the same Log Analytics workspace.
Practical tip: ensure your application writes structured logs to stdout (JSON if possible). This makes queries and filtering far more effective.
Virtual Machines: enable Azure Monitor Agent and collect OS + app logs
For VMs, you typically collect:
- Metrics: CPU, memory, disk, network (via VM insights/agent).
- OS logs: Windows Event Logs or Linux syslog.
- Web server logs: IIS logs or Nginx/Apache access/error logs (often via custom log collection).
Step-by-step: connect a VM to Log Analytics
- Open the Virtual Machine resource.
- Go to Monitoring > Insights (or VM insights), then enable it.
- Select the target Log Analytics workspace.
- Install/enable the Azure Monitor Agent when prompted.
Step-by-step: collect OS logs
- In the portal, open Azure Monitor > Data Collection Rules (DCR).
- Create a DCR targeting your VM(s).
- Choose sources: Windows Event Logs (e.g., Application/System) or Syslog (e.g., auth, daemon).
- Set destination to your Log Analytics workspace.
- Associate the DCR to the VM(s).
Practical tip: for web server access logs on VMs, start with OS logs and application logs first, then add access log collection once you know the exact file paths and formats you need.
Starter Log Analytics queries for errors and latency
How to think about queries
Most investigations follow the same pattern: pick a time window, filter to the affected service, summarize by status/error, then drill into examples. Use these starter queries as templates and adjust table names to your environment (different services emit different tables).
Query 1: find recent server errors (5xx) from HTTP/access logs
// Template: HTTP access logs (table name varies by service/diagnostic category)\n// Adjust the table and column names to match your data\nAppServiceHTTPLogs\n| where TimeGenerated > ago(30m)\n| where ScStatus >= 500\n| summarize Errors=count() by ScStatus, bin(TimeGenerated, 5m)\n| order by TimeGenerated descUse it when: users report failures or you see an error-rate spike in metrics.
Query 2: top failing endpoints (group by URL/path)
// Template: group failures by requested URL\nAppServiceHTTPLogs\n| where TimeGenerated > ago(60m)\n| where ScStatus >= 500\n| summarize Errors=count() by CsUriStem\n| order by Errors desc\n| take 20Use it when: you need to know whether failures are isolated to a specific route (e.g., /checkout) or global.
Query 3: latency investigation (p95 by endpoint)
// Template: latency percentiles (field names vary)\nAppServiceHTTPLogs\n| where TimeGenerated > ago(60m)\n| summarize p95=percentile(TimeTaken, 95), avg=avg(TimeTaken), requests=count() by CsUriStem\n| order by p95 desc\n| take 20Use it when: the site is “slow” but not necessarily failing. Focus on p95 to capture tail latency.
Query 4: correlate errors with deployments or restarts (time-based correlation)
// Template: correlate error spikes with platform events (example table names may differ)\nlet window=2h;\nlet errors = AppServiceHTTPLogs\n| where TimeGenerated > ago(window)\n| where ScStatus >= 500\n| summarize ErrorCount=count() by bin(TimeGenerated, 5m);\nlet events = AppServiceConsoleLogs\n| where TimeGenerated > ago(window)\n| summarize EventCount=count() by bin(TimeGenerated, 5m);\nerrors\n| join kind=fullouter events on TimeGenerated\n| order by TimeGenerated ascUse it when: you suspect a new release, configuration change, or restart correlates with the incident window.
Query 5: VM-side investigation (Windows/Linux OS signals)
// Windows example: application errors in Event Logs (table depends on DCR setup)\nEvent\n| where TimeGenerated > ago(2h)\n| where EventLog == "Application"\n| where Level == "Error"\n| summarize Errors=count() by Source, EventID\n| order by Errors desc// Linux example: auth/syslog errors (table depends on DCR setup)\nSyslog\n| where TimeGenerated > ago(2h)\n| where SeverityLevel <= 3\n| summarize count() by ProcessName, Facility\n| order by count_ descLayer 3: Alerts (threshold and anomaly-style basics)
Alert types you will use for web hosting
- Metric alerts: fast, near real-time, ideal for CPU, memory, request rate, and latency thresholds.
- Log alerts (scheduled query rules): powerful, based on KQL queries; ideal for “count of 5xx > X” or “exception signature appears”.
- Activity log alerts: detect control-plane changes (e.g., someone stopped a VM, changed a setting).
For “anomaly-style” basics, start with dynamic thresholds (where available) or use log-based baselines (compare current window vs previous window) to detect unusual spikes without hardcoding a single number.
Action groups: who gets notified and what happens next
Alerts become actionable when they route to an Action Group (email, SMS, push, webhook, ITSM connector, automation runbook). Create one action group per team/on-call rotation and reuse it across alerts.
Step-by-step: create an action group
- Open Azure Monitor > Alerts > Action groups > Create.
- Add notification receivers (email/on-call distribution, webhook to incident system).
- Optionally add an action (automation) such as triggering a runbook.
Examples of actionable alert definitions
1) High 5xx error rate (log alert)
- Signal: Log query count of 5xx in last 5 minutes.
- Condition: count > 20 (tune to your traffic), evaluated every 1 minute.
- Scope: App Service / Container App logs in the workspace.
- Action: page on-call, include top failing endpoints in alert description.
// Example KQL for a scheduled query alert\nAppServiceHTTPLogs\n| where TimeGenerated > ago(5m)\n| where ScStatus >= 500\n| summarize ErrorCount=count()2) Latency regression (metric alert or log alert)
- Metric approach: alert when average response time > 1.5s for 10 minutes.
- Log approach: alert when p95(TimeTaken) > 2s for 10 minutes.
// Example KQL for p95 latency alert\nAppServiceHTTPLogs\n| where TimeGenerated > ago(10m)\n| summarize p95=percentile(TimeTaken, 95)\n| where p95 > 20003) Compute saturation (metric alert)
- App Service: CPU percentage > 80% for 10 minutes.
- Container Apps: CPU/memory usage near limits for sustained period.
- VM: CPU > 85% and memory available below threshold (if collected) for 10 minutes.
4) Restart/crash loop detection (log alert)
- Signal: container/app restart events or repeated startup messages.
- Condition: restarts > 3 in 15 minutes.
- Action: notify on-call and include revision/instance identifiers.
Basic anomaly-style detection without advanced tooling
If you don’t know the right static threshold yet, use a relative comparison: “current 5-minute error count is 3x higher than the previous hour’s 5-minute average.” This reduces false positives during traffic peaks.
// Example: spike detection using baseline comparison\nlet current = AppServiceHTTPLogs\n| where TimeGenerated between (ago(5m)..now())\n| where ScStatus >= 500\n| summarize cur=count();\nlet baseline = AppServiceHTTPLogs\n| where TimeGenerated between (ago(65m)..ago(5m))\n| where ScStatus >= 500\n| summarize base=avg(todouble(count())) by bin(TimeGenerated, 5m)\n| summarize base=avg(base);\ncurrent\n| extend joinKey=1\n| join kind=inner (baseline | extend joinKey=1) on joinKey\n| project cur, base, spikeRatio=cur / iif(base==0.0, 1.0, base)\n| where spikeRatio >= 3Note: tune time windows and ratios to your traffic patterns.
Incident workflow: detect, triage, correlate, confirm recovery
1) Detect
Detection should come from alerts tied to user-impacting symptoms:
- 5xx error rate above threshold
- p95 latency above threshold
- availability probe failures (if you have synthetic checks)
- resource saturation (CPU/memory) sustained
Example: A log alert fires: “5xx errors > 20 in 5 minutes” for your App Service.
2) Triage (is it real, and how big is it?)
In triage, you quickly answer: scope, severity, and immediate mitigation.
- Scope: one endpoint or all endpoints? one region or all?
- Severity: are errors continuous or spiky? is latency elevated for all users?
- Mitigation: can you reduce impact quickly (scale out, rollback, route traffic)?
Practical triage checklist:
- Open the alert and confirm the time window.
- Check platform metrics: requests, 5xx rate, latency, CPU/memory.
- Run “top failing endpoints” query to see if a single route is failing.
3) Correlate (connect metrics, logs, and changes)
Correlation is where Azure Monitor becomes powerful: align multiple signals on the same timeline.
- Metrics: did CPU spike first, or did latency spike first?
- Logs: do exceptions match the same timestamp as the metric spike?
- Changes: did a deployment, configuration change, or restart occur right before the incident?
Example correlation path:
- Metrics show p95 latency rising at 10:05.
- Logs show increased 500s and exceptions “SQL timeout” starting at 10:06.
- CPU is stable, suggesting downstream dependency slowness rather than compute saturation.
4) Confirm recovery (prove it’s fixed)
Recovery is not “we deployed a fix”; it is “signals returned to normal and stayed there.” Confirm with:
- Alert clears (or error count returns below threshold).
- Metrics: 5xx rate returns to baseline; p95 latency normalizes.
- Logs: exception signature disappears; request success rate improves.
Practical confirmation query:
// Confirm that 5xx errors dropped after a fix time\nAppServiceHTTPLogs\n| where TimeGenerated > ago(60m)\n| summarize Errors=countif(ScStatus >= 500), Total=count() by bin(TimeGenerated, 5m)\n| extend ErrorRate = todouble(Errors) / todouble(Total)\n| order by TimeGenerated ascPutting it together: a minimal monitoring setup per hosting option
App Service minimal setup
- Enable diagnostic settings to Log Analytics (HTTP logs + app logs where available).
- Create metric alerts: CPU high, response time high (if exposed), request count drop (optional).
- Create log alerts: 5xx spike, top exception signature count.
- Use a shared action group for on-call notifications.
Container Apps minimal setup
- Route container stdout/stderr and platform diagnostics to Log Analytics.
- Alert on restart loops and 5xx spikes (via ingress/access logs where available).
- Track CPU/memory saturation relative to configured limits.
VM minimal setup
- Enable VM insights / Azure Monitor Agent to Log Analytics.
- Collect OS logs via DCR (Windows Event Logs or syslog).
- Alert on CPU/memory/disk saturation and critical OS errors.
- Add web server log collection if you need per-request visibility at the VM layer.