Why Observability Matters in Offline-First Sync
Offline-first apps fail differently than always-online apps. A user can create data for hours without connectivity, then reconnect and trigger a burst of queued operations, conflict resolution, file transfers, and token refreshes. When something goes wrong, the symptoms are often indirect: “items stuck syncing,” “duplicate records,” “battery drain,” “uploads never finish,” or “data missing on another device.” Observability is the discipline of instrumenting your app and backend so you can answer: what happened, to whom, on which device, under what conditions, and why.
In practice, observability is built from four pillars: metrics (aggregated numbers over time), logs (event records), traces (end-to-end timing and causality), and sync failure diagnosis (domain-specific signals that explain why synchronization did not converge). For offline-first systems, you need all four because a single sync attempt spans multiple layers: local store, operation queue, network stack, auth, backend APIs, and sometimes background execution constraints. The goal is not to collect everything; the goal is to collect the minimum set of signals that let you quickly detect issues, triage them, and verify fixes.
Core Concepts: Metrics vs Logs vs Traces
Metrics
Metrics are numeric measurements aggregated over time. They answer “how often” and “how much,” and are ideal for dashboards and alerting. Examples: sync success rate, median sync duration, number of queued operations, upload throughput, or rate of 401 responses.
Logs
Logs are discrete records of events with context. They answer “what happened” and “what was the state.” Logs are essential for debugging specific user reports and for understanding rare edge cases. For offline-first apps, logs should capture the lifecycle of a sync attempt, the decisions made (e.g., “skipped due to metered network”), and the errors returned by the server.
Traces
Traces connect multiple operations into a single end-to-end view. They answer “where did time go” and “which component caused the failure.” In sync, a single user action can lead to a chain: enqueue operation → batch request → server processing → conflict response → follow-up fetch → local apply. Tracing ties these together with a trace ID and spans.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Sync Failure Diagnosis
Sync failure diagnosis is the offline-first specialization: you need structured reasons why sync is not progressing. Generic errors like “network error” are insufficient. You want machine-readable categories such as “auth_expired,” “schema_mismatch,” “conflict_unresolved,” “payload_too_large,” “rate_limited,” “background_execution_denied,” or “local_db_corruption.” These categories become metrics, appear in logs, and can be attached to traces.
Instrumentation Strategy: What to Measure and Why
A practical approach is to instrument around “sync sessions.” A sync session is a bounded attempt to reconcile local and remote state. It might be triggered by app foregrounding, a background job, a push notification, or a user action. Every session should have a unique session ID and should emit consistent signals across metrics, logs, and traces.
Define a Sync Session Envelope
- sync_session_id: UUID generated on the client per attempt.
- device_id: stable, privacy-safe identifier (rotating or hashed if needed).
- user_id: internal ID; avoid logging PII like email.
- app_version, os_version, device_model.
- trigger: foreground, background, manual, push, periodic.
- network_type: wifi, cellular, none; optionally “metered.”
- power_state: charging, low_power_mode, battery_level bucket.
- local_queue_depth_start, local_queue_depth_end.
- result: success, partial, failed, canceled.
- failure_category and failure_code when not successful.
This envelope is the backbone for correlation. If a user reports “sync stuck,” you can search logs by device_id and time, find sync sessions, and see exactly where they failed.
Metrics: A Minimal, High-Value Set
Metrics should be few, stable, and actionable. Prefer counters, gauges, and histograms. Keep cardinality low: do not attach user_id to metrics; use dimensions like app_version, platform, and failure_category.
Recommended Client-Side Metrics
- sync_sessions_total (counter) with labels: platform, app_version, trigger, result.
- sync_session_duration_ms (histogram) with labels: platform, trigger, result.
- sync_failure_total (counter) with labels: platform, failure_category, failure_code (coarse).
- queue_depth (gauge) sampled periodically and at session start/end.
- operations_applied_total (counter) with labels: type (create/update/delete), direction (upload/download).
- bytes_uploaded_total and bytes_downloaded_total (counter) with labels: endpoint group (metadata/files).
- retry_attempts_total (counter) with labels: failure_category.
- background_sync_denied_total (counter) when OS constraints prevent execution.
Recommended Server-Side Metrics (Sync APIs)
- sync_requests_total with labels: endpoint, status_code class (2xx/4xx/5xx), client_version.
- sync_request_latency_ms histogram with labels: endpoint, status_code class.
- conflict_responses_total with labels: conflict_type.
- payload_size_bytes histogram with labels: endpoint.
- rate_limited_total with labels: endpoint, client_version.
These metrics let you detect systemic issues: a new app release causing higher 409 conflicts, a spike in 401s due to token refresh bugs, or increased latency on a particular endpoint.
Logging: Structured, Correlatable, and Safe
For offline-first apps, logs are most useful when they are structured (JSON-like fields), consistent, and privacy-aware. Avoid free-form text as the primary format; use event names and fields so you can query them.
Log Levels and What Belongs Where
- DEBUG: detailed state transitions, payload sizes, decision branches. Usually disabled in production unless sampled.
- INFO: sync session start/end, counts, major milestones.
- WARN: recoverable issues: transient network failures, rate limits, partial apply.
- ERROR: unrecoverable failures: schema mismatch, local DB corruption, repeated auth failures.
Canonical Sync Log Events
- sync_session_started: includes envelope fields and queue depth.
- sync_batch_prepared: number of operations, estimated bytes, batch_id.
- sync_request_sent: endpoint, method, request_id, trace_id.
- sync_response_received: status, latency_ms, server_request_id.
- sync_operation_applied: operation_id, type, entity_type, result (applied/skipped/conflict).
- sync_conflict_detected: entity_type, conflict_type, resolution_path.
- sync_session_failed: failure_category, failure_code, error_message (sanitized), retry_in_ms.
- sync_session_completed: counts, duration, queue depth end.
Privacy and Security Rules for Logs
- Never log access tokens, refresh tokens, passwords, or full authorization headers.
- Do not log raw user content (notes, messages, file names) unless you have explicit consent and strong controls.
- Hash or redact identifiers that could be PII. Prefer internal IDs over emails/phone numbers.
- Use sampling for high-volume logs; keep ERROR logs unsampled.
Example: Structured Log Payload
{"event":"sync_session_failed","timestamp":"2026-01-11T10:15:22.120Z","sync_session_id":"b7b2...","device_id":"d_91f...","user_id":"u_123","app_version":"3.4.0","platform":"android","trigger":"background","network_type":"cellular","queue_depth_start":48,"failure_category":"auth","failure_code":"token_refresh_failed","http_status":401,"retry_in_ms":60000}Tracing: End-to-End Visibility Across Client and Server
Tracing is the fastest way to pinpoint where time is spent and where failures originate. For sync, you want a trace that starts on the client when a sync session begins and continues through each HTTP request to the server, including downstream calls (database, object storage, message queues).
Trace Design for Sync
- Create a root span per sync session: name it something like SyncSession.
- Create child spans for phases: PrepareBatch, UploadBatch, DownloadChanges, ApplyChanges, UploadFiles, DownloadFiles.
- For each HTTP request, create a span with tags: endpoint, method, payload_bytes, attempt_number.
- Propagate trace context headers so server spans join the same trace.
Span Attributes That Matter
- sync_session_id and batch_id as attributes for correlation with logs.
- queue_depth at start of session.
- changed_entities_count and operations_count.
- failure_category and failure_code on error spans.
Example Trace Breakdown (What You Want to See)
- SyncSession (12.4s)
- PrepareBatch (120ms)
- UploadBatch POST /sync/ops (2.1s) → server span shows DB lock contention (1.8s)
- DownloadChanges GET /sync/changes (1.3s)
- ApplyChanges (7.8s) → client span indicates local DB transaction slow
- SyncSession ends with result=partial due to conflict_unresolved
This view immediately tells you whether the bottleneck is local apply time, server latency, or network conditions.
Diagnosing Sync Failures: A Step-by-Step Playbook
When a sync issue is reported, you need a consistent diagnostic flow. The key is to move from symptom to a specific failure category, then to the root cause, then to a fix or mitigation.
Step 1: Identify the User Impact Pattern
- Is it stuck syncing (queue depth not decreasing)?
- Is it data divergence (device A differs from device B)?
- Is it duplication (same item appears twice)?
- Is it performance (sync drains battery, takes minutes)?
- Is it files (uploads stuck, corrupted downloads)?
Each pattern maps to different signals to inspect first.
Step 2: Locate the Relevant Sync Session(s)
Use the user’s approximate time window and device identifier to find sync_session_started events and their corresponding completion/failure events. If you have tracing, search by sync_session_id or trace_id.
Step 3: Check High-Level Outcome Metrics
- Did sync_sessions_total show a spike in failures for the user’s app_version?
- Is sync_session_duration_ms elevated for a specific platform or endpoint?
- Is sync_failure_total dominated by one failure_category?
This tells you whether it’s isolated or systemic.
Step 4: Determine the Failure Category and Next Action
Define a taxonomy and make it consistent across client and server. Example categories and what to do next:
- network: inspect network_type, timeouts, DNS failures; check server availability and client retry behavior.
- auth: inspect 401/403 rates; verify token refresh spans; check clock skew issues if applicable.
- rate_limit: look for 429; confirm backoff and batching; check if a release increased request volume.
- validation: 400 with field errors; indicates client/server contract mismatch or bad local data.
- schema_mismatch: server rejects unknown fields or client cannot parse response; often tied to version skew.
- conflict: 409 or domain conflict responses; inspect conflict_type distribution and resolution outcomes.
- local_storage: local DB errors, migration issues, corruption; inspect ApplyChanges span and local error logs.
- background_constraints: background execution denied, job killed; correlate with OS version and power state.
- file_transfer: checksum mismatch, partial content errors, storage quota exceeded.
Step 5: Confirm Whether the Queue Is Making Progress
A common “stuck” report is actually “progress is too slow.” Track progress explicitly:
- Log and metric: queue_depth_start and queue_depth_end per session.
- Log: operations_applied_count and operations_failed_count.
- Metric: queue_depth gauge trend over time for sampled devices.
If queue depth never decreases, inspect whether the same operation_id fails repeatedly. If it decreases slowly, inspect batch sizing, local apply performance, and server latency.
Step 6: Reproduce with the Same Conditions (When Possible)
Use the envelope fields to approximate conditions: platform, app_version, network_type, and trigger. If the failure is tied to background execution, reproduce with the same OS power mode and background restrictions. If it’s tied to payload size, reproduce with similar queue depth and file sizes.
Designing a Failure Taxonomy and Error Codes
To diagnose quickly, errors must be categorized consistently. A good pattern is: failure_category (broad) + failure_code (specific) + retryable (boolean) + recommended_action (enum). This can be produced by the client, the server, or both.
Example Failure Codes
- auth: token_expired, token_refresh_failed, forbidden_scope
- network: timeout, dns_failure, tls_error, offline
- validation: missing_field, invalid_state_transition
- schema: unsupported_client_version, response_parse_error
- rate_limit: too_many_requests, quota_exceeded
- local_storage: db_locked, migration_failed, corruption_detected
- background: job_canceled, execution_window_too_short
- file_transfer: checksum_mismatch, insufficient_storage, upload_session_expired
Make sure the client maps raw exceptions into these codes. For example, many different network exceptions should map into a small set of network failure codes so metrics remain meaningful.
Practical Implementation: Step-by-Step Instrumentation Plan
Step 1: Add a Sync Session Context Object
Create a small object that is passed through your sync engine and attached to every log and trace span. It should hold sync_session_id, trigger, and environment fields. Ensure it can be created even when the user is logged out (user_id optional).
Step 2: Instrument Phase Timers
Wrap each major phase with timing measurement and emit both a trace span and a histogram metric. Keep phase names stable so dashboards don’t break.
Step 3: Emit Progress Signals
At minimum, record queue depth at start/end and number of operations attempted/applied/failed. This turns “sync stuck” into a measurable condition.
Step 4: Normalize Errors into Your Taxonomy
Implement a single error-mapping function that takes raw errors (HTTP status, exception types, server error payloads) and outputs failure_category, failure_code, retryable, and user_message_key (for UI). Log the normalized error, not the raw stack trace, in production.
Step 5: Correlate Client and Server
Include sync_session_id as a header on sync requests, and propagate trace context. On the server, log sync_session_id and server_request_id. This enables “follow the request” debugging from a mobile log to server logs and traces.
Step 6: Add Sampling and Rate Controls
Offline-first sync can be chatty. Implement sampling rules:
- Always capture ERROR logs and failed traces.
- Sample INFO logs for successful sessions (e.g., 1–5%).
- For DEBUG logs, enable only via remote config for specific device_ids during investigation.
Common Diagnostic Scenarios and What to Look For
Scenario: “Sync stuck” after reconnect
- Check queue_depth trend: does it remain constant across sessions?
- Find repeated failure_code for the same operation_id (e.g., validation.invalid_state_transition).
- Trace: is time spent in UploadBatch (server) or ApplyChanges (client)?
- Server metrics: spike in 400/409 for that client_version?
Scenario: “Battery drain” complaints
- Metrics: sync_sessions_total by trigger; too many background triggers indicates scheduling loops.
- Logs: frequent sync_session_started with result=canceled or background_constraints.
- Trace: repeated short sessions with no progress suggests thrashing.
Scenario: “Data missing on second device”
- Confirm the originating device had sync_session_completed with operations_applied_total > 0 and queue_depth_end decreased.
- On the receiving device, check DownloadChanges spans and whether ApplyChanges failed with local_storage errors.
- Server logs: did the upload request succeed (2xx) and commit?
Scenario: “Uploads fail on cellular”
- Compare bytes_uploaded_total and file_transfer failure codes by network_type.
- Look for timeouts or upload_session_expired; check if background execution windows are too short.
- Trace: long gaps between upload chunks can indicate OS suspension.
Operational Dashboards and Alerts (What to Alert On)
Alerts should be tied to user impact and should be resilient to normal offline behavior. Good alert candidates:
- Sync success rate drops below threshold for a specific app_version or platform.
- Auth failures spike (token_refresh_failed, forbidden_scope).
- Server 5xx rate increases on sync endpoints.
- Median sync duration increases significantly (regression).
- Queue depth p95 increases over time (backlog building).
- Conflict responses spike unexpectedly (possible server-side logic change).
Pair each alert with a runbook link or at least a checklist: which dashboard to open, which logs to query, and which trace view to inspect.