Project Goal and Scenario
What you are building: an IoT anomaly detector that runs on a device (or gateway) and continues to function when the network is unreliable. The device should detect unusual behavior in sensor signals locally, trigger an alert immediately, and then synchronize events and summaries to the cloud when connectivity returns.
Why intermittent connectivity changes the design: you cannot assume continuous streaming to a server for inference or storage. The device must (1) make decisions locally, (2) buffer data and events safely, (3) avoid duplicate uploads after reconnect, and (4) degrade gracefully when storage is full or time is unsynchronized.
Example setting: a remote pump station with vibration, temperature, and current sensors. Cellular coverage drops for hours. You want to detect bearing wear, cavitation, overheating, or electrical issues without waiting for the cloud.
Concept: Anomaly Detection Under Network Gaps
Anomaly detection definition in this project: learn a baseline of “normal” sensor behavior and flag deviations that are statistically or model-wise unlikely. In many industrial IoT cases, anomalies are rare and labels are incomplete, so you often start with an unsupervised or semi-supervised approach.
Connectivity-aware anomaly detection: the anomaly detector itself is not the only model. The overall system includes a local inference loop plus a local “event ledger” that records what happened, when, and with what confidence. When the network is down, the ledger grows; when the network is up, the ledger drains to the cloud.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Two outputs, not one: (1) immediate local actions (e.g., blink LED, trip relay, store high-resolution snippet) and (2) eventual cloud reporting (e.g., event metadata, daily summaries, selected raw windows). This split is essential to keep bandwidth and storage under control.
Mini-Project Scope and Deliverables
Deliverable A: On-device anomaly scoring pipeline that consumes sensor windows, computes features, and outputs an anomaly score plus a discrete severity level.
Deliverable B: Offline-first event buffering that stores anomaly events and a small amount of contextual data while offline, with deduplication and retry logic.
Deliverable C: Sync protocol that uploads events and summaries when connectivity returns, using idempotent identifiers so the cloud can safely accept retries.
Deliverable D: Test plan that simulates network dropouts, clock drift, storage pressure, and sensor glitches.
Step 1: Choose Sensors, Sampling, and Windowing
Pick a realistic sensor set: for a first build, use 2–4 channels. Common combinations: vibration (accelerometer), temperature, motor current, pressure, humidity. If you do not have hardware yet, you can prototype with public datasets or synthetic signals, but keep the interface identical to what a device would provide.
Sampling rates: choose rates that match the phenomenon. Vibration might be 1–10 kHz; temperature might be 1 Hz. If you mix rates, you need a strategy: either resample to a common timeline or compute per-sensor features and concatenate.
Windowing strategy: define a fixed window length and stride. Example: 2-second windows with 50% overlap for vibration; 60-second windows with 10-second stride for slow sensors. For mixed sensors, you can align by producing one “feature vector” every stride interval, built from the most recent window per sensor.
Practical rule: choose a stride that meets your detection latency needs, and a window long enough to capture the pattern. If you need alerts within 5 seconds, do not use 60-second windows for the critical signal.
Step 2: Define “Normal” and Collect Baseline Data
Baseline data is your training anchor: for unsupervised anomaly detection, you primarily need representative normal operation. Collect data across expected operating modes: idle, ramp-up, steady load, high load, environmental changes.
Mode coverage matters more than volume: 2 hours of one steady state can be less useful than 20 minutes each across 6 modes. If you cannot collect all modes, at least tag the data with context (e.g., RPM bucket, valve state) so you can later segment baselines.
Store baseline locally during development: keep raw windows and computed features. In production, you may only keep features or summaries, but during the mini-project you want the ability to inspect what the model saw.
Step 3: Build a Feature Pipeline That Survives Offline Operation
Why features first: even if you plan to use an end-to-end model later, a feature pipeline is easier to debug and can be computed deterministically on-device. It also lets you store compact representations when offline storage is limited.
Example features for vibration: RMS, peak-to-peak, kurtosis, crest factor, bandpower in a few frequency bands, dominant frequency, spectral entropy.
Example features for temperature/current: mean, slope, variance, min/max, exponentially weighted moving average (EWMA) residuals.
Implementation tip: compute features in a streaming way. Maintain rolling sums and FFT buffers so you do not allocate large arrays repeatedly. This reduces jitter and makes behavior more predictable under resource pressure.
# Pseudocode: compute features per window (conceptual, not tied to a runtime) window = ring_buffer.last(N) rms = sqrt(mean(window^2)) peak_to_peak = max(window) - min(window) # bandpower via FFT bins spectrum = rfft(window * hann(N)) power = abs(spectrum)^2 band1 = sum(power[f1:f2]) band2 = sum(power[f3:f4]) features = [rms, peak_to_peak, band1, band2]Step 4: Pick an Anomaly Scoring Method Suitable for the Device
Start simple and robust: for a mini-project, a strong baseline is a statistical model on features. Two common approaches are (1) z-score / robust z-score per feature with aggregation, and (2) distance to a learned normal distribution (e.g., Mahalanobis distance) on the feature vector.
Robust z-score approach: for each feature, estimate median and MAD (median absolute deviation) from baseline. Score each new feature value by how many MADs away it is. Aggregate by max or weighted sum. This is resilient to outliers in baseline data.
Mahalanobis approach: estimate mean vector μ and covariance Σ of baseline features. Anomaly score is (x−μ)ᵀ Σ⁻¹ (x−μ). This captures correlations (e.g., vibration and current rising together might be normal under load).
Practical choice guidance: if you have few features and want simplicity, use robust z-scores. If you have 10–50 features and correlations matter, use Mahalanobis with regularization (e.g., add λI to Σ).
# Pseudocode: robust z-score aggregation # baseline: median[i], mad[i] per feature i z_i = abs(x_i - median[i]) / (mad[i] + eps) score = max(z_i) # or score = sum(w_i * z_i)Step 5: Convert Scores Into Events With Hysteresis
Why you need event logic: raw anomaly scores fluctuate. If you alert on every spike, you will spam logs and waste storage while offline. Instead, convert scores into “events” with start/end boundaries and severity.
Define thresholds: choose at least two: warning and critical. Example: warning if score > 6, critical if score > 10. Calibrate using baseline plus a small set of known abnormal examples or synthetic injections.
Add hysteresis: once an event starts, require the score to fall below a lower threshold for a minimum duration before closing the event. This prevents rapid toggling.
Event fields: event_id, start_time, end_time, max_score, severity, sensor_context (e.g., RPM bucket), and pointers to stored snippets (e.g., file offsets).
# Pseudocode: event state machine if state == IDLE and score > warn_th: start_event() state = ACTIVE if state == ACTIVE: update_max(score) if score > crit_th: severity = CRITICAL if score < clear_th for clear_duration: end_event() state = IDLEStep 6: Design the Offline-First Buffer (Local Event Ledger)
Core idea: treat the device as the source of truth while offline. Every event is appended to a local ledger (a small database or log file). Uploading is a separate process that reads from the ledger and marks items as acknowledged by the server.
Data to buffer: store (1) event metadata (small), (2) optional feature snapshots around the event (medium), and (3) optional raw sensor snippets (large). Make raw snippets conditional on severity or available storage.
Storage format options: SQLite is convenient for atomic writes and queries. A line-delimited JSON log is simpler but needs careful handling for corruption and partial writes. For embedded systems, a ring-buffer file with fixed-size records can be extremely robust.
Idempotency and deduplication: generate a stable event_id on-device. A good pattern is ULID/UUID plus a monotonic counter. The cloud endpoint should accept repeated uploads of the same event_id without creating duplicates.
Offline constraints to handle explicitly: (1) disk full, (2) power loss mid-write, (3) clock not set, (4) reboot loops. Your ledger design should survive these without losing the ability to continue detecting anomalies.
Step 7: Implement a Sync Loop That Tolerates Flaky Networks
Separate concerns: inference loop should never block on network calls. Run sync in a separate thread/task/process with backoff and timeouts.
Upload strategy: send small batches (e.g., 10–50 events) and wait for server acknowledgments. Mark events as “sent” only after ack. If the connection drops mid-batch, retry later; idempotent event_ids prevent duplication.
Bandwidth shaping: when reconnecting after a long outage, you may have hundreds of events. Avoid saturating the link. Implement a rate limit (events per minute) and prioritize critical events first.
Time handling: if the device clock is unreliable, store both “device monotonic time” (time since boot) and “best-known wall time.” When NTP becomes available, you can map monotonic timestamps to wall time approximately. At minimum, preserve ordering.
# Pseudocode: sync worker while true: if not network_available(): sleep(backoff) continue batch = ledger.fetch_unsent(limit=20, order_by='severity_desc_then_time') if batch empty: sleep(idle_interval) continue resp = post('/events', json=batch) if resp.ack_ids: ledger.mark_sent(resp.ack_ids) sleep(short_delay)Step 8: Add Summaries to Reduce Upload and Improve Observability
Why summaries matter: cloud teams often want trends, not raw data. When offline, you also want to conserve storage. Summaries let you keep a compact record of normal operation and anomaly rates.
Daily (or hourly) summary record: counts of warning/critical events, max score, mean score, percent time in each operating mode, and basic sensor statistics. Store these summaries in the ledger too, and upload them even if you drop raw snippets.
Triggered snapshots: for critical events, store a short raw snippet (e.g., 5 seconds before and after) plus the feature vector at peak score. For warning events, store only features unless storage is plentiful.
Step 9: Simulate Intermittent Connectivity and Failure Conditions
Network dropout simulation: during testing, deliberately disable the network for random intervals (e.g., 5–30 minutes). Verify that (1) detection continues, (2) events are appended, (3) sync resumes and drains backlog, (4) no duplicates appear in the cloud.
Power loss test: kill power or force reboot while writing events. After reboot, verify ledger integrity and that the system can continue appending. If using SQLite, ensure you use safe journaling settings appropriate for your storage medium.
Storage pressure test: fill the disk to near capacity. Confirm your policy: do you drop oldest raw snippets first, then oldest features, while preserving event metadata? Ensure the device does not crash or block inference when storage is low.
Clock drift test: set incorrect system time, generate events, then correct time later. Confirm that event ordering remains correct and that the cloud can interpret timestamps (or at least sequence numbers).
Step 10: Practical Calibration and Evaluation
Calibrate thresholds using baseline percentiles: compute anomaly scores on baseline data and set warning threshold at, for example, the 99.5th percentile and critical at the 99.9th percentile. Then validate against a small set of known anomalies or injected faults.
Inject synthetic anomalies: add spikes, increased noise, frequency shifts, or step changes to your sensor stream. For vibration, inject a narrowband tone or increase energy in a band. For temperature, inject an abnormal slope. Confirm that the detector triggers and that event logic groups the anomaly into a single event.
Measure false alert rate under offline conditions: because you cannot rely on cloud-side filtering, your on-device false positive rate must be acceptable. Track “events per hour” during normal operation and adjust thresholds, hysteresis, and feature scaling.
Reference Implementation Blueprint (Putting It Together)
Runtime loop structure: (1) sensor acquisition fills ring buffers, (2) every stride interval compute features, (3) compute anomaly score, (4) update event state machine, (5) append events and optional snippets to ledger, (6) sync worker uploads when possible.
Suggested module boundaries: SensorReader, FeatureExtractor, AnomalyScorer, EventManager, LedgerStore, SyncWorker, StorageManager (for quotas and eviction).
Minimal data contracts: FeatureVector {timestamp, values[], context}; Event {event_id, start, end, severity, max_score, context, attachments}; Attachment {type: raw|features, location, size}.
# Pseudocode: main loop skeleton init_modules() while true: samples = sensor.read() ring_buffer.push(samples) if time_to_compute_features(): fv = features.compute(ring_buffer) score = scorer.score(fv) maybe_event = events.update(fv.timestamp, score, fv.context) if maybe_event.created_or_updated: ledger.append(maybe_event) storage.maybe_store_attachments(maybe_event, ring_buffer, fv) sleep(loop_delay)What to Submit for the Mini-Project
Code artifacts: feature extraction code, anomaly scoring code, event state machine, ledger storage implementation, and sync worker with retry and idempotency.
Configuration artifacts: sampling rates, window sizes, feature list, thresholds, hysteresis parameters, storage quotas, and upload batch sizes.
Test artifacts: scripts or instructions to simulate network outages, power loss, and storage pressure, plus logs showing correct buffering and eventual sync.
Operational checklist: verify that inference never blocks on network, ledger writes are atomic, storage eviction preserves critical metadata, and cloud receives exactly-once semantics at the event level via idempotent IDs.