All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

System Architecture for On-Device Intelligence

Capítulo 2

Estimated reading time: 13 minutes

What “System Architecture” Means for On-Device Intelligence

System architecture for on-device intelligence is the end-to-end design of how data is captured, processed, inferred on, and acted upon directly on a device (phone, camera, wearable, industrial sensor, kiosk), including how the model is packaged, updated, monitored, and secured. The goal is not only to “run a model locally,” but to build a reliable product system: predictable latency, bounded memory and power, safe failure modes, and maintainable updates across many device variants.

In practice, architecture decisions answer questions like: where does preprocessing happen, which components run on which cores, how do you schedule inference relative to other tasks, how do you handle intermittent connectivity, what telemetry is collected, and how do you roll out model updates without breaking performance or safety constraints.

Core Architectural Building Blocks

1) Data acquisition and sensor pipeline

This block interfaces with sensors (camera, microphone, IMU, radar, temperature, current clamps, etc.) and turns raw signals into time-stamped frames or windows. Architectural concerns include sampling rate, buffering strategy, clock synchronization, and backpressure when downstream processing is slower than the sensor rate.

A common pattern is a ring buffer per sensor stream. The buffer decouples acquisition from processing and allows multiple consumers (e.g., one path for inference, another for logging or UI). For audio and IMU, you typically process fixed-size windows (e.g., 20–40 ms audio frames, 1–2 s IMU windows). For video, you may downscale or drop frames to meet compute budgets.

2) Preprocessing and feature extraction

Preprocessing transforms sensor data into the model’s expected input: resizing, normalization, color conversion, spectrogram generation, filtering, or feature computation. Architecturally, preprocessing is often as important as inference for latency and power. It should be treated as a first-class component with its own performance budget and test coverage.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Two key design choices are: (a) where preprocessing runs (CPU vs GPU vs DSP/NPU), and (b) whether you use “fused” operations to reduce memory copies. For example, converting camera YUV to RGB and resizing can be fused via hardware-accelerated pipelines; for audio, computing a log-mel spectrogram can be offloaded to a DSP if available.

3) Model runtime and inference engine

The inference engine loads the model, manages tensors, and executes operators on available accelerators (CPU, GPU, NPU, DSP). The architecture must account for model format, operator support, quantization, and memory planning. A robust design also includes a compatibility layer so the same application logic can run across different device chipsets and OS versions.

Important runtime concerns include warm-up (first inference often slower), thread affinity, tensor arena sizing, and avoiding repeated allocations. Many systems pre-allocate buffers and reuse them across inferences to keep latency stable.

4) Postprocessing and decision logic

Postprocessing converts raw model outputs into actionable decisions: decoding bounding boxes, applying non-maximum suppression, smoothing predictions over time, thresholding, or mapping class probabilities to UI states. This block is where product behavior is defined: hysteresis to avoid flicker, rate limiting, and confidence calibration.

Architecturally, keep postprocessing deterministic and testable. If you rely on temporal smoothing, define the state machine explicitly (e.g., “armed,” “triggered,” “cooldown”) rather than scattering thresholds across the UI layer.

5) Application integration and actuation

This block integrates inference results into the rest of the device: UI updates, notifications, motor control, camera autofocus, audio prompts, or industrial PLC signals. It must respect real-time constraints and safety rules. For example, if an inference triggers a mechanical action, you may require redundant checks, watchdog timers, and safe defaults when the model is uncertain.

6) Storage, caching, and on-device data management

Even when inference is local, you often need local persistence: model files, configuration, calibration parameters, and small caches (e.g., last N predictions, rolling statistics). Architecture should define what is stored, how it is encrypted, how it is rotated, and how it is versioned so updates can be rolled back.

7) Update, rollout, and lifecycle management

On-device intelligence is a living system: models evolve, preprocessing changes, thresholds are tuned, and device firmware updates can affect performance. A good architecture treats models as versioned artifacts with staged rollout, compatibility checks, and rollback. This block also includes A/B testing mechanisms (when appropriate) and safeguards to prevent a bad model from bricking performance.

8) Observability and diagnostics

You need visibility into latency, memory, battery impact, and failure rates in the field. Observability on-device typically includes lightweight metrics (histograms of inference time, dropped frames, accelerator usage), structured logs, and crash reports. The architecture should define sampling rates and storage limits so diagnostics do not become the main workload.

9) Security boundaries and trust model

System architecture must define trust boundaries: which components can access raw sensor data, where keys are stored, how model files are verified, and how to prevent tampering. Even if you do not transmit raw data, local attackers could try to replace models, alter thresholds, or extract intellectual property. Common measures include signed model bundles, secure boot, and least-privilege permissions for sensor access.

Reference Architecture: A Typical On-Device Inference Pipeline

A practical reference architecture is a staged pipeline with explicit queues between stages: (1) acquisition, (2) preprocessing, (3) inference, (4) postprocessing, (5) application. Each stage has a bounded queue to prevent unbounded memory growth. When the system is overloaded, you drop or coalesce work at well-defined points (e.g., drop video frames before preprocessing rather than after expensive transforms).

For example, a camera-based detector might run acquisition at 30 FPS, but inference at 10 FPS. The architecture can downsample frames (take every third frame) and keep UI rendering independent. This avoids the common failure mode where the UI becomes sluggish because inference blocks the main thread.

Latency, Throughput, and Scheduling: Designing for Predictability

Latency budget decomposition

Start by decomposing end-to-end latency into measurable segments: sensor capture, preprocessing, inference, postprocessing, and actuation/UI. Assign a budget to each segment. For instance, if you need a 100 ms response time, you might allocate 10 ms capture + 20 ms preprocessing + 40 ms inference + 20 ms postprocessing + 10 ms UI/actuation. This budget becomes a contract for implementation and testing.

Threading model and queues

A common threading model uses: a high-priority acquisition thread, a worker thread pool for preprocessing, a dedicated inference thread pinned to a performance core (or to the accelerator’s driver thread requirements), and a lightweight postprocessing thread. Use lock-free queues or bounded blocking queues to reduce contention. Avoid running inference on the UI thread.

Backpressure strategies

When compute cannot keep up, you need explicit backpressure. Typical strategies include dropping oldest frames, dropping newest frames, reducing input resolution, reducing inference frequency, or switching to a smaller model. The architecture should define the policy and the trigger conditions (e.g., queue length, moving average latency, thermal state).

Memory Architecture: Avoiding Copies and Controlling Peak Usage

On-device systems often fail due to memory spikes rather than average usage. Architecture should specify: tensor arena size, intermediate buffers, image planes, and caching. Prefer pre-allocation and buffer reuse. For camera pipelines, use zero-copy buffers when possible (e.g., hardware buffers shared between camera and GPU) to avoid expensive memcpy operations.

Define a “peak memory envelope” per device class. If you target multiple SKUs, create tiers (low, mid, high) and ensure the app selects appropriate model variants and buffer sizes at runtime.

Power and Thermal Architecture: Staying Within the Device’s Limits

Power and thermal constraints are architectural, not just optimization tasks. Sustained inference can trigger thermal throttling, which increases latency and can cause cascading failures (queue growth, dropped frames, UI lag). Architecture should include adaptive policies: duty cycling (run inference intermittently), event-driven activation (only infer when motion is detected), and quality scaling (reduce resolution or model size under thermal pressure).

Integrate with platform signals when available (thermal state APIs, battery saver mode). Define what the system does in each state: normal, warm, hot, critical. For example, in “hot,” you might halve inference rate and disable non-essential logging.

Model Packaging and Compatibility Layers

Real deployments face heterogeneity: different chipsets, different operator support, and different performance characteristics. A practical architecture uses a model bundle that can contain multiple variants: quantized vs float, different input sizes, and possibly different backends. At runtime, a capability detector selects the best compatible variant.

Also version the preprocessing configuration alongside the model. If you update the model but forget to update normalization constants or label maps, you can silently degrade accuracy. Treat “model + preprocessing + postprocessing parameters” as a single atomic unit.

Practical Step-by-Step: Designing an On-Device Intelligence Architecture

Step 1: Define the system contract

Write down measurable requirements: maximum end-to-end latency, minimum throughput, acceptable battery impact, memory ceiling, and acceptable degradation modes. Include device tiers and OS constraints. This contract guides every architectural choice.

Step 2: Draw the pipeline and choose boundaries

Sketch the stages and decide where boundaries (queues) exist. For each boundary, define: data format (e.g., YUV420 frame, float32 tensor, int8 tensor), maximum queue length, and drop policy. Keep the number of format conversions minimal.

Step 3: Choose runtime and acceleration strategy

Select the inference runtime(s) and define a backend abstraction. Decide which operations can run on accelerators and which must run on CPU. Plan for fallback paths when an accelerator is unavailable or when operator support is missing. Ensure the application can detect and report which backend is active.

Step 4: Plan memory and buffer reuse

List all buffers: sensor buffers, preprocessing scratch space, model input/output tensors, postprocessing working arrays. Decide which are persistent and which are per-inference. Implement a buffer pool to reuse allocations. Set hard limits and fail gracefully if memory is insufficient (e.g., switch to a smaller model variant).

Step 5: Implement scheduling and backpressure

Implement bounded queues and define overload behavior. Add a controller that monitors moving averages of stage latencies and queue depths. When overload is detected, apply a policy: reduce frame rate, downscale input, or switch models. Make the controller deterministic and testable.

Step 6: Build observability from day one

Add timers around each stage and record metrics: p50/p95 latency, dropped frames, inference frequency, and memory usage. Store metrics locally and upload only aggregated summaries when connectivity is available (or keep them for user support workflows). Ensure logging is rate-limited.

Step 7: Add update and rollback mechanisms

Define how model bundles are delivered, verified, and activated. Use a two-phase approach: download to a staging area, verify signature and compatibility, then switch the active pointer. Keep the previous version for rollback. Record which version is active and expose it in diagnostics.

Step 8: Validate with end-to-end tests on real devices

Create tests that measure end-to-end latency under realistic conditions: background apps running, low battery mode, thermal stress, and intermittent connectivity. Validate not just average latency but tail latency and stability over time. Confirm that overload policies behave as designed.

Pattern Catalog: Common Architectural Patterns

Always-on vs event-driven inference

Always-on inference runs continuously at a fixed cadence. It is simpler but can be power-hungry. Event-driven inference uses a low-cost trigger (motion, sound energy, simple heuristic) to activate the heavier model. Architecturally, event-driven designs require a small “sentinel” pipeline and a state machine that controls when the main model runs.

Cascaded models (coarse-to-fine)

A cascade uses a small, fast model to filter most inputs and a larger model for ambiguous cases. The architecture needs a routing decision and a shared feature or buffer strategy to avoid duplicating preprocessing. Cascades are effective when most frames are “easy negatives.”

Streaming inference with temporal state

For audio, IMU, and some video tasks, models may maintain temporal context. Architecture must manage state across windows: resetting on gaps, handling variable frame rates, and ensuring state is not corrupted by dropped windows. Define how state is stored and versioned, especially across model updates.

Hybrid local + remote augmentation (optional)

Some systems run locally by default but can optionally call a remote service for enhanced results when connectivity and user settings allow. Architecturally, treat remote augmentation as a separate, non-blocking path: local results must be available within the latency budget, and remote results can update the UI later if they arrive. This avoids coupling core functionality to network variability.

Example: Camera Object Detection Architecture (Concrete Blueprint)

Pipeline

Acquisition: camera delivers YUV frames into a ring buffer. Preprocessing: a GPU-accelerated resize converts YUV to the model’s input size and format, writing into a pre-allocated input tensor. Inference: the NPU runs an int8 model at 10 FPS on a dedicated inference thread. Postprocessing: CPU decodes boxes and applies non-maximum suppression, then a tracker smooths results across frames. Application: UI overlays boxes and labels, and a rate limiter prevents rapid UI churn.

Overload handling

If inference latency exceeds the budget for 1 second, the controller reduces inference rate to 7 FPS. If thermal state becomes “hot,” it switches to a smaller model variant and reduces input resolution. If memory pressure is detected, it shortens queues and disables optional debug overlays.

Diagnostics

Metrics include per-stage latency, dropped frame count, active model version, backend type (CPU/GPU/NPU), and thermal state transitions. A debug screen shows these values so field testers can capture reproducible reports.

Example: Audio Keyword Spotting Architecture (Concrete Blueprint)

Pipeline

Acquisition: microphone audio is captured into a circular buffer at a fixed sample rate. Preprocessing: a DSP or CPU computes log-mel features for overlapping windows. Inference: a small quantized model runs every 20 ms and outputs keyword probabilities. Postprocessing: a smoothing filter and a state machine require sustained confidence for a trigger. Application: on trigger, the system wakes a heavier component (e.g., command parser) while keeping the keyword model running.

State and gaps

The architecture defines how to handle audio dropouts: if a gap exceeds a threshold, the feature pipeline resets and the model state is cleared to avoid false triggers. The system also defines a cooldown period after a trigger to prevent repeated activations.

Implementation Sketch: Pipeline with Bounded Queues

// Pseudocode illustrating architectural structure (language-agnostic) queue<Frame> captureQ(max=3) queue<Tensor> preprocQ(max=2) queue<Output> inferQ(max=2) thread captureThread: while running: frame = camera.read() if captureQ.full(): captureQ.drop_oldest() captureQ.push(frame) thread preprocThread: while running: frame = captureQ.pop_blocking() tensor = buffers.get_input_tensor() preprocess(frame, tensor) if preprocQ.full(): preprocQ.drop_oldest() preprocQ.push(tensor) thread inferThread: warmup_once() while running: tensor = preprocQ.pop_blocking() out = runtime.run(tensor) if inferQ.full(): inferQ.drop_oldest() inferQ.push(out) thread postThread: while running: out = inferQ.pop_blocking() decision = postprocess(out) app.apply(decision)

This structure makes boundaries explicit, prevents unbounded memory growth, and provides clear insertion points for metrics and overload policies. In production, you would add timestamps, per-stage timers, and a controller that adjusts rates and model variants based on observed performance and device state.

Now answer the exercise about the content:

Why does an on-device inference pipeline often use bounded queues between stages like capture, preprocessing, inference, and postprocessing?

You are right! Congratulations, now go to the next page

You missed! Try again.

Bounded queues make stage boundaries explicit, keep memory from growing without limits, and allow controlled overload handling (for example, dropping frames before expensive work) to maintain predictable behavior.