All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Data Collection and Labeling for Edge Scenarios

Capítulo 3

Estimated reading time: 13 minutes

Why edge data collection is different

Key idea: in edge scenarios, the “dataset” is shaped by device constraints, real-world variability, and privacy boundaries. Unlike cloud-centric pipelines where you can often log everything and label later, edge pipelines must decide up front what to capture, how to store it, and how to label it without leaking sensitive information.

Practical implication: you should treat data collection and labeling as part of product engineering, not a one-off ML task. The device’s sensors, compute budget, power modes, and connectivity patterns determine what is feasible. Your labeling strategy must also account for the fact that raw data may never leave the device, or may be transformed (blurred, quantized, embedded) before export.

Define the learning target and the “label contract”

Concept: before collecting anything, define what the model must output and what a label means operationally. A “label contract” is a precise specification of: classes (or regression targets), allowable ambiguity, temporal alignment, and how to handle “unknown” or “not applicable.”

Example: for a wake-word detector, the label contract might specify: (1) positive segments contain the wake phrase spoken by any speaker, (2) negatives include near-miss phrases and background speech, (3) timestamps must align to the start of the phrase within ±50 ms, and (4) “uncertain” is allowed when audio is clipped or heavily overlapped.

Step-by-step: write a label contract

List outputs: class names, multi-label rules, or numeric targets.
Define boundaries: what counts as positive vs. negative; include edge cases.
Specify timing: for streaming sensors, define window length, stride, and how labels map to windows.
Define “unknown/other”: explicitly decide whether to include an “other” class or to reject out-of-distribution inputs.
Set acceptance criteria: inter-annotator agreement targets, allowable noise, and minimum examples per class.

Plan for sensor reality: sampling, drift, and missingness

Concept: edge sensors behave differently across devices and over time. Microphones vary in frequency response; cameras vary in exposure and lens distortion; IMUs drift; environmental conditions change. Your dataset must represent this variability, or your model will fail in the field.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Practical implication: collect data across device models, firmware versions, and operating conditions. Also plan for missing data: dropped frames, sensor pauses, and power-saving modes. Labeling must reflect what the model will see, not what you wish it saw.

Step-by-step: create a sensor coverage matrix

Enumerate devices: target SKUs, chipsets, sensor vendors.
Enumerate conditions: lighting, noise levels, motion patterns, temperatures.
Enumerate user behaviors: typical and atypical interactions.
Define quotas: minimum minutes/images per cell of the matrix.
Track metadata: store device ID hash, sensor settings, timestamp, and environment tags (when safe).

Choose a collection strategy: on-device logging vs. guided capture

Concept: edge datasets are commonly built using a mix of (1) guided capture (structured sessions where users or testers follow prompts) and (2) in-the-wild logging (passive capture triggered by events). Guided capture yields clean labels and balanced classes; in-the-wild logging yields realism and long-tail coverage.

Practical implication: you often start with guided capture to bootstrap a baseline model, then iterate with targeted in-the-wild logging to fix failure modes. The key is to design triggers and sampling so you do not collect redundant data or violate privacy constraints.

Step-by-step: design a guided capture protocol

Write prompts: scripts for speech, gestures, or actions; include variations.
Randomize order: reduce bias from fatigue or learning effects.
Capture negatives: explicitly record confusing non-target cases.
Instrument context: record safe metadata (distance bins, noise bins) rather than raw sensitive context.
Quality checks: immediate on-device checks for clipping, blur, or sensor dropout.

Step-by-step: design in-the-wild logging

Define triggers: model uncertainty, user feedback, rare sensor patterns, or error events.
Sampling policy: cap per-user/per-day volume; use stratified sampling to avoid over-collecting common cases.
Minimize payload: prefer short snippets, downsampled frames, or derived features when possible.
Consent and controls: clear opt-in, pause controls, and data deletion pathways.
Secure storage: encrypt at rest; rotate keys; enforce retention limits.

Event-based capture and “hard example” mining on the edge

Concept: because edge devices may not upload everything, you can use the current model to decide what to capture. This is a practical form of active learning: log examples where the model is uncertain, disagrees with user actions, or hits known failure signatures.

Example: for a gesture classifier, log short IMU windows when the predicted class confidence is below 0.4, or when the user immediately retries an action (a proxy for failure). For a camera-based detector, log frames where the detector produces many low-confidence boxes (possible clutter) or where tracking breaks.

Step-by-step: implement hard-example triggers

Pick signals: entropy of softmax, margin between top-2 classes, anomaly score, or post-processing failures.
Set thresholds: start conservative to limit volume; adjust after measuring yield.
Debounce: avoid logging repeated near-identical windows; add cooldown timers.
Attach context: include safe metadata like sensor stats, not raw user identifiers.
Measure yield: percent of logged samples that become useful labeled data.

Labeling modalities: what “label” means for edge sensors

Concept: labels differ by modality and task. For edge, labeling must often be fast, consistent, and compatible with streaming windows. Common label types include: classification labels, bounding boxes, segmentation masks, keypoints, timestamps, and sequence labels (e.g., activity over time).

Practical implication: choose the simplest label that supports the product requirement. Overly detailed labels can be expensive and slow iteration. Under-specified labels can make the model impossible to train or evaluate.

Examples of label choices

Audio: wake-word timestamp, speaker count, noise condition tags.
Vision: bounding boxes vs. instance segmentation; “visible” vs. “occluded” flags.
IMU: activity segments with start/end times; transition labels.
Multimodal: synchronized labels across streams (e.g., video + IMU) with clock alignment tolerances.

On-device labeling and weak supervision

Concept: sometimes labels can be generated on-device using proxies: user interactions, system states, or simple heuristics. This is weak supervision: labels are noisy but cheap. For edge scenarios, weak labels can be the only scalable option when raw data cannot be exported.

Examples: (1) “User pressed cancel” as a negative label for a voice command interpretation; (2) “Screen unlocked successfully” as a positive label for face authentication; (3) “Device placed on charger” as a proxy for being stationary; (4) “Button long-press” as a ground-truth marker for an event in sensor streams.

Step-by-step: build a weak-label pipeline

Identify proxies: UI events, system logs, hardware interrupts, or app states.
Estimate noise: sample and manually verify a subset to measure label error.
Model the noise: use robust losses, label smoothing, or confidence-weighted training.
Filter: drop ambiguous cases (e.g., rapid repeated taps) to improve precision.
Iterate: refine proxies as product flows change.

Human labeling workflows tailored to edge constraints

Concept: when human annotation is needed, edge constraints affect what annotators can see. You may need to label transformed data (blurred faces, redacted audio, cropped regions, or embeddings). This changes tooling and guidelines: annotators must understand what information is missing and how to label consistently anyway.

Practical implication: design annotation tasks around the data representation you can safely provide. If you can only export short audio snippets, ensure the labeling UI supports precise timestamp marking within those snippets. If you export downsampled frames, ensure bounding box guidelines account for reduced resolution.

Step-by-step: set up an annotation job

Create guidelines: include positive/negative examples, borderline cases, and “do not label” rules.
Define units: frame, clip, window, or session; keep units small enough for consistent labeling.
Qualification: test annotators on a gold set; require minimum accuracy.
Redundancy: multiple labels per item; resolve disagreements with adjudication.
Audit: continuous sampling of labeled items; track per-annotator error patterns.

Temporal alignment and windowing for streaming inference

Concept: many edge models run on sliding windows (e.g., 1-second audio windows every 100 ms, or 2-second IMU windows). Labels must be aligned to these windows. Misalignment is a common hidden cause of poor edge performance because training labels do not match inference-time framing.

Example: if an activity starts at time t=2.05s and your window boundaries are at 2.0s and 2.1s, you must decide whether both windows are positive, only the later one, or whether you use a soft label proportional to overlap.

Step-by-step: align labels to windows

Choose window parameters: length and stride matching the runtime pipeline.
Define overlap rules: minimum overlap percentage to mark positive.
Handle transitions: introduce a “transition” label or ignore boundary windows.
Synchronize clocks: for multimodal data, correct for sensor timestamp offsets and drift.
Validate: replay labeled streams through the same windowing code used on-device.

Dataset balance, long-tail coverage, and edge-specific negatives

Concept: edge deployments often fail on “negatives” that look similar to positives: near-wake phrases, gestures that resemble the target, objects with similar silhouettes, or background patterns that trigger false positives. These hard negatives are more important than random negatives.

Practical implication: plan explicit collection for confusing negatives and rare contexts. Also, balance must be considered at the window level, not just at the clip/session level, because sliding windows can create many near-duplicate samples.

Step-by-step: build a hard-negative set

Mine from logs: collect false positives flagged by user correction or downstream checks.
Generate near-misses: scripted phrases, similar gestures, adversarial lighting/noise conditions.
Deduplicate: cluster by embeddings or simple hashes to avoid overcounting repeats.
Weight training: upweight hard negatives or use focal loss to emphasize them.
Track metrics: false positive rate in the specific confusing contexts.

Data minimization and privacy-preserving transformations during collection

Concept: edge data collection should minimize sensitive content while preserving learning signal. Common transformations include cropping to regions of interest, blurring faces, removing background audio, downsampling, quantization, and converting raw data to features (e.g., MFCCs for audio) before storage or upload.

Practical implication: transformations can change what is labelable and what the model can learn. If you train on transformed data, ensure the on-device inference uses the same representation. If the device uses raw sensors but you only export features, you may need a “teacher” model on-device to generate labels or embeddings consistently.

Step-by-step: choose a transformation policy

Identify sensitive fields: faces, license plates, background speech, location traces.
Select minimal representation: crop, downsample, or feature-extract to retain task signal.
Test learnability: train a small prototype model on transformed data to confirm performance.
Document: record transformation parameters as part of dataset versioning.
Enforce: implement transformations in the capture pipeline, not as an optional post-step.

Quality control: catching label noise and sensor artifacts

Concept: edge datasets often contain artifacts: clipped audio, motion blur, rolling shutter, IMU saturation, or corrupted packets. Label noise also appears from weak supervision or rushed annotation. Quality control must detect both data defects and label defects.

Practical implication: build automated checks that run at ingestion time and before training. Then add targeted human review for the most impactful slices (e.g., high-loss samples, disagreement cases, or samples that dominate false positives).

Step-by-step: implement QC checks

Signal checks: audio RMS/clipping rate, image blur score, IMU saturation counts.
Schema checks: missing timestamps, inconsistent sample rates, invalid label IDs.
Outlier checks: embedding-based anomaly detection to catch corrupted or off-domain samples.
Label consistency: confusion matrix by annotator, agreement rates, and drift over time.
Slice review: inspect top error slices (device model, environment tag, firmware version).

Dataset versioning and reproducibility for iterative edge releases

Concept: edge models are updated iteratively, and data collection continues after deployment. Without strict dataset versioning, it becomes impossible to attribute improvements or regressions to data vs. model changes. Versioning must include raw/transformed data identifiers, labeling guidelines versions, and sampling policies.

Practical implication: treat datasets like code: immutable versions, changelogs, and reproducible splits. For edge, also record capture firmware/app versions because they affect sensor preprocessing and triggers.

Step-by-step: create a dataset release process

Immutable snapshot: freeze a dataset version with checksums and metadata.
Changelog: document new sources, new labels, guideline changes, and transformations.
Stable splits: split by user/device/session to avoid leakage; keep a fixed test set.
Provenance: store capture app version, device type, and trigger type per sample.
Rebuild script: automate dataset assembly from source logs and labels.

Practical example: building an edge dataset for an on-device keyword spotter

Scenario: you need a keyword spotter that runs continuously on-device. Data must represent far-field speech, different accents, background noise, and near-miss phrases, while keeping capture minimal.

Step-by-step: end-to-end plan

Label contract: positive = keyword spoken once; timestamp at keyword start; negatives include near-miss phrases and conversational speech; “uncertain” allowed for overlap.
Guided capture: collect scripted positives from diverse speakers; collect scripted near-misses; record in multiple noise conditions (quiet room, TV noise, street noise) using the target devices.
In-the-wild triggers: log 2-second audio snippets when the model confidence is between 0.3 and 0.6, and when the user repeats the command within 10 seconds (possible miss).
Minimize data: store only short snippets; optionally export log-mel features instead of raw audio if that preserves labelability for keyword presence.
Human labeling: annotators mark keyword present/absent and timestamp; ambiguous clips go to adjudication.
QC: reject clipped audio; ensure sample rate consistency; audit annotator agreement on near-miss cases.
Balance: maintain a hard-negative pool of near-misses and background speech; deduplicate repeated household noise patterns.
Versioning: snapshot dataset v1.0 for baseline; v1.1 adds hard negatives from field false positives; keep a fixed device-held-out test set.

Practical example: labeling IMU-based activities with minimal user burden

Scenario: you want to detect activities like walking, running, cycling, and “still” from phone IMU. Continuous manual labeling is unrealistic, and activities transition frequently.

Step-by-step: hybrid labeling approach

Label contract: labels are per 2-second window with 50% overlap; transitions within a window are labeled “transition” or ignored.
Guided capture: short sessions where testers follow a script: 2 minutes walking, 2 minutes running, 2 minutes still, with a button press at each transition to create ground-truth markers.
Weak supervision: use GPS speed bins (when available and permitted) as a noisy label for walking vs. cycling; use “screen off + charging” as a proxy for stillness in some contexts.
Alignment: align button press timestamps to IMU stream; apply a fixed offset if the UI event is delayed.
QC: detect IMU saturation (phone shaking) and drop those windows; verify that sample rate is stable across devices.
Hard negatives: collect “phone in bag,” “phone in hand while gesturing,” and “vehicle ride” to reduce false positives.

Implementation sketch: capture metadata and labels safely

Concept: a practical edge pipeline stores small, structured records: a sample ID, minimal sensor payload (or features), and metadata needed for debugging and balancing. Labels may be added later by humans or generated by weak supervision.

// Pseudocode: record format for an edge-captured sample
struct SampleRecord {
  string sample_id;          // random UUID
  int64  t_start_ms;
  int64  t_end_ms;
  string device_model_hash;
  string app_version;
  string trigger_type;       // e.g., "uncertainty", "user_feedback"
  bytes  payload;            // e.g., compressed features or short snippet
  map<string,string> meta;   // safe tags: noise_bin, light_bin, etc.
}

struct LabelRecord {
  string sample_id;
  string label_version;      // ties to guideline version
  string label;              // class or structured JSON
  float  confidence;         // for weak labels or adjudication
  string source;             // "human", "heuristic", "user_action"
}

Practical implication: by separating sample records from label records, you can relabel without recapturing, track guideline changes, and combine human and weak labels while keeping provenance.

Now answer the exercise about the content:

When collecting data for an edge AI model, which approach best supports privacy-preserving iteration when you may need to update labels over time?

You are right! Congratulations, now go to the next page

You missed! Try again.

Keeping samples separate from labels allows relabeling without recapture, supports dataset/version changes, and preserves provenance (human, heuristic, user action) while limiting sensitive payloads.