Mini-Project Overview: What You Will Build
In this mini-project you will build a small keyword spotting (KWS) system that runs fully on-device: a microphone audio pipeline continuously captures audio, converts it into compact features, and a lightweight classifier predicts whether a target keyword was spoken. The focus here is not on inventing a new neural architecture, but on wiring a reliable end-to-end audio pipeline that behaves well in real rooms: it should handle background noise, variable loudness, and different speaking styles, while keeping false triggers low. You will implement the pipeline in modular stages so you can test each stage independently: audio capture, buffering, preprocessing, feature extraction, inference, and post-processing (smoothing and trigger logic).
Keyword Spotting Concept: Streaming Classification Over Short Windows
Keyword spotting is a streaming classification problem: instead of classifying an entire recording, the device repeatedly classifies short, overlapping windows of audio (for example, 1 second of audio updated every 20 milliseconds). Each window is converted into a time-frequency representation such as log-mel spectrogram features. The model outputs probabilities for classes like “keyword” and “background” (or multiple keywords). A post-processing stage then decides when to emit a trigger event. The practical challenge is that the model’s output is noisy frame-to-frame; you need smoothing and thresholding so the system triggers once per utterance and does not chatter.
Project Targets and Constraints
Define concrete targets before coding so you can make trade-offs intentionally. A typical KWS target might be: detect a single keyword (for example “Hey Device”) with a detection latency under 300 ms after the word ends, with a false trigger rate below 1 per hour in typical home noise. You also need to decide the sampling rate (often 16 kHz), the window length (commonly 1.0 s), the hop size (commonly 20 ms), and the feature dimensions (for example 40 mel bands). These parameters determine memory usage, compute cost, and responsiveness. In this chapter you will implement the pipeline with these common defaults and learn where to tune them.
System Decomposition: The On-Device Audio Pipeline Blocks
Implement the system as a set of blocks with clear interfaces. (1) Audio capture: reads PCM samples from the microphone in small chunks. (2) Ring buffer: stores a rolling history of samples so you can assemble windows without gaps. (3) Preprocessing: optional DC removal, pre-emphasis, automatic gain control (AGC), and noise suppression if available. (4) Feature extraction: computes log-mel spectrogram frames for the latest window. (5) Inference: runs the model on the feature tensor. (6) Post-processing: smooths probabilities over time and applies trigger logic with debounce and refractory periods. This decomposition makes it easy to unit test: you can feed recorded audio into the ring buffer and verify features and triggers deterministically.
Step 1: Choose Audio I/O Format and Buffering Strategy
Start by choosing a microphone format that is widely supported and easy to process: 16-bit PCM mono at 16,000 Hz is a common baseline. Decide the capture callback size (for example 10 ms = 160 samples) or 20 ms = 320 samples. Smaller callbacks reduce latency but increase overhead; larger callbacks reduce overhead but can make your system less responsive. Use a ring buffer sized to hold at least your window length plus some margin; for a 1-second window at 16 kHz, that is 16,000 samples. If you use 20 ms hops, you will read 320 new samples per step and reuse most of the previous window, which is efficient.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Ring Buffer Pseudocode
// 16 kHz, 1.0 s window, 20 ms hop, int16 PCM input. const int SR = 16000; const int WINDOW_SAMPLES = SR * 1; const int HOP_SAMPLES = SR / 50; // 20 ms const int RING_SIZE = WINDOW_SAMPLES + 2 * HOP_SAMPLES; int16_t ring[RING_SIZE]; int write_idx = 0; void on_audio_callback(const int16_t* in, int n) { for (int i = 0; i < n; i++) { ring[write_idx] = in[i]; write_idx = (write_idx + 1) % RING_SIZE; } } // Copy latest WINDOW_SAMPLES into a linear buffer for feature extraction void get_latest_window(int16_t* out) { int start = (write_idx - WINDOW_SAMPLES + RING_SIZE) % RING_SIZE; for (int i = 0; i < WINDOW_SAMPLES; i++) { out[i] = ring[(start + i) % RING_SIZE]; } }Step 2: Normalize and Convert PCM to Float
Most feature extraction code expects float samples in the range [-1, 1]. Convert int16 PCM by dividing by 32768.0. Also consider removing DC offset (subtract the mean) if your microphone chain has bias. Keep this stage simple and deterministic; if you later add AGC or noise suppression, do it as a separate optional module so you can A/B test its effect on false triggers.
PCM to Float and DC Removal
void pcm16_to_float(const int16_t* in, float* out, int n) { float mean = 0.0f; for (int i = 0; i < n; i++) mean += (float)in[i]; mean /= (float)n; for (int i = 0; i < n; i++) { float x = ((float)in[i] - mean) / 32768.0f; out[i] = x; } }Step 3: Frame the Audio and Apply a Window Function
Log-mel features are computed from short-time Fourier transforms (STFT). Choose a frame length and frame step. A common choice is 25 ms frames with a 10 ms step. At 16 kHz, that is 400-sample frames with 160-sample hop. For a 1-second window you will have about 98 frames (depending on padding). Apply a Hann window to each frame before the FFT to reduce spectral leakage. Precompute the Hann window coefficients once to avoid repeated computation.
Framing Parameters
Be consistent between training and inference. If your model was trained on 40 mel bands with 25 ms / 10 ms STFT settings, your on-device pipeline must match those exactly: sample rate, frame length, hop length, FFT size, mel filter bank, log scaling, and any normalization. Many KWS failures in production come from subtle mismatches in feature extraction rather than the model itself.
Step 4: Compute Log-Mel Spectrogram Features
Compute the magnitude spectrum for each frame, apply a mel filter bank, then take the log. Use an FFT size that is at least the frame length (commonly 512 for 400-sample frames). The mel filter bank maps FFT bins to mel bands; 40 bands is a common compromise between accuracy and compute. After taking log, you may apply per-band normalization (for example mean/variance normalization) if that was used during training. If you did not train with normalization, do not add it at inference time.
Feature Tensor Shape
Most KWS models expect an input tensor shaped like [frames, mel_bins] or [1, frames, mel_bins, 1]. For a 1-second window with 10 ms hop you might have ~100 frames, so an input like 100 x 40. Decide whether you compute features for the whole window each hop (simpler) or update incrementally (more efficient). For a mini-project, compute the full window each hop first; once it works, optimize by caching overlapping frames.
Log-Mel Extraction Pseudocode (High Level)
// High-level outline; FFT and mel bank details omitted for brevity. void compute_logmel(const float* audio, int n_samples, float* out, int n_frames, int n_mels) { for (int f = 0; f < n_frames; f++) { // 1) slice frame audio[f*hop : f*hop+frame_len] // 2) apply Hann window // 3) FFT -> magnitude spectrum // 4) mel filter bank multiply -> mel energies // 5) log(mel_energy + eps) // 6) write to out[f*n_mels + m] } }Step 5: Integrate the Model Inference Call
At this point you have a feature tensor for the latest window. Feed it into your on-device inference runtime and obtain output probabilities. Keep the inference wrapper minimal: allocate input and output tensors once, reuse them, and avoid per-hop allocations. If your runtime supports it, pin buffers to avoid copies. The output is typically a vector of class scores. For a single keyword detector you might have two classes: background and keyword. For multiple keywords you might have N+1 classes with background included.
Inference Wrapper Pattern
// Pseudocode: runtime-specific calls omitted. struct KwsModel { float* input; float* output; int n_out; void init(); void run(const float* features); }; void KwsModel::run(const float* features) { // copy/assign features to input tensor // invoke runtime // read output tensor into output } Step 6: Post-Processing: Smoothing, Thresholding, and Trigger Logic
Raw per-window probabilities fluctuate. A robust KWS system uses temporal smoothing and a trigger policy. Common approaches include: moving average over the last K windows, exponential moving average (EMA), or requiring K consecutive windows above a threshold. You also need a refractory period (cooldown) after a trigger so the system does not fire multiple times for the same utterance. Additionally, consider a “minimum background confidence” rule: only trigger if keyword probability is high and background probability is low, which can reduce false positives in noisy environments.
EMA Smoothing and Debounce Example
float ema = 0.0f; const float alpha = 0.2f; // smoothing factor const float TH = 0.8f; // trigger threshold int cooldown_steps = 0; const int COOLDOWN = 50; // e.g., 50 hops * 20ms = 1s bool update_trigger(float p_keyword) { ema = alpha * p_keyword + (1.0f - alpha) * ema; if (cooldown_steps > 0) { cooldown_steps--; return false; } if (ema > TH) { cooldown_steps = COOLDOWN; return true; } return false; }Tuning the Trigger Policy
Thresholds and smoothing are not one-size-fits-all. If you see missed detections, reduce the threshold or increase the window length (more context) at the cost of latency. If you see false triggers, increase the threshold, increase smoothing, or require multiple consecutive windows above threshold. Always tune using representative audio: quiet rooms, TV in the background, music, far-field speech, and different microphones. Keep a small test suite of audio clips and run them through your pipeline to compare changes objectively.
Step 7: Build a Reproducible Offline Test Harness
Before running on live microphone input, build an offline harness that reads WAV files and simulates streaming by feeding chunks into the ring buffer at the same hop size. This lets you debug deterministically, print intermediate values, and plot features. The harness should output a timeline of keyword probabilities and trigger events. With this harness you can quickly iterate on feature extraction correctness and post-processing without dealing with device-specific audio APIs.
Streaming Simulation Outline
// Read WAV into pcm[] then simulate callback chunks. for (int i = 0; i < total_samples; i += HOP_SAMPLES) { on_audio_callback(&pcm[i], HOP_SAMPLES); get_latest_window(window_pcm); pcm16_to_float(window_pcm, window_f, WINDOW_SAMPLES); compute_logmel(window_f, WINDOW_SAMPLES, feats, N_FRAMES, N_MELS); model.run(feats); float p_kw = model.output[KW_INDEX]; bool trig = update_trigger(p_kw); // log time, p_kw, ema, trig }Step 8: Validate Feature Extraction Against a Reference
Feature extraction is the most common source of “it works on my laptop but not on device” issues. Validate your on-device (or C/C++) feature pipeline against a known-good reference implementation used during training. Do this by running the same WAV through both pipelines and comparing the resulting log-mel tensors. They will not match bit-for-bit across libraries, but they should be close. If they diverge significantly, check: sample rate, pre-emphasis, window function, FFT scaling, mel filter bank definition (HTK vs Slaney-style), log base (natural log vs log10), and epsilon added before log. Fix mismatches before tuning thresholds; otherwise you will be tuning around a broken input representation.
Step 9: Handle Real-World Audio Issues
Live microphones introduce issues that do not appear in curated datasets. Clipping: if users speak loudly near the mic, int16 samples saturate; consider detecting clipping and suppressing triggers during heavy clipping. Variable loudness: if your model is sensitive to amplitude, consider a simple per-window RMS normalization consistent with training, but be careful: aggressive normalization can amplify noise and increase false triggers. Non-speech noise: fans, keyboards, and TV can produce patterns that resemble parts of the keyword; post-processing and a stricter threshold often help more than heavy preprocessing. Multi-channel audio: if the device provides stereo, downmix to mono consistently (average channels) and ensure phase issues do not distort the signal.
Simple Clipping Detector
bool is_clipped(const int16_t* pcm, int n) { int clipped = 0; for (int i = 0; i < n; i++) { if (pcm[i] == 32767 || pcm[i] == -32768) clipped++; } return (clipped > n / 100); // >1% samples clipped }Step 10: Add a “Wake Word State Machine” for Better UX
Even with smoothing, a single threshold can feel jumpy. A small state machine improves behavior: (A) Idle: listen for keyword. (B) Candidate: keyword probability rising; require it to stay above a lower threshold for a short duration. (C) Triggered: emit event and enter cooldown. This reduces false triggers from brief spikes while keeping latency low. You can also add a “cancel” path: if probability drops quickly, return to Idle. Keep the state machine simple and test it with the offline harness so you can see why it triggered.
State Machine Sketch
enum State { IDLE, CANDIDATE, TRIGGERED }; State st = IDLE; int cand_steps = 0; const float TH_LOW = 0.6f; const float TH_HIGH = 0.85f; const int CAND_MIN = 3; // e.g., 3 hops = 60ms bool update_sm(float p) { switch (st) { case IDLE: if (p > TH_LOW) { st = CANDIDATE; cand_steps = 1; } return false; case CANDIDATE: if (p > TH_LOW) cand_steps++; else { st = IDLE; return false; } if (p > TH_HIGH && cand_steps >= CAND_MIN) { st = TRIGGERED; cooldown_steps = COOLDOWN; return true; } return false; case TRIGGERED: if (cooldown_steps > 0) { cooldown_steps--; return false; } st = IDLE; return false; } return false; }Step 11: Instrumentation: What to Log On Device
When you move from offline harness to device, add lightweight instrumentation so you can diagnose issues without dumping raw audio. Useful logs include: timestamp, smoothed keyword probability, trigger state, clipping flag, and simple signal statistics like RMS energy. If you can afford it, log a small rolling buffer of features (not raw audio) around trigger events for debugging; features are less sensitive than raw waveforms and often sufficient to understand false triggers. Keep logs rate-limited so they do not affect real-time behavior.
Step 12: Practical Tuning Workflow With Real Audio
Use a structured tuning loop. (1) Collect a small set of positive clips (keyword spoken in different conditions) and negative clips (background speech, TV, noise). (2) Run the offline harness and compute metrics: detection rate, false triggers per minute, and average detection delay. (3) Adjust post-processing first (thresholds, smoothing, state machine) because it is fast and often yields large gains. (4) If performance is still poor, revisit feature extraction consistency and window/hop parameters. (5) Only then consider adding optional preprocessing like noise suppression, and evaluate whether it helps or hurts in your target environment.
Optional Extensions for the Mini-Project
If the basic pipeline works, extend it in ways that teach practical engineering. Add multi-keyword support: output multiple keyword probabilities and implement per-keyword thresholds and cooldowns. Add a “push-to-talk fallback”: if the system is uncertain (probability in a gray zone), require a button press. Add adaptive thresholds: estimate background noise level from RMS and slightly raise the threshold in noisy conditions. Add incremental feature updates: compute only the newest STFT frames each hop and reuse previous frames to reduce compute. Each extension should be validated with the offline harness and then tested on device.