All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Compression Techniques: Quantization, Pruning, and Distillation

Capítulo 5

Estimated reading time: 14 minutes

Why Compression Matters on the Edge

Goal and constraints: On-device models must fit within tight limits on memory, storage, compute, and power while still meeting latency and accuracy targets. Compression techniques reduce model size and runtime cost so you can deploy stronger models on the same hardware, or meet real-time requirements without moving inference to the cloud.

What you can compress: In practice you compress (1) weights (parameters), (2) activations (intermediate tensors), and sometimes (3) the computation graph itself (operator fusion, constant folding). This chapter focuses on three model-level techniques used constantly in edge deployments: quantization, pruning, and knowledge distillation.

How to evaluate compression: Track at least four metrics: model size on disk, peak RAM during inference, latency (p50/p95), and task quality (accuracy, mAP, WER, etc.). Also track energy if you can (battery drain or power draw). Compression often shifts bottlenecks: a smaller model might become memory-bandwidth bound, or a pruned model might not speed up unless the runtime exploits sparsity.

Quantization

Concept: Quantization represents weights and/or activations with fewer bits than float32, typically int8. This reduces memory footprint and can accelerate inference on hardware with integer SIMD, DSPs, NPUs, or microcontroller instructions. Quantization can be applied after training (post-training quantization) or during training (quantization-aware training) to preserve accuracy.

Key idea: mapping floats to integers: A float tensor x is mapped to an integer tensor q using a scale s and zero-point z: q = round(x / s) + z. During inference, the runtime uses integer arithmetic and rescales as needed. Two common schemes are symmetric (z = 0, signed int8) and asymmetric (z may be nonzero, uint8). Symmetric is simpler and often faster; asymmetric can represent non-zero-centered distributions better.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Quantization targets and trade-offs

Weight-only vs. weight+activation: Weight-only quantization reduces model size and memory bandwidth but may not speed up much if activations remain float. Full integer quantization (weights and activations) enables end-to-end int8 kernels and usually yields the best latency improvements on edge accelerators.

Per-tensor vs. per-channel: Per-tensor uses one scale for an entire tensor; per-channel uses a separate scale per output channel (common for conv and linear weights). Per-channel typically preserves accuracy better, especially for depthwise convolutions and models with varied channel magnitudes, at a slight metadata and compute cost.

Dynamic range and outliers: Quantization error grows when a few outliers force a large scale, reducing effective precision for most values. Mitigations include per-channel quantization, clipping (calibration with percentile), or using higher precision for sensitive layers.

Post-Training Quantization (PTQ): step-by-step

When to use PTQ: Use PTQ when you have a trained float model and want a fast path to int8 deployment, especially when you cannot afford retraining. PTQ works best when the model is already well-behaved (no extreme activation outliers) and the task is not extremely sensitive.

Step 1 — Choose the quantization scope: Decide whether you need full int8 (weights+activations) or weight-only. If your target runtime/hardware supports int8 operators broadly, prefer full int8. If some ops are unsupported, plan for mixed precision (some float fallback), and measure the latency impact.

Step 2 — Prepare a representative calibration set: Collect a small dataset (often 100–1000 samples) that matches real inference distributions: lighting, sensor noise, typical inputs, and edge cases. Calibration does not require labels; it needs representative activations. If your input pipeline includes normalization, resizing, or feature extraction, calibration must run through the exact same preprocessing.

Step 3 — Calibrate activation ranges: Run the model on the calibration set and record min/max or histogram statistics for each activation tensor. Histogram-based methods (e.g., KL divergence minimization) often outperform simple min/max because they reduce the influence of outliers. Many toolchains expose options like “minmax”, “percentile”, or “entropy/KL”.

Step 4 — Quantize and export: Convert weights to int8 (or int16 for some layers), insert quantize/dequantize nodes if needed, and export to your deployment format. Validate that the exported model uses integer kernels on the target device (not just in a desktop simulator).

Step 5 — Validate accuracy and regressions: Compare float vs. quantized outputs. For classification, check top-1/top-5; for detection, check mAP; for speech, WER; for regression, MAE/RMSE. Also check “behavioral” regressions: confidence calibration, false positives in rare conditions, and stability across input noise.

Step 6 — Iterate with targeted fixes: If accuracy drops, try per-channel weights, better calibration (percentile/histogram), or exclude sensitive layers from quantization (e.g., first/last layers, attention softmax, layer norm). Mixed precision is common: keep a few layers in float16/float32 while quantizing the rest.

Quantization-Aware Training (QAT): step-by-step

When to use QAT: Use QAT when PTQ accuracy is insufficient, or when the model has quantization-sensitive components (e.g., small models, depthwise-heavy networks, certain transformers). QAT simulates quantization during training so the model learns to be robust to rounding and clipping.

Step 1 — Start from a strong float checkpoint: Begin with a converged float model. QAT is usually a fine-tuning stage, not a full training from scratch, unless your pipeline is already built for it.

Step 2 — Insert fake-quantization modules: During forward passes, weights and activations are “fake-quantized” (quantize then dequantize) so gradients still flow in float. This exposes the network to quantization noise while keeping training stable.

Step 3 — Use a small learning rate and short schedule: QAT often converges in a few epochs. Use a reduced learning rate, and consider freezing batch norm statistics or folding batch norm into conv weights before final export, depending on your framework and runtime.

Step 4 — Monitor layer-wise saturation: Watch for activations that frequently clip at the quantization range. If clipping is heavy, adjust calibration ranges, use per-channel quantization, or modify the model (e.g., replace problematic activations, add normalization, or change scaling).

Step 5 — Export and verify integer execution: After QAT, export to an int8 model and confirm that the target runtime uses int8 kernels. Re-run the same latency and memory benchmarks on-device.

Practical notes for edge deployment

Operator coverage matters: Quantization only speeds up what runs in integer. If your model uses unsupported ops (custom activations, certain resize modes, some attention patterns), you may end up with float fallbacks that dominate latency. Consider replacing ops with quantization-friendly alternatives or using a model variant designed for your runtime.

Beware of small batch behavior: Edge inference typically runs with batch size 1. Some quantized kernels are optimized for larger batches; always benchmark with realistic batch sizes and input shapes.

Measure end-to-end, not just model time: Preprocessing and postprocessing can dominate latency after quantization speeds up the core network. If you quantize a detector but NMS remains slow, overall gains may be limited.

Pruning

Concept: Pruning removes parameters or structures that contribute little to the output. The goal is to reduce compute and memory while maintaining accuracy. Pruning can be unstructured (removing individual weights) or structured (removing channels, filters, heads, or entire blocks). Structured pruning is usually more deployable on edge devices because it maps to smaller dense tensors that standard kernels can accelerate.

Types of pruning and what they buy you

Unstructured pruning (sparsity): You set many individual weights to zero. This can yield high parameter sparsity (e.g., 80–95%), but speedups depend on sparse kernel support. Many mobile runtimes do not accelerate arbitrary sparsity well, so you might reduce model size (with sparse encoding) but see limited latency gains.

Structured pruning (channels/filters): You remove entire channels or filters, reducing tensor dimensions. This reliably reduces FLOPs and memory bandwidth and typically speeds up inference on standard dense kernels. The trade-off is that structured pruning can hurt accuracy more per removed parameter than unstructured pruning, so it requires careful tuning.

Block pruning and N:M sparsity: Some hardware supports structured sparsity patterns (e.g., 2:4). This can provide real speedups while keeping kernels efficient. If your target accelerator supports it, N:M sparsity can be a sweet spot between deployability and compression.

Structured channel pruning: step-by-step

Step 1 — Decide the pruning granularity: For CNNs, common units are output channels of conv layers. For transformers, common units include attention heads, MLP hidden dimensions, or entire layers. Pick a granularity that your runtime can exploit (channels and heads are usually safe).

Step 2 — Choose an importance criterion: Popular criteria include L1/L2 norm of weights per channel, activation-based importance (average magnitude), or gradient-based saliency. Norm-based criteria are simple and often effective; activation/gradient-based methods can be more accurate but require extra instrumentation.

Step 3 — Set a pruning schedule: Avoid pruning too aggressively in one shot. Use iterative pruning: prune a small fraction (e.g., 5–20%), fine-tune, then repeat. This helps the network recover and redistributes capacity to remaining channels.

Step 4 — Handle dependencies in the graph: Removing channels affects downstream layers. For residual connections, concatenations, and depthwise separable convolutions, you must prune consistently so tensor shapes still match. Many pruning libraries provide graph-aware pruning; if you do it manually, map each pruned channel to corresponding input channels in the next layer.

Step 5 — Fine-tune with regularization: After pruning, fine-tune on your training data with a modest learning rate. Consider adding regularizers that encourage sparsity or stability (e.g., weight decay, knowledge distillation loss from the original model). Monitor accuracy and latency after each pruning round.

Step 6 — Export a compact dense model: The deployable artifact should be a smaller dense model, not a large model with masked zeros. Physically remove pruned channels and re-save weights so the runtime sees reduced tensor shapes and can realize speedups.

Unstructured pruning: when it still helps

Model size reduction: If your primary constraint is storage (flash) rather than latency, unstructured pruning combined with sparse compression can reduce on-disk size. This is useful when distributing models over-the-air to devices with limited storage.

Hybrid approach: A common edge strategy is structured pruning for latency plus mild unstructured pruning for additional size reduction, followed by quantization. The ordering matters: prune first (to change shapes), then quantize the final compact model.

Pruning pitfalls

Latency does not always drop: If pruning reduces FLOPs but increases memory fragmentation or prevents kernel fusion, latency may not improve. Always benchmark on the target device and ensure the runtime uses optimized kernels for the new shapes.

Accuracy cliffs: Some layers are more sensitive (early feature extractors, downsampling stages, final projection layers). Use layer-wise pruning ratios rather than a uniform global ratio, and protect sensitive layers.

Distribution shift sensitivity: Pruned models can become less robust to input changes. Validate on stress tests (noise, blur, lighting changes) that match your edge environment.

Knowledge Distillation

Concept: Knowledge distillation trains a smaller “student” model to mimic a larger “teacher” model. Instead of learning only from hard labels, the student learns from the teacher’s soft outputs (probabilities or logits) and sometimes intermediate features. Distillation often yields a student that is more accurate than training the same small model from scratch, especially when data is limited or labels are noisy.

What the student learns from the teacher

Soft targets (logit distillation): The teacher’s logits contain information about class similarities (e.g., “cat” vs “dog”). Using a temperature parameter T smooths the distribution: softmax(logits/T). Higher T reveals more structure but can weaken gradients if too high.

Feature distillation: The student matches intermediate representations (feature maps, attention maps) from the teacher. This is useful when the student architecture differs but still has comparable stages. Feature matching can stabilize training and improve transfer beyond final outputs.

Response-based vs. relation-based: Some methods distill relationships (pairwise distances between samples, attention relations) rather than raw features. These can help when feature dimensions differ significantly.

Distillation: step-by-step

Step 1 — Pick a teacher and define the deployment target: The teacher should be accurate and stable. The student must meet edge constraints (latency, memory). Distillation is most effective when the teacher is significantly stronger than the student but still solves the same task.

Step 2 — Choose the distillation losses: A typical setup combines (a) standard supervised loss with ground-truth labels and (b) distillation loss between teacher and student outputs. For classification, use KL divergence between softened probability distributions. For detection or segmentation, distill class logits, box regressions, or mask logits, and optionally intermediate features.

Step 3 — Set temperature and loss weights: Common practice is T in the range 2–8 for classification, but it is task-dependent. Balance the losses with weights (e.g., alpha for distillation, 1-alpha for supervised). Tune these on a validation set; too much distillation can cause the student to inherit teacher biases, too little yields minimal benefit.

Step 4 — Train the student with the teacher frozen: Keep the teacher fixed to provide a stable target. Run teacher forward passes to produce logits/features, then train the student to match them. If compute is a concern during training, precompute teacher outputs for the dataset (when feasible) or use a smaller teacher for iterative experimentation.

Step 5 — Validate and stress-test: Distillation can improve average accuracy but sometimes worsens calibration or rare-class behavior. Check confusion matrices, calibration curves, and edge-case performance relevant to your deployment.

Distillation as a compression pipeline component

Distill then quantize: A common pattern is to distill a compact student, then quantize it. The student can be designed to be quantization-friendly (operator choices, normalization, activation functions). This often yields better final int8 accuracy than quantizing a larger, more complex model directly.

Distill to recover pruning/quantization loss: Distillation can be used as a fine-tuning objective after pruning or during QAT. The teacher is typically the original float model; the student is the pruned and/or quantized model. This helps the compressed model match the teacher’s behavior and regain accuracy.

Putting It Together: Practical Compression Recipes

Recipe 1: Fastest path to deployment (PTQ-first)

Use when: You need a quick win and your model is likely quantization-friendly.

Export float model and benchmark on-device (baseline size, RAM, latency, accuracy).
Run PTQ with a representative calibration set; prefer per-channel weight quantization.
Benchmark again; verify integer kernels are used.
If accuracy drop is acceptable, stop; otherwise move to QAT for a short fine-tune.

Recipe 2: Latency-driven compression (structured pruning + quantization)

Use when: Latency is the primary constraint and your runtime benefits from smaller dense shapes.

Profile the model to find expensive layers (often early high-resolution convs or large MLP blocks).
Apply structured pruning iteratively with conservative per-layer ratios.
After each iteration, export a physically smaller dense model and benchmark on-device.
Once latency target is met or accuracy starts to cliff, stop pruning.
Quantize the pruned model (PTQ or QAT) and re-benchmark.

Recipe 3: Accuracy-preserving compression (distillation-centered)

Use when: You must keep accuracy high while shrinking the model significantly.

Train or select a strong teacher model.
Design a student architecture that meets edge constraints and uses quantization-friendly ops.
Train the student with distillation (logits and optionally features).
Quantize the student; if needed, run QAT with distillation from the teacher.
Validate robustness and calibration; benchmark end-to-end latency including pre/post steps.

Implementation Checklist and Debugging Tips

Checklist: before you compress: Lock down preprocessing, input shapes, and evaluation metrics. Establish a repeatable on-device benchmark harness. Save baseline outputs for a fixed test set so you can detect subtle regressions after each compression step.

Checklist: after each compression step: Confirm the exported model is actually smaller, confirm peak RAM, and confirm operator execution mode (int8 vs float). If latency does not improve, inspect runtime logs or profiling traces to identify float fallbacks, slow postprocessing, or memory-bound operators.

Debugging quantization issues: If accuracy collapses, check for (1) incorrect input normalization, (2) calibration set mismatch, (3) layers with extreme activation ranges, (4) unsupported ops causing unexpected dequantize/quantize boundaries. Try per-channel weights, better calibration, or QAT.

Debugging pruning issues: If latency does not drop, ensure you exported a compact dense model rather than masked weights. If accuracy drops sharply, reduce pruning in sensitive layers, prune iteratively, and fine-tune longer. Consider distillation during fine-tuning to stabilize recovery.

Debugging distillation issues: If the student underperforms, tune temperature and loss weights, ensure the teacher is strong on your data distribution, and consider feature distillation at matching stages. If the student becomes overconfident, add calibration-aware evaluation and consider label smoothing or temperature scaling at inference (if supported).

# Pseudocode sketch: combining distillation with QAT fine-tuning (framework-agnostic)
# teacher: frozen float model
# student: quantization-aware model with fake-quant modules
for batch in data_loader:
    x, y = batch
    with no_grad():
        t_logits = teacher(x)
    s_logits = student(x)  # fake-quant happens inside
    loss_sup = CE(s_logits, y)
    loss_kd = KL(softmax(t_logits/T), softmax(s_logits/T)) * (T*T)
    loss = (1-alpha)*loss_sup + alpha*loss_kd
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Now answer the exercise about the content:

In edge deployments, why does structured pruning typically produce more reliable inference speedups than unstructured pruning?

You are right! Congratulations, now go to the next page

You missed! Try again.

Structured pruning reduces tensor dimensions by removing whole channels/filters, which maps well to smaller dense operations that typical edge runtimes can speed up. Unstructured sparsity may not be accelerated unless sparse kernels are supported.