Free Ebook cover Edge AI in Practice: Building Privacy-Preserving, Low-Latency Intelligence on Devices

Edge AI in Practice: Building Privacy-Preserving, Low-Latency Intelligence on Devices

New course

16 pages

Deployment Runtimes: TensorFlow Lite, ONNX Runtime, and Core ML

Capítulo 9

Estimated reading time: 0 minutes

+ Exercise

What “Deployment Runtime” Means on the Edge

A deployment runtime is the software layer that loads a trained model, prepares inputs, executes inference efficiently on the target device, and returns outputs in a format your application can use. On-device runtimes sit between your app code (camera pipeline, audio capture, sensor fusion, UI) and the compute backends (CPU, GPU, DSP, NPU/ANE). In practice, choosing a runtime is not only about raw speed; it also determines which model formats you can ship, how you integrate with mobile/embedded build systems, what operators are supported, how you handle pre/post-processing, and how you validate correctness across devices.

In this chapter, you will work with three widely used runtimes: TensorFlow Lite (TFLite), ONNX Runtime (ORT), and Apple Core ML. They overlap in purpose but differ in model formats, tooling, hardware delegation, and platform fit. A common production pattern is to keep a “source of truth” model in a training framework, export to an interchange format (or multiple), then package per-platform artifacts with runtime-specific optimizations and validation.

Comparing TensorFlow Lite, ONNX Runtime, and Core ML

Model formats and packaging

TensorFlow Lite consumes .tflite flatbuffer models. These are compact, designed for mobile/embedded distribution, and can embed metadata (labels, normalization hints, input shapes) to help downstream apps. ONNX Runtime consumes .onnx models (Open Neural Network Exchange), a graph format intended for portability across frameworks and runtimes. Core ML consumes .mlmodel (source) and typically compiles to .mlmodelc for deployment; compilation can specialize the model for the target device and OS version.

Platform fit

TFLite is strong on Android, Linux embedded, and microcontroller-adjacent environments (with TFLite Micro as a separate stack). ONNX Runtime is strong when you want one inference API across multiple OSes (Android, iOS, Windows, Linux) and when your organization standardizes on ONNX as the interchange format. Core ML is the native choice for Apple platforms (iOS, iPadOS, macOS, watchOS, tvOS) and integrates tightly with Apple hardware acceleration and developer tooling.

Hardware acceleration and “delegates/providers”

TFLite uses delegates to offload parts of the graph to accelerators: GPU delegate, NNAPI delegate (Android), and vendor-specific delegates. ONNX Runtime uses execution providers (EPs) such as CPU, NNAPI (Android), CoreML EP (iOS), and others. Core ML uses Apple’s internal scheduling across CPU/GPU/Neural Engine, and you can influence behavior via compute units and model configuration.

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Operator support and graph compatibility

Each runtime supports a set of operators (ops) and data types. If your model uses an op not supported by the runtime or by a specific delegate/provider, the runtime may fall back to CPU for that subgraph, or fail to load. This makes “operator coverage” a practical constraint: you often need to adjust the model graph (e.g., replace unsupported ops, freeze shapes, avoid dynamic control flow) to achieve stable, accelerated inference on-device.

TensorFlow Lite in Practice

When TFLite is a good choice

TFLite is a good choice when you target Android and embedded Linux, want a small runtime footprint, and can export models cleanly from TensorFlow/Keras. It is also common when you want to use TFLite metadata, or when you rely on Android’s NNAPI for vendor acceleration. TFLite can run on CPU with optimized kernels (XNNPACK) and can delegate to GPU/NNAPI for acceleration depending on device capabilities.

Step-by-step: Exporting a model to .tflite

The typical workflow is: start from a SavedModel (or Keras model), convert to TFLite, then validate numerics and performance on representative inputs.

# Python: convert a TensorFlow SavedModel to TFLite (float model example)import tensorflow as tfsaved_model_dir = "./saved_model"converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)# Common toggles: enable TF ops (bigger binary), or keep only built-insconverter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]# If your model needs select TF ops (less portable), uncomment:# converter.target_spec.supported_ops = [#     tf.lite.OpsSet.TFLITE_BUILTINS,#     tf.lite.OpsSet.SELECT_TF_OPS# ]tflite_model = converter.convert()with open("model.tflite", "wb") as f:    f.write(tflite_model)

After conversion, run a quick sanity check by loading the model with the TFLite Interpreter in Python and comparing outputs to the original model on a small set of test inputs. This catches shape mismatches, preprocessing differences, and conversion issues early.

Step-by-step: Running inference with the TFLite Interpreter

At runtime, you allocate tensors, set input data, invoke, and read outputs. The key is to match input dtype and layout exactly (for example, NHWC float32 vs uint8). If you use metadata, you can store normalization parameters and label maps alongside the model, but your app still needs to apply the correct preprocessing.

# Python: minimal TFLite inference loopimport numpy as npimport tensorflow as tfinterpreter = tf.lite.Interpreter(model_path="model.tflite")interpreter.allocate_tensors()input_details = interpreter.get_input_details()output_details = interpreter.get_output_details()# Example input: batch=1, height=224, width=224, channels=3x = np.random.rand(1, 224, 224, 3).astype(input_details[0]["dtype"])interpreter.set_tensor(input_details[0]["index"], x)interpreter.invoke()y = interpreter.get_tensor(output_details[0]["index"])print(y.shape, y.dtype)

Delegates: GPU and NNAPI

Delegates can significantly change latency, but they also change which ops are accelerated and sometimes impose constraints (static shapes, supported dtypes). On Android, NNAPI is the standard path to vendor NPUs. The GPU delegate is often effective for vision models with large convolution workloads. A practical integration approach is to attempt delegate initialization, fall back gracefully to CPU, and log which path was used so you can diagnose device-specific behavior.

// Android (Kotlin): sketch of enabling NNAPI delegateval options = Interpreter.Options()try {    val nnApiDelegate = NnApiDelegate()    options.addDelegate(nnApiDelegate)} catch (e: Exception) {    // Fallback to CPU}val interpreter = Interpreter(loadModelFile("model.tflite"), options)

Common pitfalls with TFLite deployments

  • Input preprocessing drift: training used mean/std normalization but the app uses a different scale or channel order.

  • Dynamic shapes: some delegates prefer fixed input sizes; resizing at runtime can force CPU fallback.

  • Unsupported ops: conversion succeeds with SELECT_TF_OPS but increases binary size and may reduce portability.

  • Threading: CPU performance depends on thread count and device; you should set threads explicitly and benchmark.

ONNX Runtime in Practice

When ONNX Runtime is a good choice

ONNX Runtime is a good choice when your model originates from multiple training frameworks, when you want a consistent inference API across platforms, or when you want to standardize on ONNX as the interchange format for a heterogeneous device fleet. ORT provides a flexible execution provider system, graph optimizations, and tooling for model inspection. It is commonly used in cross-platform products where Android and iOS share a large portion of inference logic.

Step-by-step: Exporting to ONNX

Export depends on the training framework. For PyTorch, export is typically done via torch.onnx.export. The most important practical decisions are: opset version, static vs dynamic axes, and naming inputs/outputs clearly so app code can bind tensors reliably.

# Python (PyTorch): export to ONNXimport torchimport torch.onnxmodel = ...  # your trained torch.nn.Modulemodel.eval()dummy = torch.randn(1, 3, 224, 224)torch.onnx.export(    model, dummy, "model.onnx",    input_names=["input"], output_names=["logits"],    opset_version=17,    do_constant_folding=True,    dynamic_axes=None  # prefer static for mobile unless you truly need dynamic)

After export, validate the ONNX graph with a checker and run a small inference test using ONNX Runtime on desktop to confirm outputs are reasonable before integrating into mobile builds.

Step-by-step: Running inference with ONNX Runtime

ONNX Runtime uses sessions. You create an InferenceSession with session options, then call run with a dictionary mapping input names to numpy arrays (or platform-specific tensor types). For mobile, you typically use ORT Mobile packages to reduce binary size.

# Python: ONNX Runtime inferenceimport numpy as npimport onnxruntime as ortsess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])x = np.random.rand(1, 3, 224, 224).astype(np.float32)outputs = sess.run(["logits"], {"input": x})logits = outputs[0]print(logits.shape, logits.dtype)

Execution Providers on mobile: NNAPI and CoreML EP

On Android, the NNAPI Execution Provider can offload supported subgraphs to device accelerators. On iOS, the CoreML Execution Provider can translate parts of the ONNX graph to Core ML for acceleration. In both cases, you should expect partial offload: some ops run on the accelerator, others on CPU. Your benchmarking should measure end-to-end latency including data movement and any required layout conversions.

A practical approach is to ship a configuration that tries an accelerator EP first, then falls back to CPU. You also want to log the chosen provider and, when possible, enable profiling to see which nodes ran where during development builds.

Graph optimizations and model compatibility

ORT can apply graph optimizations (constant folding, operator fusion, layout optimizations) at session creation time. For edge deployments, you want predictable startup time and memory use, so you may choose a balanced optimization level and cache compiled artifacts when supported. Compatibility issues often come from unsupported ops in a mobile EP, or from dynamic shapes that complicate compilation. If you need dynamic shapes, constrain them (for example, fixed batch size, bounded image sizes) and test on your lowest-end target device.

Common pitfalls with ONNX Runtime deployments

  • Opset mismatches: exporting with a newer opset than your runtime supports can cause load failures.

  • Dynamic axes everywhere: convenient for experimentation but can reduce accelerator offload and increase overhead.

  • Binary size: full ORT packages can be large; prefer ORT Mobile and include only needed ops when possible.

  • Input layout mismatch: many ONNX vision models use NCHW; your camera pipeline may naturally produce NHWC.

Core ML in Practice

When Core ML is a good choice

Core ML is the best choice when your primary target is Apple devices and you want the most native integration with Apple’s hardware acceleration and developer ecosystem. Core ML models can be integrated directly into Xcode projects, and inference can be performed via Vision (for image-related tasks) or directly via Core ML APIs. Core ML also supports model compilation, which can improve load time and performance consistency on device.

Step-by-step: Converting a model to Core ML

Conversion is commonly done using coremltools from either a TensorFlow or PyTorch source, or from ONNX (depending on your pipeline). The practical steps are: define inputs with correct shapes and scaling, convert, set metadata, then compile and test on a real device. Pay close attention to image input configuration: channel order, color space, and normalization parameters should match what your app will provide.

# Python: convert to Core ML with coremltools (illustrative)import coremltools as ct# Assume you have a TorchScript model or an ML program-compatible sourcemodel = ...# Define input type; for images, you can use ct.ImageType with scale/biasmlmodel = ct.convert(    model,    inputs=[ct.TensorType(name="input", shape=(1, 3, 224, 224))],    convert_to="mlprogram"  # common for newer OS versions)mlmodel.save("Model.mlmodel")

In many real projects, you will use ct.ImageType to encode preprocessing (scale and bias) into the model interface so the app can pass raw pixel buffers more directly. If you do this, ensure the same preprocessing is not applied twice (once in app code and once in the model).

Step-by-step: Using Core ML in an iOS app

Once you add the .mlmodel to your Xcode project, Xcode generates a typed Swift/Objective-C interface. You can run inference either directly or through Vision. Vision is convenient when you already have camera frames as CVPixelBuffer and want built-in resizing/cropping options, but direct Core ML can be simpler when you manage preprocessing yourself.

// Swift: direct Core ML inference (sketch)import CoreMLlet config = MLModelConfiguration()config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .alllet model = try MyModel(configuration: config)// Prepare input (often CVPixelBuffer or MLMultiArray depending on model signature)let input = MyModelInput(input: myMultiArray)let output = try model.prediction(input: input)let logits = output.logits

Compute units, model types, and OS compatibility

Core ML lets you choose compute units to influence scheduling. In production, you may want to default to .all and keep a remote-configurable override for troubleshooting. Also note that Core ML has evolved: newer “ML Program” models can offer better performance and flexibility but may require newer OS versions. This creates a packaging decision: you might ship different model variants per minimum OS, or choose a more compatible format if you must support older devices.

Common pitfalls with Core ML deployments

  • Preprocessing mismatch: image scaling and color channel order are frequent sources of accuracy loss.

  • Unsupported layers during conversion: the converter may rewrite parts of the graph, affecting numerics.

  • Device-only behavior: performance and sometimes numerics can differ between simulator and real hardware; always test on device.

  • Model interface drift: if you change input names/shapes during conversion, app bindings can silently break.

Choosing a Runtime: Practical Decision Checklist

Start from your product constraints

Choose the runtime based on where you ship and what you can support operationally. If you are Android-first and already train in TensorFlow, TFLite is usually the shortest path. If you need one model artifact and one inference API across Android and iOS, ONNX Runtime can reduce fragmentation, but you must validate execution provider coverage on both platforms. If you are Apple-first and want the most native acceleration and tooling, Core ML is the default choice.

Check operator coverage early

Before committing, run a “representative” model through the full export and runtime load path and confirm it executes with your intended accelerator path. For example, a model that runs on ORT CPU may not offload well to NNAPI EP; a model that converts to Core ML may still run partly on CPU if certain ops are not supported on the Neural Engine. Treat operator coverage as a first-class acceptance criterion, not a late-stage surprise.

Define a stable model interface contract

For long-lived products, define a contract for inputs and outputs: tensor names, shapes, dtypes, and semantic meaning (for example, “logits are unnormalized scores over 1000 classes”). Store this contract in a small document and, when possible, in model metadata. Then write small conformance tests that load the model and verify the interface at app startup in development builds. This reduces integration bugs when models are updated.

Integration Patterns: Pre/Post-Processing and Data Plumbing

Keep preprocessing deterministic and testable

Most “runtime bugs” are actually preprocessing bugs. Decide where preprocessing lives: in app code, in the model graph, or encoded in runtime-specific metadata. A practical pattern is to keep preprocessing in app code for maximum transparency, but to also provide a reference implementation in Python that uses the same steps. Then you can test that a given input frame produces the same tensor values in both environments.

Minimize copies and conversions

Edge inference often becomes bottlenecked by memory movement rather than compute. Each runtime has preferred layouts and buffer types. For example, Core ML and Vision work naturally with CVPixelBuffer; TFLite on Android often works with ByteBuffer/DirectByteBuffer; ORT may require NCHW float tensors. Design your pipeline to avoid repeated format conversions (YUV to RGB to float to another layout) and prefer zero-copy paths where available.

Batching and streaming considerations

Many on-device apps process a stream (camera frames, audio windows). Runtimes differ in how they handle stateful models and recurrent inputs. If you use a model with internal state (for example, streaming audio), define how state is passed between invocations and ensure the runtime supports the required ops. When batching is possible (for example, processing multiple crops), measure whether batching helps or hurts latency due to increased memory pressure and scheduling overhead.

Validation and Debugging Across Runtimes

Numerical parity testing

When you export the same model to multiple runtimes, outputs will rarely match bit-for-bit. Instead, define acceptable tolerances and compare outputs on a fixed test set. For classifiers, compare top-k agreement and confidence deltas; for regressions, compare mean absolute error and worst-case error. Keep a small “golden” dataset of representative inputs that you can run through TFLite, ORT, and Core ML builds to detect regressions when you update conversion tooling or runtime versions.

Runtime introspection

Use runtime tools to understand what actually ran on the accelerator. In TFLite, delegate logs can indicate which nodes were delegated. In ONNX Runtime, profiling can show per-node execution time and provider assignment. In Core ML, Xcode Instruments and model load logs can help you see if the Neural Engine is used. Make these diagnostics part of your development build so you can quickly spot CPU fallbacks that would otherwise look like “mysterious latency.”

Versioning and rollout strategy

Runtimes evolve quickly, and operator support changes. Pin runtime versions per app release, and treat upgrades as a model compatibility event: rerun conversion, parity tests, and on-device benchmarks. If your product supports remote model updates, ensure the runtime version on device is compatible with the model artifact you plan to deliver, or maintain multiple model variants keyed by runtime/OS capabilities.

Putting It Together: A Multi-Platform Packaging Workflow

One training model, multiple deployment artifacts

A practical production workflow is to maintain one “source” model and generate per-platform artifacts in CI. For Android you might produce model.tflite plus optional metadata; for cross-platform you might produce model.onnx; for iOS you might produce Model.mlmodel and compile it to .mlmodelc during the app build. Each artifact should be accompanied by a small manifest describing input/output names, shapes, expected preprocessing, and a checksum so the app can verify integrity.

Step-by-step: CI checks you should automate

  • Export check: conversion scripts run in a clean environment and produce artifacts deterministically.

  • Load check: each runtime loads the artifact successfully and reports expected input/output signatures.

  • Golden inference: run a small set of inputs and compare outputs to reference tolerances.

  • Smoke benchmark: run a short benchmark to detect accidental slowdowns (for example, CPU fallback after a conversion change).

  • Packaging check: verify the app bundle includes the correct artifact and that the runtime can locate it.

Now answer the exercise about the content:

Why is operator coverage an important factor when choosing an edge deployment runtime and its accelerator path?

You are right! Congratulations, now go to the next page

You missed! Try again.

If an operator is not supported by a runtime or a delegate/provider, execution may fall back to CPU for that subgraph or fail to load. This affects performance and stability, so coverage should be checked early.

Next chapter

Secure Model Packaging, Delivery, and On-Device Updates

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.