Free Ebook cover Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

New course

17 pages

Numeric Routines and Data-Oriented Optimization

Capítulo 11

Estimated reading time: 0 minutes

+ Exercise

What “Numeric Routines” Mean in Performance Work

Numeric routines are the tight loops and kernels that transform numbers: vector and matrix operations, statistics, signal processing, geometry, simulations, scoring functions, and any repeated arithmetic over arrays. They often dominate runtime because they execute the same small set of operations millions or billions of times. The key performance lever is not only “fewer operations,” but “operations arranged so the CPU can execute them efficiently,” which is where data-oriented optimization becomes concrete: you shape loops and data access so the hardware can stream through memory, predict branches, and use SIMD (vector) instructions.

Across Python, Ruby, Java, and C, the same high-level principles apply, but the practical techniques differ because of runtime overheads and available libraries. In managed languages (Python, Ruby, Java), the fastest numeric code typically pushes work into optimized native kernels (BLAS, vectorized libraries, JVM intrinsics) or uses representations that avoid per-element object overhead. In C, you can write the kernel directly and control layout and vectorization more explicitly.

Typical numeric hot spots

  • Reductions: sum, dot product, min/max, norm, histogram, count of predicate.
  • Elementwise transforms: clamp, scale, normalize, sigmoid, log/exp, polynomial evaluation.
  • Stencil-like loops: convolution, moving averages, finite differences.
  • Small linear algebra: 3D transforms, covariance updates, Kalman steps.

Data-Oriented Optimization for Numeric Kernels

Data-oriented optimization focuses on how data flows through the CPU rather than how objects relate conceptually. For numeric routines, that means: contiguous arrays, predictable access patterns, minimal branching, and loop structures that enable vectorization and instruction-level parallelism.

Core ideas you can apply everywhere

  • Make the inner loop simple: avoid function calls, dynamic dispatch, and unpredictable branches inside the hottest loop.
  • Stream through memory: prefer sequential access (i from 0..n-1) and avoid striding patterns that skip around.
  • Separate phases: if you need filtering, transformation, and reduction, consider two passes if it makes each pass branch-free and vectorizable.
  • Minimize conversions: repeated int-to-float, boxing/unboxing, or parsing inside loops can dominate arithmetic.
  • Exploit associativity carefully: reordering floating-point operations changes rounding; decide if you need strict reproducibility or can tolerate small differences.

Step-by-Step: Turning a Scalar Loop into a Fast Numeric Kernel

This section walks through a repeatable workflow you can apply to a numeric routine in any language. The examples use a dot product and a “clamped linear transform” because they appear in many domains.

Step 1: Write a correct scalar reference

Start with the simplest correct implementation. Keep it readable and testable. Example dot product:

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

// C reference dot product (double precision) double dot_ref(const double* a, const double* b, size_t n) {   double acc = 0.0;   for (size_t i = 0; i < n; i++) {     acc += a[i] * b[i];   }   return acc; }

In Java, you would use primitive arrays to avoid boxing:

// Java reference dot product static double dotRef(double[] a, double[] b) {   double acc = 0.0;   for (int i = 0; i < a.length; i++) {     acc += a[i] * b[i];   }   return acc; }

In Python and Ruby, a pure loop over native numeric objects is usually a baseline for correctness, not speed:

# Python reference (correctness baseline) def dot_ref(a, b):     acc = 0.0     for x, y in zip(a, b):         acc += x * y     return acc
# Ruby reference (correctness baseline) def dot_ref(a, b)   acc = 0.0   a.length.times do |i|     acc += a[i] * b[i]   end   acc end

Step 2: Ensure data is in a numeric array form suitable for fast iteration

The biggest win in Python and Ruby is often switching from “array of objects” to “typed numeric buffer.” In Java and C, it means using primitive arrays or contiguous buffers and avoiding wrapper types.

  • Python: prefer NumPy arrays (dtype float32/float64) or array('d') for simple cases. NumPy enables vectorized kernels and BLAS-backed operations.
  • Ruby: consider Numo::NArray for numeric arrays, or pack data into binary strings and use native extensions when needed. Plain Ruby Arrays store references to objects, which is costly for numeric kernels.
  • Java: use double[]/float[]/int[]; avoid Double[] and streams for hot loops.
  • C: use contiguous arrays (malloc/new) and keep alignment in mind if you plan to use SIMD intrinsics.

Step 3: Remove branches from the inner loop (when possible)

Branches that depend on data are hard to predict and can stall pipelines. For numeric routines, common branch sources are clamping, conditional accumulation, and NaN handling. A standard technique is to convert branches into arithmetic or to split the work into two passes.

Example: clamp each element to [0, 1] and then accumulate. Branchy version:

// C branchy clamp+sum double clamp_sum_branchy(const double* x, size_t n) {   double acc = 0.0;   for (size_t i = 0; i < n; i++) {     double v = x[i];     if (v < 0.0) v = 0.0;     if (v > 1.0) v = 1.0;     acc += v;   }   return acc; }

Branchless version using min/max (often compiled to vector-friendly instructions):

// C branchless clamp+sum (compiler may vectorize) #include <math.h> double clamp_sum_branchless(const double* x, size_t n) {   double acc = 0.0;   for (size_t i = 0; i < n; i++) {     double v = x[i];     v = fmax(0.0, fmin(1.0, v));     acc += v;   }   return acc; }

In Java, Math.min/Math.max can be similarly effective, and the JIT may intrinsify them:

// Java clamp+sum branchless-ish static double clampSum(double[] x) {   double acc = 0.0;   for (int i = 0; i < x.length; i++) {     double v = x[i];     v = Math.max(0.0, Math.min(1.0, v));     acc += v;   }   return acc; }

In Python with NumPy, you can express this as a vectorized operation that runs in optimized C loops:

# Python/NumPy clamp+sum import numpy as np def clamp_sum_np(x):     return np.clip(x, 0.0, 1.0).sum()

Step 4: Unroll and use multiple accumulators to increase throughput

Even without SIMD, modern CPUs can execute multiple independent operations per cycle. A single accumulator can create a dependency chain (each addition depends on the previous). Using multiple accumulators breaks the chain and can increase instruction-level parallelism.

// C dot product with 4 accumulators double dot_ilp(const double* a, const double* b, size_t n) {   size_t i = 0;   double s0 = 0.0, s1 = 0.0, s2 = 0.0, s3 = 0.0;   for (; i + 3 < n; i += 4) {     s0 += a[i]     * b[i];     s1 += a[i + 1] * b[i + 1];     s2 += a[i + 2] * b[i + 2];     s3 += a[i + 3] * b[i + 3];   }   double acc = (s0 + s1) + (s2 + s3);   for (; i < n; i++) acc += a[i] * b[i];   return acc; }

Java can benefit from the same pattern, though the JIT may already apply unrolling. If you do it manually, keep it simple and verify it helps in your environment.

// Java dot with multiple accumulators static double dotILP(double[] a, double[] b) {   int n = a.length;   int i = 0;   double s0 = 0, s1 = 0, s2 = 0, s3 = 0;   for (; i + 3 < n; i += 4) {     s0 += a[i] * b[i];     s1 += a[i+1] * b[i+1];     s2 += a[i+2] * b[i+2];     s3 += a[i+3] * b[i+3];   }   double acc = (s0 + s1) + (s2 + s3);   for (; i < n; i++) acc += a[i] * b[i];   return acc; }

Be aware: changing reduction order changes floating-point rounding. If you require bitwise identical results to a scalar loop, you may need to keep strict order or use compensated summation (at a cost).

Step 5: Enable SIMD (vectorization) deliberately

SIMD executes the same operation on multiple data elements per instruction. How you access it differs by language:

  • C: rely on compiler auto-vectorization (with optimization flags) or use intrinsics (SSE/AVX/NEON). Ensure pointers don’t alias (use restrict) and loops are simple.
  • Java: the HotSpot JIT can auto-vectorize some loops; newer JDKs include the Vector API (incubator/preview in some versions) for explicit SIMD.
  • Python: NumPy, SciPy, and similar libraries already use SIMD in their C/Fortran kernels; your job is to express operations in array form.
  • Ruby: use numeric libraries implemented in C (e.g., Numo) or write a C extension; Ruby itself won’t auto-vectorize numeric loops.

In C, a small change like adding restrict can help vectorization by telling the compiler arrays don’t overlap:

// C: hint non-aliasing for better vectorization double dot_restrict(const double* restrict a, const double* restrict b, size_t n) {   double acc = 0.0;   for (size_t i = 0; i < n; i++) acc += a[i] * b[i];   return acc; }

Precision, Stability, and Reproducibility in Fast Numeric Code

Optimizing numeric routines is not only about speed. You must decide what “correct” means for floating-point results. Common trade-offs:

  • Float32 vs Float64: float32 halves memory bandwidth and can be faster, but increases rounding error and risk of overflow/underflow.
  • Fused multiply-add (FMA): can be faster and more accurate for expressions like a*b + c, but may change results compared to separate multiply and add.
  • Reduction order: parallel or unrolled reductions change rounding. If you need stable sums, consider Kahan or Neumaier summation.

Example compensated summation (Neumaier) in Java for a sum reduction:

// Java Neumaier summation for better accuracy static double sumNeumaier(double[] x) {   double sum = 0.0;   double c = 0.0;   for (double v : x) {     double t = sum + v;     if (Math.abs(sum) >= Math.abs(v)) c += (sum - t) + v;     else c += (v - t) + sum;     sum = t;   }   return sum + c; }

This is slower than a plain sum but can be essential for long reductions or ill-conditioned data.

Structuring Numeric Pipelines: Kernel Fusion vs Library Calls

Numeric workloads often look like pipelines: transform A, combine with B, apply nonlinearity, reduce. You can implement them as multiple passes (each a simple kernel) or fuse them into one pass. The best choice depends on language and library support.

When to fuse

  • You are memory-bandwidth bound: multiple passes reread and rewrite large arrays, wasting bandwidth.
  • The per-element computation is small and the arrays are large.
  • You can keep the fused loop branch-free and vectorizable.

When not to fuse

  • A library call uses a highly optimized kernel (BLAS, vectorized ufunc) that beats your fused loop.
  • Fusing introduces complex branching or prevents vectorization.
  • You need intermediate results for reuse or debugging.

Python example: two-pass vs fused. Two-pass is readable but allocates an intermediate array; fused can be done with NumPy ufuncs and out= to control temporaries.

# Python/NumPy: avoid temporaries with out= import numpy as np def fused_like(a, b, out):     # out = clip(a*b + b, 0, 1) without extra arrays     np.multiply(a, b, out=out)      # out = a*b     np.add(out, b, out=out)         # out = a*b + b     np.clip(out, 0.0, 1.0, out=out) # in-place clamp     return out

Java example: a fused loop is often the right default because you control allocation and keep everything in primitives:

// Java fused transform: y[i] = clamp(a[i]*b[i] + b[i], 0, 1) static void fused(double[] a, double[] b, double[] y) {   for (int i = 0; i < y.length; i++) {     double v = a[i] * b[i] + b[i];     v = Math.max(0.0, Math.min(1.0, v));     y[i] = v;   } }

Working with Strides, Blocking, and Locality in Numeric Loops

Many numeric routines operate on 2D or 3D data. Performance depends heavily on iteration order and blocking (tiling). The goal is to reuse data in caches before it is evicted.

Row-major vs column-major access

C and Java arrays of primitives are effectively row-major when you store a matrix in a flat array and index as i*cols + j. If you traverse by columns (j outer, i inner), you create a large stride and defeat caching and prefetching.

// C/Java-style flat matrix indexing: A[i*cols + j] // Prefer i outer, j inner for row-major storage

Blocking (tiling) for matrix multiplication

Even if you never implement full GEMM yourself (often you should call BLAS), blocking is a useful pattern for custom kernels like small convolutions or batched transforms. Here is a simplified blocked matrix multiply in C to illustrate the idea:

// C: blocked matrix multiply C = A*B (naive illustration) void matmul_blocked(const double* A, const double* B, double* C, int n, int bs) {   for (int ii = 0; ii < n; ii += bs) {     for (int kk = 0; kk < n; kk += bs) {       for (int jj = 0; jj < n; jj += bs) {         int i_max = (ii + bs < n) ? ii + bs : n;         int k_max = (kk + bs < n) ? kk + bs : n;         int j_max = (jj + bs < n) ? jj + bs : n;         for (int i = ii; i < i_max; i++) {           for (int k = kk; k < k_max; k++) {             double a = A[i*n + k];             for (int j = jj; j < j_max; j++) {               C[i*n + j] += a * B[k*n + j];             }           }         }       }     }   } }

The blocking size bs is hardware-dependent; you tune it empirically. The key is that the inner loops reuse a block of A and B while updating a block of C, reducing cache misses.

Language-Specific Patterns for Numeric Performance

Python: express kernels in array operations or compiled loops

For numeric routines, Python’s per-iteration overhead makes pure Python loops slow. The common pattern is to move the inner loop to optimized native code:

  • Use NumPy vectorized operations, reductions, and broadcasting.
  • Use SciPy/BLAS for linear algebra.
  • For custom kernels that cannot be expressed as ufuncs, use Numba/Cython or write a C extension.

Step-by-step: converting a Python loop to NumPy for dot product:

  • Ensure inputs are NumPy arrays with a numeric dtype: a = np.asarray(a, dtype=np.float64).
  • Use np.dot(a, b) or (a*b).sum().
  • If you need to avoid temporaries, prefer np.dot or use np.multiply with out and then sum.
# Python/NumPy dot product import numpy as np def dot_np(a, b):     a = np.asarray(a, dtype=np.float64)     b = np.asarray(b, dtype=np.float64)     return float(np.dot(a, b))

Ruby: avoid per-element Ruby numeric objects in hot loops

Ruby’s Numeric objects and dynamic dispatch make tight numeric loops expensive. Practical options:

  • Use Numo::NArray (or similar) to run numeric kernels in C.
  • Use FFI/native extensions for custom kernels.
  • If you must stay in Ruby, reduce loop overhead: avoid blocks in hot paths, use while loops, and keep operations simple, but expect limits.
# Ruby: while loop tends to be faster than each for hot numeric loops def dot_while(a, b)   i = 0   n = a.length   acc = 0.0   while i < n     acc += a[i] * b[i]     i += 1   end   acc end

This improves overhead but does not change the fundamental cost of boxed numerics. For large arrays, moving the kernel to native code is usually the decisive step.

Java: rely on primitives, JIT optimizations, and careful loop structure

Java can be excellent for numeric routines when you use primitive arrays and write straightforward loops. Practical guidance:

  • Use double[]/float[] and keep loops simple.
  • Avoid streams and lambdas in hot numeric kernels; they can inhibit optimizations and add overhead.
  • Consider float when acceptable to reduce bandwidth.
  • Warm up code paths before measuring; the JIT needs execution to optimize.

Example: computing mean and variance in one pass (Welford) for stability, still fast and branch-light:

// Java: Welford online variance static double variance(double[] x) {   double mean = 0.0;   double m2 = 0.0;   int n = 0;   for (double v : x) {     n++;     double delta = v - mean;     mean += delta / n;     double delta2 = v - mean;     m2 += delta * delta2;   }   return (n > 1) ? (m2 / (n - 1)) : 0.0; }

C: make vectorization and aliasing explicit when needed

C gives you control and responsibility. For numeric kernels:

  • Keep arrays contiguous and aligned when possible.
  • Use restrict to help auto-vectorization.
  • Prefer simple loops; avoid hidden aliasing and function calls in the inner loop.
  • If auto-vectorization fails, consider intrinsics, but isolate them behind a clean API and provide a scalar fallback.

Example: polynomial evaluation with Horner’s method (good for instruction count and pipeline):

// C: Horner polynomial p(x) = c0 + c1*x + ... + ck*x^k double poly_horner(const double* c, size_t k, double x) {   double acc = c[k];   for (size_t i = k; i-- > 0;) {     acc = acc * x + c[i];   }   return acc; }

Handling Edge Cases Without Destroying Throughput

Numeric routines often need to handle NaNs, infinities, or domain constraints. If you check for these per element with branches, you may slow down the common case. A common pattern is to separate “fast path” and “slow path”:

  • Fast path assumes typical inputs and runs branch-free.
  • Slow path handles rare cases (NaNs, out-of-range) when detected.

Example strategy (language-agnostic):

  • First pass: compute and also track a flag if any element is problematic (e.g., has_nan |= isnan(v)).
  • If the flag is false, return the fast result.
  • If true, rerun with robust handling or handle only the problematic indices if you recorded them.

In C, you can implement a cheap NaN check using v != v (NaN is not equal to itself), though using isnan is clearer:

// C: fast pass with NaN detection double sum_with_nan_check(const double* x, size_t n, int* has_nan) {   double acc = 0.0;   int flag = 0;   for (size_t i = 0; i < n; i++) {     double v = x[i];     acc += v;     if (v != v) flag = 1;   }   *has_nan = flag;   return acc; }

Choosing Numeric Types and Scaling for Throughput

Numeric throughput is often limited by memory bandwidth rather than compute. Reducing bytes moved per element can be as powerful as reducing operations.

  • Use float32 when acceptable: halves memory traffic and can double SIMD width (more lanes).
  • Use int32 for counters and indices: smaller types can improve cache residency, but watch overflow.
  • Normalize once: if you repeatedly scale inputs (e.g., divide by 255), do it once into a float buffer rather than per kernel.

In Java, switching from double to float can be a big win for large arrays, but ensure your error tolerance allows it. In Python/NumPy, choose dtype=np.float32 when appropriate and keep it consistent to avoid implicit upcasting.

Micro-Kernel Mindset: Identify the Kernel, Then Optimize Around It

Many real systems have a small “micro-kernel” that dominates time: a scoring function, a distance computation, a transform. The data-oriented approach is to isolate that kernel, define its inputs as flat numeric arrays, and then optimize the loop structure and math. A practical workflow:

  • Extract the kernel into a standalone function with explicit array inputs and sizes.
  • Write tests for numeric correctness and acceptable error bounds.
  • Implement a scalar baseline.
  • Apply: branch removal, multiple accumulators, vectorization-friendly structure.
  • Decide on precision (float vs double) and reproducibility requirements.
  • Integrate back into the pipeline with minimal format conversions.

This workflow is portable: in Python and Ruby, the “kernel extraction” often becomes “move kernel to NumPy/Numo or native code.” In Java and C, it becomes “make the kernel JIT/auto-vectorizer friendly” or “use intrinsics where justified.”

Now answer the exercise about the content:

Which change is most likely to make a tight numeric routine faster by helping the CPU stream through memory and enabling vectorization?

You are right! Congratulations, now go to the next page

You missed! Try again.

Data-oriented optimization favors contiguous numeric buffers, predictable sequential access, and simple inner loops with minimal branching, which improves cache behavior and makes SIMD/auto-vectorization more likely.

Next chapter

IO Throughput and Backpressure-Aware Pipelines

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.