All courses > Information Technology > Programming Languages ( Python, Ruby, Java, C ) ::

Data Representation and Layout for Speed and Safety

Capítulo 6

Estimated reading time: 16 minutes

Why Data Representation Dominates Performance (and Safety)

Once you have a correct algorithm and you are not wasting time on avoidable allocations, the next large lever is how your data is represented in memory and how it is laid out for access. “Representation” means the types you choose (integers vs floats, strings vs bytes, structs vs maps) and the invariants you encode (ranges, nullability, ownership). “Layout” means how those values are arranged: contiguous arrays vs pointer-linked nodes, packed vs padded structs, row-major vs columnar storage, and whether you store “array of structs” (AoS) or “struct of arrays” (SoA). These choices affect CPU cache behavior, branch prediction, vectorization, and the amount of work needed to validate and safely manipulate data.

This chapter focuses on patterns you can apply across Python, Ruby, Java, and C to make data access fast and predictable while also reducing the surface area for bugs (overflow, invalid states, encoding errors, and unsafe aliasing). We will avoid rehashing benchmarking, profiling, memory models, allocation behavior, and garbage collection pressure; instead, we will concentrate on representation and layout decisions that stand on their own.

Core Principles

1) Prefer Contiguous, Homogeneous Storage for Hot Data

Modern CPUs are optimized for sequential access. When your hot loop walks contiguous memory, you get fewer cache misses and better prefetching. When your data is homogeneous (same type/size), the runtime and JIT can optimize more aggressively, and you can often use SIMD/vector operations.

Good: arrays of primitives (C arrays, Java int[], Python array/memoryview, Ruby packed strings used as byte buffers).
Risky for speed: lists/arrays of objects where each element is a pointer to separately allocated objects (Python list of ints, Ruby Array of Integers, Java Integer[]).

2) Encode Invariants in Types and Layout

Safety improves when invalid states are unrepresentable or at least hard to construct. Examples: using unsigned types for non-negative values, using fixed-width integers for protocol fields, using enums for small sets of states, and using length-prefixed buffers to avoid scanning for terminators.

In C, this means careful use of uint32_t, size_t, and structs with explicit fields.
In Java, it means preferring primitives and using record or small immutable classes to preserve invariants.
In Python and Ruby, it means using dataclasses/Struct, validating at boundaries, and using bytes-like objects for binary data.

3) Separate “Hot” Fields from “Cold” Fields

Many objects contain a mix of frequently accessed fields (hot) and rarely used metadata (cold). If you store them together, every access drags cold bytes into cache. A common pattern is to store hot fields in tight arrays and keep cold data in side tables keyed by index or ID.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

This is a layout decision that can be applied even in high-level languages: you can keep parallel arrays (SoA) for hot numeric fields and a dictionary/hash for occasional metadata.

4) Choose AoS vs SoA Intentionally

AoS (array of structs) stores each record contiguously: [x,y,z][x,y,z].... SoA (struct of arrays) stores each field contiguously: [x,x,x...] [y,y,y...] [z,z,z...]. AoS is convenient and good when you touch most fields per element. SoA is often faster when you process one or two fields across many elements (vectorization, fewer cache lines).

Practical Pattern: AoS vs SoA for Coordinate Updates

Consider updating positions with velocities for many particles. If your update loop touches only x and vx (and similarly for y), SoA can be faster.

C Example

AoS:

typedef struct { float x, y; float vx, vy; } Particle; void step_aos(Particle* p, size_t n, float dt) { for (size_t i = 0; i < n; i++) { p[i].x += p[i].vx * dt; p[i].y += p[i].vy * dt; } }

SoA:

typedef struct { float* x; float* y; float* vx; float* vy; } Particles; void step_soa(Particles p, size_t n, float dt) { for (size_t i = 0; i < n; i++) { p.x[i] += p.vx[i] * dt; p.y[i] += p.vy[i] * dt; } }

Step-by-step decision process:

List the fields used in the hot loop (here: all four fields, but accessed in a streaming way).
Estimate whether you often process subsets of fields (e.g., sometimes only x for projections). If yes, SoA tends to win.
Check alignment and padding in AoS: structs may include padding that increases memory bandwidth needs.
In C, consider whether SoA enables vectorization (compiler can more easily auto-vectorize contiguous arrays).

Java Example

Java AoS often becomes “array of object references,” which is not AoS in memory. If you write Particle[] where Particle is a class, you get an array of pointers to heap objects. For hot numeric loops, prefer SoA with primitive arrays:

final class Particles { final float[] x, y, vx, vy; Particles(int n) { x = new float[n]; y = new float[n]; vx = new float[n]; vy = new float[n]; } void step(float dt) { for (int i = 0; i < x.length; i++) { x[i] += vx[i] * dt; y[i] += vy[i] * dt; } } }

Safety note: keep arrays the same length and encapsulate them so callers cannot accidentally mismatch indices. Expose methods that operate on an index rather than exposing raw arrays.

Python Example

Python lists store references, so a list of floats is not contiguous float storage. For numeric hot paths, use array, memoryview, or a library that provides packed numeric arrays. Here is a standard-library approach using array:

from array import array def step(x, y, vx, vy, dt): for i in range(len(x)): x[i] += vx[i] * dt y[i] += vy[i] * dt # setup n = 100000 x = array('f', [0.0]) * n y = array('f', [0.0]) * n vx = array('f', [1.0]) * n vy = array('f', [1.0]) * n step(x, y, vx, vy, 0.016)

Step-by-step: (1) identify numeric fields, (2) replace list-of-floats with array('f') for packed floats, (3) keep arrays encapsulated in a class to preserve invariants, (4) use memoryview if you need slicing without copying.

Ruby Example

Ruby Arrays also store references. For packed numeric storage, Ruby does not have a built-in primitive array like Java, but you can represent numeric data in a packed binary string and unpack when needed, or use String as a byte buffer for fixed-width fields. For floats, packing/unpacking each iteration is expensive; the pattern is most useful for binary protocols and IO buffers. For compute-heavy numeric loops, Ruby typically relies on native extensions or specialized libraries, but the representation principle still applies: keep hot data in contiguous bytes when you can.

Binary Data: Bytes, Endianness, and Bounds Safety

When dealing with file formats, network protocols, or serialization, representation choices determine both speed and safety. Parsing text (JSON, CSV) is flexible but expensive. Parsing binary with fixed-width fields is fast but demands careful bounds checks and endianness handling.

Step-by-step: Designing a Safe Binary Record Layout

Define a fixed header with a magic number and version to reject incompatible data early.
Use fixed-width integer types for fields with known ranges (e.g., uint32 for counts).
Store lengths explicitly (length-prefixed) rather than relying on terminators.
Decide endianness (often little-endian) and document it.
Validate all lengths before reading payloads to avoid out-of-bounds reads and denial-of-service.

C: Parsing with Explicit Bounds

#include <stdint.h> #include <string.h> typedef struct { uint32_t len; uint32_t type; } Header; int parse_record(const uint8_t* buf, size_t n, Header* out, const uint8_t** payload) { if (n < sizeof(Header)) return 0; Header h; memcpy(&h, buf, sizeof(Header)); // If data is little-endian on the wire, convert here if needed. if (h.len > n - sizeof(Header)) return 0; *out = h; *payload = buf + sizeof(Header); return 1; }

Safety points: never cast a pointer to a struct and dereference it unless you are sure about alignment and packing; use memcpy to avoid undefined behavior. Always validate len against remaining buffer size before computing pointers.

Java: ByteBuffer with Explicit Order

import java.nio.ByteBuffer; import java.nio.ByteOrder; final class Record { final int len; final int type; final ByteBuffer payload; Record(int len, int type, ByteBuffer payload) { this.len = len; this.type = type; this.payload = payload; } } static Record parse(ByteBuffer buf) { buf.order(ByteOrder.LITTLE_ENDIAN); if (buf.remaining() < 8) return null; int len = buf.getInt(); int type = buf.getInt(); if (len < 0 || len > buf.remaining()) return null; ByteBuffer payload = buf.slice(); payload.limit(len); buf.position(buf.position() + len); return new Record(len, type, payload); }

Speed and safety: ByteBuffer gives bounds checks and avoids manual pointer arithmetic. Using slice() creates a view without copying, but you must manage positions carefully to avoid confusing state; encapsulate parsing in one place.

Python: struct + memoryview for Zero-Copy Slices

import struct def parse(buf: bytes): mv = memoryview(buf) if len(mv) < 8: return None len_, type_ = struct.unpack_from('<II', mv, 0) if len_ > len(mv) - 8: return None payload = mv[8:8+len_] return (len_, type_, payload)

memoryview allows slicing without copying. struct.unpack_from reads from the buffer directly. Always validate len_ before slicing.

Ruby: unpack with Length Checks

def parse(buf) return nil if buf.bytesize < 8 len, type = buf.unpack('V2') # little-endian uint32 return nil if len > buf.bytesize - 8 payload = buf.byteslice(8, len) [len, type, payload] end

Use byteslice to avoid encoding surprises and to operate on raw bytes. Validate lengths before slicing.

Strings vs Bytes: Encoding as a Representation Choice

Text handling is a frequent source of both performance issues and security bugs. The key representation decision is whether your data is “text” (characters) or “bytes” (opaque). Mixing them leads to repeated transcoding, incorrect length calculations, and vulnerabilities in boundary checks.

Use bytes for protocol payloads, hashes, compressed data, and encrypted blobs.
Decode to text at the boundary where you need to interpret characters, and validate encoding once.
When you need to count bytes (e.g., network limits), count bytes, not characters.

Python Boundary Pattern

def handle_request(raw: bytes): # raw is bytes from socket; keep as bytes for parsing headers # decode only the specific header value that is defined as UTF-8 name_bytes = extract_name_field(raw) name = name_bytes.decode('utf-8', errors='strict') validate_name(name) return name

Ruby Boundary Pattern

def handle_request(raw) raw = raw.b # force ASCII-8BIT (bytes) name_bytes = extract_name_field(raw) name = name_bytes.force_encoding('UTF-8') raise 'bad encoding' unless name.valid_encoding? validate_name(name) name end

In Ruby, .b and ASCII-8BIT help keep data as bytes. Convert to UTF-8 only when you need text semantics, and validate encoding explicitly.

Bit Packing and Field Compression (When It’s Worth It)

Sometimes the fastest representation is smaller, because it reduces memory bandwidth and improves cache residency. Bit packing can store multiple small fields in one machine word. This can be a win when you process huge arrays of records and the fields have small ranges (flags, small enums, bounded counters). The trade-off is complexity and the risk of bugs in masking/shifting.

Step-by-step: Safe Bit Packing

Write down the range of each field and compute required bits.
Allocate bit positions and document them as constants.
Provide accessor functions (get/set) rather than open-coded shifts everywhere.
Validate inputs before packing (range checks).
In higher-level languages, keep packed representation internal and expose a clear API.

C Example: Packing Two 12-bit Values and 8 Flags

#include <stdint.h> enum { A_BITS = 12, B_BITS = 12, F_BITS = 8 }; enum { A_SHIFT = 0, B_SHIFT = A_SHIFT + A_BITS, F_SHIFT = B_SHIFT + B_BITS }; static inline uint32_t pack(uint32_t a, uint32_t b, uint32_t f) { if (a >= (1u << A_BITS)) return 0; if (b >= (1u << B_BITS)) return 0; if (f >= (1u << F_BITS)) return 0; return (a << A_SHIFT) | (b << B_SHIFT) | (f << F_SHIFT); } static inline uint32_t get_a(uint32_t x) { return (x >> A_SHIFT) & ((1u << A_BITS) - 1u); }

Safety: range checks prevent silent truncation. Returning 0 on error is not always safe if 0 is a valid value; consider an out-parameter with a success flag.

Java Example: Packed int with Accessors

final class Packed { static final int A_BITS = 12, B_BITS = 12, F_BITS = 8; static final int A_MASK = (1 << A_BITS) - 1; static final int B_MASK = (1 << B_BITS) - 1; static int pack(int a, int b, int f) { if ((a & ~A_MASK) != 0) throw new IllegalArgumentException(); if ((b & ~B_MASK) != 0) throw new IllegalArgumentException(); if ((f & ~((1 << F_BITS) - 1)) != 0) throw new IllegalArgumentException(); return a | (b << A_BITS) | (f << (A_BITS + B_BITS)); } static int a(int x) { return x & A_MASK; } }

Java’s defined shift behavior and exceptions make it easier to keep safe. Keep these operations centralized to avoid scattered bit logic.

Struct Layout, Padding, and Alignment (C) and Their Cross-Language Echo

In C, struct layout is a direct performance and correctness concern. Field order can introduce padding; misaligned access can be slower or even fault on some platforms. Even if you are not writing C, understanding these effects helps when you define binary formats consumed by multiple languages.

Step-by-step: Minimizing Padding

Group fields by decreasing alignment requirement (e.g., 8-byte, then 4-byte, then 2-byte, then 1-byte).
Use fixed-width types for on-disk/on-wire formats (uint32_t, uint16_t).
Do not rely on compiler-specific packing for external formats; define serialization explicitly.

C Example: Field Reordering

typedef struct { uint8_t tag; uint64_t id; uint32_t count; } Bad; typedef struct { uint64_t id; uint32_t count; uint8_t tag; } Better;

Bad likely inserts padding after tag to align id. Better reduces padding by placing the 8-byte field first.

Cross-language note: if you serialize struct bytes directly, other languages may not match your padding/alignment. Prefer explicit serialization (write each field in a defined endianness) rather than dumping raw structs.

Nullability and Sentinel Values

Representing “missing” values is both a safety and performance decision. Options include: nullable references, sentinel values (like -1), separate validity bitsets, or separate arrays for present values. Each has trade-offs.

Nullable references are convenient but can add branches and risk NullPointerException/AttributeError.
Sentinels are fast but dangerous if the sentinel can collide with valid data.
Validity bitsets (one bit per element) are compact and can be processed efficiently; they also make missingness explicit.

Java: Sentinel vs Optional

Using Optional in hot data structures can add overhead. For dense numeric arrays, a sentinel or a parallel boolean array is often clearer and faster:

final class IntColumn { final int[] values; final boolean[] present; IntColumn(int n) { values = new int[n]; present = new boolean[n]; } int getOrDefault(int i, int def) { return present[i] ? values[i] : def; } }

Safety: the presence array prevents accidental use of uninitialized values and makes missingness explicit without relying on a magic number.

Python/Ruby: Explicit Missingness

In Python and Ruby, None/nil are common, but in hot loops they add branches and type checks. For numeric columns, consider parallel arrays: one packed numeric array plus a bytearray/String bitmask for presence. Keep the representation internal and provide methods that hide the complexity.

Copying vs Views: Representing Slices Safely

Another representation choice is whether a “slice” is a copy or a view. Views avoid copying and can be faster, but they require careful lifetime and mutation rules.

Java ByteBuffer.slice() is a view: changes affect the same underlying buffer.
Python memoryview is a view: it keeps the underlying object alive and can expose writable access.
Ruby byteslice returns a new string in many cases (implementation-dependent), but it is still safer to treat slices as separate values unless you know the sharing behavior.
C pointer+length pairs are views; you must ensure the backing storage remains valid.

Step-by-step: Safe View Design

Represent a view as (pointer, length) or (buffer, offset, length) rather than relying on terminators.
Make mutability explicit: provide separate APIs for read-only vs writable views.
Ensure the backing storage outlives the view (ownership/lifetime rule).
Validate offsets and lengths at creation time, not repeatedly in inner loops.

C Example: Slice Type

typedef struct { const uint8_t* p; size_t n; } Slice; int slice_from(const uint8_t* buf, size_t n, size_t off, size_t len, Slice* out) { if (off > n) return 0; if (len > n - off) return 0; out->p = buf + off; out->n = len; return 1; }

This centralizes bounds checks and makes it harder to accidentally compute an out-of-range pointer.

Choosing Representations at Module Boundaries

A reliable way to keep both speed and safety is to standardize representations at boundaries: parse/validate once, then operate on a compact internal form.

Step-by-step Boundary Workflow

At input: accept flexible external forms (JSON/text, user objects), but immediately validate and convert.
Internally: store in a representation optimized for access (packed arrays, IDs instead of strings, normalized encoding).
At output: convert back to external form only when needed.

Examples of internal normalization:

Replace repeated strings with integer IDs (dictionary encoding) and store IDs in an int array.
Store timestamps as integer epoch nanoseconds/milliseconds rather than strings.
Store booleans as bitsets when you have millions of them.

Python Example: Dictionary Encoding for Repeated Strings

def encode(values): # values: list[str] table = {} ids = [] for s in values: i = table.get(s) if i is None: i = len(table) table[s] = i ids.append(i) return table, ids

Even without specialized libraries, this reduces repeated string comparisons in later stages and makes hot loops operate on integers.

Java Example: Intern Table + int[]

import java.util.*; final class DictEncoded { final Map<String,Integer> dict = new HashMap<>(); final ArrayList<String> rev = new ArrayList<>(); int id(String s) { Integer i = dict.get(s); if (i != null) return i; int nid = rev.size(); rev.add(s); dict.put(s, nid); return nid; } }

Store the resulting IDs in an int[] for processing. Safety: keep the mapping encapsulated so IDs are always valid indices into rev.

Checklist: Representation and Layout Decisions You Can Apply Immediately

Hot numeric data: use contiguous primitive storage (C arrays, Java primitive arrays, Python array/memoryview, Ruby byte buffers for binary).
AoS vs SoA: choose based on access pattern; SoA often wins for columnar processing.
Binary formats: fixed-width fields, explicit endianness, length-prefixed payloads, validate lengths before slicing.
Text vs bytes: keep bytes as bytes; decode once at boundaries with strict validation.
Missing values: avoid ambiguous sentinels; consider parallel presence arrays/bitsets.
Views: represent slices as (buffer, offset, length); validate once; make mutability explicit.
Struct padding (C): reorder fields to reduce padding; never serialize raw structs for cross-language interchange.

Now answer the exercise about the content:

In a hot numeric loop that processes one field (or a small subset of fields) across many elements, which layout choice is typically faster and why?

You are right! Congratulations, now go to the next page

You missed! Try again.

SoA keeps each field in a contiguous array, which is often faster when the loop touches only one or two fields across many elements. This improves cache locality and can make vectorization easier.