Free Ebook cover Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

New course

17 pages

Concurrency Primitives and Parallel Work Strategies

Capítulo 13

Estimated reading time: 0 minutes

+ Exercise

Why Concurrency Primitives Matter

Concurrency is about structuring a program so multiple tasks can make progress without blocking each other unnecessarily. Parallelism is about executing tasks at the same time on multiple CPU cores. The two overlap, but they are not the same: you can have concurrency on a single core (interleaving work) and parallelism without much concurrency (a batch of independent tasks running simultaneously). Concurrency primitives are the building blocks that let you coordinate tasks safely: they protect shared state, communicate between producers and consumers, and express ordering constraints.

Across Python, Ruby, Java, and C, the primitives differ in ergonomics and guarantees, but the core problems are consistent: (1) avoid data races, (2) avoid deadlocks, (3) avoid starvation and priority inversion, (4) keep contention low, and (5) make failure and cancellation manageable. This chapter focuses on practical primitives and work strategies you can apply across languages, with concrete examples and step-by-step guidance.

Choosing a Concurrency Model: Shared State vs Message Passing

Shared-state concurrency

Shared-state concurrency means multiple threads (or tasks) access the same memory and coordinate with locks/atomics. It can be efficient but is easy to get wrong. Use it when you need low-latency access to shared structures (counters, caches, in-memory indexes) and when copying data would be too expensive.

Message passing

Message passing means tasks communicate by sending immutable (or effectively immutable) messages through queues/channels. It reduces the surface area for races because ownership is transferred or data is copied. Use it when you can structure work as pipelines, fan-out/fan-in, or actor-like components.

Practical rule: default to message passing for coordination and use shared-state primitives only at well-defined boundaries (e.g., a single shared map protected by one lock, or a small set of atomic counters).

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Core Primitives and What They Solve

Mutex / Lock

A mutex provides mutual exclusion: only one thread enters a critical section at a time. Use it to protect invariants of shared data structures. Keep critical sections small and avoid calling unknown code while holding a lock (callbacks, user code, IO), because it increases contention and deadlock risk.

Read-write lock

A read-write lock allows multiple readers or one writer. It can help when reads dominate and writes are rare, but it can also introduce overhead and writer starvation if not implemented fairly. Prefer a simple mutex unless you have a demonstrated read-heavy contention problem.

Semaphore

A semaphore controls access to a limited resource (e.g., only N concurrent operations). It is often used to bound concurrency and protect external systems (database connections, rate-limited APIs) without serializing everything behind a single lock.

Condition variable

A condition variable lets threads wait until a predicate becomes true, releasing a mutex while waiting and reacquiring it when signaled. It is used for producer-consumer queues and for coordinating state transitions. Always wait in a loop that re-checks the predicate, because wakeups can be spurious or the condition may no longer hold.

Atomic operations

Atomics provide lock-free reads/writes and read-modify-write operations (increment, compare-and-swap). They are ideal for counters, flags, and simple state machines. They are not a replacement for locks when you need to update multiple fields consistently.

Futures / Promises

Futures represent a value that will be available later. They simplify composition: you can submit work and then wait, poll, or attach continuations. In Java, futures are central to executor-based parallelism; in Python, they appear in concurrent.futures and asyncio; in Ruby, they appear in libraries and in Ractors you typically pass messages and wait for results; in C you often build your own future-like struct with a mutex/condvar.

Barriers and latches

Barriers synchronize phases: all workers reach a point before any proceed. Latches allow one-time synchronization (wait until N events occur). These are useful in parallel algorithms with distinct stages (parse, transform, aggregate) where each stage must complete before the next begins.

Deadlocks, Starvation, and Contention: Practical Avoidance

Lock ordering

Deadlocks often happen when two threads acquire locks in different orders. Establish a global ordering rule (e.g., lock A then lock B) and follow it everywhere. If you must acquire multiple locks, acquire them in a consistent order and release in reverse order.

Minimize lock scope

Compute outside the lock, then lock briefly to publish results. This reduces contention and improves throughput. A common pattern is: read shared state into locals, unlock, compute, lock, validate state hasn’t changed (or accept it), then update.

Prefer coarse locks over many fine locks until proven otherwise

Many fine-grained locks can increase complexity and deadlock risk. Start with a single lock protecting a structure. If contention is too high, consider sharding (striped locks) or switching to message passing.

Use timeouts and try-locks for resilience

In systems code, a try-lock or timed lock can prevent a stuck thread from blocking progress indefinitely. Use it to detect contention hotspots and to fail fast when a lock cannot be acquired in time. Do not use try-lock loops as a default; they can spin and waste CPU.

Parallel Work Strategies That Scale

Strategy 1: Task parallelism with a bounded worker pool

Task parallelism splits work into independent tasks and runs them on a fixed number of workers. The key is bounding: too many tasks in flight increases overhead and memory pressure; too few underutilizes cores. A bounded queue plus a worker pool is a robust default.

Step-by-step:

  • Choose a worker count (often number of cores for CPU-bound work; for IO-bound work, higher counts can help but should be bounded).
  • Create a queue for tasks (or use a built-in executor).
  • Submit tasks; if the queue is bounded, submission may block or fail when full.
  • Workers pull tasks, execute, and publish results to a results queue or future.
  • Signal completion (close the queue, send sentinel values, or track outstanding tasks).

Python example (CPU-bound vs IO-bound)

In CPython, threads are good for IO-bound concurrency but do not run Python bytecode in parallel due to the GIL. For CPU-bound parallelism, use processes.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed

def io_task(url):
    # placeholder for network IO
    return url, 200

def cpu_task(n):
    # placeholder for CPU work
    s = 0
    for i in range(n):
        s += i*i
    return s

urls = [f"https://example.com/{i}" for i in range(100)]
with ThreadPoolExecutor(max_workers=32) as ex:
    futures = [ex.submit(io_task, u) for u in urls]
    for f in as_completed(futures):
        url, status = f.result()

nums = [2_000_00] * 16
with ProcessPoolExecutor(max_workers=8) as ex:
    futures = [ex.submit(cpu_task, n) for n in nums]
    results = [f.result() for f in futures]

Practical guidance: use ThreadPoolExecutor for IO-bound tasks (HTTP, disk, waiting on subprocesses). Use ProcessPoolExecutor for CPU-bound tasks. Keep payloads small when using processes because arguments/results are serialized.

Ruby example: Threads vs Ractors

Ruby threads provide concurrency, but parallel execution of Ruby code depends on the Ruby implementation. In CRuby, the GVL limits parallel execution of Ruby bytecode, but threads still help for IO-bound work. For parallel CPU-bound work in modern Ruby, consider Ractors, which enforce isolation and communicate via message passing.

# IO-bound concurrency with threads
threads = 20.times.map do |i|
  Thread.new do
    # simulate IO
    sleep(0.01)
    i
  end
end
results = threads.map(&:value)

# CPU-bound parallelism with Ractors (message passing)
ractors = 4.times.map do
  Ractor.new do
    loop do
      msg = Ractor.receive
      break if msg == :stop
      n = msg
      s = 0
      (0...n).each { |i| s += i*i }
      Ractor.yield s
    end
  end
end

work = [200_000, 200_000, 200_000, 200_000]
work.each_with_index { |n, idx| ractors[idx % ractors.size].send(n) }
partial = work.size.times.map { Ractor.select(*ractors).last }

ractors.each { |r| r.send(:stop) }

Practical guidance: Ractors require shareable objects or copying; design messages as simple immutable data (numbers, frozen strings, arrays of primitives). This pushes you toward safer concurrency by construction.

Java example: ExecutorService and CompletionService

Java’s standard approach is executors and thread pools. For many tasks where you want results as they finish, CompletionService reduces bookkeeping.

import java.util.*;
import java.util.concurrent.*;

ExecutorService pool = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
CompletionService<Integer> cs = new ExecutorCompletionService<>(pool);

List<Callable<Integer>> tasks = new ArrayList<>();
for (int t = 0; t < 100; t++) {
  final int n = 200_000;
  tasks.add(() -> {
    int s = 0;
    for (int i = 0; i < n; i++) s += i * i;
    return s;
  });
}

for (Callable<Integer> task : tasks) cs.submit(task);
int total = 0;
for (int i = 0; i < tasks.size(); i++) {
  Future<Integer> f = cs.take();
  total += f.get();
}

pool.shutdown();

Practical guidance: prefer a bounded pool; avoid creating threads per task. Use cancellation (Future.cancel) and timeouts (get with timeout) when integrating with unreliable dependencies.

C example: pthread worker pool with mutex + condition variable

In C, you often build a worker pool using pthreads. The essential pieces are a task queue, a mutex to protect it, and a condition variable to wake workers.

#include <pthread.h>
#include <stdlib.h>

typedef void (*task_fn)(void*);

typedef struct task {
  task_fn fn;
  void* arg;
  struct task* next;
} task_t;

typedef struct {
  pthread_mutex_t mu;
  pthread_cond_t cv;
  task_t* head;
  task_t* tail;
  int stop;
} queue_t;

void queue_init(queue_t* q) {
  pthread_mutex_init(&q->mu, NULL);
  pthread_cond_init(&q->cv, NULL);
  q->head = q->tail = NULL;
  q->stop = 0;
}

void queue_push(queue_t* q, task_fn fn, void* arg) {
  task_t* t = (task_t*)malloc(sizeof(task_t));
  t->fn = fn; t->arg = arg; t->next = NULL;
  pthread_mutex_lock(&q->mu);
  if (q->tail) q->tail->next = t; else q->head = t;
  q->tail = t;
  pthread_cond_signal(&q->cv);
  pthread_mutex_unlock(&q->mu);
}

task_t* queue_pop(queue_t* q) {
  pthread_mutex_lock(&q->mu);
  while (!q->stop && q->head == NULL) {
    pthread_cond_wait(&q->cv, &q->mu);
  }
  task_t* t = q->head;
  if (t) {
    q->head = t->next;
    if (!q->head) q->tail = NULL;
  }
  pthread_mutex_unlock(&q->mu);
  return t;
}

void* worker(void* arg) {
  queue_t* q = (queue_t*)arg;
  for (;;) {
    task_t* t = queue_pop(q);
    if (!t) {
      pthread_mutex_lock(&q->mu);
      int stopping = q->stop;
      pthread_mutex_unlock(&q->mu);
      if (stopping) break;
      continue;
    }
    t->fn(t->arg);
    free(t);
  }
  return NULL;
}

Step-by-step usage:

  • Initialize the queue.
  • Create N worker threads running worker.
  • Push tasks with queue_push.
  • To stop: set stop=1 under the mutex, broadcast the condition variable, then join threads.

Practical guidance: always guard the queue with the mutex; always wait on the condition variable in a loop; use broadcast on shutdown so all workers wake.

Strategy 2: Data parallelism with chunking

Data parallelism splits a large dataset into chunks processed independently. The main performance lever is chunk size: too small increases scheduling overhead; too large causes load imbalance. Use dynamic scheduling (work stealing or a shared atomic index) when per-item cost varies.

Step-by-step with an atomic index (language-agnostic idea):

  • Store the next index to process in an atomic integer.
  • Each worker repeatedly fetch-adds a chunk size to claim a range.
  • Process that range locally without locks.
  • Stop when the claimed start index is beyond the dataset size.

Java has built-in support via parallel streams and ForkJoinPool, but explicit chunking often gives more predictable control. In C, you can implement the atomic index with C11 atomics; in Python/Ruby, you typically use process-based chunking for CPU-bound work.

Strategy 3: Pipeline parallelism with staged queues

Pipeline parallelism breaks work into stages (e.g., decode, transform, aggregate) connected by bounded queues. Each stage can have its own concurrency level. This improves throughput when stages have different costs and lets you isolate blocking operations.

Step-by-step:

  • Define stages with clear input/output message types.
  • Place bounded queues between stages to prevent unbounded in-flight work.
  • Run each stage with a small worker pool.
  • Use sentinel messages or close semantics to signal end-of-stream.
  • Ensure each stage handles cancellation and propagates termination downstream.

Even when you already have backpressure patterns elsewhere, the concurrency-specific detail here is how to terminate cleanly: every stage must know when no more inputs will arrive, and must ensure all workers exit without leaving consumers blocked.

Cancellation, Timeouts, and Failure Propagation

Designing for cancellation

Cancellation is a coordination problem: one task decides the overall operation should stop (timeout, error, user request), and other tasks must observe that decision and stop promptly. The primitive varies: a shared atomic flag, a cancellation token, interrupting threads, or closing channels.

Practical steps:

  • Create a shared cancellation signal (atomic boolean or language token).
  • Check it at safe points inside loops and between phases.
  • Ensure blocking operations have timeouts or are interruptible.
  • On cancellation, stop accepting new work and drain/stop workers.

Java: interruption and timeouts

In Java, thread interruption is a standard mechanism. If you use blocking queues or sleep, interruption can wake threads. Always restore the interrupt status if you catch InterruptedException and cannot exit immediately.

try {
  while (!Thread.currentThread().isInterrupted()) {
    Runnable r = queue.poll(100, TimeUnit.MILLISECONDS);
    if (r != null) r.run();
  }
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
}

Python: cooperative cancellation

With thread pools, cancellation is mostly cooperative (you can cancel futures that haven’t started; running tasks must check a flag). With asyncio, cancellation is integrated via CancelledError, but you still need to ensure your code awaits at cancellation points and cleans up resources.

C: cancellation tokens and shutdown broadcast

In C, a common approach is a shared atomic flag plus condition-variable broadcast. Workers check the flag between tasks and also wake up when shutdown is requested.

Reducing Contention with Sharding and Combining

Sharded locks (striping)

If a single lock becomes hot, shard the data structure into multiple independent segments, each with its own lock. For example, a concurrent map can be split into 64 buckets by hash, each protected by a mutex. This reduces contention when keys are well-distributed.

Practical steps:

  • Choose shard count (power of two is convenient).
  • Map key to shard via hash & (shards-1).
  • Lock only the shard you need.
  • Keep operations that span shards rare or carefully ordered to avoid deadlocks.

Thread-local accumulation + periodic merge

For counters and aggregations, avoid a shared atomic increment in a tight loop. Instead, accumulate in thread-local storage and merge periodically under a lock (or at the end). This is a work strategy that often outperforms fine-grained atomic updates under high contention.

Example idea (language-agnostic): each worker keeps a local sum; after processing its chunk, it adds the local sum to a shared total once. This turns many contended updates into a few uncontended ones.

Safe Publication and Ownership Rules

Concurrency bugs often come from unclear ownership: who is allowed to mutate an object, and when. A practical way to stay safe across languages is to adopt explicit ownership rules:

  • Immutable messages: once sent to another thread/task, do not mutate.
  • Single-writer principle: for a given piece of state, designate one thread as the writer; others read via snapshots or messages.
  • Confine mutable state: keep mutable objects thread-local whenever possible.

In Java, safe publication is often achieved by final fields, volatile references, or synchronization. In C, it’s mutex-protected handoff or atomics with correct ordering. In Python and Ruby, the runtime reduces some low-level hazards, but logical races still occur; treat shared mutable objects as unsafe unless protected.

Putting It Together: Selecting Primitives by Problem Shape

Shared counters and metrics

  • Low contention: atomic increment (Java AtomicLong; C11 atomic; in Python/Ruby, a lock around increments if correctness matters).
  • High contention: thread-local counters + merge.

Producer-consumer work queue

  • Use a blocking queue/channel when available (Java BlockingQueue; Python queue.Queue; Ruby Queue).
  • In C, implement with mutex + condition variable and a clear shutdown protocol.

Bounded concurrency to protect a resource

  • Semaphore (Java Semaphore; Python threading.Semaphore; Ruby SizedQueue or Semaphore from libraries; C sem_t or a custom counter + condvar).

Parallel map/reduce style batch

  • Java: executor + futures or ForkJoinPool for recursive splits.
  • Python: process pool with chunking.
  • Ruby: Ractors for CPU-bound, threads for IO-bound.
  • C: worker pool with chunked ranges and local accumulation.

Common Step-by-Step Patterns You Can Reuse

Pattern: Bounded worker pool with graceful shutdown

Steps:

  • Create a bounded queue for tasks.
  • Start N workers that block on the queue.
  • Submit tasks; if submission blocks, you have built-in throttling.
  • On shutdown: stop accepting new tasks, signal workers (close queue or send N sentinels), then join/wait.
  • Ensure workers handle exceptions and report failures (results queue, future, or shared error state).

Pattern: Fan-out/fan-in with early exit on error

Steps:

  • Submit tasks and keep handles (futures or worker IDs).
  • As results arrive, if any task fails, trigger cancellation.
  • Stop scheduling new tasks and wait for in-flight tasks to observe cancellation.
  • Return the first error (or aggregate errors if needed).

This pattern is especially important when parallelizing validation or transformation: you often want to stop quickly on the first fatal error rather than waste CPU.

Pattern: Two-phase parallel work with a barrier

Steps:

  • Phase 1: each worker computes local results independently.
  • Barrier: wait until all workers finish phase 1.
  • Phase 2: workers consume shared results (often read-only) or perform a second pass.

In Java, you can use CyclicBarrier or Phaser; in C, pthread_barrier_t (where available) or a custom barrier with mutex/condvar; in Python/Ruby, barriers are often built using condition variables or higher-level libraries.

Now answer the exercise about the content:

When designing coordination between concurrent tasks, which approach should typically be the default, and how should shared-state primitives be used?

You are right! Congratulations, now go to the next page

You missed! Try again.

Message passing reduces the surface area for races by transferring ownership or copying data. Shared-state locks/atomics should be limited to well-defined boundaries such as a single shared structure protected by one lock.

Next chapter

Interoperability and Cross-Language Boundaries

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.