Free Ebook cover Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

Polyglot Performance Patterns: Writing Fast, Safe Code Across Python, Ruby, Java, and C

New course

17 pages

Profiling and Observability Tooling

Capítulo 3

Estimated reading time: 0 minutes

+ Exercise

What “Profiling” and “Observability” Mean in Practice

Profiling and observability are complementary ways to answer different performance questions. Profiling is about attributing cost: where time, allocations, lock contention, or I/O waits are spent inside a process. Observability is about understanding a running system from the outside using telemetry: metrics, logs, and traces that explain behavior over time and across services.

In a single-process script, profiling often gets you to the fix fastest: identify hot functions, excessive allocations, or slow system calls. In a production service, observability is the safety net: it tells you when performance regresses, which requests are slow, and whether the issue is CPU, memory, I/O, downstream dependencies, or concurrency.

A useful mental model is: profiling answers “why is this code slow?”, while observability answers “when is the system slow, for whom, and what changed?”. Mature performance work uses both: observability to detect and scope, profiling to pinpoint and remediate.

Telemetry Building Blocks: Metrics, Logs, Traces, and Profiles

Metrics

Metrics are numeric time series sampled or aggregated over intervals (e.g., requests per second, p95 latency, CPU usage, heap size). They are cheap to store and query, and ideal for dashboards and alerting. For performance, prefer distributions (histograms) over averages: p50/p95/p99 latency, queue depth percentiles, GC pause histograms, and request size histograms.

Logs

Logs are event records. For performance, unstructured logs are hard to query; structured logs (JSON) let you filter by request id, user id, endpoint, and latency. Logs are best for explaining “what happened” around an incident: errors, retries, timeouts, circuit breaker opens, and slow query statements (with care for sensitive data).

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Traces

Distributed tracing follows a request across boundaries: web handler → database → cache → downstream service. Traces reveal where latency accumulates and which dependency dominates. Tracing is especially valuable when the slow part is not in your code (e.g., DNS, TLS handshake, database lock waits).

Profiles

Profiles capture internal cost attribution. Common types: CPU profiles (where CPU time is spent), allocation/heap profiles (where memory is allocated), lock profiles (where threads wait), and I/O profiles (where time is blocked). Profiles can be collected in development (on-demand) or in production (continuous or sampled) using low-overhead profilers.

Choosing the Right Tool: A Decision Checklist

  • If you have a specific slow endpoint in production: start with metrics (latency percentiles, error rate), then traces to find the slow span, then profile the responsible service if the span is inside your code.

  • If CPU is high: use CPU profiling (sampling) and check for lock contention; correlate with request rate and payload size metrics.

  • If memory grows: use heap/allocation profiling, GC metrics, and look for object retention (leaks) versus churn (high allocation rate).

  • If latency spikes: look for tail latency causes—GC pauses, lock contention, thread pool saturation, connection pool exhaustion, or downstream timeouts—using metrics and traces.

  • If “it’s slow locally”: use deterministic profiling runs and flame graphs; add targeted instrumentation around suspected regions.

Profiling Patterns That Transfer Across Python, Ruby, Java, and C

Sampling vs instrumentation profilers

Sampling profilers periodically capture stack traces (e.g., 100 Hz) and infer where time is spent. They have low overhead and are suitable for production. Instrumentation profilers wrap function entry/exit to measure exact durations; they can be more precise but often add significant overhead and distort results, especially in dynamic languages.

Wall time vs CPU time

Wall time includes waiting (I/O, locks, scheduling). CPU time measures actual CPU execution. A function can have high wall time but low CPU time if it blocks on I/O. When diagnosing latency, wall-time flame graphs are often more actionable; when diagnosing CPU saturation, CPU-time profiles are key.

Inclusive vs exclusive cost

Inclusive cost includes time in callees; exclusive cost is time in the function itself. Hot “dispatcher” functions often have high inclusive cost but low exclusive cost; the fix is usually in deeper callees.

Allocation rate vs retained heap

High allocation rate causes GC pressure and latency spikes even if memory does not “leak.” Retained heap indicates objects staying alive too long. Allocation profiling helps reduce churn; heap dump analysis helps find retention paths.

Python: Practical Profiling and Production-Friendly Observability

Step-by-step: CPU profiling with cProfile and pstats

Use cProfile for quick attribution in a script or unit-test-like harness. It is deterministic but adds overhead; prefer it in development.

python -m cProfile -o profile.pstats your_script.py
python -c "import pstats; p=pstats.Stats('profile.pstats'); p.sort_stats('cumtime').print_stats(30)"

Interpretation tips: sort by cumulative time to find call chains; sort by tottime to find expensive leaf functions. If a function appears frequently with small per-call cost, consider batching or reducing call count.

Step-by-step: line-level profiling (when function-level is not enough)

Line profilers are useful when a single function dominates but you need to know which line is responsible. They add significant overhead; run on representative inputs.

# install: pip install line_profiler
# in code: annotate functions with @profile, then run:
kernprof -l -v your_script.py

Step-by-step: memory allocation profiling with tracemalloc

tracemalloc tracks Python allocations and can show which code paths allocate most.

import tracemalloc, time
tracemalloc.start()
# run workload
time.sleep(1)
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')[:10]
for stat in top: print(stat)

Use this to reduce churn (e.g., avoid repeated string concatenations, unnecessary list copies, or per-item regex compilation).

Observability instrumentation: OpenTelemetry for metrics and traces

For services, add request metrics and traces. A minimal pattern is: create a tracer, start spans around key operations (DB, cache, external HTTP), and propagate context across boundaries. Use automatic instrumentation where possible, then add custom spans for domain-specific work.

# conceptual example (framework-specific setup omitted)
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("compute_quote") as span:
    span.set_attribute("items", len(items))
    price = pricing_engine(items)
    span.set_attribute("price", price)

Keep span attributes low-cardinality (bounded set of values). High-cardinality attributes (user ids, order ids) can explode storage costs; use them in logs, not metrics, and only in traces if sampling is controlled.

Ruby: Profiling Hot Paths and Measuring Allocation Pressure

Step-by-step: CPU profiling with ruby-prof

ruby-prof can produce call graphs and flat profiles. It is best in development or staging due to overhead.

# Gemfile: gem 'ruby-prof'
require 'ruby-prof'
RubyProf.start
# run workload
result = RubyProf.stop
printer = RubyProf::FlatPrinter.new(result)
printer.print(STDOUT, min_percent: 1)

Look for methods with high total time and high call counts. In Ruby, reducing object churn and method dispatch overhead can matter; consider memoization, avoiding intermediate arrays, and using iterators carefully when they allocate.

Step-by-step: allocation profiling with stackprof

stackprof supports CPU and object allocation profiling and can output flame graphs. Allocation profiling is particularly useful for GC-related latency.

# Gemfile: gem 'stackprof'
require 'stackprof'
StackProf.run(mode: :object, out: 'stackprof.dump') do
  # run workload
end
# then:
stackprof stackprof.dump --text --limit 30

Use results to eliminate avoidable allocations (e.g., repeated Hash creation in tight loops, converting between strings and symbols, building arrays only to immediately iterate again).

Production observability: structured logs and trace correlation

In Ruby services, structured logging plus trace/span ids is a strong baseline. Emit JSON logs with fields like request_id, trace_id, span_id, duration_ms, status, and route. Ensure your logger is non-blocking or buffered to avoid turning logging into a latency source.

Java: Low-Overhead Profiling and JVM Observability

JVM metrics you should always have

  • Heap usage (used, committed), allocation rate if available

  • GC pause time distribution and count (by collector)

  • Thread counts and states (RUNNABLE, BLOCKED, WAITING)

  • CPU usage (process and system), load average

  • Request latency histograms and error rates

  • Connection pool metrics (DB, HTTP clients)

These metrics help you distinguish: CPU-bound work, GC pressure, lock contention, and downstream saturation.

Step-by-step: Java Flight Recorder (JFR) for profiling in staging or production

JFR is built into the JVM and designed for low overhead. You can start a recording without restarting the process (depending on JVM and permissions).

# start a 60s recording
jcmd <pid> JFR.start name=perf settings=profile duration=60s filename=/tmp/perf.jfr
# stop explicitly if needed
jcmd <pid> JFR.stop name=perf

Open the .jfr file in Java Mission Control to inspect CPU hotspots, allocation pressure, thread contention, and I/O. For latency spikes, correlate “Thread Park,” “Monitor Enter,” and GC pause events with application-level request latency metrics.

Step-by-step: async-profiler for CPU and allocation flame graphs

async-profiler uses sampling and can generate flame graphs with relatively low overhead.

# CPU profile for 30 seconds, output as flame graph HTML
./profiler.sh -d 30 -f /tmp/cpu.html <pid>
# Allocation profile
./profiler.sh -d 30 -e alloc -f /tmp/alloc.html <pid>

Interpretation: CPU flame graphs show hot stacks; allocation flame graphs show where objects are created. If allocation hotspots align with latency spikes, focus on reducing churn or changing data structures.

Tracing: OpenTelemetry with context propagation

In Java, automatic instrumentation can capture HTTP server spans, client spans, and JDBC spans. Add custom spans around expensive domain operations. Ensure you propagate context across thread pools; missing propagation yields broken traces and misleading latency attribution.

C: Profiling Native Code and System-Level Observability

Step-by-step: sampling CPU profiling with perf

On Linux, perf provides powerful sampling of CPU cycles, cache misses, and more.

# record CPU samples for a running process
sudo perf record -F 99 -p <pid> -g -- sleep 30
# report with call stacks
sudo perf report

For accurate stacks, compile with frame pointers (or ensure unwind info is available):

gcc -O2 -g -fno-omit-frame-pointer -o app app.c

perf can also generate flame graphs (via external scripts) to visualize hotspots. If you see time in memcpy/memmove, consider data layout and copying; if you see time in syscalls, consider batching I/O or reducing context switches.

Step-by-step: heap profiling with heaptrack (or similar)

Heap profilers attribute allocations and show leaks and churn. A common workflow is to run the program under the profiler with representative input.

# example usage (tool availability varies by distro)
heaptrack ./app --workload
heaptrack_print heaptrack.app.*.gz | head

Use results to identify allocation-heavy paths and replace frequent small allocations with arenas, object pools (carefully), or stack allocation where safe.

System observability: eBPF and syscall-level insight

When native services are slow, the bottleneck may be in the kernel: disk I/O, network retransmits, scheduler latency, or lock contention in kernel paths. eBPF-based tools can measure latency distributions of syscalls, TCP retransmits, and block I/O without modifying the application. The transferable idea is to correlate application latency spikes with system-level events rather than guessing.

Cross-Language Step-by-Step Workflow: From Symptom to Root Cause

1) Detect and scope with metrics

Start with a small set of high-signal charts: request rate, error rate, p50/p95/p99 latency, CPU, memory/heap, GC pauses (if applicable), and saturation metrics (thread pool queue depth, connection pool usage). The goal is to answer: is this a general slowdown or isolated to certain routes, tenants, or payload sizes?

2) Identify the slow path with traces

Use tracing to find which span dominates the critical path. Common patterns:

  • DB span dominates: investigate query plans, locks, pool saturation, or N+1 patterns.

  • HTTP client span dominates: investigate downstream latency, retries, timeouts, DNS/TLS, or connection reuse.

  • Internal compute span dominates: move to profiling.

3) Pinpoint with profiling

Collect a profile during the problematic window (or reproduce in staging). Choose the profile type based on symptoms:

  • High CPU: CPU sampling profile

  • High GC or memory churn: allocation profile

  • Threads blocked: lock/contention profile

  • Latency without CPU: wall-time profile or syscall tracing

Prefer short, targeted captures (10–60 seconds) during the incident window to reduce noise and overhead.

4) Validate with targeted instrumentation

Profiles tell you “where,” but not always “why.” Add minimal instrumentation around the suspected region: measure batch sizes, cache hit rates, queue times, and retry counts. In all languages, a common pattern is to measure time spent waiting versus time spent computing.

# pseudo-pattern: measure queue wait vs work time
t0 = now()
wait_for_worker()
t1 = now()
do_work()
t2 = now()
emit_metric("queue_wait_ms", t1-t0)
emit_metric("work_ms", t2-t1)

5) Guardrails: sampling, cardinality, and overhead budgets

Observability can harm performance if implemented carelessly. Apply these guardrails:

  • Sampling: sample traces (and sometimes logs) under high load; keep metrics aggregated.

  • Cardinality control: avoid unbounded labels/tags in metrics (user_id, order_id, full URL). Use route templates (e.g., /users/:id) and bounded enums.

  • Async export: export telemetry asynchronously; avoid blocking request threads on network I/O to telemetry backends.

  • Payload hygiene: never log sensitive data; redact or hash identifiers if needed.

  • Overhead budget: decide acceptable overhead (e.g., <1–2% CPU) for always-on telemetry; use on-demand profiling for deeper dives.

Practical Examples of Performance Questions and the Telemetry to Use

“p99 latency doubled after a deploy”

  • Metrics: compare p99 by route and status code; check GC pause histograms (Java), allocation rate (Ruby/Java), CPU.

  • Traces: compare critical path spans before/after; look for new downstream calls or increased span durations.

  • Profiles: capture CPU and allocation profiles during peak; look for new hotspots or increased call counts.

“CPU is pegged but throughput is flat”

  • Metrics: CPU, run queue length, thread states, request rate; check if retries increased.

  • Profiles: CPU sampling flame graph; look for busy loops, excessive serialization/deserialization, regex hotspots, or contention spin.

  • Traces: ensure time is not spent in downstream waits (if it is, CPU pegging may be due to retry storms or tight timeout loops).

“Memory keeps growing over hours”

  • Metrics: heap usage trend, GC frequency, RSS vs heap (native memory), cache sizes.

  • Profiles: allocation profiling to find churn; heap dump/retention analysis (Java) or tracemalloc snapshots (Python) to find retention paths.

  • Logs: correlate growth with specific tenants/routes/payload sizes.

Designing an Observability Baseline for Polyglot Systems

Standardize semantic conventions

Across Python, Ruby, Java, and C services, standardize names and dimensions so dashboards and alerts are consistent. Use the same metric names for request latency and error rate, the same tag keys for service, route, method, status, and environment, and the same trace propagation format across HTTP and messaging.

Correlate logs, metrics, and traces with shared identifiers

Ensure each request has a request id and trace id. Put them in:

  • Response headers (for client correlation)

  • Structured logs (trace_id, span_id, request_id)

  • Trace context (propagated to downstream calls)

This enables a workflow: alert fires → open dashboard → click into exemplar trace → pivot to logs for that trace id → capture a profile on the implicated service.

Use exemplars and high-cardinality data safely

Some metric systems support exemplars: attach a trace id to a histogram bucket sample. This gives you the best of both worlds: cheap aggregated metrics plus a direct link to a representative trace for a latency spike, without turning metrics into a high-cardinality store.

Common Pitfalls and How to Avoid Them

Profiling the wrong workload

Profiles are only as good as the workload. If you profile a small input but production uses large payloads, you may optimize the wrong thing. Use traces to identify representative slow requests, then replay or simulate those shapes in staging.

Confusing “hot” with “important”

A function can be hot because it is called frequently, but optimizing it may not reduce end-to-end latency if it is not on the critical path. Use traces to confirm critical path, then profile within that path.

Letting observability become the bottleneck

Excessive logging, synchronous exporters, and high-cardinality metrics can degrade performance. Treat telemetry as production code: load test it, measure overhead, and apply sampling and backpressure.

Ignoring contention and queues

Many latency problems are not “slow code” but waiting: locks, thread pools, connection pools, and kernel scheduling. Ensure you measure queue depth and wait time explicitly, and use profilers that can show blocked time (JFR, wall-time profilers, eBPF tools).

Now answer the exercise about the content:

In a production service, what sequence best matches a recommended workflow to diagnose a specific slow endpoint?

You are right! Congratulations, now go to the next page

You missed! Try again.

Metrics show when and where latency regresses, traces identify the slow span on the critical path, and profiles attribute internal cost to pinpoint what to fix when the issue is inside the service.

Next chapter

Memory Models, Allocation Behavior, and Object Lifetimes

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.