All courses > Technology and Programming > Developer and IT Tools ::

Building Reproducible Environments with Dockerfiles

Capítulo 3

Estimated reading time: 10 minutes

Why Dockerfiles Matter for Reproducible Environments

A reproducible environment means that the same project can be built and run in the same way today, tomorrow, and on another machine, producing consistent behavior. In practice, reproducibility reduces “works on my machine” problems by making the environment explicit: operating system base, system packages, language runtime, dependencies, configuration files, and startup commands.

A Dockerfile is a build recipe that describes how to create an image in a deterministic, reviewable way. Instead of manually setting up a server or relying on undocumented local steps, you encode the environment as a sequence of instructions. When you build from that Dockerfile, you get an image that can be run anywhere Docker is available.

Reproducibility is not automatic just because you use Docker. You still need to make careful choices: pin versions, avoid “latest” tags, control dependency resolution, and structure your Dockerfile to minimize surprises. This chapter focuses on those practices and shows step-by-step Dockerfiles you can adapt.

How Dockerfile Instructions Create a Deterministic Build

A Dockerfile is processed top-to-bottom. Each instruction typically creates a new layer. Layers are cached, which speeds up rebuilds when earlier steps do not change. Reproducibility and speed often align: when you structure steps so that stable parts (like dependency installation) happen before frequently changing parts (like copying your source code), you get faster rebuilds and fewer accidental changes.

Key instructions you will use

FROM: selects a base image. This is the starting point for your environment.
Continue in our app.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Or continue reading below...
Download the app
WORKDIR: sets the working directory for subsequent instructions.
COPY and ADD: bring files into the image. Prefer COPY for clarity.
RUN: executes commands at build time (install packages, compile assets, etc.).
ENV: sets environment variables inside the image.
EXPOSE: documents which port the containerized app listens on.
USER: switches to a non-root user for better security.
CMD and ENTRYPOINT: define the default process when the container starts.

Build-time vs run-time

RUN happens during image build; CMD/ENTRYPOINT happen when the container starts. Reproducible environments depend on doing as much as possible at build time (installing dependencies, compiling) and keeping run-time steps minimal and predictable.

Pinning: The Foundation of Reproducibility

If you want the same build result over time, you must control versions. The biggest sources of drift are base images and dependency downloads.

Pin base images

A tag like python:3.12-slim is better than python:latest, but it can still change as maintainers update the image. For stronger reproducibility, pin to a digest. A digest identifies an exact image content.

FROM python:3.12-slim@sha256:REPLACE_WITH_REAL_DIGEST

You can obtain the digest from your registry or by pulling and inspecting the image. Pinning to a digest ensures that the base layer does not silently change.

Pin OS packages and language dependencies

When you install system packages with a package manager, you are depending on a repository state that changes. You can mitigate this by:

Installing only what you need.
Using a stable distribution release (for example, Debian stable).
Pinning versions where feasible.
Keeping the package list explicit and consistent.

For language dependencies, use lock files (for example, package-lock.json, poetry.lock, requirements.txt with pinned versions). The lock file is often the single most important artifact for reproducible builds.

Step-by-Step: A Reproducible Python Web API Dockerfile

This example shows a typical pattern: install dependencies first (cached), then copy application code, then run. It also demonstrates non-root execution and predictable dependency installation.

Project layout (example)

myapi/  app.py  requirements.txt  Dockerfile  .dockerignore

requirements.txt (pinned)

flask==3.0.3gunicorn==22.0.0

app.py (minimal)

from flask import Flaskapp = Flask(__name__)@app.get("/")def home():    return {"status": "ok"}

.dockerignore

A .dockerignore improves reproducibility and speed by preventing accidental files (like local virtual environments) from being sent to the build context.

__pycache__/venv/.venv/.git/.DS_Store

Dockerfile (reproducible pattern)

FROM python:3.12-slimWORKDIR /app# Prevent Python from writing .pyc files and enable unbuffered logsENV PYTHONDONTWRITEBYTECODE=1ENV PYTHONUNBUFFERED=1# Install system dependencies (keep minimal)RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
  && rm -rf /var/lib/apt/lists/*# Create a non-root userRUN useradd -m -u 10001 appuser# Copy only dependency files first to leverage build cacheCOPY requirements.txt /app/# Install Python dependencies in a deterministic wayRUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir -r requirements.txt# Copy the rest of the application codeCOPY . /app/# Switch to non-rootUSER appuserEXPOSE 8000CMD ["gunicorn", "-b", "0.0.0.0:8000", "app:app"]

Notes on reproducibility choices:

Copy requirements first: if only your code changes, dependency installation stays cached.
--no-cache-dir: avoids pip caching inside the image, making builds smaller and less dependent on previous state.
Minimal apt packages: fewer moving parts means fewer surprises.
Non-root user: not directly about reproducibility, but a best practice for predictable and safer runtime behavior.

Build and run (step-by-step)

# From the myapi directorydocker build -t myapi:1.0 .# Run itdocker run --rm -p 8000:8000 myapi:1.0

When someone else builds the same Dockerfile with the same inputs (base image, requirements.txt, source), they should get the same dependency versions and the same startup command.

Step-by-Step: Reproducible Node.js App with npm ci

Node projects often become non-reproducible when dependency resolution is allowed to float. The most reliable approach is to commit a lock file and use npm ci, which installs exactly what the lock file specifies.

Project layout (example)

nodeapp/  package.json  package-lock.json  server.js  Dockerfile  .dockerignore

Dockerfile using npm ci

FROM node:20.11.1-bookworm-slimWORKDIR /appENV NODE_ENV=production# Copy dependency manifests firstCOPY package.json package-lock.json /app/# Install exactly what's in package-lock.jsonRUN npm ci --omit=dev# Copy application codeCOPY server.js /app/EXPOSE 3000CMD ["node", "server.js"]

Why this is reproducible:

node:20.11.1-... pins the major/minor/patch runtime.
npm ci fails if the lock file and package.json are out of sync, preventing accidental drift.
Copying lock files first ensures caching works well.

Multi-Stage Builds: Same Result, Smaller and Cleaner Images

Multi-stage builds let you separate “build tools” from “runtime.” This improves reproducibility by making the runtime image contain only what is needed to run, not whatever happened to be installed during compilation. It also reduces the chance that build-only tools affect runtime behavior.

Step-by-step: Build a small static site with a builder stage

Imagine a frontend project that produces static files in a dist/ directory. You can build with Node and serve with a minimal web server image.

# Stage 1: build assetsFROM node:20.11.1-bookworm-slim AS builderWORKDIR /srcCOPY package.json package-lock.json /src/RUN npm ciCOPY . /src/RUN npm run build# Stage 2: serve static filesFROM nginx:1.27.0-alpineWORKDIR /usr/share/nginx/htmlCOPY --from=builder /src/dist/ /usr/share/nginx/html/EXPOSE 80CMD ["nginx", "-g", "daemon off;"]

Reproducibility benefits:

The runtime stage is isolated from build dependencies.
The final image is smaller and has fewer packages that could change behavior.
Each stage can be pinned independently.

Controlling Build Context: .dockerignore and Deterministic Copies

The “build context” is what Docker sends to the builder. If your context includes random local files, you can accidentally bake machine-specific artifacts into the image. This harms reproducibility and can leak secrets.

Common items to ignore

Local dependency folders: node_modules/, venv/, .venv/
Build outputs: dist/, build/ (unless you intentionally copy them)
VCS metadata: .git/
Editor/OS files: .DS_Store
Secrets: .env (prefer passing env vars at runtime or using secret mechanisms)

Prefer explicit COPY

Instead of copying everything early, copy only what you need for each step. For example, copy dependency manifests first, then install, then copy application code. This reduces the chance that unrelated file changes trigger dependency reinstalls or change the build outcome.

Environment Variables and Configuration: Keep Builds Predictable

Environment variables can make behavior vary between runs. For reproducibility, decide what should be baked into the image and what should be provided at runtime.

Guidelines

Build-time defaults: use ENV for stable defaults that should be the same everywhere (for example, PYTHONUNBUFFERED=1).
Runtime configuration: pass environment variables at docker run time for values that differ per environment (for example, database URLs, API keys).
Avoid copying local .env files into images. It makes builds depend on local state and risks leaking secrets.

Example: runtime configuration without rebuilding

# Build oncedocker build -t myapi:1.0 .# Run with different settings without changing the imagedocker run --rm -p 8000:8000 -e LOG_LEVEL=debug myapi:1.0

Reducing Non-Determinism in RUN Steps

RUN steps often download content from the internet. If those downloads are not pinned, your build can change over time.

Best practices

Avoid curl | bash installers unless you pin versions and verify checksums.
Pin artifact versions (for example, download a specific release tarball).
Verify integrity with checksums when practical.
Keep RUN commands explicit and avoid relying on interactive defaults.

Example: downloading a pinned binary with checksum verification

RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates \
  && rm -rf /var/lib/apt/lists/*RUN curl -fsSL -o /tmp/tool.tar.gz https://example.com/tool-1.2.3-linux-amd64.tar.gz \
  && echo "REPLACE_WITH_SHA256  /tmp/tool.tar.gz" | sha256sum -c - \
  && tar -xzf /tmp/tool.tar.gz -C /usr/local/bin \
  && rm /tmp/tool.tar.gz

This pattern makes the build fail if the downloaded file changes unexpectedly.

Choosing CMD vs ENTRYPOINT for Reproducible Startup

Reproducible environments also mean reproducible startup behavior. Containers should start the same way unless the operator intentionally overrides it.

Common patterns

CMD only: easy to override the command when running the container.
ENTRYPOINT + CMD: ENTRYPOINT defines the fixed executable; CMD provides default arguments.

Example: ENTRYPOINT with default args

ENTRYPOINT ["gunicorn"]CMD ["-b", "0.0.0.0:8000", "app:app"]

This makes it easy to override arguments while keeping the executable consistent.

Reproducible Builds Across Teams: Practical Checklist

Use this checklist when writing or reviewing a Dockerfile intended for consistent builds across machines and over time.

Base image pinned to a specific version (and ideally a digest).
Dependencies pinned via lock files or explicit versions.
Minimal system packages installed; remove package lists after install.
Use .dockerignore to keep context clean and stable.
Copy dependency manifests first, then install, then copy source.
Multi-stage builds for compiled assets or binaries.
Non-root user for runtime consistency and safety.
Runtime configuration passed via environment variables, not baked from local files.
Downloads pinned and verified when possible.
Startup command defined clearly with CMD/ENTRYPOINT.

Hands-On Exercise: Turn a “Messy” Dockerfile into a Reproducible One

Below is an intentionally problematic Dockerfile. It might work today, but it is likely to drift and rebuild unpredictably.

Problematic Dockerfile

FROM python:latestRUN apt-get updateRUN apt-get install -y curlCOPY . /appWORKDIR /appRUN pip install -r requirements.txtCMD python app.py

Issues to spot

python:latest can change at any time.
Separate apt-get update and apt-get install can cause cache and repo mismatch problems.
No cleanup of apt lists, leading to larger images.
COPYing everything before installing dependencies reduces caching and can bake in unwanted files.
pip install may resolve floating versions if requirements are not pinned.
Runs as root by default.

Improved reproducible Dockerfile (step-by-step rewrite)

FROM python:3.12-slimWORKDIR /appENV PYTHONDONTWRITEBYTECODE=1ENV PYTHONUNBUFFERED=1# Install only what you need, in one layer, and clean upRUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
  && rm -rf /var/lib/apt/lists/*# Create non-root userRUN useradd -m -u 10001 appuser# Copy dependency file(s) firstCOPY requirements.txt /app/# Install dependencies deterministically (requires pinned versions in requirements.txt)RUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir -r requirements.txt# Copy application code lastCOPY . /app/USER appuserCMD ["python", "app.py"]

If you want to take this further, pin the base image by digest and ensure requirements.txt contains exact versions. The goal is that two developers building the same commit get the same environment.

Now answer the exercise about the content:

Which Dockerfile change most directly improves reproducibility by preventing the base environment from silently changing over time?

You are right! Congratulations, now go to the next page

You missed! Try again.

Pinning the base image to a digest fixes the exact image content, so future builds do not change if the tag is updated. This reduces drift and helps ensure consistent builds across machines and time.