Why Dockerfiles Matter for Reproducible Environments
A reproducible environment means that the same project can be built and run in the same way today, tomorrow, and on another machine, producing consistent behavior. In practice, reproducibility reduces “works on my machine” problems by making the environment explicit: operating system base, system packages, language runtime, dependencies, configuration files, and startup commands.
A Dockerfile is a build recipe that describes how to create an image in a deterministic, reviewable way. Instead of manually setting up a server or relying on undocumented local steps, you encode the environment as a sequence of instructions. When you build from that Dockerfile, you get an image that can be run anywhere Docker is available.
Reproducibility is not automatic just because you use Docker. You still need to make careful choices: pin versions, avoid “latest” tags, control dependency resolution, and structure your Dockerfile to minimize surprises. This chapter focuses on those practices and shows step-by-step Dockerfiles you can adapt.
How Dockerfile Instructions Create a Deterministic Build
A Dockerfile is processed top-to-bottom. Each instruction typically creates a new layer. Layers are cached, which speeds up rebuilds when earlier steps do not change. Reproducibility and speed often align: when you structure steps so that stable parts (like dependency installation) happen before frequently changing parts (like copying your source code), you get faster rebuilds and fewer accidental changes.
Key instructions you will use
FROM: selects a base image. This is the starting point for your environment.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
WORKDIR: sets the working directory for subsequent instructions.
COPY and ADD: bring files into the image. Prefer COPY for clarity.
RUN: executes commands at build time (install packages, compile assets, etc.).
ENV: sets environment variables inside the image.
EXPOSE: documents which port the containerized app listens on.
USER: switches to a non-root user for better security.
CMD and ENTRYPOINT: define the default process when the container starts.
Build-time vs run-time
RUN happens during image build; CMD/ENTRYPOINT happen when the container starts. Reproducible environments depend on doing as much as possible at build time (installing dependencies, compiling) and keeping run-time steps minimal and predictable.
Pinning: The Foundation of Reproducibility
If you want the same build result over time, you must control versions. The biggest sources of drift are base images and dependency downloads.
Pin base images
A tag like python:3.12-slim is better than python:latest, but it can still change as maintainers update the image. For stronger reproducibility, pin to a digest. A digest identifies an exact image content.
FROM python:3.12-slim@sha256:REPLACE_WITH_REAL_DIGESTYou can obtain the digest from your registry or by pulling and inspecting the image. Pinning to a digest ensures that the base layer does not silently change.
Pin OS packages and language dependencies
When you install system packages with a package manager, you are depending on a repository state that changes. You can mitigate this by:
Installing only what you need.
Using a stable distribution release (for example, Debian stable).
Pinning versions where feasible.
Keeping the package list explicit and consistent.
For language dependencies, use lock files (for example, package-lock.json, poetry.lock, requirements.txt with pinned versions). The lock file is often the single most important artifact for reproducible builds.
Step-by-Step: A Reproducible Python Web API Dockerfile
This example shows a typical pattern: install dependencies first (cached), then copy application code, then run. It also demonstrates non-root execution and predictable dependency installation.
Project layout (example)
myapi/ app.py requirements.txt Dockerfile .dockerignorerequirements.txt (pinned)
flask==3.0.3gunicorn==22.0.0app.py (minimal)
from flask import Flaskapp = Flask(__name__)@app.get("/")def home(): return {"status": "ok"}.dockerignore
A .dockerignore improves reproducibility and speed by preventing accidental files (like local virtual environments) from being sent to the build context.
__pycache__/venv/.venv/.git/.DS_StoreDockerfile (reproducible pattern)
FROM python:3.12-slimWORKDIR /app# Prevent Python from writing .pyc files and enable unbuffered logsENV PYTHONDONTWRITEBYTECODE=1ENV PYTHONUNBUFFERED=1# Install system dependencies (keep minimal)RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*# Create a non-root userRUN useradd -m -u 10001 appuser# Copy only dependency files first to leverage build cacheCOPY requirements.txt /app/# Install Python dependencies in a deterministic wayRUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt# Copy the rest of the application codeCOPY . /app/# Switch to non-rootUSER appuserEXPOSE 8000CMD ["gunicorn", "-b", "0.0.0.0:8000", "app:app"]Notes on reproducibility choices:
Copy requirements first: if only your code changes, dependency installation stays cached.
--no-cache-dir: avoids pip caching inside the image, making builds smaller and less dependent on previous state.
Minimal apt packages: fewer moving parts means fewer surprises.
Non-root user: not directly about reproducibility, but a best practice for predictable and safer runtime behavior.
Build and run (step-by-step)
# From the myapi directorydocker build -t myapi:1.0 .# Run itdocker run --rm -p 8000:8000 myapi:1.0When someone else builds the same Dockerfile with the same inputs (base image, requirements.txt, source), they should get the same dependency versions and the same startup command.
Step-by-Step: Reproducible Node.js App with npm ci
Node projects often become non-reproducible when dependency resolution is allowed to float. The most reliable approach is to commit a lock file and use npm ci, which installs exactly what the lock file specifies.
Project layout (example)
nodeapp/ package.json package-lock.json server.js Dockerfile .dockerignoreDockerfile using npm ci
FROM node:20.11.1-bookworm-slimWORKDIR /appENV NODE_ENV=production# Copy dependency manifests firstCOPY package.json package-lock.json /app/# Install exactly what's in package-lock.jsonRUN npm ci --omit=dev# Copy application codeCOPY server.js /app/EXPOSE 3000CMD ["node", "server.js"]Why this is reproducible:
node:20.11.1-...pins the major/minor/patch runtime.npm cifails if the lock file and package.json are out of sync, preventing accidental drift.Copying lock files first ensures caching works well.
Multi-Stage Builds: Same Result, Smaller and Cleaner Images
Multi-stage builds let you separate “build tools” from “runtime.” This improves reproducibility by making the runtime image contain only what is needed to run, not whatever happened to be installed during compilation. It also reduces the chance that build-only tools affect runtime behavior.
Step-by-step: Build a small static site with a builder stage
Imagine a frontend project that produces static files in a dist/ directory. You can build with Node and serve with a minimal web server image.
# Stage 1: build assetsFROM node:20.11.1-bookworm-slim AS builderWORKDIR /srcCOPY package.json package-lock.json /src/RUN npm ciCOPY . /src/RUN npm run build# Stage 2: serve static filesFROM nginx:1.27.0-alpineWORKDIR /usr/share/nginx/htmlCOPY --from=builder /src/dist/ /usr/share/nginx/html/EXPOSE 80CMD ["nginx", "-g", "daemon off;"]Reproducibility benefits:
The runtime stage is isolated from build dependencies.
The final image is smaller and has fewer packages that could change behavior.
Each stage can be pinned independently.
Controlling Build Context: .dockerignore and Deterministic Copies
The “build context” is what Docker sends to the builder. If your context includes random local files, you can accidentally bake machine-specific artifacts into the image. This harms reproducibility and can leak secrets.
Common items to ignore
Local dependency folders:
node_modules/,venv/,.venv/Build outputs:
dist/,build/(unless you intentionally copy them)VCS metadata:
.git/Editor/OS files:
.DS_StoreSecrets:
.env(prefer passing env vars at runtime or using secret mechanisms)
Prefer explicit COPY
Instead of copying everything early, copy only what you need for each step. For example, copy dependency manifests first, then install, then copy application code. This reduces the chance that unrelated file changes trigger dependency reinstalls or change the build outcome.
Environment Variables and Configuration: Keep Builds Predictable
Environment variables can make behavior vary between runs. For reproducibility, decide what should be baked into the image and what should be provided at runtime.
Guidelines
Build-time defaults: use
ENVfor stable defaults that should be the same everywhere (for example,PYTHONUNBUFFERED=1).Runtime configuration: pass environment variables at
docker runtime for values that differ per environment (for example, database URLs, API keys).Avoid copying local .env files into images. It makes builds depend on local state and risks leaking secrets.
Example: runtime configuration without rebuilding
# Build oncedocker build -t myapi:1.0 .# Run with different settings without changing the imagedocker run --rm -p 8000:8000 -e LOG_LEVEL=debug myapi:1.0Reducing Non-Determinism in RUN Steps
RUN steps often download content from the internet. If those downloads are not pinned, your build can change over time.
Best practices
Avoid curl | bash installers unless you pin versions and verify checksums.
Pin artifact versions (for example, download a specific release tarball).
Verify integrity with checksums when practical.
Keep RUN commands explicit and avoid relying on interactive defaults.
Example: downloading a pinned binary with checksum verification
RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*RUN curl -fsSL -o /tmp/tool.tar.gz https://example.com/tool-1.2.3-linux-amd64.tar.gz \
&& echo "REPLACE_WITH_SHA256 /tmp/tool.tar.gz" | sha256sum -c - \
&& tar -xzf /tmp/tool.tar.gz -C /usr/local/bin \
&& rm /tmp/tool.tar.gzThis pattern makes the build fail if the downloaded file changes unexpectedly.
Choosing CMD vs ENTRYPOINT for Reproducible Startup
Reproducible environments also mean reproducible startup behavior. Containers should start the same way unless the operator intentionally overrides it.
Common patterns
CMD only: easy to override the command when running the container.
ENTRYPOINT + CMD: ENTRYPOINT defines the fixed executable; CMD provides default arguments.
Example: ENTRYPOINT with default args
ENTRYPOINT ["gunicorn"]CMD ["-b", "0.0.0.0:8000", "app:app"]This makes it easy to override arguments while keeping the executable consistent.
Reproducible Builds Across Teams: Practical Checklist
Use this checklist when writing or reviewing a Dockerfile intended for consistent builds across machines and over time.
Base image pinned to a specific version (and ideally a digest).
Dependencies pinned via lock files or explicit versions.
Minimal system packages installed; remove package lists after install.
Use .dockerignore to keep context clean and stable.
Copy dependency manifests first, then install, then copy source.
Multi-stage builds for compiled assets or binaries.
Non-root user for runtime consistency and safety.
Runtime configuration passed via environment variables, not baked from local files.
Downloads pinned and verified when possible.
Startup command defined clearly with CMD/ENTRYPOINT.
Hands-On Exercise: Turn a “Messy” Dockerfile into a Reproducible One
Below is an intentionally problematic Dockerfile. It might work today, but it is likely to drift and rebuild unpredictably.
Problematic Dockerfile
FROM python:latestRUN apt-get updateRUN apt-get install -y curlCOPY . /appWORKDIR /appRUN pip install -r requirements.txtCMD python app.pyIssues to spot
python:latestcan change at any time.Separate
apt-get updateandapt-get installcan cause cache and repo mismatch problems.No cleanup of apt lists, leading to larger images.
COPYing everything before installing dependencies reduces caching and can bake in unwanted files.
pip installmay resolve floating versions if requirements are not pinned.Runs as root by default.
Improved reproducible Dockerfile (step-by-step rewrite)
FROM python:3.12-slimWORKDIR /appENV PYTHONDONTWRITEBYTECODE=1ENV PYTHONUNBUFFERED=1# Install only what you need, in one layer, and clean upRUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*# Create non-root userRUN useradd -m -u 10001 appuser# Copy dependency file(s) firstCOPY requirements.txt /app/# Install dependencies deterministically (requires pinned versions in requirements.txt)RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt# Copy application code lastCOPY . /app/USER appuserCMD ["python", "app.py"]If you want to take this further, pin the base image by digest and ensure requirements.txt contains exact versions. The goal is that two developers building the same commit get the same environment.