In this blog, I shared how I reduced a Kubeflow pipeline Docker image from 3.17 GB down to 354 MB (an 89% reduction) and the reasoning behind every change I made.
A little backstory.
As part of our MLOps workflow, I took over an existing Kubeflow pipeline developed by a different ML engineer.
While working on the pipeline, when I checked the image size, it was 3.17 GB. During the discussion with the DevOps team, we agreed this was unacceptable for a pipeline that runs on every model retrain.
So my only goals were,
- To reduce the size of the Docker image significantly
- Make sure the pipeline still runs without errors
Here is what I learned along the way including a few things I wish I had known before I started.
The Starting Point: A Bloated 3.17 GB Docker Image
For running my Kubeflow pipeline, I need to install the following packages and dependencies, which are contained in my requirements.txt file.
This was the requirements.txt file before optimization.
# training
scikit-learn
numpy
pandas
pandera
matplotlib
seaborn
boto3
python-dotenv
# (optional)
pyarrow
joblib
# Feast/pandas can read s3:// URLs:
s3fs
# kubeflow
kfp
kfp-kubernetes
kubernetes
kubeflow
kubeflow-training
# mlflow
mlflow
# API / Serving
flask
flask-cors
# feast-feature-store
feast
feast[redis]
feast[postgres]
torchAnd, this was the Dockerfile before optimization.
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y git && apt-get clean
# Set working directory
WORKDIR /app
# Copy entire project
COPY . .
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Install your project as a package
RUN pip install --no-cache-dir .When I ran the Docker build command, it built an image of 3.17 GB. And also it took a time of nearly 800 seconds ( 13 minutes ) to build the image.
$ docker images
IMAGE ID DISK USAGE CONTENT SIZE
kubeflow_pipeline:v1 09e88c766290 9.47GB 3.17GB[+] Building 797.2s (12/12) FINISHEDAudit packages and dependencies
Let's see the history of the image build, So that we can understand at which layer of the build it took more of the size. And remove what packages we dont need at runtime.
For we need to run the command.
docker history kubeflow_pipeline:v1 --no-trunc --format "table {{.Size}}\t{{.CreatedBy}}"Here is the output.
$ docker history kubeflow_pipeline:v1 --no-trunc --format "table {{.Size}}\t{{.CreatedBy}}"
SIZE CREATED BY
319kB RUN /bin/sh -c pip install --no-cache-dir . # buildkit
5.98GB RUN /bin/sh -c pip install --no-cache-dir -r requirements.txt # buildkit
10.3MB COPY . . # buildkit
8.19kB WORKDIR /app
138MB RUN /bin/sh -c apt-get update && apt-get install -y git && apt-get clean # buildkit
0B CMD ["python3"]As you can see that the requirements.txt layer in the docker file takes the most size.
Now, Let's go through the strategies for Image Optimization step by step:
Two Strategies for Image Optimization
To reduce the size of my Docker image, I followed a few key steps.
- Packages and dependencies optimization
- Multistage build (separate build artifacts from runtime)
dockerignorefile
Step 1: Packages and dependency Optimization
During development stages, the ML engineer added many dependencies to requirements.txt file. From an analysis, I found many package are not required for the kubeflow training pipeline.
So, from the requirements.txt file, I removed many unwanted and duplicate packages and dependencies.
Now you might wonder.
How do I analyse, that a specific package is not needed in my project?
There are 2 ways to find out:
We can use the grep command to find whether the installed packages from the requirements.txt file is used in our project or not
kubeflow-training-pipeline % grep -R "import numpy".You will get an output
./src/model_development/_09_evaluation.py:import numpy as npThis tells us that, the installed package numpy is imported and used at the specific path
In this way, I removed some of the unwanted packages from the requirements.txt file like torch
But we should not rely on the grep command, Because when we run the grep command on kfp-kubernetes It shows no output. So don't remove the kfp-kubernetes from the requirements.txt file.
Because, some packages are needed for:
- pipeline compilation
- kubernetes execution
- SDK behavior
That are not necessarily imported in .py files
Here comes the next method .
Validation method: In this method, we will remove a package from the requirements.txt file, Rebuild the container, Run the pipeline
Like, I remove the kfp-kubernetes , I rebuild the container and Run the pipeline.
If I see any errors like:
- Component not found
- Kubernetes configuration issues
- Pipeline execution failed
In this way I can understand that, the kfp-kubernetes package is mandatorily needed for our Kubeflow pipeline.
Now let's see What Packages and dependencies were removed for the optimization, and why?
Lets go through it step by step :
- I removed
torchbecause Torch alone can get up to a size of nearly 700 MB to 2 GB - If the pipeline component is only responsible for data preprocessing, model evaluation, or orchestration, a GPU training package like Torch is not needed inside the image
matplotlibandseabornbelong are data visualization libraries, visualizations belongs in the notebooks or a separate analysis step. Removing them saves space since matplotlib carries several heavy dependencies inside.kubeflowandkubeflow-trainingare also not needed, thekfpandkfp-kubernetesalready gives us everything we need to define, compile, and run the Kubeflow pipelines. The Kubeflow and Kubeflow-training packages are high-level SDKs used for training operators like PyTorchJob or TFJobflaskandflask-cors, these are web frameworks used for serving the API's. A Kubeflow-training pipeline container is not an API server. It just runs a job and exits.- Before optimization
feastwas duplicated into separate 3 lines.
feast
feast[redis]
feast[postgres]We made it to one line feast[redis,postgres] Because, multiple lines of the same package can cause conflicts and redundant installs.
And after optimization, my final requirements.txt file looks like this:
numpy
pandas
scikit-learn
joblib
pandera
python-dotenv
s3fs
boto3
botocore
kfp
kfp-kubernetes
kubernetes
mlflow
feast[redis,postgres]
pyarrow
fsspecAfter I checked my Docker image size:
$ docker images
IMAGE ID DISK USAGE CONTENT SIZE
kubeflow_pipeline:test 09e88c766290 1.8GB 410MBAnd the build time reduced nearly half 400 seconds ( 6 minutes )
[+] Building 402.9s (12/12) FINISHEDIn a MLOps environment, package auditing is the first line of defense against bloated images in pipelines.
Step 2: Implementing Multi-Stage Build
After removing the unwanted dependencies. Next is to create a multi-stage Dockerfile.
Its not an application, its just a base image for a pipeline, so why should I use multi-staging?
The answer is, it is not mandatory, but its really useful.
As you know, multi-stage means build one stage, copy only what we need, and run on a clean stage.
So I rewrote my Dockerfile to multi-stage:
# Stage 1: Build
FROM python:3.11-slim AS builder
WORKDIR /install
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt
# Stage 2: Runtime (lighter)
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY setup.py .
COPY src/ ./src/
COPY _feast/ ./_feast/
COPY _kubeflow/ ./_kubeflow/
COPY _mlflow/ ./_mlflow/
RUN pip install --no-cache-dir .
Stage 1 is a temporary stage (or) workspace.
It installs the Python dependencies from requirements.txt into a separate install folder. This stage is cleaned away after the build, so it never ends up in your final image.
The second stage is the actual final image, where it starts with a clean Python environment and only copies the installed packages from Stage 1, which means, no build tools, no cache, no junk.
Then it copies your project folders into the /app directory inside the container.
Finally, It runs the pip install command which ensures that the project folder is globally recognized within the container.
$ docker images
IMAGE ID DISK USAGE CONTENT SIZE
kubeflow_pipeline:v2 09e88c766290 1.59GB 353MBAs you can see, after multi-stage building my Docker image, the size was reduced from 410Mb to 353Mb , ie 60MB reduction. That is not much but still helpful.
But the build time is reduced to just 200 seconds ( 3 minutes )
[+] Building 197.4s (12/12) FINISHEDStep 3 : Docker ignore
The Docker ignore file acts as a Control center, that decides what are the files and folders that should not go inside our docker image.
In our previous docker file, we can see a line:
COPY . .When this line runs from your docker file, It not just only copy files into the image.
It sends your entire project directory to your Docker daemon, we call it as build context.
What if there is no dockerignore file?
Without a dockerignore file, Docker includes each and everything inside your project directory.
.git- virtual environments (
venv/,.venv/) - environment files (
.env) - cache files (
__pycache__)
Even if we dont use them, they will still get uploaded and copied to our image.
Verify dockerignore is working
We have a command to verify, the dockerignore file.
docker build --no-cache --progress=plain -t test .This command will give us an output,
=> => transferring context:249BThis is we looked for, If we see another output like:
Sending build context to Docker daemon 847MBThis means Large files are being sent, it includes the directories that I mentioned early ( .git , virtual environments, etc )
So we should recheck our dockerignore file before building the Docker image.
Final Results
The following table shows the final before and after optmization results.
| What Improved | Before | After |
|---|---|---|
| Image Size | 3.17 GB | 354 MB (↓ 89%) |
| Build Time | 13 minutes | 3 minutes (↓ 75%) |
| Dependencies | Unused + duplicate packages | Minimal required packages |
| Dockerfile | Single-stage | Multi-stage |
| Build Context | Large | Optimized with .dockerignore |
Real World Impact
The image size is reduced, But does this matter?
Yes it does.
- Reduced Build time: As you can see, the more smaller the image the less time it takes to build, We reduced the build time from 13 minutes to just 3 minutes.
- Faster pulls: We will get faster pull time in new EKS nodes
- Avoids unwanted registry storage costs.
Conclusion
By combining multi-stage Docker builds with a leaner, deduplicated requirements.txt, I reduced the image size from 3.17 GB to just 354 MB, almost a 89% reduction.
The key takeaway is simple, as a developer, only include what your pipeline actually needs at runtime. So image optimization is a colloaborate effort between Data Scientists, Developer and a DevOps engineer.
A smaller image means faster builds, quicker deployments, and a cleaner production environment, all without sacrificing functionalities.