From 3.17 GB to 354 MB: How I Reduced My Kubeflow Docker Image by 89%

Docker Image Size Optimization for Kubeflow pipeline

In this blog, I shared how I reduced a Kubeflow pipeline Docker image from 3.17 GB down to 354 MB (an 89% reduction) and the reasoning behind every change I made.

A little backstory.

As part of our MLOps workflow, I took over an existing Kubeflow pipeline developed by a different ML engineer.

While working on the pipeline, when I checked the image size, it was 3.17 GB. During the discussion with the DevOps team, we agreed this was unacceptable for a pipeline that runs on every model retrain.

So my only goals were,

Here is what I learned along the way including a few things I wish I had known before I started.

The Starting Point: A Bloated 3.17 GB Docker Image

For running my Kubeflow pipeline, I need to install the following packages and dependencies, which are contained in my requirements.txt file.

This was the requirements.txt file before optimization.

# training
scikit-learn
numpy
pandas
pandera
matplotlib
seaborn
boto3
python-dotenv

# (optional)
pyarrow
joblib
# Feast/pandas can read s3:// URLs:
s3fs

# kubeflow
kfp
kfp-kubernetes
kubernetes
kubeflow
kubeflow-training

# mlflow
mlflow

# API / Serving 
flask
flask-cors

# feast-feature-store
feast
feast[redis]
feast[postgres]

torch

And, this was the Dockerfile before optimization.

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y git && apt-get clean

# Set working directory
WORKDIR /app

# Copy entire project
COPY . .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Install your project as a package
RUN pip install --no-cache-dir .

When I ran the Docker build command, it built an image of 3.17 GB. And also it took a time of nearly 800 seconds ( 13 minutes ) to build the image.

$ docker images

IMAGE                    ID              DISK USAGE    CONTENT SIZE
kubeflow_pipeline:v1   09e88c766290        9.47GB        3.17GB
[+] Building 797.2s (12/12) FINISHED

Audit packages and dependencies

Let's see the history of the image build, So that we can understand at which layer of the build it took more of the size. And remove what packages we dont need at runtime.

For we need to run the command.

docker history kubeflow_pipeline:v1 --no-trunc --format "table {{.Size}}\t{{.CreatedBy}}"

Here is the output.

$ docker history kubeflow_pipeline:v1 --no-trunc --format "table {{.Size}}\t{{.CreatedBy}}"

SIZE      CREATED BY
319kB     RUN /bin/sh -c pip install --no-cache-dir . # buildkit
5.98GB    RUN /bin/sh -c pip install --no-cache-dir -r requirements.txt # buildkit
10.3MB    COPY . . # buildkit
8.19kB    WORKDIR /app
138MB     RUN /bin/sh -c apt-get update && apt-get install -y git && apt-get clean # buildkit
0B        CMD ["python3"]

As you can see that the requirements.txt layer in the docker file takes the most size.

💡
Always remember: If your Docker image size is huge, do not start with the Dockerfile, start with optimizing your packages and dependencies.

Now, Let's go through the strategies for Image Optimization step by step:

Two Strategies for Image Optimization

To reduce the size of my Docker image, I followed a few key steps.

  1. Packages and dependencies optimization
  2. Multistage build (separate build artifacts from runtime)
  3. dockerignore file

Step 1: Packages and dependency Optimization

During development stages, the ML engineer added many dependencies to requirements.txt file. From an analysis, I found many package are not required for the kubeflow training pipeline.

So, from the requirements.txt file, I removed many unwanted and duplicate packages and dependencies.

Now you might wonder.

How do I analyse, that a specific package is not needed in my project?

There are 2 ways to find out:

We can use the grep command to find whether the installed packages from the requirements.txt file is used in our project or not

kubeflow-training-pipeline % grep -R "import numpy".

You will get an output

./src/model_development/_09_evaluation.py:import numpy as np

This tells us that, the installed package numpy is imported and used at the specific path

In this way, I removed some of the unwanted packages from the requirements.txt file like torch

But we should not rely on the grep command, Because when we run the grep command on kfp-kubernetes It shows no output. So don't remove the kfp-kubernetes from the requirements.txt file.

Because, some packages are needed for:

  • pipeline compilation
  • kubernetes execution
  • SDK behavior

That are not necessarily imported in .py files

Here comes the next method .

Validation method: In this method, we will remove a package from the requirements.txt file, Rebuild the container, Run the pipeline

Like, I remove the kfp-kubernetes , I rebuild the container and Run the pipeline.

If I see any errors like:

  • Component not found
  • Kubernetes configuration issues
  • Pipeline execution failed

In this way I can understand that, the kfp-kubernetes package is mandatorily needed for our Kubeflow pipeline.

Now let's see What Packages and dependencies were removed for the optimization, and why?

Lets go through it step by step :

  • I removed torch because Torch alone can get up to a size of nearly 700 MB to 2 GB
  • If the pipeline component is only responsible for data preprocessing, model evaluation, or orchestration, a GPU training package like Torch is not needed inside the image
💡
You can check this link to see which GPU packages are installed by torch.
  • matplotlib and seabornbelong are data visualization libraries, visualizations belongs in the notebooks or a separate analysis step. Removing them saves space since matplotlib carries several heavy dependencies inside.
  • kubeflow and kubeflow-training are also not needed, the kfp and kfp-kubernetes already gives us everything we need to define, compile, and run the Kubeflow pipelines. The Kubeflow and Kubeflow-training packages are high-level SDKs used for training operators like PyTorchJob or TFJob
  • flask and flask-cors , these are web frameworks used for serving the API's. A Kubeflow-training pipeline container is not an API server. It just runs a job and exits.
  • Before optimization feast was duplicated into separate 3 lines.
feast
feast[redis]
feast[postgres]

We made it to one line feast[redis,postgres] Because, multiple lines of the same package can cause conflicts and redundant installs.

And after optimization, my final requirements.txt file looks like this:

numpy
pandas
scikit-learn
joblib
pandera
python-dotenv
s3fs
boto3
botocore
kfp
kfp-kubernetes
kubernetes
mlflow
feast[redis,postgres]
pyarrow
fsspec

After I checked my Docker image size:

$ docker images

IMAGE                    ID              DISK USAGE    CONTENT SIZE
kubeflow_pipeline:test   09e88c766290      1.8GB         410MB

And the build time reduced nearly half 400 seconds ( 6 minutes )

[+] Building 402.9s (12/12) FINISHED
💡
Key Takeaway:

In a MLOps environment, package auditing is the first line of defense against bloated images in pipelines.

Step 2: Implementing Multi-Stage Build

After removing the unwanted dependencies. Next is to create a multi-stage Dockerfile.

Its not an application, its just a base image for a pipeline, so why should I use multi-staging?

The answer is, it is not mandatory, but its really useful.

As you know, multi-stage means build one stage, copy only what we need, and run on a clean stage.

So I rewrote my Dockerfile to multi-stage:

# Stage 1: Build
FROM python:3.11-slim AS builder

WORKDIR /install
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Stage 2: Runtime (lighter)
FROM python:3.11-slim

WORKDIR /app

COPY --from=builder /install /usr/local

COPY setup.py .
COPY src/ ./src/
COPY _feast/ ./_feast/
COPY _kubeflow/ ./_kubeflow/
COPY _mlflow/ ./_mlflow/

RUN pip install --no-cache-dir .

Stage 1 is a temporary stage (or) workspace.

It installs the Python dependencies from requirements.txt into a separate install folder. This stage is cleaned away after the build, so it never ends up in your final image.

The second stage is the actual final image, where it starts with a clean Python environment and only copies the installed packages from Stage 1, which means, no build tools, no cache, no junk.

Then it copies your project folders into the /app directory inside the container.

Finally, It runs the pip install command which ensures that the project folder is globally recognized within the container.

$ docker images

IMAGE                    ID              DISK USAGE    CONTENT SIZE
kubeflow_pipeline:v2   09e88c766290        1.59GB        353MB

As you can see, after multi-stage building my Docker image, the size was reduced from 410Mb to 353Mb , ie 60MB reduction. That is not much but still helpful.

But the build time is reduced to just 200 seconds ( 3 minutes )

[+] Building 197.4s (12/12) FINISHED

Step 3 : Docker ignore

The Docker ignore file acts as a Control center, that decides what are the files and folders that should not go inside our docker image.

In our previous docker file, we can see a line:

COPY . .

When this line runs from your docker file, It not just only copy files into the image.

It sends your entire project directory to your Docker daemon, we call it as build context.

What if there is no dockerignore file?

Without a dockerignore file, Docker includes each and everything inside your project directory.

  • .git
  • virtual environments (venv/ , .venv/)
  • environment files ( .env )
  • cache files (__pycache__)

Even if we dont use them, they will still get uploaded and copied to our image.

Verify dockerignore is working

We have a command to verify, the dockerignore file.

docker build --no-cache --progress=plain -t test .

This command will give us an output,

=> => transferring context:249B

This is we looked for, If we see another output like:

Sending build context to Docker daemon  847MB

This means Large files are being sent, it includes the directories that I mentioned early ( .git , virtual environments, etc )

So we should recheck our dockerignore file before building the Docker image.

Final Results

The following table shows the final before and after optmization results.

What Improved Before After
Image Size 3.17 GB 354 MB (↓ 89%)
Build Time 13 minutes 3 minutes (↓ 75%)
Dependencies Unused + duplicate packages Minimal required packages
Dockerfile Single-stage Multi-stage
Build Context Large Optimized with .dockerignore

Real World Impact

The image size is reduced, But does this matter?

Yes it does.

  • Reduced Build time: As you can see, the more smaller the image the less time it takes to build, We reduced the build time from 13 minutes to just 3 minutes.
  • Faster pulls: We will get faster pull time in new EKS nodes
  • Avoids unwanted registry storage costs.

Conclusion

By combining multi-stage Docker builds with a leaner, deduplicated requirements.txt, I reduced the image size from 3.17 GB to just 354 MB, almost a 89% reduction.

The key takeaway is simple, as a developer, only include what your pipeline actually needs at runtime. So image optimization is a colloaborate effort between Data Scientists, Developer and a DevOps engineer.

A smaller image means faster builds, quicker deployments, and a cleaner production environment, all without sacrificing functionalities.

About the author
Hashim N

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to DevOpsCube – Easy DevOps, SRE Guides & Reviews.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.