Reducing size of Docker images
Making efficient final images
Most blogs and manuals will recommend you the simpler approaches to reducing the image of your docker image. We’ll go a little further today but let’s reiterate them anyway:
- Use the reduced version of base images (alpine usually recommended), avoid SDKs for final images
- Use multistage build, do not copy over temporary files or sources
- Take care of the .dockerignore, ignore as much as possible
Having said that, it is possible that you’ll still end up with a very huge docker image, and it’s difficult to understand what the next step from here.
This is where this post comes in.
Once the image is constructed, we can inspect what different steps (commits) took place on it to identify what the major hog of space is:
docker history alphadock/screaming-bot:latest
IMAGE CREATED CREATED BY SIZE
e90b021bcc11 5 minutes ago /bin/sh -c #(nop) VOLUME [/usr/src/app/text... 0B
f479a24b3b2f 5 minutes ago /bin/sh -c #(nop) VOLUME [/usr/src/app/sett... 0B
1f2647d86908 5 minutes ago /bin/sh -c python -c "import nltk; nltk.down... 25MB
27d27678dd10 5 minutes ago /bin/sh -c python -m spacy download en 45.1MB
6e37b83f7be5 6 minutes ago /bin/sh -c pip install --no-cache-dir -r req... 463MB
c5e124630280 13 minutes ago /bin/sh -c #(nop) COPY dir:414b0e689b3fdeac6... 18.6kB
b83ddc2b496c 2 months ago /bin/sh -c #(nop) WORKDIR /usr/src/app 0B
746a826ed9d7 2 months ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 months ago /bin/sh -c set -ex; wget -O get-pip.py 'ht... 6MB
<missing> 2 months ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=10... 0B
<missing> 2 months ago /bin/sh -c cd /usr/local/bin && ln -s idle3... 32B
<missing> 2 months ago /bin/sh -c set -ex && wget -O python.tar.x... 69.9MB
<missing> 2 months ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.7.0 0B
<missing> 2 months ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E... 0B
<missing> 2 months ago /bin/sh -c apt-get update && apt-get install... 16.8MB
<missing> 2 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 months ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/... 0B
<missing> 2 months ago /bin/sh -c set -ex; apt-get update; apt-ge... 556MB
<missing> 2 months ago /bin/sh -c apt-get update && apt-get install... 142MB
<missing> 2 months ago /bin/sh -c set -ex; if ! command -v gpg > /... 7.8MB
<missing> 2 months ago /bin/sh -c apt-get update && apt-get install... 23.2MB
<missing> 2 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 months ago /bin/sh -c #(nop) ADD file:370028dca6e8ca9ed... 101MB
As you can see from my example, there is one clear command that is taking all of the space: pip install
. Makes sense – this is probably generating a lot of temporary files and downloads that we don’t need to keep.
But what exactly are they?
To find out, we can inspect the filesystem to see what was generated by that layer. Docker actually keeps layers as part of the filesystem so we can drive into that directory to check it out.
To find out where is that layer stored, we can use the inspect
command:
docker inspect 6e37b83f7be5
(lots of json output)
Actually, we just care about the GraphDriver
section, with the location specified in Data.Workdir
. So, we can specify it in our command so that we get only that portion back:
docker inspect -f "{{.GraphDriver.Data.WorkDir}}" 6e37b83f7be5
/var/lib/docker/overlay2/ed87e8cb41e0f5d2f40e2205db550f2c9e888224fb6928f6a1c626176ad2beb3/work
Nice!
If you’re working in a Linux computer you can just cd
into that directory. But if you’re in Windows, you won’t find it – it is stored in the Hypervisor Linux VM that Docker uses.
This post suggest using this command to accessing it, all through the power of docker:
docker run -it --privileged --pid=host debian nsenter -t 1 -m -u -i sh
# now inside the debian sh
cd /var/lib/docker/overlay2/ed87e8cb41e0f5d2f40e2205db550f2c9e888224fb6928f6a1c626176ad2beb3
At this point we get to see the changes in the file system inside the diff
directory. These are the particular changes that this layer introduced. We can explore it like we would with any other filesystem.
I just executed du -h .
and was able to find what were the major hogs in space, like the __pycache__
directories and other temporary building files.
By repeating this process and identifying the primary causes of large layers, you can trim down the final version of your image.
Having said this, it is still possible that you cannot control what particular steps in your build process do. For instance, in my case, I could not avoid the creation of the __pycache__
directories as part of the build process.
This is where multistage builds come in.