Heavy docker

Most blogs and manuals will recommend you the simpler approaches to reducing the image of your docker image. We’ll go a little further today but let’s reiterate them anyway:

  • Use the reduced version of base images (alpine usually recommended), avoid SDKs for final images
  • Use multistage build, do not copy over temporary files or sources
  • Take care of the .dockerignore, ignore as much as possible

Having said that, it is possible that you’ll still end up with a very huge docker image, and it’s difficult to understand what the next step from here.

This is where this post comes in.

Once the image is constructed, we can inspect what different steps (commits) took place on it to identify what the major hog of space is:

docker history alphadock/screaming-bot:latest

IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
e90b021bcc11        5 minutes ago       /bin/sh -c #(nop)  VOLUME [/usr/src/app/text…   0B
f479a24b3b2f        5 minutes ago       /bin/sh -c #(nop)  VOLUME [/usr/src/app/sett…   0B
1f2647d86908        5 minutes ago       /bin/sh -c python -c "import nltk; nltk.down…   25MB
27d27678dd10        5 minutes ago       /bin/sh -c python -m spacy download en          45.1MB
6e37b83f7be5        6 minutes ago       /bin/sh -c pip install --no-cache-dir -r req…   463MB
c5e124630280        13 minutes ago      /bin/sh -c #(nop) COPY dir:414b0e689b3fdeac6…   18.6kB
b83ddc2b496c        2 months ago        /bin/sh -c #(nop) WORKDIR /usr/src/app          0B
746a826ed9d7        2 months ago        /bin/sh -c #(nop)  CMD ["python3"]              0B
<missing>           2 months ago        /bin/sh -c set -ex;   wget -O get-pip.py 'ht…   6MB
<missing>           2 months ago        /bin/sh -c #(nop)  ENV PYTHON_PIP_VERSION=10…   0B
<missing>           2 months ago        /bin/sh -c cd /usr/local/bin  && ln -s idle3…   32B
<missing>           2 months ago        /bin/sh -c set -ex   && wget -O python.tar.x…   69.9MB
<missing>           2 months ago        /bin/sh -c #(nop)  ENV PYTHON_VERSION=3.7.0     0B
<missing>           2 months ago        /bin/sh -c #(nop)  ENV GPG_KEY=0D96DF4D4110E…   0B
<missing>           2 months ago        /bin/sh -c apt-get update && apt-get install…   16.8MB
<missing>           2 months ago        /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B
<missing>           2 months ago        /bin/sh -c #(nop)  ENV PATH=/usr/local/bin:/…   0B
<missing>           2 months ago        /bin/sh -c set -ex;  apt-get update;  apt-ge…   556MB
<missing>           2 months ago        /bin/sh -c apt-get update && apt-get install…   142MB
<missing>           2 months ago        /bin/sh -c set -ex;  if ! command -v gpg > /…   7.8MB
<missing>           2 months ago        /bin/sh -c apt-get update && apt-get install…   23.2MB
<missing>           2 months ago        /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>           2 months ago        /bin/sh -c #(nop) ADD file:370028dca6e8ca9ed…   101MB

As you can see from my example, there is one clear command that is taking all of the space: pip install. Makes sense — this is probably generating a lot of temporary files and downloads that we don’t need to keep.

But what exactly are they?

To find out, we can inspect the filesystem to see what was generated by that layer. Docker actually keeps layers as part of the filesystem so we can drive into that directory to check it out.

To find out where is that layer stored, we can use the inspect command:

docker inspect 6e37b83f7be5

(lots of json output)

Actually, we just care about the GraphDriver section, with the location specified in Data.Workdir. So, we can specify it in our command so that we get only that portion back:

docker inspect -f "{{.GraphDriver.Data.WorkDir}}" 6e37b83f7be5

/var/lib/docker/overlay2/ed87e8cb41e0f5d2f40e2205db550f2c9e888224fb6928f6a1c626176ad2beb3/work

Nice!

If you’re working in a Linux computer you can just cd into that directory. But if you’re in Windows, you won’t find it — it is stored in the Hypervisor Linux VM that Docker uses.

This post suggest using this command to accessing it, all through the power of docker:

docker run -it --privileged --pid=host debian nsenter -t 1 -m -u -i sh

# now inside the debian sh

cd /var/lib/docker/overlay2/ed87e8cb41e0f5d2f40e2205db550f2c9e888224fb6928f6a1c626176ad2beb3

At this point we get to see the changes in the file system inside the diff directory. These are the particular changes that this layer introduced. We can explore it like we would with any other filesystem.

I just executed du -h . and was able to find what were the major hogs in space, like the __pycache__ directories and other temporary building files.

By repeating this process and identifying the primary causes of large layers, you can trim down the final version of your image.

Having said this, it is still possible that you cannot control what particular steps in your build process do. For instance, in my case, I could not avoid the creation of the __pycache__ directories as part of the build process.

This is where multistage builds come in.