Using placement constraints with Docker Swarm

Label it!

As we’ve previously seen, Docker Swarm mode is a pretty powerful tool for deploying containers across a cluster. It has self-healing capabilities, built in network load balancers, scaling, private VXLAN networks and more.

Docker Swarm will automatically try and place your containers to provide maximum resiliency within the service. So, for example, if you request 3 running copies of a container then it will try and place these on three different machines. Only if resources are unavailable will two containers be placed on the same host.

We saw this with our simple pinger application; it ran on 3 nodes:

% docker service ps pinger
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
xxn5tnej6bao        pinger.1            centos:latest       test1.spuddy.org    Running             Running 5 minutes ago
dcmno194ymjf        pinger.2            centos:latest       test2.spuddy.org    Running             Running 2 second ago
g89lpi6xeuji        pinger.3            centos:latest       test3.spuddy.org    Running             Running 2 second ago

Sometimes, however, you need to control where a container is run. This may be for functionality reasons; for example a container that monitors and reports on the Swarm state needs to run on a manager node in order to get the data it needs. Or there may be OS requirements (a container designed to run on a Windows machine shouldn’t be deployed to a Linux machine!).

Frequently, however, this due to some nodes having the necessary resources, and the most common of these are “volume” dependencies. Remember that while the containers may be run on multiple nodes, a Swarm is really just a collection of standalone Docker engines pulling images from a registry. That means that backend resources, such as file system volumes, are served locally from each node. The contents of /myapp on server test1 may be different from /myapp on test2. We also saw this, in passing, with the MySQL container; it was constrained to only run on a specific node so that the backing datafiles were consistent.

  db:
    image: mysql:5.5
    networks:
      - appdb
    environment:
      - MYSQL_ROOT_PASSWORD=foobar
      - MYSQL_DATABASE=mydb1
    volumes:
      - db-data:/var/lib/mysql
    deploy:
      placement:
        constraints: [node.hostname == test1.spuddy.org]

In this case we use a named volume rather than a filesystem directory, but the constraint is still required; each node would have its own unique “db-data” volume.

Now constraining a host by hostname works, but you’re limited to just a single host. What if you wanted to run 3 copies of Tomcat? We need to constrain it to run, just where the config files are and we can’t just list all three names (the rules are ANDed together).

So, instead, we can define a label and constrain it to that:

  tomcat:
    image: sweh/test:fake_tomcat
    deploy:
      replicas: 2
      placement:
        constraints: [node.labels.Tomcat == true ]
    volumes:
      - "/myapp/apache/certs:/etc/pki/tls/certs/myapp"
      - "/myapp/apache/logs:/etc/httpd/logs"
      - "/myapp/tomcat/webapps:/usr/local/apache-tomcat-8.5.3/webapps"
      - "/myapp/tomcat/logs:/usr/local/apache-tomcat-8.5.3/logs"
    ports:
      - "8443:443"

(“fake_tomcat” is just a dummy program I wrote that listens on the requested port; it doesn’t do any real work).

I want to run this on test1 and test2 so on those two machines I make the necessary directories.

We now need to add the labels to tell the Swarm these are able to run Tomcat:

$ docker node update --label-add Tomcat=true test1.spuddy.org
test1.spuddy.org
$ docker node update --label-add Tomcat=true test2.spuddy.org
test2.spuddy.org
$   docker node inspect --format '{{ .Spec.Labels }}' test1.spuddy.org
map[Tomcat:true]
$   docker node inspect --format '{{ .Spec.Labels }}' test2.spuddy.org
map[Tomcat:true]

I then create the stack, and we can see it running

$ docker stack deploy -c stack.tomcat myapp
Creating network myapp_default
Creating service myapp_tomcat

$ docker stack ls
NAME                SERVICES
myapp               1

$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                   PORTS
bnhvu1cc3orx        myapp_tomcat        replicated          2/2                 sweh/test:fake_tomcat   *:8443->443/tcp

$ docker service ps myapp_tomcat
ID                  NAME                IMAGE                   NODE                DESIRED STATE       CURRENT STATE                ERROR               PORTS
to0yvevap2om        myapp_tomcat.1      sweh/test:fake_tomcat   test2.spuddy.org    Running             Running about a minute ago
dk9pnthgvdcp        myapp_tomcat.2      sweh/test:fake_tomcat   test1.spuddy.org    Running             Running about a minute ago

We can see this is working by testing port 8443:

$ echo hello | nc localhost 8443
You are caller #1 to have reached fake_tomcat on 38be98384cb9:443
fake_tomcat has received message `hello'

$ echo hello | nc localhost 8443
You are caller #1 to have reached fake_tomcat on d095e9f2490f:443
fake_tomcat has received message `hello'

Two calls hit the different containers (as shown by the different hostname in the output).

Multiple services

Let’s create a more complicated stack which has Tomcat and Memcached and Zookeeper.

The additional lines to the stack are:

  zookeeper:
    image: sweh/test:fake_zookeeper
    deploy:
      replicas: 2
      placement:
        constraints: [node.labels.Zookeeper == true ]
    volumes:
      - "/myapp/zookeeper/data:/usr/zookeeper/data"
      - "/myapp/zookeeper/logs:/usr/zookeeper/logs"
      - "/myapp/zookeeper/conf:/usr/zookeeper/conf"
    environment:
      CFG_FILE: /usr/zookeeper/conf/zoo.cfg

  memcached:
    image: sweh/test:fake_memcached
    deploy:
      replicas: 1
      placement:
        constraints: [node.labels.Memcached == true ]

And let’s create the relevant labels to distribute over the three servers. The resulting labels look like:

docker-ce.spuddy.org
    map[]

test1.spuddy.org
    map[Memcached:true Tomcat:true]

test2.spuddy.org
    map[Tomcat:true Zookeeper:true]

test3.spuddy.org
    map[Zookeeper:true]

Note there’s nothing labeled on the manager node; my app won’t run there.

Again, ensure the directories exist with the necessary configuration and deploy:

$ docker stack rm myapp
Removing service myapp_tomcat
Removing network myapp_default

$ docker stack deploy -c stack.full myapp
Creating network myapp_default
Creating service myapp_memcached
Creating service myapp_tomcat
Creating service myapp_zookeeper

$ docker stack ps myapp
ID                  NAME                IMAGE                      NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
iidao6f3vt6z        myapp_zookeeper.1   sweh/test:fake_zookeeper   test2.spuddy.org    Running             Running 34 seconds ago
haq4vvben495        myapp_tomcat.1      sweh/test:fake_tomcat      test1.spuddy.org    Running             Running 35 seconds ago
6u4s6w4fs0cm        myapp_memcached.1   sweh/test:fake_memcached   test1.spuddy.org    Running             Running 37 seconds ago
zixk0eu2aahu        myapp_zookeeper.2   sweh/test:fake_zookeeper   test3.spuddy.org    Running             Running 35 seconds ago
4u8ytws4kl5v        myapp_tomcat.2      sweh/test:fake_tomcat      test2.spuddy.org    Running             Running 35 seconds ago

Rescale

In my fake environment I decided that memcached is running slow (and, besides, only having one copy isn’t very resilient!). Now we can see the power of labels; I can add the Memcached label to another node and then rescale:

$ docker node update --label-add Memcached=true test3.spuddy.org
test3.spuddy.org

$ docker service scale myapp_memcached=2
myapp_memcached scaled to 2

$ docker service ps myapp_memcached
ID                  NAME                IMAGE                      NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
6u4s6w4fs0cm        myapp_memcached.1   sweh/test:fake_memcached   test1.spuddy.org    Running             Running 3 minutes ago
pf56ix45e887        myapp_memcached.2   sweh/test:fake_memcached   test3.spuddy.org    Running             Running 7 seconds ago

Now that was easy because it didn’t have any external volumes to depend on, but we could do the same for zookeeper or tomcat; just create the necessary volumes and configurations on the new node, then add the label.

What if you forget the volumes?

Let’s add Zookeeper to test1 but “forget” to make the volumes:

$ docker node update --label-add Zookeeper=true test1.spuddy.org
test1.spuddy.org
$ docker service scale myapp_zookeeper=3
myapp_zookeeper scaled to 3
$ docker service ps myapp_zookeeper
ID                  NAME                    IMAGE                      NODE                DESIRED STATE       CURRENT STATE                     ERROR                              PORTS
iidao6f3vt6z        myapp_zookeeper.1       sweh/test:fake_zookeeper   test2.spuddy.org    Running             Running 6 minutes ago
zixk0eu2aahu        myapp_zookeeper.2       sweh/test:fake_zookeeper   test3.spuddy.org    Running             Running 6 minutes ago
ytyp3impjzyf        myapp_zookeeper.3       sweh/test:fake_zookeeper   test1.spuddy.org    Ready               Rejected less than a second ago   "invalid mount config for type�
sudgw3w694ja         \_ myapp_zookeeper.3   sweh/test:fake_zookeeper   test1.spuddy.org    Shutdown            Rejected 5 seconds ago            "invalid mount config for type�
3gv64qgli504         \_ myapp_zookeeper.3   sweh/test:fake_zookeeper   test1.spuddy.org    Shutdown            Rejected 6 seconds ago            "invalid mount config for type�

That’s messy!

Eventually the system settles down and runs two copies on an existing node:

$ docker service ps myapp_zookeeper
ID                  NAME                    IMAGE                      NODE                DESIRED STATE       CURRENT STATE                 ERROR                              PORTS
iidao6f3vt6z        myapp_zookeeper.1       sweh/test:fake_zookeeper   test2.spuddy.org    Running             Running 8 minutes ago
zixk0eu2aahu        myapp_zookeeper.2       sweh/test:fake_zookeeper   test3.spuddy.org    Running             Running 8 minutes ago
ib7eiekpes3c        myapp_zookeeper.3       sweh/test:fake_zookeeper   test2.spuddy.org    Running             Running 49 seconds ago

Monitoring of service state becomes very important in this environment!

Migration of services between nodes

For performance reasons I want to move the memcached currently run on test3 and migrate it to test2, so it’s on the same machine as the tomcat instance. We can force a migration of the container by adding the label to the new node and removing it from the old.

$ docker service ps myapp_memcached
ID                  NAME                IMAGE                      NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
6u4s6w4fs0cm        myapp_memcached.1   sweh/test:fake_memcached   test1.spuddy.org    Running             Running 13 minutes ago
pf56ix45e887        myapp_memcached.2   sweh/test:fake_memcached   test3.spuddy.org    Running             Running 9 minutes ago

$ docker node update --label-add Memcached=true test2.spuddy.org
test2.spuddy.org

$ docker node update --label-rm Memcached test3.spuddy.org
test3.spuddy.org

$ docker service ps myapp_memcached
ID                  NAME                    IMAGE                      NODE                DESIRED STATE       CURRENT STATE             ERROR               PORTS
6u4s6w4fs0cm        myapp_memcached.1       sweh/test:fake_memcached   test1.spuddy.org    Running             Running 14 minutes ago
n7bv9331opoq        myapp_memcached.2       sweh/test:fake_memcached   test2.spuddy.org    Running             Running 8 seconds ago
pf56ix45e887         \_ myapp_memcached.2   sweh/test:fake_memcached   test3.spuddy.org    Shutdown            Rejected 14 seconds ago

The instance that had been running on test3 has been rejected and a new instance started on test2. Remember the Docker scheduler will attempt to pick a node that isn’t already running a copy of the container. In this case the only valid servers to meet the placement constraints were test1 and test2, and test1 already had a copy running.

Summary

Using labels allows for a very dynamic way of determining where your workloads run; you can rescale, migrate and even extend the cluster (add new nodes, add labels to the node, modify the scaling) without needing to redeploy the stack.

This can become more important as multiple workloads in multiple stacks are deployed to the same swarm; the stack owners don’t need to know about the underlying node names, they can just use labels and the same stack can be deployed to different targets (how easy would this be to spin up on AWS EC2 instances? No stack changes needed at all!)

Of course this doesn’t come for free; it takes time for containers to spin up (about 6 seconds in my fake_memcached migration) so make sure your services can handle short outages. The closer to 12factor your apps are, the better they are at handling dynamic migration of containers.

And, of course, there’s still the persistent data volume question to handle; this can complicate your deployments.

But despite these complications I would recommend taking a look at labels if you have a need to constrain where your containers run.