A look at Docker Swarm

A Brief Introduction

In my previous entry I took a quick look at some of the Docker orchestration tools. I spent a bit of time poking at docker-compose and mentioned Swarm.

In this entry I’m going to poke a little at Swarm; after all, it now comes as part of the platform and is a key foundation of Docker Enterprise Edition.

Docker Swarm tries to take some of the concepts of a single host model and convert it into a cluster. Some of the same concepts (e.g. networking) are extended to a cluster aware manner (e.g. using VXLAN overlay networks). Where possible it tries to hide this stuff.

Now this entry may look long… but that’s because Swarm does a lot. I’m really only touching on some highlights! There’s a lot more happening behind the scenes.

Complications

Now a Swarm is a collection of Docker Engines. This means that a lot of the stuff we’ve seen in previous entries still applies. In particular, you can still access each node directly and examine the running state. A bit of a gotcha is that you really need a registry to run images.

In previous examples I did something similar to

docker build -t myimage .
docker run myimage

Now this works because the engine has a copy of the image. If I tried to do the docker run command on another machine then it’d fail, because it wouldn’t have a copy of myimage (indeed, it’d try to reach out to docker.io to find a copy there).

There are a few solutions to this; e.g. manually copy images around the cluster, but they don’t really scale. The correct solution is to run a registry, and configure your engines to talk to that.

Docker provides a hosted registry at Docker hub. The free offering is suitable for public storage of images (you can create one private repo, for free). You can pay for additional private repos, which can be configured to require authentication (a number of vendors do this). There is a freebie image that you can pull and run your own internal registry. And there are private registries (eg Docker Datacenter, or VMware harbor); some are commercial. For an enterprise you should look carefully at running your own private repo. For this blog I’m going to use the sweh repo on Docker Hub.

Initializing a Swarm

Once you have your servers built and the Docker software installed and configured (datastores, etc) then we can get started and build a swarm out of it.

On my network I have 4 VMs configured

  1. docker-ce
  2. test1
  3. test2
  4. test3

(my hostnames are so imaginative!). docker-ce will be the manager node, so we can initialize it.

Before we do this, let’s quickly check the state of the network. It should match what we’ve seen before

% docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
20b900380c5e        bridge              bridge              local
9a4091b44793        host                host                local
5e9fc3c20b10        none                null                local

% brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02420dfebfca       no

% ip -4 addr show dev docker0
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

Now we can start the swarm.

% docker swarm init
Swarm initialized: current node (f8t7omvmys6pa0nx66wpathuf) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1234567890123456789069ugkou4sr7k476qlulvi9y3rkpsbt-9ynm7135rlebwycgqnlcap824 10.0.0.164:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

That’s it! We now have a swarm! A swarm of one node, but a swarm.

What does the network look like, now?

% docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
20b900380c5e        bridge              bridge              local
4c524f9b4565        docker_gwbridge     bridge              local
9a4091b44793        host                host                local
7yg0jizx89dt        ingress             overlay             swarm
5e9fc3c20b10        none                null                local

% brctl show
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02420dfebfca       no
docker_gwbridge 8000.024293da454e       no              veth9368d19

% ip -4 addr show dev docker_gwbridge
8: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    inet 172.18.0.1/16 scope global docker_gwbridge
       valid_lft forever preferred_lft forever

So we can see there’s a new bridge network (docker_gwbridge) and an overlay network created.

Now the “swarm token” presented should be kept secret; this is needed to join another Docker node to the swarm. If you don’t keep it secret than anyone (who can reach the manager) can add their machine… and potentially get access to your programs or data.

Fortunately you don’t need to remember this or write it down; you can ask Docker to repeat it

% docker swarm join-token worker
To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1234567890123456789069ugkou4sr7k476qlulvi9y3rkpsbt-9ynm7135rlebwycgqnlcap824 10.0.0.164:2377

Workers and managers

There are two types of nodes in Swarm

  • Workers
    A worker node is where the compute happens; when you tell Swarm to run 5 copies of your image then it will distribute the jobs across the workers.

  • Managers
    A manager node is also a worker node, so can run jobs. But a manager node is also used to manage the swarm. You can (should!) have multiple managers, for resiliency. Now the managers communicate using the Raft consensus protocol. This means that over half of the managers need to agree on the state of the swarm. Typically that means you need an odd number of managers. With 3 managers you can suffer the loss of 1, and still keep running; with 5 managers you can lose 2; with 7 managers you can lose 3.. picking the right number of managers is an availability question and may depend on the underlying infrastructure “availability zones”.

Side note; managers can be made to not be workers by setting their worker availability to “drain” (docker node update --availability drain), which tells Swarm to remove any running jobs and don’t schedule new ones.

A worker can be promoted to a manager (docker node promote), and a manager can be demoted to a worker (docker node demote).

Adding workers

You simply need to run the docker swarm join command that was mentioned when the swarm was initialized (and that docker swarm join-token worker reports).

% docker swarm join --token SWMTKN-1-1234567890123456789069ugkou4sr7k476qlulvi9y3rkpsbt-9ynm7135rlebwycgqnlcap824 10.0.0.164:2377
This node joined a swarm as a worker.

You can also join a node directly as a manager using the token returned by docker swarm join-token manager

Now my swarm has some compute power!

% docker node ls
ID                            HOSTNAME               STATUS              AVAILABILITY        MANAGER STATUS
k4rec71pftizbzjfcmgio6cz5 *   docker-ce.spuddy.org   Ready               Active              Leader
vbhqixyoe8lcvotwav5dfhf7i     test2.spuddy.org       Ready               Active
wqg2tdlbfow7bftzu5sexz32g     test1.spuddy.org       Ready               Active
zqx3vkx0cdqswlq9q9xn91ype     test3.spuddy.org       Ready               Active

You might have noticed my prompt was %; you don’t need to be root to do any of this if you are in the docker group; membership of this group is very powerful!

If you look at a worker node network it also has the new docker_gwbridge and ingress networks created. But, importantly, the ID of the ingress network is consistent across all nodes. This shows it’s shared across the swarm, whereas the other networks are still local to the node. (Indeed docker_gwbridge is 172.18.0.1/16 on each node).

First service

Docker Swarm works on the concepts of services. These define the unit to be deployed across the swarm. It can be as simple as a single container, or it could be a complicated setup similar to those described by docker-compose.

Let’s just start a copy of centos, and just set it pinging localhost:

% docker service create --detach=true --replicas 1 --name pinger centos ping localhost
vfdxae3ck4r2xt5ig4fd7aqqv

Well, that looks very similar to a docker run command…

We can see the service state, and we can even see what server is running it:

% docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
vfdxae3ck4r2        pinger              replicated          1/1                 centos:latest

% docker service ps pinger
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                ERROR               PORTS
xxn5tnej6bao        pinger.1            centos:latest       test1.spuddy.org    Running             Running 25 seconds ago

OK, so that’s running on test1. We can login to that machine and use the docker commands we know to look at this:

% ssh test1 docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED              STATUS              PORTS               NAMES
3aac8b07c701        centos:latest       "ping localhost"    About a minute ago   Up About a minute                       pinger.1.xxn5tnej6bao96up565ak1xny

% ssh test1 docker logs pinger.1.xxn5tnej6bao96up565ak1xny | tail -3
64 bytes from localhost (127.0.0.1): icmp_seq=160 ttl=64 time=0.017 ms
64 bytes from localhost (127.0.0.1): icmp_seq=161 ttl=64 time=0.347 ms
64 bytes from localhost (127.0.0.1): icmp_seq=162 ttl=64 time=0.048 ms

% ssh test1 docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
centos              <none>              36540f359ca3        10 days ago          193MB

% ssh test3 docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

%

This should be a big clue that Swarm really just sits on top of the existing docker constructs! test3 doesn’t have the centos image, because it hasn’t needed to pull it, yet.

Re-scale

One container is boring. I want 3 of them!

% docker service scale pinger=3
pinger scaled to 3

% docker service ps pinger
ID                  NAME                IMAGE               NODE                   DESIRED STATE       CURRENT STATE              ERROR               PORTS
xxn5tnej6bao        pinger.1            centos:latest       test1.spuddy.org       Running             Running 4 minutes ago
dcmno194ymjf        pinger.3            centos:latest       test2.spuddy.org       Running             Preparing 14 seconds ago
g89lpi6xeuji        pinger.2            centos:latest       test3.spuddy.org       Running             Preparing 14 seconds ago

% ssh test3 docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
centos              <none>              36540f359ca3        10 days ago          193MB

We can see state is Preparing, which means it isn’t running yet. Part of this preparation is pulling down the image from the repository. Once the image is down then the container will start up.

But finally:

% docker service ps pinger
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
xxn5tnej6bao        pinger.1            centos:latest       test1.spuddy.org    Running             Running 5 minutes ago
dcmno194ymjf        pinger.2            centos:latest       test2.spuddy.org    Running             Running 2 second ago
g89lpi6xeuji        pinger.3            centos:latest       test3.spuddy.org    Running             Running 2 second ago

The service can be stopped with docker service rm.

Self-healing

Let’s start again, and run 3 new copies of my service:

% docker service ps pinger
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
k9myef61ebqe        pinger.1            centos:latest       test3.spuddy.org    Running             Running 9 seconds ago
quelws25339z        pinger.2            centos:latest       test1.spuddy.org    Running             Running 9 seconds ago
r020adnj1nbd        pinger.3            centos:latest       test2.spuddy.org    Running             Running 9 seconds ago

On test1 I’m going to kill the image:

% ssh test1 docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED              STATUS              PORTS               NAMES
e49af88bdfac        centos:latest       "ping localhost"    About a minute ago   Up 59 seconds                           pinger.2.quelws25339z0zkfxek89k5fa
% ssh test1 docker kill e49af88bdfac
e49af88bdfac

We told the Swarm we wanted three replicas, so what did it do?

% docker service ps pinger
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                ERROR                         PORTS
k9myef61ebqe        pinger.1            centos:latest       test3.spuddy.org    Running             Running about a minute ago
qul4kor2n9j1        pinger.2            centos:latest       test1.spuddy.org    Running             Running 18 seconds ago
quelws25339z         \_ pinger.2        centos:latest       test1.spuddy.org    Shutdown            Failed 24 seconds ago        "task: non-zero exit (137)"
r020adnj1nbd        pinger.3            centos:latest       test2.spuddy.org    Running             Running about a minute ago

Ha! It noticed that one container (pinger.2) had failed and brought up a new one to replace it.

Docker Swarm maintains a history of terminated containers to assist in diagnostics; by default this is something like 5 containers per replica, but can be changed.

Let’s be a little more drastic; hard power down test1.

% docker service ps pinger
ID                  NAME                IMAGE               NODE                   DESIRED STATE       CURRENT STATE                     ERROR                         PORTS
k9myef61ebqe        pinger.1            centos:latest       test3.spuddy.org       Running             Running 7 minutes ago
838i3nnx6ptj        pinger.2            centos:latest       docker-ce.spuddy.org   Running             Starting less than a second ago
qul4kor2n9j1         \_ pinger.2        centos:latest       test1.spuddy.org       Shutdown            Running 6 minutes ago
quelws25339z         \_ pinger.2        centos:latest       test1.spuddy.org       Shutdown            Failed 6 minutes ago              "task: non-zero exit (137)"
r020adnj1nbd        pinger.3            centos:latest       test2.spuddy.org       Running             Running 7 minutes ago

In under a minute the Swarm detected the worker node had failed, and restarted the required job on another node.

This is a great way of doing HA; if you have multiple copies of your service running (scale > 1) then you can handle single server failures almost unnoticed!

The Swarm also tells us the node isn’t available:

% docker node ls
ID                            HOSTNAME               STATUS              AVAILABILITY        MANAGER STATUS
k4rec71pftizbzjfcmgio6cz5 *   docker-ce.spuddy.org   Ready               Active              Leader
vbhqixyoe8lcvotwav5dfhf7i     test2.spuddy.org       Ready               Active
wqg2tdlbfow7bftzu5sexz32g     test1.spuddy.org       Down                Active
zqx3vkx0cdqswlq9q9xn91ype     test3.spuddy.org       Ready               Active

Planned outages

If a server is planned to be taken down (e.g. for patching, rebooting) then the work load can be drained off of it first, and the node made quiescent:

% docker service ps pinger
ID                  NAME                IMAGE               NODE                   DESIRED STATE       CURRENT STATE             ERROR               PORTS
vobdvumj1ve3        pinger.1            centos:latest       test3.spuddy.org       Running             Running 6 seconds ago
q1z3333fmcxa        pinger.2            centos:latest       docker-ce.spuddy.org   Running             Preparing 4 seconds ago
5ddwrc3dcuh8        pinger.3            centos:latest       test1.spuddy.org       Running             Running 1 second ago

% docker node update --availability drain docker-ce.spuddy.org
docker-ce.spuddy.org

% docker node ls
ID                            HOSTNAME               STATUS              AVAILABILITY        MANAGER STATUS
k4rec71pftizbzjfcmgio6cz5 *   docker-ce.spuddy.org   Ready               Drain               Leader
vbhqixyoe8lcvotwav5dfhf7i     test2.spuddy.org       Ready               Active
wqg2tdlbfow7bftzu5sexz32g     test1.spuddy.org       Ready               Active
zqx3vkx0cdqswlq9q9xn91ype     test3.spuddy.org       Ready               Active

% docker service ps pinger
ID                  NAME                IMAGE               NODE                   DESIRED STATE       CURRENT STATE            ERROR               PORTS
vobdvumj1ve3        pinger.1            centos:latest       test3.spuddy.org       Running             Running 47 seconds ago
3ww4npd4yc9g        pinger.2            centos:latest       test2.spuddy.org       Ready               Ready 9 seconds ago
q1z3333fmcxa         \_ pinger.2        centos:latest       docker-ce.spuddy.org   Shutdown            Running 9 seconds ago
5ddwrc3dcuh8        pinger.3            centos:latest       test1.spuddy.org       Running             Running 41 seconds ago

We can see the job that was on docker-ce node has been shutdown and migrated to test2.

Remember, this is a restart of the service, so state internal to the container is lost… but you’re externalizing your state, aren’t you?

After maintenance is complete, the node can be made available again:

% docker node update --availability active docker-ce.spuddy.org
docker-ce.spuddy.org

Load balancer

Docker Swarm has a built in load balancer. Let’s build a simple web server that has a CGI which reports on the hostname. I’m going to push this to docker hub with the name sweh/test. Once it’s pushed I can delete the local version (and the base image I built on)

% docker build -t test .
[...stuff happens...]
Successfully tagged test:latest

% docker tag test sweh/test

% docker push sweh/test
The push refers to a repository [docker.io/sweh/test]
[...stuff happens...]

Let’s run 3 copies of this, exposing port 80. Notice I’m referring to the docker.io container name, so each engine can pull from the repo.

% docker service create --detach=true --replicas 3 --publish 80:80 --name httpd sweh/test
0fsfgeylod67ff4iwx9uiu4i9

% docker service ps httpd
ID                  NAME                IMAGE               NODE                   DESIRED STATE       CURRENT STATE            ERROR               PORTS
v1uwsqfpho2c        httpd.1             sweh/test:latest    docker-ce.spuddy.org   Running             Running 32 seconds ago
40z9wjqhokwq        httpd.2             sweh/test:latest    test1.spuddy.org       Running             Running 9 seconds ago
jwnab0dz2d1s        httpd.3             sweh/test:latest    test2.spuddy.org       Running             Running 19 seconds ago

OK, so I exposed port 80… but what IP address is the service on? The answer is all IP addresses in the swarm. Each node (even if it’s not running the service) will listen on port 80 and direct the request to the a running container.

% curl localhost/cgi-bin/t.cgi
Remote: 10.255.0.2
Hostname: baded58d0501
My Addresses:
    lo 127.0.0.1/8
  eth0 10.255.0.9/16 10.255.0.6/32
  eth1 172.18.0.3/16

% curl localhost/cgi-bin/t.cgi
Remote: 10.255.0.2
Hostname: 69b94822a2f4
My Addresses:
    lo 127.0.0.1/8
  eth0 10.255.0.7/16 10.255.0.6/32
  eth1 172.18.0.3/16

% curl test3/cgi-bin/t.cgi
Remote: 10.255.0.5
Hostname: baded58d0501
My Addresses:
    lo 127.0.0.1/8
  eth0 10.255.0.9/16 10.255.0.6/32
  eth1 172.18.0.3/16

So two calls to localhost hit different containers (different hostnames), and hitting a node not running the service still worked.

This makes it pretty easy to put a HA Proxy (or similar) instance in front of your swarm; it doesn’t matter where your container is running, the combination of HA Proxy and Swarm load balancer means the request will be forwarded.

The underlying constructs used are very similar to those seen previously, on a single node. Some interesting IP addresses in there, which hint at how Docker is doing this magic; note that the primary address on eth0 is unique to the instance, but that a secondary address and the address on eth1 is consistent.

Now we know eth1 is on the docker_gwbridge bridge local to each node. But what is this 10.255.0.0/16 network?

% docker network inspect --format='{{ .IPAM.Config }}' ingress
[{10.255.0.0/16  10.255.0.1 map[]}]

Ah ha! It’s on the ingress network that we’d previously seen flagged consistently across the nodes. Yet this doesn’t appear as a network interface to the host.

This ingress network is the Swarm equivalent of docker0, but using the overlay driver across the swarm. If you look back at the example web server output, you’ll see the “Remote” address was a 10.255.0.x value; we’re seeing the load balancer at work on the ingress network.

Docker Swarm and docker-compose

Swarm can not run docker-compose directly. If you try it’ll warn you that you’re just running a single node service. Your compose file will work, but just on the node you started it on; there’s no Swarm interaction.

However, docker stack can read a compose file and create a stack of services out of it.

There are limitations, of course, due to the distributed nature of the run time. A big one, for example, is that filesystem based volumes don’t work so well (/my/dir on the manager may not be present on all the compute nodes!). There are work-arounds (different volume drivers; NFS;…) but this complexity is inherent in a multi-server model.

In earlier examples of docker-compose I created a 3-tier architecture (web,app,DB). Let’s see if we can do the same with Swarm. Now the DB is awkward; it needs access to specific files, so let’s run that on a fixed node.

% cat deploy.yaml
version: "3"

networks:
  webapp:
  appdb:

volumes:
  db-data:

services:
  web:
    image: sweh/test
    networks:
      - webapp
    ports:
      - 80:80

  app:
    image: centos
    networks:
      - webapp
      - appdb
    entrypoint: /bin/sh
    stdin_open: true
    tty: true

  db:
    image: mysql:5.5
    networks:
      - appdb
    environment:
      - MYSQL_ROOT_PASSWORD=foobar
      - MYSQL_DATABASE=mydb1
    volumes:
      - db-data:/var/lib/mysql
    deploy:
      placement:
        constraints: [node.hostname == test1.spuddy.org]

WARNING: This is a really bad setup. db-data is local to each node and so there’s no data persistency if the database is allowed to bounce around the swarm. It’s why we only allow it to run on test1. Do not do this for any production setup (use a better network aware volume setup!); I’m just showing it here for simplicity.

% docker stack deploy -c deploy.yaml 3tier
Creating network 3tier_appdb
Creating network 3tier_webapp
Creating service 3tier_db
Creating service 3tier_web
Creating service 3tier_app

You can see this looks very similar to the previous compose file.

% docker stack ls
NAME                SERVICES
3tier               3

% docker stack services 3tier
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
7f02afdt674n        3tier_app           replicated          1/1                 centos:latest
qj9ken7d69tq        3tier_web           replicated          1/1                 sweh/test:latest    *:80->80/tcp
u97lwk6kc8cj        3tier_db            replicated          1/1                 mysql:5.5

Of course a stack uses services, so everything we’ve seen still works

% docker service ps 3tier
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
b0kk8dkq7n1i        3tier_app.1         centos:latest       test3.spuddy.org    Running             Running 47 seconds ago
a5wrrwjh07ni        3tier_web.1         sweh/test:latest    test2.spuddy.org    Running             Running 48 seconds ago
fs0owmxbm45h        3tier_db.1          mysql:5.5           test1.spuddy.org    Running             Running 28 seconds ago

It should be no surprise to see the mysql instance running on test1!

Swarm stack networks

We can see the networking type stuff as well. test1 only runs the db layer, so

% ssh test1 docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
1b5cs8cs2sf3        3tier_appdb         overlay             swarm
...

(output edited to remove other networks we’ve already seen)

Similarly, test2 only has the webapp network, but test3 runs the app and so needs both networks:

% ssh test2 docker network ls | grep 3t
s6u8tobcz9rp        3tier_webapp        overlay             swarm

% ssh test3 docker network ls | grep 3t
1b5cs8cs2sf3        3tier_appdb         overlay             swarm
s6u8tobcz9rp        3tier_webapp        overlay             swarm

Again, these networks are overlay networks across the swarm, and they match the bridges we saw for docker-compose.

We can inspect each service to see what IPs have been assigned:

% docker service inspect --format='{{ .Endpoint.VirtualIPs }}' 3tier_web
[{7yg0jizx89dtc9t5kurmpxn95 10.255.0.6/16} {s6u8tobcz9rpcezii6gpuh6c6 10.0.1.2/24}]

% docker service inspect --format='{{ .Endpoint.VirtualIPs }}' 3tier_app
[{1b5cs8cs2sf3zxjo9n40f7g80 10.0.0.4/24} {s6u8tobcz9rpcezii6gpuh6c6 10.0.1.4/24}]

% docker service inspect --format='{{ .Endpoint.VirtualIPs }}' 3tier_db 
[{1b5cs8cs2sf3zxjo9n40f7g80 10.0.0.2/24}]

So the AppDB network is 10.0.0.0/24 and the WebApp network is 10.0.1.0/24. Web also shows on the ingress network because of the port forwarding.

Name resolution works similarly to docker-compse; e.g. from the web instance:

Now this seems to be hitting an issue on my setup; I use 10.0.0.0/24 for my main network (e.g. test1 is 10.0.0.158). There seems to be the occasional slowdown in resolving hostnames (e.g. ping db may take 5 seconds, but then it works).

I’m wondering if the docker instance start to have trouble in routing between the VXLAN and the primary network. So let’s shut this down (docker stack rm 3tier) and modify our stack definition to include network ranges:

networks:
  webapp:
    ipam:
      config:
        - subnet: 10.20.1.0/24
  appdb:
    ipam:
      config:
        - subnet: 10.20.2.0/24

Now when we run the stack we can see the services work as we expect and mirrors what we saw with docker-compose; the web layer can not see the db layer, but the app layer can see both.

  web: eth0=10.20.1.3/24, 10.20.1.2/32   (webapp)
       eth1=172.18.0.3/16                (docker_gwbridge)
       eth2=10.255.0.7/16, 10.255.0.6/32 (ingress)

  app: eth0=10.20.2.3/24, 10.20.2.2/32   (appdb)
       eth1=172.18.0.3/16                (docker_gwbridge)
       eth2=10.20.1.5/24, 10.20.1.4/32   (webapp)

   db: eth0=10.20.2.5/24, 10.20.2.4/32   (appdb)
       eth1=172.18.0.3/16                (docker_gwbridge)

In all cases the default route is to 172.18.0.1 (i.e. the local bridge on each node).

e.g. on the app instance

default via 172.18.0.1 dev eth1
10.20.1.0/24 dev eth2  proto kernel  scope link  src 10.20.1.5
10.20.2.0/24 dev eth0  proto kernel  scope link  src 10.20.2.3
172.18.0.0/16 dev eth1  proto kernel  scope link  src 172.18.0.3

Summary

Argh, this post has grown far too long. I didn’t even talk about secrets management…

Docker Swarm is a powerful extension of the standalone Docker Engine model to a distributed resilient model; run your containers over multiple machines and let Swarm handle the uptime. With stacks it becomes relatively easy to migrate from a one-node compose to a multi-node clustered setup.

Of course there are complications; resources (e.g. directories, repositories) may not be consistent across the cluster. This is inherent complexity in a multi-service solution, and you’ll see similar constraints on other orchestration systems (e.g. Kubernetes). The closer your apps are to 12 Factor the easier it will be to run in a swarm.

I’m not sure why Swarm decided it was free to re-use my primary network range; that definitely caused issues. This probably only a problem with the stack command; if I had manually created networks and attached services this wouldn’t have been a problem. If you’ve seen this issue then please let me know in the comments :-)

Understanding Docker Swarm is important to understanding Docker Datacenter (and Docker Enterprise Edition), since that uses Swarm as the underlying technology.