Persistent data

in a non-persistent world

In this glorious new world I’ve been writing about, applications are non persistent. They spin up and are destroyed at will. They have no state in them. They can be rebuilt, scaled out, migrated, replaced and your application shouldn’t notice… if written properly!

But applications are pointless if they don’t have data to work on. In traditional compute an app is associated with a machine (or set of machines). These machines have filesystems. We can write data there. If we want to share data between machines we can use something like NFS. It’s very easy to persist data.

In our new dynamically scalable app migrating world we don’t have this.

So where do we store our data?

The standard answer is to use an external datastore, such as MySQL or CouchDB or an object store (typically presented with an Amazon S3 compatible API, even if not actually using Amazon S3). Your application doesn’t persist any data; these resources are attached (or bound) to your app so they can be used.

Even for users of databases this may require a change in behaviour; you might write out all your important data into the database but write out logs and performance data to the filesystem. You can’t do that any more; everything you want to keep needs to stored in the database or S3 store. And that requires a code rewrite.

Persistent files

But I’ve never been a fan of that. Why can’t we treat a filesystem as if it was another attached resource? This data could also be shared between instances of the app.

With docker we have some of this ability with the -v flag. Let’s create a directory and share it across two running instances:

In terminal window 1:

tty1$ sudo mkdir -p /export/myapp
tty1$ docker run --rm -it -v /export/myapp:/myapp --name inst1 centos 
[root@cd0c4a2b0055 /]# ls /myapp
[root@cd0c4a2b0055 /]# echo world > /myapp/hello
[root@cd0c4a2b0055 /]# 

In window 2:

tty2$ docker run --rm -it -v /export/myapp:/myapp --name inst2 centos
[root@988151662ef1 /]# cat /myapp/hello 
world
[root@988151662ef1 /]# 

We can shut down both containers, and they’ll be destroyed (because of the --rm flag) and the data will persist on the host

[root@cd0c4a2b0055 /]# exit
tty1$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
tty1$ ls /export/myapp/
hello
tty1$ docker run --rm -it -v /export/myapp:/myapp --name inst3 centos
[root@7ed3b5e955f6 /]# cat /myapp/hello 
world

So here’s an easy way of persisting data in a manner that developers are already used to.

It’s not that easy

There are a number of problems with this model. First and most important is data security. Because the data is present on the host then anyone with access to the host might be able to read this data. This is why I’ve recommended that production container servers should treat the parent OS with high access restrictions; as restrictive as your hypervisors in a traditional VM.

There may be problems with SELinux or other labelling systems; the directory created didn’t have any of the right labels so if the OS was set to enforcing mode then access to this directory may be rejected (indeed I had to do a setenforce 0 for these tests).

We’ve only looked at a single host; in a real world environment your app may spin up and down across dozens of different servers. So you may need your persistent datastore to be coming from an NFS server or similar. You might just mount that at the host level. Docker also allows for different backends to be used with the -v flag; it can talk directly to an NFS server, for example. That’s pretty powerful!

At lot of this docker functionality is documented in a tutorial.

You also need your orchestration tool to be able to support this configuration and start up your containers with the right flags. Mesos, for example, supports persistent volumes; Kubernettes supports similar.

Beyond docker

There’s nothing that says this has to be docker only; for example, if you use systemd-nspawn to manage your containers then there is a --bind option to do similar stuff.

Even Amazon are in this game with Elastic File System which presents your storage as an NFSv4.1 accessible filesystem that you can mount onto your EC2 server. This means you can take the persistency out of your AMI and put it into EFS; the result is something very similar to the docker examples earlier.

Summary

We don’t have to use storage tools such as S3 to keep to a 12 factor design. It’s perfectly possible to keep your standard filesystem semantics and keep your application layer immutable and all the rest of the goodness.

This can even make operations easier; that same persistent volume could be shared read-only with an “operations application”; your operate team can read the logs, analysis performance statistics on demand (only spin up that app when needed) without needing to access the production application container.