Monitoring my router with graphs

collectd, influxdb, grafana

WARNING: technical content ahead! There’s also a tonne of config files, which make this page look longer than it really is, but hopefully they’ll help other people who want to do similar work.

A few months back I replaced my OpenWRT router with a CentOS 7 based one. This machine has been working very well, and handles my Gigabit FIOS traffic without any issues.

I came across a post from Jim Perrin around using tools to collect data and draw graphs on it.

Now I’d started to play with collectd a few years ago, but never went anywhere with it (I still had the source tree from where I compiled it!). Jim’s solution looked simple enough…

He had everything running on the router, but I wanted to keep as much load off the router as possible; let it handle the job of routing packets; databases and graphing can be done elsewhere :-) So I built a small VM (512Mb RAM, 15Gb disk) and called it “monitor”. This also allows me to collect data from other machines and consolidate to one point.

Starting with influxDB

I did things in a slightly different order; I started with the database. That way I could check that collectd was sending data correctly.

So, following Jim’s instructions, I added the repo and installed the software.

% cat /etc/yum.repos.d/influxdb.repo 
[influxdb]
name = InfluxDB Repository - RHEL $releasever
baseurl = https://repos.influxdata.com/centos/$releasever/$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key

% sudo yum install influxdb
[...]

We now need to configure it. I noticed that the defaults log a lot of information; so this configuration turns off all that. You might want to leave it on while building, but beware that your /var/log/messages file will grow large!

The effective configuration is:

reporting-disabled = true
[meta]
  dir = "/var/lib/influxdb/meta"
  logging-enabled=false

[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  query-log-enabled=false

[http]
  enabled = true
  bind-address = ":8086"
  log-enabled=false

[[collectd]]
  enabled = true
  bind-address = ":8096"
  database = "collectd"
  typesdb = "/usr/share/collectd"

There are more sections in the file which probably need to be kept there, but everything inside those sections are commented out; they’re the defaults.

Now the collectd section needs to see types.db file from collectd. If you have the collectd RPM installed then this lives in /usr/share/collectd/types.db, otherwise you’ll need to copy the file over. The config file points to the directory the file lives in.

Startup and run the server:

% sudo systemctl enable --now influxdb

% ps -ef | grep influx
influxdb  2597     1  0 Aug31 ?        00:25:34 /usr/bin/influxd -config /etc/influxdb/influxdb.conf

collectd

Now we can configure collectd on the router. Since this is in EPEL, make sure you have the repo enabled and then it’s simple to install:

% sudo yum install collectd

The configuration file is pretty simple:

Hostname    "router"
FQDNLookup   true

BaseDir     "/var/lib/collectd"
PIDFile     "/var/run/collectd.pid"
PluginDir   "/usr/lib64/collectd"
TypesDB     "/usr/share/collectd/types.db"

LoadPlugin syslog
<Plugin syslog>
        LogLevel info
</Plugin>

LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin network

<Plugin df>
        ReportInodes true
</Plugin>

<Plugin disk>
        Disk "/^([hsv]|xv)d[a-z][0-9]?$/"
</Plugin>

<Plugin network>
        Server "10.0.0.150" "8096"
</Plugin>

(I’ll come to the “Disk” line a little later).

Let’s start it up!

systemctl enable --now collectd

Verifying it works

At this point collectd should be sending data into InfluxDB. We can check this:

% influx
Connected to http://localhost:8086 version 1.3.5
InfluxDB shell version: 1.3.5
> use collectd;
Using database collectd
> show measurements;
name: measurements
name
----
cpu_value
df_value
disk_io_time
disk_read
disk_value
disk_weighted_io_time
disk_write
interface_rx
interface_tx
load_longterm
load_midterm
load_shortterm
memory_value

Excellent! InfluxDB has a simple SELECT syntax so you can look inside each of the measurements for the raw data.

The default settings for influxDB allows it to retain data for 7 days. If you want to keep data for longer then you’ll need to create a new retention schedule. In my case I wanted 2 months:

> create retention policy two_month on collectd duration 62d replication 1 default;
> drop retention policy autogen on collectd;
> show retention policies;
name      duration  shardGroupDuration replicaN default
----      --------  ------------------ -------- -------
two_month 1488h0m0s 24h0m0s            1        true

Grafana

Let’s head back to Jim’s tutorial… we can now install Grafana on the monitor machine.

% cat /etc/yum.repos.d/grafana.repo 
[grafana]
name=grafana
baseurl=https://packagecloud.io/grafana/stable/el/6/$basearch
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

% sudo yum install grafana

Configuration is pretty simple; just set the admin user/password in the [security] section. (We’ll take a look at security in a little bit).

% sudo systemctl enable --now grafana

At this point I was able to go to http://monitor:3000/ and login as the admin user.

We’ll be required to set up a datasource. Because influxDB is on the same machine, we can simply use these settings:

grafana datasource
It’s important to make sure the localhost URL is highlighted and not dim.

Creating a dashboard

We’re now ready to create a dashboard to look at all this data. Jim recommended the Host overview dashboard, so that’s where I started.

I realised that some of these graphs were not suitable for a router; for example the networking graphs take the total over all interfaces, which means that “received” and “sent” values are almost identical. After all a router receives data on one interface and sends it back out again on another :-)

The GUI makes it pretty easy to edit panels, move them around, add/remove entries. I wanted the network panels to only refer to br-wan (the interface I use for connecting to the internet). I also didn’t care too much about disk space on this dashboard.

The resulting JSON file is here, if it helps. I can now quickly see my internet usage and CPU load on the server.

Multiple hosts

At this point it becomes very easy to deploy this to other servers. Since my internet hosts are connected via a VPN I can use the same file everywhere. A simple ansible script let me deploy to all the hosts I wanted:

---
- name: Set up collected
  hosts: collectd
  become: False

  tasks:
    - yum: name=collectd state=present

    - service: name=collectd enabled=yes state=started

    - name: collectd.conf
      template: src=Files/collectd/collectd.conf.j2 dest=/etc/collectd.conf owner=root group=root mode=644
      notify:
      - reconfig

  handlers:
    - name: reconfig
      service: name=collectd state=restarted

The “conf” file is identical to the one previously shown, except for the one line:

Hostname    "{{ ansible_hostname }}"

I also added a line for my VM server to collect virt stats:

{% if ansible_hostname == "penfold" %}
LoadPlugin virt
{% endif %}

A quick ansible-playbook command and a few minutes and now I have 7 machines all sending data to influxDB and they now all magically show in the ‘Host Overview” dashboard.

This also explains the odd “Disk” line in the config. The various machines may show hd* or sd* or vd* or even xvd* (Xen client). I also wanted to not collect data for metadisks (md0, md6, etc).

Security

Now this configuration has a lot of things that I’d scream about at work. We’re exposing services to the network without proper access control (primarily influxDB). This means that anyone on my network could send funky data and pollute my database or even read data. There’s no sensitive data stored here, but it’s not a good configuration and would never be acceptable in an enterprise environment.

The tools do allow for authentication and TLS; InfluxDB documentation goes into quite some detail on how to configure this, and both collectd and Grafana can be configured to authenticate to talk to it.

However, I’m not an enterprise; guests are on the guestnet and can’t see this server; I’m not exposing it to the internet…

No doubt this will come back to haunt me at some point :-)

Conclusion

I’m actually not a fan of GUIs for management. This form of “eyes on glass” monitoring means that issues may be lost. However they do have benefits; oddities may stand out.

Indeed, one oddity showed up almost immediately:

grafana network

Why is my network baseline downloaded 800Mbytes/hour? That seems high!

I was about to hunt down what server was doing a constant 200Kbyte/sec when it struck me… my speed check tester downloads 200Mbytes every 15 minutes; 4 times an hour. That’s 800Mbytes, right there :-)

This graph also clearly shows when I downloaded backups from my internet hosted servers, and when I uploaded my backups to Amazon cloud :-)