Using Nagios to make things good

2017-03-31
3 min read

At LMAX Exchange Nagios is one of our essential tools for monitoring and verifying the operation of our systems. We use it for three distinct purposes.

  • Alerting when things break.
  • Recording trends so that we can predict when problems will occur and then mitigate them.
  • Using Nagios to verify the overall structure of our environments.

Things have broken

Using Nagios to monitor things breaking down is perhaps the most common use case. These checks need to run often, perhaps every few seconds. Let us look at an example, a web server, and some of the tests we might want to run.

  • Does the server respond on ports 80, 443.
  • Is Apache running on the server.
  • Are all the network interfaces up.
  • Are all the fans working.
  • Has a disk failed.
  • Are there unexpected users logged in.

Some of these tests are implemented using the normal Nagios checks, some we write ourselves. One feature these tests have in common is that they return a binary result - pass or fail. Another aspect is that we want to know when they fail, and depending on the test send an email or SMS alert.

Things will break

There are some tests that check for trends, in this respect the graphing feature can be useful. We use this for resources that we can address e.g. filesystem storage, memory, CPU and network utilisation. For these checks we want to know when they will become an issue. Typically, we can set a threshold and when that threshold is reached we can get an alert, and make a judgement call as to what to do e.g. assign more disk space to the filesystem or assign more cores to the VM. Some trends may require more planning so we want to be alerted sooner.

We might break things

The third example of monitoring is for things that are incorrectly configured, might cause us an annoyance in the future or are not as we expect them to be. Some examples are the versions of certain firmware, which network card is plugged into which switch or the size of a disk volume. The faults that can cause these tests to fail are normally ones that have happened because someone has changed something. We usually run most of these tests once a week but if we make a change, we manually trigger them to prevent the alert. Another time they are useful is when a new system is built, we can run the checks to verify that it has been built correctly, and then rectify the configuration until all tests are green. This is also useful if we have had work completed in a rack, we can then run all the tests to verify that nothing has been disturbed in the work. There will be a more in depth blog on this next month from Luke.

In Summary

Nagios can be used for so much more than just an up/down polling system. At LMAX Exchange Nagios helps us maintain a consistent environment through trend analysis and configuration anomaly detection.