Blog posts from
January February March April May June July August September October November December // 2009
January February March April May June July August September October November December // 2011
January February March April May June July August September October November December // 2015
I recommend Munin for general-purpose linux/unix server state monitoring (CPU, memory, disk, network traffic, etc).
One of the strengths of Munin is that both its implementation and its state model is very simple. For each server (myserver1, myserver2) you have multiple plugins (cpu, df), each of which returns one or more value (/dev/sda1 disk free, /dev/sda2 disk free). You configure 'warning' and 'critical' levels and the notifications tell you which values went over those levels, so you can easily see whether issues are at the "keep an eye on it" stage or the "drop everything and fix it" level. When the values go back down to 'normal' levels, you get a second notification that they're now OK, so if you haven't got that, things are still bad.
This is generally a good system because you don't want to get another set of alert notices every time it polls - which is every 5 minutes. And it combines all the new alerts for each host into a single email, so when a server goes offline you'll get one email about it, not 20.
The only problem with this is that when something goes seriously wrong with your datacenter, especially some shared resource such as a power failure or your database server going temporarily offline, you'll probably get not one warning/critical notice but dozens, because it'll affect a number of servers. Now, you'll get told when each goes back to OK levels.
But in practice, there's a catch here: you would have to carefully look through the list of original notices and check to see if you've received an OK notification for each. In reality, when you get about 30 alert emails and later get about 30 "OK" emails, you tend to assume that everything's fine now - and not notice that one of the services didn't come back up.
As such I recommend that you always configure Munin to periodically re-send any alerts, so that you realise that there's still a problem.
There's no configuration option in Munin to do this, but it's very easy to set up yourself, because Munin simply stores the list of warning states it has issued in a single file - /var/lib/munin/limits.
When all your measurements are normal, it's empty except for a version header:
Whereas when there's any outstanding issues it looks more like:
version 1.4.4 myserver1;df;_dev_sda1;warning Value is 95.99. Warning range (:85) exceeded myserver1;df;_dev_sda1;state warning myserver1;if_err_eth2;trans;unknown Value is unknown. myserver1;if_err_eth2;trans;state unknown
You could just send yourself this file, but then you'd miss out on the translations from the plugin field names to the "real" human-readable labels, so I think it's better to trigger a resend from Munin itself, and that's easy to do. Munin copes just fine if this file goes missing - a missing limits file is treated in exactly the same as an empty limits file - so if you want to be resent alerts, just make a cronjob to rm the file:
45 * * * * root rm -f /var/lib/munin/limits
There I've set up Munin for reminders every hour (at 45 minutes past) since we need to keep a close eye on that project, but for a lot of projects every few hours or even every day would probably be more reasonable.