Nagios is loved and hated in the industry. Nagios has a great brand recognition, but suffers from less than intuitive setup and difficult to automate configuration. With passive checks and right Chef cookbooks, Nagios/Icinga setup becomes very effective. Another Nagios weakness is poor implementation. Fortunately, Nagios poorly implements right ideas, so not everything is lost. This war story is about fixing Nagios NSCA implementation. Nagios NSCA is a Nagios passive-check daemon written in C that runs on a Nagios server and processes all passive check sent by Nagios clients.
In our setup all nodes in the server farm are configured as NSCA clients sending updates to an NSCA server (or servers) that processes the information and forwards updates to Icinga through a named pipe. Monitors are configure in two places: on an NSCA client and on the Icinga server.
The actual problem that we saw was that the Icinga server would stop receiving information from the NSCA server. In the cases like this what I oftentimes do is look at the strace logs of the process that is failing (NSCA server in the case).
The first problem that I saw looking at strace were failure reading from the /dev/urandom:
EAGAIN (Resource temporarily unavailable)"
This had nothing to do with problem I was trying to fix, but the error messages littered strace logs, so I decided to fix it. Turns out NSCA server does a straight read from /dev/urandom forgetting that the read system call will buffer. Disabling the read buffer right after opening /dev/urandom fixed the EAGAIN errors.
The next thing that attracted my attention was:
write(6, "[1341698843] PROCESS_SERVICE_CHE"..., 291) = -1 EPIPE (Broken pipe) <0.000208>
This time, the error actually led to NSCA server crashing. Turns out NSCA server does a very naive write to the pipe file descriptor connected to Icinga process:
fprintf(command_file_fp, "[%lu] PROCESS_HOST_CHECK_RESULT;%s;%d;%s\n", ...
The cases where fprintf fails are not at all handled by the NSCA server. The simple fix watches the write result and attempts to reopen the pipe if write fails.
The last issue to fix with NSCA server was the code to handle failures opening pipe to Icinga process. NSCA server would fail to open the pipe with ENXIO:
open("/var/icinga/rw/icinga.cmd", O_WRONLY|O_APPEND|O_NONBLOCK) = -1 ENXIO (No such device or address) <0.000030>
ENXIO error, in this case, shows up because NSCA server attempts to open the Icinga pipe for writing before the Icinga process opens the pipe for reading.
After applying the three fixes listed about our Icinga setup became rock solid.