Cacti is a really useful tool when it comes to monitoring/graphing the state of a linux system like CPU load, free diskspace, database stats etc. utilizing the RRD-Tool. But recently I encountered, that one of my shellscript-based cacti data-input-method didn’t terminate as expected, driving my CPU crazy. It took me some time to track down the cause, so I thought it might be worth writing a line or two about it.
How I use cacti
To monitor the performance of one of my applications, I wrote a small cacti bash script as data-input-method, that analyzes the logfile of the application and outputs the average timings. Basically the script looks like this:
#!/bin/bash
for i in `tac application.log|grep -m 50 "checkpoint"|cut -d" " -f8`
do
# checkpoint calculations
done
# output average to cacti
This script cats the application.log file starting from the end (reverse, hence tac), greps for lines containing “checkpoint”, pipes it to cut and stops after 50 lines have been processed. No magic here. Calling the script directly from bash, returns after 1-2 seconds – even on logfiles with 2 GB size. Normally the script is called every 5 minutes by cacti’s poller.php
command, invoked by php
using crontab
.
The problem
Recently it happened, that there were several tac
commands running, nearly killing my machine (18 (!) load). After terminating the tac
commands, I tried to reproduce the problem by calling the cacti bash script on the cmdline – without luck. It returned after 1-2 seconds as usual. Then 5 minutes later, cacti invoked the same script and got stuck again. After scratching my head for a while, I attached to the CPU killing tac
process via
strace -p
and could see, that it produced SIGPIPE (EPIPE)
aka “Broken Pipe” errors en masse. But instead of terminating tac
due to that errors, it continued processing the whole log file – 2 GB. Since that took more than 5 minutes, crontab
started another poller.php
command calling the bash script. Again, and again. That was the reason why CPU usage went up like crazy.
What is this SIGPIPE aka Broken Pipe?
After doing some research I figured out, that the SIGPIPE
is a signal that is send by the kernel to a process, which is sending data through a pipe to another process that has been already terminated. Still writing to that pipe and the non-existent process, leads write()
to fail with EPIPE
. Normally due to that, the process stops doing its work and exits.
Why is SIGPIPE apparently ignored?
In my specific case, tac
still processed the logfile sending the data to the grep
command, although it had already been terminated due to 50 matches. So why did tac
still process that logfile and not handling SIGPIPE
correctly? Well, the only explanation I could found is, that someone trapped/ignored that signal. Normally that can be done by using the trap
command within the bash. The trap
command allows you to perform custom commands when a specific signal is received within your script. It also gives you the option to simply ignore signals or to restore signals that have been ignored before. Some examples look like this:
# delete lockfile when the script exits
trap "rm -f .lockfile" EXIT
# ignore the EXIT signal
trap '' EXIT
# remove any commands/ignores that were assigned to EXIT earlier
trap - EXIT
Within my bash script I tried the latter one to restore the original signal, but that didn’t work as expected. As I found out, that the signal can’t be restored, if the caller has ignored it in the first place :(. My guess is, that the php
process calling poller.php
and in the end my bash script, ignores/traps that signal somehow. Since I didn’t want to modify the cacti scripts, I had to find another solution.
How can I fix it?
The main difference between calling the bash script manually (working) and getting called by the php interpreter (not working) is, that doing it manually uses a so called ‘interactive shell’. That means that e.g. commands can ask the user to input data interactively instead of passing it via cmdline switches. A typical interaction would be the confirmation by entering yes/no.
So I simply added the -i
switch to my bash script that forces the bash to run in interactive mode:
#!/bin/bash -i
After that, the script behavior was identical to calling it on the cmdline. The tac
process terminated as expected and never got stuck again generating a horde of ‘zombie’ processes trying to kill my machine :).
Note: One shouldn’t forget to put bash -i
to the Data Input Method / Input String as well! And to check for pending jobs in the Poller Cache and to Rebuild it if necessary – so the changes made are reflected properly.
Conclusion
To be honest, I don’t really know where the root cause of the problem lies exactly. Maybe php ignores SIGPIPE
signals during popen()/fgets()
calls in general or the tac
command doesn’t handle SIGPIPE
signals appropriately when called from non-interactive shells. But the fact is, that using -i
in the script made things work out for me. Maybe for you too – just try and let me know :)!
What also came to my mind – maybe it is a bug in some of my installed debian packages. So I list some of them here for you to compare:
- cacti 0.8.7b-2.1+len
- coreutils 6.10-6 (contains tac 6.10)
- php 5.2.6-1+lenny6 with Suhosin-Patch 0.9.6.2
- bash 3.2.39(1)
After I fixed the problem like described above, I stumbled over the fact, that cacti also supports “spine” for calling data-input-method scripts. The major difference seems to be that it is faster than the php called poller.php
and is written in C. Maybe switching to this method might have also fixed the problem. Please let me know if you’ve tried it with success.