Cacti is a really useful tool when it comes to monitoring/graphing the state of a linux system like CPU load, free diskspace, database stats etc. utilizing the RRD-Tool. But recently I encountered, that one of my shellscript-based cacti data-input-method didn't terminate as expected, driving my CPU crazy. It took me some time to track down the cause, so I thought it might be worth writing a line or two about it.
How I use cacti
To monitor the performance of one of my applications, I wrote a small cacti bash script as data-input-method, that analyzes the logfile of the application and outputs the average timings. Basically the script looks like this:
#!/bin/bash for i in `tac application.log|grep -m 50 "checkpoint"|cut -d" " -f8` do # checkpoint calculations done # output average to cacti
This script cats the application.log file starting from the end (reverse, hence tac), greps for lines containing "checkpoint", pipes it to cut and stops after 50 lines have been processed. No magic here. Calling the script directly from bash, returns after 1-2 seconds - even on logfiles with 2 GB size. Normally the script is called every 5 minutes by cacti's
poller.php command, invoked by
Recently it happened, that there were several
tac commands running, nearly killing my machine (18 (!) load). After terminating the
tac commands, I tried to reproduce the problem by calling the cacti bash script on the cmdline - without luck. It returned after 1-2 seconds as usual. Then 5 minutes later, cacti invoked the same script and got stuck again. After scratching my head for a while, I attached to the CPU killing
tac process via
and could see, that it produced
SIGPIPE (EPIPE) aka "Broken Pipe" errors en masse. But instead of terminating
tac due to that errors, it continued processing the whole log file - 2 GB. Since that took more than 5 minutes,
crontab started another
poller.php command calling the bash script. Again, and again. That was the reason why CPU usage went up like crazy.
What is this SIGPIPE aka Broken Pipe?
After doing some research I figured out, that the
SIGPIPE is a signal that is send by the kernel to a process, which is sending data through a pipe to another process that has been already terminated. Still writing to that pipe and the non-existent process, leads
write() to fail with
EPIPE. Normally due to that, the process stops doing its work and exits.
Why is SIGPIPE apparently ignored?
In my specific case,
tac still processed the logfile sending the data to the
grep command, although it had already been terminated due to 50 matches. So why did
tac still process that logfile and not handling
SIGPIPE correctly? Well, the only explanation I could found is, that someone trapped/ignored that signal. Normally that can be done by using the
trap command within the bash. The
trap command allows you to perform custom commands when a specific signal is received within your script. It also gives you the option to simply ignore signals or to restore signals that have been ignored before. Some examples look like this:
# delete lockfile when the script exits trap "rm -f .lockfile" EXIT # ignore the EXIT signal trap '' EXIT # remove any commands/ignores that were assigned to EXIT earlier trap - EXIT
Within my bash script I tried the latter one to restore the original signal, but that didn't work as expected. As I found out, that the signal can't be restored, if the caller has ignored it in the first place :(. My guess is, that the
php process calling
poller.php and in the end my bash script, ignores/traps that signal somehow. Since I didn't want to modify the cacti scripts, I had to find another solution.
How can I fix it?
The main difference between calling the bash script manually (working) and getting called by the php interpreter (not working) is, that doing it manually uses a so called 'interactive shell'. That means that e.g. commands can ask the user to input data interactively instead of passing it via cmdline switches. A typical interaction would be the confirmation by entering yes/no.
So I simply added the
-i switch to my bash script that forces the bash to run in interactive mode:
After that, the script behavior was identical to calling it on the cmdline. The
tac process terminated as expected and never got stuck again generating a horde of 'zombie' processes trying to kill my machine :).
Note: One shouldn't forget to put
bash -i to the Data Input Method / Input String as well! And to check for pending jobs in the Poller Cache and to Rebuild it if necessary - so the changes made are reflected properly.
To be honest, I don't really know where the root cause of the problem lies exactly. Maybe php ignores
SIGPIPE signals during
popen()/fgets() calls in general or the
tac command doesn't handle
SIGPIPE signals appropriately when called from non-interactive shells. But the fact is, that using
-i in the script made things work out for me. Maybe for you too - just try and let me know :)!
What also came to my mind - maybe it is a bug in some of my installed debian packages. So I list some of them here for you to compare:
- cacti 0.8.7b-2.1+len
- coreutils 6.10-6 (contains tac 6.10)
- php 5.2.6-1+lenny6 with Suhosin-Patch 0.9.6.2
- bash 3.2.39(1)
After I fixed the problem like described above, I stumbled over the fact, that cacti also supports "spine" for calling data-input-method scripts. The major difference seems to be that it is faster than the php called
poller.php and is written in C. Maybe switching to this method might have also fixed the problem. Please let me know if you've tried it with success.