fixing non-terminating cacti shell scripts

Cacti is a really useful tool when it comes to monitoring/graphing the state of a linux system like CPU load, free diskspace, database stats etc. utilizing the RRD-Tool. But recently I encountered, that one of my shellscript-based cacti data-input-method didn’t terminate as expected, driving my CPU crazy. It took me some time to track down the cause, so I thought it might be worth writing a line or two about it.

How I use cacti

To monitor the performance of one of my applications, I wrote a small cacti bash script as data-input-method, that analyzes the logfile of the application and outputs the average timings. Basically the script looks like this:

#!/bin/bash
for i in `tac application.log|grep -m 50 "checkpoint"|cut -d" " -f8`
do
   # checkpoint calculations
done
# output average to cacti

This script cats the application.log file starting from the end (reverse, hence tac), greps for lines containing “checkpoint”, pipes it to cut and stops after 50 lines have been processed. No magic here. Calling the script directly from bash, returns after 1-2 seconds – even on logfiles with 2 GB size. Normally the script is called every 5 minutes by cacti’s poller.php command, invoked by php using crontab.

The problem

Recently it happened, that there were several tac commands running, nearly killing my machine (18 (!) load). After terminating the tac commands, I tried to reproduce the problem by calling the cacti bash script on the cmdline – without luck. It returned after 1-2 seconds as usual. Then 5 minutes later, cacti invoked the same script and got stuck again. After scratching my head for a while, I attached to the CPU killing tac process via

strace -p

and could see, that it produced SIGPIPE (EPIPE) aka “Broken Pipe” errors en masse. But instead of terminating tac due to that errors, it continued processing the whole log file – 2 GB. Since that took more than 5 minutes, crontab started another poller.php command calling the bash script. Again, and again. That was the reason why CPU usage went up like crazy.

What is this SIGPIPE aka Broken Pipe?

After doing some research I figured out, that the SIGPIPE is a signal that is send by the kernel to a process, which is sending data through a pipe to another process that has been already terminated. Still writing to that pipe and the non-existent process, leads write() to fail with EPIPE. Normally due to that, the process stops doing its work and exits.

Why is SIGPIPE apparently ignored?

In my specific case, tac still processed the logfile sending the data to the grep command, although it had already been terminated due to 50 matches. So why did tac still process that logfile and not handling SIGPIPE correctly? Well, the only explanation I could found is, that someone trapped/ignored that signal. Normally that can be done by using the trap command within the bash. The trap command allows you to perform custom commands when a specific signal is received within your script. It also gives you the option to simply ignore signals or to restore signals that have been ignored before. Some examples look like this:

# delete lockfile when the script exits
trap "rm -f .lockfile" EXIT
 
# ignore the EXIT signal
trap '' EXIT
 
# remove any commands/ignores that were assigned to EXIT earlier
trap - EXIT

Within my bash script I tried the latter one to restore the original signal, but that didn’t work as expected. As I found out, that the signal can’t be restored, if the caller has ignored it in the first place :(. My guess is, that the php process calling poller.php and in the end my bash script, ignores/traps that signal somehow. Since I didn’t want to modify the cacti scripts, I had to find another solution.

How can I fix it?

The main difference between calling the bash script manually (working) and getting called by the php interpreter (not working) is, that doing it manually uses a so called ‘interactive shell’. That means that e.g. commands can ask the user to input data interactively instead of passing it via cmdline switches. A typical interaction would be the confirmation by entering yes/no.

So I simply added the -i switch to my bash script that forces the bash to run in interactive mode:

#!/bin/bash -i

After that, the script behavior was identical to calling it on the cmdline. The tac process terminated as expected and never got stuck again generating a horde of ‘zombie’ processes trying to kill my machine :).

Note: One shouldn’t forget to put bash -i to the Data Input Method / Input String as well! And to check for pending jobs in the Poller Cache and to Rebuild it if necessary – so the changes made are reflected properly.

Conclusion

To be honest, I don’t really know where the root cause of the problem lies exactly. Maybe php ignores SIGPIPE signals during popen()/fgets() calls in general or the tac command doesn’t handle SIGPIPE signals appropriately when called from non-interactive shells. But the fact is, that using -i in the script made things work out for me. Maybe for you too – just try and let me know :)!

What also came to my mind – maybe it is a bug in some of my installed debian packages. So I list some of them here for you to compare:

cacti 0.8.7b-2.1+len
coreutils 6.10-6 (contains tac 6.10)
php 5.2.6-1+lenny6 with Suhosin-Patch 0.9.6.2
bash 3.2.39(1)

After I fixed the problem like described above, I stumbled over the fact, that cacti also supports “spine” for calling data-input-method scripts. The major difference seems to be that it is faster than the php called poller.php and is written in C. Maybe switching to this method might have also fixed the problem. Please let me know if you’ve tried it with success.