Finding the cause of high CPU utilisation on Linux

We've all had it happen from time to time, suddenly sites on a server we manage appear to have slowed to a crawl. Having logged in you take a look at the running processes to find the cause. 9 times out of 10 it's something like a backup routine overrunning, so you kill the task and everything's OK again.

What do we do, though, if that's not the cause. It can sometimes be difficult to narrow down exactly what's causing it if the process list shows everything's currently OK. The slowdown may only kick in for a few seconds, but the perfectionist in us still needs to know what the cause is so we can mitigate it (if it's a particular page on a particular site, what happens if it suddenly gets a lot of hits?)

This documentation details ways to monitor usage without needing to spend all day looking at top. The main aim being to try and identify which process is responsible to then dig a little further.

 

I'm assuming you've already done the basic investigation, top iostat etc.  This is more about how to implement logging to try and pick up the longer term issues.

 

The first thing to do is to set up a script found on UnixForums, and make a few tweaks to iron out a couple of minor bugs

#!/bin/bash
export fa=/tmp/ps.a.$$ fb=/tmp/ps.b.$$
export PS='aux'
ps -fp$$ | read zh # capture ps header
ps $PS | sort >$fb # prime the comparison

while [ 1 ]
do
sleep 60
mv -f $fb $fa
date
ps $PS | sort >$fb
echo "$zh"
comm -13 $fa $fb | grep -v '0:0[01] '
echo
done

The script takes the output of ps and then takes another to compare. It strips out any processes that are only using the CPU for 0 - 1 seconds (as they're not a major concern) and then spits out details of the changed processes.

To be of real use we want to leave this running for quite some time, preferably logging to a file, so assuming we saved the script at ps_monitor.sh we run

ps_monitor.sh > ps_log.log

We'll then be building a historic log, but probably don't want to have to read it all manually. Given the format it gives us we can check for excessively high usage another way. Create a BASH script called parselog.sh

#!/bin/bash

# Get the filename to read from the arguments
export FILENAME=$1
export MAXLIM=30

echo "USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND"

while read -r a
do
if [ "$a" != "" ]
then
CPUUSAGE=$( echo "$a" | awk -F\ '{print $4}');
echo $CPUUSAGE | grep ":" > /dev/null
if [ "$?" == "1" ]
then
ABOVEMAX=$( echo "$CPUUSAGE > $MAXLIM" | bc );
if [ "$ABOVEMAX" == "1" ]
then
echo $a | sed 's/ / /g' >> /tmp/pslogresults.$$
fi
fi

fi
done < $FILENAME

cat /tmp/pslogresults.$$ | sort -t" " -k 3 -r
rm -f /tmp/pslogresults.$$

To run the script, we simply invoke it (you've already made it executable, right?) and pass it the name of the file we directed our output to in the earlier command. So in this example we'd run

parselog.sh ps_log.log

Although not pretty, this will output a list of all processes recorded as having a higher CPU utilisation that that specified in the parselog script. You may find (as I did) that one instance has a rediculously high usage (90%) so you'll then be able to go and take a look.

Keep an eye out in the original log file for signs of infinite loops as well, but these two scripts can be very helpful in tracking down errant scripts that are hogging resources.