Monday, July 11, 2016

Troubleshooting high load average on Linux

Troubleshooting high load average on Linux

This morning I was quite surprised to see one of my lab servers suffering from a high load average. To be fair I haven’t looked at it for a while, it’s just one of these demo systems ticking along … However this was an interesting case!
High load averages are usually first felt when you get sluggish response from the terminal. I got very sluggish response but attributed it to the SSH connection. Nope, it turned out that the other VMs were a lot more responsive, and I know that they are on the same host.
A few options exist in Linux to see what is going on. The uptime command is a pretty good first indicator.
1
2
# uptime
 09:14:05 up 28 days, 11:29,  1 user,  load average: 19.13, 19.15, 19.13
OK so this has been going on for a little while (hey remember its my lab server!). But what’s contributing? The next stop was top for me. Here’s the abridged output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# top -b -n 1 | head
top - 03:11:40 up 28 days, 11:32,  1 user,  load average: 19.08, 19.15, 19.13
Tasks: 217 total,   1 running, 215 sleeping,   0 stopped,   1 zombie
Cpu0  :  5.5%us,  2.9%sy,  0.0%ni, 91.0%id,  0.1%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu1  :  1.5%us,  1.2%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.1%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.8%us,  0.7%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  1.5%us,  1.1%sy,  0.0%ni, 97.3%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  1.1%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.8%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.6%us,  0.6%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  2.9%us,  1.8%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
...
Swap: 4194304k total,        0k used, 4194304k free, 4194302k cached
The interesting bit here is that the CPUs are mostly idle. No single process consumed more than a few percent of CPU, most a lot less actually. Only 1 zombie process listed so that’s not an issue either. In 2.4 kernels zombie processes could cause problems because they a) sometimes couldn’t be killed (“super zombies”) and b) easily could hog a CPU. No swap is in use which in combination with the absence of kswapd activity allowed me to discard a memory shortage.
I have seen high load averages in the past as the result of database activity (lots of sessions with lots of nested loop joins hammering the CPUs with access requests to the same set of hot blocks) or swapping. Both could be ruled out here. Swapping would most likely show as high %SYS percentages. Additionally poorly performing Oracle queries would also show up using %USR cycles and list high in the output of top.
So maybe it’s something else then?
So it had to be something else then. If the CPUs are idle then maybe what’s contributing to the load is a “hung” process with high priority or maybe something generating a lot of I/O.
Two approaches came to my mind:
  • using iotop
  • using top in batch mode and filtering processes in suspicious states
The output of iotop was not revealing any further clues to the problem so I carried on with a suggestion I found using my favorite search engine. The suggestion was to filter the output of top for processes in state “D”. According to the man page of top, the following states are defined
1
2
3
4
5
6
7
w: S  --  Process Status
   The status of the task which can be one of:
      ’D’ = uninterruptible sleep
      ’R’ = running
      ’S’ = sleeping
      ’T’ = traced or stopped
      ’Z’ = zombie
So a process in state D is in uninterruptible sleep-that does not sound too good because if anything is in an uninterruptible sleep then it hogs the CPU. Running the following on the server revealed the culprits very quickly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> top -n 1 -b | egrep "PID|updatedb"
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1535 root      39  19  3868  796  524 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
 7138 root      39  19  3868  788  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
 8246 root      39  19  3868  768  508 D  0.0  0.0   0:00.57 /usr/bin/updatedb -f ...
 9279 root      39  19  3868  788  524 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
12125 root      39  19  3868  772  508 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
12860 root      39  19  3868  772  508 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
13480 root      39  19  3868  788  524 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
14632 root      39  19  3868  796  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
16745 root      39  19  3868  788  524 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
18962 root      20   0 61192  716  624 S  0.0  0.0   0:00.00 egrep PID|updatedb
19052 root      39  19  3868  788  524 D  0.0  0.0   0:00.00 /usr/bin/updatedb -f ...
19988 root      39  19  3868  792  524 D  0.0  0.0   0:00.01 /usr/bin/updatedb -f ...
21520 root      39  19  3868  796  524 D  0.0  0.0   0:00.03 /usr/bin/updatedb -f ...
26830 root      39  19  3868  788  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
27192 root      39  19  3868  768  508 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
28528 root      39  19  3868  796  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
29751 root      39  19  3868  768  508 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
32080 root      39  19  3868  772  508 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
32386 root      39  19  3868  792  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
32649 root      39  19  3868  792  524 D  0.0  0.0   0:00.02 /usr/bin/updatedb -f ...
There are 19 of these processes occupying the machine. Although they literally don’t do anything, their process state is worrying me. And I only have 8 threads in total, so 19 is definitely too high a number. I decided to use a kill -9 approach removing the processes from the system. At worst it would leave me with a stale index for use with locate. And sure enough the load average went down.
1
2
# uptime
 09:18:22 up 28 days, 13:03,  2 users,  load average: 0.53, 3.72, 11.54
RESULT!
Oh, I should have said that updatedb is part of the findutils-locate package, updatedb indexes your disk allowing you to run “locate -i ” to find a file without running a find command across “/”. The call to updatedb is scheduled in /etc/cron.daily/mlocate.cron. I didn’t spend time to find out why the processes got stuck

Isolating Linux High System Load

Uptime

The first thing I'll do when I can get logged into a non-responsive system is do an "uptime". If the load is high, I know I need to start digging into things with the other tools. Uptime gives you 3 numbers which indicate the 1, 5, and 15 minute load averages. From this you can tell if the load is trending up, neutral, or going down:
guin:~$ uptime
 13:26:32 up 1 day, 16:52, 21 users,  load average: 0.00, 0.14, 0.15
guin:~$
On my laptop, the load is fairly low, but it is trending down (1 minute average of 0.00, 5 minute of 0.14, and over 15 minutes it's 0.15).
database:~ # uptime
 12:29pm  up 1 day 13:29,  1 user,  load average: 0.84, 0.82, 0.80
database:~ #
On this database server the load is somewhat low (it's a quad CPU box, so I wouldn't consider it saturated until it was around 4).

Dmesg

It's also useful to look at the bottom of the "dmesg" output. Usually it isn't particularly revealing, but in the case of hardware errors or the out of memory killer it can very quickly reveal a problem.

Vmstat

Next I will often run "vmstat 1", which prints out statistics every second on the system utilization. The first line is the average since the system was last booted:
denver-database:~ # vmstat 1
procs ---------memory---------- --swap-- --io--- -system-- -----cpu-----
 r  b swpd   free   buff  cache  si  so  bi  bo   in   cs  us sy id wa st
 0  0  116 158096 259308 3083748   0   0  47  39   30   58 11  8 76  5  0
 2  0  116 158220 259308 3083748   0   0   0   0 1706 4899 22 14 64  0  0
 1  0  116 158220 259308 3083748   0   0   0 276 1435 1490  4  2 93  0  0
 0  0  116 158220 259308 3083748   0   0   0   0 1502 1569  5  3 92  0  0
 0  0  116 158220 259308 3083748   0   0   0 892 1394 1529  2  1 97  0  0
 1  0  116 158592 259308 3083748   0   0   0 216 1702 1825  8  7 84  1  0
 0  0  116 158344 259308 3083748   0   0   0 368 1465 1461  8  7 84  0  0
 0  0  116 158344 259308 3083748   0   0   0 940 1992 2115  2  2 95  0  0
 0  0  116 158344 259308 3083748   0   0   0 240 1906 1982  6  7 87  0  0
The first thing I'll look at here is the "wa" column; the mount of CPU time spent waiting. If this is high you almost certainly have something hitting the disc hard.
If the "wa" is high, the next thing I'd look at is the "swap" columns "si" and "so". If these are much above 0 on a regular basis, it probably means you're out of memory and the system is swapping. Since RAM is around a million times faster than a hard drive (10ns instead of 10ms), swapping much can cause the system to really grind to a halt. Note however that some swapping, particularly swapping out, is normal.
Next I'd look at the "id" column under "cpu" for the amount of idle CPU time. If this is around 0, it means the CPU is heavily used. If it is, the "sy" and "us" columns tell us how much time is being used by the kernel and user-space processes.
If CPU "sy" time is high, this can often indicate that there are some large directories (say a user's "spam" mail directory) with hundreds of thousand or millions of entries, or other large directory trees. Another common cause of high "sy" CPU time is the system firewall: iptables. There are other causes of course but these seem to be the primary ones.
If CPU "us" is high, that's easy to track down with "top".

ps awwlx --sort=vsz

If there is swapping going on I like to look at the big processes via "ps awwlx --sort=vsz". This shows processes sorted by virtual sizes (which does include shared libraries, but also counts blocks swapped out to disc).

Iostat

For systems where there is a lot of I/O activity (shown via the "bi" and "bo" being high, but "si" and "so" being low), iostat can tell you more about what hard drives the activity is happening on, and what the utilization is. Normally I will run "iostat -x 5" which causes it to print out updated stats every 5 seconds:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.64    0.00    3.95    0.30    0.00   90.11

Device: rrqm/s wrqm/s   r/s   w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda       0.00   9.60  0.60  2.40   6.40  97.60    34.67     0.01  4.80  4.80  1.44
I'll first look at the "%util" column, if it's approaching 100% then that device is being hit hard. In this case we only have one device, so I can't use this to isolate where the heavy activity might be happening, but if the database were on it's own partition that could help track it down.
"await" is a very useful column, it tells us how long the device takes to service a request. If this gets high, it probably indicates saturation.
Other information iostat gives can tell us if the activity is read-oriented or writes, and whether they are small or large writes (based on the sec/s sectors per second rate and the number of read/writes per second).

Iotop

This requires a very recent kernel (2.6.20 or newer), so this isn't something I tend to run very often: most of the systems I maintain are enterprise distros, so they have older kernels. RHEL/CentOS 3/4/5 are too old, Ubuntu Hardy doesn't have iotop, but Lucid does support it.
iotop is like top but it will show processes that are doing heavy I/O. However, often this may be a kernel process so you still may not be able to tell exactly what process is causing the I/O load. It's much better than what we had in the past though.

Top

In the case of high user CPU time, "top" is a great tool for telling you what is using the most CPU.

Munin

Munin is a great tool that tracks long-term system performance trends. However, it's not something you can start using when you have a performance problem. It's the sort of thing you should set up on all your systems so that you can build up the historic usage and have it available when you need it.
It will give you extensive stats about CPU, disc, RAM, network, and other resources, and allow you to see trends to determine an upgrade will be needed in the coming months, rather than that you needed to do one a few months ago. :-)

Conclusions

When performance problems hit, there are many great tools for helping to isolate and resolve them. Using these techniques I've been able to quickly and accurately identify and mitigate performance issues.

No comments:

Post a Comment