I have a dual Xeon 5130 (four total CPUs) server running CentOS 5.7. Approximately every 17 hours, the load on this server slowly creeps up until it hits 20, then slowly goes back down. The most recent example started around 2:00am this morning. Outside of these weird times, the load never exceeds 2.0 (and in fact spends the overwhelming majority of its time at 1.0). So this morning, a few data points: - 2:06 to 2:07 load increased from 1.0 to 2.0 - At 2:09 it hit 4.0 - At 2:10 it hit 5.34 - At 2:16 it hit 10.02 - At 2:17 it hit 11.0 - At 2:24 it hit 17.0 - At 2:27 it hit 19.0 and stayed here +/1 1.0 until - At 2:48 it was 18.96 and looks like it started to go down (very slowly) - At 2:57 it was 17.84 - At 3:05 it was 16.76 - At 3:16 it was 15.03 - At 3:27 it was 9.3 - At 3:39 it was 4.08 - At 3:44 it was 1.92, and stayed under 2.0 from there on This is the 1m load average by the way (i.e. first number in /proc/loadavg, given by top, uptime, etc). Running top while this occurs shows very little CPU usage. It seems the standard cause of this is processes in a "d" state, which means waiting on I/O. But we're not seeing this. In fact, I the system runs sar, and I've collected copious amounts of data. But I don't see anything that jumps out that correlates with these events. I.e., no surges in disk IO, disk read/write bytes, network traffic, etc. The system *never* uses any swap. I also used "dstat" to collect all data that it can for 24 hours (so it captured one of these events). I used 1 second samples, loaded the info up into a huge spreadsheet, but again, didn't see any obvious "trigger" or interesting stuff going on while the load spiked. All the programs running on the system seem to work fine while this is happening... but it triggers all kinds of monitoring alerts which is annoying. We've been collecting data too, and as I said above, seems to happen every 17 hours. I checked all our cron jobs, and nothing jumped out as an obvious culprit. Anyone seen anything like this? Any thoughts or ideas? Thanks, Matt
Miranda Hawarden-Ogata
2014-Mar-27 22:44 UTC
[CentOS] High load average, low CPU utilization
On 2014/03/27 12:20, Matt Garman wrote:> Anyone seen anything like this? Any thoughts or ideas? > > Thanks, > MattSomething of a shot in the dark, but when we had a server with a high load average where nothing obvious was causing it, it turned out to be multiple df cmds hanging on a stale nfs mount. This command helped us id it: top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' Hope that helps, Miranda
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman <matthew.garman at gmail.com> wrote:> Anyone seen anything like this? Any thoughts or ideas?Post some data.. This public facing? Are you getting sprayed down by packets? Array? Soft/hard? Someone have screens laying around? Write a trap to catch a process list when the loads spike? Look at crontab(s)? User accounts? Malicious shells? Any guest containers around? Possibilities are sort of endless here. -- People often find it easier to be a result of the past than a cause of the future.
On Thu, 27 Mar 2014 17:20:22 -0500 Matt Garman <matthew.garman at gmail.com> wrote:> Any thoughts or ideas?Start digging into your array. Perhaps you're starting to lose a drive and it's running daily integrity checks or something. ie, dropping in and out of the array or the like.. /var/log/messages might have some clues.. (not cat, but tac) tac /var/log/messages | less Don't forget about the crons in /etc/cron* -- You know you're a little fat if you have stretch marks on your car. -- Cyrus, Chicago Reader 1/22/82