Benjamin Smith
2009-Nov-17 03:56 UTC
[CentOS] High load averages with latest kernel and USB drives?
I'm having a server report a high load average when backing up Postgres database files to an external USB drive. This is driving my loadbalancers all out of kilter and causing a large volume of network monitor alerts. I have a 1TB USB drive plugged into a USB2 port that I use to back up the production drives (which are SCSI). It's working fine, but while doing backups (hourly) the load average on the server shoots up from the normal 0.5 - 1.5 or so up to a high between 10 and 30. Strangely, even though the "load is high" the server is completely responsive, even the USB drives being accessed are! Backup script is really simple, run via cron, pretty much just: #! /bin/sh hour=`date +%k`; pg_dump <options> mydatabase > /media/backups/mydatabase.$hour.pgsql; where /media/backups is the mount point for the USB drive. Using top to diagnose, nothing seems to be particularly high! IoWait seems reasonable (10-30%) and CPUs are 0.5%, Idle is 70-90%. Even accessing the USB partition while the load is "high" is responsive! I'm guessing that something changed in how load average is counted? Server Stats: Late model 8-way Xeon, SuperMicro brand. CentOS 4.x / 64 (all updates applied, booted after last kernel update) Kernel 2.6.9-89.0.16.ELsmp 4 GB ECC RAM 300 GB SCSI HDD. Standard Apache/PHP, Postgres 8.4. Any idea how to revert to the old load average tracking behavior short of using a stale and potentially insecure kernel? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Todd Denniston
2009-Nov-17 15:52 UTC
[CentOS] High load averages with latest kernel and USB drives?
Benjamin Smith wrote, On 11/16/2009 10:56 PM:> I have a 1TB USB drive plugged into a USB2 port that I use to back up the > production drives (which are SCSI). It's working fine, but while doing backups > (hourly) the load average on the server shoots up from the normal 0.5 - 1.5 or > so up to a high between 10 and 30. Strangely, even though the "load is high" > the server is completely responsive, even the USB drives being accessed are! > > Backup script is really simple, run via cron, pretty much just: > > #! /bin/sh > hour=`date +%k`; > pg_dump <options> mydatabase > /media/backups/mydatabase.$hour.pgsql; > > where /media/backups is the mount point for the USB drive. > > Using top to diagnose, nothing seems to be particularly high! IoWait seems > reasonable (10-30%) and CPUs are 0.5%, Idle is 70-90%. Even accessing the USB > partition while the load is "high" is responsive! > > I'm guessing that something changed in how load average is counted? > > Server Stats: > Late model 8-way Xeon, SuperMicro brand. > CentOS 4.x / 64 (all updates applied, booted after last kernel update) > Kernel 2.6.9-89.0.16.ELsmp > 4 GB ECC RAM > 300 GB SCSI HDD. > Standard Apache/PHP, Postgres 8.4. > > Any idea how to revert to the old load average tracking behavior short of > using a stale and potentially insecure kernel? >Note, although I have a couple of ideas, I am answering/questioning more out of curiosity than experience. salt appropriately. Are you saying that when you were running a previous kernel the same operations with the same devices did not have the high load? Which specific kernels worked as desired (if someone is going to bisect the problem they need a start point)? Are there other processes on the machine that are waiting to use the db while the dump is occurring? How many postgres processes are waiting for the dump to finish (it has been a while since I ran postgres so I don't recall how it deals with query's during a dump)? As workarounds perhaps asking the kernel to schedule in a specific way might help, i.e.: #1 set the backup on a particular set of processors, # replace the pg_dump line above with taskset -c 3-4 pg_dump <options> mydatabase > \ /media/backups/mydatabase.$hour.pgsql; #2 set the usb-storage on a particular set of processors, # Note USBSTORPID= line prototyped on CentOS 5 machine not 4. USBSTORPID=`ps aux |grep usb-storage|head -1 |awk '{print $2}'` taskset -p -c 3-4 $USBSTORPID #you might even go back and reduce the processor list #to just 3 or 4 instead of both. #3 don't update atime # (should at worst be a minor thing, and you say that # the usb mounted file system is responsive, # but perhaps it would help some.) mount -oremount,noatime /media/backups/ I have not had the taskset of the USB driver cause faults when used on a dual processor Xeon, but if any of the above breaks your system you get to keep the chunky bits. :0 -- Todd Denniston Crane Division, Naval Surface Warfare Center (NSWC Crane) Harnessing the Power of Technology for the Warfighter
Amos Shapira
2009-Nov-19 07:20 UTC
[CentOS] High load averages with latest kernel and USB drives?
Sorry can't suggest much about the usb issue but for such frequent backups, as well as to enable poin-in-time-recovery (PITR) you should consider log archiving. It should also save you heaps of load on cpu, disk, network and postgresql server. -Amos On 11/17/09, Benjamin Smith <lists at benjamindsmith.com> wrote:> I'm having a server report a high load average when backing up Postgres > database files to an external USB drive. This is driving my loadbalancers > all > out of kilter and causing a large volume of network monitor alerts. > > I have a 1TB USB drive plugged into a USB2 port that I use to back up the > production drives (which are SCSI). It's working fine, but while doing > backups > (hourly) the load average on the server shoots up from the normal 0.5 - 1.5 > or > so up to a high between 10 and 30. Strangely, even though the "load is high" > the server is completely responsive, even the USB drives being accessed are! > > Backup script is really simple, run via cron, pretty much just: > > #! /bin/sh > hour=`date +%k`; > pg_dump <options> mydatabase > /media/backups/mydatabase.$hour.pgsql; > > where /media/backups is the mount point for the USB drive. > > Using top to diagnose, nothing seems to be particularly high! IoWait seems > reasonable (10-30%) and CPUs are 0.5%, Idle is 70-90%. Even accessing the > USB > partition while the load is "high" is responsive! > > I'm guessing that something changed in how load average is counted? > > Server Stats: > Late model 8-way Xeon, SuperMicro brand. > CentOS 4.x / 64 (all updates applied, booted after last kernel update) > Kernel 2.6.9-89.0.16.ELsmp > 4 GB ECC RAM > 300 GB SCSI HDD. > Standard Apache/PHP, Postgres 8.4. > > Any idea how to revert to the old load average tracking behavior short of > using a stale and potentially insecure kernel? > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >