Hi All I have xeon server with 16 Gb Ram and no Swap memory.I am running cassandra server on two node in cluster.When there is high load on server kswapd0 kicks inn and take 100% cpu and make machine very slow and we need to restart out cassandra server.I have latest kernel 2.6.18-238.9.1.el5.Please let me know how can i fix this issue .Its hurting us badly this our production server any quick help will be appreciated. -- S.Ali Ahsan
Peter Kjellström
2011-May-09 13:01 UTC
[CentOS] kswapd taking 100% cpu with no swap on system
On Saturday, May 07, 2011 09:35:48 PM Ali Ahsan wrote:> Hi All > > I have xeon server with 16 Gb Ram and no Swap memory.I am running > cassandra server on two node in cluster.When there is high load on > server kswapd0 kicks inn and take 100% cpu and make machine very slow > and we need to restart out cassandra server.I have latest kernel > 2.6.18-238.9.1.el5.Please let me know how can i fix this issue .Its > hurting us badly this our production server any quick help will be > appreciated.There is more than one bug that causes this behaviour. A few related memory managent situations (possibly responsible) may actually be avoided if you add some swap (even if it's not used). My suggestion would be to add some swap, set swappiness to 0 and see what happens. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <http://lists.centos.org/pipermail/centos/attachments/20110509/bbb62a5d/attachment-0005.sig>
On Monday, May 09, 2011 11:49:26 AM Ali Ahsan wrote:> Hmmm nice points,I am using Sata with LVM with 1 TB of two HD.What sort of SATA drives are you using? There are some known issues with some SATA drives in certain configurations and on some controllers. It shouldn't cause kswapd to hit high CPU, but it is worth checking out.
On Monday, May 09, 2011 01:11:17 PM Ali Ahsan wrote:> sd 0:0:0:0: Attached scsi disk sda > Vendor: ATA Model: WDC WD10EARS-003 Rev: 80.0 > Type: Direct-Access ANSI SCSI revision: 05Are your two drives in a RAID? cat /proc/mdstat The WD10EARS drives are known to have some issues, having to do with both powersave and TLER (google for WDTLER), as well as 4k sector size. I have seen these issues, but that was with Fedora, and it showed up as high system load, but not as kswapd at a high percentage. Also google for WD10EARS linux percentage; the top result on that page is a Western Digital community forum post titled 'WD10EARS slow, slow, slow, slow - Western Digital Community' If you install the sysstat package, you can use iostat to see if iowaits are your problem; I've used 'iostat -x 1' (and then set my konsole to wider.....) and monitor the await column for each device; pinned down a WD15EADS drive creating iowaits.... Hope that helps.
On Monday, May 09, 2011 01:57:54 PM Ali Ahsan wrote:> On 05/09/2011 10:51 PM, Lamar Owen wrote: > > On Monday, May 09, 2011 01:11:17 PM Ali Ahsan wrote: > >> sd 0:0:0:0: Attached scsi disk sda > >> Vendor: ATA Model: WDC WD10EARS-003 Rev: 80.0 > >> Type: Direct-Access ANSI SCSI revision: 05 > > Are your two drives in a RAID? cat /proc/mdstat > > > No i am doing LVM with 2X 1 TB HDCan you give the output of pvdisplay, vgdisplay, and lvdisplay? Also, did you align the pv's to 4K sectors when you partitioned? What does iostat -x tell you? The particular drives (hardware) you are using have known performance issues; there are a number of reports in Western Digital's forums confirming this, for more than just Linux.
On Monday, May 09, 2011 02:06:54 PM Lamar Owen wrote:> The particular drives (hardware) you are using have known performance issues; there are a number of reports in Western Digital's forums confirming this, for more than just Linux.For reference: http://community.wdc.com/t5/Desktop/WD10EARS-slow-slow-slow-slow/td-p/7581 http://www.hv23.net/2010/02/wd10ears-performance-larger-block-size-issues4k/ http://b1mmer.com/linux/wdhdd/
On Monday, May 09, 2011 02:02:08 PM Ali Ahsan wrote:> On 05/09/2011 10:51 PM, Lamar Owen wrote: > > iostat -x 1 > I am little new to iostat please guide me on this[snip]> avg-cpu: %user %nice %system %iowait %steal %idle > 34.79 0.00 1.25 6.11 0.00 57.86 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz > avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 14.00 0.00 49.00 0.00 2768.00 0.00 > 56.49 2.63 79.80 6.98 34.20 > sdb1 14.00 0.00 49.00 0.00 2768.00 0.00 > 56.49 2.63 79.80 6.98 34.20 > dm-0 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > dm-1 0.00 0.00 53.00 0.00 2040.00 0.00 > 38.49 4.63 111.42 6.45 34.20Ok, on this particular frame of data, you have a 6% iowait (which isn't bad for a busy server), and the awaits for sdb at 79.8 milliseconds, and a device mapper (this is the LVM piece) await of 111.42 milliseconds, aren't too terrible. That's slower than some busy servers I've seen. In my case with the WD15EADS drive (in the same family as the WD10EARS drive), I had seen awaits in the 27,000 millisecond (27 seconds!) range during intensive operations; intensive io operations like an svn update on a copy of the Plone collective, which is a really good stress test if you want to bring a box to its knees, would take ten to fifteen times longer than they should have taken. Watch the output, in particular the await column (you'll want to widen your terminal to get it all on single lines), for 'spikes' to see if this is the issue that is affecting you. And it may not be the problem; but, then again, on my box with the WD15EADS drive, it would run for hours and then slow to an absolute crawl for minutes at a time, and then run smoothly for hours again.
On Monday, May 09, 2011 02:03:54 PM Brunner, Brian T. wrote:> ...a pursuit after feral aquatic fowl.For those for whom English is not their first language, this translates to 'wild goose chase' which term see in wikipedia.org.