Hi all, My system is powered by an Intel Core 2 Duo (E6600) with 8GB of RAM. Running into some very heavy CPU usage. First, a copy from one zpool to another (cp -aRv /oldtank/documents* /tank/documents/*), both in the same system. Load averages are around ~4.8. I think I used lockstat correctly, and found the following: movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 Profiling interrupt: 2960 events in 30.516 seconds (97 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 1518 51% 51% 0.00 1800 cpu[0] SHA256TransformBlocks 334 11% 63% 0.00 2820 cpu[0] vdev_raidz_generate_parity_pq 261 9% 71% 0.00 3493 cpu[0] bcopy_altentry 119 4% 75% 0.00 3033 cpu[0] mutex_enter 73 2% 78% 0.00 2818 cpu[0] i86_mwait <snip> So, obviously here it seems checksum calculation is, to put it mildly, eating up CPU cycles like none other. I believe it''s bad(TM) to turn off checksums? (zfs property just has checksum=on, I guess it has defaulted to SHA256 checksums?) Second, a copy from my desktop PC to my new zpool. (5900rpm drive over GigE to 2 6-drive RAID-Z2s). Load average are around ~3. Again, with lockstat: movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 Profiling interrupt: 2919 events in 30.089 seconds (97 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 1298 44% 44% 0.00 1853 cpu[0] i86_mwait 301 10% 55% 0.00 2700 cpu[0] vdev_raidz_generate_parity_pq 144 5% 60% 0.00 3569 cpu[0] bcopy_altentry 103 4% 63% 0.00 3933 cpu[0] ddi_getl 83 3% 66% 0.00 2465 cpu[0] mutex_enter <snip> Here it seems as if ''i86_mwait'' is occupying the top spot (is this because I have power-management set to poll my CPU?). Is something odd happening drive buffer wise? (i.e. coming in on NIC, buffered in the HBA somehow, and then flushed to disks?) Either case, it seems I''m hitting a ceiling of around 65MB/s. I assume CPU is bottlenecking, since bonnie++ benches resulted in much better performance for the vdev. In the latter case though, it may just be a limitation of the source drive (if it can''t read data faster than 65MB/s, I can''t write faster than that...). e: E6600 is a first-generation 65nm LGA775 CPU, clocked at 2.40GHz. Dual-cores, no hyper-threading. -- --khd
On 2/8/2011 8:41 AM, Krunal Desai wrote:> Hi all, > > My system is powered by an Intel Core 2 Duo (E6600) with 8GB of RAM. > Running into some very heavy CPU usage. > > First, a copy from one zpool to another (cp -aRv /oldtank/documents* > /tank/documents/*), both in the same system. Load averages are around > ~4.8. I think I used lockstat correctly, and found the following: > > movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 > > Profiling interrupt: 2960 events in 30.516 seconds (97 events/sec) > > Count indv cuml rcnt nsec Hottest CPU+PIL Caller > ------------------------------------------------------------------------------- > 1518 51% 51% 0.00 1800 cpu[0] SHA256TransformBlocks > 334 11% 63% 0.00 2820 cpu[0] > vdev_raidz_generate_parity_pq > 261 9% 71% 0.00 3493 cpu[0] bcopy_altentry > 119 4% 75% 0.00 3033 cpu[0] mutex_enter > 73 2% 78% 0.00 2818 cpu[0] i86_mwait > <snip> > > So, obviously here it seems checksum calculation is, to put it mildly, > eating up CPU cycles like none other. I believe it''s bad(TM) to turn > off checksums? (zfs property just has checksum=on, I guess it has > defaulted to SHA256 checksums?) > > Second, a copy from my desktop PC to my new zpool. (5900rpm drive over > GigE to 2 6-drive RAID-Z2s). Load average are around ~3. Again, with > lockstat: > > movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 > > Profiling interrupt: 2919 events in 30.089 seconds (97 events/sec) > > Count indv cuml rcnt nsec Hottest CPU+PIL Caller > ------------------------------------------------------------------------------- > 1298 44% 44% 0.00 1853 cpu[0] i86_mwait > 301 10% 55% 0.00 2700 cpu[0] > vdev_raidz_generate_parity_pq > 144 5% 60% 0.00 3569 cpu[0] bcopy_altentry > 103 4% 63% 0.00 3933 cpu[0] ddi_getl > 83 3% 66% 0.00 2465 cpu[0] mutex_enter > <snip> > Here it seems as if ''i86_mwait'' is occupying the top spot (is this > because I have power-management set to poll my CPU?). Is something odd > happening drive buffer wise? (i.e. coming in on NIC, buffered in the > HBA somehow, and then flushed to disks?) > > Either case, it seems I''m hitting a ceiling of around 65MB/s. I assume > CPU is bottlenecking, since bonnie++ benches resulted in much better > performance for the vdev. In the latter case though, it may just be a > limitation of the source drive (if it can''t read data faster than > 65MB/s, I can''t write faster than that...). > > e: E6600 is a first-generation 65nm LGA775 CPU, clocked at 2.40GHz. > Dual-cores, no hyper-threading. >In the second case, you''re almost certainly hitting a Network bottleneck. If you don''t have Jumbo Frames turned on at both ends of the connection, and a switch that understands JF, then the CPU is spending a horrible amount of time doing interrupts, trying to re-assemble small packets. I also think you might be running into an erratum around old Xeons, CR 6588054. This was fixed in kernel patch 127128-11, included in s10u5 or later. Otherwise, it might be an issue with the powersave functionality of certain Intel CPUs. In either case, try putting this in your /etc/system: set idle_cpu_prefer_mwait = 0 If that fix causes an issue (and, there a reports it does occasionally), you''ll need to boot without the /etc/system, append the ''-a'' flag to the end of the GRUB menu entry that you boot from. This will push you into an interactive boot, where, when it asks you for a /etc/system to use, specify /dev/null. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Feb 8, 2011, at 8:41 AM, Krunal Desai wrote:> Hi all, > > My system is powered by an Intel Core 2 Duo (E6600) with 8GB of RAM. > Running into some very heavy CPU usage.The data below does not show heavy CPU usage. Do you have data that does show heavy CPU usage? mpstat would be a good start.> > First, a copy from one zpool to another (cp -aRv /oldtank/documents* > /tank/documents/*), both in the same system. Load averages are around > ~4.8. I think I used lockstat correctly, and found the following: > > movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 > > Profiling interrupt: 2960 events in 30.516 seconds (97 events/sec)97 events/sec is quite a small load.> > Count indv cuml rcnt nsec Hottest CPU+PIL Caller > ------------------------------------------------------------------------------- > 1518 51% 51% 0.00 1800 cpu[0] SHA256TransformBlocksThis says that 51% of the events were SHA256TransformBlocks and they consumed about 1800 nanoseconds (!) of CPU time each for a cumulative 2,732,400 nanoseconds or 0.009% of one CPU during thesampling interval. To put this in perspective, a write to a HDD can easily take more than 4,000,000 nanoseconds.> 334 11% 63% 0.00 2820 cpu[0] > vdev_raidz_generate_parity_pq > 261 9% 71% 0.00 3493 cpu[0] bcopy_altentry > 119 4% 75% 0.00 3033 cpu[0] mutex_enter > 73 2% 78% 0.00 2818 cpu[0] i86_mwait > <snip> > > So, obviously here it seems checksum calculation is, to put it mildly, > eating up CPU cycles like none other. I believe it''s bad(TM) to turn > off checksums? (zfs property just has checksum=on, I guess it has > defaulted to SHA256 checksums?)Some ZFS checksums are always SHA-256. By default, data checksums are Fletcher4 on most modern ZFS implementations, unless dedup is enabled.> > Second, a copy from my desktop PC to my new zpool. (5900rpm drive over > GigE to 2 6-drive RAID-Z2s). Load average are around ~3.Lockstat won''t provide direct insight to the run queue (which is used to calculate load average). Perhaps you''d be better off starting with prstat. -- richard> Again, with > lockstat: > > movax at megatron:/tank# lockstat -kIW -D 20 sleep 30 > > Profiling interrupt: 2919 events in 30.089 seconds (97 events/sec) > > Count indv cuml rcnt nsec Hottest CPU+PIL Caller > ------------------------------------------------------------------------------- > 1298 44% 44% 0.00 1853 cpu[0] i86_mwait > 301 10% 55% 0.00 2700 cpu[0] > vdev_raidz_generate_parity_pq > 144 5% 60% 0.00 3569 cpu[0] bcopy_altentry > 103 4% 63% 0.00 3933 cpu[0] ddi_getl > 83 3% 66% 0.00 2465 cpu[0] mutex_enter > <snip> > Here it seems as if ''i86_mwait'' is occupying the top spot (is this > because I have power-management set to poll my CPU?). Is something odd > happening drive buffer wise? (i.e. coming in on NIC, buffered in the > HBA somehow, and then flushed to disks?) > > Either case, it seems I''m hitting a ceiling of around 65MB/s. I assume > CPU is bottlenecking, since bonnie++ benches resulted in much better > performance for the vdev. In the latter case though, it may just be a > limitation of the source drive (if it can''t read data faster than > 65MB/s, I can''t write faster than that...). > > e: E6600 is a first-generation 65nm LGA775 CPU, clocked at 2.40GHz. > Dual-cores, no hyper-threading. > > -- > --khd > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Feb 9, 2011 at 12:02 AM, Richard Elling <richard.elling at gmail.com> wrote:> The data below does not show heavy CPU usage. Do you have data that > does show heavy CPU usage? ?mpstat would be a good start.Here is mpstat output during a network copy; I think one of the CPUs disappeared due to a L2 Cache error. movax at megatron:~# mpstat -p CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl set 1 333 0 6 4057 3830 19467 140 27 265 0 1561 1 48 0 51 0> Some ZFS checksums are always SHA-256. By default, data checksums are > Fletcher4 on most modern ZFS implementations, unless dedup is enabled.I see, thanks for the info.>> Second, a copy from my desktop PC to my new zpool. (5900rpm drive over >> GigE to 2 6-drive RAID-Z2s). Load average are around ~3. > > Lockstat won''t provide direct insight to the run queue (which is used to calculate > load average). Perhaps you''d be better off starting with prstat.Ah, gotcha. I ran prstat, which is more of what I wanted: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 1434 root 0K 0K run 0 -20 0:01:54 23% zpool-tank/136 1515 root 9804K 3260K cpu1 59 0 0:00:00 0.1% prstat/1 1438 root 14M 9056K run 59 0 0:00:00 0.0% smbd/16 zpool thread near the top of usage, which is what I suppose you would expect.