Hi all, we are using the following setup as file server: --- # uname -a SunOS troubadix 5.10 Generic_120011-14 sun4u sparc SUNW,Sun-Fire-280R # prtconf -D System Configuration: Sun Microsystems sun4u Memory size: 2048 Megabytes System Peripherals (Software Nodes): SUNW,Sun-Fire-280R (driver name: rootnex) scsi_vhci, instance #0 (driver name: scsi_vhci) packages SUNW,builtin-drivers deblocker disk-label terminal-emulator obp-tftp SUNW,debug dropins kbd-translator ufs-file-system chosen openprom client-services options, instance #0 (driver name: options) aliases memory virtual-memory SUNW,UltraSPARC-III+ memory-controller, instance #0 (driver name: mc-us3) SUNW,UltraSPARC-III+ memory-controller, instance #1 (driver name: mc-us3) pci, instance #0 (driver name: pcisch) ebus, instance #0 (driver name: ebus) flashprom bbc power, instance #0 (driver name: power) i2c, instance #0 (driver name: pcf8584) dimm-fru, instance #0 (driver name: seeprom) dimm-fru, instance #1 (driver name: seeprom) dimm-fru, instance #2 (driver name: seeprom) dimm-fru, instance #3 (driver name: seeprom) nvram, instance #4 (driver name: seeprom) idprom i2c, instance #1 (driver name: pcf8584) cpu-fru, instance #5 (driver name: seeprom) temperature, instance #0 (driver name: max1617) cpu-fru, instance #6 (driver name: seeprom) temperature, instance #1 (driver name: max1617) fan-control, instance #0 (driver name: tda8444) motherboard-fru, instance #7 (driver name: seeprom) ioexp, instance #0 (driver name: pcf8574) ioexp, instance #1 (driver name: pcf8574) ioexp, instance #2 (driver name: pcf8574) fcal-backplane, instance #8 (driver name: seeprom) remote-system-console, instance #9 (driver name: seeprom) power-distribution-board, instance #10 (driver name: seeprom) power-supply, instance #11 (driver name: seeprom) power-supply, instance #12 (driver name: seeprom) rscrtc beep, instance #0 (driver name: bbc_beep) rtc, instance #0 (driver name: todds1287) gpio, instance #0 (driver name: gpio_87317) pmc, instance #0 (driver name: pmc) parallel, instance #0 (driver name: ecpp) rsc-control, instance #0 (driver name: su) rsc-console, instance #1 (driver name: su) serial, instance #0 (driver name: se) network, instance #0 (driver name: eri) usb, instance #0 (driver name: ohci) scsi, instance #0 (driver name: glm) disk (driver name: sd) tape (driver name: st) sd, instance #12 (driver name: sd) ... ses, instance #29 (driver name: ses) ses, instance #30 (driver name: ses) scsi, instance #1 (driver name: glm) disk (driver name: sd) tape (driver name: st) sd, instance #31 (driver name: sd) sd, instance #32 (driver name: sd) ... ses, instance #46 (driver name: ses) ses, instance #47 (driver name: ses) network, instance #0 (driver name: ce) pci, instance #1 (driver name: pcisch) SUNW,qlc, instance #0 (driver name: qlc) fp (driver name: fp) disk (driver name: ssd) fp, instance #1 (driver name: fp) ssd, instance #1 (driver name: ssd) ssd, instance #0 (driver name: ssd) scsi, instance #0 (driver name: mpt) disk (driver name: sd) tape (driver name: st) sd, instance #0 (driver name: sd) sd, instance #1 (driver name: sd) ... ses, instance #14 (driver name: ses) ses, instance #31 (driver name: ses) os-io iscsi, instance #0 (driver name: iscsi) pseudo, instance #0 (driver name: pseudo) --- The disks reside in a StoreEdge3320 expansion unit connected to the machine''s SCSI controller card (LSI1030 U320). We''ve created a raidz2 pool: --- # zpool status pool: storage_array state: ONLINE scrub: scrub completed with 0 errors on Wed Dec 12 23:38:36 2007 config: NAME STATE READ WRITE CKSUM storage_array ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 errors: No known data errors --- The throughput when writing from a local disk to the zpool is around 30MB/s, when writing from a client machine, the throughput drops to ~9MB/s (NFS mounts over dedicated gigabit switch). When copying data to the pool throughput drops every few seconds to almost zero regardless of the source (NFS or local). # zpool iostat 1 capacity operations bandwidth pool used avail read write read write ------------- ----- ----- ----- ----- ----- ----- ... storage_array 138G 202G 0 0 0 0 storage_array 138G 202G 0 11 0 123K storage_array 138G 202G 2 30 3.96K 3.01M storage_array 138G 202G 0 96 0 4.14M storage_array 138G 202G 0 136 0 4.36M storage_array 138G 202G 0 73 0 4.09M storage_array 138G 202G 2 77 254K 9.19M storage_array 138G 202G 0 64 127K 6.05M storage_array 138G 202G 0 75 0 8.70M storage_array 138G 202G 0 101 0 3.98M storage_array 138G 202G 5 154 2.97K 6.19M storage_array 138G 202G 0 74 0 8.06M storage_array 138G 202G 0 121 0 2.77M storage_array 138G 202G 0 64 0 4.95M storage_array 138G 202G 0 63 0 7.73M storage_array 138G 202G 0 75 0 9.41M storage_array 138G 202G 1 128 235K 4.00M storage_array 138G 202G 0 97 0 4.16M storage_array 138G 202G 0 72 0 9.08M storage_array 138G 202G 0 70 0 8.68M storage_array 138G 202G 0 70 0 8.79M storage_array 138G 202G 2 102 13.4K 8.01M storage_array 138G 202G 0 178 0 599K storage_array 138G 202G 0 37 0 3.39M storage_array 138G 202G 0 79 0 9.92M storage_array 138G 202G 0 72 0 9.10M storage_array 138G 202G 0 79 0 9.93M storage_array 138G 202G 0 69 0 8.67M storage_array 138G 202G 0 76 0 9.53M storage_array 138G 202G 0 116 0 8.50M storage_array 138G 202G 0 112 0 2.76M storage_array 138G 202G 0 0 0 0 storage_array 138G 202G 0 55 0 6.95M storage_array 138G 202G 0 0 0 0 storage_array 138G 202G 0 12 0 1.61M storage_array 138G 202G 0 70 0 8.79M storage_array 138G 202G 0 88 0 11.0M storage_array 138G 202G 0 79 0 9.90M ... The performance is slightly disappointing. Does anyone have a similar setup and can anyone share some figures? Any pointers to possible improvements are greatly appreciated. Cheers, Frank
> The throughput when writing from a local disk to the > zpool is around 30MB/s, when writing from a clientErr.. sorry, the internal storage would be good old 1Gbit FCAL disks @ 10K rpm. Still, not the fastest around ;)
Frank Penczek wrote:> > The performance is slightly disappointing. Does anyone have > a similar setup and can anyone share some figures? > Any pointers to possible improvements are greatly appreciated. > >Use a faster processor or change to a mirrored configuration. raidz2 can become processor bound in the Reed-Soloman calculations for the 2nd parity set. You should be able to see this in mpstat, and to a coarser grain in vmstat. -- richard
> Use a faster processor or change to a mirrored configuration. > raidz2 can become processor bound in the Reed-Soloman calculations > for the 2nd parity set. You should be able to see this in mpstat, and to > a coarser grain in vmstat.Hmm. Is the OP''s hardware *that* slow? (I don''t know enough about the Sun hardware models) I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 3800+ (i.e., one of the earlier AMD dual-core offerings). Here''s me dd:ing to a file on FreeBSD on ZFS running on that hardware: promraid 741G 387G 0 380 0 47.2M promraid 741G 387G 0 336 0 41.8M promraid 741G 387G 0 424 510 51.0M promraid 741G 387G 0 441 0 54.5M promraid 741G 387G 0 514 0 19.2M promraid 741G 387G 34 192 4.12M 24.1M promraid 741G 387G 0 341 0 42.7M promraid 741G 387G 0 361 0 45.2M promraid 741G 387G 0 350 0 43.9M promraid 741G 387G 0 370 0 46.3M promraid 741G 387G 1 423 134K 51.7M promraid 742G 386G 22 329 2.39M 10.3M promraid 742G 386G 28 214 3.49M 26.8M promraid 742G 386G 0 347 0 43.5M promraid 742G 386G 0 349 0 43.7M promraid 742G 386G 0 354 0 44.3M promraid 742G 386G 0 365 0 45.7M promraid 742G 386G 2 460 7.49K 55.5M At this point the bottleneck looks architectural rather than CPU. None of the cores are saturated, and the CPU usage of the ZFS kernel threads is pretty low. I say architectural because writes to the underlying devices are not sustained; it drops to almost zero for certain periods (this is more visible in iostat -x than it is in the zpool statistics). What I think is happening is that ZFS is too late to evict data in the cache, thus blocking the writing process. Once a transaction group with a bunch of data gets committed the application unblocks, but presumably ZFS waits for a little while before resuming writes. Note that this is also being run on plain hardware; it''s not even PCI Express. During throughput peaks, but not constantly, the bottleneck is probably the PCI bus. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: This is a digitally signed message part. URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071215/aa24bcff/attachment.bin>
Hello Peter, Saturday, December 15, 2007, 7:45:50 AM, you wrote:>> Use a faster processor or change to a mirrored configuration. >> raidz2 can become processor bound in the Reed-Soloman calculations >> for the 2nd parity set. You should be able to see this in mpstat, and to >> a coarser grain in vmstat.PS> Hmm. Is the OP''s hardware *that* slow? (I don''t know enough about the Sun PS> hardware models) PS> I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 PS> 3800+ (i.e., one of the earlier AMD dual-core offerings). Here''s me dd:ing to PS> a file on FreeBSD on ZFS running on that hardware: PS> promraid 741G 387G 0 380 0 47.2M PS> promraid 741G 387G 0 336 0 41.8M PS> promraid 741G 387G 0 424 510 51.0M PS> promraid 741G 387G 0 441 0 54.5M PS> promraid 741G 387G 0 514 0 19.2M PS> promraid 741G 387G 34 192 4.12M 24.1M PS> promraid 741G 387G 0 341 0 42.7M PS> promraid 741G 387G 0 361 0 45.2M PS> promraid 741G 387G 0 350 0 43.9M PS> promraid 741G 387G 0 370 0 46.3M PS> promraid 741G 387G 1 423 134K 51.7M PS> promraid 742G 386G 22 329 2.39M 10.3M PS> promraid 742G 386G 28 214 3.49M 26.8M PS> promraid 742G 386G 0 347 0 43.5M PS> promraid 742G 386G 0 349 0 43.7M PS> promraid 742G 386G 0 354 0 44.3M PS> promraid 742G 386G 0 365 0 45.7M PS> promraid 742G 386G 2 460 7.49K 55.5M PS> At this point the bottleneck looks architectural rather than CPU. None of the PS> cores are saturated, and the CPU usage of the ZFS kernel threads is pretty PS> low. PS> I say architectural because writes to the underlying devices are not PS> sustained; it drops to almost zero for certain periods (this is more visible PS> in iostat -x than it is in the zpool statistics). What I think is happening PS> is that ZFS is too late to evict data in the cache, thus blocking the writing PS> process. Once a transaction group with a bunch of data gets committed the PS> application unblocks, but presumably ZFS waits for a little while before PS> resuming writes. PS> Note that this is also being run on plain hardware; it''s not even PCI Express. PS> During throughput peaks, but not constantly, the bottleneck is probably the PS> PCI bus. Sequential writing problem with process throttling - there''s an open bug for it for quite a while. Try to lower txg_time to 1s - should help a little bit. Can you also post iostat -xnz 1 while you''re doing dd? and zpool status -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi, On Dec 14, 2007 7:50 PM, Louwtjie Burger <burgerw at zaber.org.za> wrote: [...]> I would have said ... to be expected, since the 280 came with a > 100Mbit interface. So a 9-12 MB/s peak would be acceptable. You did > mention a "gigabit switch"... did you install a gigabit HBA ? If > that''s the case then yes, performance sucks.Yes, sorry, I forgot to mention that we''re using a GigaSwift NIC in that machine.> fsstat?What ''fsstat'' output are you interested in? For a start here''s ''fsstat -F'': # fsstat -F new name name attr attr lookup rddir read read write write file remov chng get set ops ops ops bytes ops bytes 660K 69.8K 16.0K 11.3M 170K 60.2M 154K 25.6M 14.0G 6.05M 58.9G ufs 0 0 0 43.6K 0 78.8K 14.0K 27.8K 10.0M 0 0 proc 0 0 0 21 0 0 0 0 0 0 0 nfs 30.7K 7.67K 11.2K 348M 56.9K 122M 4.61M 1.65M 25.1G 1.05M 58.2G zfs 0 0 0 574K 0 0 0 0 0 0 0 lofs 162K 17.3K 120K 273K 15.9K 1.48M 4.60K 418K 1.24G 1.10M 5.93G tmpfs 0 0 0 6.18K 0 0 0 51 9.31K 0 0 mntfs 0 0 0 0 0 0 0 0 0 0 0 nfs3 0 0 0 0 0 0 0 0 0 0 0 nfs4 0 0 0 43 0 0 0 0 0 0 0 autofs Thanks, Frank
Hi, sorry for the lengthy post ... On Dec 15, 2007 1:56 PM, Robert Milkowski <rmilkowski at task.gda.pl> wrote: [...]> Sequential writing problem with process throttling - there''s an open > bug for it for quite a while. Try to lower txg_time to 1s - should > help a little bit.Since setting txg_time to 1 the periodic drop in bandwidth seems to have gone. That''s great. Unfortunately the performance is still not amazing - 10MB/s over the network and not more...> Can you also post iostat -xnz 1 while you''re doing dd? > and zpool status--- # zpool status pool: storage_array state: ONLINE scrub: scrub completed with 0 errors on Wed Dec 12 23:38:36 2007 config: NAME STATE READ WRITE CKSUM storage_array ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 errors: No known data errors --- dd''ing to NFS mount: fpz at obelix://tmp> dd if=./file.tmp of=/home/fpz/file.tmp 200000+0 records in 200000+0 records out 102400000 bytes (102 MB) copied, 11.3959 seconds, 9.0 MB/s # iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 2.8 17.3 149.4 127.6 0.0 1.3 0.0 66.0 0 12 c2t8d0 2.8 17.3 149.4 127.6 0.0 1.3 0.0 65.9 0 13 c2t9d0 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.1 0 13 c2t10d0 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.4 0 13 c2t11d0 2.8 17.3 149.5 127.6 0.0 1.3 0.0 66.5 0 13 c2t12d0 0.3 1.0 5.4 133.9 0.0 0.0 0.1 27.2 0 1 c1t1d0 0.5 0.3 26.8 16.5 0.0 0.0 0.1 11.1 0 0 c1t0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.9 0 1 c1t1d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.5 0 0 c2t8d0 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.5 0 1 c2t9d0 0.0 10.0 0.0 6.0 0.0 0.0 0.0 0.7 0 1 c2t10d0 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.3 0 0 c2t11d0 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.3 0 0 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 67.6 0.0 1298.6 0.0 9.8 0.2 145.2 1 71 c2t8d0 0.0 64.8 0.0 1139.4 0.0 9.2 0.0 141.8 0 69 c2t9d0 0.0 59.2 0.0 898.9 0.0 8.6 0.0 144.9 0 68 c2t10d0 0.0 67.6 0.0 1379.4 0.0 9.5 0.0 140.0 0 68 c2t11d0 0.0 70.4 0.0 1257.3 0.0 11.4 0.0 162.1 0 73 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 43.8 0.0 3068.5 0.0 34.9 0.0 796.0 0 100 c2t8d0 0.0 55.6 0.0 3891.9 0.0 34.7 0.0 624.9 0 100 c2t9d0 0.0 58.8 0.0 4211.9 0.0 33.4 0.0 568.2 0 100 c2t10d0 0.0 49.2 0.0 3388.6 0.0 34.5 0.0 702.3 0 100 c2t11d0 0.0 57.7 0.0 3805.3 0.0 34.3 0.0 594.0 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 60.0 0.0 4279.6 0.0 35.0 0.0 583.2 0 100 c2t8d0 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t9d0 0.0 41.0 0.0 2910.3 0.0 35.0 0.0 853.6 0 100 c2t10d0 0.0 50.0 0.0 3552.2 0.0 35.0 0.0 699.9 0 100 c2t11d0 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t8d0 0.0 60.0 0.0 4280.8 0.0 35.0 0.0 583.1 0 100 c2t9d0 0.0 55.0 0.0 3938.2 0.0 35.0 0.0 636.1 0 100 c2t10d0 0.0 56.0 0.0 4024.3 0.0 35.0 0.0 624.7 0 100 c2t11d0 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 52.0 0.0 3723.5 0.0 35.0 0.0 672.9 0 100 c2t8d0 0.0 43.0 0.0 3081.5 0.0 35.0 0.0 813.8 0 100 c2t9d0 0.0 46.0 0.0 3296.0 0.0 35.0 0.0 760.7 0 100 c2t10d0 0.0 48.0 0.0 3424.0 0.0 35.0 0.0 729.0 0 100 c2t11d0 0.0 62.0 0.0 4408.1 0.0 35.0 0.0 564.4 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 60.0 0.0 4279.8 0.0 35.0 0.0 583.2 0 100 c2t8d0 0.0 57.0 0.0 4065.8 0.0 35.0 0.0 613.9 0 100 c2t9d0 0.0 59.0 0.0 4194.3 0.0 35.0 0.0 593.1 0 100 c2t10d0 0.0 56.0 0.0 4023.3 0.0 35.0 0.0 624.9 0 100 c2t11d0 0.0 48.0 0.0 3424.3 0.0 35.0 0.0 729.1 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 65.7 0.0 1385.0 0.0 14.5 0.0 220.8 0 90 c2t8d0 0.9 68.4 39.8 1623.6 0.0 13.0 0.0 187.8 0 87 c2t9d0 0.9 74.9 39.3 2054.6 0.0 16.7 0.0 219.6 0 94 c2t10d0 0.9 70.3 39.3 1662.9 0.0 15.4 0.0 216.1 0 95 c2t11d0 0.0 68.4 0.0 1736.0 0.0 14.9 0.0 217.9 0 87 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 65.3 0.0 3287.1 0.0 29.2 0.0 447.8 0 99 c2t8d0 0.0 55.5 0.0 2642.4 0.0 28.2 0.0 508.9 0 99 c2t9d0 0.0 47.9 0.0 2130.0 0.0 26.7 0.0 558.2 0 100 c2t10d0 0.0 66.4 0.0 3336.1 0.0 29.3 0.0 441.2 0 100 c2t11d0 0.0 65.3 0.0 3103.3 0.0 29.7 0.0 454.7 0 99 c2t12d0 0.0 1.1 0.0 2.2 0.0 0.0 0.0 10.0 0 1 c1t1d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 44.0 0.0 3125.2 0.0 35.0 0.0 795.1 0 100 c2t8d0 0.0 50.0 0.0 3553.8 0.0 35.0 0.0 699.7 0 100 c2t9d0 0.0 55.0 0.0 3895.8 0.0 35.0 0.0 636.1 0 100 c2t10d0 0.0 44.0 0.0 3081.7 0.0 35.0 0.0 795.1 0 100 c2t11d0 0.0 48.0 0.0 3424.7 0.0 35.0 0.0 728.8 0 100 c2t12d0 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.7 0 1 c1t1d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 52.0 0.0 3724.6 0.0 35.0 0.0 672.8 0 100 c2t8d0 0.0 46.0 0.0 3253.1 0.0 35.0 0.0 760.6 0 100 c2t9d0 0.0 38.0 0.0 2697.0 0.0 35.0 0.0 920.7 0 100 c2t10d0 0.0 51.0 0.0 3638.6 0.0 35.0 0.0 686.0 0 100 c2t11d0 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 44.0 0.0 2915.0 0.0 22.9 0.0 521.2 0 100 c2t8d0 0.0 47.0 0.0 3382.1 0.0 24.1 0.0 512.0 0 100 c2t9d0 0.0 56.0 0.0 4024.2 0.0 25.7 0.0 459.3 0 100 c2t10d0 0.0 41.0 0.0 2954.1 0.0 22.7 0.0 552.4 0 100 c2t11d0 0.0 46.0 0.0 3083.6 0.0 22.8 0.0 494.7 0 98 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 103.0 0.0 207.5 0.0 5.4 0.0 52.4 0 89 c2t8d0 0.0 101.0 0.0 381.4 0.0 4.8 0.0 47.2 0 89 c2t9d0 0.0 102.0 0.0 432.9 0.0 4.0 0.0 39.5 0 79 c2t10d0 0.0 112.0 0.0 257.5 0.0 5.9 0.0 52.4 0 95 c2t11d0 0.0 111.0 0.0 206.5 0.0 6.1 0.0 54.8 0 92 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 102.0 0.0 213.0 0.0 4.7 0.0 46.3 0 78 c2t8d0 0.0 106.0 0.0 214.5 0.0 5.0 0.0 47.6 0 82 c2t9d0 0.0 95.0 0.0 214.5 0.0 4.3 0.0 45.5 0 71 c2t10d0 0.0 97.0 0.0 214.0 0.0 4.7 0.0 48.9 0 80 c2t11d0 0.0 99.0 0.0 216.5 0.0 5.2 0.0 52.7 0 90 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 67.0 0.0 104.0 0.0 3.2 0.0 47.3 0 55 c2t8d0 0.0 68.0 0.0 106.5 0.0 3.4 0.0 49.6 0 58 c2t9d0 0.0 66.0 0.0 101.5 0.0 3.2 0.0 48.6 0 60 c2t10d0 0.0 64.0 0.0 103.0 0.0 3.1 0.0 48.0 0 57 c2t11d0 0.0 69.0 0.0 103.5 0.0 3.1 0.0 45.4 0 62 c2t12d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.0 0.0 1.5 0.0 0.0 0.0 10.2 0 1 c1t0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.0 0.0 0.5 0.0 0.0 0.0 10.3 0 1 c1t1d0 Thanks for your time! Cheers, Frank
Hi, On Dec 14, 2007 8:24 PM, Richard Elling <Richard.Elling at sun.com> wrote:> Frank Penczek wrote: > > > > The performance is slightly disappointing. Does anyone have > > a similar setup and can anyone share some figures? > > Any pointers to possible improvements are greatly appreciated. > > > > > > Use a faster processor or change to a mirrored configuration. > raidz2 can become processor bound in the Reed-Soloman calculations > for the 2nd parity set. You should be able to see this in mpstat, and to > a coarser grain in vmstat. > -- richardThanks for the hint. When dd''ing to the pool, mpstat tells me: # mpstat 1 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 46 0 67 518 413 509 4 46 26 0 272 1 3 0 96 1 44 0 64 1765 141 576 5 46 26 0 256 1 3 0 96 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 10 804 698 1130 4 69 52 0 218 0 5 0 95 1 7 0 10 3301 390 1189 5 79 64 0 106 0 4 0 96 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 4 294 188 601 3 56 59 0 85 0 2 0 98 1 0 0 4 1029 319 593 1 57 61 0 78 0 2 0 98 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 2 303 198 238 2 33 10 0 145 0 0 0 100 1 0 0 4 283 74 261 3 30 8 0 159 0 1 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 2 0 113 1055 885 813 137 120 20 9 32833 7 57 0 36 1 90 2 74 3622 220 1328 74 118 49 18 19956 5 45 0 50 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 14 0 115 438 191 876 181 127 40 0 32425 7 54 0 39 1 14 0 197 1513 453 671 118 132 54 0 23929 5 53 0 42 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 98 901 679 1726 163 189 43 0 26718 5 48 0 47 1 0 0 121 3722 508 843 171 194 32 0 29947 6 53 0 41 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 50 418 185 905 162 109 21 0 31884 7 54 0 39 1 0 0 135 1772 550 670 102 107 37 0 23882 5 55 0 40 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 58 353 184 330 106 105 17 0 30134 28 47 0 26 1 1 0 74 862 250 604 128 106 27 0 26312 6 45 0 50 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 60 1087 851 1542 231 174 31 1 33800 7 61 0 32 1 0 0 136 4191 444 1072 125 165 39 1 20273 4 52 0 44 ... Based on the ''idl'' column I interpret these numbers as "there are resources left" or is it me being naive? Cheers, Frank
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t8d0 > 0.0 60.0 0.0 4280.8 0.0 35.0 0.0 583.1 0 100 c2t9d0 > 0.0 55.0 0.0 3938.2 0.0 35.0 0.0 636.1 0 100 c2t10d0 > 0.0 56.0 0.0 4024.3 0.0 35.0 0.0 624.7 0 100 c2t11d0 > 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0That service time is just terrible!
hi Frank, there is an interesting pattern here (at least, to my untrained eyes) - your %b starts off quite low: Frank Penczek wrote: ....> --- > dd''ing to NFS mount: > fpz at obelix://tmp> dd if=./file.tmp of=/home/fpz/file.tmp > 200000+0 records in > 200000+0 records out > 102400000 bytes (102 MB) copied, 11.3959 seconds, 9.0 MB/s > > # iostat -xnz 1 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 2.8 17.3 149.4 127.6 0.0 1.3 0.0 66.0 0 12 c2t8d0 > 2.8 17.3 149.4 127.6 0.0 1.3 0.0 65.9 0 13 c2t9d0 > 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.1 0 13 c2t10d0 > 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.4 0 13 c2t11d0 > 2.8 17.3 149.5 127.6 0.0 1.3 0.0 66.5 0 13 c2t12d0 > 0.3 1.0 5.4 133.9 0.0 0.0 0.1 27.2 0 1 c1t1d0 > 0.5 0.3 26.8 16.5 0.0 0.0 0.1 11.1 0 0 c1t0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.9 0 1 c1t1d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.5 0 0 c2t8d0 > 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.5 0 1 c2t9d0 > 0.0 10.0 0.0 6.0 0.0 0.0 0.0 0.7 0 1 c2t10d0 > 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.3 0 0 c2t11d0 > 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.3 0 0 c2t12d0then it jumps - roughly, quadrupling> extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 67.6 0.0 1298.6 0.0 9.8 0.2 145.2 1 71 c2t8d0 > 0.0 64.8 0.0 1139.4 0.0 9.2 0.0 141.8 0 69 c2t9d0 > 0.0 59.2 0.0 898.9 0.0 8.6 0.0 144.9 0 68 c2t10d0 > 0.0 67.6 0.0 1379.4 0.0 9.5 0.0 140.0 0 68 c2t11d0 > 0.0 70.4 0.0 1257.3 0.0 11.4 0.0 162.1 0 73 c2t12d0then it maxes out and stays that way> extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 43.8 0.0 3068.5 0.0 34.9 0.0 796.0 0 100 c2t8d0 > 0.0 55.6 0.0 3891.9 0.0 34.7 0.0 624.9 0 100 c2t9d0 > 0.0 58.8 0.0 4211.9 0.0 33.4 0.0 568.2 0 100 c2t10d0 > 0.0 49.2 0.0 3388.6 0.0 34.5 0.0 702.3 0 100 c2t11d0 > 0.0 57.7 0.0 3805.3 0.0 34.3 0.0 594.0 0 100 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 60.0 0.0 4279.6 0.0 35.0 0.0 583.2 0 100 c2t8d0 > 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t9d0 > 0.0 41.0 0.0 2910.3 0.0 35.0 0.0 853.6 0 100 c2t10d0 > 0.0 50.0 0.0 3552.2 0.0 35.0 0.0 699.9 0 100 c2t11d0 > 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t8d0 > 0.0 60.0 0.0 4280.8 0.0 35.0 0.0 583.1 0 100 c2t9d0 > 0.0 55.0 0.0 3938.2 0.0 35.0 0.0 636.1 0 100 c2t10d0 > 0.0 56.0 0.0 4024.3 0.0 35.0 0.0 624.7 0 100 c2t11d0 > 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 52.0 0.0 3723.5 0.0 35.0 0.0 672.9 0 100 c2t8d0 > 0.0 43.0 0.0 3081.5 0.0 35.0 0.0 813.8 0 100 c2t9d0 > 0.0 46.0 0.0 3296.0 0.0 35.0 0.0 760.7 0 100 c2t10d0 > 0.0 48.0 0.0 3424.0 0.0 35.0 0.0 729.0 0 100 c2t11d0 > 0.0 62.0 0.0 4408.1 0.0 35.0 0.0 564.4 0 100 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 60.0 0.0 4279.8 0.0 35.0 0.0 583.2 0 100 c2t8d0 > 0.0 57.0 0.0 4065.8 0.0 35.0 0.0 613.9 0 100 c2t9d0 > 0.0 59.0 0.0 4194.3 0.0 35.0 0.0 593.1 0 100 c2t10d0 > 0.0 56.0 0.0 4023.3 0.0 35.0 0.0 624.9 0 100 c2t11d0 > 0.0 48.0 0.0 3424.3 0.0 35.0 0.0 729.1 0 100 c2t12d0drops back a fraction> extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 65.7 0.0 1385.0 0.0 14.5 0.0 220.8 0 90 c2t8d0 > 0.9 68.4 39.8 1623.6 0.0 13.0 0.0 187.8 0 87 c2t9d0 > 0.9 74.9 39.3 2054.6 0.0 16.7 0.0 219.6 0 94 c2t10d0 > 0.9 70.3 39.3 1662.9 0.0 15.4 0.0 216.1 0 95 c2t11d0 > 0.0 68.4 0.0 1736.0 0.0 14.9 0.0 217.9 0 87 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 65.3 0.0 3287.1 0.0 29.2 0.0 447.8 0 99 c2t8d0 > 0.0 55.5 0.0 2642.4 0.0 28.2 0.0 508.9 0 99 c2t9d0 > 0.0 47.9 0.0 2130.0 0.0 26.7 0.0 558.2 0 100 c2t10d0 > 0.0 66.4 0.0 3336.1 0.0 29.3 0.0 441.2 0 100 c2t11d0 > 0.0 65.3 0.0 3103.3 0.0 29.7 0.0 454.7 0 99 c2t12d0 > 0.0 1.1 0.0 2.2 0.0 0.0 0.0 10.0 0 1 c1t1d0but quickly reverts to 100%:> extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 44.0 0.0 3125.2 0.0 35.0 0.0 795.1 0 100 c2t8d0 > 0.0 50.0 0.0 3553.8 0.0 35.0 0.0 699.7 0 100 c2t9d0 > 0.0 55.0 0.0 3895.8 0.0 35.0 0.0 636.1 0 100 c2t10d0 > 0.0 44.0 0.0 3081.7 0.0 35.0 0.0 795.1 0 100 c2t11d0 > 0.0 48.0 0.0 3424.7 0.0 35.0 0.0 728.8 0 100 c2t12d0 > 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.7 0 1 c1t1d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 52.0 0.0 3724.6 0.0 35.0 0.0 672.8 0 100 c2t8d0 > 0.0 46.0 0.0 3253.1 0.0 35.0 0.0 760.6 0 100 c2t9d0 > 0.0 38.0 0.0 2697.0 0.0 35.0 0.0 920.7 0 100 c2t10d0 > 0.0 51.0 0.0 3638.6 0.0 35.0 0.0 686.0 0 100 c2t11d0 > 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 44.0 0.0 2915.0 0.0 22.9 0.0 521.2 0 100 c2t8d0 > 0.0 47.0 0.0 3382.1 0.0 24.1 0.0 512.0 0 100 c2t9d0 > 0.0 56.0 0.0 4024.2 0.0 25.7 0.0 459.3 0 100 c2t10d0 > 0.0 41.0 0.0 2954.1 0.0 22.7 0.0 552.4 0 100 c2t11d0 > 0.0 46.0 0.0 3083.6 0.0 22.8 0.0 494.7 0 98 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 103.0 0.0 207.5 0.0 5.4 0.0 52.4 0 89 c2t8d0 > 0.0 101.0 0.0 381.4 0.0 4.8 0.0 47.2 0 89 c2t9d0 > 0.0 102.0 0.0 432.9 0.0 4.0 0.0 39.5 0 79 c2t10d0 > 0.0 112.0 0.0 257.5 0.0 5.9 0.0 52.4 0 95 c2t11d0 > 0.0 111.0 0.0 206.5 0.0 6.1 0.0 54.8 0 92 c2t12d0and then tails off> extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 102.0 0.0 213.0 0.0 4.7 0.0 46.3 0 78 c2t8d0 > 0.0 106.0 0.0 214.5 0.0 5.0 0.0 47.6 0 82 c2t9d0 > 0.0 95.0 0.0 214.5 0.0 4.3 0.0 45.5 0 71 c2t10d0 > 0.0 97.0 0.0 214.0 0.0 4.7 0.0 48.9 0 80 c2t11d0 > 0.0 99.0 0.0 216.5 0.0 5.2 0.0 52.7 0 90 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 67.0 0.0 104.0 0.0 3.2 0.0 47.3 0 55 c2t8d0 > 0.0 68.0 0.0 106.5 0.0 3.4 0.0 49.6 0 58 c2t9d0 > 0.0 66.0 0.0 101.5 0.0 3.2 0.0 48.6 0 60 c2t10d0 > 0.0 64.0 0.0 103.0 0.0 3.1 0.0 48.0 0 57 c2t11d0 > 0.0 69.0 0.0 103.5 0.0 3.1 0.0 45.4 0 62 c2t12d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1.0 0.0 1.5 0.0 0.0 0.0 10.2 0 1 c1t0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1.0 0.0 0.5 0.0 0.0 0.0 10.3 0 1 c1t1d0All of which, to me, look like you''re filling a buffer or two. I don''t recall the config of your zpool, but if the devices are disks that are direct or san-attached, I would be wondering about their outstanding queue depths. I think it''s time to break out some D to find out where in the stack the bottleneck(s) really are. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hello James, Sunday, December 16, 2007, 9:54:18 PM, you wrote: JCM> hi Frank, JCM> there is an interesting pattern here (at least, to my JCM> untrained eyes) - your %b starts off quite low: JCM> Frank Penczek wrote: JCM> ....>> --- >> dd''ing to NFS mount: >> fpz at obelix://tmp> dd if=./file.tmp of=/home/fpz/file.tmp >> 200000+0 records in >> 200000+0 records out >> 102400000 bytes (102 MB) copied, 11.3959 seconds, 9.0 MB/s >> >> # iostat -xnz 1 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 2.8 17.3 149.4 127.6 0.0 1.3 0.0 66.0 0 12 c2t8d0 >> 2.8 17.3 149.4 127.6 0.0 1.3 0.0 65.9 0 13 c2t9d0 >> 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.1 0 13 c2t10d0 >> 2.8 17.3 149.3 127.6 0.0 1.3 0.0 66.4 0 13 c2t11d0 >> 2.8 17.3 149.5 127.6 0.0 1.3 0.0 66.5 0 13 c2t12d0 >> 0.3 1.0 5.4 133.9 0.0 0.0 0.1 27.2 0 1 c1t1d0 >> 0.5 0.3 26.8 16.5 0.0 0.0 0.1 11.1 0 0 c1t0d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.9 0 1 c1t1d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.5 0 0 c2t8d0 >> 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.5 0 1 c2t9d0 >> 0.0 10.0 0.0 6.0 0.0 0.0 0.0 0.7 0 1 c2t10d0 >> 0.0 10.0 0.0 7.0 0.0 0.0 0.0 0.3 0 0 c2t11d0 >> 0.0 10.0 0.0 7.5 0.0 0.0 0.0 0.3 0 0 c2t12d0JCM> then it jumps - roughly, quadrupling>> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 67.6 0.0 1298.6 0.0 9.8 0.2 145.2 1 71 c2t8d0 >> 0.0 64.8 0.0 1139.4 0.0 9.2 0.0 141.8 0 69 c2t9d0 >> 0.0 59.2 0.0 898.9 0.0 8.6 0.0 144.9 0 68 c2t10d0 >> 0.0 67.6 0.0 1379.4 0.0 9.5 0.0 140.0 0 68 c2t11d0 >> 0.0 70.4 0.0 1257.3 0.0 11.4 0.0 162.1 0 73 c2t12d0JCM> then it maxes out and stays that way>> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 43.8 0.0 3068.5 0.0 34.9 0.0 796.0 0 100 c2t8d0 >> 0.0 55.6 0.0 3891.9 0.0 34.7 0.0 624.9 0 100 c2t9d0 >> 0.0 58.8 0.0 4211.9 0.0 33.4 0.0 568.2 0 100 c2t10d0 >> 0.0 49.2 0.0 3388.6 0.0 34.5 0.0 702.3 0 100 c2t11d0 >> 0.0 57.7 0.0 3805.3 0.0 34.3 0.0 594.0 0 100 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 60.0 0.0 4279.6 0.0 35.0 0.0 583.2 0 100 c2t8d0 >> 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t9d0 >> 0.0 41.0 0.0 2910.3 0.0 35.0 0.0 853.6 0 100 c2t10d0 >> 0.0 50.0 0.0 3552.2 0.0 35.0 0.0 699.9 0 100 c2t11d0 >> 0.0 48.0 0.0 3423.7 0.0 35.0 0.0 729.1 0 100 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t8d0 >> 0.0 60.0 0.0 4280.8 0.0 35.0 0.0 583.1 0 100 c2t9d0 >> 0.0 55.0 0.0 3938.2 0.0 35.0 0.0 636.1 0 100 c2t10d0 >> 0.0 56.0 0.0 4024.3 0.0 35.0 0.0 624.7 0 100 c2t11d0 >> 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 52.0 0.0 3723.5 0.0 35.0 0.0 672.9 0 100 c2t8d0 >> 0.0 43.0 0.0 3081.5 0.0 35.0 0.0 813.8 0 100 c2t9d0 >> 0.0 46.0 0.0 3296.0 0.0 35.0 0.0 760.7 0 100 c2t10d0 >> 0.0 48.0 0.0 3424.0 0.0 35.0 0.0 729.0 0 100 c2t11d0 >> 0.0 62.0 0.0 4408.1 0.0 35.0 0.0 564.4 0 100 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 60.0 0.0 4279.8 0.0 35.0 0.0 583.2 0 100 c2t8d0 >> 0.0 57.0 0.0 4065.8 0.0 35.0 0.0 613.9 0 100 c2t9d0 >> 0.0 59.0 0.0 4194.3 0.0 35.0 0.0 593.1 0 100 c2t10d0 >> 0.0 56.0 0.0 4023.3 0.0 35.0 0.0 624.9 0 100 c2t11d0 >> 0.0 48.0 0.0 3424.3 0.0 35.0 0.0 729.1 0 100 c2t12d0JCM> drops back a fraction>> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 65.7 0.0 1385.0 0.0 14.5 0.0 220.8 0 90 c2t8d0 >> 0.9 68.4 39.8 1623.6 0.0 13.0 0.0 187.8 0 87 c2t9d0 >> 0.9 74.9 39.3 2054.6 0.0 16.7 0.0 219.6 0 94 c2t10d0 >> 0.9 70.3 39.3 1662.9 0.0 15.4 0.0 216.1 0 95 c2t11d0 >> 0.0 68.4 0.0 1736.0 0.0 14.9 0.0 217.9 0 87 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 65.3 0.0 3287.1 0.0 29.2 0.0 447.8 0 99 c2t8d0 >> 0.0 55.5 0.0 2642.4 0.0 28.2 0.0 508.9 0 99 c2t9d0 >> 0.0 47.9 0.0 2130.0 0.0 26.7 0.0 558.2 0 100 c2t10d0 >> 0.0 66.4 0.0 3336.1 0.0 29.3 0.0 441.2 0 100 c2t11d0 >> 0.0 65.3 0.0 3103.3 0.0 29.7 0.0 454.7 0 99 c2t12d0 >> 0.0 1.1 0.0 2.2 0.0 0.0 0.0 10.0 0 1 c1t1d0JCM> but quickly reverts to 100%:>> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 44.0 0.0 3125.2 0.0 35.0 0.0 795.1 0 100 c2t8d0 >> 0.0 50.0 0.0 3553.8 0.0 35.0 0.0 699.7 0 100 c2t9d0 >> 0.0 55.0 0.0 3895.8 0.0 35.0 0.0 636.1 0 100 c2t10d0 >> 0.0 44.0 0.0 3081.7 0.0 35.0 0.0 795.1 0 100 c2t11d0 >> 0.0 48.0 0.0 3424.7 0.0 35.0 0.0 728.8 0 100 c2t12d0 >> 0.0 1.0 0.0 8.0 0.0 0.0 0.0 8.7 0 1 c1t1d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 52.0 0.0 3724.6 0.0 35.0 0.0 672.8 0 100 c2t8d0 >> 0.0 46.0 0.0 3253.1 0.0 35.0 0.0 760.6 0 100 c2t9d0 >> 0.0 38.0 0.0 2697.0 0.0 35.0 0.0 920.7 0 100 c2t10d0 >> 0.0 51.0 0.0 3638.6 0.0 35.0 0.0 686.0 0 100 c2t11d0 >> 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 44.0 0.0 2915.0 0.0 22.9 0.0 521.2 0 100 c2t8d0 >> 0.0 47.0 0.0 3382.1 0.0 24.1 0.0 512.0 0 100 c2t9d0 >> 0.0 56.0 0.0 4024.2 0.0 25.7 0.0 459.3 0 100 c2t10d0 >> 0.0 41.0 0.0 2954.1 0.0 22.7 0.0 552.4 0 100 c2t11d0 >> 0.0 46.0 0.0 3083.6 0.0 22.8 0.0 494.7 0 98 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 103.0 0.0 207.5 0.0 5.4 0.0 52.4 0 89 c2t8d0 >> 0.0 101.0 0.0 381.4 0.0 4.8 0.0 47.2 0 89 c2t9d0 >> 0.0 102.0 0.0 432.9 0.0 4.0 0.0 39.5 0 79 c2t10d0 >> 0.0 112.0 0.0 257.5 0.0 5.9 0.0 52.4 0 95 c2t11d0 >> 0.0 111.0 0.0 206.5 0.0 6.1 0.0 54.8 0 92 c2t12d0JCM> and then tails off>> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 102.0 0.0 213.0 0.0 4.7 0.0 46.3 0 78 c2t8d0 >> 0.0 106.0 0.0 214.5 0.0 5.0 0.0 47.6 0 82 c2t9d0 >> 0.0 95.0 0.0 214.5 0.0 4.3 0.0 45.5 0 71 c2t10d0 >> 0.0 97.0 0.0 214.0 0.0 4.7 0.0 48.9 0 80 c2t11d0 >> 0.0 99.0 0.0 216.5 0.0 5.2 0.0 52.7 0 90 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 67.0 0.0 104.0 0.0 3.2 0.0 47.3 0 55 c2t8d0 >> 0.0 68.0 0.0 106.5 0.0 3.4 0.0 49.6 0 58 c2t9d0 >> 0.0 66.0 0.0 101.5 0.0 3.2 0.0 48.6 0 60 c2t10d0 >> 0.0 64.0 0.0 103.0 0.0 3.1 0.0 48.0 0 57 c2t11d0 >> 0.0 69.0 0.0 103.5 0.0 3.1 0.0 45.4 0 62 c2t12d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 1.0 0.0 1.5 0.0 0.0 0.0 10.2 0 1 c1t0d0 >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> 0.0 1.0 0.0 0.5 0.0 0.0 0.0 10.3 0 1 c1t1d0JCM> All of which, to me, look like you''re filling a buffer JCM> or two. JCM> I don''t recall the config of your zpool, but if the JCM> devices are disks that are direct or san-attached, I JCM> would be wondering about their outstanding queue depths. JCM> I think it''s time to break out some D to find out where JCM> in the stack the bottleneck(s) really are. Maybe he could try to limit # of queued request per disk in zfs to something smaller than default 35 (maybe even down to 1?) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
dd uses a default block size of 512B. Does this map to your expected usage ? When I quickly tested the CPU cost of small read from cache, I did see that ZFS was more costly than UFS up to a crossover between 8K and 16K. We might need a more comprehensive study of that (data in/out of cache, different recordsize & alignment constraints ). But for small syscalls, I think we might need some work in ZFS to make it CPU efficient. So first, does small sequential write to a large file, matches an interesting use case ? -r
Robert Milkowski wrote:> Hello James, > > Sunday, December 16, 2007, 9:54:18 PM, you wrote: > > JCM> hi Frank, > > JCM> there is an interesting pattern here (at least, to my > JCM> untrained eyes) - your %b starts off quite low:....> JCM> All of which, to me, look like you''re filling a buffer > JCM> or two. > > JCM> I don''t recall the config of your zpool, but if the > JCM> devices are disks that are direct or san-attached, I > JCM> would be wondering about their outstanding queue depths. > > JCM> I think it''s time to break out some D to find out where > JCM> in the stack the bottleneck(s) really are. > Maybe he could try to limit # of queued request per disk in zfs to > something smaller than default 35 (maybe even down to 1?)Hi Robert, yup, that''s on my list of things for Frank to try. I''ve asked for a bit more config information though so we can get a bit of clarity on that front first. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi, On Dec 17, 2007 10:37 AM, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> > > dd uses a default block size of 512B. Does this map to your > expected usage ? When I quickly tested the CPU cost of small > read from cache, I did see that ZFS was more costly than UFS > up to a crossover between 8K and 16K. We might need a more > comprehensive study of that (data in/out of cache, different > recordsize & alignment constraints ). But for small > syscalls, I think we might need some work in ZFS to make it > CPU efficient. > > So first, does small sequential write to a large file, > matches an interesting use case ?The pool holds home directories so small sequential writes to one large file present one of a few interesting use cases. The performance is equally disappointing for many (small) files like compiling projects in svn repositories. Cheers, Frank
Frank Penczek writes: > Hi, > > On Dec 17, 2007 10:37 AM, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: > > > > > > dd uses a default block size of 512B. Does this map to your > > expected usage ? When I quickly tested the CPU cost of small > > read from cache, I did see that ZFS was more costly than UFS > > up to a crossover between 8K and 16K. We might need a more > > comprehensive study of that (data in/out of cache, different > > recordsize & alignment constraints ). But for small > > syscalls, I think we might need some work in ZFS to make it > > CPU efficient. > > > > So first, does small sequential write to a large file, > > matches an interesting use case ? > > The pool holds home directories so small sequential writes to one > large file present one of a few interesting use cases. Can you be more specific here ? Do you have a body of application that would do small sequential writes; or one in particular ? Another interesting info is if we expect those to be allocating writes or overwrite (beware that some app, move the old file out, then run allocating writes, then unlink the original file). > The performance is equally disappointing for many (small) files > like compiling projects in svn repositories. > ??? -r > Cheers, > Frank
>> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device>> 0.0 48.0 0.0 3424.6 0.0 35.0 0.0 728.9 0 100 c2t8d0 > That service time is just terrible! yea, that service time is unreasonable. almost a second for each command? and 35 more commands queued? (reorder = faster) I had a server with similar service times, so I repaired a replacement blade and when I went to slid it in, noticed a loud noise coming from the blade below it.. notified the windows person who owned it and it had been "broken" for some time and turned it off... it was much better after that. vibration... check vibration. Rob
Hi, On Dec 17, 2007 4:18 PM, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> > > > The pool holds home directories so small sequential writes to one > > large file present one of a few interesting use cases. > > Can you be more specific here ? > > Do you have a body of application that would do small > sequential writes; or one in particular ? Another > interesting info is if we expect those to be allocating > writes or overwrite (beware that some app, move the old file > out, then run allocating writes, then unlink the original > file).Sorry, I try to be more specific. The zpool contains home directories that are exported to client machines. It is hard to predict what exactly users are doing, but one thing users do for certain is checking out software projects from our subversion server. The projects typically contain many source code files (thousands) and a build process accesses all of them in the worst case. That is what I meant by "many (small) files like compiling projects" in my previous post. The performance for this case is ... hopefully improvable. Now for sequential writes: We don''t have a specific application issuing sequential writes but I can think of at least a few cases where these writes may occur, e.g. dumps of substantial amounts of measurement data or growing log files of applications. In either case these would be mainly allocating writes. Does this provide the information you''re interested in? Cheers, Frank
Frank Penczek writes: > Hi, > > On Dec 17, 2007 4:18 PM, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: > > > > > > The pool holds home directories so small sequential writes to one > > > large file present one of a few interesting use cases. > > > > Can you be more specific here ? > > > > Do you have a body of application that would do small > > sequential writes; or one in particular ? Another > > interesting info is if we expect those to be allocating > > writes or overwrite (beware that some app, move the old file > > out, then run allocating writes, then unlink the original > > file). > > Sorry, I try to be more specific. > The zpool contains home directories that are exported to client machines. > It is hard to predict what exactly users are doing, but one thing users do for > certain is checking out software projects from our subversion server. The > projects typically contain many source code files (thousands) and a > build process > accesses all of them in the worst case. That is what I meant by "many (small) > files like compiling projects" in my previous post. The performance > for this case > is ... hopefully improvable. > This we''ll have to work on. But first, If this is to Storage with NVRAM, I assume you checked that the storage does not flush it''s caches : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes If that is not your problem and if ZFS underperform another FS on the backend of NFS, then this needs investigation. If ZFS/NFS underperformance a direct attach FS that might just be an NFS issue not related to ZFS. Again that needs investigation. Performance gains won''t happen unless we find out what doesn''t work. > Now for sequential writes: > We don''t have a specific application issuing sequential writes but I > can think of > at least a few cases where these writes may occur, e.g. > dumps of substantial amounts of measurement data or growing log files > of applications. > In either case these would be mainly allocating writes. > Right but I''d hope the application would issue substantially large writes specially if it'' needs to dump data at high rate. If the data rate is more modest, then the CPU lost to this effect will itself be modest. > Does this provide the information you''re interested in? > I get a sense that it''s more important we find out what is your build issue is. But the small writes will have to be improved one day also. -r > > Cheers, > Frank
> Sequential writing problem with process throttling - there''s an open > bug for it for quite a while. Try to lower txg_time to 1s - should > help a little bit.Yeah, my post was mostly to emphasize that on commodity hardware raidz2 does not even come close to being a CPU bottleneck. It wasn''t a poke at the streaming performance. Very interesting to hear there''s a bug open for it though.> Can you also post iostat -xnz 1 while you''re doing dd? > and zpool statusThis was FreeBSD, but I can provide iostat -x if you still want it for some reason. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: This is a digitally signed message part. URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071218/2130f386/attachment.bin>
Hello Peter, Tuesday, December 18, 2007, 5:12:48 PM, you wrote:>> Sequential writing problem with process throttling - there''s an open >> bug for it for quite a while. Try to lower txg_time to 1s - should >> help a little bit.PS> Yeah, my post was mostly to emphasize that on commodity hardware raidz2 does PS> not even come close to being a CPU bottleneck. It wasn''t a poke at the PS> streaming performance. Very interesting to hear there''s a bug open for it PS> though.>> Can you also post iostat -xnz 1 while you''re doing dd? >> and zpool statusPS> This was FreeBSD, but I can provide iostat -x if you still want it for some PS> reason. I was just wandering that maybe there''s a problem with just one disk... -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> I was just wandering that maybe there''s a problem with just one > disk...No, this is something I have observed on at least four different systems, with vastly varying hardware. Probably just the effects of the known problem. Thanks, -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: This is a digitally signed message part. URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071219/cf083a53/attachment.bin>