Hi, I have a Netra X1 server with 512MB ram and two ATA disk, model ST340016A. Processor is a UltraSPARC-IIe 500MHz. Version of solaris is: Solaris 10 10/09 s10s_u8wos_08a SPARC I jumpstarted the server with ZFS as root, two disks as a mirror: =======================================================pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s0 ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 errors: No known data errors ======================================================= Here''s the following partition table: =======================================================NAME USED AVAIL REFER MOUNTPOINT rpool 28.9G 7.80G 98K /rpool rpool/ROOT 4.06G 7.80G 21K legacy rpool/ROOT/newbe 4.06G 7.80G 4.06G / rpool/dump 512M 7.80G 512M - rpool/opt 23.8G 7.80G 4.11G /opt rpool/opt/export 19.7G 7.80G 19.7G /opt/export rpool/swap 512M 8.12G 187M - ======================================================= When I do this command: # digest -a md5 /opt/export/BIGFILE (4.6GB) It took around 1 hour and 45 minutes to process it. I can understand that Netra X1 with ATA can be slow but not like it. Also, there are some inconsistencies between "zfs iostat 1" and dtrace outputs. Here''s the small snippet of zpool iostat 1: ======================================================= capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- rpool 28.5G 8.70G 33 1 3.28M 5.78K rpool 28.5G 8.70G 124 0 15.5M 0 rpool 28.5G 8.70G 150 0 18.8M 0 rpool 28.5G 8.70G 134 0 16.8M 0 rpool 28.5G 8.70G 135 0 16.7M 0 ======================================================= No so bad as I would say, for ATA IDE drives. But, if I dig deeper, for example, if I use iostat -D 1 command: ======================================================= extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b dad0 63.4 0.0 8115.1 0.0 32.7 2.0 547.5 100 100 dad1 63.4 0.0 8115.1 0.0 33.0 2.0 551.9 100 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b dad0 52.0 0.0 6152.3 0.0 9.3 1.6 210.9 75 84 dad1 70.0 0.0 8714.7 0.0 16.0 2.0 256.5 93 99 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b dad0 75.3 0.0 9634.7 0.0 23.1 1.9 332.7 96 97 dad1 71.3 0.0 9121.0 0.0 22.3 1.9 339.4 91 95 ======================================================= zfs says 15MB worth of data being transferred but iostat says 8MB worth of data. We can also clearly see that the bus and hard drive are busy (%w and %b). Let''s see the % of IO with iotop from DTracet toolkit: =======================================================2009 Nov 13 09:06:03, load: 0.37, disk_r: 93697 KB, disk_w: 0 KB UID PID PPID CMD DEVICE MAJ MIN D %I/O 0 2250 1859 digest dad1 136 8 R 3 0 2250 1859 digest dad0 136 0 R 4 0 0 0 sched dad0 136 0 R 88 0 0 0 sched dad1 136 8 R 89 ======================================================= sched is taking up to 90% of IO. I tried to trace it: =======================================================[x]: ps -efl |grep sche 1 T root 0 0 0 0 SY ? 0 21:31:44 ? 0:48 sched [x]: truss -f -p 1 1: pollsys(0x0002B9DC, 1, 0xFFBFF588, 0x00000000) (sleeping...) [x]: ======================================================= Only one ouput from truss. Weird. As if sched is doing nothing but is taking up to 90% of IO. prstat is showing this (I''m gonna show the only first process at the top): ====================================================== PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 2250 root 2464K 1872K run 0 0 0:01:57 15% digest/1 ====================================================== digest command is taking only 15% of cpu. We can see that the IO is no so fast. pfilestat from Dtracetoolkit is reporting this: ====================================================== STATE FDNUM Time Filename running 0 6% waitcpu 0 7% read 4 13% /opt/export/BIGFILE sleep-r 0 69% STATE FDNUM KB/s Filename read 4 614 /opt/export/BIGFILE ====================================================== Around 615KB is really read each second. That''s is very slow ! This is not normal I think, even if those are ATA drives. So my question are the following: 1.- Why zpool iostat is reporting 15MB/s of data read when in reality only 615KB/s is read ? 2.- Why sched is taking so much io? 3.- What I can do to improve IO performance? It find it very unbelievable that this is the best performance the current hardware can provide... Thank you! -- This message posted from opensolaris.org
On Fri, 13 Nov 2009, inouk wrote:> > So my question are the following: > > 1.- Why zpool iostat is reporting 15MB/s of data read when in reality only 615KB/s is read ? > 2.- Why sched is taking so much io? > 3.- What I can do to improve IO performance? It find it very unbelievable that this is the best performance the current hardware can provide...Your system has every little RAM (512MB). It is less than is recommended for Solaris 10 or for zfs and if it was a PC, it would be barely enough to run Windows XP. Since zfs likes to use RAM and expects and sufficient RAM will be available, it seems likely that this system is both paging badly, and is also not succeeding to cache enough data to operate efficiently. Zfs is re-reading from disk where normally the data would be cached. The simple solution is to install a lot more RAM. 2GB is a good starting point. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Agreed, but still: wy zpool iostat 15MB en iostat 615KB? Regard, Jeff ________________________________________ From: zfs-discuss-bounces at opensolaris.org [zfs-discuss-bounces at opensolaris.org] On Behalf Of Bob Friesenhahn [bfriesen at simple.dallas.tx.us] Sent: Friday, November 13, 2009 4:05 PM To: inouk Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs/io performance on Netra X1 On Fri, 13 Nov 2009, inouk wrote:> > So my question are the following: > > 1.- Why zpool iostat is reporting 15MB/s of data read when in reality only 615KB/s is read ? > 2.- Why sched is taking so much io? > 3.- What I can do to improve IO performance? It find it very unbelievable that this is the best performance the current hardware can provide...Your system has every little RAM (512MB). It is less than is recommended for Solaris 10 or for zfs and if it was a PC, it would be barely enough to run Windows XP. Since zfs likes to use RAM and expects and sufficient RAM will be available, it seems likely that this system is both paging badly, and is also not succeeding to cache enough data to operate efficiently. Zfs is re-reading from disk where normally the data would be cached. The simple solution is to install a lot more RAM. 2GB is a good starting point. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> On Fri, 13 Nov 2009, inouk wrote: > Your system has every little RAM (512MB). It is less > than is > recommended for Solaris 10 or for zfs and if it was a > PC, it would be > barely enough to run Windows XP. Since zfs likes to > use RAM and > expects and sufficient RAM will be available, it > seems likely that > this system is both paging badly, and is also not > succeeding to cache > enough data to operate efficiently. Zfs is > re-reading from disk where > normally the data would be cached. > > The simple solution is to install a lot more RAM. > 2GB is a good > tarting point. >I don''t agree, especially if you compare with Windows XP. It has windowing system and any other fancy stuffs. The server I''m talking about has nothing on it except system background processes (sendmail, kernel threads, and all). Finally, swap isn''t used at all. So, I could say almost 90% of ram is available for zfs operations. Anyway, I discovered something interesting: while investigating, I "offlined" the second disk in mirror pool: ===========================================================pool: rpool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c0t0d0s0 ONLINE 0 0 0 c0t2d0s0 OFFLINE 0 0 0 errors: No known data errors =========================================================== It went from 650KB to 1200KB (1.2MB) according to pfilestat: =========================================================== STATE FDNUM Time Filename running 0 5% waitcpu 0 12% read 0 16% /opt/export/flash_recovery/OVO_2008-02-20.fl sleep-r 0 65% STATE FDNUM KB/s Filename read 0 1200 /opt/export/flash_recovery/OVO_2008-02-20.fl Total event time (ms): 4999 Total Mbytes/sec: 1 =========================================================== Also, in read transferts, sevice time reduced to between 80ms and 100ms: ===========================================================device r/s w/s kr/s kw/s wait actv svc_t %w %b dad0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 dad1 168.8 0.0 21608.9 0.0 13.5 1.7 89.9 78 88 =========================================================== Sounds like a bus bottleneck, as if two HD''s can''t use the same bus for data transfert. I don''t know the hardware specifications of Netra X1, though... -- This message posted from opensolaris.org
On Fri, 13 Nov 2009, inouk wrote:> > Sounds like a bus bottleneck, as if two HD''s can''t use the same bus > for data transfert. I don''t know the hardware specifications of > Netra X1, thoughMaybe it uses Ultra-160 SCSI like my Sun Blade 2500? This does constrain performance, but due to simultaneous writes (to each side if the mirror) rather than reads. If it is using parallel SCSI, perhaps there is a problem with the SCSI bus termination or a bad cable? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Nov 13, 2009 at 9:53 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Nov 2009, inouk wrote: > >> >> Sounds like a bus bottleneck, as if two HD''s can''t use the same bus for >> data transfert. I don''t know the hardware specifications of Netra X1, >> though >> > > Maybe it uses Ultra-160 SCSI like my Sun Blade 2500? This does constrain > performance, but due to simultaneous writes (to each side if the mirror) > rather than reads. > > If it is using parallel SCSI, perhaps there is a problem with the SCSI bus > termination or a bad cable? > > > Bob >SCSI? Try PATA ;) --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091113/0920e74d/attachment.html>
On Fri, 13 Nov 2009, Tim Cook wrote:> > If it is using parallel SCSI, perhaps there is a problem with the SCSI bus termination or a bad cable? > > SCSI?? Try PATA ;)Is that good? I don''t recall ever selecting that option when purchasing a computer. It seemed safer to stick with SCSI than to try exotic technologies. Does PATA daisy-chain disks onto the same cable and controller? If this PATA and drives are becoming overwelmed, maybe it will help to tune zfs:zfs_vdev_max_pending down to a very small value in the kernel. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Fri, 13 Nov 2009, Tim Cook wrote: >> >> If it is using parallel SCSI, perhaps there is a problem with the >> SCSI bus termination or a bad cable? >> >> SCSI? Try PATA ;) > > Is that good? I don''t recall ever selecting that option when > purchasing a computer. It seemed safer to stick with SCSI than to try > exotic technologies. >I hope you''re being facetious. :-) http://en.wikipedia.org/wiki/Parallel_ATA The Netra X1 has two IDE channels, so it should be able to handle 2 disks without contention so long as only one disk is on each channel. OTOH, that machine is basically a desktop machine in a rack mount case (similar to a Blade 100) and is also vintage 2001. I wouldn''t expect much performance out of it regardless. -Brian
The Netra X1 has one ATA bus for both internal drives. No way to get high perf out of a snail. -- richard On Nov 13, 2009, at 8:08 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Fri, 13 Nov 2009, Tim Cook wrote: >> If it is using parallel SCSI, perhaps there is a problem with the >> SCSI bus termination or a bad cable? >> SCSI? Try PATA ;) > > Is that good? I don''t recall ever selecting that option when > purchasing a computer. It seemed safer to stick with SCSI than to > try exotic technologies. > > Does PATA daisy-chain disks onto the same cable and controller? > > If this PATA and drives are becoming overwelmed, maybe it will help > to tune zfs:zfs_vdev_max_pending down to a very small value in the > kernel. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
There is also a long-standing bug in the ALi chipset used on these servers which ZFS tickles. I don''t think a work-around for this bug was ever implemented, and it''s still present in Solaris 10. On Nov 13, 2009, at 11:29 AM, Richard Elling wrote:> The Netra X1 has one ATA bus for both internal drives. > No way to get high perf out of a snail. > > -- richard > > > > On Nov 13, 2009, at 8:08 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote: > >> On Fri, 13 Nov 2009, Tim Cook wrote: >>> If it is using parallel SCSI, perhaps there is a problem with the SCSI bus termination or a bad cable? >>> SCSI? Try PATA ;) >> >> Is that good? I don''t recall ever selecting that option when purchasing a computer. It seemed safer to stick with SCSI than to try exotic technologies. >> >> Does PATA daisy-chain disks onto the same cable and controller? >> >> If this PATA and drives are becoming overwelmed, maybe it will help to tune zfs:zfs_vdev_max_pending down to a very small value in the kernel. >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss