Bob Friesenhahn
2009-Jul-04 04:03 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS performs so terribly on my system. I blew a good bit of personal life savings on this set-up but am not seeing performance anywhere near what is expected. Testing with iozone shows that bulk I/O performance is good. Testing with Jeff Bonwick''s ''diskqual.sh'' shows expected disk performance. The problem is that actual observed application performance sucks, and could often be satisified by portable USB drives rather than high-end SAS drives. It could be satisified by just one SAS disk drive. Behavior is as if zfs is very slow to read data since disks are read at only 2 or 3 MB/second followed by an intermittent write on a long cycle. Drive lights blink slowly. It is as if ZFS does no successful sequential read-ahead on the files (see Prefetch Data hit rate of 0% and Prefetch Data cache miss of 60% below), or there is a semaphore bottleneck somewhere (but CPU use is very low). Observed behavior is very program dependent. # zpool status Sun_2540 pool: Sun_2540 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33 2009 config: NAME STATE READ WRITE CKSUM Sun_2540 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096A47B4559Ed0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA047B4529Bd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096E47B456DAd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA447B4544Fd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096147B451BEd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA847B45605d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096647B453CEd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AAC47B45739d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000097347B457D4d0 ONLINE 0 0 0 c4t600A0B800039C9B500000AB047B457ADd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B800039C9B500000A9C47B4522Dd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AB447B4595Fd0 ONLINE 0 0 0 errors: No known data errors % ./diskqual.sh c1t0d0 130 MB/sec c1t1d0 130 MB/sec c2t202400A0B83A8A0Bd31 13422 MB/sec c3t202500A0B83A8A0Bd31 13422 MB/sec c4t600A0B80003A8A0B0000096A47B4559Ed0 191 MB/sec c4t600A0B80003A8A0B0000096E47B456DAd0 192 MB/sec c4t600A0B80003A8A0B0000096147B451BEd0 192 MB/sec c4t600A0B80003A8A0B0000096647B453CEd0 192 MB/sec c4t600A0B80003A8A0B0000097347B457D4d0 212 MB/sec c4t600A0B800039C9B500000A9C47B4522Dd0 191 MB/sec c4t600A0B800039C9B500000AA047B4529Bd0 192 MB/sec c4t600A0B800039C9B500000AA447B4544Fd0 192 MB/sec c4t600A0B800039C9B500000AA847B45605d0 191 MB/sec c4t600A0B800039C9B500000AAC47B45739d0 191 MB/sec c4t600A0B800039C9B500000AB047B457ADd0 191 MB/sec c4t600A0B800039C9B500000AB447B4595Fd0 191 MB/sec % arc_summary.pl System Memory: Physical RAM: 20470 MB Free Memory : 2371 MB LotsFree: 312 MB ZFS Tunables (/etc/system): * set zfs:zfs_arc_max = 0x300000000 set zfs:zfs_arc_max = 0x280000000 * set zfs:zfs_arc_max = 0x200000000 ARC Size: Current Size: 9383 MB (arcsize) Target Size (Adaptive): 10240 MB (c) Min Size (Hard Limit): 1280 MB (zfs_arc_min) Max Size (Hard Limit): 10240 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 6% 644 MB (p) Most Frequently Used Cache Size: 93% 9595 MB (c-p) ARC Efficency: Cache Access Total: 674638362 Cache Hit Ratio: 91% 615586988 [Defined State for buffer] Cache Miss Ratio: 8% 59051374 [Undefined State for Buffer] REAL Hit Ratio: 87% 590314508 [MRU/MFU Hits Only] Data Demand Efficiency: 96% Data Prefetch Efficiency: 7% CACHE HITS BY CACHE LIST: Anon: 2% 13626529 [ New Customer, First Cache Hit ] Most Recently Used: 78% 480379752 (mru) [ Return Customer ] Most Frequently Used: 17% 109934756 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 5180256 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1% 6465695 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 78% 485431759 Prefetch Data: 0% 3045442 Demand Metadata: 16% 103900170 Prefetch Metadata: 3% 23209617 CACHE MISSES BY DATA TYPE: Demand Data: 30% 18109355 Prefetch Data: 60% 35633374 Demand Metadata: 6% 3806177 Prefetch Metadata: 2% 1502468 --------------------------------------------- Prefetch seems to be performing badly. The Ben Rockwood''s blog entry at http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch. The sample Dtrace script on that page only shows cache misses: vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536: MISS vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608: MISS Unfortunately, the file-level prefetch DTrace sample script from the same page seems to have a syntax error. I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not observe any change in behavior. # kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 130.61298275 zfs:0:vdev_cache_stats:delegations 754287 zfs:0:vdev_cache_stats:hits 3973496 zfs:0:vdev_cache_stats:misses 2154959 zfs:0:vdev_cache_stats:snaptime 451955.55419545 Performance when coping 236 GB of files (each file is 5537792 bytes, with 20001 files per directory) from one directory to another: Copy Method Data Rate ==================================== =================cpio -pdum 75 MB/s cp -r 32 MB/s tar -cf - . | (cd dest && tar -xf -) 26 MB/s I would expect data copy rates approaching 200 MB/s. I have not seen a peep from a zfs developer on this list for a month or two. It would be useful if they would turn up to explain possible causes for this level of performance. If I am encountering this problem, then it is likely that many others are as well. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-04 04:26 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Fri, 3 Jul 2009, Bob Friesenhahn wrote:> > Copy Method Data Rate > ==================================== =================> cpio -pdum 75 MB/s > cp -r 32 MB/s > tar -cf - . | (cd dest && tar -xf -) 26 MB/sIt seems that the above should be ammended. Running the cpio based copy again results in zpool iostat only reporting a read bandwidth of 33 MB/second. The system seems to get slower and slower as it runs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Phil Harman
2009-Jul-04 07:48 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies into the Solaris page cache. As long as they remain there ZFS will be slow for those files, even if you subsequently use read(2) to access them. If you reboot, your cpio(1) tests will probably go fast again, until someone uses mmap(2) on the files again. I think tar(1) uses read(2), but from my iPod I can''t be sure. It would be interesting to see how tar(1) performs if you run that test before cp(1) on a freshly rebooted system. I have done some work with the ZFS team towards a fix, but it is only currently in OpenSolaris. The other thing that slows you down is that ZFS only flushes to disk every 5 seconds if there are no synchronous writes. It would be interesting to see iostat -xnz 1 while you are running your tests. You may find the disks are writing very efficiently for one second in every five. Hope this helps, Phil blogs.sun.com/pgdh Sent from my iPod On 4 Jul 2009, at 05:26, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 3 Jul 2009, Bob Friesenhahn wrote: >> >> Copy Method Data Rate >> ==================================== =================>> cpio -pdum 75 MB/s >> cp -r 32 MB/s >> tar -cf - . | (cd dest && tar -xf -) 26 MB/s > > It seems that the above should be ammended. Running the cpio based > copy again results in zpool iostat only reporting a read bandwidth > of 33 MB/second. The system seems to get slower and slower as it > runs. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mattias Pantzare
2009-Jul-04 08:57 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 4, 2009 at 06:03, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS > performs so terribly on my system. ?I blew a good bit of personal life > savings on this set-up but am not seeing performance anywhere near what is > expected. ?Testing with iozone shows that bulk I/O performance is good. > ?Testing with Jeff Bonwick''s ''diskqual.sh'' shows expected disk performance. > ?The problem is that actual observed application performance sucks, and > could often be satisified by portable USB drives rather than high-end SAS > drives. ?It could be satisified by just one SAS disk drive. ?Behavior is as > if zfs is very slow to read data since disks are read at only 2 or 3 > MB/second followed by an intermittent write on a long cycle. ?Drive lights > blink slowly. ?It is as if ZFS does no successful sequential read-ahead on > the files (see Prefetch Data hit rate of 0% and Prefetch Data cache miss of > 60% below), or there is a semaphore bottleneck somewhere (but CPU use is > very low). > > Observed behavior is very program dependent. > > # zpool status Sun_2540 > ?pool: Sun_2540 > ?state: ONLINE > status: The pool is formatted using an older on-disk format. ?The pool can > ? ? ? ?still be used, but some features are unavailable. > action: Upgrade the pool using ''zpool upgrade''. ?Once this is done, the > ? ? ? ?pool will no longer be accessible on older software versions. > ?scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33 > 2009 > config: > > ? ? ? ?NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? STATE ? ? READ WRITE CKSUM > ? ? ? ?Sun_2540 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B80003A8A0B0000096A47B4559Ed0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AA047B4529Bd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B80003A8A0B0000096E47B456DAd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AA447B4544Fd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B80003A8A0B0000096147B451BEd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AA847B45605d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B80003A8A0B0000096647B453CEd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AAC47B45739d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B80003A8A0B0000097347B457D4d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AB047B457ADd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000A9C47B4522Dd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > ? ? ? ? ? ?c4t600A0B800039C9B500000AB447B4595Fd0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 > > errors: No known data errors > > > Prefetch seems to be performing badly. ?The Ben Rockwood''s blog entry at > http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch. > ?The sample Dtrace script on that page only shows cache misses: > > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536: > MISS > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608: > MISS > > Unfortunately, the file-level prefetch DTrace sample script from the same > page seems to have a syntax error. > > I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not > observe any change in behavior. > > # kstat -p zfs:0:vdev_cache_stats > zfs:0:vdev_cache_stats:class ? ?misc > zfs:0:vdev_cache_stats:crtime ? 130.61298275 > zfs:0:vdev_cache_stats:delegations ? ? ?754287 > zfs:0:vdev_cache_stats:hits ? ? 3973496 > zfs:0:vdev_cache_stats:misses ? 2154959 > zfs:0:vdev_cache_stats:snaptime 451955.55419545 > > Performance when coping 236 GB of files (each file is 5537792 bytes, with > 20001 files per directory) from one directory to another: > > Copy Method ? ? ? ? ? ? ? ? ? ? ? ? ? ? Data Rate > ==================================== ? ?=================> cpio -pdum ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?75 MB/s > cp -r ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 32 MB/s > tar -cf - . | (cd dest && tar -xf -) ? ?26 MB/s > > I would expect data copy rates approaching 200 MB/s. >What happens if you run two copy at the same time? (On different data) Your test is very bad at using striping as reads are done sequential. Prefetch can only help in a file and your files are only 5Mb.
Joerg Schilling
2009-Jul-04 10:27 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Mattias Pantzare <pantzer at ludd.ltu.se> wrote:> > Performance when coping 236 GB of files (each file is 5537792 bytes, with > > 20001 files per directory) from one directory to another: > > > > Copy Method ? ? ? ? ? ? ? ? ? ? ? ? ? ? Data Rate > > ==================================== ? ?=================> > cpio -pdum ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?75 MB/s > > cp -r ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 32 MB/s > > tar -cf - . | (cd dest && tar -xf -) ? ?26 MB/s > > > > I would expect data copy rates approaching 200 MB/s. > > > > What happens if you run two copy at the same time? (On different data)Before you do things like this, you first should start using test that may give you useful results. Note of the programs above have been written for decent performance. I know that "cp" on Solaris is a partial exception for songle file copies, but does not help us if we like to compare _aparent_ performance. Let me first introduce other programs: sdd A dd(1) replacement that was first written in 1984 and that includes built-in speed metering since Jily 1988. star A tar(1) replacement that was first written in 1982 and that supports much better performance by using a shared memory based FIFO. Note that most speed tests that are run on Linux do not result un useful values as you don''t know what''s happening dunring the observation time. If you like to meter read performance, I recommend to use a filesystem that was mounted directly before doing the test or to use files that are big enough not to fit into memory. Use e.g.: sdd if=file-name bs=64k -onull -time If you like to meter write performance, I recomment to write big enough files to avoid using wrong numbers as a result from caching. Use e.g. sdd -inull bs=64k count=some-number of=file-name -time Us an apropriate value for "some-number". For copying files, I recommend to use: star -copy bs=1m fs=128m -time -C from-dir . to-dir It makes sense to run another test using the option: -no-fsync in addition. On Solaris with UFS, using -no-fsync speeds up things by aprox. 10%. On Linux with a local filesystem, using -no-fsync speeds up things by aprox. 400%. This is why you get useless high numbers from using GNU tar for copy tests on Linux. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Jonathan Edwards
2009-Jul-04 12:50 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 12:03 AM, Bob Friesenhahn wrote:> % ./diskqual.sh > c1t0d0 130 MB/sec > c1t1d0 130 MB/sec > c2t202400A0B83A8A0Bd31 13422 MB/sec > c3t202500A0B83A8A0Bd31 13422 MB/sec > c4t600A0B80003A8A0B0000096A47B4559Ed0 191 MB/sec > c4t600A0B80003A8A0B0000096E47B456DAd0 192 MB/sec > c4t600A0B80003A8A0B0000096147B451BEd0 192 MB/sec > c4t600A0B80003A8A0B0000096647B453CEd0 192 MB/sec > c4t600A0B80003A8A0B0000097347B457D4d0 212 MB/sec > c4t600A0B800039C9B500000A9C47B4522Dd0 191 MB/sec > c4t600A0B800039C9B500000AA047B4529Bd0 192 MB/sec > c4t600A0B800039C9B500000AA447B4544Fd0 192 MB/sec > c4t600A0B800039C9B500000AA847B45605d0 191 MB/sec > c4t600A0B800039C9B500000AAC47B45739d0 191 MB/sec > c4t600A0B800039C9B500000AB047B457ADd0 191 MB/sec > c4t600A0B800039C9B500000AB447B4595Fd0 191 MB/secsomehow i don''t think that reading the first 64MB off (presumably) off a raw disk device 3 times and picking the middle value is going to give you much useful information on the overall state of the disks .. i believe this was more of a quick hack to just validate that there''s nothing too far out of the norm, but with that said - what''s the c2 and c3 device above? you''ve got to be caching the heck out of that to get that unbelievable 13 GB/s - so you''re really only seeing memory speeds there more useful information would be something more like the old taz or some of the disk IO latency tools when you''re driving a workload.> % arc_summary.pl > > System Memory: > Physical RAM: 20470 MB > Free Memory : 2371 MB > LotsFree: 312 MB > > ZFS Tunables (/etc/system): > * set zfs:zfs_arc_max = 0x300000000 > set zfs:zfs_arc_max = 0x280000000 > * set zfs:zfs_arc_max = 0x200000000 > > ARC Size: > Current Size: 9383 MB (arcsize) > Target Size (Adaptive): 10240 MB (c) > Min Size (Hard Limit): 1280 MB (zfs_arc_min) > Max Size (Hard Limit): 10240 MB (zfs_arc_max) > > ARC Size Breakdown: > Most Recently Used Cache Size: 6% 644 MB (p) > Most Frequently Used Cache Size: 93% 9595 MB (c-p) > > ARC Efficency: > Cache Access Total: 674638362 > Cache Hit Ratio: 91% 615586988 [Defined State for buffer] > Cache Miss Ratio: 8% 59051374 [Undefined State for Buffer] > REAL Hit Ratio: 87% 590314508 [MRU/MFU Hits Only] > > Data Demand Efficiency: 96% > Data Prefetch Efficiency: 7% > > CACHE HITS BY CACHE LIST: > Anon: 2% 13626529 [ New > Customer, First Cache Hit ] > Most Recently Used: 78% 480379752 (mru) [ Return > Customer ] > Most Frequently Used: 17% 109934756 (mfu) > [ Frequent Customer ] > Most Recently Used Ghost: 0% 5180256 (mru_ghost) [ Return > Customer Evicted, Now Back ] > Most Frequently Used Ghost: 1% 6465695 (mfu_ghost) [ Frequent > Customer Evicted, Now Back ] > CACHE HITS BY DATA TYPE: > Demand Data: 78% 485431759 > Prefetch Data: 0% 3045442 > Demand Metadata: 16% 103900170 > Prefetch Metadata: 3% 23209617 > CACHE MISSES BY DATA TYPE: > Demand Data: 30% 18109355 > Prefetch Data: 60% 35633374 > Demand Metadata: 6% 3806177 > Prefetch Metadata: 2% 1502468 > --------------------------------------------- > > Prefetch seems to be performing badly. The Ben Rockwood''s blog > entry at http://www.cuddletech.com/blog/pivot/entry.php?id=1040 > discusses prefetch. The sample Dtrace script on that page only > shows cache misses: > > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset > 6774849536: MISS > vdev_cache_read: 6507827833451031357 read 131072 bytes at offset > 6774980608: MISS > > Unfortunately, the file-level prefetch DTrace sample script from the > same page seems to have a syntax error.if you''re using LUNs off an array - this might be another case of the zfs_vdev_max_pending being tuned more for direct attach drives .. you could be trying to queue up too much I/O against the RAID controller, particularly if the RAID controller is also trying to prefetch out of it''s cache.> I tried disabling file level prefetch (zfs_prefetch_disable=1) but > did not observe any change in behavior.this is only going to help if you''ve got problems in zfetch .. you''d probably see this better by looking for high lock contention in zfetch with lockstat> # kstat -p zfs:0:vdev_cache_stats > zfs:0:vdev_cache_stats:class misc > zfs:0:vdev_cache_stats:crtime 130.61298275 > zfs:0:vdev_cache_stats:delegations 754287 > zfs:0:vdev_cache_stats:hits 3973496 > zfs:0:vdev_cache_stats:misses 2154959 > zfs:0:vdev_cache_stats:snaptime 451955.55419545 > > Performance when coping 236 GB of files (each file is 5537792 bytes, > with 20001 files per directory) from one directory to another: > > Copy Method Data Rate > ==================================== =================> cpio -pdum 75 MB/s > cp -r 32 MB/s > tar -cf - . | (cd dest && tar -xf -) 26 MB/s > > I would expect data copy rates approaching 200 MB/s.you might want to dtrace this to break down where the latency is occuring .. eg: is this a DNLC caching problem, ARC problem, or device level problem also - is this really coming off a 2540? if so - you should probably investigate the array throughput numbers and what''s happening on the RAID controller .. i typically find it helpful to understand what the raw hardware is capable of (hence tools like vdbench to drive an anticipated load before i configure anything) - and then attempting to configure the various tunables to match after that for now you''re pretty much just at the FS/VOP layers and playing with caching when the real culprit might be more on the vdev interface layer or below --- .je
David Magda
2009-Jul-04 13:39 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 03:48, Phil Harman wrote:> The other thing that slows you down is that ZFS only flushes to disk > every 5 seconds if there are no synchronous writes. It would be > interesting to see iostat -xnz 1 while you are running your tests. > You may find the disks are writing very efficiently for one second > in every five.The value of 5 seconds is no longer a hard stop since SNV 87. Since snv_87 (and S10u6) it can be up to 30 seconds (but it does shoot for 5 seconds): http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 See the 20-Mar-2008 change for txg.c for details.
Joerg Schilling
2009-Jul-04 13:59 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Phil Harman <Phil.Harman at Sun.COM> wrote:> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC > instead of the Solaris page cache. But mmap() uses the latter. So if > anyone maps a file, ZFS has to keep the two caches in sync. > > cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it > copies into the Solaris page cache. As long as they remain there ZFS > will be slow for those files, even if you subsequently use read(2) to > access them. > > If you reboot, your cpio(1) tests will probably go fast again, untilDo you believe that reboot is the only way to reset this?> someone uses mmap(2) on the files again. I think tar(1) uses read(2), > but from my iPod I can''t be sure. It would be interesting to see how > tar(1) performs if you run that test before cp(1) on a freshly > rebooted system.There are many tar implementations. The oldest is the UNIX tar implementation from around 1978, the next was star from 1982, then there is GNU tar from 1987. Star forks into two processes that are connected via shared memory in order to speed up things. If you compare the copy speed from star amd cp on UFS and if you tell star to be as unreliable as cp (by specifying the star option -no-fsync), star will do the job by 30% faster than cp does even though star does not use mmap. Copying with Sun''s tar is a tic faster than using cp and it is a bit more accurat. GNU tar is not better than Sun''s tar. If you are looking for the best speed, use: star -copy -no-fsync -C from-dir . to-dir and set up e.v. bs=1m fs=128m. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Bob Friesenhahn
2009-Jul-04 14:09 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Jonathan Edwards wrote:> > somehow i don''t think that reading the first 64MB off (presumably) off a raw > disk device 3 times and picking the middle value is going to give you much > useful information on the overall state of the disks .. i believe this was > more of a quick hack to just validate that there''s nothing too far out of the > norm, but with that said - what''s the c2 and c3 device above? you''ve got to > be caching the heck out of that to get that unbelievable 13 GB/s - so you''re > really only seeing memory speeds thereAgreed. It is just a quick sanity check. I think that the c2 and c3 devices are speedy USB drives.> more useful information would be something more like the old taz or some of > the disk IO latency tools when you''re driving a workload.What I see from ''iostat -cx'' is a low latency (<= 4 ms) and low workload while the data is being read, and then (periodically) a burst of write data with much higher latency (40-64ms svc_t). The write burst does not take long so it is clear that reading is the bottleneck.> if you''re using LUNs off an array - this might be another case of the > zfs_vdev_max_pending being tuned more for direct attach drives .. you could > be trying to queue up too much I/O against the RAID controller, particularly > if the RAID controller is also trying to prefetch out of it''s cache.I have played with zfs_vdev_max_pending before. It does dial down the latency pretty linearly during the write phase (e.g. 35 queued I/Os results in 64 ms svc_t).> you might want to dtrace this to break down where the latency is occuring .. > eg: is this a DNLC caching problem, ARC problem, or device level problem > > also - is this really coming off a 2540? if so - you should probably > investigate the array throughput numbers and what''s happening on the RAID > controller .. i typically find it helpful to understand what the raw hardware > is capable of (hence tools like vdbench to drive an anticipated load before i > configure anything) - and then attempting to configure the various tunables > to match after thatYes, this comes off of a 2540. I used iozone for testing and see that through zfs, the hardware is able to write a 64GB file at 380 MB/s and read at 551 MB/s. Unfortunately, this does not seem to translate well for the actual task. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-04 14:33 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote:> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC instead > of the Solaris page cache. But mmap() uses the latter. So if anyone maps a > file, ZFS has to keep the two caches in sync. > > cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it copies > into the Solaris page cache. As long as they remain there ZFS will be slow > for those files, even if you subsequently use read(2) to access them.This is very interesting information and certainly can explain a lot. My application has a choice of using mmap or traditional I/O. I often use mmap. From what you are saying, using mmap is poison to subsequent performance. On June 29th I tested my application (which was set to use mmap) shortly after a reboot and got this overall initial runtime: real 2:24:25.675 user 4:38:57.837 sys 14:30.823 By June 30th (with no intermediate reboot) the overall runtime had increased to real 3:08:58.941 user 4:38:38.192 sys 15:44.197 which seems like quite a large change.> If you reboot, your cpio(1) tests will probably go fast again, until someone > uses mmap(2) on the files again. I think tar(1) uses read(2), but from myI will test.> The other thing that slows you down is that ZFS only flushes to disk every 5 > seconds if there are no synchronous writes. It would be interesting to see > iostat -xnz 1 while you are running your tests. You may find the disks are > writing very efficiently for one second in every five.Actually I found that the disks were writing flat out for five seconds at a time which stalled all other pool I/O (and dependent CPU) for at least three seconds (see earlier discussion). So at the moment I have zfs_write_limit_override set to 2684354560 so that the write cycle is more on the order of one second in every five. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-04 15:15 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote:> > If you reboot, your cpio(1) tests will probably go fast again, until someone > uses mmap(2) on the files again. I think tar(1) uses read(2), but from my > iPod I can''t be sure. It would be interesting to see how tar(1) performs if > you run that test before cp(1) on a freshly rebooted system.Ok, I just rebooted the system. Now ''zpool iostat Sun_2540 60'' shows that the cpio read rate has increased from (the most recently observed) 33 MB/second to as much as 132 MB/second. To some this may not seem significant but to me it looks a whole lot different. ;-)> I have done some work with the ZFS team towards a fix, but it is only > currently in OpenSolaris.Hopefully the fix is very very good. It is difficult to displace the many years of SunOS training that using mmap is the path to best performance. Mmap provides many tools to improve application performance which are just not available via traditional I/O. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gary Mills
2009-Jul-04 15:39 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC > instead of the Solaris page cache. But mmap() uses the latter. So if > anyone maps a file, ZFS has to keep the two caches in sync.That''s the first I''ve heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Bob Friesenhahn
2009-Jul-04 15:57 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
A tar pipeline still provides terrible file copy performance. Read bandwidth is only 26 MB. So I stopped the tar copy and re-tried the cpio copy. A second copy with the cpio results in a read/write data rate of only 54.9 MB/s (vs the just experienced 132 MB/s). Performance is reduced by more than half. Based on yesterday''s experience, that may diminish to only 33 MB/s. The amount of data being copied is much larger than any cache yet somehow reading a file a second time is less than 1/2 as fast. This brings me to the absurd conclusion that the system must be rebooted immediately prior to each use. /etc/system tunables are currently: set zfs:zfs_arc_max = 0x280000000 set zfs:zfs_write_limit_override = 0xea600000 set zfs:zfs_vdev_max_pending = 5 Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling
2009-Jul-04 17:12 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> A tar pipeline still provides terrible file copy performance. Read > bandwidth is only 26 MB. So I stopped the tar copy and re-tried the > cpio copy. > > A second copy with the cpio results in a read/write data rate of only > 54.9 MB/s (vs the just experienced 132 MB/s). Performance is reduced > by more than half. Based on yesterday''s experience, that may diminish > to only 33 MB/s."star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" is nearly 40% faster than "find . | cpio -pdum to-dir" Did you try to use highly performant software like star? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Phil Harman
2009-Jul-04 17:55 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Joerg Schilling wrote:> Phil Harman <Phil.Harman at Sun.COM> wrote: > > >> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC >> instead of the Solaris page cache. But mmap() uses the latter. So if >> anyone maps a file, ZFS has to keep the two caches in sync. >> >> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it >> copies into the Solaris page cache. As long as they remain there ZFS >> will be slow for those files, even if you subsequently use read(2) to >> access them. >> >> If you reboot, your cpio(1) tests will probably go fast again, until >> > > Do you believe that reboot is the only way to reset this? >No, but from my iPod I didn''t have the patience to write a fuller explanation :) See ... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_vnops.c#514 We take the long path is the vnode has any pages cached in the page cache. So instead of a reboot, you should also be able to export/import the pool or unmount/mount the filesystem. Also, if you didn''t touch the file for a long time, and had lots of other page cache churn, the file might eventually get expunged from the page cache. Phil
Bob Friesenhahn
2009-Jul-04 18:03 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Joerg Schilling wrote:>> by more than half. Based on yesterday''s experience, that may diminish >> to only 33 MB/s. > > "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" > > is nearly 40% faster than > > "find . | cpio -pdum to-dir" > > Did you try to use highly performant software like star?No, because I don''t want to tarnish your software''s stellar reputation. I am focusing on Solaris 10 bugs today. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Phil Harman
2009-Jul-04 18:04 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote: >> >> If you reboot, your cpio(1) tests will probably go fast again, until >> someone uses mmap(2) on the files again. I think tar(1) uses read(2), >> but from my iPod I can''t be sure. It would be interesting to see how >> tar(1) performs if you run that test before cp(1) on a freshly >> rebooted system. > > Ok, I just rebooted the system. Now ''zpool iostat Sun_2540 60'' shows > that the cpio read rate has increased from (the most recently > observed) 33 MB/second to as much as 132 MB/second. To some this may > not seem significant but to me it looks a whole lot different. ;-)Thanks, that''s really useful data. I wasn''t near a machine at the time, so I couldn''t do it for myself. I answered your initial question based on what I understood of the implementation, and it''s very satisfying to have the data to back it up.>> I have done some work with the ZFS team towards a fix, but it is only >> currently in OpenSolaris. > > Hopefully the fix is very very good. It is difficult to displace the > many years of SunOS training that using mmap is the path to best > performance. Mmap provides many tools to improve application > performance which are just not available via traditional I/O.The part of the problem I highlighted was ... 6699438 zfs induces crosscall storm under heavy mapped sequential read This has been fixed in OpenSolaris, and should be fixed in Solaris 10 update 8. However, this is only part of the problem. The fundamental issue is that ZFS has its own ARC apart from the Solaris page cache, so whenever mmap() is used, all I/O to that file has to make sure that the two caches are in sync. Hence, a read(2) on a file which has sometime been mapped, will be impacted, even if the file is nolonger mapped. I''m sure the data and interest from this thread will be useful to the ZFS team in prioritising further performance enhancements. So thanks again. And if there''s any more useful data you can add, please do so. If you have a support contract, you might also consider logging a call and even raising an escalation request. Cheers, Phil> Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling
2009-Jul-04 18:15 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sat, 4 Jul 2009, Joerg Schilling wrote: > >> by more than half. Based on yesterday''s experience, that may diminish > >> to only 33 MB/s. > > > > "star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir" > > > > is nearly 40% faster than > > > > "find . | cpio -pdum to-dir" > > > > Did you try to use highly performant software like star? > > No, because I don''t want to tarnish your software''s stellar > reputation. I am focusing on Solaris 10 bugs today.I''ve seen more prefessional replies. At the end it is your decision to ignore helpful advise. BTW: if star on ZFS would not be faster than cpio this would be just a hint for a problem in ZFS that needs to be fixed. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Jonathan Edwards
2009-Jul-04 18:16 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote:> This brings me to the absurd conclusion that the system must be > rebooted immediately prior to each use.see Phil''s later email .. an export/import of the pool or a remount of the filesystem should clear the page cache - with mmap''d files you''re essentially both them both in the page cache and also in the ARC .. then invalidations in the page cache are going to have effects on dirty data in the cache> /etc/system tunables are currently: > > set zfs:zfs_arc_max = 0x280000000 > set zfs:zfs_write_limit_override = 0xea600000 > set zfs:zfs_vdev_max_pending = 5if you''re on x86 - i''d also increase maxphys to 128K .. we still have a 56KB default value in there which is still a bad thing (IMO) --- .je
Phil Harman
2009-Jul-04 18:18 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Gary Mills wrote:> On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: > >> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC >> instead of the Solaris page cache. But mmap() uses the latter. So if >> anyone maps a file, ZFS has to keep the two caches in sync. >> > > That''s the first I''ve heard of this issue. Our e-mail server runs > Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) > extensively. I understand that Solaris has an excellent > implementation of mmap(2). ZFS has many advantages, snapshots for > example, for mailbox storage. Is there anything that we can be do to > optimize the two caches in this environment? Will mmap(2) one day > play nicely with ZFS? >I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) was the first UNIX to get a working implementation of mmap(2) for files (if I recall correctly, BSD 4.3 had a manpage but no implementation for files). From that we got a whole lot of cool stuff, not least dynamic linking with ld.so (which has made it just about everywhere). The Solaris implementation of mmap(2) is functionally correct, but the wait for a 64 bit address space rather moved the attention of performance tuning elsewhere. I must admit I was surprised to see so much code out there that still uses mmap(2) for general I/O (rather than just to support dynamic linking). Software engineering is always about prioritising resource. Nothing prioritises performance tuning attention quite like compelling competitive data. When Bart Smaalders and I wrote libMicro we generated a lot of very compelling data. I also coined the phrase "If Linux is faster, it''s a Solaris bug". You will find quite a few (mostly fixed) bugs with the synopsis "linux is faster than solaris at ...". So, if mmap(2) playing nicely with ZFS is important to you, probably the best thing you can do to help that along is to provide data that will help build the business case for spending engineering resource on the issue. Cheers, Phil
Bob Friesenhahn
2009-Jul-04 18:25 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 with latest firmware. I rebooted the system used cpio to send the input files to /dev/null, and then immediately used cpio a second time to send the input files to /dev/null. Note that the amount of file data (243 GB) is plenty sufficient to purge any file data from the ARC (which has a cap of 10 GB). % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 1.573 total cpio -o > /dev/null 78.92s user 360.55s system 43% cpu 16:59.48 total % time cat dpx-files.txt | cpio -o > /dev/null 495713288 blocks cat dpx-files.txt 0.00s user 0.00s system 0% cpu 0.198 total cpio -o > /dev/null 79.92s user 358.75s system 11% cpu 1:01:05.88 total zpool iostat averaged over 60 seconds reported that the first run through the files read the data at 251 MB/s and the second run only achieved 68 MB/s. It seems clear that there is something really bad about Solaris 10 zfs''s file caching code which is causing it to go into the weeds. I don''t think that the results mean much, but I have attached output from ''hotkernel'' while a subsequent cpio copy is taking place. It shows that the kernel is mostly sleeping. This is not a new problem. It seems that I have been banging my head against this from the time I started using zfs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ Sampling... Hit Ctrl-C to end. FUNCTION COUNT PCNT unix`SHA1Update 1 0.0% unix`page_unlock 1 0.0% unix`lwp_segregs_save 1 0.0% rootnex`rootnex_dma_allochdl 1 0.0% unix`mutex_delay_default 1 0.0% emlxs`emlxs_initialize_pkt 1 0.0% genunix`pid_lookup 1 0.0% TS`ts_setrun 1 0.0% fcp`ssfcp_adjust_cmd 1 0.0% genunix`strrput 1 0.0% genunix`cyclic_softint 1 0.0% genunix`fop_poll 1 0.0% sd`sd_xbuf_strategy 1 0.0% ohci`ohci_state_is_operational 1 0.0% zfs`SHA256Transform 1 0.0% unix`cpu_resched 1 0.0% nvidia`_nv006110rm 1 0.0% genunix`lwp_timer_timeout 1 0.0% genunix`realtime_timeout 1 0.0% fcp`ssfcp_scsi_destroy_pkt 1 0.0% nvidia`nvidia_pci_check_config_space 1 0.0% genunix`closef 1 0.0% sd`sd_setup_rw_pkt 1 0.0% unix`vsnprintf 1 0.0% zfs`vdev_dtl_contains 1 0.0% genunix`siginfo_kto32 1 0.0% iommulib`iommulib_nex_open 1 0.0% genunix`vn_has_cached_data 1 0.0% ohci`ohci_sendup_td_message 1 0.0% scsi_vhci`vhci_scsi_destroy_pkt 1 0.0% genunix`avl_add 1 0.0% unix`page_create_va 1 0.0% genunix`savectx 1 0.0% ohci`ohci_root_hub_allocate_intr_pipe_resource 1 0.0% unix`page_add 1 0.0% zfs`zfs_unix_to_v4 1 0.0% genunix`set_qend 1 0.0% zfs`vdev_queue_io_done 1 0.0% unix`set_idle_cpu 1 0.0% zfs`vdev_cache_read 1 0.0% nvidia`_nv002998rm 1 0.0% ohci`ohci_do_intrs_stats 1 0.0% genunix`putq 1 0.0% genunix`strput 1 0.0% zfs`zio_buf_alloc 1 0.0% sockfs`socktpi_poll 1 0.0% sockfs`so_update_attrs 1 0.0% sockfs`so_unlock_read 1 0.0% zfs`zfs_zaccess 1 0.0% genunix`schedctl_save 1 0.0% nvidia`_nv004051rm 1 0.0% zfs`dbuf_destroy 1 0.0% nvidia`_nv001416rm 1 0.0% genunix`ddi_dma_buf_bind_handle 1 0.0% zfs`zio_decompress_data 1 0.0% genunix`bdev_strategy 1 0.0% kcf`hmac_encr 1 0.0% unix`page_trylock 1 0.0% unix`hat_page_getattr 1 0.0% genunix`fop_open 1 0.0% zfs`zfs_zaccess_common 1 0.0% emlxs`emlxs_sli2_bde_setup 1 0.0% genunix`sprintf 1 0.0% unix`ddi_get8 1 0.0% zfs`zfs_dirlook 1 0.0% zfs`zio_clear_transform_stack 1 0.0% genunix`untimeout 1 0.0% genunix`allocb_cred 1 0.0% unix`lock_clear 1 0.0% iommulib`lookup_cache 1 0.0% zfs`dbuf_hash_insert 1 0.0% genunix`syscall_exit 1 0.0% unix`strcmp 1 0.0% unix`hr_clock_lock 1 0.0% genunix`delfpollinfo 1 0.0% unix`i_ddi_caut_get16 1 0.0% genunix`fop_access 1 0.0% unix`getflags 1 0.0% zfs`arc_cksum_verify 1 0.0% zfs`vdev_disk_io_done 1 0.0% zfs`spa_config_enter 1 0.0% emlxs`emlxs_thread_trigger2 1 0.0% TS`ts_preempt 1 0.0% ldterm`ldtermrput 1 0.0% usba`usba_vlog 1 0.0% genunix`crgetuid 1 0.0% unix`idle_enter 1 0.0% genunix`sleepq_unsleep 1 0.0% unix`mutex_vector_enter 1 0.0% unix`lgrp_loadavg 1 0.0% kcf`kcf_rnd_get_pseudo_bytes 1 0.0% sd`sd_fill_scsi1_lun 1 0.0% nge`nge_hot_rxd_fill 1 0.0% genunix`vmem_xalloc 1 0.0% nvidia`_nv006118rm 1 0.0% TS`ts_update_list 1 0.0% nvidia`_nv002896rm 1 0.0% genunix`pn_setlast 1 0.0% genunix`pcache_poll 1 0.0% iommulib`hashfn 1 0.0% ldterm`newmsg 1 0.0% genunix`signal_is_blocked 1 0.0% genunix`kmem_slab_alloc 1 0.0% unix`i_ddi_caut_get32 1 0.0% unix`xcopyout 1 0.0% zfs`dmu_object_size_from_db 1 0.0% tl`tl_wput_data_ser 1 0.0% genunix`disp_lock_enter 1 0.0% genunix`bioerror 1 0.0% dtrace`dtrace_ioctl 1 0.0% nvidia`_nv008153rm 1 0.0% genunix`avl_first 1 0.0% zfs`zio_read_decompress 1 0.0% unix`pci_peekpoke_check 1 0.0% emlxs`emlxs_proc_ring_event 1 0.0% genunix`list_insert_head 1 0.0% genunix`tsd_get 1 0.0% nvidia`_nv006127rm 1 0.0% zfs`zfs_ace_v0_get_mask 1 0.0% genunix`pcache_insert 1 0.0% genunix`crhold 1 0.0% zfs`zio_should_retry 1 0.0% zfs`arc_adapt 1 0.0% specfs`spec_ioctl 1 0.0% ohci`ohci_deallocate_tw_resources 1 0.0% genunix`restorecontext 1 0.0% genunix`freeb 1 0.0% genunix`timespecfix 1 0.0% genunix`ddi_dma_unbind_handle 1 0.0% genunix`vmem_size 1 0.0% genunix`schedctl_sigblock 1 0.0% dtrace`dtrace_speculation_clean 1 0.0% scsi`scsi_destroy_pkt 1 0.0% genunix`new_mstate 1 0.0% nvidia`_nv006129rm 1 0.0% zfs`dbuf_set_data 1 0.0% unix`i_ddi_caut_getput_ctlops 1 0.0% fctl`fc_ulp_init_packet 1 0.0% genunix`ddi_dma_freehdl 1 0.0% nvidia`_nv012491rm 1 0.0% unix`pc_tod_get 1 0.0% tl`tl_wput 1 0.0% zfs`arc_buf_destroy 1 0.0% ehci`ehci_get_root_hub_port_status 1 0.0% nvidia`_nv012492rm 1 0.0% nvidia`_nv004854rm 1 0.0% genunix`copy_pattern 1 0.0% unix`av_check_softint_pending 1 0.0% zfs`rrw_exit 1 0.0% rootnex`rootnex_clean_dmahdl 1 0.0% genunix`mod_major_to_name 1 0.0% genunix`ddi_dma_allochdl 1 0.0% genunix`strdoioctl 1 0.0% nvidia`_nv004086rm 1 0.0% genunix`cv_signal 1 0.0% genunix`ddi_ctlops 1 0.0% genunix`ldi_strategy 1 0.0% nvidia`_nv002887rm 1 0.0% zfs`zfs_fuid_map_ids 1 0.0% genunix`schedctl_restore 1 0.0% genunix`mp_cont_len 1 0.0% zfs`arc_reclaim_needed 1 0.0% ip`tcp_send_find_ire 1 0.0% genunix`setrun 1 0.0% genunix`kmem_cpu_reload 1 0.0% genunix`strget 1 0.0% nvidia`_nv006152rm 1 0.0% unix`hat_kpm_pfn2va 1 0.0% genunix`seg_ppurge_all 1 0.0% genunix`timespectohz 1 0.0% nvidia`_nv008179rm 1 0.0% genunix`disp_lock_enter_high 1 0.0% genunix`clear_stale_fd 1 0.0% zfs`zio_nowait 1 0.0% genunix`setf 1 0.0% genunix`pcacheset_resolve 2 0.0% unix`set_freemem 2 0.0% unix`hr_clock_unlock 2 0.0% fctl`fc_ulp_uninit_packet 2 0.0% emlxs`emlxs_tx_get 2 0.0% nvidia`_nv004781rm 2 0.0% unix`kstat_runq_exit 2 0.0% zfs`dbuf_findbp 2 0.0% nvidia`_nv003170rm 2 0.0% unix`prefetch_page_r 2 0.0% sd`sd_buf_iodone 2 0.0% genunix`clock 2 0.0% unix`x86pte_release_pagetable 2 0.0% rootnex`rootnex_dma_bindhdl 2 0.0% nvidia`_nv008946rm 2 0.0% scsi`scsi_transport 2 0.0% zfs`vdev_cache_lastused_compare 2 0.0% nvidia`rm_isr 2 0.0% pcplusmp`apic_redistribute_compute 2 0.0% zfs`spa_get_random 2 0.0% emlxs`emlxs_mem_put 2 0.0% usba`usba_rm_from_list 2 0.0% unix`pg_cmt_load 2 0.0% zfs`dbuf_hold 2 0.0% zfs`vdev_queue_io 2 0.0% zfs`zio_vdev_io_assess 2 0.0% unix`atomic_inc_64 2 0.0% zfs`zio_vdev_io_start 2 0.0% zfs`lzjb_compress 2 0.0% unix`mutex_vector_tryenter 2 0.0% genunix`kmem_depot_free 2 0.0% nvidia`_nv003008rm 2 0.0% unix`lock_clear_splx 2 0.0% genunix`ddi_dma_alloc_handle 2 0.0% nvidia`_nv004840rm 2 0.0% unix`lock_set_spl_spin 2 0.0% genunix`canputnext 2 0.0% zfs`arc_get_data_buf 2 0.0% unix`tlb_service 2 0.0% fcp`ssfcp_transport 2 0.0% zfs`dnode_block_freed 2 0.0% fcp`ssfcp_complete_pkt 2 0.0% emlxs`emlxs_timer_check_pkts 2 0.0% specfs`spec_poll 2 0.0% zfs`dbuf_update_data 2 0.0% genunix`ddi_driver_major 2 0.0% genunix`thread_free_prevent 2 0.0% genunix`strioctl 2 0.0% genunix`allocb 2 0.0% genunix`pcacheset_cmp 2 0.0% genunix`biodone 2 0.0% genunix`ddi_get_soft_state 2 0.0% emlxs`emlxs_swap_fcp_pkt 2 0.0% genunix`list_head 2 0.0% zfs`arc_access 2 0.0% zfs`zfs_dirent_lock 2 0.0% unix`drv_usecwait 2 0.0% genunix`restorectx 2 0.0% genunix`getq_noenab 2 0.0% unix`kstat_waitq_to_runq 2 0.0% genunix`clock_tick 2 0.0% genunix`mdi_pi_get_vhci_private 2 0.0% scsi_vhci`vhci_scsi_start 2 0.0% zfs`vdev_readable 2 0.0% fcp`ssfcp_init_ventilators 2 0.0% genunix`sleepq_dequeue 2 0.0% genunix`timeout 2 0.0% unix`mutex_adaptive_tryenter 2 0.0% nvidia`_nv008401rm 2 0.0% scsi`scsi_hba_pkt_alloc 2 0.0% rootnex`rootnex_coredma_bindhdl 2 0.0% emlxs`emlxs_msi_intr 2 0.0% unix`clock_tick_schedule 2 0.0% nvidia`_nv002782rm 3 0.0% ohci`ohci_handle_root_hub_status_change 3 0.0% zfs`arc_buf_freeze 3 0.0% emlxs`emlxs_sli3_bde_setup 3 0.0% fctl`fc_ulp_transport 3 0.0% genunix`set_anoninfo 3 0.0% nvidia`_nv005978rm 3 0.0% zfs`add_reference 3 0.0% zfs`dnode_hold 3 0.0% zfs`zio_destroy 3 0.0% zfs`zfs_range_unlock_reader 3 0.0% genunix`new_cpu_mstate 3 0.0% genunix`cv_timedwait_sig 3 0.0% genunix`kmem_alloc 3 0.0% unix`ddi_io_put32 3 0.0% unix`disp_lowpri_cpu 3 0.0% sockfs`socktpi_ioctl 3 0.0% genunix`uiomove 3 0.0% genunix`disp_lock_exit_high 3 0.0% zfs`dbuf_do_evict 3 0.0% zfs`buf_hash_insert 3 0.0% nvidia`_nv001414rm 3 0.0% unix`clock_tick_execute_common 3 0.0% unix`mutex_destroy 3 0.0% unix`clock_tick_process 3 0.0% unix`fpxsave_begin 3 0.0% zfs`vdev_lookup_top 3 0.0% zfs`zio_wait_for_children_ready 3 0.0% genunix`pollsys 3 0.0% genunix`fop_ioctl 3 0.0% genunix`avl_insert 3 0.0% rootnex`rootnex_coredma_unbindhdl 3 0.0% genunix`kmem_zalloc 3 0.0% unix`lwp_segregs_restore32 3 0.0% rootnex`rootnex_dma_unbindhdl 4 0.0% zfs`vdev_stat_update 4 0.0% genunix`ddi_get_instance 4 0.0% unix`av_dispatch_softvect 4 0.0% zfs`rrw_enter_read 4 0.0% emlxs`emlxs_timer 4 0.0% fcp`ssfcp_scsi_start 4 0.0% zfs`remove_reference 4 0.0% genunix`cv_waituntil_sig 4 0.0% scsi_vhci`vhci_scsi_init_pkt 4 0.0% genunix`mdi_select_path 4 0.0% scsi`scsi_init_pkt 4 0.0% zfs`rrw_enter 4 0.0% zfs`arc_read 4 0.0% unix`hment_mapcnt 4 0.0% fcp`ssfcp_cmd_callback 4 0.0% sd`xbuf_iostart 4 0.0% unix`i_ddi_vaddr_rep_get8 4 0.0% zfs`zio_root 4 0.0% genunix`avl_find 4 0.0% nvidia`_nv003011rm 4 0.0% zfs`zio_notify_parent 4 0.0% genunix`poll_common 4 0.0% sd`sd_initpkt_for_buf 4 0.0% unix`rw_write_held 4 0.0% genunix`callout_execute 4 0.0% zfs`vdev_mirror_child_done 4 0.0% genunix`sleepq_unlink 4 0.0% zfs`zio_null 4 0.0% zfs`arc_do_user_evicts 4 0.0% zfs`zio_interrupt 4 0.0% sd`sd_start_cmds 4 0.0% genunix`ioctl 5 0.0% zfs`dnode_rele 5 0.0% sd`sd_xbuf_init 5 0.0% emlxs`emlxs_issue_iocb 5 0.0% zfs`rrn_find_and_remove 5 0.0% zfs`arc_evict_ghost 5 0.0% zfs`zio_wait_for_children_done 5 0.0% zfs`vdev_is_dead 5 0.0% genunix`read32 5 0.0% zfs`vdev_disk_io_intr 5 0.0% zfs`arc_change_state 5 0.0% zfs`zio_ready 5 0.0% genunix`restore_mstate 5 0.0% emlxs`emlxs_pcimem_bcopy 5 0.0% genunix`sleepq_wakeall_chan 5 0.0% genunix`mdi_pi_kstat_iosupdate 5 0.0% sha1`sha1_block_data_order 5 0.0% sd`sdstrategy 5 0.0% zfs`vdev_mirror_io_done 5 0.0% zfs`dmu_buf_rele_array 5 0.0% zfs`vdev_mirror_io_start 5 0.0% zfs`spa_get_failmode 5 0.0% unix`htable_va2entry 5 0.0% nvidia`_nv006149rm 5 0.0% zfs`dbuf_hash_remove 5 0.0% unix`sys_syscall 6 0.0% unix`mutex_init 6 0.0% unix`kstat_waitq_enter 6 0.0% zfs`zio_wait 6 0.0% unix`cmt_balance 6 0.0% emlxs`emlxs_handle_ring_event 6 0.0% scsi_vhci`vhci_intr 6 0.0% zfs`vdev_queue_io_to_issue 6 0.0% unix`bcopy 6 0.0% sd`sd_return_command 6 0.0% zfs`zio_checksum_verify 6 0.0% zfs`arc_buf_add_ref 6 0.0% rootnex`rootnex_coredma_freehdl 6 0.0% emlxs`emlxs_send_fcp_cmd 6 0.0% sd`sd_core_iostart 6 0.0% unix`ddi_get16 6 0.0% zfs`vdev_mirror_map_alloc 7 0.0% emlxs`emlxs_pkt_to_bpl 7 0.0% unix`tsc_scalehrtime 7 0.0% sd`sdinfo 7 0.0% TS`ts_wakeup 7 0.0% genunix`sleepq_insert 7 0.0% genunix`cpu_decay 7 0.0% zfs`vdev_disk_io_start 7 0.0% zfs`zio_vdev_io_done 7 0.0% genunix`post_syscall 7 0.0% genunix`timespectohz64 7 0.0% unix`mutex_owned 7 0.0% emlxs`emlxs_unregister_pkt 7 0.0% genunix`cv_block 7 0.0% unix`av_dispatch_autovect 7 0.0% zfs`zfs_shim_read 8 0.0% emlxs`emlxs_proc_ring 8 0.0% zfs`arc_buf_alloc 8 0.0% zfs`dbuf_create 8 0.0% emlxs`emlxs_register_pkt 8 0.0% zfs`dbuf_read_impl 8 0.0% genunix`taskq_dispatch 8 0.0% genunix`timeout_common 8 0.0% unix`i_ddi_vaddr_rep_put8 8 0.0% TS`ts_sleep 8 0.0% zfs`zio_push_transform 9 0.0% zfs`dbuf_whichblock 9 0.0% genunix`getminor 9 0.0% zfs`arc_buf_remove_ref 9 0.0% zfs`buf_hash_remove 9 0.0% zfs`arc_evict 9 0.0% emlxs`emlxs_proc_attention 9 0.0% genunix`fs_rwunlock 9 0.0% unix`resume 9 0.0% dtrace`dtrace_dynvar_clean 9 0.0% emlxs`emlxs_mem_get 10 0.0% emlxs`emlxs_pkt_init 10 0.0% genunix`taskq_thread 10 0.0% emlxs`emlxs_handle_fcp_event 10 0.0% zfs`zio_checksum_error 10 0.0% nvidia`_nv002899rm 10 0.0% emlxs`emlxs_transport 10 0.0% fcp`ssfcp_prepare_pkt 10 0.0% genunix`ddi_dma_sync 10 0.0% genunix`sleepq_wakeone_chan 10 0.0% genunix`callout_schedule_1 10 0.0% scsi_vhci`vhci_bind_transport 10 0.0% unix`membar_enter 10 0.0% unix`swtch 11 0.0% rootnex`rootnex_coredma_allochdl 11 0.0% zfs`dmu_buf_hold_array 11 0.0% zfs`zfs_range_lock 11 0.0% zfs`zfs_range_unlock 11 0.0% unix`hat_switch 12 0.0% zfs`dbuf_hold_impl 12 0.0% genunix`avl_numnodes 12 0.0% genunix`avl_remove 12 0.0% unix`x86pte_access_pagetable 12 0.0% unix`intr_thread 12 0.0% pcplusmp`apic_send_ipi 12 0.0% unix`idle 12 0.0% nvidia`_nv003003rm 12 0.0% unix`htable_getpage 12 0.0% unix`htable_release 12 0.0% unix`lwp_getdatamodel 13 0.0% unix`setfrontdq 13 0.0% genunix`cpu_update_pct 13 0.0% zfs`zio_assess 13 0.0% genunix`kmem_depot_alloc 13 0.0% fcp`ssfcp_scsi_init_pkt 14 0.0% nvidia`_nv003006rm 14 0.0% genunix`fs_rwlock 14 0.0% zfs`zio_wait_for_children 14 0.0% zfs`dbuf_read 14 0.0% genunix`kmem_free 14 0.0% genunix`fop_read 14 0.0% unix`lock_try 14 0.0% unix`cpu_wakeup 14 0.0% emlxs`emlxs_issue_iocb_cmd 14 0.0% genunix`list_remove 15 0.0% unix`bitset_in_set 15 0.0% unix`x86pte_get 15 0.0% rootnex`rootnex_get_sgl 15 0.0% sd`sdintr 15 0.0% zfs`dnode_hold_impl 15 0.0% unix`htable_lookup 15 0.0% unix`atomic_cas_32 15 0.0% zfs`dmu_zfetch 16 0.0% zfs`zio_done 17 0.0% zfs`buf_hash_find 17 0.0% unix`copyin 17 0.0% genunix`disp_lock_exit_nopreempt 17 0.0% zfs`dmu_read_uio 18 0.0% unix`atomic_add_32 18 0.0% zfs`dbuf_hash 18 0.0% zfs`dmu_zfetch_find 18 0.0% zfs`arc_released 18 0.0% unix`htable_getpte 18 0.0% unix`mutex_tryenter 18 0.0% genunix`disp_lock_exit 19 0.0% genunix`cv_broadcast 19 0.0% zfs`dbuf_rele 20 0.0% unix`hat_getpfnum 20 0.0% unix`dosoftint 20 0.0% genunix`nbl_need_check 21 0.0% unix`lock_set 21 0.0% zfs`arc_read_done 21 0.0% genunix`gethrtime_unscaled 22 0.0% unix`ddi_get32 22 0.0% zfs`lzjb_decompress 24 0.0% genunix`thread_lock 26 0.0% zfs`zio_pop_transform 27 0.0% unix`mutex_vector_exit 27 0.0% mm`mmwrite 27 0.0% unix`_resume_from_idle 27 0.0% zfs`dmu_buf_hold_array_by_dnode 27 0.0% genunix`watch_disable_addr 29 0.0% unix`disp_anywork 29 0.0% unix`page_next_scan_large 30 0.0% zfs`zio_create 30 0.0% zfs`dbuf_find 30 0.0% genunix`kmem_cache_free 31 0.0% genunix`kmem_cache_alloc 33 0.0% zfs`zio_execute 33 0.0% genunix`cdev_write 33 0.0% genunix`fsflush 33 0.0% zfs`zfs_read 33 0.0% unix`disp 36 0.0% genunix`read 37 0.0% unix`_interrupt 38 0.0% unix`bitset_atomic_add 39 0.0% unix`rw_exit 39 0.0% unix`mul32 39 0.0% unix`pc_rtcget 40 0.0% unix`clear_int_flag 41 0.0% genunix`gethrestime 44 0.0% genunix`clear_active_fd 44 0.0% unix`setbackdq 48 0.0% genunix`set_active_fd 48 0.0% unix`bitset_atomic_del 48 0.0% genunix`gethrtime 49 0.0% genunix`fop_rwlock 51 0.0% unix`gethrestime_sec 52 0.0% zfs`buf_hash 53 0.0% unix`rw_enter 60 0.0% unix`lock_set_spl 61 0.0% genunix`fop_rwunlock 62 0.0% unix`atomic_add_64_nv 64 0.0% genunix`write32 69 0.0% mm`mmrw 72 0.0% unix`atomic_add_64 74 0.0% specfs`spec_maxoffset 75 0.0% genunix`releasef 78 0.0% unix`mutex_exit 83 0.0% unix`pc_gethrestime 93 0.0% nvidia`_nv002897rm 94 0.0% genunix`fop_write 94 0.0% genunix`copyin_nowatch 95 0.0% genunix`getf 100 0.0% specfs`smark 102 0.0% unix`disp_getwork 112 0.0% unix`bzero 113 0.0% genunix`copyin_args32 124 0.0% genunix`syscall_entry 149 0.1% specfs`spec_write 163 0.1% genunix`write 219 0.1% unix`tsc_gethrtime_delta 278 0.1% genunix`syscall_mstate 360 0.1% unix`tsc_gethrtimeunscaled_delta 395 0.1% unix`sys_syscall32 477 0.2% genunix`fsflush_do_pages 582 0.2% unix`mutex_enter 709 0.2% unix`kcopy 1580 0.5% zfs`fletcher_2_native 1965 0.7% unix`cpu_halt 276440 95.9%
Bob Friesenhahn
2009-Jul-04 18:28 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote:> However, this is only part of the problem. The fundamental issue is that ZFS > has its own ARC apart from the Solaris page cache, so whenever mmap() is > used, all I/O to that file has to make sure that the two caches are in sync. > Hence, a read(2) on a file which has sometime been mapped, will be impacted, > even if the file is nolonger mapped.However, it seems that memory mapping is not responsible for the problem I am seeing here. Memory mapping may make the problem seem worse, but it is clearly not the cause. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling
2009-Jul-04 18:46 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Phil Harman <Phil.Harman at Sun.COM> wrote:> I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) > was the first UNIX to get a working implementation of mmap(2) for files > (if I recall correctly, BSD 4.3 had a manpage but no implementation for > files). From that we got a whole lot of cool stuff, not least dynamic > linking with ld.so (which has made it just about everywhere).Well on BSD, you could mmap() devices but as a result from the fact that there was no useful address space management, you had to first malloc() the amount of space, forcing you to have the same amount of memory available as swap. Later, the device was mapped on top of the allocated memory and made the underlying spap space unacessible. We had to add expensive amounts of swap that time in order to be able to mmap the 256 MB of RAM from our image processor that time at Berthold AG.> The Solaris implementation of mmap(2) is functionally correct, but the > wait for a 64 bit address space rather moved the attention of > performance tuning elsewhere. I must admit I was surprised to see so > much code out there that still uses mmap(2) for general I/O (rather than > just to support dynamic linking).When the new memory management architecture was introduced with SunOS-4.0, things became better although the now unified and partially anomized address space made it hard to implement "limit memoryuse" (rlmit with RLIMIT_RSS). I made a working implementation for SunOS-4.0 but this did not make it into SunOS. There are still related performance issues. If you e.g. store a CD/DVD/BluRay image in /tmp that is bigger than the amount of RAM in the machine, you will observe a buffer overflow while writing with cdrecord unless you use driveropts=burnfree because pagin in is slow on tmpfs.> Software engineering is always about prioritising resource. Nothing > prioritises performance tuning attention quite like compelling > competitive data. When Bart Smaalders and I wrote libMicro we generated > a lot of very compelling data. I also coined the phrase "If Linux is > faster, it''s a Solaris bug". You will find quite a few (mostly fixed) > bugs with the synopsis "linux is faster than solaris at ...".Fortunately, Linux is slower with most tasks ;-) In 1988, the effect of mmap() was much more visible than it is now. 20 years ago, the CPU speed limited copy operations making pipes, copyout() and similar slow. This changed with modern CPUs and for this reason, the demand for using mmap() is lower than it has been 20 years ago.> So, if mmap(2) playing nicely with ZFS is important to you, probably the > best thing you can do to help that along is to provide data that will > help build the business case for spending engineering resource on the issue.I would be interested to see a open(2) flag that tells the system that I will read a file that I opened exactly once in native oder. This could tell the system to do read ahead and to later mark the pages as immediately reusable. This would make star even faster than it is now. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Phil Harman
2009-Jul-04 19:36 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote: >> However, this is only part of the problem. The fundamental issue is >> that ZFS has its own ARC apart from the Solaris page cache, so >> whenever mmap() is used, all I/O to that file has to make sure that >> the two caches are in sync. Hence, a read(2) on a file which has >> sometime been mapped, will be impacted, even if the file is nolonger >> mapped. > > However, it seems that memory mapping is not responsible for the > problem I am seeing here. Memory mapping may make the problem seem > worse, but it is clearly not the cause.mmap(2) is what brings ZFS files into the page cache. I think you''ve shown us that once you''ve copied files with cp(1) - which does use mmap(2) - that anything that uses read(2) on the same files is impacted.> Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-04 19:41 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote:>> >> However, it seems that memory mapping is not responsible for the problem I >> am seeing here. Memory mapping may make the problem seem worse, but it is >> clearly not the cause. > > mmap(2) is what brings ZFS files into the page cache. I think you''ve shown us > that once you''ve copied files with cp(1) - which does use mmap(2) - that > anything that uses read(2) on the same files is impacted.The problem is observed with cpio, which does not use mmap. This is immediately after a reboot or unmount/mount of the filesystem. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-04 19:49 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Jonathan Edwards wrote:> > this is only going to help if you''ve got problems in zfetch .. you''d probably > see this better by looking for high lock contention in zfetch with lockstatThis is what lockstat says when performance is poor: Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 47 10% 10% 0.00 5813 0xffffffff80256000 untimeout+0x24 46 10% 19% 0.00 2223 0xffffffffb0a2f200 taskq_thread+0xe3 38 8% 27% 0.00 2252 0xffffffffb0a2f200 cv_wait+0x70 29 6% 34% 0.00 1115 0xffffffff80256000 callout_execute+0xeb 26 5% 39% 0.00 3006 0xffffffffb0a2f200 taskq_dispatch+0x1b8 22 5% 44% 0.00 1200 0xffffffffa06158c0 post_syscall+0x206 18 4% 47% 0.00 3858 arc_eviction_mtx arc_do_user_evicts+0x76 16 3% 51% 0.00 1352 arc_eviction_mtx arc_buf_add_ref+0x2d 15 3% 54% 0.00 5376 0xffffffffb1adac28 taskq_thread+0xe3 11 2% 56% 0.00 2520 0xffffffffb1adac28 taskq_dispatch+0x1b8 9 2% 58% 0.00 2158 0xffffffffbb909e20 pollwakeup+0x116 9 2% 60% 0.00 2431 0xffffffffb1adac28 cv_wait+0x70 8 2% 62% 0.00 3912 0xffffffff80259000 untimeout+0x24 7 1% 63% 0.00 3679 0xffffffffb10dfbc0 polllock+0x3f 7 1% 65% 0.00 2171 0xffffffffb0a2f2d8 cv_wait+0x70 6 1% 66% 0.00 771 0xffffffffb3f23708 pcache_delete_fd+0xac 6 1% 67% 0.00 4679 0xffffffffb0a2f2d8 taskq_dispatch+0x1b8 5 1% 68% 0.00 500 0xffffffffbe555040 fifo_read+0xf8 5 1% 69% 0.00 15838 0xffffffff8025c000 untimeout+0x24 4 1% 70% 0.00 1213 0xffffffffac44b558 sd_initpkt_for_buf+0x110 4 1% 71% 0.00 638 0xffffffffa28722a0 polllock+0x3f 4 1% 72% 0.00 610 0xffffffff80259000 timeout_common+0x39 4 1% 73% 0.00 10691 0xffffffff80256000 timeout_common+0x39 3 1% 73% 0.00 1559 htable_mutex+0x78 htable_release+0x8a 3 1% 74% 0.00 3610 0xffffffffbb909e20 cv_timedwait_sig+0x1c1 3 1% 74% 0.00 1636 0xffffffffa240d410 ohci_allocate_periodic_in_resource+0x71 2 0% 75% 0.00 5959 0xffffffffbe555040 fifo_read+0x5c 2 0% 75% 0.00 3744 0xffffffffbe555040 polllock+0x3f 2 0% 76% 0.00 635 0xffffffffb3f23708 pollwakeup+0x116 2 0% 76% 0.00 709 0xffffffffb3f23708 cv_timedwait_sig+0x1c1 2 0% 77% 0.00 831 0xffffffffb3dd2070 pcache_insert+0x13d 2 0% 77% 0.00 5976 0xffffffffb3dd2070 pollwakeup+0x116 2 0% 77% 0.00 1339 0xffffffffb1eb9b80 metaslab_group_alloc+0x136 2 0% 78% 0.00 1514 0xffffffffb0a2f2d8 taskq_thread+0xe3 2 0% 78% 0.00 4042 0xffffffffb0a22988 vdev_queue_io_done+0xc3 2 0% 79% 0.00 3428 0xffffffffb0a21f08 vdev_queue_io_done+0xc3 2 0% 79% 0.00 1002 0xffffffffac44b558 sd_core_iostart+0x37 2 0% 79% 0.00 1387 0xffffffffa8c56d80 xbuf_iostart+0x7d 2 0% 80% 0.00 698 0xffffffffa58a3318 sd_return_command+0x11b 2 0% 80% 0.00 385 0xffffffffa58a3318 sd_start_cmds+0x115 2 0% 81% 0.00 562 0xffffffffa5647800 ssfcp_scsi_start+0x30 2 0% 81% 0.00 1620 0xffffffffa4162d58 ssfcp_scsi_init_pkt+0x1be 2 0% 82% 0.00 897 0xffffffffa4162d58 ssfcp_scsi_start+0x42 2 0% 82% 0.00 475 0xffffffffa4162b78 ssfcp_scsi_start+0x42 2 0% 82% 0.00 697 0xffffffffa40fb158 sd_start_cmds+0x115 2 0% 83% 0.00 10901 0xffffffffa28722a0 fifo_write+0x5b 2 0% 83% 0.00 4379 0xffffffffa28722a0 fifo_read+0xf8 2 0% 84% 0.00 1534 0xffffffffa2638390 emlxs_tx_get+0x38 2 0% 84% 0.00 1601 0xffffffffa2638350 emlxs_issue_iocb_cmd+0xc1 2 0% 84% 0.00 6697 0xffffffffa2503f08 vdev_queue_io_done+0x7b 2 0% 85% 0.00 4113 0xffffffffa24040b0 gcpu_ntv_mca_poll_wrapper+0x64 2 0% 85% 0.00 928 0xfffffe85dc140658 pollwakeup+0x116 1 0% 86% 0.00 404 iommulib_lock lookup_cache+0x2c 1 0% 86% 0.00 4867 pidlock thread_exit+0x6f 1 0% 86% 0.00 1245 plocks+0x3c0 pollhead_delete+0x23 1 0% 86% 0.00 2452 plocks+0x3c0 pollhead_insert+0x35 1 0% 86% 0.00 882 htable_mutex+0x3c0 htable_lookup+0x83 1 0% 87% 0.00 28547 htable_mutex+0x3c0 htable_create+0xe3 1 0% 87% 0.00 21173 htable_mutex+0x3c0 htable_release+0x8a 1 0% 87% 0.00 1235 htable_mutex+0x370 htable_lookup+0x83 1 0% 87% 0.00 3212 htable_mutex+0x370 htable_release+0x8a 1 0% 87% 0.00 793 htable_mutex+0x78 htable_lookup+0x83 1 0% 88% 0.00 981 buf_hash_table+0x1210 arc_buf_add_ref+0x7c 1 0% 88% 0.00 1222 buf_hash_table+0x1c50 arc_buf_add_ref+0x7c 1 0% 88% 0.00 1585 buf_hash_table+0x2490 arc_buf_remove_ref+0x6d 1 0% 88% 0.00 1545158 ARC_mru+0x58 remove_reference+0x56 1 0% 88% 0.00 564 0xffffffffbcad4a00 strrput+0x19a 1 0% 89% 0.00 1033 0xffffffffbcad4a00 polllock+0x3f 1 0% 89% 0.00 587 0xffffffffbd328098 putnext+0x6c 1 0% 89% 0.00 11576 0xffffffffbd328098 strrput+0x19a 1 0% 89% 0.00 847 0xffffffffb3f23708 pcache_insert+0x13d 1 0% 90% 0.00 703 0xffffffffbb909e20 poll_common+0x258 1 0% 90% 0.00 1286 0xffffffffbcad4870 kstrgetmsg+0x79 1 0% 90% 0.00 1528 0xffffffffb1012e00 cv_wait+0x70 1 0% 90% 0.00 404 0xffffffffb1011de0 zio_notify_parent+0x37 1 0% 90% 0.00 764 0xffffffffb1011de0 zio_create+0x29f 1 0% 91% 0.00 5887 0xffffffffb0a2f3b0 cv_wait+0x70 1 0% 91% 0.00 883 0xffffffffb1ad3de0 metaslab_group_alloc+0x7e 1 0% 91% 0.00 555 0xffffffffb10dfbc0 fifo_write+0x5b 1 0% 91% 0.00 692 0xffffffffb3dd2070 pollrelock+0x36 1 0% 91% 0.00 4390 0xffffffffb0a22988 vdev_queue_io+0x6e 1 0% 92% 0.00 1449 0xffffffffb0a21f60 vdev_cache_write+0x64 1 0% 92% 0.00 859 0xffffffffb0a21a20 vdev_cache_write+0x64 1 0% 92% 0.00 446 0xffffffffb0a20ec8 vdev_queue_io+0x6e 1 0% 92% 0.00 1987 0xffffffffb0a21f08 vdev_queue_io+0x6e 1 0% 92% 0.00 5968 0xffffffffb0a21f08 vdev_queue_io_done+0x3a 1 0% 93% 0.00 280 0xffffffffac44b558 sd_start_cmds+0x115 1 0% 93% 0.00 527 0xffffffffac44b118 sd_core_iostart+0x37 1 0% 93% 0.00 380 0xffffffffac51aed8 sd_initpkt_for_buf+0x110 1 0% 93% 0.00 742 0xffffffffac51aed8 sdstrategy+0x53 1 0% 94% 0.00 696 0xffffffffac51aed8 sd_start_cmds+0x115 1 0% 94% 0.00 5398 0xffffffffb0a1c988 vdev_queue_io_done+0x3a 1 0% 94% 0.00 6102 0xffffffffb0a1c988 vdev_queue_io_done+0x7b 1 0% 94% 0.00 988 0xffffffffa40fbcd8 sd_return_command+0x11b 1 0% 94% 0.00 298 0xffffffffa4101460 ssfcp_scsi_init_pkt+0x3b4 1 0% 95% 0.00 302 0xffffffffa40fbcd8 sdstrategy+0x53 1 0% 95% 0.00 1436 0xffffffffa40fb158 sdintr+0x3a 1 0% 95% 0.00 764 0xffffffffa40fbcd8 sd_initpkt_for_buf+0x110 1 0% 95% 0.00 846 0xffffffffa40fbcd8 sd_start_cmds+0x115 1 0% 95% 0.00 1172 0xffffffffa5644f60 vdev_cache_write+0x64 1 0% 96% 0.00 8401 0xffffffffa5644f08 vdev_queue_io_done+0xc3 1 0% 96% 0.00 417 0xffffffffa5644f08 vdev_queue_io+0x6e 1 0% 96% 0.00 3419 0xffffffffa5644f08 vdev_queue_io_done+0x7b 1 0% 96% 0.00 1341 0xffffffffa58a3318 sd_core_iostart+0x37 1 0% 96% 0.00 431 0xffffffffa22cc840 fc_ulp_init_packet+0x31 1 0% 97% 0.00 569 0xffffffff807a4000 callout_execute+0xeb 1 0% 97% 0.00 695 0xffffffff8025c000 callout_execute+0xeb 1 0% 97% 0.00 500 0xffffffff80244000 callout_execute+0xeb 1 0% 97% 0.00 855 0xfffffe85dc140658 pcache_insert+0x13d 1 0% 97% 0.00 13339 0xfffffe85d67ddc48 cv_wait+0x70 1 0% 98% 0.00 5377 0xffffffff80253000 untimeout+0x24 1 0% 98% 0.00 5104 0xffffffff80253000 timeout_common+0x39 1 0% 98% 0.00 508 0xffffffff80253000 callout_execute+0xeb 1 0% 98% 0.00 260 0xffffffffa2638420 emlxs_register_pkt+0x30 1 0% 99% 0.00 1059 0xffffffffa2638390 emlxs_tx_put+0x79 1 0% 99% 0.00 411 0xffffffffa3e3a298 sdstrategy+0x53 1 0% 99% 0.00 336 0xffffffffa3d6e380 fc_ulp_init_packet+0x31 1 0% 99% 0.00 926 0xffffffffa3cdfc58 sd_start_cmds+0x115 1 0% 99% 0.00 894 0xffffffffa3e3a298 sd_core_iostart+0x37 1 0% 100% 0.00 766 0xffffffffa3e3a298 sd_buf_iodone+0x23 1 0% 100% 0.00 340 0xffffffffa3e58420 emlxs_register_pkt+0x30 1 0% 100% 0.00 1516 0xffffffffa3e58420 emlxs_unregister_pkt+0x53 ------------------------------------------------------------------------------- Adaptive mutex block: 7 events in 30.019 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 1 14% 14% 0.00 39004 0xffffffffbe555040 fifo_read+0x5c 1 14% 29% 0.00 78624 0xffffffffb3dd2070 pollwakeup+0x116 1 14% 43% 0.00 6668 0xffffffffb0a2f200 taskq_dispatch+0x1b8 1 14% 57% 0.00 35694 0xffffffffb0a2f3b0 cv_wait+0x70 1 14% 71% 0.00 22697 0xffffffffb0a21f08 vdev_queue_io_done+0xc3 1 14% 86% 0.00 20174 0xffffffffb0a1c988 vdev_queue_io_done+0x3a 1 14% 100% 0.00 7365 0xfffffe85d67ddc48 cv_wait+0x70 ------------------------------------------------------------------------------- Spin lock spin: 478 events in 30.019 seconds (16 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 131 27% 27% 0.00 1383 0xffffffffa21a89f8 disp_lock_enter+0x1e 93 19% 47% 0.00 962 0xffffffffa250e9c0 disp_lock_enter+0x1e 84 18% 64% 0.00 3085 0xffffffffa21a8a28 disp_lock_enter+0x1e 73 15% 80% 0.00 2105 cpu0_disp disp_lock_enter+0x1e 33 7% 87% 0.00 3548 0xffffffffa21a8a28 disp_lock_enter_high+0x9 22 5% 91% 0.00 6011 cpu0_disp disp_lock_enter_high+0x9 21 4% 96% 0.00 2222 hres_lock hr_clock_lock+0x1d 9 2% 97% 0.00 3869 0xffffffffa21a89f8 disp_lock_enter_high+0x9 8 2% 99% 0.00 1649 0xffffffffa250e9c0 disp_lock_enter_high+0x9 4 1% 100% 0.00 624 cp_default disp_lock_enter+0x1e ------------------------------------------------------------------------------- Thread lock spin: 6 events in 30.019 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 2 33% 33% 0.00 698 transition_lock ts_update_list+0x5c 2 33% 67% 0.00 779 cpu[3]+0xf8 cv_wait+0x3e 1 17% 83% 0.00 452 cpu[3]+0xf8 cv_timedwait_sig+0xe1 1 17% 100% 0.00 324 cpu[2]+0xf8 cv_timedwait_sig+0xe1 ------------------------------------------------------------------------------- -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
dick hoogendijk
2009-Jul-04 20:07 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009 13:03:52 -0500 (CDT) Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sat, 4 Jul 2009, Joerg Schilling wrote:> > Did you try to use highly performant software like star? > > No, because I don''t want to tarnish your software''s stellar > reputation. I am focusing on Solaris 10 bugs today.Blunt. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2009.06 release + All that''s really worth doing is what we do for others (Lewis Carrol)
Phil Harman
2009-Jul-04 20:09 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote: >>> >>> However, it seems that memory mapping is not responsible for the >>> problem I am seeing here. Memory mapping may make the problem seem >>> worse, but it is clearly not the cause. >> >> mmap(2) is what brings ZFS files into the page cache. I think you''ve >> shown us that once you''ve copied files with cp(1) - which does use >> mmap(2) - that anything that uses read(2) on the same files is impacted. > > The problem is observed with cpio, which does not use mmap. This is > immediately after a reboot or unmount/mount of the filesystem.Sorry, I didn''t get to your other post ...> Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) > performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 > with latest firmware. I rebooted the system used cpio to send the > input files to /dev/null, and then immediately used cpio a second time > to send the input files to /dev/null. Note that the amount of file > data (243 GB) is plenty sufficient to purge any file data from the ARC > (which has a cap of 10 GB). > > % time cat dpx-files.txt | cpio -o > /dev/null > 495713288 blocks > cat dpx-files.txt 0.00s user 0.00s system 0% cpu 1.573 total > cpio -o > /dev/null 78.92s user 360.55s system 43% cpu 16:59.48 total > > % time cat dpx-files.txt | cpio -o > /dev/null > 495713288 blocks > cat dpx-files.txt 0.00s user 0.00s system 0% cpu 0.198 total > cpio -o > /dev/null 79.92s user 358.75s system 11% cpu 1:01:05.88 total > > zpool iostat averaged over 60 seconds reported that the first run > through the files read the data at 251 MB/s and the second run only > achieved 68 MB/s. It seems clear that there is something really bad > about Solaris 10 zfs''s file caching code which is causing it to go > into the weeds. > > I don''t think that the results mean much, but I have attached output > from ''hotkernel'' while a subsequent cpio copy is taking place. It > shows that the kernel is mostly sleeping. > > This is not a new problem. It seems that I have been banging my head > against this from the time I started using zfs.I''d like to see mpstat 1 for each case, on an otherwise idle system, but then there''s probably a whole lot of dtrace I''d like to do ... but I''m just off on vacation for a week, and this will probably have to be my last post on this thread until I''m back. Cheers, Phil
Bob Friesenhahn
2009-Jul-04 20:22 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, 4 Jul 2009, Phil Harman wrote:>> >> This is not a new problem. It seems that I have been banging my head >> against this from the time I started using zfs. > > I''d like to see mpstat 1 for each case, on an otherwise idle system, > but then there''s probably a whole lot of dtrace I''d like to do ... > but I''m just off on vacation for a week, and this will probably have > to be my last post on this thread until I''m back.Shame on you for taking well-earned vacation in my time of need. :-) ''mpstat 1'' output when I/O is good: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 1700 247 2187 11 214 11 0 10270 2 5 0 93 1 0 0 0 1478 5 2812 18 241 10 0 18424 2 4 0 94 2 0 0 1 1210 0 2392 60 185 19 0 301927 5 28 0 67 3 0 0 0 3242 2320 2028 60 181 9 0 222500 3 24 0 73 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 1862 244 2554 9 231 6 0 2880 2 3 0 95 1 0 0 0 1158 1 2055 17 221 7 0 4479 1 3 0 96 2 0 0 0 1037 0 2051 65 186 14 0 250211 4 24 0 73 3 0 0 0 3037 2167 2101 62 186 11 0 251393 4 25 0 71 ''mpstat 1'' output when I/O is bad: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 859 243 1006 5 106 0 0 20733 2 3 0 95 1 0 0 0 504 15 942 12 84 6 0 74009 3 6 0 91 2 0 0 0 192 0 338 0 48 0 0 38 0 1 0 99 3 0 0 0 549 376 522 1 36 0 0 135 0 2 0 98 Notice how intensely unbusy the CPU cores are when I/O is bad. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Boyd Adamson
2009-Jul-06 13:12 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Phil Harman <Phil.Harman at Sun.COM> writes:> Gary Mills wrote: > The Solaris implementation of mmap(2) is functionally correct, but the > wait for a 64 bit address space rather moved the attention of > performance tuning elsewhere. I must admit I was surprised to see so > much code out there that still uses mmap(2) for general I/O (rather > than just to support dynamic linking).Probably this is encouraged by documentation like this:> The memory mapping interface is described in Memory Management > Interfaces. Mapping files is the most efficient form of file I/O for > most applications run under the SunOS platform.Found at: http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=view Boyd.
Bob Friesenhahn
2009-Jul-06 14:23 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 6 Jul 2009, Boyd Adamson wrote:> > Probably this is encouraged by documentation like this: > >> The memory mapping interface is described in Memory Management >> Interfaces. Mapping files is the most efficient form of file I/O for >> most applications run under the SunOS platform. > > Found at: > > http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=viewPeople often think about the main benefit of mmap() being to reduce CPU consumption and buffer copies but the mmap() family of programming interfaces is much richer than low-level read/write, pread/pwrite, or stdio, because madvise() provides the ability for I/O scheduling, or to flush stale data from memory. In recent Solaris, it also includes provisions which allow applications to improve their performance on NUMA systems. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gary Mills
2009-Jul-06 15:28 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 07:18:45PM +0100, Phil Harman wrote:> Gary Mills wrote: > >On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: > > > >>ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC > >>instead of the Solaris page cache. But mmap() uses the latter. So if > >>anyone maps a file, ZFS has to keep the two caches in sync. > > > >That''s the first I''ve heard of this issue. Our e-mail server runs > >Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) > >extensively. I understand that Solaris has an excellent > >implementation of mmap(2). ZFS has many advantages, snapshots for > >example, for mailbox storage. Is there anything that we can be do to > >optimize the two caches in this environment? Will mmap(2) one day > >play nicely with ZFS? >[..]> Software engineering is always about prioritising resource. Nothing > prioritises performance tuning attention quite like compelling > competitive data. When Bart Smaalders and I wrote libMicro we generated > a lot of very compelling data. I also coined the phrase "If Linux is > faster, it''s a Solaris bug". You will find quite a few (mostly fixed) > bugs with the synopsis "linux is faster than solaris at ...". > > So, if mmap(2) playing nicely with ZFS is important to you, probably the > best thing you can do to help that along is to provide data that will > help build the business case for spending engineering resource on the issue.First of all, how significant is the double caching in terms of performance? If the effect is small, I won''t worry about it anymore. What sort of data do you need? Would a list of software products that utilize mmap(2) extensively and could benefit from ZFS be suitable? As for a business case, we just had an extended and catastrophic performance degradation that was the result of two ZFS bugs. If we have another one like that, our director is likely to instruct us to throw away all our Solaris toys and convert to Microsoft products. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Andre van Eyssen
2009-Jul-06 15:29 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 6 Jul 2009, Gary Mills wrote:> As for a business case, we just had an extended and catastrophic > performance degradation that was the result of two ZFS bugs. If we > have another one like that, our director is likely to instruct us to > throw away all our Solaris toys and convert to Microsoft products.If you change platform every time you get two bugs in a product, you must cycle platforms on a pretty regular basis! -- Andre van Eyssen. mail: andre at purplecow.org jabber: andre at interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org
Bryan Allen
2009-Jul-06 15:44 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
+------------------------------------------------------------------------------ | On 2009-07-07 01:29:11, Andre van Eyssen wrote: | | On Mon, 6 Jul 2009, Gary Mills wrote: | | >As for a business case, we just had an extended and catastrophic | >performance degradation that was the result of two ZFS bugs. If we | >have another one like that, our director is likely to instruct us to | >throw away all our Solaris toys and convert to Microsoft products. | | If you change platform every time you get two bugs in a product, you must | cycle platforms on a pretty regular basis! Given that policy, I don''t imagine Windows will last very long anyway. -- bda cyberpunk is dead. long live cyberpunk.
Andrew Gabriel
2009-Jul-06 15:54 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Andre van Eyssen wrote:> On Mon, 6 Jul 2009, Gary Mills wrote: > >> As for a business case, we just had an extended and catastrophic >> performance degradation that was the result of two ZFS bugs. If we >> have another one like that, our director is likely to instruct us to >> throw away all our Solaris toys and convert to Microsoft products. > > If you change platform every time you get two bugs in a product, you > must cycle platforms on a pretty regular basis!You often find the change is towards Windows. That very rarely has the same rules applied, so things then stick there. -- Andrew
Sanjeev
2009-Jul-07 03:51 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, Catching up late on this thread. Would it be possible for you to collect the following data : - /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5 - /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5 - /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5 Or if you have access to the GUDs tool please collect data using that. We need to understand how ARC plays a role here. Thanks and regards, Sanjeev. On Sat, Jul 04, 2009 at 02:49:05PM -0500, Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Jonathan Edwards wrote: >> >> this is only going to help if you''ve got problems in zfetch .. you''d >> probably see this better by looking for high lock contention in zfetch >> with lockstat > > This is what lockstat says when performance is poor: > > Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 47 10% 10% 0.00 5813 0xffffffff80256000 untimeout+0x24 > 46 10% 19% 0.00 2223 0xffffffffb0a2f200 taskq_thread+0xe3 > 38 8% 27% 0.00 2252 0xffffffffb0a2f200 cv_wait+0x70 > 29 6% 34% 0.00 1115 0xffffffff80256000 callout_execute+0xeb > 26 5% 39% 0.00 3006 0xffffffffb0a2f200 taskq_dispatch+0x1b8 > 22 5% 44% 0.00 1200 0xffffffffa06158c0 post_syscall+0x206 > 18 4% 47% 0.00 3858 arc_eviction_mtx arc_do_user_evicts+0x76 > 16 3% 51% 0.00 1352 arc_eviction_mtx arc_buf_add_ref+0x2d > 15 3% 54% 0.00 5376 0xffffffffb1adac28 taskq_thread+0xe3 > 11 2% 56% 0.00 2520 0xffffffffb1adac28 taskq_dispatch+0x1b8 > 9 2% 58% 0.00 2158 0xffffffffbb909e20 pollwakeup+0x116 > 9 2% 60% 0.00 2431 0xffffffffb1adac28 cv_wait+0x70 > 8 2% 62% 0.00 3912 0xffffffff80259000 untimeout+0x24 > 7 1% 63% 0.00 3679 0xffffffffb10dfbc0 polllock+0x3f > 7 1% 65% 0.00 2171 0xffffffffb0a2f2d8 cv_wait+0x70 > 6 1% 66% 0.00 771 0xffffffffb3f23708 pcache_delete_fd+0xac > 6 1% 67% 0.00 4679 0xffffffffb0a2f2d8 taskq_dispatch+0x1b8 > 5 1% 68% 0.00 500 0xffffffffbe555040 fifo_read+0xf8 > 5 1% 69% 0.00 15838 0xffffffff8025c000 untimeout+0x24 > 4 1% 70% 0.00 1213 0xffffffffac44b558 sd_initpkt_for_buf+0x110 > 4 1% 71% 0.00 638 0xffffffffa28722a0 polllock+0x3f > 4 1% 72% 0.00 610 0xffffffff80259000 timeout_common+0x39 > 4 1% 73% 0.00 10691 0xffffffff80256000 timeout_common+0x39 > 3 1% 73% 0.00 1559 htable_mutex+0x78 htable_release+0x8a > 3 1% 74% 0.00 3610 0xffffffffbb909e20 cv_timedwait_sig+0x1c1 > 3 1% 74% 0.00 1636 0xffffffffa240d410 ohci_allocate_periodic_in_resource+0x71 > 2 0% 75% 0.00 5959 0xffffffffbe555040 fifo_read+0x5c > 2 0% 75% 0.00 3744 0xffffffffbe555040 polllock+0x3f > 2 0% 76% 0.00 635 0xffffffffb3f23708 pollwakeup+0x116 > 2 0% 76% 0.00 709 0xffffffffb3f23708 cv_timedwait_sig+0x1c1 > 2 0% 77% 0.00 831 0xffffffffb3dd2070 pcache_insert+0x13d > 2 0% 77% 0.00 5976 0xffffffffb3dd2070 pollwakeup+0x116 > 2 0% 77% 0.00 1339 0xffffffffb1eb9b80 metaslab_group_alloc+0x136 > 2 0% 78% 0.00 1514 0xffffffffb0a2f2d8 taskq_thread+0xe3 > 2 0% 78% 0.00 4042 0xffffffffb0a22988 vdev_queue_io_done+0xc3 > 2 0% 79% 0.00 3428 0xffffffffb0a21f08 vdev_queue_io_done+0xc3 > 2 0% 79% 0.00 1002 0xffffffffac44b558 sd_core_iostart+0x37 > 2 0% 79% 0.00 1387 0xffffffffa8c56d80 xbuf_iostart+0x7d > 2 0% 80% 0.00 698 0xffffffffa58a3318 sd_return_command+0x11b > 2 0% 80% 0.00 385 0xffffffffa58a3318 sd_start_cmds+0x115 > 2 0% 81% 0.00 562 0xffffffffa5647800 ssfcp_scsi_start+0x30 > 2 0% 81% 0.00 1620 0xffffffffa4162d58 ssfcp_scsi_init_pkt+0x1be > 2 0% 82% 0.00 897 0xffffffffa4162d58 ssfcp_scsi_start+0x42 > 2 0% 82% 0.00 475 0xffffffffa4162b78 ssfcp_scsi_start+0x42 > 2 0% 82% 0.00 697 0xffffffffa40fb158 sd_start_cmds+0x115 > 2 0% 83% 0.00 10901 0xffffffffa28722a0 fifo_write+0x5b > 2 0% 83% 0.00 4379 0xffffffffa28722a0 fifo_read+0xf8 > 2 0% 84% 0.00 1534 0xffffffffa2638390 emlxs_tx_get+0x38 > 2 0% 84% 0.00 1601 0xffffffffa2638350 emlxs_issue_iocb_cmd+0xc1 > 2 0% 84% 0.00 6697 0xffffffffa2503f08 vdev_queue_io_done+0x7b > 2 0% 85% 0.00 4113 0xffffffffa24040b0 gcpu_ntv_mca_poll_wrapper+0x64 > 2 0% 85% 0.00 928 0xfffffe85dc140658 pollwakeup+0x116 > 1 0% 86% 0.00 404 iommulib_lock lookup_cache+0x2c > 1 0% 86% 0.00 4867 pidlock thread_exit+0x6f > 1 0% 86% 0.00 1245 plocks+0x3c0 pollhead_delete+0x23 > 1 0% 86% 0.00 2452 plocks+0x3c0 pollhead_insert+0x35 > 1 0% 86% 0.00 882 htable_mutex+0x3c0 htable_lookup+0x83 > 1 0% 87% 0.00 28547 htable_mutex+0x3c0 htable_create+0xe3 > 1 0% 87% 0.00 21173 htable_mutex+0x3c0 htable_release+0x8a > 1 0% 87% 0.00 1235 htable_mutex+0x370 htable_lookup+0x83 > 1 0% 87% 0.00 3212 htable_mutex+0x370 htable_release+0x8a > 1 0% 87% 0.00 793 htable_mutex+0x78 htable_lookup+0x83 > 1 0% 88% 0.00 981 buf_hash_table+0x1210 arc_buf_add_ref+0x7c > 1 0% 88% 0.00 1222 buf_hash_table+0x1c50 arc_buf_add_ref+0x7c > 1 0% 88% 0.00 1585 buf_hash_table+0x2490 arc_buf_remove_ref+0x6d > 1 0% 88% 0.00 1545158 ARC_mru+0x58 remove_reference+0x56 > 1 0% 88% 0.00 564 0xffffffffbcad4a00 strrput+0x19a > 1 0% 89% 0.00 1033 0xffffffffbcad4a00 polllock+0x3f > 1 0% 89% 0.00 587 0xffffffffbd328098 putnext+0x6c > 1 0% 89% 0.00 11576 0xffffffffbd328098 strrput+0x19a > 1 0% 89% 0.00 847 0xffffffffb3f23708 pcache_insert+0x13d > 1 0% 90% 0.00 703 0xffffffffbb909e20 poll_common+0x258 > 1 0% 90% 0.00 1286 0xffffffffbcad4870 kstrgetmsg+0x79 > 1 0% 90% 0.00 1528 0xffffffffb1012e00 cv_wait+0x70 > 1 0% 90% 0.00 404 0xffffffffb1011de0 zio_notify_parent+0x37 > 1 0% 90% 0.00 764 0xffffffffb1011de0 zio_create+0x29f > 1 0% 91% 0.00 5887 0xffffffffb0a2f3b0 cv_wait+0x70 > 1 0% 91% 0.00 883 0xffffffffb1ad3de0 metaslab_group_alloc+0x7e > 1 0% 91% 0.00 555 0xffffffffb10dfbc0 fifo_write+0x5b > 1 0% 91% 0.00 692 0xffffffffb3dd2070 pollrelock+0x36 > 1 0% 91% 0.00 4390 0xffffffffb0a22988 vdev_queue_io+0x6e > 1 0% 92% 0.00 1449 0xffffffffb0a21f60 vdev_cache_write+0x64 > 1 0% 92% 0.00 859 0xffffffffb0a21a20 vdev_cache_write+0x64 > 1 0% 92% 0.00 446 0xffffffffb0a20ec8 vdev_queue_io+0x6e > 1 0% 92% 0.00 1987 0xffffffffb0a21f08 vdev_queue_io+0x6e > 1 0% 92% 0.00 5968 0xffffffffb0a21f08 vdev_queue_io_done+0x3a > 1 0% 93% 0.00 280 0xffffffffac44b558 sd_start_cmds+0x115 > 1 0% 93% 0.00 527 0xffffffffac44b118 sd_core_iostart+0x37 > 1 0% 93% 0.00 380 0xffffffffac51aed8 sd_initpkt_for_buf+0x110 > 1 0% 93% 0.00 742 0xffffffffac51aed8 sdstrategy+0x53 > 1 0% 94% 0.00 696 0xffffffffac51aed8 sd_start_cmds+0x115 > 1 0% 94% 0.00 5398 0xffffffffb0a1c988 vdev_queue_io_done+0x3a > 1 0% 94% 0.00 6102 0xffffffffb0a1c988 vdev_queue_io_done+0x7b > 1 0% 94% 0.00 988 0xffffffffa40fbcd8 sd_return_command+0x11b > 1 0% 94% 0.00 298 0xffffffffa4101460 ssfcp_scsi_init_pkt+0x3b4 > 1 0% 95% 0.00 302 0xffffffffa40fbcd8 sdstrategy+0x53 > 1 0% 95% 0.00 1436 0xffffffffa40fb158 sdintr+0x3a > 1 0% 95% 0.00 764 0xffffffffa40fbcd8 sd_initpkt_for_buf+0x110 > 1 0% 95% 0.00 846 0xffffffffa40fbcd8 sd_start_cmds+0x115 > 1 0% 95% 0.00 1172 0xffffffffa5644f60 vdev_cache_write+0x64 > 1 0% 96% 0.00 8401 0xffffffffa5644f08 vdev_queue_io_done+0xc3 > 1 0% 96% 0.00 417 0xffffffffa5644f08 vdev_queue_io+0x6e > 1 0% 96% 0.00 3419 0xffffffffa5644f08 vdev_queue_io_done+0x7b > 1 0% 96% 0.00 1341 0xffffffffa58a3318 sd_core_iostart+0x37 > 1 0% 96% 0.00 431 0xffffffffa22cc840 fc_ulp_init_packet+0x31 > 1 0% 97% 0.00 569 0xffffffff807a4000 callout_execute+0xeb > 1 0% 97% 0.00 695 0xffffffff8025c000 callout_execute+0xeb > 1 0% 97% 0.00 500 0xffffffff80244000 callout_execute+0xeb > 1 0% 97% 0.00 855 0xfffffe85dc140658 pcache_insert+0x13d > 1 0% 97% 0.00 13339 0xfffffe85d67ddc48 cv_wait+0x70 > 1 0% 98% 0.00 5377 0xffffffff80253000 untimeout+0x24 > 1 0% 98% 0.00 5104 0xffffffff80253000 timeout_common+0x39 > 1 0% 98% 0.00 508 0xffffffff80253000 callout_execute+0xeb > 1 0% 98% 0.00 260 0xffffffffa2638420 emlxs_register_pkt+0x30 > 1 0% 99% 0.00 1059 0xffffffffa2638390 emlxs_tx_put+0x79 > 1 0% 99% 0.00 411 0xffffffffa3e3a298 sdstrategy+0x53 > 1 0% 99% 0.00 336 0xffffffffa3d6e380 fc_ulp_init_packet+0x31 > 1 0% 99% 0.00 926 0xffffffffa3cdfc58 sd_start_cmds+0x115 > 1 0% 99% 0.00 894 0xffffffffa3e3a298 sd_core_iostart+0x37 > 1 0% 100% 0.00 766 0xffffffffa3e3a298 sd_buf_iodone+0x23 > 1 0% 100% 0.00 340 0xffffffffa3e58420 emlxs_register_pkt+0x30 > 1 0% 100% 0.00 1516 0xffffffffa3e58420 emlxs_unregister_pkt+0x53 > ------------------------------------------------------------------------------- > > Adaptive mutex block: 7 events in 30.019 seconds (0 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 1 14% 14% 0.00 39004 0xffffffffbe555040 fifo_read+0x5c > 1 14% 29% 0.00 78624 0xffffffffb3dd2070 pollwakeup+0x116 > 1 14% 43% 0.00 6668 0xffffffffb0a2f200 taskq_dispatch+0x1b8 > 1 14% 57% 0.00 35694 0xffffffffb0a2f3b0 cv_wait+0x70 > 1 14% 71% 0.00 22697 0xffffffffb0a21f08 vdev_queue_io_done+0xc3 > 1 14% 86% 0.00 20174 0xffffffffb0a1c988 vdev_queue_io_done+0x3a > 1 14% 100% 0.00 7365 0xfffffe85d67ddc48 cv_wait+0x70 > ------------------------------------------------------------------------------- > > Spin lock spin: 478 events in 30.019 seconds (16 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 131 27% 27% 0.00 1383 0xffffffffa21a89f8 disp_lock_enter+0x1e > 93 19% 47% 0.00 962 0xffffffffa250e9c0 disp_lock_enter+0x1e > 84 18% 64% 0.00 3085 0xffffffffa21a8a28 disp_lock_enter+0x1e > 73 15% 80% 0.00 2105 cpu0_disp disp_lock_enter+0x1e > 33 7% 87% 0.00 3548 0xffffffffa21a8a28 disp_lock_enter_high+0x9 > 22 5% 91% 0.00 6011 cpu0_disp disp_lock_enter_high+0x9 > 21 4% 96% 0.00 2222 hres_lock hr_clock_lock+0x1d > 9 2% 97% 0.00 3869 0xffffffffa21a89f8 disp_lock_enter_high+0x9 > 8 2% 99% 0.00 1649 0xffffffffa250e9c0 disp_lock_enter_high+0x9 > 4 1% 100% 0.00 624 cp_default disp_lock_enter+0x1e > > ------------------------------------------------------------------------------- > > Thread lock spin: 6 events in 30.019 seconds (0 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 2 33% 33% 0.00 698 transition_lock ts_update_list+0x5c > 2 33% 67% 0.00 779 cpu[3]+0xf8 cv_wait+0x3e > 1 17% 83% 0.00 452 cpu[3]+0xf8 cv_timedwait_sig+0xe1 > 1 17% 100% 0.00 324 cpu[2]+0xf8 > cv_timedwait_sig+0xe1 > ------------------------------------------------------------------------------- > > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ---------------- Sanjeev Bagewadi Solaris RPE Bangalore, India
Lejun Zhu
2009-Jul-07 06:43 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
If cpu seems to be idle, the tool latencytop probably can give you some clue. It''s developed for OpenSolaris but Solaris 10 should work too (with glib 2.14 installed). You can get a copy of v0.1 at http://opensolaris.org/os/project/latencytop/ To use latencytop, open a terminal and start "latencytop -s -k 2". The tool will show a window with activities that are being blocked in the system. Then you can launch your application to reproduce the performance problem in another terminal, switch back to latencytop window, and use "<" and ">" to find your process. The list will tell you which function is causing the delay. After a couple minutes you may press "q" to exit from latencytop. When it ends, a log file /var/log/latencytop.log will be created. It includes the stack trace of waiting for IO, semaphore etc. when latencytop was running. If you post the log here, I can probably extract a list of worst delays in ZFS source code, and other experts may comment. -- This message posted from opensolaris.org
James Andrewartha
2009-Jul-07 09:38 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Joerg Schilling wrote:> I would be interested to see a open(2) flag that tells the system that I will > read a file that I opened exactly once in native oder. This could tell the > system to do read ahead and to later mark the pages as immediately reusable. > This would make star even faster than it is now.Are you aware of posix_fadvise(2) and madvise(2)? -- James Andrewartha
Joerg Schilling
2009-Jul-07 14:05 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
James Andrewartha <jamesa at daa.com.au> wrote:> Joerg Schilling wrote: > > I would be interested to see a open(2) flag that tells the system that I will > > read a file that I opened exactly once in native oder. This could tell the > > system to do read ahead and to later mark the pages as immediately reusable. > > This would make star even faster than it is now. > > Are you aware of posix_fadvise(2) and madvise(2)?I am of course aware of madvise since December 1987 but this is an interface that does not play nicely with a highly portable program like star. posix_fadvise seems to be _very_ new for Solaris and even though I am frequently reading/writing the POSIX standards mailing list, I was not aware of it.>From my tests with star, I cannot see a significant performance increase but itmay have a 3% effect........ J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Gary Mills
2009-Jul-07 15:55 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 06, 2009 at 04:54:16PM +0100, Andrew Gabriel wrote:> Andre van Eyssen wrote: > >On Mon, 6 Jul 2009, Gary Mills wrote: > > > >>As for a business case, we just had an extended and catastrophic > >>performance degradation that was the result of two ZFS bugs. If we > >>have another one like that, our director is likely to instruct us to > >>throw away all our Solaris toys and convert to Microsoft products. > > > >If you change platform every time you get two bugs in a product, you > >must cycle platforms on a pretty regular basis! > > You often find the change is towards Windows. That very rarely has the > same rules applied, so things then stick there.There''s a more general principle in operation here. Organizations do sometimes change platforms for peculiar reasons, but once they do that they''re not going to do it again for a long time. That''s why they disregard problems with the new platform. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Bob Friesenhahn
2009-Jul-07 16:18 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 7 Jul 2009, Joerg Schilling wrote:> > posix_fadvise seems to be _very_ new for Solaris and even though I am > frequently reading/writing the POSIX standards mailing list, I was not aware of > it. > > From my tests with star, I cannot see a significant performance increase but it > may have a 3% effect........Based on the prior discussions of using mmap() with ZFS and the way ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at all and POSIX_FADV_DONTNEED probably does not work either. These are pretty straightforward to implement with UFS since UFS benefits from the existing working madvise() functionality. ZFS seems to want to cache all read data in the ARC, period. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling
2009-Jul-07 16:23 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 7 Jul 2009, Joerg Schilling wrote: > > > > posix_fadvise seems to be _very_ new for Solaris and even though I am > > frequently reading/writing the POSIX standards mailing list, I was not aware of > > it. > > > > From my tests with star, I cannot see a significant performance increase but it > > may have a 3% effect........ > > Based on the prior discussions of using mmap() with ZFS and the way > ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at > all and POSIX_FADV_DONTNEED probably does not work either. These are > pretty straightforward to implement with UFS since UFS benefits from > the existing working madvise() functionality.I did run my tests on UFS...> ZFS seems to want to cache all read data in the ARC, period.And this is definitely a conceptional mistake as there are applications like star that like to benefit from read ahead but that don''t like to trash caches. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Bob Friesenhahn
2009-Jul-07 17:24 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 7 Jul 2009, Joerg Schilling wrote:>> >> Based on the prior discussions of using mmap() with ZFS and the way >> ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at >> all and POSIX_FADV_DONTNEED probably does not work either. These are >> pretty straightforward to implement with UFS since UFS benefits from >> the existing working madvise() functionality. > > I did run my tests on UFS...To clarify, you are not likely to see benefits until the system becomes starved for memory resources, or there is contention from multiple processes for read cache. Solaris UFS is very well tuned so it is likely that a single process won''t see much benefit. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-07 21:56 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 7 Jul 2009, Sanjeev wrote:> Bob, > > Catching up late on this thread. > > Would it be possible for you to collect the following data : > - /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5 > - /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5 > - /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5Here is the output of those commands. The start of each command is prefixed with a ''+'': + /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5 Adaptive mutex spin: 4 events in 5.023 seconds (1 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 96% 96% 0.00 803888 0xffffffffb0a2f3b0 cv_wait+0x70 nsec ------ Time Distribution ------ count Stack 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 taskq_thread+0x14f thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 3% 99% 0.00 24784 0xfffffe85fc605af0 cv_wait+0x70 nsec ------ Time Distribution ------ count Stack 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 zio_wait+0x53 dmu_buf_hold_array_by_dnode+0x108 dmu_buf_hold_array+0x81 dmu_read_uio+0x49 zfs_read+0x15c zfs_shim_read+0xc fop_read+0x31 read+0x188 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 1% 100% 0.00 9699 pidlock[8] thread_exit+0x6f nsec ------ Time Distribution ------ count Stack 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 proc_exit+0x927 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 100% 0.00 669 0xffffffff80253000 untimeout+0x24 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 cv_timedwait+0xb1 taskq_d_thread+0xc5 thread_start+0x8 ------------------------------------------------------------------------------- Adaptive mutex block: 2 events in 5.023 seconds (0 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 72% 72% 0.00 24546 0xffffffffb0a2f3b0 cv_wait+0x70 nsec ------ Time Distribution ------ count Stack 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 taskq_thread+0x14f thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 28% 100% 0.00 9431 0xfffffe85fc605af0 cv_wait+0x70 nsec ------ Time Distribution ------ count Stack 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 zio_wait+0x53 dmu_buf_hold_array_by_dnode+0x108 dmu_buf_hold_array+0x81 dmu_read_uio+0x49 zfs_read+0x15c zfs_shim_read+0xc fop_read+0x31 read+0x188 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Spin lock spin: 223 events in 5.023 seconds (44 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 119 62% 62% 0.00 77453 cpu0_disp[48] disp_lock_enter+0x1e nsec ------ Time Distribution ------ count Stack 256 | 1 disp+0x7a 512 |@@@@@@@@@@@@ 49 swtch+0xa0 1024 |@@@@@@@@@@@@@ 54 cv_wait+0x68 2048 |@@ 8 taskq_thread+0x14f 4096 | 0 thread_start+0x8 8192 | 0 16384 | 0 32768 | 1 65536 | 0 131072 | 0 262144 | 0 524288 | 1 1048576 | 1 2097152 | 2 4194304 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 13 35% 97% 0.00 399130 0xffffffffa21a89f8 disp_lock_enter+0x1e nsec ------ Time Distribution ------ count Stack 256 |@@ 1 disp+0x7a 512 |@@@@@@ 3 swtch+0xa0 1024 |@@@@@@ 3 idle+0xdb 2048 | 0 thread_start+0x8 4096 | 0 8192 |@@ 1 16384 | 0 32768 |@@ 1 65536 | 0 131072 | 0 262144 | 0 524288 | 0 1048576 |@@@@ 2 2097152 |@@@@ 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 77 3% 100% 0.00 6402 0xffffffffa21a8a28 disp_lock_enter+0x1e nsec ------ Time Distribution ------ count Stack 256 | 1 disp+0x7a 512 |@@@@@@@@@@@@@@@ 40 swtch+0xa0 1024 |@@@@@@@@@@@@@ 35 cv_wait+0x68 2048 | 0 taskq_thread+0x14f 4096 | 0 thread_start+0x8 8192 | 0 16384 | 0 32768 | 0 65536 | 0 131072 | 0 262144 | 0 524288 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 8 0% 100% 0.00 3096 hres_lock[4] hr_clock_lock+0x1d nsec ------ Time Distribution ------ count Stack 512 |@@@@@@@@@@@ 3 gethrestime_lasttick+0x11 1024 |@@@@@@@@@@@@@@@ 4 timeout_common+0x31 2048 | 0 realtime_timeout+0x21 4096 | 0 cv_timedwait_sig+0xc5 8192 | 0 cv_waituntil_sig+0x113 16384 | 0 poll_common+0x3f4 32768 |@@@ 1 pollsys+0xbe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 5 0% 100% 0.00 594 0xffffffffa250e9c0 disp_lock_enter_high+0x9 nsec ------ Time Distribution ------ count Stack 512 |@@@@@@ 1 setfrontdq+0xc7 1024 |@@@@@@@@@@@@@@@@@@@@@@@@ 4 ts_setrun+0x118 cv_unsleep+0x78 setrun_locked+0x7a setrun+0x19 callout_execute+0xdb softint+0x146 softlevel1+0x9 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 100% 0.00 1248 cp_default[320] disp_lock_enter+0x1e nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 disp_getbest+0x15 disp+0x48 swtch+0xa0 idle+0xdb thread_start+0x8 ------------------------------------------------------------------------------- Thread lock spin: 1 events in 5.023 seconds (0 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 100% 100% 0.00 423 cpu[1][1512] ts_tick+0x2a nsec ------ Time Distribution ------ count Stack 512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 clock_tick+0x48 clock_tick_process+0x14f clock_tick_execute_common+0x73 clock_tick_schedule+0x74 clock+0x2d0 cyclic_softint+0xba cbe_softclock+0x17 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- + /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5 lockstat: warning: 45087 aggregation drops on CPU 0 lockstat: warning: 107548 aggregation drops on CPU 1 lockstat: warning: 15170 aggregation drops on CPU 2 lockstat: warning: 351494 aggregation drops on CPU 3 lockstat: warning: 12585 aggregation drops on CPU 0 lockstat: warning: 32687 aggregation drops on CPU 2 lockstat: warning: 72441 aggregation drops on CPU 3 lockstat: warning: 9933 aggregation drops on CPU 2 lockstat: warning: ran out of data records (use -n for more) Adaptive mutex hold: 1160267 events in 5.765 seconds (201263 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 295278 11% 11% 0.00 1082 0xfffffe85dd53e540 releasef+0x87 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@ 115095 write+0x95 2048 |@@@@@@@@@@@@@@@@@@ 180125 write32+0xe 4096 | 29 sys_syscall32+0x101 8192 | 15 16384 | 1 32768 | 0 65536 | 0 131072 | 0 262144 | 6 524288 | 7 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 351 7% 18% 0.00 529819 0xfffffe85cce06290 poll_common+0x2a7 nsec ------ Time Distribution ------ count Stack 32768 | 2 pollsys+0xbe 65536 | 2 sys_syscall32+0x101 131072 | 1 262144 | 0 524288 |@@@@@@@@@@@@@@ 172 1048576 |@@@@@@@@@@@@@@ 172 2097152 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 17412 5% 22% 0.00 7540 0xffffffffa0616a40 brk_internal+0x78 nsec ------ Time Distribution ------ count Stack 1024 | 12 brk+0x44 2048 | 106 sys_syscall+0x17b 4096 | 4 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16645 16384 | 162 32768 | 457 65536 | 11 131072 | 15 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 14724 4% 26% 0.00 7143 buf_hash_table[16400] arc_buf_remove_ref+0x8e nsec ------ Time Distribution ------ count Stack 2048 |@ 653 dbuf_rele+0x11d 4096 | 223 dmu_buf_rele_array+0x51 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 13521 dmu_read_uio+0xbb 16384 | 301 zfs_read+0x15c 32768 | 15 zfs_shim_read+0xc 65536 | 0 fop_read+0x31 131072 | 0 read+0x188 262144 | 5 read32+0xe 524288 | 5 sys_syscall32+0x101 1048576 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 24009 3% 30% 0.00 4022 ph_mutex[65536] page_lookup_create+0x27c nsec ------ Time Distribution ------ count Stack 1024 | 20 page_lookup+0x11 2048 |@@@@@@@@@@@@@@ 11602 swap_getapage+0x6d 4096 | 23 swap_getpage+0x46 8192 |@@@@@@@@@@@@@@ 11577 fop_getpage+0x47 16384 | 780 anon_zero+0xa4 32768 | 0 segvn_faultpage+0x46b 65536 | 3 segvn_fault+0x9a6 131072 | 4 as_fault+0x205 pagefault+0x8b trap+0x3d7 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 27696 3% 33% 0.00 3065 0xfffffe8769e3f298 dbuf_hold_impl+0x168 nsec ------ Time Distribution ------ count Stack 1024 |@@@ 3218 dbuf_hold+0x1b 2048 |@@@@@@@@@@@@@@@@ 15243 dnode_hold_impl+0x7e 4096 | 2 dnode_hold+0x14 8192 |@@@@@@@@@ 9205 dmu_buf_hold_array+0x3b 16384 | 19 dmu_read_uio+0x49 32768 | 0 zfs_read+0x15c 65536 | 0 zfs_shim_read+0xc 131072 | 0 fop_read+0x31 262144 | 7 read+0x188 524288 | 2 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 4333 2% 35% 0.00 13537 0xfffffe85f4277558 pcache_insert+0x164 nsec ------ Time Distribution ------ count Stack 1024 |@@@@ 654 pcacheset_resolve+0x2d8 2048 |@@@@@@@@@@@@@@@@@@@@@ 3153 poll_common+0x565 4096 | 0 pollsys+0xbe 8192 | 37 sys_syscall32+0x101 16384 | 137 32768 | 89 65536 | 89 131072 | 1 262144 | 55 524288 | 117 1048576 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 47343 2% 37% 0.00 1216 htable_mutex[1024] htable_release+0x12a nsec ------ Time Distribution ------ count Stack 1024 |@@ 4231 hat_getpfnum+0x151 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 42799 rootnex_get_sgl+0x34f 4096 | 272 rootnex_coredma_bindhdl+0x12c 8192 | 3 rootnex_dma_bindhdl+0x1a 16384 | 1 ddi_dma_buf_bind_handle+0xb0 32768 | 0 ssfcp_prepare_pkt+0x1c7 65536 | 0 ssfcp_scsi_init_pkt+0x3a0 131072 | 2 scsi_init_pkt+0x44 262144 | 0 vhci_bind_transport+0x124 524288 | 4 vhci_scsi_init_pkt+0xbf scsi_init_pkt+0x44 sd_setup_rw_pkt+0xe5 sd_initpkt_for_buf+0xa3 sd_start_cmds+0xa5 sd_core_iostart+0x87 sd_mapblockaddr_iostart+0x11a sd_xbuf_strategy+0x46 xbuf_iostart+0x75 ddi_xbuf_qstrategy+0x4a sdstrategy+0xbb bdev_strategy+0x54 ldi_strategy+0x4e vdev_disk_io_start+0x139 zio_vdev_io_start+0xba zio_execute+0x60 zio_nowait+0x9 vdev_mirror_io_start+0xa9 zio_vdev_io_start+0xba zio_execute+0x60 zio_nowait+0x9 vdev_mirror_io_start+0xa9 zio_vdev_io_start+0x147 zio_execute+0x60 zio_nowait+0x9 arc_read+0x487 dbuf_read_impl+0x1a0 dbuf_read+0x95 dmu_buf_hold_array_by_dnode+0x217 dmu_buf_hold_array+0x81 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 44245 2% 39% 0.00 1231 anon_array_lock[8192] anon_array_exit+0x2e nsec ------ Time Distribution ------ count Stack 1024 |@@@@ 6971 segvn_faultpage+0x56c 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@ 37061 segvn_fault+0x9a6 4096 | 157 as_fault+0x205 8192 | 7 pagefault+0x8b 16384 | 6 trap+0x3d7 32768 | 40 cmntrap+0x140 65536 | 2 131072 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 113 2% 40% 0.00 385142 0xffffffffbca03610 poll_common+0x2a7 nsec ------ Time Distribution ------ count Stack 32768 |@@ 8 pollsys+0xbe 65536 | 0 sys_syscall32+0x101 131072 |@@@@@@@@@ 36 262144 |@@ 10 524288 |@@@@@@@ 29 1048576 |@@@@@@@ 30 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 31183 1% 42% 0.00 1226 anonhash_lock[512] anon_alloc+0x93 nsec ------ Time Distribution ------ count Stack 1024 |@@ 2843 anon_zero+0x65 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28280 segvn_faultpage+0x46b 4096 | 53 segvn_fault+0x9a6 8192 | 2 as_fault+0x205 16384 | 0 pagefault+0x8b 32768 | 1 trap+0x3d7 65536 | 1 cmntrap+0x140 131072 | 1 262144 | 1 524288 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 30898 1% 43% 0.00 1140 pio_mutex[1024] page_io_unlock+0x44 nsec ------ Time Distribution ------ count Stack 1024 |@@@@ 4537 pvn_plist_init+0x9c 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@ 26317 swap_getapage+0x1aa 4096 | 34 swap_getpage+0x46 8192 | 1 fop_getpage+0x47 16384 | 0 anon_zero+0xa4 32768 | 0 segvn_faultpage+0x46b 65536 | 3 segvn_fault+0x9a6 131072 | 2 as_fault+0x205 pagefault+0x8b trap+0x3d7 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 27390 1% 44% 0.00 1111 0xffffffffbe667a88 as_rangeunlock+0x26 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@ 7622 brk+0x50 2048 |@@@@@@@@@@@@@@@@@@@@@ 19718 sys_syscall+0x17b 4096 | 48 8192 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 26982 1% 45% 0.00 1076 ani_free_pool[8192] anon_alloc+0xc6 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@ 8793 anon_zero+0x65 2048 |@@@@@@@@@@@@@@@@@@@@ 18178 segvn_faultpage+0x46b 4096 | 9 segvn_fault+0x9a6 8192 | 1 as_fault+0x205 16384 | 0 pagefault+0x8b 32768 | 0 trap+0x3d7 65536 | 0 cmntrap+0x140 131072 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 16883 1% 46% 0.00 1699 dbuf_hash_table[2064] dbuf_find+0xdc nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16299 dbuf_hold_impl+0x42 4096 |@ 582 dbuf_hold+0x1b 8192 | 2 dnode_hold_impl+0x7e dnode_hold+0x14 dmu_buf_hold_array+0x3b dmu_read_uio+0x49 zfs_read+0x15c zfs_shim_read+0xc fop_read+0x31 read+0x188 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 5 1% 47% 0.00 4756677 dtrace_lock[8] dtrace_ioctl+0xd7b nsec ------ Time Distribution ------ count Stack 8192 |@@@@@@ 1 cdev_ioctl+0x1d 16384 | 0 spec_ioctl+0x50 32768 |@@@@@@ 1 fop_ioctl+0x25 65536 | 0 ioctl+0xac 131072 | 0 sys_syscall+0x17b 262144 | 0 524288 | 0 1048576 | 0 2097152 | 0 4194304 | 0 8388608 |@@@@@@ 1 16777216 |@@@@@@@@@@@@ 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 19703 1% 48% 0.00 1195 0xffffffff807b22c0 kmem_cache_alloc+0x4d nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@ 4381 anon_alloc+0x21 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 15169 anon_zero+0x65 4096 | 12 segvn_faultpage+0x46b 8192 | 0 segvn_fault+0x9a6 16384 | 137 as_fault+0x205 32768 | 1 pagefault+0x8b 65536 | 3 trap+0x3d7 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 18555 1% 48% 0.00 1108 0xffffffffb0e8e958 rrw_exit+0x69 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@ 5997 zfs_read+0x199 2048 |@@@@@@@@@@@@@@@@@@@@ 12554 zfs_shim_read+0xc 4096 | 3 fop_read+0x31 8192 | 1 read+0x188 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1579 1% 49% 0.00 12903 0xfffffffffb5db180 page_get_mnode_freelist+0x33d nsec ------ Time Distribution ------ count Stack 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1561 page_get_freelist+0x1a4 32768 | 16 page_create_va+0x256 65536 | 2 swap_getapage+0xfd swap_getpage+0x46 fop_getpage+0x47 anon_zero+0xa4 segvn_faultpage+0x46b segvn_fault+0x9a6 as_fault+0x205 pagefault+0x8b trap+0x3d7 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1576 1% 50% 0.00 12890 0xfffffffffb5db1a0 page_get_mnode_freelist+0x33d nsec ------ Time Distribution ------ count Stack 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1561 page_get_freelist+0x1a4 32768 | 14 page_create_va+0x256 65536 | 1 swap_getapage+0xfd swap_getpage+0x46 fop_getpage+0x47 anon_zero+0xa4 segvn_faultpage+0x46b segvn_fault+0x9a6 as_fault+0x205 pagefault+0x8b trap+0x3d7 cmntrap+0x140 ------------------------------------------------------------------------------- Spin lock hold: 36973 events in 5.765 seconds (6413 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 6135 38% 38% 0.00 7365 sleepq_head[32768] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@ 1445 ts_update_list+0x161 4096 | 173 ts_update+0x36 8192 |@@@@@@@@@@@ 2260 callout_execute+0xdb 16384 |@@@@@@@@@ 2036 taskq_thread+0xbc 32768 |@ 218 thread_start+0x8 65536 | 0 131072 | 0 262144 | 2 524288 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2832 19% 57% 0.00 8091 0xffffffffa1d54059 mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 8192 |@@@@@@@@@@@@@@@@@@@@@@ 2108 pci_peekpoke_check+0xbb 16384 |@@@@@@@ 723 pepb_ctlops+0x2be 32768 | 0 ddi_ctlops+0x3b 65536 | 0 i_ddi_caut_getput_ctlops+0x36 131072 | 0 i_ddi_caut_get32+0x29 262144 | 1 pci_config_get32+0x2b nvidia_pci_check_config_space+0xb8 nv_intr+0x6f av_dispatch_autovect+0x78 intr_thread+0x5f ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1146 7% 64% 0.00 7746 xc_mbox_lock[24] mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 4096 |@@@@ 172 xc_do_call+0x9b 8192 |@@@@@@@@@@@@@@@ 593 xc_sync+0x36 16384 |@@@@@@@@@ 377 dtrace_xcall+0x97 32768 | 2 dtrace_sync+0x17 65536 | 1 dtrace_dynvar_clean+0xe6 131072 | 0 dtrace_state_clean+0x29 262144 | 0 cyclic_softint+0xba 524288 | 0 cbe_low_level+0x14 1048576 | 1 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 5664 6% 70% 0.00 1239 0xffffffffa1d54051 mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 1024 | 10 pci_peekpoke_check+0xe6 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5625 pepb_ctlops+0x2be 4096 | 21 ddi_ctlops+0x3b 8192 | 8 i_ddi_caut_getput_ctlops+0x36 i_ddi_caut_get32+0x29 pci_config_get32+0x2b nvidia_pci_check_config_space+0xb8 nv_intr+0x6f av_dispatch_autovect+0x78 intr_thread+0x5f ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 3634 6% 76% 0.00 1842 cpu0_disp[48] disp_lock_exit_high+0x2a nsec ------ Time Distribution ------ count Stack 1024 | 35 disp+0x137 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 2882 swtch+0xa0 4096 |@@@@ 564 idle+0xdb 8192 |@ 133 thread_start+0x8 16384 | 20 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 3231 4% 80% 0.00 1657 0xffffffffa21a8a28 disp_lock_exit_high+0x2a nsec ------ Time Distribution ------ count Stack 1024 | 60 disp+0x137 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@ 2693 swtch+0xa0 4096 |@@@@ 451 idle+0xdb 8192 | 19 thread_start+0x8 16384 | 8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 3211 4% 84% 0.00 1534 hres_lock[4] dtrace_hres_tick+0x69 nsec ------ Time Distribution ------ count Stack 1024 | 5 cbe_hres_tick+0xe 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 2884 cyclic_expire+0xbc 4096 |@@ 306 cyclic_fire+0x5b 8192 | 15 cbe_fire+0x39 16384 | 1 av_dispatch_autovect+0x78 _interrupt+0x15a cpu_halt+0x121 idle+0x89 thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2265 4% 88% 0.00 1916 cpu[0][1512] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 1024 | 13 post_syscall+0x3ec 2048 |@@@@@@@@@@@@@@@@@@@ 1436 syscall_exit+0x59 4096 |@@@@@@@@@@ 798 sys_syscall32+0x1a0 8192 | 15 16384 | 3 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2127 3% 91% 0.00 1951 cpu[1][1512] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 1024 | 21 post_syscall+0x3ec 2048 |@@@@@@@@@@@@@@@@@ 1237 syscall_exit+0x59 4096 |@@@@@@@@@@@ 834 sys_syscall32+0x1a0 8192 | 32 16384 | 3 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1377 2% 93% 0.00 1617 0xffffffffa21a89f8 disp_lock_exit_high+0x2a nsec ------ Time Distribution ------ count Stack 1024 |@ 59 disp+0x137 2048 |@@@@@@@@@@@@@@@@@@@@@@@@ 1146 swtch+0xa0 4096 |@@@ 164 idle+0xdb 8192 | 6 thread_start+0x8 16384 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1613 2% 95% 0.00 1315 softcall_lock[8] mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 1024 | 10 softint+0x13e 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1602 softlevel1+0x9 4096 | 1 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 948 1% 96% 0.00 1884 cpu[2][1512] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 1024 | 9 post_syscall+0x3ec 2048 |@@@@@@@@@@@@@@@@@@@@ 659 syscall_exit+0x59 4096 |@@@@@@@ 248 sys_syscall32+0x1a0 8192 |@ 32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1050 1% 98% 0.00 1599 cpu[3][1512] disp_lock_exit_nopreempt+0x3a nsec ------ Time Distribution ------ count Stack 1024 | 6 ts_tick+0x74 2048 |@@@@@@@@@@@@@@@@@@@@@@@@ 858 clock_tick+0x48 4096 |@@@@ 169 clock_tick_process+0x14f 8192 | 17 clock_tick_execute_common+0x73 clock_tick_schedule+0x74 clock+0x2d0 cyclic_softint+0xba cbe_softclock+0x17 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 683 1% 99% 0.00 1609 0xffffffffa250e9c0 disp_lock_exit_high+0x2a nsec ------ Time Distribution ------ count Stack 1024 |@ 27 disp+0x137 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 534 swtch+0xa0 4096 |@@@@@ 121 idle+0xdb 8192 | 0 thread_start+0x8 16384 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 788 1% 99% 0.00 1242 shuttle_lock[1] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 788 ts_update_list+0x161 ts_update+0x36 callout_execute+0xdb taskq_thread+0xbc thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 212 0% 100% 0.00 1884 lwpsleepq[32768] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 191 ts_update_list+0x161 4096 | 5 ts_update+0x36 8192 |@@ 16 callout_execute+0xdb taskq_thread+0xbc thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 23 0% 100% 0.00 4978 turnstile_table[4096] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@ 17 turnstile_exit+0x50 4096 | 0 mutex_vector_enter+0x14d 8192 |@ 1 cv_wait+0x70 16384 |@@ 2 taskq_thread+0x14f 32768 |@@@ 3 thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 9 0% 100% 0.00 5938 cp_default[320] disp_lock_exit+0x78 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 3 sigtoproc+0x446 4096 |@@@ 1 sigaddqa+0x4a 8192 | 0 timer_fire+0xb9 16384 |@@@@@@@@@@@@@@@@ 5 clock_realtime_fire+0x2d callout_execute+0xdb softint+0x146 softlevel1+0x9 av_dispatch_softvect+0x62 dosoftint+0x32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 18 0% 100% 0.00 1292 0xffffffffa06fc6c9 mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18 ndi_fmc_insert+0x9d rootnex_coredma_bindhdl+0x490 rootnex_dma_bindhdl+0x1a ddi_dma_buf_bind_handle+0xb0 mpt_scsi_init_pkt+0x146 scsi_init_pkt+0x44 sd_setup_rw_pkt+0xe5 sd_initpkt_for_buf+0xa3 sd_start_cmds+0xa5 sd_core_iostart+0x87 sd_mapblockaddr_iostart+0x11a sd_xbuf_strategy+0x46 xbuf_iostart+0x75 ddi_xbuf_qstrategy+0x4a sdstrategy+0xbb bdev_strategy+0x54 log_roll_write_crb+0x59 log_roll_write+0x85 trans_roll+0x1e0 thread_start+0x8 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2 0% 100% 0.00 5570 reaplock[8] mutex_vector_exit+0xad nsec ------ Time Distribution ------ count Stack 4096 |@@@@@@@@@@@@@@@ 1 lwp_create+0x1e0 8192 | 0 forklwp+0x8a 16384 |@@@@@@@@@@@@@@@ 1 cfork+0x7a6 fork1+0x10 sys_syscall+0x17b ------------------------------------------------------------------------------- R/W writer hold: 12279 events in 5.765 seconds (2130 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 5479 64% 64% 0.00 65450 0xffffffffbe667ab8 as_map_locked+0x144 nsec ------ Time Distribution ------ count Stack 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5407 as_map+0x4a 131072 | 63 brk_internal+0x28f 262144 | 8 brk+0x44 524288 | 0 sys_syscall+0x17b 1048576 | 0 2097152 | 0 4194304 | 0 8388608 | 0 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 364 30% 94% 0.00 457691 0xfffffe85eb376ab0 as_unmap+0x10f nsec ------ Time Distribution ------ count Stack 65536 | 1 munmap+0x85 131072 |@@@@@@@@@@@@@@@@@@@ 242 sys_syscall+0x17b 262144 |@@@@@@@@ 100 524288 |@ 17 1048576 | 1 2097152 | 1 4194304 | 0 8388608 | 0 16777216 | 1 33554432 | 0 67108864 | 0 134217728 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 5477 2% 96% 0.00 1766 0xffffffffbf781348 segvn_extend_prev+0x15a nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4964 segvn_create+0x8f5 4096 |@@ 475 as_map_locked+0x102 8192 | 37 as_map+0x4a 16384 | 0 brk_internal+0x28f 32768 | 0 brk+0x44 65536 | 1 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 96% 0.00 1210410 0xfffffe87611807c8 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 1071082 0xfffffe85dea77ac0 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 503423 0xfffffe8754e01580 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 499633 0xfffffe87598e2e40 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 387715 0xfffffe87611156c0 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 383589 0xfffffe87611158c0 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 322684 0xfffffe87611189c8 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 314745 0xfffffe8761115440 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 312264 0xfffffe8761183680 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 305929 0xfffffe8761180988 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 246750 0xfffffe8764cbeec8 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 217796 0xfffffe8761115340 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 211458 0xfffffe87598e2cc0 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 207617 0xfffffe8761183200 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_unmap+0xdb munmap+0x85 sys_syscall+0x17b ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 206543 0xfffffe85d8f2a680 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 205967 0xfffffe87598e2d40 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1 0% 97% 0.00 205288 0xfffffe85dea7cdc8 segvn_free+0xc0 nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 seg_free+0x3f segvn_unmap+0x60e as_free+0xac relvm+0x1f7 proc_exit+0x3a1 exit+0x9 rexit+0x10 sys_syscall32+0x101 ------------------------------------------------------------------------------- R/W reader hold: 139192 events in 5.765 seconds (24145 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 25275 28% 28% 0.00 123303 0xffffffffbe667ab8 as_fault+0x488 nsec ------ Time Distribution ------ count Stack 8192 | 1 pagefault+0x8b 16384 | 0 trap+0x3d7 32768 | 0 cmntrap+0x140 65536 | 7 131072 |@@@@@@@@@@@@@@@@@@@ 16753 262144 |@@@@@@@@@@ 8479 524288 | 34 1048576 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 25253 27% 55% 0.00 116596 0xffffffffb94097c8 segvn_fault+0xcbd nsec ------ Time Distribution ------ count Stack 65536 | 33 as_fault+0x205 131072 |@@@@@@@@@@@@@@@@@@@@@@@ 19403 pagefault+0x8b 262144 |@@@@@@ 5787 trap+0x3d7 524288 | 29 cmntrap+0x140 1048576 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 25253 25% 81% 0.00 109842 0xffffffffbf781348 segvn_fault+0xceb nsec ------ Time Distribution ------ count Stack 65536 | 47 as_fault+0x205 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 24350 pagefault+0x8b 262144 | 832 trap+0x3d7 524288 | 23 cmntrap+0x140 1048576 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1690 3% 83% 0.00 172137 0xfffffe85eb376ab0 as_fault+0x488 nsec ------ Time Distribution ------ count Stack 131072 |@@@@@@@@@@@@ 723 pagefault+0x8b 262144 |@@@@@@@@@@@@@@@ 865 trap+0x3d7 524288 |@ 86 cmntrap+0x140 1048576 | 16 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 18480 2% 86% 0.00 14461 0xffffffffb2d31830 dbuf_read+0x215 nsec ------ Time Distribution ------ count Stack 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18273 dnode_hold_impl+0xa7 32768 | 177 dnode_hold+0x14 65536 | 0 dmu_buf_hold_array+0x3b 131072 | 1 dmu_read_uio+0x49 262144 | 16 zfs_read+0x15c 524288 | 13 zfs_shim_read+0xc fop_read+0x31 read+0x188 read32+0xe sys_syscall32+0x101 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1509 2% 88% 0.00 155379 0xfffffe876684c4f8 segvn_fault+0xcbd nsec ------ Time Distribution ------ count Stack 131072 |@@@@@@@@@@@@@ 654 as_fault+0x205 262144 |@@@@@@@@@@@@@@@@ 829 pagefault+0x8b 524288 | 22 trap+0x3d7 1048576 | 4 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1509 2% 90% 0.00 148662 0xfffffe85dea77d40 segvn_fault+0xceb nsec ------ Time Distribution ------ count Stack 131072 |@@@@@@@@@@@@@ 655 as_fault+0x205 262144 |@@@@@@@@@@@@@@@@ 830 pagefault+0x8b 524288 | 22 trap+0x3d7 1048576 | 2 cmntrap+0x140 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 528 1% 91% 0.00 247232 0xffffffffb0a2f3b8 taskq_thread+0xdb nsec ------ Time Distribution ------ count Stack 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@ 457 thread_start+0x8 524288 |@@@ 69 1048576 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2028 1% 92% 0.00 41220 0xfffffe87ba3ea5c0 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 676 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 628 zfs_shim_read+0xc 32768 | 4 fop_read+0x31 65536 |@@@@@@@@@ 628 read+0x188 131072 | 44 read32+0xe 262144 | 4 sys_syscall32+0x101 524288 | 2 1048576 | 40 2097152 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2031 1% 93% 0.00 40995 0xfffffe8686311cf0 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 677 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 1 zfs_read+0x15c 16384 |@@@@@@@@@ 632 zfs_shim_read+0xc 32768 | 1 fop_read+0x31 65536 |@@@@@@@@@ 631 read+0x188 131072 | 41 read32+0xe 262144 | 4 sys_syscall32+0x101 524288 | 3 1048576 | 36 2097152 | 5 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2025 1% 93% 0.00 40914 0xfffffe87b9955a78 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@ 673 dmu_buf_hold_array+0x81 4096 | 1 dmu_read_uio+0x49 8192 | 1 zfs_read+0x15c 16384 |@@@@@@@@@ 631 zfs_shim_read+0xc 32768 | 4 fop_read+0x31 65536 |@@@@@@@@@ 630 read+0x188 131072 | 41 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 4 1048576 | 35 2097152 | 5 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2028 1% 94% 0.00 40006 0xfffffe8683148058 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 676 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 629 zfs_shim_read+0xc 32768 | 6 fop_read+0x31 65536 |@@@@@@@@@ 633 read+0x188 131072 | 42 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 1 1048576 | 40 2097152 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2028 1% 95% 0.00 39638 0xfffffe875b9ef578 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 676 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 633 zfs_shim_read+0xc 32768 | 0 fop_read+0x31 65536 |@@@@@@@@@ 627 read+0x188 131072 | 40 read32+0xe 262144 | 3 sys_syscall32+0x101 524288 | 8 1048576 | 38 2097152 | 3 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1944 1% 95% 0.00 40258 0xfffffe8724b592c8 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 648 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 607 zfs_shim_read+0xc 32768 | 3 fop_read+0x31 65536 |@@@@@@@@@ 607 read+0x188 131072 | 37 read32+0xe 262144 | 1 sys_syscall32+0x101 524288 | 3 1048576 | 35 2097152 | 3 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2028 1% 96% 0.00 38110 0xfffffe8628c965c0 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 676 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 632 zfs_shim_read+0xc 32768 | 1 fop_read+0x31 65536 |@@@@@@@@@ 633 read+0x188 131072 | 41 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 3 1048576 | 41 2097152 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1692 1% 97% 0.00 40668 0xfffffe873f426a90 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@ 563 dmu_buf_hold_array+0x81 4096 | 1 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 526 zfs_shim_read+0xc 32768 | 4 fop_read+0x31 65536 |@@@@@@@@@ 529 read+0x188 131072 | 34 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 1 1048576 | 30 2097152 | 4 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1433 1% 97% 0.00 41323 0xfffffe861e20acd8 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 478 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 445 zfs_shim_read+0xc 32768 | 2 fop_read+0x31 65536 |@@@@@@@@@ 444 read+0x188 131072 | 31 read32+0xe 262144 | 1 sys_syscall32+0x101 524288 | 3 1048576 | 25 2097152 | 4 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1326 0% 98% 0.00 39036 0xfffffe87b98b5060 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 442 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 413 zfs_shim_read+0xc 32768 | 3 fop_read+0x31 65536 |@@@@@@@@@ 416 read+0x188 131072 | 26 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 0 1048576 | 26 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 1140 0% 98% 0.00 41647 0xfffffe878b029cf0 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 380 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 351 zfs_shim_read+0xc 32768 | 5 fop_read+0x31 65536 |@@@@@@@@@ 354 read+0x188 131072 | 24 read32+0xe 262144 | 0 sys_syscall32+0x101 524288 | 2 1048576 | 22 2097152 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 987 0% 99% 0.00 42129 0xfffffe873752e520 dmu_buf_hold_array_by_dnode+0x208 nsec ------ Time Distribution ------ count Stack 2048 |@@@@@@@@@@ 329 dmu_buf_hold_array+0x81 4096 | 0 dmu_read_uio+0x49 8192 | 0 zfs_read+0x15c 16384 |@@@@@@@@@ 306 zfs_shim_read+0xc 32768 | 1 fop_read+0x31 65536 |@@@@@@@@@ 306 read+0x188 131072 | 20 read32+0xe 262144 | 1 sys_syscall32+0x101 524288 | 3 1048576 | 19 2097152 | 2 ------------------------------------------------------------------------------- + /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5 Profiling interrupt: 19652 events in 5.028 seconds (3909 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 18681 95% 95% 0.00 464 cpu[2] cpu_halt nsec ------ Time Distribution ------ count Stack 512 |@@@@@@@@@@@@@@@@@@@@@@ 13724 idle 1024 |@@@@@@@ 4905 thread_start 2048 | 50 4096 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 108 1% 96% 0.00 656 cpu[0] kcopy nsec ------ Time Distribution ------ count Stack 512 | 2 uiomove 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 104 dmu_read_uio 2048 | 2 zfs_read zfs_shim_read fop_read read read32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 102 1% 96% 0.00 719 cpu[1] (usermode) nsec ------ Time Distribution ------ count Stack 512 |@ 4 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 91 2048 |@@ 7 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 100 1% 97% 0.00 638 cpu[2] fletcher_2_native nsec ------ Time Distribution ------ count Stack 512 |@@@ 12 zio_checksum_verify 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@ 85 zio_execute 2048 | 3 taskq_thread thread_start ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 55 0% 97% 0.00 2909 cpu[3] fsflush_do_pages nsec ------ Time Distribution ------ count Stack 1024 | 1 fsflush 2048 |@ 3 thread_start 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 50 8192 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 38 0% 97% 0.00 617 cpu[1] sys_syscall32 nsec ------ Time Distribution ------ count Stack 512 |@@@ 4 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 34 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 30 0% 97% 0.00 582 cpu[1] syscall_mstate nsec ------ Time Distribution ------ count Stack 512 |@@@@ 4 sys_syscall32 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 26 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 20 0% 97% 0.00 629 cpu[0] write nsec ------ Time Distribution ------ count Stack 512 |@ 1 write32 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 19 0% 97% 0.00 645 cpu[1] tsc_gethrtimeunscaled_delta nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19 gethrtime_unscaled syscall_mstate sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 15 0% 98% 0.00 641 cpu[0] pc_gethrestime nsec ------ Time Distribution ------ count Stack 512 |@@@@ 2 gethrestime 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 13 gethrestime_sec smark spec_write fop_write write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 14 0% 98% 0.00 653 cpu[1] tsc_gethrtime_delta nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 gethrtime pc_gethrestime gethrestime gethrestime_sec smark spec_write fop_write write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 13 0% 98% 0.00 590 cpu[0] copyin_nowatch nsec ------ Time Distribution ------ count Stack 512 |@@ 1 copyin_args32 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12 syscall_entry sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 13 0% 98% 0.00 624 cpu[0] copyin_args32 nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 13 syscall_entry sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 12 0% 98% 0.00 627 cpu[1] mutex_enter nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12 write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 9 0% 98% 0.00 595 cpu[1] spec_write nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9 fop_write write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 7 0% 98% 0.00 596 cpu[1] gethrestime_sec nsec ------ Time Distribution ------ count Stack 512 |@@@@ 1 smark 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@ 6 spec_write fop_write write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 7 0% 98% 0.00 683 cpu[1] gethrestime_sec nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7 spec_write fop_write write write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 7 0% 98% 0.00 585 cpu[1] kcopy nsec ------ Time Distribution ------ count Stack 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7 copyin_nowatch copyin_args32 syscall_entry sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 7 0% 98% 0.00 590 cpu[1] fop_write nsec ------ Time Distribution ------ count Stack 512 |@@@@@@@@@@@@ 3 write 1024 |@@@@@@@@@@@@@@@@@ 4 write32 sys_syscall32 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Hottest CPU+PIL Caller 6 0% 98% 0.00 729 cpu[1] spec_maxoffset nsec ------ Time Distribution ------ count Stack 512 |@@@@@ 1 fop_write 1024 |@@@@@@@@@@@@@@@@@@@@ 4 write 2048 |@@@@@ 1 write32 sys_syscall32 ------------------------------------------------------------------------------- -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
William Bauer
2009-Jul-09 18:13 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have a much more generic question regarding this thread. I have a sun T5120 (T2 quad core, 1.4GHz) with two 10K RPM SAS drives in a mirrored pool running Solaris 10 u7. The disk performance seems horrible. I have the same apps running on a Sun X2100M2 (dual core 1.8GHz AMD) also running Solaris 10u7 and an old, really poor performing SATA drive (also with ZFS), and its disk performance seems at least 5x better. I''m not offering much detail here, but I had been attributing this to what I''ve always observed--Solaris on x86 performs far better than on sparc for any app I''ve ever used. I guess the real question would be is ZFS ready for production in Solaris 10, or should I flar this bugger up and rebuild with UFS? This thread concerns me, and I really want to keep ZFS on this system for its many features. Sorry if this is off-topic, but you guys got me wondering. -- This message posted from opensolaris.org
William Bauer
2009-Jul-09 18:14 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I don''t swear. The word it bleeped was not a bad word.... -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jul-12 21:38 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool ''rpool'' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 2m54.17s user 0m7.65s sys 0m36.59s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 11m54.65s user 0m7.70s sys 0m35.06s Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 13m3.91s user 2m43.04s sys 9m28.73s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 23m50.27s user 2m41.81s sys 9m46.76s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. I am interested to hear about systems which do not suffer from this bug. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Scott Lawson
2009-Jul-12 23:15 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [root at xxx /]#> uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [root at xxx /]#> cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [root at xxx /]#> prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [root at xxx tmp]#> ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 4m48.94s user 0m21.58s sys 0m44.91s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 6m39.87s user 0m21.62s sys 0m46.20s Feel free to clean up with ''zfs destroy test1/zfscachetest''. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/''s sustained on the 2nd. /Scott. Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue > for a week now. A 4X reduction in file read performance due to having > read the file before is terrible, and of course the situation is > considerably worse if the file was previously mmapped as well. Many > of us have sent a lot of money to Sun and were not aware that ZFS is > sucking the life out of our expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. > For example, I reproduced it on my Blade 2500 (SPARC) which uses a > simple mirrored rpool. On that system there is a 1.8X read slowdown > from the file being accessed previously. > > In order to raise visibility of this issue, I invite others to see if > they can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > > Implements a simple test. It requires a fair amount of disk space to > run, but the main requirement is that the disk space consumed be more > than available memory so that file data gets purged from the ARC. The > script needs to run as root since it creates a filesystem and uses > mount/umount. The script does not destroy any data. > > There are several adjustments which may be made at the front of the > script. The pool ''rpool'' is used by default, but the name of the pool > to test may be supplied via an argument similar to: > > # ./zfs-cache-test.ksh Sun_2540 > zfs create Sun_2540/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /Sun_2540/zfscachetest ... > Done! > zfs unmount Sun_2540/zfscachetest > zfs mount Sun_2540/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 2m54.17s > user 0m7.65s > sys 0m36.59s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 11m54.65s > user 0m7.70s > sys 0m35.06s > > Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. > > And here is a similar run on my Blade 2500 using the default rpool: > > # ./zfs-cache-test.ksh > zfs create rpool/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /rpool/zfscachetest ... > Done! > zfs unmount rpool/zfscachetest > zfs mount rpool/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 13m3.91s > user 2m43.04s > sys 9m28.73s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 23m50.27s > user 2m41.81s > sys 9m46.76s > > Feel free to clean up with ''zfs destroy rpool/zfscachetest''. > > I am interested to hear about systems which do not suffer from this bug. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Gaëtan Lehmann
2009-Jul-13 08:58 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Here is the result on a Dell Precision T5500 with 24 GB of RAM and two HD in a mirror (SATA, 7200 rpm, NCQ). [glehmann at marvin2 tmp]$ uname -a SunOS marvin2 5.11 snv_117 i86pc i386 i86pc Solaris [glehmann at marvin2 tmp]$ pfexec ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/ zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 8m19,74s user 0m6,47s sys 0m25,32s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 10m42,68s user 0m8,35s sys 0m30,93s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. HTH, Ga?tan Le 13 juil. 09 ? 01:15, Scott Lawson a ?crit :> Bob, > > Output of my run for you. System is a M3000 with 16 GB RAM and 1 > zpool called test1 > which is contained on a raid 1 volume on a 6140 with 7.50.13.10 > firmware on > the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. > > This machine is brand new with a clean install of S10 05/09. It is > destined to become a Oracle 10 server with > ZFS filesystems for zones and DB volumes. > > [root at xxx /]#> uname -a > SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise > [root at xxx /]#> cat /etc/release > Solaris 10 5/09 s10s_u7wos_08 SPARC > Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 30 March 2009 > > [root at xxx /]#> prtdiag -v | more > System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise > M3000 Server > System clock frequency: 1064 MHz > Memory size: 16384 Megabytes > > > Here is the run output for you. > > [root at xxx tmp]#> ./zfs-cache-test.ksh test1 > zfs create test1/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under /test1/ > zfscachetest ... > Done! > zfs unmount test1/zfscachetest > zfs mount test1/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 4m48.94s > user 0m21.58s > sys 0m44.91s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 6m39.87s > user 0m21.62s > sys 0m46.20s > > Feel free to clean up with ''zfs destroy test1/zfscachetest''. > > Looks like a 25% performance loss for me. I was seeing around 80MB/s > sustained > on the first run and around 60M/''s sustained on the 2nd. > > /Scott. > > > Bob Friesenhahn wrote: >> There has been no forward progress on the ZFS read performance >> issue for a week now. A 4X reduction in file read performance due >> to having read the file before is terrible, and of course the >> situation is considerably worse if the file was previously mmapped >> as well. Many of us have sent a lot of money to Sun and were not >> aware that ZFS is sucking the life out of our expensive Sun hardware. >> >> It is trivially easy to reproduce this problem on multiple >> machines. For example, I reproduced it on my Blade 2500 (SPARC) >> which uses a simple mirrored rpool. On that system there is a 1.8X >> read slowdown from the file being accessed previously. >> >> In order to raise visibility of this issue, I invite others to see >> if they can reproduce it in their ZFS pools. The script at >> >> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh >> >> Implements a simple test. It requires a fair amount of disk space >> to run, but the main requirement is that the disk space consumed be >> more than available memory so that file data gets purged from the >> ARC. The script needs to run as root since it creates a filesystem >> and uses mount/umount. The script does not destroy any data. >> >> There are several adjustments which may be made at the front of the >> script. The pool ''rpool'' is used by default, but the name of the >> pool to test may be supplied via an argument similar to: >> >> # ./zfs-cache-test.ksh Sun_2540 >> zfs create Sun_2540/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under / >> Sun_2540/zfscachetest ... >> Done! >> zfs unmount Sun_2540/zfscachetest >> zfs mount Sun_2540/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 2m54.17s >> user 0m7.65s >> sys 0m36.59s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 11m54.65s >> user 0m7.70s >> sys 0m35.06s >> >> Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. >> >> And here is a similar run on my Blade 2500 using the default rpool: >> >> # ./zfs-cache-test.ksh >> zfs create rpool/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under /rpool/ >> zfscachetest ... >> Done! >> zfs unmount rpool/zfscachetest >> zfs mount rpool/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 13m3.91s >> user 2m43.04s >> sys 9m28.73s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 23m50.27s >> user 2m41.81s >> sys 9m46.76s >> >> Feel free to clean up with ''zfs destroy rpool/zfscachetest''. >> >> I am interested to hear about systems which do not suffer from this >> bug. >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/11a75945/attachment.bin>
Alexander Skwar
2009-Jul-13 09:30 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, On Sun, Jul 12, 2009 at 23:38, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> There has been no forward progress on the ZFS read performance issue for a > week now. ?A 4X reduction in file read performance due to having read the > file before is terrible, and of course the situation is considerably worse > if the file was previously mmapped as well. ?Many of us have sent a lot of > money to Sun and were not aware that ZFS is sucking the life out of our > expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. For > example, I reproduced it on my Blade 2500 (SPARC) which uses a simple > mirrored rpool. ?On that system there is a 1.8X read slowdown from the file > being accessed previously. > > In order to raise visibility of this issue, I invite others to see if they > can reproduce it in their ZFS pools. ?The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Implements a simple test.--($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 Bl?cke real 4m7.70s user 0m24.10s sys 1m5.99s Doing second ''cpio -o > /dev/null'' 48000247 Bl?cke real 1m44.88s user 0m22.26s sys 0m51.56s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. real 10m47.747s user 0m54.189s sys 3m22.039s This is a M4000 mit 32 GB RAM and two HDs in a mirror. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''
Hey Bob, Here are my results on a Dual 2.2Ghz Opteron, 8GB of RAM and 16 SATA disks connected via a Supermicro AOC-SAT2-MV8 (albeit with one dead drive). Looks like a 5x slowdown to me: Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 4m46.45s user 0m10.29s sys 0m58.27s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 15m50.62s user 0m10.54s sys 1m11.86s Ross -- This message posted from opensolaris.org
Daniel Rock
2009-Jul-13 11:52 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi, Solaris 10U7, patched to the latest released patches two weeks ago. Four ST31000340NS attached to two SI3132 SATA controller, RAIDZ1. Selfmade system with 2GB RAM and an x86 (chipid 0x0 AuthenticAMD family 15 model 35 step 2 clock 2210 MHz) AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ processor. On the first run throughput was ~110MB/s, on the second run only 80MB/s. Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 Bl?cke real 3m37.17s user 0m11.15s sys 0m47.74s Doing second ''cpio -o > /dev/null'' 48000247 Bl?cke real 4m55.69s user 0m10.69s sys 0m47.57s Daniel
Jorgen Lundman
2009-Jul-13 12:31 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
x4540 running svn117 # ./zfs-cache-test.ksh zpool1 zfs create zpool1/zfscachetest creating data file set 93000 files of 8192000 bytes0 under /zpool1/zfscachetest ... done1 zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest doing initial (unmount/mount) ''cpio -o . /dev/null'' 48000247 blocks real 4m7.13s user 0m9.27s sys 0m49.09s doing second ''cpio -o . /dev/null'' 48000247 blocks real 4m52.52s user 0m9.13s sys 0m47.51s
Alexander Skwar
2009-Jul-13 12:51 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Here''s a more useful output, with having set the number of files to 6000, so that it has a dataset which is larger than the amount of RAM. --($ ~)-- time sudo ksh zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 96000493 Bl?cke real 8m44.82s user 0m46.85s sys 2m15.01s Doing second ''cpio -o > /dev/null'' 96000493 Bl?cke real 29m15.81s user 0m45.31s sys 3m2.36s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. real 48m40.890s user 1m47.192s sys 8m2.165s Still on S10 U7 Sparc M4000. So I''m now inline with the other results - the 2nd run is WAY slower. 4x as slow. Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!'' -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/30fd9a1d/attachment.html>
Bob Friesenhahn
2009-Jul-13 14:22 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote:> > This is a M4000 mit 32 GB RAM and two HDs in a mirror.I think that you should edit the script to increase the file count since your RAM size is big enough to cache most of the data. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-13 14:34 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Alexander Skwar wrote:> > Still on S10 U7 Sparc M4000. > > So I''m now inline with the other results - the 2nd run is WAY slower. 4x > as slow.It would be good to see results from a few OpenSolaris users running a recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris issue as well. It seems likely to be more evident with fast SAS disks or SAN devices rather than a few SATA disks since the SATA disks have more access latency. Pools composed of mirrors should offer less read latency as well. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Interesting, I repeated the test on a few other machines running newer builds. First impressions are good: snv_114, virtual machine, 1GB RAM, 30GB disk - 16% slowdown. (Only 9GB free so I ran an 8GB test) Doing initial (unmount/mount) ''cpio -o > /dev/null'' 16000083 blocks real 3m4.85s user 0m16.74s sys 0m41.69s Doing second ''cpio -o > /dev/null'' 16000083 blocks real 3m34.58s user 0m18.85s sys 0m45.40s And again on snv_117, Sun x2200, 40GB RAM, single 500GB sata disk: First run (with the default 24GB set): real 6m25.15s user 0m11.93s sys 0m54.93s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 1m9.97s user 0m12.17s sys 0m57.80s ... d''oh! At least I know the ARC is working :-) The second run, with a 98GB test is running now, I''ll post the results in the morning. -- This message posted from opensolaris.org
Brad Diggs
2009-Jul-13 16:35 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
You might want to have a look at my blog on filesystem cache tuning... It will probably help you to avoid memory contention between the ARC and your apps. http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail Bradley.Diggs at Sun.COM Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 4, 2009, at 2:48 AM, Phil Harman wrote:> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC > instead of the Solaris page cache. But mmap() uses the latter. So if > anyone maps a file, ZFS has to keep the two caches in sync. > > cp(1) uses mmap(2). When you use cp(1) it brings pages of the files > it copies into the Solaris page cache. As long as they remain there > ZFS will be slow for those files, even if you subsequently use > read(2) to access them. > > If you reboot, your cpio(1) tests will probably go fast again, until > someone uses mmap(2) on the files again. I think tar(1) uses > read(2), but from my iPod I can''t be sure. It would be interesting > to see how tar(1) performs if you run that test before cp(1) on a > freshly rebooted system. > > I have done some work with the ZFS team towards a fix, but it is > only currently in OpenSolaris. > > The other thing that slows you down is that ZFS only flushes to disk > every 5 seconds if there are no synchronous writes. It would be > interesting to see iostat -xnz 1 while you are running your tests. > You may find the disks are writing very efficiently for one second > in every five. > > Hope this helps, > Phil > > blogs.sun.com/pgdh > > > Sent from my iPod > > On 4 Jul 2009, at 05:26, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: > >> On Fri, 3 Jul 2009, Bob Friesenhahn wrote: >>> >>> Copy Method Data Rate >>> ==================================== =================>>> cpio -pdum 75 MB/s >>> cp -r 32 MB/s >>> tar -cf - . | (cd dest && tar -xf -) 26 MB/s >> >> It seems that the above should be ammended. Running the cpio based >> copy again results in zpool iostat only reporting a read bandwidth >> of 33 MB/second. The system seems to get slower and slower as it >> runs. >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/2f4134f9/attachment.html>
Bob Friesenhahn
2009-Jul-13 18:54 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Brad Diggs wrote:> You might want to have a look at my blog on filesystem cache tuning... It > will probably help > you to avoid memory contention between the ARC and your apps. > > http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.htmlYour post makes it sound like there is not a bug in the operating system. It does not take long to see that there is a bug in the Solaris 10 operating system. It is not clear if the same bug is shared by current OpenSolaris since it seems like it has not been tested. Solaris 10 U7 reads files that it has not seen before at a constant rate regardless of the amount of file data it has already read. When the file is read a second time, the read is 4X or more slower. If reads were slowing down because the ARC was slow to expunge stale data, then that would be apparent on the first read pass. However, the reads are not slowing down in the first read pass. ZFS goes into the weeds if it has seen a file before but none of the file data is resident in the ARC. It is pathetic that a Sun RAID array that I paid $21K for out of my own life savings is not able to perform better than the cheapo portable USB drives that I use for backup because of ZFS. This is making me madder and madder by the minute. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
sean walmsley
2009-Jul-13 18:58 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches current to the end of Feb 2009. Current ARC size is ~6Gb. ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb SATA drives. I used 4000 8Mb files for a total of 32Gb. run 1: ~140M/s average according to zpool iostat real 4m1.11s user 0m10.44s sys 0m50.76s run 2: ~37M/s average according to zpool iostat real 13m53.43s user 0m10.62s sys 0m55.80s A zfs unmount followed by a mount of the filesystem returned the performance to the run 1 case. real 3m58.16s user 0m11.54s sys 0m51.95s In summary, the second run performance drops to about 30% of the original run. -- This message posted from opensolaris.org
Mike Gerdts
2009-Jul-13 19:11 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 9:34 AM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Alexander Skwar wrote: >> >> Still on S10 U7 Sparc M4000. >> >> So I''m now inline with the other results - the 2nd run is WAY slower. 4x >> as slow. > > It would be good to see results from a few OpenSolaris users running a > recent 64-bit kernel, and with fast storage to see if this is an OpenSolaris > issue as well.Indeed it is. Using ldoms with tmpfs as the backing store for virtual disks, I see: With S10u7: # ./zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 4800025 blocks real 0m30.35s user 0m9.90s sys 0m19.81s Doing second ''cpio -o > /dev/null'' 4800025 blocks real 0m43.95s user 0m9.67s sys 0m17.96s Feel free to clean up with ''zfs destroy testpool/zfscachetest''. # ./zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 4800025 blocks real 0m31.14s user 0m10.09s sys 0m20.47s Doing second ''cpio -o > /dev/null'' 4800025 blocks real 0m40.24s user 0m9.68s sys 0m17.86s Feel free to clean up with ''zfs destroy testpool/zfscachetest''. When I move the zpool to a 2009.06 ldom, # /var/tmp/zfs-cache-test.ksh testpool zfs create testpool/zfscachetest Creating data file set (300 files of 8192000 bytes) under /testpool/zfscachetest ... Done! zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 4800025 blocks real 0m30.09s user 0m9.58s sys 0m19.83s Doing second ''cpio -o > /dev/null'' 4800025 blocks real 0m44.21s user 0m9.47s sys 0m18.18s Feel free to clean up with ''zfs destroy testpool/zfscachetest''. # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 4800025 blocks real 0m29.89s user 0m9.58s sys 0m19.72s Doing second ''cpio -o > /dev/null'' 4800025 blocks real 0m44.40s user 0m9.59s sys 0m18.24s Feel free to clean up with ''zfs destroy testpool/zfscachetest''. Notice in these runs that each time the usr+sys time of the first run adds up to the elapsed time - the rate was choked by CPU. This is verified by "prstat -mL". The second run seemed to be slow due to a lock as we had just demonstrated that the IO path can do more (not an IO bottleneck) and "prstat -mL shows cpio at in sleep for a significant amount of time. FWIW, I hit another bug if I turn off primarycache. http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 This causes really abysmal performance - but equally so for repeat runs! # /var/tmp/zfs-cache-test.ksh testpool zfs unmount testpool/zfscachetest zfs mount testpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 4800025 blocks real 4m21.57s user 0m9.72s sys 0m36.30s Doing second ''cpio -o > /dev/null'' 4800025 blocks real 4m21.56s user 0m9.72s sys 0m36.19s Feel free to clean up with ''zfs destroy testpool/zfscachetest''. This bug report contains more detail of the configuration. One thing not covered in that bug report is that the S10u7 ldom has 2048 MB of RAM and the 2009.06 ldom has 2024 MB of RAM. -- Mike Gerdts http://mgerdts.blogspot.com/
Bob Friesenhahn
2009-Jul-13 19:54 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote:> > FWIW, I hit another bug if I turn off primarycache. > > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 > > This causes really abysmal performance - but equally so for repeat runs!It is quite facinating seeing the huge difference in I/O performance from these various reports. The bug you reported seems likely to be that without at least a little bit of caching, it is necessary to re-request the underlying 128K ZFS block several times as the program does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally disabling data caching seems best reserved for block-oriented databases which are looking for a substitute for directio(3C). It is easily demonstrated that the problem seen in Solaris 10 (jury still out on OpenSolaris although one report has been posted) is due to some sort of confusion. It is not due to delays caused by purging old data from the ARC. If these delays were caused by purging data from the ARC, then ''zfs iostat'' would start showing lower read performance once the ARC becomes full, but that is not the case. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling
2009-Jul-13 20:16 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > FWIW, I hit another bug if I turn off primarycache. > > > > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 > > > > This causes really abysmal performance - but equally so for repeat runs! > > It is quite facinating seeing the huge difference in I/O performance > from these various reports. The bug you reported seems likely to be > that without at least a little bit of caching, it is necessary to > re-request the underlying 128K ZFS block several times as the program > does numerous smaller I/Os (cpio uses 10240 bytes?) across it.cpio reads/writes in 8192 byte chunks from the filesystem. BTW: star by default creates a shared memory based FIFO of 8 MB size and reads in the biggest possible size that would currently fit into the FIFO. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Jim Mauro
2009-Jul-13 20:16 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob - Have you filed a bug on this issue? I am not up to speed on this thread, so I can not comment on whether or not there is a bug here, but you seem to have a test case and supporting data. Filing a bug will get the attention of ZFS engineering. Thanks, /jim Bob Friesenhahn wrote:> On Mon, 13 Jul 2009, Mike Gerdts wrote: >> >> FWIW, I hit another bug if I turn off primarycache. >> >> http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 >> >> This causes really abysmal performance - but equally so for repeat runs! > > It is quite facinating seeing the huge difference in I/O performance > from these various reports. The bug you reported seems likely to be > that without at least a little bit of caching, it is necessary to > re-request the underlying 128K ZFS block several times as the program > does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally > disabling data caching seems best reserved for block-oriented > databases which are looking for a substitute for directio(3C). > > It is easily demonstrated that the problem seen in Solaris 10 (jury > still out on OpenSolaris although one report has been posted) is due > to some sort of confusion. It is not due to delays caused by purging > old data from the ARC. If these delays were caused by purging data > from the ARC, then ''zfs iostat'' would start showing lower read > performance once the ARC becomes full, but that is not the case. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2009-Jul-13 20:23 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote:> > cpio reads/writes in 8192 byte chunks from the filesystem.Yes, I was just reading the cpio manual page and see that. I think that re-reading the 128K zfs block 16 times to satisfy each request for 8192 bytes explains the 16X performance loss when caching is disabled. I don''t think that this is strictly a bug since it is what the database folks are looking for. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Mike Gerdts
2009-Jul-13 20:27 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:16 PM, Joerg Schilling<Joerg.Schilling at fokus.fraunhofer.de> wrote:> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote: > >> On Mon, 13 Jul 2009, Mike Gerdts wrote: >> > >> > FWIW, I hit another bug if I turn off primarycache. >> > >> > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004 >> > >> > This causes really abysmal performance - but equally so for repeat runs! >> >> It is quite facinating seeing the huge difference in I/O performance >> from these various reports. ?The bug you reported seems likely to be >> that without at least a little bit of caching, it is necessary to >> re-request the underlying 128K ZFS block several times as the program >> does numerous smaller I/Os (cpio uses 10240 bytes?) across it. > > cpio reads/writes in 8192 byte chunks from the filesystem. > > BTW: star by default creates a shared memory based FIFO of 8 MB size and > reads in the biggest possible size that would currently fit into the FIFO. > > J?rgUsing cpio''s -C option seems to not change the behavior for this bug, but I did see a performance difference with the case where I hadn''t modified the zfs caching behavior. That is, the performance of the tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024))>/dev/null". At this point cpio was spending roughly 13% usr and 87%sys. I haven''t tried star, but I did see that I could also reproduce with "cat $file | cat > /dev/null". This seems like a worthless use of cat, but it forces cat to actually copy data from input to output unlike when cat can mmap input and output. When it does that and output is /dev/null Solaris is smart enough to avoid any reads. -- Mike Gerdts http://mgerdts.blogspot.com/
Mike Gerdts
2009-Jul-13 20:38 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 3:23 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Joerg Schilling wrote: >> >> cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. ?I think that > re-reading the 128K zfs block 16 times to satisfy each request for 8192 > bytes explains the 16X performance loss when caching is disabled. ?I don''t > think that this is strictly a bug since it is what the database folks are > looking for. > > BobI did other tests with "dd bs=128k" and verified via truss that each read(2) was returning 128K. I thought I had seen excessive reads there too, but now I can''t reproduce that. Creating another fs with recordsize=8k seems to make this behavior go away - things seem to be working as designed. I''ll go update the (nota-)bug. -- Mike Gerdts http://mgerdts.blogspot.com/
Ross Walker
2009-Jul-13 20:59 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 13, 2009, at 2:54 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Mon, 13 Jul 2009, Brad Diggs wrote: > >> You might want to have a look at my blog on filesystem cache >> tuning... It will probably help >> you to avoid memory contention between the ARC and your apps. >> >> http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html > > Your post makes it sound like there is not a bug in the operating > system. It does not take long to see that there is a bug in the > Solaris 10 operating system. It is not clear if the same bug is > shared by current OpenSolaris since it seems like it has not been > tested. > > Solaris 10 U7 reads files that it has not seen before at a constant > rate regardless of the amount of file data it has already read. > When the file is read a second time, the read is 4X or more slower. > If reads were slowing down because the ARC was slow to expunge stale > data, then that would be apparent on the first read pass. However, > the reads are not slowing down in the first read pass. ZFS goes > into the weeds if it has seen a file before but none of the file > data is resident in the ARC. > > It is pathetic that a Sun RAID array that I paid $21K for out of my > own life savings is not able to perform better than the cheapo > portable USB drives that I use for backup because of ZFS. This is > making me madder and madder by the minute.Have you tried limiting the ARC so it doesn''t squash the page cache? Make sure page cache has enough for mmap plus buffers for bouncing between it and the ARC. I would say 1GB minimum, 2 to be safe. -Ross
Bob Friesenhahn
2009-Jul-13 20:59 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mike Gerdts wrote:> > Using cpio''s -C option seems to not change the behavior for this bug, > but I did see a performance difference with the case where I hadn''t > modified the zfs caching behavior. That is, the performance of the > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) >> /dev/null". At this point cpio was spending roughly 13% usr and 87% > sys.Interesting. I just updated zfs-cache-test.ksh on my web site so that it uses 131072 byte blocks. I see a tiny improvement in performance from doing this, but I do see a bit less CPU consumption so the CPU consumption is essentially zero. The bug remains. It seems best to use ZFS''s ideal block size so that issues don''t get confused. Using an ARC monitoring script called ''arcstat.pl'' I see a huge number of ''dmis'' events when performance is poor. The ARC size is 7GB, which is less than its prescribed cap of 10GB. Better: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:39:37 20K 1K 6 58 0 1K 100 19 100 7G 10G 15:39:38 19K 1K 5 57 0 1K 100 19 100 7G 10G 15:39:39 19K 1K 6 54 0 1K 100 18 100 7G 10G 15:39:40 17K 1K 6 51 0 1K 100 17 100 7G 10G Worse: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:43:24 4K 280 6 280 6 0 0 4 100 9G 10G 15:43:25 4K 277 6 277 6 0 0 4 100 9G 10G 15:43:26 4K 268 6 268 6 0 0 5 100 9G 10G 15:43:27 4K 259 6 259 6 0 0 4 100 9G 10G An ARC stats summary from a tool called ''arc_summary.pl'' is appended to this message. Operation is quite consistent across the full span of files. Since ''dmis'' is still low when things are "good" (and even when the ARC has surely cycled already) this leads me to believe that prefetch is mostly working and is usually satisfying read requests. When things go bad I see that ''dmiss'' becomes 100% of the misses. A hypothesis is that if zfs thinks that the data might be in the ARC (due to having seen the file before) that it disables file prefetch entirely, assuming that it can retrieve the data from its cache. Then once it finally determines that there is no cached data after all, it issues a read request. Even the "better" read performance is 1/2 of what I would expect from my hardware and based on prior test results from ''iozone''. More prefetch would surely help. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ System Memory: Physical RAM: 20470 MB Free Memory : 2511 MB LotsFree: 312 MB ZFS Tunables (/etc/system): * set zfs:zfs_arc_max = 0x300000000 set zfs:zfs_arc_max = 0x280000000 * set zfs:zfs_arc_max = 0x200000000 set zfs:zfs_write_limit_override = 0xea600000 * set zfs:zfs_write_limit_override = 0xa0000000 set zfs:zfs_vdev_max_pending = 5 ARC Size: Current Size: 8735 MB (arcsize) Target Size (Adaptive): 10240 MB (c) Min Size (Hard Limit): 1280 MB (zfs_arc_min) Max Size (Hard Limit): 10240 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 95% 9791 MB (p) Most Frequently Used Cache Size: 4% 448 MB (c-p) ARC Efficency: Cache Access Total: 827767314 Cache Hit Ratio: 96% 800123657 [Defined State for buffer] Cache Miss Ratio: 3% 27643657 [Undefined State for Buffer] REAL Hit Ratio: 89% 743665046 [MRU/MFU Hits Only] Data Demand Efficiency: 99% Data Prefetch Efficiency: 61% CACHE HITS BY CACHE LIST: Anon: 5% 47497010 [ New Customer, First Cache Hit ] Most Recently Used: 33% 271365449 (mru) [ Return Customer ] Most Frequently Used: 59% 472299597 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 1700764 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 7260837 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 73% 589582518 Prefetch Data: 2% 20424879 Demand Metadata: 17% 139111510 Prefetch Metadata: 6% 51004750 CACHE MISSES BY DATA TYPE: Demand Data: 21% 5814459 Prefetch Data: 46% 12788265 Demand Metadata: 27% 7700169 Prefetch Metadata: 4% 1340764 ---------------------------------------------
Bob Friesenhahn
2009-Jul-13 21:06 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Ross Walker wrote:> > Have you tried limiting the ARC so it doesn''t squash the page cache?Yes, the ARC is limited to 10GB, leaving another 10GB for the OS and applications. Resource limits are not the problem. There is a ton of memory and CPU to go around. Current /etc/system tunables: set maxphys = 0x20000 set zfs:zfs_arc_max = 0x280000000 set zfs:zfs_write_limit_override = 0xea600000 set zfs:zfs_vdev_max_pending = 5> Make sure page cache has enough for mmap plus buffers for bouncing between it > and the ARC. I would say 1GB minimum, 2 to be safe.In this testing mmap is not being used (cpio does not use mmap) so the page cache is not an issue. It does become an issue for ''cp -r'' though where we see the I/O be substantially (and essentially permanently) reduced even more for impacted files until the filesystem is unmounted. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Mark Shellenbaum
2009-Jul-13 21:14 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue for > a week now. A 4X reduction in file read performance due to having read > the file before is terrible, and of course the situation is considerably > worse if the file was previously mmapped as well. Many of us have sent > a lot of money to Sun and were not aware that ZFS is sucking the life > out of our expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. For > example, I reproduced it on my Blade 2500 (SPARC) which uses a simple > mirrored rpool. On that system there is a 1.8X read slowdown from the > file being accessed previously. > > In order to raise visibility of this issue, I invite others to see if > they can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Implements a simple test. It requires a fair amount of disk space to > run, but the main requirement is that the disk space consumed be more > than available memory so that file data gets purged from the ARC. The > script needs to run as root since it creates a filesystem and uses > mount/umount. The script does not destroy any data. > > There are several adjustments which may be made at the front of the > script. The pool ''rpool'' is used by default, but the name of the pool > to test may be supplied via an argument similar to: > > # ./zfs-cache-test.ksh Sun_2540 > zfs create Sun_2540/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /Sun_2540/zfscachetest ... > Done! > zfs unmount Sun_2540/zfscachetest > zfs mount Sun_2540/zfscachetest >I''ve opened the following bug to track this issue: 6859997 zfs caching performance problem We need to track down if/when this problem was introduced or if it has always been there. -Mark
Joerg Schilling
2009-Jul-13 21:17 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Joerg Schilling wrote: > > > > cpio reads/writes in 8192 byte chunks from the filesystem. > > Yes, I was just reading the cpio manual page and see that. I think > that re-reading the 128K zfs block 16 times to satisfy each request > for 8192 bytes explains the 16X performance loss when caching is > disabled. I don''t think that this is strictly a bug since it is what > the database folks are looking for.cpio spends 1.6x more SYStem CPU time than star. This may mainly be a result from the fact that cpio (when using the cpio archive format) reads/writes 512 byte blocks from/to the archive file. cpio by default spends 19x more USER CPU time than star. This seems to be a result of the inapropriate header structure with the cpio archive format and reblocking and cannot be easily changed (well you could use "scpio" - or in other words the "cpio" CLI personality of star, but this reduces the USER CPU time only by 10%-50% compared to Sun cpio). cpio is a program from the past that does no fit well in our current world. The internal limits cannot be lifted without creating a new incompatible archive format. In other words: if you use cpio for your work, you have to live with it''s problems ;-) If you like to play with different parameter values (e.g. read sizes), cpio is unsuitable for tests. Star allows you to set big filesystem read sizes by using the FIFO and playing with the fifo size and smell filesystem read sizes by switching off the FIFO and playing with the archive block size. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling
2009-Jul-13 21:29 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Mike Gerdts <mgerdts at gmail.com> wrote:> Using cpio''s -C option seems to not change the behavior for this bug, > but I did see a performance difference with the case where I hadn''t > modified the zfs caching behavior. That is, the performance of the > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >/dev/null". At this point cpio was spending roughly 13% usr and 87% > sys.As mentioned before, a lot of the user CPU time from cpio is spend to create cpio archive headers or caused by the fact that cpio archives copy the file content to unaligned archive locations while the "tar" archive format starts each new file on a modulo 512 offset in the archive. This requires a lot of unneeded copying of file data. You can of course slightly modify parameters even with cpio. I am not sure what you mean with "13% usr and 87%" as star typically spends 6% of the wall clock time in user+sys CPU where the user CPU time is typically only 1.5% of the system CPU time. In the "cached" case, it is obviously ZFS that''s responsible for the slow down, regardless what cpio did in the other case. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling
2009-Jul-13 21:32 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Mike Gerdts wrote: > > > > Using cpio''s -C option seems to not change the behavior for this bug, > > but I did see a performance difference with the case where I hadn''t > > modified the zfs caching behavior. That is, the performance of the > > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024)) > >> /dev/null". At this point cpio was spending roughly 13% usr and 87% > > sys. > > Interesting. I just updated zfs-cache-test.ksh on my web site so that > it uses 131072 byte blocks. I see a tiny improvement in performance > from doing this, but I do see a bit less CPU consumption so the CPU > consumption is essentially zero. The bug remains. It seems best to > use ZFS''s ideal block size so that issues don''t get confused.If you continue to use cpio and the cpio archive format, you force copying a lot of data as the cpio archive format does use odd header sizes and starts new files "unaligned" directly after the archive header. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Bob Friesenhahn
2009-Jul-13 21:41 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Jim Mauro wrote:> Bob - Have you filed a bug on this issue? I am not up to speed on > this thread, so I can not comment on whether or not there is a bug > here, but you seem to have a test case and supporting data. Filing a > bug will get the attention of ZFS engineering.No, I have not filed a bug report yet. Any problem report to Sun''s Service department seems to require at least one day''s time. I was curious to see if recent OpenSolaris suffers from the same problem, but posted results (thus far) are not as conclusive as they are for Solaris 10. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Mike Gerdts
2009-Jul-13 22:02 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 13, 2009 at 4:41 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 13 Jul 2009, Jim Mauro wrote: > >> Bob - Have you filed a bug on this issue? I am not up to speed on this >> thread, so I can not comment on whether or not there is a bug here, but you >> seem to have a test case and supporting data. Filing a bug will get the >> attention of ZFS engineering. > > No, I have not filed a bug report yet. ?Any problem report to Sun''s Service > department seems to require at least one day''s time. > > I was curious to see if recent OpenSolaris suffers from the same problem, > but posted results (thus far) are not as conclusive as they are for Solaris > 10.It doesn''t seem to be quite as bad as S10, but there is certainly a hit. # /var/tmp/zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (400 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 6400033 blocks real 1m26.16s user 0m12.83s sys 0m25.88s Doing second ''cpio -o > /dev/null'' 6400033 blocks real 2m44.46s user 0m12.59s sys 0m24.34s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. # cat /etc/release OpenSolaris 2009.06 snv_111b SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 07 May 2009 # uname -srvp SunOS 5.11 snv_111b sparc -- Mike Gerdts http://mgerdts.blogspot.com/
Bob Friesenhahn
2009-Jul-13 22:11 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Mark Shellenbaum wrote:> I''ve opened the following bug to track this issue: > > 6859997 zfs caching performance problem > > We need to track down if/when this problem was introduced or if it > has always been there.I think that it has always been there as long as I have been using ZFS (1-3/4 years). Sometimes it takes a while for me to wake up and smell the coffee. Meanwhile I have opened a formal service request (IBIS 71326296) with Sun Support. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-13 22:17 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 13 Jul 2009, Joerg Schilling wrote:> > If you continue to use cpio and the cpio archive format, you force copying a > lot of data as the cpio archive format does use odd header sizes and starts > new files "unaligned" directly after the archive header.Note that the output of cpio is sent to /dev/null in this test so it is only the reading part which is significant as long as cpio''s CPU use is low. Sun Service won''t have a clue about ''star'' since it is not part of Solaris 10. It is best to stick with what they know so the problem report won''t be rejected. If star is truely more efficient than cpio, it may make the difference even more obvious. What did you discover when you modified my test script to use ''star'' instead? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Randy Jones
2009-Jul-14 01:37 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob: Sun v490, 4x1.35 processors, 32GB ram, Solaris 10u7 working with a raidz1 zpool made up of 6x146 sas drives on a j4200. Results of your running your script: # zfs-cache-test.ksh pool2 zfs create pool2/zfscachetest Creating data file set (6000 files of 8192000 bytes) under /pool2/zfscachetest ... Done! zfs unmount pool2/zfscachetest zfs mount pool2/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 96000512 blocks real 5m32.58s user 0m12.75s sys 2m56.58s Doing second ''cpio -C 131072 -o > /dev/null'' 96000512 blocks real 17m26.68s user 0m12.97s sys 4m34.33s Feel free to clean up with ''zfs destroy pool2/zfscachetest''. # Same results as you are seeing. Thanks Randy -- This message posted from opensolaris.org
Ok, build 117 does seem a lot better. The second run is slower, but not by such a huge margin. This was the end of the 98GB test: Creating data file set (12000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 192000985 blocks real 26m17.80s user 0m47.55s sys 3m56.94s Doing second ''cpio -o > /dev/null'' 192000985 blocks real 27m14.35s user 0m46.84s sys 4m39.85s -- This message posted from opensolaris.org
Jorgen, Am I right in thinking the numbers here don''t quite work. 48M blocks is just 9,000 files isn''t it, not 93,000? I''m asking because I had to repeat a test earlier - I edited the script with vi, but when I ran it, it was still using the old parameters. I ignored it as a one off, but I''m wondering if your test has done a similar thing. Ross> > x4540 running svn117 > > # ./zfs-cache-test.ksh zpool1 > zfs create zpool1/zfscachetest > creating data file set 93000 files of 8192000 bytes0 > under > /zpool1/zfscachetest ... > done1 > zfs unmount zpool1/zfscachetest > zfs mount zpool1/zfscachetest > > doing initial (unmount/mount) ''cpio -o . /dev/null'' > 48000247 blocks > > real 4m7.13s > user 0m9.27s > sys 0m49.09s > > doing second ''cpio -o . /dev/null'' > 48000247 blocks > > real 4m52.52s > user 0m9.13s > sys 0m47.51s > > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss-- This message posted from opensolaris.org
Jorgen Lundman
2009-Jul-14 07:10 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have no idea. I downloaded the script from Bob without modifications and ran it specifying only the name of our pool. Should I have changed something to run the test? We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running svn117 for ZFS quotas. Worth trying on both? Lund Ross wrote:> Jorgen, > > Am I right in thinking the numbers here don''t quite work. 48M blocks is just 9,000 files isn''t it, not 93,000? > > I''m asking because I had to repeat a test earlier - I edited the script with vi, but when I ran it, it was still using the old parameters. I ignored it as a one off, but I''m wondering if your test has done a similar thing. > > Ross > > >> x4540 running svn117 >> >> # ./zfs-cache-test.ksh zpool1 >> zfs create zpool1/zfscachetest >> creating data file set 93000 files of 8192000 bytes0 >> under >> /zpool1/zfscachetest ... >> done1 >> zfs unmount zpool1/zfscachetest >> zfs mount zpool1/zfscachetest >> >> doing initial (unmount/mount) ''cpio -o . /dev/null'' >> 48000247 blocks >> >> real 4m7.13s >> user 0m9.27s >> sys 0m49.09s >> >> doing second ''cpio -o . /dev/null'' >> 48000247 blocks >> >> real 4m52.52s >> user 0m9.13s >> sys 0m47.51s >> >> >> >> >> >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discu >> ss-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Aaah, nevermind, it looks like there''s just a rogue 9 appeared in your output. It was just a standard run of 3,000 files. -- This message posted from opensolaris.org
Jorgen Lundman
2009-Jul-14 08:54 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ah yes, my apologies! I haven''t quite worked out why OsX VNC server can''t handle keyboard mappings. I have to copy''paste "@" even. As I pasted the output into my mail over VNC, it would have destroyed the (not very) "unusual" characters. Ross wrote:> Aaah, nevermind, it looks like there''s just a rogue 9 appeared in your output. It was just a standard run of 3,000 files.-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Kurt Schreiner
2009-Jul-14 10:33 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, Jul 14, 2009 at 08:54:36AM +0200, Ross wrote:> Ok, build 117 does seem a lot better. The second run is slower, > but not by such a huge margin.Hm, I can''t support this: SunOS fred 5.11 snv_117 sun4u sparc SUNW,Sun-Fire-V440 The system has 16GB of Ram, pool is mirrored over two FUJITSU-MBA3147NC.>-1007: sudo ksh zfs-cache-test.kshzfs create rpool/zfscachetest Creating data file set (4000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''tar to /dev/null'' real 5m12.61s user 0m0.30s sys 1m28.36s Doing second ''tar to /dev/null'' real 11m13.93s user 0m0.22s sys 1m37.41s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. user=2.32 sec, sys=343.41 sec, elapsed=23:39.41 min, cpu use=24.3% And here''s what arcstat.pl has to say when starting the second read: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 11:53:26 11K 895 7 41 0 854 100 13 100 13G 13G 11:53:27 12K 832 6 39 0 793 100 13 100 13G 13G 11:53:28 11K 832 7 39 0 793 100 13 100 13G 13G 11:53:29 11K 832 7 39 0 793 100 13 76 13G 13G 11:53:30 12K 896 7 42 0 854 100 14 100 13G 13G 11:53:31 11K 832 7 39 0 793 100 13 100 13G 13G 11:53:32 11K 768 6 36 0 732 100 12 100 13G 13G 11:53:33 11K 832 7 39 0 793 100 13 100 13G 13G 11:53:34 7K 497 7 253 3 244 99 4 11 13G 13G 11:53:35 5K 385 7 385 7 0 0 0 0 13G 13G 11:53:36 5K 374 7 374 7 0 0 0 0 13G 13G 11:53:37 5K 368 7 368 7 0 0 0 0 13G 13G 11:53:38 4K 340 7 340 7 0 0 0 0 13G 13G 11:53:39 5K 383 7 383 7 0 0 0 0 13G 13G 11:53:40 5K 406 7 406 7 0 0 0 0 13G 13G 11:53:41 4K 360 7 360 7 0 0 0 0 13G 13G 11:53:42 4K 328 7 328 7 0 0 0 0 13G 13G 11:53:43 4K 346 7 346 7 0 0 0 0 13G 13G 11:53:44 4K 346 7 346 7 0 0 0 0 13G 13G 11:53:45 4K 319 7 319 7 0 0 0 0 13G 13G 11:53:47 4K 337 7 337 7 0 0 0 0 13G 13G I used tar in this run instead of cpio, just to give it a try... [time (find . -type f | xargs -i tar cf /dev/null {} )] Another run with Bob''s new script: (rpool/zfscachetest not destroyed before this run, so wall clock time below is lower)>-1008: sudo ksh zfs-cache-test.ksh.1zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 64000512 blocks real 4m40.25s user 0m7.96s sys 1m28.62s Doing second ''cpio -C 131072 -o > /dev/null'' 64000512 blocks real 11m0.08s user 0m7.37s sys 1m38.58s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. user=15.35 sec, sys=187.87 sec, elapsed=15:43.65 min, cpu use=21.5% Not much difference to the "tar"-run... Kurt
Jorgen Lundman
2009-Jul-14 12:06 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I also ran this on my future RAID/NAS. Intel Atom 330 (D945GCLF2) dual core 1.6ghz, on a single HDD pool. svn_114, 64 bit, 2GB RAM. bash-3.23 ./zfs-cache-test.ksh zboot zfs create zboot/zfscachetest creating data file set (3000 files of 8192000 bytes) under /zboot/zfscachetest ... done1 zfs unmount zboot/zfscachetest zfs mount zboot/zfscachetest doing initial (unmount/mount) ''cpio -c 131072 -o . /dev/null'' 48000256 blocks real 7m45.96s user 0m6.55s sys 1m20.85s doing second ''cpio -c 131072 -o . /dev/null'' 48000256 blocks real 7m50.35s user 0m6.76s sys 1m32.91s feel free to clean up with ''zfs destroy zboot/zfscachetest''. Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue for > a week now. A 4X reduction in file read performance due to having read > the file before is terrible, and of course the situation is considerably > worse if the file was previously mmapped as well. Many of us have sent > a lot of money to Sun and were not aware that ZFS is sucking the life > out of our expensive Sun hardware. > > It is trivially easy to reproduce this problem on multiple machines. For > example, I reproduced it on my Blade 2500 (SPARC) which uses a simple > mirrored rpool. On that system there is a 1.8X read slowdown from the > file being accessed previously. > > In order to raise visibility of this issue, I invite others to see if > they can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Implements a simple test. It requires a fair amount of disk space to > run, but the main requirement is that the disk space consumed be more > than available memory so that file data gets purged from the ARC. The > script needs to run as root since it creates a filesystem and uses > mount/umount. The script does not destroy any data. > > There are several adjustments which may be made at the front of the > script. The pool ''rpool'' is used by default, but the name of the pool > to test may be supplied via an argument similar to: > > # ./zfs-cache-test.ksh Sun_2540 > zfs create Sun_2540/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /Sun_2540/zfscachetest ... > Done! > zfs unmount Sun_2540/zfscachetest > zfs mount Sun_2540/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 2m54.17s > user 0m7.65s > sys 0m36.59s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 11m54.65s > user 0m7.70s > sys 0m35.06s > > Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. > > And here is a similar run on my Blade 2500 using the default rpool: > > # ./zfs-cache-test.ksh > zfs create rpool/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /rpool/zfscachetest ... > Done! > zfs unmount rpool/zfscachetest > zfs mount rpool/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 13m3.91s > user 2m43.04s > sys 9m28.73s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 23m50.27s > user 2m41.81s > sys 9m46.76s > > Feel free to clean up with ''zfs destroy rpool/zfscachetest''. > > I am interested to hear about systems which do not suffer from this bug. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
For what it''s worth, I just repeated that test. The timings are suspiciously similar. This is very definitely a reproducible bug: zfs unmount rc-pool/zfscachetest zfs mount rc-pool/zfscachetest Doing initial (unmount/mount) ''cpio -o > /dev/null'' 48000247 blocks real 4m45.69s user 0m10.22s sys 0m53.29s Doing second ''cpio -o > /dev/null'' 48000247 blocks real 15m47.48s user 0m10.58s sys 1m10.96s -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jul-14 15:29 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Ross, Please refresh your test script from the source. The current script tells cpio to use 128k blocks and mentions the proper command in its progress message. I have now updated it to display useful information about the system being tested, and to dump the pool configuration. It is really interesting seeing the various posted numbers. This is as close as it comes to a common benchmark. A sort of sanity check. What is most interesting to me is the reported performance for those who paid for really fast storage hardware and are using what should be really fast storage configurations. The reason why it is interesting is that there seems to be a hardware-independent cap on maximum read performance. It seems that ZFS''s read algorithm is rate-limiting the read so that regardless of how nice the hardware is, there is a peak read limit. There can be no other explanation as to why an ideal configuration of "Thumper II" SAS type hardware is neck and neck with my own setup, and quite similar to another fast system as well. My own setup is delivering less than 1/2 the performance that I would expect for the initial read (iozone says it can read 540MB/second from a huge file). Do the math and see if you think that zfs is giving you the read performance you expect based on your hardware. I think that we are encountering several bugs here. We also have a general read bottleneck. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-14 16:09 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Jorgen Lundman wrote:> I have no idea. I downloaded the script from Bob without modifications and > ran it specifying only the name of our pool. Should I have changed something > to run the test?If your system has quite a lot of memory, the number of files should be increased to at least match the amount of memory.> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running > svn117 for ZFS quotas. Worth trying on both?It is useful to test as much as possible in order to fully understand the situation. Since results often get posted without system details, the script is updated to dump some system info and the pool configuration. Refresh from http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
lists+zfs at xinu.tv
2009-Jul-14 16:32 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn wrote:> On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without modifications >> and ran it specifying only the name of our pool. Should I have changed >> something to run the test? > > If your system has quite a lot of memory, the number of files should be > increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 >> running svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully understand > the situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. Refresh > from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussWhitebox Quad-core Phenom, 8G RAM, RAID-Z (3x1TB + 3x1.5TB) SATA drives via an AOC-USAS-L8i: System Configuration: Gigabyte Technology Co., Ltd. GA-MA770-DS3 System architecture: i386 System release level: 5.11 snv_111b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c3t6d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 errors: No known data errors zfs create pool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /pool/zfscachetest ... Done! zfs unmount pool/zfscachetest zfs mount pool/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 4m59.33s user 0m21.83s sys 2m56.05s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 8m28.11s user 0m22.66s sys 3m13.26s Feel free to clean up with ''zfs destroy pool/zfscachetest''.
Angelo Rajadurai
2009-Jul-14 16:47 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Just FYI. I ran a slightly different version of the test. I used SSD (for log & cache)! 3 x 32GB SSDs. 2 mirrored for log and one for cache. The systems is a 4150 with 12 GB of RAM. Here are the results $ pfexec ./zfs-cache-test.ksh sdpool System Configuration: System architecture: i386 System release level: 5.11 snv_111b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: sdpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Fri Jul 10 11:33:01 2009 config: NAME STATE READ WRITE CKSUM sdpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 logs ONLINE 0 0 0 mirror ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 cache c8t4d0 ONLINE 0 0 0 errors: No known data errors zfs unmount sdpool/zfscachetest zfs mount sdpool/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 3m27.06s user 0m2.05s sys 0m30.14s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 2m47.32s user 0m2.09s sys 0m32.32s Feel free to clean up with ''zfs destroy sdpool/zfscachetest''. -Angelo On Jul 14, 2009, at 12:09 PM, Bob Friesenhahn wrote:> On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without >> modifications and ran it specifying only the name of our pool. >> Should I have changed something to run the test? > > If your system has quite a lot of memory, the number of files should > be increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 >> running svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully > understand the situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. > Refresh from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Bob, My guess is something like it''s single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. A quick summary of some of the figures, with times normalized for 3000 files: Sun x2200, single 500GB sata: 6m25.15s Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13s Taking my single SATA drive as a base, a pool of mirrored SATA is almost exactly twice as quick which makes sense if ZFS is reading the file off both drives at once. The raid pool of SAS drives is quicker again, but for a single threaded request that also seems about right. The random read benefits of the mirror aren''t going to take effect unless you run multiple reads in parallel. What I suspect is helping here are the slightly better seek times of the SAS drives, along with slightly higher throughput due to the raid. What might be interesting would be to see the results off a ramdisk or SSD drive. Ross -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jul-14 18:59 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Ross wrote:> My guess is something like it''s single threaded, with each file > dealt with in order and requests being serviced by just one or two > disks at a time. With that being the case, an x4500 is essentially > just running off 7200 rpm SATA drives, which really is nothing > special.Keep in mind that there is supposed to be file level read-ahead. As an example, ZFS is able to read from my array at up to 551 MB/second when reading from a huge (64GB) file yet it is only managing 145MB/second or so for these 8MB files sequentially accessed by cpio. This suggests that even for the initial read case that zfs is not applying enough file level read-ahead (or applying it soon enough) to keep the disks busy. 8MB is still pretty big in the world of files. Perhaps it takes zfs a long time to decide that read-ahead is required. I have yet to find a tunable for file level read-ahead. There are tunables for vdev-level read-ahead but vdev read-ahead pretty minor read-ahead and increasing it may cause more harm than help.> A quick summary of some of the figures, with times normalized for 3000 files: > > Sun x2200, single 500GB sata: 6m25.15s > Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s > Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s > Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13sAnd mine: Ultra 40-M2 / StorageTek 2540, 6 sets of mirrored 300GB SAS: 2m44.20s I think that Jorgen implied that his system is using SAN storage with a mirror across two jumbo LUNs.> The raid pool of SAS drives is quicker again, but for a single > threaded request that also seems about right. The random read > benefits of the mirror aren''t going to take effect unless you run > multiple reads in parallel. What I suspect is helping here are the > slightly better seek times of the SAS drives, along with slightly > higher throughput due to the raid.Once ZFS decides to apply file level read-ahead then it can issue many reads in parallel. It should be able to keep at least six disks busy at once, leading to much better performance than we are seeing. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gaëtan Lehmann
2009-Jul-14 19:36 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Le 14 juil. 09 ? 18:09, Bob Friesenhahn a ?crit :> On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without >> modifications and ran it specifying only the name of our pool. >> Should I have changed something to run the test? > > If your system has quite a lot of memory, the number of files should > be increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 >> running svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully > understand the situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. > Refresh from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.kshHere is the result on another host with faster drives (sas 10000 rpm) and solaris 10u7. System Configuration: Sun Microsystems SUN FIRE X4150 System architecture: i386 System release level: 5.10 Generic_139556-08 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool : rpool ?tat : ONLINE purger : aucun requis configuration : NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 erreurs : aucune erreur de donn?es connue zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocs real 4m56.84s user 0m1.72s sys 0m28.48s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocs real 13m48.19s user 0m2.07s sys 0m44.45s Feel free to clean up with ''zfs destroy rpool/zfscachetest''. -- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090714/7d389370/attachment.bin>
Halldor Runar Haflidason
2009-Jul-14 20:04 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn wrote:> On Tue, 14 Jul 2009, Jorgen Lundman wrote: > >> I have no idea. I downloaded the script from Bob without modifications and >> ran it specifying only the name of our pool. Should I have changed >> something to run the test? > > If your system has quite a lot of memory, the number of files should be > increased to at least match the amount of memory. > >> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running >> svn117 for ZFS quotas. Worth trying on both? > > It is useful to test as much as possible in order to fully understand the > situation. > > Since results often get posted without system details, the script is > updated to dump some system info and the pool configuration. Refresh from > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussAnd mine: dori at pax:1512 $ pfexec ./zfs-cache-test.ksh tank System Configuration: MICRO-STAR INTERNATIONAL CO.,LTD MS-7365 System architecture: i386 System release level: 5.11 snv_101b CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: tank state: ONLINE scrub: scrub completed after 3h30m with 0 errors on Tue Jul 7 19:38:45 2009 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4d0 ONLINE 0 0 0 c5d0 ONLINE 0 0 0 c7d0 ONLINE 0 0 0 errors: No known data errors zfs create tank/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /tank/zfscachetest ... Done! zfs unmount tank/zfscachetest zfs mount tank/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 8m19.62s user 0m2.07s sys 0m30.18s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 5m4.59s user 0m1.86s sys 0m34.06s Feel free to clean up with ''zfs destroy tank/zfscachetest''. -- Regards, D?ri
Richard Elling
2009-Jul-14 21:04 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Tue, 14 Jul 2009, Ross wrote: > >> My guess is something like it''s single threaded, with each file dealt >> with in order and requests being serviced by just one or two disks at >> a time. With that being the case, an x4500 is essentially just >> running off 7200 rpm SATA drives, which really is nothing special. > > Keep in mind that there is supposed to be file level read-ahead. As > an example, ZFS is able to read from my array at up to 551 MB/second > when reading from a huge (64GB) file yet it is only managing > 145MB/second or so for these 8MB files sequentially accessed by cpio. > This suggests that even for the initial read case that zfs is not > applying enough file level read-ahead (or applying it soon enough) to > keep the disks busy. 8MB is still pretty big in the world of files. > Perhaps it takes zfs a long time to decide that read-ahead is required. > > I have yet to find a tunable for file level read-ahead. There are > tunables for vdev-level read-ahead but vdev read-ahead pretty minor > read-ahead and increasing it may cause more harm than help.That is because file prefetch is dynamic. benr wrote a good blog on the subject and includes a DTrace script to monitor DMU prefetches. http://www.cuddletech.com/blog/pivot/entry.php?id=1040 -- richard
Bob Friesenhahn
2009-Jul-14 21:36 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Richard Elling wrote:> > That is because file prefetch is dynamic. benr wrote a good blog on the > subject and includes a DTrace script to monitor DMU prefetches. > http://www.cuddletech.com/blog/pivot/entry.php?id=1040Apparently not dynamic enough. The provided DTrace script has a syntax error when used for Solaris 10 U7. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jakov Sosic
2009-Jul-14 22:23 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hi! Do you think that this issues will be seen on a ZVOL-s that are exported as iSCSI tragets? -- This message posted from opensolaris.org
Jorgen Lundman
2009-Jul-15 00:28 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
3 servers contained within. Both x4500 and x4540 are setup the way Sun shipped to us. With minor changes (nfsservers=1024 etc). I was a little disappointed that they were identical in speed on round one, but the x4540 looked better part 2. Which I suspect is probably just OS version? x4500 Sol 10 100% idle, but with 3.86T existing data. 16GB memory, 4 core. x4500-03:/var/tmp# ./zfs-cache-test.ksh zpool1 System Configuration: Sun Microsystems Sun Fire X4500 System architecture: i386 System release level: 5.10 on10-public-x:s10idr_ldi:03/27/2009 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: zpool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zpool1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 errors: No known data errors zfs create zpool1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /zpool1/zfscachetest ... Done! zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 3m1.58s user 0m1.92s sys 0m56.67s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 7m7.76s user 0m1.77s sys 1m6.82s Feel free to clean up with ''zfs destroy zpool1/zfscachetest''. x4540 Sol svn 117, 100% idle, completely empty, 32GB memory, 8 core. x4500-07:/var/tmp# ./zfs-cache-test.ksh zpool1 System Configuration: Sun Microsystems Sun Fire X4540 System architecture: i386 System release level: 5.11 snv_117 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: zpool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zpool1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 spares c3t6d0 AVAIL c4t6d0 AVAIL c5t6d0 AVAIL c6t6d0 AVAIL errors: No known data errors zfs create zpool1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /zpool1/zfscachetest ... Done! zfs unmount zpool1/zfscachetest zfs mount zpool1/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 3m5.51s user 0m1.70s sys 0m29.53s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 4m7.63s user 0m1.67s sys 0m26.66s Feel free to clean up with ''zfs destroy zpool1/zfscachetest''. Intel Atom: bash-3.2# ./zfs-cache-test.ksh zboot System Configuration: System architecture: i386 System release level: 5.11 snv_114 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: zboot state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zboot ONLINE 0 0 0 c1d0s0 ONLINE 0 0 0 errors: No known data errors zfs create zboot/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /zboot/zfscachetest ... Done! zfs unmount zboot/zfscachetest zfs mount zboot/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 7m27.87s user 0m6.51s sys 1m20.28s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 7m25.34s user 0m6.63s sys 1m32.04s Feel free to clean up with ''zfs destroy zboot/zfscachetest''. -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Scott Lawson
2009-Jul-15 00:37 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I added a second Lun identical in size as a mirror and reran test. Results are more in line with yours now. ./zfs-cache-test.ksh test1 System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System architecture: sparc System release level: 5.10 Generic_139555-08 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: test1 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Wed Jul 15 11:38:54 2009 config: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B80005622640000039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE2000004394A258B93d0 ONLINE 0 0 0 errors: No known data errors zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 3m25.13s user 0m2.67s sys 0m28.40s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 8m53.05s user 0m2.69s sys 0m32.83s Feel free to clean up with ''zfs destroy test1/zfscachetest''. Scott Lawson wrote:> Bob, > > Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool > called test1 > which is contained on a raid 1 volume on a 6140 with 7.50.13.10 > firmware on > the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. > > This machine is brand new with a clean install of S10 05/09. It is > destined to become a Oracle 10 server with > ZFS filesystems for zones and DB volumes. > > [root at xxx /]#> uname -a > SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise > [root at xxx /]#> cat /etc/release > Solaris 10 5/09 s10s_u7wos_08 SPARC > Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 30 March 2009 > > [root at xxx /]#> prtdiag -v | more > System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise > M3000 Server > System clock frequency: 1064 MHz > Memory size: 16384 Megabytes > > > Here is the run output for you. > > [root at xxx tmp]#> ./zfs-cache-test.ksh test1 > zfs create test1/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /test1/zfscachetest ... > Done! > zfs unmount test1/zfscachetest > zfs mount test1/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 4m48.94s > user 0m21.58s > sys 0m44.91s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 6m39.87s > user 0m21.62s > sys 0m46.20s > > Feel free to clean up with ''zfs destroy test1/zfscachetest''. > > Looks like a 25% performance loss for me. I was seeing around 80MB/s > sustained > on the first run and around 60M/''s sustained on the 2nd. > > /Scott. > > > Bob Friesenhahn wrote: >> There has been no forward progress on the ZFS read performance issue >> for a week now. A 4X reduction in file read performance due to >> having read the file before is terrible, and of course the situation >> is considerably worse if the file was previously mmapped as well. >> Many of us have sent a lot of money to Sun and were not aware that >> ZFS is sucking the life out of our expensive Sun hardware. >> >> It is trivially easy to reproduce this problem on multiple machines. >> For example, I reproduced it on my Blade 2500 (SPARC) which uses a >> simple mirrored rpool. On that system there is a 1.8X read slowdown >> from the file being accessed previously. >> >> In order to raise visibility of this issue, I invite others to see if >> they can reproduce it in their ZFS pools. The script at >> >> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh >> >> >> Implements a simple test. It requires a fair amount of disk space to >> run, but the main requirement is that the disk space consumed be more >> than available memory so that file data gets purged from the ARC. The >> script needs to run as root since it creates a filesystem and uses >> mount/umount. The script does not destroy any data. >> >> There are several adjustments which may be made at the front of the >> script. The pool ''rpool'' is used by default, but the name of the >> pool to test may be supplied via an argument similar to: >> >> # ./zfs-cache-test.ksh Sun_2540 >> zfs create Sun_2540/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under >> /Sun_2540/zfscachetest ... >> Done! >> zfs unmount Sun_2540/zfscachetest >> zfs mount Sun_2540/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 2m54.17s >> user 0m7.65s >> sys 0m36.59s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 11m54.65s >> user 0m7.70s >> sys 0m35.06s >> >> Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. >> >> And here is a similar run on my Blade 2500 using the default rpool: >> >> # ./zfs-cache-test.ksh >> zfs create rpool/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under >> /rpool/zfscachetest ... >> Done! >> zfs unmount rpool/zfscachetest >> zfs mount rpool/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 13m3.91s >> user 2m43.04s >> sys 9m28.73s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 23m50.27s >> user 2m41.81s >> sys 9m46.76s >> >> Feel free to clean up with ''zfs destroy rpool/zfscachetest''. >> >> I am interested to hear about systems which do not suffer from this bug. >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, >> http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- _________________________________________________________________________ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz __________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' __________________________________________________________________________
Bob Friesenhahn
2009-Jul-15 01:16 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Jorgen Lundman wrote:> > Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' > 48000256 blocks > > real 3m1.58s > user 0m1.92s > sys 0m56.67s > > Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' > 48000256 blocks > > real 3m5.51s > user 0m1.70s > sys 0m29.53s >You have some mighty pools there. Something I find quite interesting is that those who have "mighty pools" generally obtain about the same data rate regardless of their relative degree of excessive "might". This causes me to believe that the Solaris kernel is throttling the read rate so that throwing more and faster hardware at the problem does not help. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jorgen Lundman
2009-Jul-15 02:07 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
> > You have some mighty pools there. Something I find quite interesting is > that those who have "mighty pools" generally obtain about the same data > rate regardless of their relative degree of excessive "might". This > causes me to believe that the Solaris kernel is throttling the read rate > so that throwing more and faster hardware at the problem does not help. >Are you saying the X4500s we have are set up incorrectly, or done in a way which will make them run poorly? The servers came with no documentation nor advise. I have yet to find a good place that suggest configurations for dedicated x4500 NFS servers. We had to find out about the NFSD_SERVERS when the first trouble came in. (Followed by 5 other tweaks and limits-reached troubles). If Sun really wants to compete with NetApp, you''d think they would ship us hardware configured for NFS servers, not x4500s configured for desktops :( They are cheap though! Nothing like being Wall-Mart of Storage! That is how the pools were created as well. Admittedly it may be down to our Vendor again. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Bob Friesenhahn
2009-Jul-15 02:09 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Scott Lawson wrote:> > NAME STATE READ WRITE CKSUM > test1 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t600A0B80005622640000039B4A257E11d0 ONLINE 0 0 0 > c3t600A0B8000336DE2000004394A258B93d0 ONLINE 0 0 0 > > Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' > 48000256 blocks > > real 3m25.13s > user 0m2.67s > sys 0m28.40sIt is quite impressive that your little two disk mirror reads as fast as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible! Does this have something to do with your well-managed power and cooling? :-) Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-15 02:14 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 14 Jul 2009, Ross wrote:> Hi Bob, > > My guess is something like it''s single threaded, with each file dealt with in order and requests being serviced by just one or two disks at a time. With that being the case, an x4500 is essentially just running off 7200 rpm SATA drives, which really is nothing special. > > A quick summary of some of the figures, with times normalized for 3000 files: > > Sun x2200, single 500GB sata: 6m25.15s > Sun v490, raidz1 zpool of 6x146 sas drives on a j4200: 2m46.29s > Sun X4500, 7 sets of mirrored 500Gb SATA: 3m0.83s > Sun x4540, (unknown pool - Jorgen, what are you running?): 4m7.13sThis new one from Scott Lawson is incredible (but technically quite possible): SPARC Enterprise M3000, single SAS mirror pair: 3m25.13s Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-15 02:29 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Jorgen Lundman wrote:>> You have some mighty pools there. Something I find quite interesting is >> that those who have "mighty pools" generally obtain about the same data >> rate regardless of their relative degree of excessive "might". This causes >> me to believe that the Solaris kernel is throttling the read rate so that >> throwing more and faster hardware at the problem does not help. > > Are you saying the X4500s we have are set up incorrectly, or done in a way > which will make them run poorly?No. I am suggesting that all Solaris 10 (and probably OpenSolaris systems) currently have a software-imposed read bottleneck which places a limit on how well systems will perform on this simple sequential read benchmark. After a certain point (which is unfortunately not very high), throwing more hardware at the problem does not result in any speed improvement. This is demonstrated by Scott Lawson''s little two disk mirror almost producing the same performance as our much more exotic setups. Evidence suggests that SPARC systems are doing better than x86. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Scott Lawson
2009-Jul-15 03:55 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Scott Lawson wrote: >> >> NAME STATE READ WRITE >> CKSUM >> test1 ONLINE 0 >> 0 0 >> mirror ONLINE 0 >> 0 0 >> c3t600A0B80005622640000039B4A257E11d0 ONLINE 0 >> 0 0 >> c3t600A0B8000336DE2000004394A258B93d0 ONLINE 0 >> 0 0Each of these LUNS is a pair of 146GB 15K drives in a RAID1 on Crystal firmware on a 6140. Each LUN is 2km apart in different data centres. 1 LUN where the server is, 1 remote. Interestingly by creating the mirror vdev the first run got faster, and the second much much slower. The second cpio took and extra 2 minutes by virtue of it being a mirror. I ran the script once again prior to adding the mirror and the results were pretty much the same as the first run posted. (plus or minus a couple of seconds, which is to be expected as these LUNS are on prod arrays feeding other servers as well) I will try these tests on some of my J4500''s when I get a chance shortly. My interest is now piqued.>> >> Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' >> 48000256 blocks >> >> real 3m25.13s >> user 0m2.67s >> sys 0m28.40s > > It is quite impressive that your little two disk mirror reads as fast > as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible! > > Does this have something to do with your well-managed power and > cooling? :-)Maybe it is Bob, maybe it is. ;) haha.> > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Scott Lawson
2009-Jul-15 04:11 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
This system has 32 GB of RAM so I will probbaly need to increase the data set size. [root at xxxxx tmp]#> ./zfs-cache-test.ksh nbupool System Configuration: Sun Microsystems sun4v SPARC Enterprise T5220 System architecture: sparc System release level: 5.10 Generic_141414-02 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: nbupool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nbupool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 c2t24d0 ONLINE 0 0 0 c2t25d0 ONLINE 0 0 0 c2t26d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t27d0 ONLINE 0 0 0 c2t28d0 ONLINE 0 0 0 c2t29d0 ONLINE 0 0 0 c2t30d0 ONLINE 0 0 0 c2t31d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t32d0 ONLINE 0 0 0 c2t33d0 ONLINE 0 0 0 c2t34d0 ONLINE 0 0 0 c2t35d0 ONLINE 0 0 0 c2t36d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t37d0 ONLINE 0 0 0 c2t38d0 ONLINE 0 0 0 c2t39d0 ONLINE 0 0 0 c2t40d0 ONLINE 0 0 0 c2t41d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t42d0 ONLINE 0 0 0 c2t43d0 ONLINE 0 0 0 c2t44d0 ONLINE 0 0 0 c2t45d0 ONLINE 0 0 0 c2t46d0 ONLINE 0 0 0 spares c2t47d0 AVAIL c2t48d0 AVAIL c2t49d0 AVAIL errors: No known data errors zfs create nbupool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /nbupool/zfscachetest ... Done! zfs unmount nbupool/zfscachetest zfs mount nbupool/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 3m37.24s user 0m9.87s sys 1m54.08s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 1m59.11s user 0m9.93s sys 1m49.15s Feel free to clean up with ''zfs destroy nbupool/zfscachetest''. Scott Lawson wrote:> Bob, > > Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool > called test1 > which is contained on a raid 1 volume on a 6140 with 7.50.13.10 > firmware on > the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. > > This machine is brand new with a clean install of S10 05/09. It is > destined to become a Oracle 10 server with > ZFS filesystems for zones and DB volumes. > > [root at xxx /]#> uname -a > SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise > [root at xxx /]#> cat /etc/release > Solaris 10 5/09 s10s_u7wos_08 SPARC > Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 30 March 2009 > > [root at xxx /]#> prtdiag -v | more > System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise > M3000 Server > System clock frequency: 1064 MHz > Memory size: 16384 Megabytes > > > Here is the run output for you. > > [root at xxx tmp]#> ./zfs-cache-test.ksh test1 > zfs create test1/zfscachetest > Creating data file set (3000 files of 8192000 bytes) under > /test1/zfscachetest ... > Done! > zfs unmount test1/zfscachetest > zfs mount test1/zfscachetest > > Doing initial (unmount/mount) ''cpio -o > /dev/null'' > 48000247 blocks > > real 4m48.94s > user 0m21.58s > sys 0m44.91s > > Doing second ''cpio -o > /dev/null'' > 48000247 blocks > > real 6m39.87s > user 0m21.62s > sys 0m46.20s > > Feel free to clean up with ''zfs destroy test1/zfscachetest''. > > Looks like a 25% performance loss for me. I was seeing around 80MB/s > sustained > on the first run and around 60M/''s sustained on the 2nd. > > /Scott. > > > Bob Friesenhahn wrote: >> There has been no forward progress on the ZFS read performance issue >> for a week now. A 4X reduction in file read performance due to >> having read the file before is terrible, and of course the situation >> is considerably worse if the file was previously mmapped as well. >> Many of us have sent a lot of money to Sun and were not aware that >> ZFS is sucking the life out of our expensive Sun hardware. >> >> It is trivially easy to reproduce this problem on multiple machines. >> For example, I reproduced it on my Blade 2500 (SPARC) which uses a >> simple mirrored rpool. On that system there is a 1.8X read slowdown >> from the file being accessed previously. >> >> In order to raise visibility of this issue, I invite others to see if >> they can reproduce it in their ZFS pools. The script at >> >> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh >> >> >> Implements a simple test. It requires a fair amount of disk space to >> run, but the main requirement is that the disk space consumed be more >> than available memory so that file data gets purged from the ARC. The >> script needs to run as root since it creates a filesystem and uses >> mount/umount. The script does not destroy any data. >> >> There are several adjustments which may be made at the front of the >> script. The pool ''rpool'' is used by default, but the name of the >> pool to test may be supplied via an argument similar to: >> >> # ./zfs-cache-test.ksh Sun_2540 >> zfs create Sun_2540/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under >> /Sun_2540/zfscachetest ... >> Done! >> zfs unmount Sun_2540/zfscachetest >> zfs mount Sun_2540/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 2m54.17s >> user 0m7.65s >> sys 0m36.59s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 11m54.65s >> user 0m7.70s >> sys 0m35.06s >> >> Feel free to clean up with ''zfs destroy Sun_2540/zfscachetest''. >> >> And here is a similar run on my Blade 2500 using the default rpool: >> >> # ./zfs-cache-test.ksh >> zfs create rpool/zfscachetest >> Creating data file set (3000 files of 8192000 bytes) under >> /rpool/zfscachetest ... >> Done! >> zfs unmount rpool/zfscachetest >> zfs mount rpool/zfscachetest >> >> Doing initial (unmount/mount) ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 13m3.91s >> user 2m43.04s >> sys 9m28.73s >> >> Doing second ''cpio -o > /dev/null'' >> 48000247 blocks >> >> real 23m50.27s >> user 2m41.81s >> sys 9m46.76s >> >> Feel free to clean up with ''zfs destroy rpool/zfscachetest''. >> >> I am interested to hear about systems which do not suffer from this bug. >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, >> http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- _________________________________________________________________________ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz __________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' __________________________________________________________________________
Richard Elling
2009-Jul-15 05:37 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I think a picture is emerging that if you have enough RAM, the ARC is working very well. Which means that the ARC management is suspect. I propose the hypothesis that ARC misses are not prefetched. The first time through, prefetching works. For the second pass, ARC misses are not prefetched, so sequential reads go slower. For JBODs, the effect will be worse than for LUNs on a storage array with lots of cache. benr''s prefetch script will help shed light on this, but apparently doesn''t work for Solaris 10. Since the Solaris 10 source is not publicly available, someone with source access might need to adjust it to match the Solaris 10 source. -- richard
Joerg Schilling
2009-Jul-15 07:24 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Richard Elling <richard.elling at gmail.com> wrote:> I think a picture is emerging that if you have enough RAM, the > ARC is working very well. Which means that the ARC management > is suspect. > > I propose the hypothesis that ARC misses are not prefetched. The > first time through, prefetching works. For the second pass, ARC > misses are not prefetched, so sequential reads go slower.You may be right as it may be that the cache is not filled by new important data because there is already 100% of unimportant data inside..... J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. On the second run however, the ARC is already full of the data that we just read, and I''m guessing that the prefetch code is less aggressive when there is already data in the ARC. Which for normal use may be what you want - it''s trying to keep things in the ARC in case they are needed. However that does mean that ZFS prefetch is always going to suffer performance degradation on a live system, although early signs are that this might not be so severe in snv_117. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Also, could there be any generic algorithm improvements that could help. Why should ZFS keep data in the ARC if it hasn''t been used? This script has 8GB files, but the ARC should be using at least 1GB of RAM. That''s a minimum of 128 files in memory, none of which would have been read more than once. If we''re reading a new file now, prefetching should be able to displace any old object in the ARC that hasn''t been used - in this case all 127 previous files should be candidates for replacement. I wonder how that would interact with a L2ARC. If that was fast enough I''d certainly want to allocate more of the ARC to prefetching. Finally, would it make sense for the ARC to always allow a certain percentage for prefetching, possibly with that percentage being tunable, allowing us to balance the needs of the two systems according to the expected usage? Ross -- This message posted from opensolaris.org
My D. Truong
2009-Jul-15 14:07 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
> It would be good to see results from a few > OpenSolaris users running a > recent 64-bit kernel, and with fast storage to see if > this is an > OpenSolaris issue as well.Bob, Here''s an example of an OpenSolaris machine, 2008.11 upgraded to the 117 devel release. X4540, 32GB RAM. The file count was bumped up to 9000 to be a little over double the RAM. root at deviant:~# ./zfs-cache-test.ksh gauss System Configuration: Sun Microsystems Sun Fire X4540 System architecture: i386 System release level: 5.11 snv_117 CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86 Pool configuration: pool: gauss state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM gauss ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 errors: No known data errors zfs create gauss/zfscachetest Creating data file set (9000 files of 8192000 bytes) under /gauss/zfscachetest ... Done! zfs unmount gauss/zfscachetest zfs mount gauss/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 144000768 blocks real 9m15.87s user 0m5.16s sys 1m29.32s Doing second ''cpio -C 131072 -o > /dev/null'' 144000768 blocks real 28m57.54s user 0m5.47s sys 1m50.32s Feel free to clean up with ''zfs destroy gauss/zfscachetest''. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jul-15 14:59 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Ross wrote:> Yes, that makes sense. For the first run, the pool has only just > been mounted, so the ARC will be empty, with plenty of space for > prefetching.I don''t think that this hypothesis is quite correct. If you use ''zpool iostat'' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read-ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. Recent OpenSolaris seems to take a 2X performance hit rather than the 4X hit that Solaris 10 takes. This may be due to improvement of existing algorithm function performance (optimizations) rather than a related design improvement.> I wonder if there is any tuning that can be done to counteract this? > Is there any way to tell ZFS to bias towards prefetching rather than > preserving data in the ARC? That may provide better performance for > scripts like this, or for random access workloads.Recent zfs development focus has been on how to keep prefetch from damaging applications like database where prefetch causes more data to be read than is needed. Since OpenSolaris now apparently includes an option setting which blocks file data caching and prefetch, this seems to open the door for use of more aggressive prefetch in the normal mode. In summary, I agree with Richard Elling''s hypothesis (which is the same as my own). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-15 15:08 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, My D. Truong wrote:> > Here''s an example of an OpenSolaris machine, 2008.11 upgraded to the > 117 devel release. X4540, 32GB RAM. The file count was bumped up > to 9000 to be a little over double the RAM.Your timings show a 3.1X hit so it appears that the OpenSolaris improvement is not as much as was assumed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Jul-15 16:47 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Ross wrote: > >> Yes, that makes sense. For the first run, the pool has only just >> been mounted, so the ARC will be empty, with plenty of space for >> prefetching. > > I don''t think that this hypothesis is quite correct. If you use > ''zpool iostat'' to monitor the read rate while reading a large > collection of files with total size far larger than the ARC, you will > see that there is no fall-off in read performance once the ARC becomes > full.Unfortunately, "zpool iostat" doesn''t really tell you anything about performance. All it shows is bandwidth. Latency is what you need to understand performance, so use iostat.> The performance problem occurs when there is still metadata cached for > a file but the file data has since been expunged from the cache. The > implication here is that zfs speculates that the file data will be in > the cache if the metadata is cached, and this results in a cache miss > as well as disabling the file read-ahead algorithm. You would not > want to do read-ahead on data that you already have in a cache.I realized this morning that what I posted last night might be misleading to the casual reader. Clearly the first time through the data is prefetched and misses the cache. On the second pass, it should also miss the cache, if it were a simple cache. But the ARC tries to be more clever and has ghosts -- where the data is no longer in cache, but the metadata is. I suspect the prefetching is not being used for the ghosts. The arcstats will show this. As benr blogs, "These Ghosts lists are magic. If you get a lot of hits to the ghost lists, it means that ARC is WAY too small and that you desperately need either more RAM or an L2 ARC device (likely, SSD). Please note, if you are considering investing in L2 ARC, check this FIRST." http://www.cuddletech.com/blog/pivot/entry.php?id=979 This is the explicit case presented by your test. This also explains why the entry from the system with an L2ARC did not have the performance "problem." Also, another test would be to have two large files. Read from one, then the other, then from the first again. Capture arcstats from between the reads and see if the haunting stops ;-) -- richard
Bob Friesenhahn
2009-Jul-15 16:58 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Richard Elling wrote:> > Unfortunately, "zpool iostat" doesn''t really tell you anything about > performance. All it shows is bandwidth. Latency is what you need > to understand performance, so use iostat.You are still thinking about this as if it was a hardware-related problem when it is clearly not. Iostat is useful for analyzing hardware-related problems in the case where the workload is too much for the hardware, or the hardware is non-responsive. Anyone who runs this crude benchmark will discover that iostat shows hardly any disk utilization at all, latencies are low, and read I/O rates are low enough that they could be satisfied by a portable USB drive. You can even observe the blinking lights on the front of the drive array and see that it is lightly loaded. This explains why a two disk mirror is almost able to keep up with a system with 40 fast SAS drives. This is the opposite situation from the zfs writes which periodically push the hardware to its limits. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Jul-15 17:21 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Richard Elling wrote: >> >> Unfortunately, "zpool iostat" doesn''t really tell you anything about >> performance. All it shows is bandwidth. Latency is what you need >> to understand performance, so use iostat. > > You are still thinking about this as if it was a hardware-related > problem when it is clearly not. Iostat is useful for analyzing > hardware-related problems in the case where the workload is too much > for the hardware, or the hardware is non-responsive. Anyone who runs > this crude benchmark will discover that iostat shows hardly any disk > utilization at all, latencies are low, and read I/O rates are low > enough that they could be satisfied by a portable USB drive. You can > even observe the blinking lights on the front of the drive array and > see that it is lightly loaded. This explains why a two disk mirror is > almost able to keep up with a system with 40 fast SAS drives.heh. What you would be looking for is evidence of prefetching. If there is a lot of prefetching, the actv will tend to be high and latencies relatively low. If there is no prefetching, actv will be low and latencies may be higher. This also implies that if you use IDE disks, which cannot handle multiple outstanding I/Os, the performance will look similar for both runs. Or, you could get more sophisticated and use a dtrace script to look at the I/O behavior to determine the latency between contiguous I/O requests. Something like iopattern is a good start, though it doesn''t try to measure the time between requests, it would be easy to add. http://www.richardelling.com/Home/scripts-and-programs-1/iopattern -- richard
Bob Friesenhahn
2009-Jul-15 17:49 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 15 Jul 2009, Richard Elling wrote:> > heh. What you would be looking for is evidence of prefetching. If > there is a lot of prefetching, the actv will tend to be high and > latencies relatively low. If there is no prefetching, actv will be > low and latencies may be higher. This also implies that if you use > IDE disks, which cannot handle multiple outstanding I/Os, the > performance will look similar for both runs.Ok, here are some stats for the "poor" (initial "USB" rates) and "terrible" (sub-"USB" rates) cases. "poor" (29% busy) iostat: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0 0.0 1.2 0.0 11.4 0.0 0.0 0.0 4.5 0 0 c1t1d0 91.2 0.0 11654.7 0.0 0.0 0.8 0.0 9.2 0 27 c4t600A0B80003A8A0B0000096147B451BEd0 95.0 0.0 12160.3 0.0 0.0 0.9 0.0 9.9 0 29 c4t600A0B800039C9B500000A9C47B4522Dd0 96.4 0.0 12333.1 0.0 0.0 0.9 0.0 9.5 0 29 c4t600A0B800039C9B500000AA047B4529Bd0 96.8 0.0 12377.9 0.0 0.0 0.9 0.0 9.5 0 30 c4t600A0B80003A8A0B0000096647B453CEd0 100.4 0.0 12845.1 0.0 0.0 1.0 0.0 9.5 0 29 c4t600A0B800039C9B500000AA447B4544Fd0 93.4 0.0 11949.1 0.0 0.0 0.8 0.0 9.0 0 28 c4t600A0B80003A8A0B0000096A47B4559Ed0 91.5 0.0 11705.9 0.0 0.0 0.9 0.0 9.7 0 28 c4t600A0B800039C9B500000AA847B45605d0 91.4 0.0 11680.3 0.0 0.0 0.9 0.0 10.1 0 29 c4t600A0B80003A8A0B0000096E47B456DAd0 88.9 0.0 11366.7 0.0 0.0 0.9 0.0 9.7 0 27 c4t600A0B800039C9B500000AAC47B45739d0 94.3 0.0 12045.5 0.0 0.0 0.9 0.0 9.9 0 29 c4t600A0B800039C9B500000AB047B457ADd0 96.5 0.0 12339.5 0.0 0.0 0.9 0.0 9.3 0 28 c4t600A0B80003A8A0B0000097347B457D4d0 87.9 0.0 11232.7 0.0 0.0 0.9 0.0 10.4 0 29 c4t600A0B800039C9B500000AB447B4595Fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t202400A0B83A8A0Bd31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t202500A0B83A8A0Bd31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 freddy:vold(pid508) "terrible" (8% busy) iostat: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0 0.0 1.8 0.0 1.0 0.0 0.0 0.0 26.6 0 1 c1t1d0 26.8 0.0 3430.4 0.0 0.0 0.1 0.0 2.9 0 8 c4t600A0B80003A8A0B0000096147B451BEd0 21.0 0.0 2688.0 0.0 0.0 0.1 0.0 3.9 0 8 c4t600A0B800039C9B500000A9C47B4522Dd0 24.0 0.0 3059.6 0.0 0.0 0.1 0.0 3.4 0 8 c4t600A0B800039C9B500000AA047B4529Bd0 27.6 0.0 3532.8 0.0 0.0 0.1 0.0 3.2 0 9 c4t600A0B80003A8A0B0000096647B453CEd0 20.8 0.0 2662.4 0.0 0.0 0.1 0.0 3.1 0 6 c4t600A0B800039C9B500000AA447B4544Fd0 26.5 0.0 3392.0 0.0 0.0 0.1 0.0 2.6 0 7 c4t600A0B80003A8A0B0000096A47B4559Ed0 20.6 0.0 2636.8 0.0 0.0 0.1 0.0 3.0 0 6 c4t600A0B800039C9B500000AA847B45605d0 22.9 0.0 2931.2 0.0 0.0 0.1 0.0 3.8 0 9 c4t600A0B80003A8A0B0000096E47B456DAd0 21.4 0.0 2739.2 0.0 0.0 0.1 0.0 3.5 0 7 c4t600A0B800039C9B500000AAC47B45739d0 23.1 0.0 2944.4 0.0 0.0 0.1 0.0 3.7 0 9 c4t600A0B800039C9B500000AB047B457ADd0 24.9 0.0 3187.2 0.0 0.0 0.1 0.0 3.4 0 8 c4t600A0B80003A8A0B0000097347B457D4d0 28.3 0.0 3622.4 0.0 0.0 0.1 0.0 2.8 0 8 c4t600A0B800039C9B500000AB447B4595Fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t202400A0B83A8A0Bd31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t202500A0B83A8A0Bd31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 freddy:vold(pid508) Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Aaah, ok, I think I understand now. Thanks Richard. I''ll grab the updated test and have a look at the ARC ghost results when I get back to work tomorrow. -- This message posted from opensolaris.org
James Andrewartha
2009-Jul-16 14:58 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sun, 2009-07-12 at 16:38 -0500, Bob Friesenhahn wrote:> In order to raise visibility of this issue, I invite others to see if > they can reproduce it in their ZFS pools. The script at > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.kshHere''s the results from two machines, the first has 12x400MHz US-II CPUs, 11GB of RAM and the disks are 18GB 10krpm SCSI in a split D1000: System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise 4000/5000 System architecture: sparc System release level: 5.11 snv_101 CPU ISA list: sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: space state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h22m with 0 errors on Mon Jul 13 17:18:55 2009 config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c2t13d0 ONLINE 1 0 0 128K repaired errors: No known data errors zfs create space/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest ... Done! zfs unmount space/zfscachetest zfs mount space/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 11m40.67s user 0m20.32s sys 5m27.16s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 31m29.42s user 0m19.31s sys 6m46.39s Feel free to clean up with ''zfs destroy space/zfscachetest''. The second has 2x1.2GHz US-III+, 4GB RAM and 10krpm FC disks on a single loop. System Configuration: Sun Microsystems sun4u Sun Fire 480R System architecture: sparc System release level: 5.11 snv_97 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t34d0 ONLINE 0 0 0 c1t48d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t35d0 ONLINE 0 0 0 c1t49d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t36d0 ONLINE 0 0 0 c1t51d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t33d0 ONLINE 0 0 0 c1t52d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t38d0 ONLINE 0 0 0 c1t53d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t39d0 ONLINE 0 0 0 c1t54d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t40d0 ONLINE 0 0 0 c1t55d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t41d0 ONLINE 0 0 0 c1t56d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t42d0 ONLINE 0 0 0 c1t57d0 ONLINE 0 0 0 logs ONLINE 0 0 0 c1t50d0 ONLINE 0 0 0 errors: No known data errors zfs create space/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest ... Done! zfs unmount space/zfscachetest zfs mount space/zfscachetest Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 5m45.66s user 0m5.63s sys 1m14.66s Doing second ''cpio -C 131072 -o > /dev/null'' 48000256 blocks real 15m29.42s user 0m5.65s sys 1m37.83s Feel free to clean up with ''zfs destroy space/zfscachetest''. James Andrewartha
Bob Friesenhahn
2009-Jul-16 19:44 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I have received email that Sun CR numbers 6861397 & 6859997 have been created to get this performance problem fixed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Marion Hakanson
2009-Jul-21 01:45 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
bfriesen at simple.dallas.tx.us said:> No. I am suggesting that all Solaris 10 (and probably OpenSolaris systems) > currently have a software-imposed read bottleneck which places a limit on > how well systems will perform on this simple sequential read benchmark. > After a certain point (which is unfortunately not very high), throwing more > hardware at the problem does not result in any speed improvement. This is > demonstrated by Scott Lawson''s little two disk mirror almost producing the > same performance as our much more exotic setups.Apologies for reawakening this thread -- I was away last week. Bob, have you tried changing your benchmark to be multithreaded? It occurs to me that maybe a single cpio invocation is another bottleneck. I''ve definitely experienced the case where a single bonnie++ process was not enough to max out the storage system. I''m not suggesting that the bug you''re demonstrating is not real. It''s clear that subsequent runs on the same system show the degradation, and that points out a problem. Rather, I''m thinking that maybe the timing comparisons between low-end and high-end storage systems on this particular test are not revealing the whole story. Regards, Marion
Bob Friesenhahn
2009-Jul-21 02:52 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, 20 Jul 2009, Marion Hakanson wrote:> > Bob, have you tried changing your benchmark to be multithreaded? It > occurs to me that maybe a single cpio invocation is another bottleneck. > I''ve definitely experienced the case where a single bonnie++ process was > not enough to max out the storage system.It is likely that adding more cpios would cause more data to be read, but it would also thrash the disks with many more conflicting IOPS.> I''m not suggesting that the bug you''re demonstrating is not real. It''sIt is definitely real. Sun has opened internal CR 6859997. It is now in Dispatched state at High priority.> that points out a problem. Rather, I''m thinking that maybe the timing > comparisons between low-end and high-end storage systems on this particular > test are not revealing the whole story.The similarity of performance between the low-end and high-end storage systems is a sign that the rotating rust is not a whole lot faster on the high-end storage systems. Since zfs is failing to use pre-fetch, only one (or maybe two) disks are accessed at a time. If more read I/Os are issued in parallel, then the data read rate will be vastly higher on the higher-end systems. With my 12 disk array and a large sequential read, zfs can issue 12 requests for 128K at once and since it can also queue pending I/Os, it can request many more than that. Care is required since over-reading will penalize the system. It is not an easy thing to get right. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Brent Jones
2009-Jul-21 04:14 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 20 Jul 2009, Marion Hakanson wrote:> > It is definitely real. ?Sun has opened internal CR 6859997. ?It is now in > Dispatched state at High priority. >Is there a way we can get a Sun person on this list to supply a little bit more info on that CR? Seems theres a lot of people bitten by this, from low end to extremely high end hardware. -- Brent Jones brent at servuhome.net
Brad Diggs
2009-Jul-22 20:09 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Have you considered running your script with ZFS pre-fetching disabled altogether to see if the results are consistent between runs? Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail Bradley.Diggs at Sun.COM Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Ross wrote: > >> Yes, that makes sense. For the first run, the pool has only just >> been mounted, so the ARC will be empty, with plenty of space for >> prefetching. > > I don''t think that this hypothesis is quite correct. If you use > ''zpool iostat'' to monitor the read rate while reading a large > collection of files with total size far larger than the ARC, you > will see that there is no fall-off in read performance once the ARC > becomes full. The performance problem occurs when there is still > metadata cached for a file but the file data has since been expunged > from the cache. The implication here is that zfs speculates that > the file data will be in the cache if the metadata is cached, and > this results in a cache miss as well as disabling the file read- > ahead algorithm. You would not want to do read-ahead on data that > you already have in a cache. > > Recent OpenSolaris seems to take a 2X performance hit rather than > the 4X hit that Solaris 10 takes. This may be due to improvement of > existing algorithm function performance (optimizations) rather than > a related design improvement. > >> I wonder if there is any tuning that can be done to counteract >> this? Is there any way to tell ZFS to bias towards prefetching >> rather than preserving data in the ARC? That may provide better >> performance for scripts like this, or for random access workloads. > > Recent zfs development focus has been on how to keep prefetch from > damaging applications like database where prefetch causes more data > to be read than is needed. Since OpenSolaris now apparently > includes an option setting which blocks file data caching and > prefetch, this seems to open the door for use of more aggressive > prefetch in the normal mode. > > In summary, I agree with Richard Elling''s hypothesis (which is the > same as my own). > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090722/c546582e/attachment.html>
Bob Friesenhahn
2009-Jul-22 23:51 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 22 Jul 2009, Roch wrote:> > HI Bob did you consider running the 2 runs with > > echo zfs_prefetch_disable/W0t1 | mdb -kw > > and see if performance is constant between the 2 runs (and low). > That would help clear the cause a bit. Sorry, I''d do it for > you but since you have the setup etc... > > Revert with : > > echo zfs_prefetch_disable/W0t0 | mdb -kw > > -rI see that if I update my test script so that prefetch is disabled before the first cpio is executed, the read performance of the first cpio reported by ''zpool iostat'' is similar to what has been normal for the second cpio case (i.e. 32MB/second). This seems to indicate that prefetch is entirely disabled if the file has ever been read before. However, there is a new wrinkle in that the second cpio completes twice as fast with prefetch disabled even though ''zpool iostat'' indicates the same consistent throughput. The difference goes away if I tripple the number of files. With 3000 8.2MB files: Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 14443520 blocks real 3m41.61s user 0m0.44s sys 0m8.12s Doing second ''cpio -C 131072 -o > /dev/null'' 14443520 blocks real 1m50.12s user 0m0.42s sys 0m7.21s Now if I increase the number of files to 9000 8.2MB files: Doing initial (unmount/mount) ''cpio -C 131072 -o > /dev/null'' 144000768 blocks real 35m51.47s user 0m4.46s sys 1m20.11s Doing second ''cpio -C 131072 -o > /dev/null'' 144000768 blocks real 35m22.41s user 0m4.40s sys 1m14.22s Notice that with 3X the files, the throughput is dramatically reduced and the time is the same for both cases. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Rich Morris
2009-Jul-28 21:13 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:> Sun has opened internal CR 6859997. It is now in Dispatched state at High priority.CR 6859997 has been accepted and is actively being worked on. The following info has been added to that CR: This is a problem with the ZFS file prefetch code (zfetch) in dmu_zfetch.c. The test script provided by the submitter (thanks Bob!) does no file prefetching the second time through each file. This problem exists in ZFS in Solaris 10, Nevada, and OpenSolaris. This test script creates 3000 files each 8M long so the amount of data (24G) is greater than the amount of memory (16G on a Thumper). With the default blocksize of 128k, each of the 3000 files has 63 blocks. The first time through, zfetch ramps up a single prefetch stream normally. But the second time through, dmu_zfetch() calls dmu_zfetch_find() which thinks that the data has already been prefetched so no additional prefetching is started. This problem is not seen with 500 files each 48M in length (still 24G of data). In that case there''s still only one prefetch stream but it is reclaimed when one of the requested offsets is not found. The reason it is not found is that stream "strided" the first time through after reaching the zfetch cap, which is 256 blocks. Files with no more than 256 blocks don''t require a stride. So this problem will only be seen when the data from a file with no more than 256 blocks is accessed after being tossed from the ARC. The fix for this problem may be more feedback between the ARC and the zfetch code. Or it may make sense to restart the prefetch stream after some time has passed or perhaps whenever there''s a miss on a block that was expected to have already been prefetched? On a Thumper running Nevada build 118, the first pass of this test takes 2 minutes 50 seconds and the second pass takes 5 minutes 22 seconds. If dmu_zfetch_find() is modified to restart the refetch stream when the requested offset is 0 and more than 2 seconds has passed since the stream was last accessed then the time needed for the second pass is reduced to 2 minutes 24 seconds. Additional investigation is currently taking place to determine if another solution makes more sense. And more testing will be needed to see what affect this change has on other prefetch patterns. 6412053 is a related CR which mentions that the zfetch code may not be issuing I/O at a sufficient pace. This behavior is also seen on a Thumper running the test script in CR 6859997 since, even when prefetch is ramping up as expected, less than half of the available I/O bandwidth is being used. Although more aggressive file prefetching could increase memory pressure as described in CRs 6258102 and 6469558. -- Rich
Bob Friesenhahn
2009-Jul-28 22:57 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 28 Jul 2009, Rich Morris wrote:> > 6412053 is a related CR which mentions that the zfetch code may not be > issuing I/O at a sufficient pace. This behavior is also seen on a Thumper > running the test script in CR 6859997 since, even when prefetch is ramping up > as expected, less than half of the available I/O bandwidth is being used. > Although more aggressive file prefetching could increase memory pressure as > described in CRs 6258102 and 6469558.It is good to see this analysis. Certainly the optimum prefetching required for an Internet video streaming server (with maybe 300 kilobits/second per stream) is radically different than what is required for uncompressed 2K preview (8MB/frame) of motion picture frames (320 megabytes/second per stream) but zfs should be able to support both. Besides real-time analysis based on current stream behavior and memory, it would be useful to maintain some recent history for the whole pool so that a pool which is usually used for 1000 slow-speed video streams behaves differently by default than one used for one or two high-speed video streams. With this bit of hint information, files belonging to a pool recently producing high-speed streams can be ramped up quickly while files belonging to a pool which has recently fed low-speed streams can be ramped up more conservatively (until proven otherwise) in order to not flood memory and starve the I/O needed by other streams. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Jul-28 23:08 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 28 Jul 2009, Rich Morris wrote:> > The fix for this problem may be more feedback between the ARC and the zfetch > code. Or it may make sense to restart the prefetch stream after some time > has passed or perhaps whenever there''s a miss on a block that was expected to > have already been prefetched?Regarding this approach of waiting for a prefetch miss, this seems like it would produce an uneven flow of data to the application and not ensure that data is always available when the application goes to read it. A stutter is likely to produce at least a 10ms gap (and possibly far greater) while the application is blocked in read() waiting for data. Since zfs blocks are large, stuttering becomes expensive, and if the application itself needs to read ahead 128K in order to avoid the stutter, then it consumes memory in an expensive non-sharable way. In the ideal case, zfs will always stay one 128K block ahead of the application''s requirement and the unconsumed data will be cached in the ARC where it can be shared with other processes. For an application with real-time data requirements, it is definitely desireable not to stutter at all if possible. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Rich Morris
2009-Sep-10 19:12 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 07/28/09 17:13, Rich Morris wrote:> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: > >> Sun has opened internal CR 6859997. It is now in Dispatched state at >> High priority.CR 6859997 has recently been fixed in Nevada. This fix will also be in Solaris 10 Update 9. This fix speeds up the sequential prefetch pattern described in this CR without slowing down other prefetch patterns. Some kstats have also been added to help improve the observability of ZFS file prefetching. -- Rich> CR 6859997 has been accepted and is actively being worked on. The > following info has been added to that CR: > > This is a problem with the ZFS file prefetch code (zfetch) in > dmu_zfetch.c. The test script provided by the submitter (thanks Bob!) > does no file prefetching the second time through each file. This > problem exists in ZFS in Solaris 10, Nevada, and OpenSolaris. > > This test script creates 3000 files each 8M long so the amount of data > (24G) is greater than the amount of memory (16G on a Thumper). With > the default blocksize of 128k, each of the 3000 files has 63 blocks. > The first time through, zfetch ramps up a single prefetch stream > normally. But the second time through, dmu_zfetch() calls > dmu_zfetch_find() which thinks that the data has already been > prefetched so no additional prefetching is started. > > This problem is not seen with 500 files each 48M in length (still 24G > of data). In that case there''s still only one prefetch stream but it > is reclaimed when one of the requested offsets is not found. The > reason it is not found is that stream "strided" the first time through > after reaching the zfetch cap, which is 256 blocks. Files with no > more than 256 blocks don''t require a stride. So this problem will > only be seen when the data from a file with no more than 256 blocks is > accessed after being tossed from the ARC. > > The fix for this problem may be more feedback between the ARC and the > zfetch code. Or it may make sense to restart the prefetch stream > after some time has passed or perhaps whenever there''s a miss on a > block that was expected to have already been prefetched? > > On a Thumper running Nevada build 118, the first pass of this test > takes 2 minutes 50 seconds and the second pass takes 5 minutes 22 > seconds. If dmu_zfetch_find() is modified to restart the refetch > stream when the requested offset is 0 and more than 2 seconds has > passed since the stream was last accessed then the time needed for the > second pass is reduced to 2 minutes 24 seconds. > > Additional investigation is currently taking place to determine if > another solution makes more sense. And more testing will be needed to > see what affect this change has on other prefetch patterns. > > 6412053 is a related CR which mentions that the zfetch code may not be > issuing I/O at a sufficient pace. This behavior is also seen on a > Thumper running the test script in CR 6859997 since, even when > prefetch is ramping up as expected, less than half of the available > I/O bandwidth is being used. Although more aggressive file > prefetching could increase memory pressure as described in CRs 6258102 > and 6469558. > > > -- Rich > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2009-Sep-10 20:17 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Thu, 10 Sep 2009, Rich Morris wrote:> On 07/28/09 17:13, Rich Morris wrote: >> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >> >>> Sun has opened internal CR 6859997. It is now in Dispatched state at High >>> priority. > > CR 6859997 has recently been fixed in Nevada. This fix will also be in > Solaris 10 Update 9. > This fix speeds up the sequential prefetch pattern described in this CR > without slowing down other prefetch patterns. Some kstats have also been > added to help improve the observability of ZFS file prefetching.Excellent. What level of read improvement are you seeing? Is the prefetch rate improved, or does the fix simply avoid losing the prefetch? Thanks, Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
eneal at businessgrade.com
2009-Sep-10 20:22 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:> On Thu, 10 Sep 2009, Rich Morris wrote: > >> On 07/28/09 17:13, Rich Morris wrote: >>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>> >>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>> state at High priority. >> >> CR 6859997 has recently been fixed in Nevada. This fix will also >> be in Solaris 10 Update 9. This fix speeds up the sequential >> prefetch pattern described in this CR without slowing down other >> prefetch patterns. Some kstats have also been added to help >> improve the observability of ZFS file prefetching. > > Excellent. What level of read improvement are you seeing? Is the > prefetch rate improved, or does the fix simply avoid losing the > prefetch? > > Thanks, > > BobIs this fixed in snv_122 or something else? -------------------------------------------------------------------------------- This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. This communication may contain material protected by the attorney-client privilege. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing or copying is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies.
Henrik Johansson
2009-Sep-10 20:26 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Hello Rich, On Sep 10, 2009, at 9:12 PM, Rich Morris wrote:> On 07/28/09 17:13, Rich Morris wrote: >> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >> >>> Sun has opened internal CR 6859997. It is now in Dispatched state >>> at High priority. > > CR 6859997 has recently been fixed in Nevada. This fix will also be > in Solaris 10 Update 9. > This fix speeds up the sequential prefetch pattern described in this > CR without slowing down other prefetch patterns. Some kstats have > also been added to help improve the observability of ZFS file > prefetching.Nice work, do you know if it will be released as a patch for s10u8 or will it only be part of the update 9 KUP? Regards Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090910/ec3067f4/attachment.html>
Rich Morris
2009-Sep-10 20:35 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 09/10/09 16:17, Bob Friesenhahn wrote:> On Thu, 10 Sep 2009, Rich Morris wrote: > >> On 07/28/09 17:13, Rich Morris wrote: >>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>> >>>> Sun has opened internal CR 6859997. It is now in Dispatched state >>>> at High priority. >> >> CR 6859997 has recently been fixed in Nevada. This fix will also be >> in Solaris 10 Update 9. This fix speeds up the sequential prefetch >> pattern described in this CR without slowing down other prefetch >> patterns. Some kstats have also been added to help improve the >> observability of ZFS file prefetching. > > Excellent. What level of read improvement are you seeing? Is the > prefetch rate improved, or does the fix simply avoid losing the prefetch?This fix avoids using a prefetch stream when it is no longer valid. BTW, ZFS prefetch appears to work well for most prefetch patterns. But this CR found a pattern that should have worked well but did not. -- Rich
Bob Friesenhahn
2009-Sep-10 21:21 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Thu, 10 Sep 2009, Rich Morris wrote:>> >> Excellent. What level of read improvement are you seeing? Is the prefetch >> rate improved, or does the fix simply avoid losing the prefetch? > > This fix avoids using a prefetch stream when it is no longer valid. BTW, ZFS > prefetch appears to work well for most prefetch patterns. But this CR found > a pattern that should have worked well but did not.It seems that after doing a fresh mount, the zfs prefetch is not quite enough to keep my hungry highly-tuned application sufficiently well fed. I will have to wait and see though. In the mean time, I need to investigate why recent Solaris 10 kernel patches (141415-10) cause my Sun Ultra-40M2 system to panic five minutes into ''zpool scrub'' with a fault being reported against the motherboard. Maybe a few more motherboard swaps will solve it (on 4th motherboard now). 141415-3 seems less likely to panic since it survives a full scrub (unless VirtualBox is running a Linux instance). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Rich Morris
2009-Sep-11 14:02 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On 09/10/09 16:22, eneal at businessgrade.com wrote:> Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>: > >> On Thu, 10 Sep 2009, Rich Morris wrote: >> >>> On 07/28/09 17:13, Rich Morris wrote: >>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>>> >>>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>>> state at High priority. >>> >>> CR 6859997 has recently been fixed in Nevada. This fix will also >>> be in Solaris 10 Update 9. This fix speeds up the sequential >>> prefetch pattern described in this CR without slowing down other >>> prefetch patterns. Some kstats have also been added to help >>> improve the observability of ZFS file prefetching. >> >> Excellent. What level of read improvement are you seeing? Is the >> prefetch rate improved, or does the fix simply avoid losing the >> prefetch? >> >> Thanks, >> >> Bob > > Is this fixed in snv_122 or something else?snv_124. See http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997
Christian Kendi
2009-Sep-13 15:40 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 is already a diff for the source available? El Sep 11, 2009, a las 4:02 PM, Rich Morris escribi?:> On 09/10/09 16:22, eneal at businessgrade.com wrote: >> Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>: >> >>> On Thu, 10 Sep 2009, Rich Morris wrote: >>> >>>> On 07/28/09 17:13, Rich Morris wrote: >>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>>>> >>>>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>>>> state at High priority. >>>> >>>> CR 6859997 has recently been fixed in Nevada. This fix will >>>> also be in Solaris 10 Update 9. This fix speeds up the >>>> sequential prefetch pattern described in this CR without slowing >>>> down other prefetch patterns. Some kstats have also been added >>>> to help improve the observability of ZFS file prefetching. >>> >>> Excellent. What level of read improvement are you seeing? Is the >>> prefetch rate improved, or does the fix simply avoid losing the >>> prefetch? >>> >>> Thanks, >>> >>> Bob >> >> Is this fixed in snv_122 or something else? > > snv_124. See http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997 > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iD8DBQFKrRJnp+9ff145KVIRAhErAKCYKnv6Fn/Vn61Fa2MYpl9S+P9KGACeJUMA g+RhFTRl9NdI0eNOx5aZaXw=QAX8 -----END PGP SIGNATURE-----
Dale Ghent
2009-Sep-15 20:03 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 10, 2009, at 3:12 PM, Rich Morris wrote:> On 07/28/09 17:13, Rich Morris wrote: >> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >> >>> Sun has opened internal CR 6859997. It is now in Dispatched state >>> at High priority. > > CR 6859997 has recently been fixed in Nevada. This fix will also be > in Solaris 10 Update 9. > This fix speeds up the sequential prefetch pattern described in this > CR without slowing down other prefetch patterns. Some kstats have > also been added to help improve the observability of ZFS file > prefetching.Awesome that the fix exists. I''ve been having a hell of a time with device-level prefetch on my iscsi clients causing tons of ultimately useless IO and have resorted to setting zfs_vdev_cache_max=1. Question though... why is bug fix that can be a watershed for performance be held back for so long? s10u9 won''t be available for at least 6 months from now, and with a huge environment, I try hard not to live off of IDRs. Am I the only one that thinks this is way too conservative? It''s just maddening to know that a highly beneficial fix is out there, but its release is based on time rather than need. Sustaining really needs to be more proactive when it comes to this stuff. /dale
Richard Elling
2009-Sep-15 21:21 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote:> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: > >> On 07/28/09 17:13, Rich Morris wrote: >>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>> >>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>> state at High priority. >> >> CR 6859997 has recently been fixed in Nevada. This fix will also >> be in Solaris 10 Update 9. >> This fix speeds up the sequential prefetch pattern described in >> this CR without slowing down other prefetch patterns. Some kstats >> have also been added to help improve the observability of ZFS file >> prefetching. > > Awesome that the fix exists. I''ve been having a hell of a time with > device-level prefetch on my iscsi clients causing tons of ultimately > useless IO and have resorted to setting zfs_vdev_cache_max=1.This only affects metadata. Wouldn''t it be better to disable prefetching for data? -- richard> > Question though... why is bug fix that can be a watershed for > performance be held back for so long? s10u9 won''t be available for > at least 6 months from now, and with a huge environment, I try hard > not to live off of IDRs. > > Am I the only one that thinks this is way too conservative? It''s > just maddening to know that a highly beneficial fix is out there, > but its release is based on time rather than need. Sustaining really > needs to be more proactive when it comes to this stuff. > > /dale > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Dale Ghent
2009-Sep-15 21:38 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 5:21 PM, Richard Elling wrote:> > On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote: > >> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: >> >>> On 07/28/09 17:13, Rich Morris wrote: >>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>>> >>>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>>> state at High priority. >>> >>> CR 6859997 has recently been fixed in Nevada. This fix will also >>> be in Solaris 10 Update 9. >>> This fix speeds up the sequential prefetch pattern described in >>> this CR without slowing down other prefetch patterns. Some kstats >>> have also been added to help improve the observability of ZFS file >>> prefetching. >> >> Awesome that the fix exists. I''ve been having a hell of a time with >> device-level prefetch on my iscsi clients causing tons of >> ultimately useless IO and have resorted to setting >> zfs_vdev_cache_max=1. > > This only affects metadata. Wouldn''t it be better to disable > prefetching for data?Well, that''s a surprise to me, but the zfs_vdev_cache_max=1 did provide relief. Just a general description of my environment: My setup consists of several s10uX iscsi clients which get LUNs from a pairs of thumpers. Each thumper pair exports identical LUNs to each iscsi client, and the client in turn mirrors each LUN pair inside a local zpool. As more space is needed on a client, a new LUN is created on the pair of thumpers, exported to the iscsi client, which then picks it up and we add a new mirrored vdev to the client''s existing zpool. This is so we have data redundancy across chassis, so if one thumper were to fail or need patching, etc, the iscsi clients just see one of side of their mirrors drop out. The problem that we observed on the iscsi clients was that, when viewing things through ''zpool iostat -v'', far more IO was being requested from the LUs than was being registered for the vdev those LUs were a member of. Being that that was a iscsi setup with stock thumpers (no SSD ZIL, L2ARC) serving the LUs, this apparently overhead caused far more uneccessary disk IO on the thumpers, thus starving out IO for data that was actually needed. The working set is lots of small-ish files, entirely random IO. If zfs_vdev_cache_max only affects metadata prefetches, which parameter affects data prefetches ? I have to admit that disabling device-level prefetching was a shot in the dark, but it did result in drastically reduced contention on the thumpers. /dale> >> >> Question though... why is bug fix that can be a watershed for >> performance be held back for so long? s10u9 won''t be available for >> at least 6 months from now, and with a huge environment, I try hard >> not to live off of IDRs. >> >> Am I the only one that thinks this is way too conservative? It''s >> just maddening to know that a highly beneficial fix is out there, >> but its release is based on time rather than need. Sustaining >> really needs to be more proactive when it comes to this stuff. >> >> /dale >> >> >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling
2009-Sep-15 22:10 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Reference below... On Sep 15, 2009, at 2:38 PM, Dale Ghent wrote:> On Sep 15, 2009, at 5:21 PM, Richard Elling wrote: > >> >> On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote: >> >>> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote: >>> >>>> On 07/28/09 17:13, Rich Morris wrote: >>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote: >>>>> >>>>>> Sun has opened internal CR 6859997. It is now in Dispatched >>>>>> state at High priority. >>>> >>>> CR 6859997 has recently been fixed in Nevada. This fix will also >>>> be in Solaris 10 Update 9. >>>> This fix speeds up the sequential prefetch pattern described in >>>> this CR without slowing down other prefetch patterns. Some >>>> kstats have also been added to help improve the observability of >>>> ZFS file prefetching. >>> >>> Awesome that the fix exists. I''ve been having a hell of a time >>> with device-level prefetch on my iscsi clients causing tons of >>> ultimately useless IO and have resorted to setting >>> zfs_vdev_cache_max=1. >> >> This only affects metadata. Wouldn''t it be better to disable >> prefetching for data? > > Well, that''s a surprise to me, but the zfs_vdev_cache_max=1 did > provide relief. > > Just a general description of my environment: > > My setup consists of several s10uX iscsi clients which get LUNs from > a pairs of thumpers. Each thumper pair exports identical LUNs to > each iscsi client, and the client in turn mirrors each LUN pair > inside a local zpool. As more space is needed on a client, a new LUN > is created on the pair of thumpers, exported to the iscsi client, > which then picks it up and we add a new mirrored vdev to the > client''s existing zpool. > > This is so we have data redundancy across chassis, so if one thumper > were to fail or need patching, etc, the iscsi clients just see one > of side of their mirrors drop out. > > The problem that we observed on the iscsi clients was that, when > viewing things through ''zpool iostat -v'', far more IO was being > requested from the LUs than was being registered for the vdev those > LUs were a member of. > > Being that that was a iscsi setup with stock thumpers (no SSD ZIL, > L2ARC) serving the LUs, this apparently overhead caused far more > uneccessary disk IO on the thumpers, thus starving out IO for data > that was actually needed. > > The working set is lots of small-ish files, entirely random IO. > > If zfs_vdev_cache_max only affects metadata prefetches, which > parameter affects data prefetches ?There are two main areas for prefetch: at the transactional object layer (DMU) and the pooled storage level (VDEV). zfs_vdev_cache_max works at the VDEV level, obviously. The DMU knows more about the context of the data and is where the intelligent prefetching algorithm works. You can easily observe the VDEV cache statistics with kstat: # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 This represents a 59% cache hit rate, which is pretty decent. But you will notice far fewer delegations+hits+misses than real IOPS because it is only caching metadata. Unfortunately, there is not a kstat for showing the DMU cache stats. But a DTrace script can be written or, even easier, lockstat will show if you are spending much time in the zfetch_* functions. More details are in the Evil Tuning Guide, including how to set zfs_prefetch_disable http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide> > I have to admit that disabling device-level prefetching was a shot > in the dark, but it did result in drastically reduced contention on > the thumpers.That is a little bit surprising. I would expect little metadata activity for iscsi service. It would not be surprising for older Solaris 10 releases, though. It was fixed in NV b70, circa July 2007. -- richard
Bob Friesenhahn
2009-Sep-15 22:28 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 15 Sep 2009, Dale Ghent wrote:> > Question though... why is bug fix that can be a watershed for > performance be held back for so long? s10u9 won''t be available for > at least 6 months from now, and with a huge environment, I try hard > not to live off of IDRs.As someone who currently faces kernel panics with recent U7+ kernel patches (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will take the time to make sure that the implementation is as good as it can be and is thoroughly tested before release. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Dale Ghent
2009-Sep-15 23:03 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sep 15, 2009, at 6:28 PM, Bob Friesenhahn wrote:> On Tue, 15 Sep 2009, Dale Ghent wrote: >> >> Question though... why is bug fix that can be a watershed for >> performance be held back for so long? s10u9 won''t be available for >> at least 6 months from now, and with a huge environment, I try hard >> not to live off of IDRs. > > As someone who currently faces kernel panics with recent U7+ kernel > patches (on AMD64 and SPARC) related to PCI bus upset, I expect that > Sun will take the time to make sure that the implementation is as > good as it can be and is thoroughly tested before release.Are you referring the the same testing that gained you this PCI panic feature in s10u7? Testing is a no-brainer, and I would expect that there already exists some level of assurance that a CR fix is correct at the point of putback. But I''ve dealt with many bugs both very recently and long in the past where a fix has existed in nevada for months, even a year, before I got bit by the same bug in s10 and then had to go through the support channels to A) convince whomever I''m talking to that, yes, I''m hitting this bug, B) yes, there is a fix, and then C) pretty please can I have an IDR Just this week I''m wrapping up testing of a IDR which addresses a e1000g hardware errata that was fixed in onnv earlier this year in February. For something that addresses a hardware issue on a Intel chipset used on shipping Sun servers, one would think that Sustaining would be on the ball and get that integrated ASAP. But the current mode of operation appears to be "no CR, no backport", which leaves us customers needlessly running into bugs and then begging for their fixes... or hearing the dreaded "oh that fix will be available two updates from now." Not cool. /dale /dale
Bob Friesenhahn
2009-Sep-16 03:29 UTC
[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Tue, 15 Sep 2009, Dale Ghent wrote:>> >> As someone who currently faces kernel panics with recent U7+ kernel patches >> (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will take >> the time to make sure that the implementation is as good as it can be and >> is thoroughly tested before release. > > Are you referring the the same testing that gained you this PCI panic feature > in s10u7?No. The system worked with the kernel patch corresponding to baseline S10U7. Problems started with later kernel patches (which seem to be much less tested). Of course there could actually be a real hardware problem. Regardless, when the integrity of our data is involved, I prefer to wait for more testing rather than to potentially have to recover the pool from backup. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/