thr3ads.net - zfs discuss - [zfs-discuss] Why is Solaris 10 ZFS performance so terrible? [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Bob Friesenhahn

2009-Jul-04 04:03 UTC

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS 
performs so terribly on my system.  I blew a good bit of personal life 
savings on this set-up but am not seeing performance anywhere near 
what is expected.  Testing with iozone shows that bulk I/O performance 
is good.  Testing with Jeff Bonwick''s ''diskqual.sh''
shows expected
disk performance.  The problem is that actual observed application 
performance sucks, and could often be satisified by portable USB 
drives rather than high-end SAS drives.  It could be satisified by 
just one SAS disk drive.  Behavior is as if zfs is very slow to read 
data since disks are read at only 2 or 3 MB/second followed by an 
intermittent write on a long cycle.  Drive lights blink slowly.  It is 
as if ZFS does no successful sequential read-ahead on the files (see 
Prefetch Data hit rate of 0% and Prefetch Data cache miss of 60% 
below), or there is a semaphore bottleneck somewhere (but CPU use is 
very low).

Observed behavior is very program dependent.

# zpool status Sun_2540
   pool: Sun_2540
  state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
         still be used, but some features are unavailable.
action: Upgrade the pool using ''zpool upgrade''.  Once this is
done, the
         pool will no longer be accessible on older software versions.
  scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33 2009
config:

         NAME                                       STATE     READ WRITE CKSUM
         Sun_2540                                   ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096A47B4559Ed0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AA047B4529Bd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096E47B456DAd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AA447B4544Fd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096147B451BEd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AA847B45605d0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096647B453CEd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AAC47B45739d0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000097347B457D4d0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AB047B457ADd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B800039C9B500000A9C47B4522Dd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AB447B4595Fd0  ONLINE       0     0     0

errors: No known data errors

% ./diskqual.sh
c1t0d0 130 MB/sec
c1t1d0 130 MB/sec
c2t202400A0B83A8A0Bd31 13422 MB/sec
c3t202500A0B83A8A0Bd31 13422 MB/sec
c4t600A0B80003A8A0B0000096A47B4559Ed0 191 MB/sec
c4t600A0B80003A8A0B0000096E47B456DAd0 192 MB/sec
c4t600A0B80003A8A0B0000096147B451BEd0 192 MB/sec
c4t600A0B80003A8A0B0000096647B453CEd0 192 MB/sec
c4t600A0B80003A8A0B0000097347B457D4d0 212 MB/sec
c4t600A0B800039C9B500000A9C47B4522Dd0 191 MB/sec
c4t600A0B800039C9B500000AA047B4529Bd0 192 MB/sec
c4t600A0B800039C9B500000AA447B4544Fd0 192 MB/sec
c4t600A0B800039C9B500000AA847B45605d0 191 MB/sec
c4t600A0B800039C9B500000AAC47B45739d0 191 MB/sec
c4t600A0B800039C9B500000AB047B457ADd0 191 MB/sec
c4t600A0B800039C9B500000AB447B4595Fd0 191 MB/sec

% arc_summary.pl

System Memory:
 	 Physical RAM: 	20470 MB
 	 Free Memory : 	2371 MB
 	 LotsFree: 	312 MB

ZFS Tunables (/etc/system):
 	 * set zfs:zfs_arc_max = 0x300000000
 	 set zfs:zfs_arc_max = 0x280000000
 	 * set zfs:zfs_arc_max = 0x200000000

ARC Size:
 	 Current Size:             9383 MB (arcsize)
 	 Target Size (Adaptive):   10240 MB (c)
 	 Min Size (Hard Limit):    1280 MB (zfs_arc_min)
 	 Max Size (Hard Limit):    10240 MB (zfs_arc_max)

ARC Size Breakdown:
 	 Most Recently Used Cache Size: 	  6% 	644 MB (p)
 	 Most Frequently Used Cache Size: 	 93% 	9595 MB (c-p)

ARC Efficency:
 	 Cache Access Total:        	 674638362
 	 Cache Hit Ratio:      91%	 615586988   	[Defined State for buffer]
 	 Cache Miss Ratio:      8%	 59051374   	[Undefined State for Buffer]
 	 REAL Hit Ratio:       87%	 590314508   	[MRU/MFU Hits Only]

 	 Data Demand   Efficiency:    96%
 	 Data Prefetch Efficiency:     7%

 	CACHE HITS BY CACHE LIST:
 	  Anon:                        2% 	 13626529            	[ New Customer, First
Cache Hit ]
 	  Most Recently Used:         78% 	 480379752 (mru)      	[ Return Customer ]
 	  Most Frequently Used:       17% 	 109934756 (mfu)      	[ Frequent Customer
]
 	  Most Recently Used Ghost:    0% 	 5180256 (mru_ghost)	[ Return Customer
Evicted, Now Back ]
 	  Most Frequently Used Ghost:  1% 	 6465695 (mfu_ghost)	[ Frequent Customer
Evicted, Now Back ]
 	CACHE HITS BY DATA TYPE:
 	  Demand Data:                78% 	 485431759
 	  Prefetch Data:               0% 	 3045442
 	  Demand Metadata:            16% 	 103900170
 	  Prefetch Metadata:           3% 	 23209617
 	CACHE MISSES BY DATA TYPE:
 	  Demand Data:                30% 	 18109355
 	  Prefetch Data:              60% 	 35633374
 	  Demand Metadata:             6% 	 3806177
 	  Prefetch Metadata:           2% 	 1502468 
---------------------------------------------

Prefetch seems to be performing badly.  The Ben Rockwood''s blog entry 
at http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses 
prefetch.  The sample Dtrace script on that page only shows cache 
misses:

vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774849536:
MISS
vdev_cache_read: 6507827833451031357 read 131072 bytes at offset 6774980608:
MISS

Unfortunately, the file-level prefetch DTrace sample script from the 
same page seems to have a syntax error.

I tried disabling file level prefetch (zfs_prefetch_disable=1) but did 
not observe any change in behavior.

# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class    misc
zfs:0:vdev_cache_stats:crtime   130.61298275
zfs:0:vdev_cache_stats:delegations      754287
zfs:0:vdev_cache_stats:hits     3973496
zfs:0:vdev_cache_stats:misses   2154959
zfs:0:vdev_cache_stats:snaptime 451955.55419545

Performance when coping 236 GB of files (each file is 5537792 bytes, 
with 20001 files per directory) from one directory to another:

Copy Method				Data Rate
====================================	=================cpio -pdum				75 MB/s
cp -r					32 MB/s
tar -cf - . | (cd dest && tar -xf -)	26 MB/s

I would expect data copy rates approaching 200 MB/s.

I have not seen a peep from a zfs developer on this list for a month 
or two.  It would be useful if they would turn up to explain possible 
causes for this level of performance.  If I am encountering this 
problem, then it is likely that many others are as well.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-04 04:26 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Fri, 3 Jul 2009, Bob Friesenhahn wrote:>
> Copy Method				Data Rate
> ====================================	=================> cpio -pdum				75
MB/s
> cp -r					32 MB/s
> tar -cf - . | (cd dest && tar -xf -)	26 MB/s
It seems that the above should be ammended.  Running the cpio based 
copy again results in zpool iostat only reporting a read bandwidth of 
33 MB/second.  The system seems to get slower and slower as it runs.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Phil Harman

2009-Jul-04 07:48 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

ZFS doesn''t mix well with mmap(2). This is because ZFS uses the ARC  
instead of the Solaris page cache. But mmap() uses the latter. So if  
anyone maps a file, ZFS has to keep the two caches in sync.

cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it  
copies into the Solaris page cache. As long as they remain there ZFS  
will be slow for those files, even if you subsequently use read(2) to  
access them.

If you reboot, your cpio(1) tests will probably go fast again, until  
someone uses mmap(2) on the files again. I think tar(1) uses read(2),  
but from my iPod I can''t be sure. It would be interesting to see how  
tar(1) performs if you run that test before cp(1) on a freshly  
rebooted system.

I have done some work with the ZFS team towards a fix, but it is only  
currently in OpenSolaris.

The other thing that slows you down is that ZFS only flushes to disk  
every 5 seconds if there are no synchronous writes. It would be  
interesting to see iostat -xnz 1 while you are running your tests. You  
may find the disks are writing very efficiently for one second in  
every five.

Hope this helps,
Phil

blogs.sun.com/pgdh

Sent from my iPod

On 4 Jul 2009, at 05:26, Bob Friesenhahn  
<bfriesen at simple.dallas.tx.us> wrote:
> On Fri, 3 Jul 2009, Bob Friesenhahn wrote:
>>
>> Copy Method                Data Rate
>> ====================================    =================>> cpio
-pdum                75 MB/s
>> cp -r                    32 MB/s
>> tar -cf - . | (cd dest && tar -xf -)    26 MB/s
>
> It seems that the above should be ammended.  Running the cpio based  
> copy again results in zpool iostat only reporting a read bandwidth  
> of 33 MB/second.  The system seems to get slower and slower as it  
> runs.
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mattias Pantzare

2009-Jul-04 08:57 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, Jul 4, 2009 at 06:03, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> I am still trying to determine why Solaris 10 (Generic_141415-03) ZFS
> performs so terribly on my system. ?I blew a good bit of personal life
> savings on this set-up but am not seeing performance anywhere near what is
> expected. ?Testing with iozone shows that bulk I/O performance is good.
> ?Testing with Jeff Bonwick''s ''diskqual.sh'' shows
expected disk performance.
> ?The problem is that actual observed application performance sucks, and
> could often be satisified by portable USB drives rather than high-end SAS
> drives. ?It could be satisified by just one SAS disk drive. ?Behavior is as
> if zfs is very slow to read data since disks are read at only 2 or 3
> MB/second followed by an intermittent write on a long cycle. ?Drive lights
> blink slowly. ?It is as if ZFS does no successful sequential read-ahead on
> the files (see Prefetch Data hit rate of 0% and Prefetch Data cache miss of
> 60% below), or there is a semaphore bottleneck somewhere (but CPU use is
> very low).
>
> Observed behavior is very program dependent.
>
> # zpool status Sun_2540
> ?pool: Sun_2540
> ?state: ONLINE
> status: The pool is formatted using an older on-disk format. ?The pool can
> ? ? ? ?still be used, but some features are unavailable.
> action: Upgrade the pool using ''zpool upgrade''. ?Once
this is done, the
> ? ? ? ?pool will no longer be accessible on older software versions.
> ?scrub: scrub completed after 0h46m with 0 errors on Mon Jun 29 05:06:33
> 2009
> config:
>
> ? ? ? ?NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? STATE ? ? READ WRITE
CKSUM
> ? ? ? ?Sun_2540 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B80003A8A0B0000096A47B4559Ed0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AA047B4529Bd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B80003A8A0B0000096E47B456DAd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AA447B4544Fd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B80003A8A0B0000096147B451BEd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AA847B45605d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B80003A8A0B0000096647B453CEd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AAC47B45739d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B80003A8A0B0000097347B457D4d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AB047B457ADd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000A9C47B4522Dd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
> ? ? ? ? ? ?c4t600A0B800039C9B500000AB447B4595Fd0 ?ONLINE ? ? ? 0 ? ? 0 ? ?
0
>
> errors: No known data errors
>
>
> Prefetch seems to be performing badly. ?The Ben Rockwood''s blog
entry at
> http://www.cuddletech.com/blog/pivot/entry.php?id=1040 discusses prefetch.
> ?The sample Dtrace script on that page only shows cache misses:
>
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset
6774849536:
> MISS
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset
6774980608:
> MISS
>
> Unfortunately, the file-level prefetch DTrace sample script from the same
> page seems to have a syntax error.
>
> I tried disabling file level prefetch (zfs_prefetch_disable=1) but did not
> observe any change in behavior.
>
> # kstat -p zfs:0:vdev_cache_stats
> zfs:0:vdev_cache_stats:class ? ?misc
> zfs:0:vdev_cache_stats:crtime ? 130.61298275
> zfs:0:vdev_cache_stats:delegations ? ? ?754287
> zfs:0:vdev_cache_stats:hits ? ? 3973496
> zfs:0:vdev_cache_stats:misses ? 2154959
> zfs:0:vdev_cache_stats:snaptime 451955.55419545
>
> Performance when coping 236 GB of files (each file is 5537792 bytes, with
> 20001 files per directory) from one directory to another:
>
> Copy Method ? ? ? ? ? ? ? ? ? ? ? ? ? ? Data Rate
> ==================================== ? ?=================> cpio -pdum ?
? ? ? ? ? ? ? ? ? ? ? ? ? ?75 MB/s
> cp -r ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 32 MB/s
> tar -cf - . | (cd dest && tar -xf -) ? ?26 MB/s
>
> I would expect data copy rates approaching 200 MB/s.
>
What happens if you run two copy at the same time? (On different data)

Your test is very bad at using striping as reads are done sequential.
Prefetch can only help in a file and your files are only 5Mb.

Joerg Schilling

2009-Jul-04 10:27 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Mattias Pantzare <pantzer at ludd.ltu.se> wrote:
> > Performance when coping 236 GB of files (each file is 5537792 bytes,
with
> > 20001 files per directory) from one directory to another:
> >
> > Copy Method ? ? ? ? ? ? ? ? ? ? ? ? ? ? Data Rate
> > ==================================== ? ?=================> >
cpio -pdum ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?75 MB/s
> > cp -r ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 32 MB/s
> > tar -cf - . | (cd dest && tar -xf -) ? ?26 MB/s
> >
> > I would expect data copy rates approaching 200 MB/s.
> >
>
> What happens if you run two copy at the same time? (On different data)
Before you do things like this, you first should start using test
that may give you useful results.

Note of the programs above have been written for decent performance.
I know that "cp" on Solaris is a partial exception for songle file
copies, but
does not help us if we like to compare _aparent_ performance.

Let me first introduce other programs:

sdd	A dd(1) replacement that was first written in 1984 and that includes
	built-in speed metering since Jily 1988.

star	A tar(1) replacement that was first written in 1982 and that supports
	much better performance by using a shared memory based FIFO.

Note that most speed tests that are run on Linux do not result un useful values
as you don''t know what''s happening dunring the observation
time.

If you like to meter read performance, I recommend to use a filesystem that was
mounted directly before doing the test or to use files that are big enough not 
to fit into memory.

Use e.g.:	sdd if=file-name bs=64k -onull -time

If you like to meter write performance, I recomment to write big enough files
to avoid using wrong numbers as a result from caching.

Use e.g.	sdd -inull bs=64k count=some-number of=file-name -time

Us an apropriate value for "some-number".

For copying files, I recommend to use:

	star -copy bs=1m fs=128m -time -C from-dir . to-dir

It makes sense to run another test using the option: -no-fsync in addition.
On Solaris with UFS, using -no-fsync speeds up things by aprox. 10%.
On Linux with a local filesystem, using -no-fsync speeds up things by 
aprox. 400%. This is why you get useless high numbers from using GNU tar
for copy tests on Linux.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Jonathan Edwards

2009-Jul-04 12:50 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Jul 4, 2009, at 12:03 AM, Bob Friesenhahn wrote:
> % ./diskqual.sh
> c1t0d0 130 MB/sec
> c1t1d0 130 MB/sec
> c2t202400A0B83A8A0Bd31 13422 MB/sec
> c3t202500A0B83A8A0Bd31 13422 MB/sec
> c4t600A0B80003A8A0B0000096A47B4559Ed0 191 MB/sec
> c4t600A0B80003A8A0B0000096E47B456DAd0 192 MB/sec
> c4t600A0B80003A8A0B0000096147B451BEd0 192 MB/sec
> c4t600A0B80003A8A0B0000096647B453CEd0 192 MB/sec
> c4t600A0B80003A8A0B0000097347B457D4d0 212 MB/sec
> c4t600A0B800039C9B500000A9C47B4522Dd0 191 MB/sec
> c4t600A0B800039C9B500000AA047B4529Bd0 192 MB/sec
> c4t600A0B800039C9B500000AA447B4544Fd0 192 MB/sec
> c4t600A0B800039C9B500000AA847B45605d0 191 MB/sec
> c4t600A0B800039C9B500000AAC47B45739d0 191 MB/sec
> c4t600A0B800039C9B500000AB047B457ADd0 191 MB/sec
> c4t600A0B800039C9B500000AB447B4595Fd0 191 MB/sec
somehow i don''t think that reading the first 64MB off (presumably) off
a raw disk device 3 times and picking the middle value is going to  
give you much useful information on the overall state of the disks ..  
i believe this was more of a quick hack to just validate that there''s  
nothing too far out of the norm, but with that said - what''s the c2  
and c3 device above?  you''ve got to be caching the heck out of that to
get that unbelievable 13 GB/s - so you''re really only seeing memory  
speeds there

more useful information would be something more like the old taz or  
some of the disk IO latency tools when you''re driving a workload.
> % arc_summary.pl
>
> System Memory:
> 	 Physical RAM: 	20470 MB
> 	 Free Memory : 	2371 MB
> 	 LotsFree: 	312 MB
>
> ZFS Tunables (/etc/system):
> 	 * set zfs:zfs_arc_max = 0x300000000
> 	 set zfs:zfs_arc_max = 0x280000000
> 	 * set zfs:zfs_arc_max = 0x200000000
>
> ARC Size:
> 	 Current Size:             9383 MB (arcsize)
> 	 Target Size (Adaptive):   10240 MB (c)
> 	 Min Size (Hard Limit):    1280 MB (zfs_arc_min)
> 	 Max Size (Hard Limit):    10240 MB (zfs_arc_max)
>
> ARC Size Breakdown:
> 	 Most Recently Used Cache Size: 	  6% 	644 MB (p)
> 	 Most Frequently Used Cache Size: 	 93% 	9595 MB (c-p)
>
> ARC Efficency:
> 	 Cache Access Total:        	 674638362
> 	 Cache Hit Ratio:      91%	 615586988   	[Defined State for buffer]
> 	 Cache Miss Ratio:      8%	 59051374   	[Undefined State for Buffer]
> 	 REAL Hit Ratio:       87%	 590314508   	[MRU/MFU Hits Only]
>
> 	 Data Demand   Efficiency:    96%
> 	 Data Prefetch Efficiency:     7%
>
> 	CACHE HITS BY CACHE LIST:
> 	  Anon:                        2% 	 13626529            	[ New  
> Customer, First Cache Hit ]
> 	  Most Recently Used:         78% 	 480379752 (mru)      	[ Return  
> Customer ]
> 	  Most Frequently Used:       17% 	 109934756 (mfu)      	 
> [ Frequent Customer ]
> 	  Most Recently Used Ghost:    0% 	 5180256 (mru_ghost)	[ Return  
> Customer Evicted, Now Back ]
> 	  Most Frequently Used Ghost:  1% 	 6465695 (mfu_ghost)	[ Frequent  
> Customer Evicted, Now Back ]
> 	CACHE HITS BY DATA TYPE:
> 	  Demand Data:                78% 	 485431759
> 	  Prefetch Data:               0% 	 3045442
> 	  Demand Metadata:            16% 	 103900170
> 	  Prefetch Metadata:           3% 	 23209617
> 	CACHE MISSES BY DATA TYPE:
> 	  Demand Data:                30% 	 18109355
> 	  Prefetch Data:              60% 	 35633374
> 	  Demand Metadata:             6% 	 3806177
> 	  Prefetch Metadata:           2% 	 1502468  
> ---------------------------------------------
>
> Prefetch seems to be performing badly.  The Ben Rockwood''s blog  
> entry at http://www.cuddletech.com/blog/pivot/entry.php?id=1040  
> discusses prefetch.  The sample Dtrace script on that page only  
> shows cache misses:
>
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset  
> 6774849536: MISS
> vdev_cache_read: 6507827833451031357 read 131072 bytes at offset  
> 6774980608: MISS
>
> Unfortunately, the file-level prefetch DTrace sample script from the  
> same page seems to have a syntax error.
if you''re using LUNs off an array - this might be another case of the  
zfs_vdev_max_pending being tuned more for direct attach drives .. you  
could be trying to queue up too much I/O against the RAID controller,  
particularly if the RAID controller is also trying to prefetch out of  
it''s cache.
> I tried disabling file level prefetch (zfs_prefetch_disable=1) but  
> did not observe any change in behavior.
this is only going to help if you''ve got problems in zfetch ..
you''d
probably see this better by looking for high lock contention in zfetch  
with lockstat
> # kstat -p zfs:0:vdev_cache_stats
> zfs:0:vdev_cache_stats:class    misc
> zfs:0:vdev_cache_stats:crtime   130.61298275
> zfs:0:vdev_cache_stats:delegations      754287
> zfs:0:vdev_cache_stats:hits     3973496
> zfs:0:vdev_cache_stats:misses   2154959
> zfs:0:vdev_cache_stats:snaptime 451955.55419545
>
> Performance when coping 236 GB of files (each file is 5537792 bytes,  
> with 20001 files per directory) from one directory to another:
>
> Copy Method				Data Rate
> ====================================	=================> cpio -pdum				75
MB/s
> cp -r					32 MB/s
> tar -cf - . | (cd dest && tar -xf -)	26 MB/s
>
> I would expect data copy rates approaching 200 MB/s.

you might want to dtrace this to break down where the latency is  
occuring .. eg: is this a DNLC caching problem, ARC problem, or device  
level problem

also - is this really coming off a 2540? if so - you should probably  
investigate the array throughput numbers and what''s happening on the  
RAID controller .. i typically find it helpful to understand what the  
raw hardware is capable of (hence tools like vdbench to drive an  
anticipated load before i configure anything) - and then attempting to  
configure the various tunables to match after that

for now you''re pretty much just at the FS/VOP layers and playing with  
caching when the real culprit might be more on the vdev interface  
layer or below

---
.je

David Magda

2009-Jul-04 13:39 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Jul 4, 2009, at 03:48, Phil Harman wrote:
> The other thing that slows you down is that ZFS only flushes to disk  
> every 5 seconds if there are no synchronous writes. It would be  
> interesting to see iostat -xnz 1 while you are running your tests.  
> You may find the disks are writing very efficiently for one second  
> in every five.
The value of 5 seconds is no longer a hard stop since SNV 87. Since  
snv_87 (and S10u6) it can be up to 30 seconds (but it does shoot for 5  
seconds):

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205

See the 20-Mar-2008 change for txg.c for details.

Joerg Schilling

2009-Jul-04 13:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Phil Harman <Phil.Harman at Sun.COM> wrote:
> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the
ARC
> instead of the Solaris page cache. But mmap() uses the latter. So if  
> anyone maps a file, ZFS has to keep the two caches in sync.
>
> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it  
> copies into the Solaris page cache. As long as they remain there ZFS  
> will be slow for those files, even if you subsequently use read(2) to  
> access them.
>
> If you reboot, your cpio(1) tests will probably go fast again, until  
Do you believe that reboot is the only way to reset this?
> someone uses mmap(2) on the files again. I think tar(1) uses read(2),  
> but from my iPod I can''t be sure. It would be interesting to see
how
> tar(1) performs if you run that test before cp(1) on a freshly  
> rebooted system.
There are many tar implementations. The oldest is the UNIX tar implementation
from around 1978, the next was star from 1982, then there is GNU tar from 1987.

Star forks into two processes that are connected via shared memory in order to
speed up things. 

If you compare the copy speed from star amd cp on UFS and if you tell star to 
be as unreliable as cp (by specifying the star option -no-fsync), star will do 
the job by 30% faster than cp does even though star does not use mmap. Copying 
with Sun''s tar is a tic faster than using cp and it is a bit more
accurat.
GNU tar is not better than Sun''s tar.

If you are looking for the best speed, use:

	star -copy -no-fsync -C from-dir . to-dir

and set up e.v. bs=1m fs=128m.


J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Bob Friesenhahn

2009-Jul-04 14:09 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Jonathan Edwards wrote:>
> somehow i don''t think that reading the first 64MB off (presumably)
off a raw
> disk device 3 times and picking the middle value is going to give you much 
> useful information on the overall state of the disks .. i believe this was 
> more of a quick hack to just validate that there''s nothing too far
out of the
> norm, but with that said - what''s the c2 and c3 device above? 
you''ve got to
> be caching the heck out of that to get that unbelievable 13 GB/s - so
you''re
> really only seeing memory speeds there
Agreed.  It is just a quick sanity check.  I think that the c2 and c3 
devices are speedy USB drives.
> more useful information would be something more like the old taz or some of
> the disk IO latency tools when you''re driving a workload.
What I see from ''iostat -cx'' is a low latency (<= 4 ms) and
low
workload while the data is being read, and then (periodically) a burst 
of write data with much higher latency (40-64ms svc_t).  The write 
burst does not take long so it is clear that reading is the 
bottleneck.
> if you''re using LUNs off an array - this might be another case of
the
> zfs_vdev_max_pending being tuned more for direct attach drives .. you could
> be trying to queue up too much I/O against the RAID controller,
particularly
> if the RAID controller is also trying to prefetch out of it''s
cache.
I have played with zfs_vdev_max_pending before.  It does dial down the 
latency pretty linearly during the write phase (e.g. 35 queued I/Os 
results in 64 ms svc_t).
> you might want to dtrace this to break down where the latency is occuring
..
> eg: is this a DNLC caching problem, ARC problem, or device level problem
>
> also - is this really coming off a 2540? if so - you should probably 
> investigate the array throughput numbers and what''s happening on
the RAID
> controller .. i typically find it helpful to understand what the raw
hardware
> is capable of (hence tools like vdbench to drive an anticipated load before
i
> configure anything) - and then attempting to configure the various tunables
> to match after that
Yes, this comes off of a 2540.  I used iozone for testing and see that 
through zfs, the hardware is able to write a 64GB file at 380 MB/s and 
read at 551 MB/s.  Unfortunately, this does not seem to translate well 
for the actual task.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-04 14:33 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Phil Harman wrote:
> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the
ARC instead
> of the Solaris page cache. But mmap() uses the latter. So if anyone maps a 
> file, ZFS has to keep the two caches in sync.
>
> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it
copies
> into the Solaris page cache. As long as they remain there ZFS will be slow 
> for those files, even if you subsequently use read(2) to access them.
This is very interesting information and certainly can explain a lot. 
My application has a choice of using mmap or traditional I/O.  I often 
use mmap.  From what you are saying, using mmap is poison to 
subsequent performance.

On June 29th I tested my application (which was set to use mmap) 
shortly after a reboot and got this overall initial runtime:

real  2:24:25.675
user  4:38:57.837
sys     14:30.823

By June 30th (with no intermediate reboot) the overall runtime had 
increased to

real  3:08:58.941
user  4:38:38.192
sys     15:44.197

which seems like quite a large change.
> If you reboot, your cpio(1) tests will probably go fast again, until
someone
> uses mmap(2) on the files again. I think tar(1) uses read(2), but from my
I will test.
> The other thing that slows you down is that ZFS only flushes to disk every
5
> seconds if there are no synchronous writes. It would be interesting to see 
> iostat -xnz 1 while you are running your tests. You may find the disks are 
> writing very efficiently for one second in every five.
Actually I found that the disks were writing flat out for five seconds 
at a time which stalled all other pool I/O (and dependent CPU) for at 
least three seconds (see earlier discussion).  So at the moment I have 
zfs_write_limit_override set to 2684354560 so that the write cycle is 
more on the order of one second in every five.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-04 15:15 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Phil Harman wrote:>
> If you reboot, your cpio(1) tests will probably go fast again, until
someone
> uses mmap(2) on the files again. I think tar(1) uses read(2), but from my 
> iPod I can''t be sure. It would be interesting to see how tar(1)
performs if
> you run that test before cp(1) on a freshly rebooted system.
Ok, I just rebooted the system.  Now ''zpool iostat Sun_2540
60'' shows
that the cpio read rate has increased from (the most recently 
observed) 33 MB/second to as much as 132 MB/second.  To some this may 
not seem significant but to me it looks a whole lot different. ;-)
> I have done some work with the ZFS team towards a fix, but it is only 
> currently in OpenSolaris.
Hopefully the fix is very very good.  It is difficult to displace the 
many years of SunOS training that using mmap is the path to best 
performance.  Mmap provides many tools to improve application 
performance which are just not available via traditional I/O.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gary Mills

2009-Jul-04 15:39 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman
wrote:> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the
ARC
> instead of the Solaris page cache. But mmap() uses the latter. So if  
> anyone maps a file, ZFS has to keep the two caches in sync.
That''s the first I''ve heard of this issue.  Our e-mail server
runs
Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
extensively.  I understand that Solaris has an excellent
implementation of mmap(2).  ZFS has many advantages, snapshots for
example, for mailbox storage.  Is there anything that we can be do to
optimize the two caches in this environment?  Will mmap(2) one day
play nicely with ZFS?

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Bob Friesenhahn

2009-Jul-04 15:57 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

A tar pipeline still provides terrible file copy performance.  Read 
bandwidth is only 26 MB.  So I stopped the tar copy and re-tried the 
cpio copy.

A second copy with the cpio results in a read/write data rate of only 
54.9 MB/s (vs the just experienced 132 MB/s).  Performance is reduced 
by more than half.  Based on yesterday''s experience, that may diminish 
to only 33 MB/s.

The amount of data being copied is much larger than any cache yet 
somehow reading a file a second time is less than 1/2 as fast.

This brings me to the absurd conclusion that the system must be 
rebooted immediately prior to each use.

/etc/system tunables are currently:

set zfs:zfs_arc_max = 0x280000000
set zfs:zfs_write_limit_override = 0xea600000
set zfs:zfs_vdev_max_pending = 5

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Schilling

2009-Jul-04 17:12 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> A tar pipeline still provides terrible file copy performance.  Read 
> bandwidth is only 26 MB.  So I stopped the tar copy and re-tried the 
> cpio copy.
>
> A second copy with the cpio results in a read/write data rate of only 
> 54.9 MB/s (vs the just experienced 132 MB/s).  Performance is reduced 
> by more than half.  Based on yesterday''s experience, that may
diminish
> to only 33 MB/s.
	"star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"

is nearly 40% faster than 

	"find . | cpio -pdum to-dir"

Did you try to use highly performant software like star?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Phil Harman

2009-Jul-04 17:55 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Joerg Schilling wrote:> Phil Harman <Phil.Harman at Sun.COM> wrote:
>
>   
>> ZFS doesn''t mix well with mmap(2). This is because ZFS uses
the ARC
>> instead of the Solaris page cache. But mmap() uses the latter. So if  
>> anyone maps a file, ZFS has to keep the two caches in sync.
>>
>> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files it
>> copies into the Solaris page cache. As long as they remain there ZFS  
>> will be slow for those files, even if you subsequently use read(2) to  
>> access them.
>>
>> If you reboot, your cpio(1) tests will probably go fast again, until  
>>     
>
> Do you believe that reboot is the only way to reset this?
>   
No, but from my iPod I didn''t have the patience to write a fuller 
explanation :)

See ...

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_vnops.c#514

We take the long path is the vnode has any pages cached in the page cache.

So instead of a reboot, you should also be able to export/import the 
pool or unmount/mount the filesystem.

Also, if you didn''t touch the file for a long time, and had lots of 
other page cache churn, the file might eventually get expunged from the 
page cache.

Phil

Bob Friesenhahn

2009-Jul-04 18:03 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Joerg Schilling wrote:>> by more than half.  Based on yesterday''s experience, that may
diminish
>> to only 33 MB/s.
>
> 	"star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"
>
> is nearly 40% faster than
>
> 	"find . | cpio -pdum to-dir"
>
> Did you try to use highly performant software like star?
No, because I don''t want to tarnish your software''s stellar 
reputation.  I am focusing on Solaris 10 bugs today.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Phil Harman

2009-Jul-04 18:04 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote:
>>
>> If you reboot, your cpio(1) tests will probably go fast again, until 
>> someone uses mmap(2) on the files again. I think tar(1) uses read(2), 
>> but from my iPod I can''t be sure. It would be interesting to
see how
>> tar(1) performs if you run that test before cp(1) on a freshly 
>> rebooted system.
>
> Ok, I just rebooted the system.  Now ''zpool iostat Sun_2540
60'' shows
> that the cpio read rate has increased from (the most recently 
> observed) 33 MB/second to as much as 132 MB/second.  To some this may 
> not seem significant but to me it looks a whole lot different. ;-)
Thanks, that''s really useful data. I wasn''t near a machine at
the time,
so I couldn''t do it for myself. I answered your initial question based 
on what I understood of the implementation, and it''s very satisfying to
have the data to back it up.
>> I have done some work with the ZFS team towards a fix, but it is only 
>> currently in OpenSolaris.
>
> Hopefully the fix is very very good.  It is difficult to displace the 
> many years of SunOS training that using mmap is the path to best 
> performance.  Mmap provides many tools to improve application 
> performance which are just not available via traditional I/O.
The part of the problem I highlighted was ...

   6699438 zfs induces crosscall storm under heavy mapped sequential read

This has been fixed in OpenSolaris, and should be fixed in Solaris 10 
update 8.

However, this is only part of the problem. The fundamental issue is that 
ZFS has its own ARC apart from the Solaris page cache, so whenever 
mmap() is used, all I/O to that file has to make sure that the two 
caches are in sync. Hence, a read(2) on a file which has sometime been 
mapped, will be impacted, even if the file is nolonger mapped.

I''m sure the data and interest from this thread will be useful to the 
ZFS team in prioritising further performance enhancements. So thanks 
again. And if there''s any more useful data you can add, please do so.
If
you have a support contract, you might also consider logging a call and 
even raising an escalation request.

Cheers,
Phil
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us, 
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Schilling

2009-Jul-04 18:15 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Sat, 4 Jul 2009, Joerg Schilling wrote:
> >> by more than half.  Based on yesterday''s experience, that
may diminish
> >> to only 33 MB/s.
> >
> > 	"star -copy -no-fsync bs=8m fs=256m -C from-dir . to-dir"
> >
> > is nearly 40% faster than
> >
> > 	"find . | cpio -pdum to-dir"
> >
> > Did you try to use highly performant software like star?
>
> No, because I don''t want to tarnish your software''s
stellar
> reputation.  I am focusing on Solaris 10 bugs today.
I''ve seen more prefessional replies. At the end it is your decision
to ignore helpful advise. 

BTW: if star on ZFS would not be faster than cpio this would be just a 
hint for a problem in ZFS that needs to be fixed.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Jonathan Edwards

2009-Jul-04 18:16 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote:
> This brings me to the absurd conclusion that the system must be  
> rebooted immediately prior to each use.
see Phil''s later email .. an export/import of the pool or a remount of
the filesystem should clear the page cache - with mmap''d files
you''re
essentially both them both in the page cache and also in the ARC ..  
then invalidations in the page cache are going to have effects on  
dirty data in the cache
> /etc/system tunables are currently:
>
> set zfs:zfs_arc_max = 0x280000000
> set zfs:zfs_write_limit_override = 0xea600000
> set zfs:zfs_vdev_max_pending = 5

if you''re on x86 - i''d also increase maxphys to 128K .. we
still have
a 56KB default value in there which is still a bad thing (IMO)

---
.je

Phil Harman

2009-Jul-04 18:18 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Gary Mills wrote:> On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:
>   
>> ZFS doesn''t mix well with mmap(2). This is because ZFS uses
the ARC
>> instead of the Solaris page cache. But mmap() uses the latter. So if  
>> anyone maps a file, ZFS has to keep the two caches in sync.
>>     
>
> That''s the first I''ve heard of this issue.  Our e-mail
server runs
> Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
> extensively.  I understand that Solaris has an excellent
> implementation of mmap(2).  ZFS has many advantages, snapshots for
> example, for mailbox storage.  Is there anything that we can be do to
> optimize the two caches in this environment?  Will mmap(2) one day
> play nicely with ZFS?
>   
I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) 
was the first UNIX to get a working implementation of mmap(2) for files 
(if I recall correctly, BSD 4.3 had a manpage but no implementation for 
files). From that we got a whole lot of cool stuff, not least dynamic 
linking with ld.so (which has made it just about everywhere).

The Solaris implementation of mmap(2) is functionally correct, but the 
wait for a 64 bit address space rather moved the attention of 
performance tuning elsewhere. I must admit I was surprised to see so 
much code out there that still uses mmap(2) for general I/O (rather than 
just to support dynamic linking).

Software engineering is always about prioritising resource. Nothing 
prioritises performance tuning attention quite like compelling 
competitive data. When Bart Smaalders and I wrote libMicro we generated 
a lot of very compelling data. I also coined the phrase "If Linux is 
faster, it''s a Solaris bug". You will find quite a few (mostly
fixed)
bugs with the synopsis "linux is faster than solaris at ...".

So, if mmap(2) playing nicely with ZFS is important to you, probably the 
best thing you can do to help that along is to provide data that will 
help build the business case for spending engineering resource on the issue.

Cheers,
Phil

Bob Friesenhahn

2009-Jul-04 18:25 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) 
performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 
with latest firmware.  I rebooted the system used cpio to send the 
input files to /dev/null, and then immediately used cpio a second time 
to send the input files to /dev/null.  Note that the amount of file 
data (243 GB) is plenty sufficient to purge any file data from the ARC 
(which has a cap of 10 GB).

% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 1.573 total
cpio -o > /dev/null  78.92s user 360.55s system 43% cpu 16:59.48 total

% time cat dpx-files.txt | cpio -o > /dev/null
495713288 blocks
cat dpx-files.txt  0.00s user 0.00s system 0% cpu 0.198 total
cpio -o > /dev/null  79.92s user 358.75s system 11% cpu 1:01:05.88 total

zpool iostat averaged over 60 seconds reported that the first run 
through the files read the data at 251 MB/s and the second run only 
achieved 68 MB/s.  It seems clear that there is something really bad 
about Solaris 10 zfs''s file caching code which is causing it to go 
into the weeds.

I don''t think that the results mean much, but I have attached output 
from ''hotkernel'' while a subsequent cpio copy is taking place.
It
shows that the kernel is mostly sleeping.

This is not a new problem.  It seems that I have been banging my head 
against this from the time I started using zfs.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sampling... Hit Ctrl-C to end.

FUNCTION                                                COUNT   PCNT
unix`SHA1Update                                             1   0.0%
unix`page_unlock                                            1   0.0%
unix`lwp_segregs_save                                       1   0.0%
rootnex`rootnex_dma_allochdl                                1   0.0%
unix`mutex_delay_default                                    1   0.0%
emlxs`emlxs_initialize_pkt                                  1   0.0%
genunix`pid_lookup                                          1   0.0%
TS`ts_setrun                                                1   0.0%
fcp`ssfcp_adjust_cmd                                        1   0.0%
genunix`strrput                                             1   0.0%
genunix`cyclic_softint                                      1   0.0%
genunix`fop_poll                                            1   0.0%
sd`sd_xbuf_strategy                                         1   0.0%
ohci`ohci_state_is_operational                              1   0.0%
zfs`SHA256Transform                                         1   0.0%
unix`cpu_resched                                            1   0.0%
nvidia`_nv006110rm                                          1   0.0%
genunix`lwp_timer_timeout                                   1   0.0%
genunix`realtime_timeout                                    1   0.0%
fcp`ssfcp_scsi_destroy_pkt                                  1   0.0%
nvidia`nvidia_pci_check_config_space                        1   0.0%
genunix`closef                                              1   0.0%
sd`sd_setup_rw_pkt                                          1   0.0%
unix`vsnprintf                                              1   0.0%
zfs`vdev_dtl_contains                                       1   0.0%
genunix`siginfo_kto32                                       1   0.0%
iommulib`iommulib_nex_open                                  1   0.0%
genunix`vn_has_cached_data                                  1   0.0%
ohci`ohci_sendup_td_message                                 1   0.0%
scsi_vhci`vhci_scsi_destroy_pkt                             1   0.0%
genunix`avl_add                                             1   0.0%
unix`page_create_va                                         1   0.0%
genunix`savectx                                             1   0.0%
ohci`ohci_root_hub_allocate_intr_pipe_resource              1   0.0%
unix`page_add                                               1   0.0%
zfs`zfs_unix_to_v4                                          1   0.0%
genunix`set_qend                                            1   0.0%
zfs`vdev_queue_io_done                                      1   0.0%
unix`set_idle_cpu                                           1   0.0%
zfs`vdev_cache_read                                         1   0.0%
nvidia`_nv002998rm                                          1   0.0%
ohci`ohci_do_intrs_stats                                    1   0.0%
genunix`putq                                                1   0.0%
genunix`strput                                              1   0.0%
zfs`zio_buf_alloc                                           1   0.0%
sockfs`socktpi_poll                                         1   0.0%
sockfs`so_update_attrs                                      1   0.0%
sockfs`so_unlock_read                                       1   0.0%
zfs`zfs_zaccess                                             1   0.0%
genunix`schedctl_save                                       1   0.0%
nvidia`_nv004051rm                                          1   0.0%
zfs`dbuf_destroy                                            1   0.0%
nvidia`_nv001416rm                                          1   0.0%
genunix`ddi_dma_buf_bind_handle                             1   0.0%
zfs`zio_decompress_data                                     1   0.0%
genunix`bdev_strategy                                       1   0.0%
kcf`hmac_encr                                               1   0.0%
unix`page_trylock                                           1   0.0%
unix`hat_page_getattr                                       1   0.0%
genunix`fop_open                                            1   0.0%
zfs`zfs_zaccess_common                                      1   0.0%
emlxs`emlxs_sli2_bde_setup                                  1   0.0%
genunix`sprintf                                             1   0.0%
unix`ddi_get8                                               1   0.0%
zfs`zfs_dirlook                                             1   0.0%
zfs`zio_clear_transform_stack                               1   0.0%
genunix`untimeout                                           1   0.0%
genunix`allocb_cred                                         1   0.0%
unix`lock_clear                                             1   0.0%
iommulib`lookup_cache                                       1   0.0%
zfs`dbuf_hash_insert                                        1   0.0%
genunix`syscall_exit                                        1   0.0%
unix`strcmp                                                 1   0.0%
unix`hr_clock_lock                                          1   0.0%
genunix`delfpollinfo                                        1   0.0%
unix`i_ddi_caut_get16                                       1   0.0%
genunix`fop_access                                          1   0.0%
unix`getflags                                               1   0.0%
zfs`arc_cksum_verify                                        1   0.0%
zfs`vdev_disk_io_done                                       1   0.0%
zfs`spa_config_enter                                        1   0.0%
emlxs`emlxs_thread_trigger2                                 1   0.0%
TS`ts_preempt                                               1   0.0%
ldterm`ldtermrput                                           1   0.0%
usba`usba_vlog                                              1   0.0%
genunix`crgetuid                                            1   0.0%
unix`idle_enter                                             1   0.0%
genunix`sleepq_unsleep                                      1   0.0%
unix`mutex_vector_enter                                     1   0.0%
unix`lgrp_loadavg                                           1   0.0%
kcf`kcf_rnd_get_pseudo_bytes                                1   0.0%
sd`sd_fill_scsi1_lun                                        1   0.0%
nge`nge_hot_rxd_fill                                        1   0.0%
genunix`vmem_xalloc                                         1   0.0%
nvidia`_nv006118rm                                          1   0.0%
TS`ts_update_list                                           1   0.0%
nvidia`_nv002896rm                                          1   0.0%
genunix`pn_setlast                                          1   0.0%
genunix`pcache_poll                                         1   0.0%
iommulib`hashfn                                             1   0.0%
ldterm`newmsg                                               1   0.0%
genunix`signal_is_blocked                                   1   0.0%
genunix`kmem_slab_alloc                                     1   0.0%
unix`i_ddi_caut_get32                                       1   0.0%
unix`xcopyout                                               1   0.0%
zfs`dmu_object_size_from_db                                 1   0.0%
tl`tl_wput_data_ser                                         1   0.0%
genunix`disp_lock_enter                                     1   0.0%
genunix`bioerror                                            1   0.0%
dtrace`dtrace_ioctl                                         1   0.0%
nvidia`_nv008153rm                                          1   0.0%
genunix`avl_first                                           1   0.0%
zfs`zio_read_decompress                                     1   0.0%
unix`pci_peekpoke_check                                     1   0.0%
emlxs`emlxs_proc_ring_event                                 1   0.0%
genunix`list_insert_head                                    1   0.0%
genunix`tsd_get                                             1   0.0%
nvidia`_nv006127rm                                          1   0.0%
zfs`zfs_ace_v0_get_mask                                     1   0.0%
genunix`pcache_insert                                       1   0.0%
genunix`crhold                                              1   0.0%
zfs`zio_should_retry                                        1   0.0%
zfs`arc_adapt                                               1   0.0%
specfs`spec_ioctl                                           1   0.0%
ohci`ohci_deallocate_tw_resources                           1   0.0%
genunix`restorecontext                                      1   0.0%
genunix`freeb                                               1   0.0%
genunix`timespecfix                                         1   0.0%
genunix`ddi_dma_unbind_handle                               1   0.0%
genunix`vmem_size                                           1   0.0%
genunix`schedctl_sigblock                                   1   0.0%
dtrace`dtrace_speculation_clean                             1   0.0%
scsi`scsi_destroy_pkt                                       1   0.0%
genunix`new_mstate                                          1   0.0%
nvidia`_nv006129rm                                          1   0.0%
zfs`dbuf_set_data                                           1   0.0%
unix`i_ddi_caut_getput_ctlops                               1   0.0%
fctl`fc_ulp_init_packet                                     1   0.0%
genunix`ddi_dma_freehdl                                     1   0.0%
nvidia`_nv012491rm                                          1   0.0%
unix`pc_tod_get                                             1   0.0%
tl`tl_wput                                                  1   0.0%
zfs`arc_buf_destroy                                         1   0.0%
ehci`ehci_get_root_hub_port_status                          1   0.0%
nvidia`_nv012492rm                                          1   0.0%
nvidia`_nv004854rm                                          1   0.0%
genunix`copy_pattern                                        1   0.0%
unix`av_check_softint_pending                               1   0.0%
zfs`rrw_exit                                                1   0.0%
rootnex`rootnex_clean_dmahdl                                1   0.0%
genunix`mod_major_to_name                                   1   0.0%
genunix`ddi_dma_allochdl                                    1   0.0%
genunix`strdoioctl                                          1   0.0%
nvidia`_nv004086rm                                          1   0.0%
genunix`cv_signal                                           1   0.0%
genunix`ddi_ctlops                                          1   0.0%
genunix`ldi_strategy                                        1   0.0%
nvidia`_nv002887rm                                          1   0.0%
zfs`zfs_fuid_map_ids                                        1   0.0%
genunix`schedctl_restore                                    1   0.0%
genunix`mp_cont_len                                         1   0.0%
zfs`arc_reclaim_needed                                      1   0.0%
ip`tcp_send_find_ire                                        1   0.0%
genunix`setrun                                              1   0.0%
genunix`kmem_cpu_reload                                     1   0.0%
genunix`strget                                              1   0.0%
nvidia`_nv006152rm                                          1   0.0%
unix`hat_kpm_pfn2va                                         1   0.0%
genunix`seg_ppurge_all                                      1   0.0%
genunix`timespectohz                                        1   0.0%
nvidia`_nv008179rm                                          1   0.0%
genunix`disp_lock_enter_high                                1   0.0%
genunix`clear_stale_fd                                      1   0.0%
zfs`zio_nowait                                              1   0.0%
genunix`setf                                                1   0.0%
genunix`pcacheset_resolve                                   2   0.0%
unix`set_freemem                                            2   0.0%
unix`hr_clock_unlock                                        2   0.0%
fctl`fc_ulp_uninit_packet                                   2   0.0%
emlxs`emlxs_tx_get                                          2   0.0%
nvidia`_nv004781rm                                          2   0.0%
unix`kstat_runq_exit                                        2   0.0%
zfs`dbuf_findbp                                             2   0.0%
nvidia`_nv003170rm                                          2   0.0%
unix`prefetch_page_r                                        2   0.0%
sd`sd_buf_iodone                                            2   0.0%
genunix`clock                                               2   0.0%
unix`x86pte_release_pagetable                               2   0.0%
rootnex`rootnex_dma_bindhdl                                 2   0.0%
nvidia`_nv008946rm                                          2   0.0%
scsi`scsi_transport                                         2   0.0%
zfs`vdev_cache_lastused_compare                             2   0.0%
nvidia`rm_isr                                               2   0.0%
pcplusmp`apic_redistribute_compute                          2   0.0%
zfs`spa_get_random                                          2   0.0%
emlxs`emlxs_mem_put                                         2   0.0%
usba`usba_rm_from_list                                      2   0.0%
unix`pg_cmt_load                                            2   0.0%
zfs`dbuf_hold                                               2   0.0%
zfs`vdev_queue_io                                           2   0.0%
zfs`zio_vdev_io_assess                                      2   0.0%
unix`atomic_inc_64                                          2   0.0%
zfs`zio_vdev_io_start                                       2   0.0%
zfs`lzjb_compress                                           2   0.0%
unix`mutex_vector_tryenter                                  2   0.0%
genunix`kmem_depot_free                                     2   0.0%
nvidia`_nv003008rm                                          2   0.0%
unix`lock_clear_splx                                        2   0.0%
genunix`ddi_dma_alloc_handle                                2   0.0%
nvidia`_nv004840rm                                          2   0.0%
unix`lock_set_spl_spin                                      2   0.0%
genunix`canputnext                                          2   0.0%
zfs`arc_get_data_buf                                        2   0.0%
unix`tlb_service                                            2   0.0%
fcp`ssfcp_transport                                         2   0.0%
zfs`dnode_block_freed                                       2   0.0%
fcp`ssfcp_complete_pkt                                      2   0.0%
emlxs`emlxs_timer_check_pkts                                2   0.0%
specfs`spec_poll                                            2   0.0%
zfs`dbuf_update_data                                        2   0.0%
genunix`ddi_driver_major                                    2   0.0%
genunix`thread_free_prevent                                 2   0.0%
genunix`strioctl                                            2   0.0%
genunix`allocb                                              2   0.0%
genunix`pcacheset_cmp                                       2   0.0%
genunix`biodone                                             2   0.0%
genunix`ddi_get_soft_state                                  2   0.0%
emlxs`emlxs_swap_fcp_pkt                                    2   0.0%
genunix`list_head                                           2   0.0%
zfs`arc_access                                              2   0.0%
zfs`zfs_dirent_lock                                         2   0.0%
unix`drv_usecwait                                           2   0.0%
genunix`restorectx                                          2   0.0%
genunix`getq_noenab                                         2   0.0%
unix`kstat_waitq_to_runq                                    2   0.0%
genunix`clock_tick                                          2   0.0%
genunix`mdi_pi_get_vhci_private                             2   0.0%
scsi_vhci`vhci_scsi_start                                   2   0.0%
zfs`vdev_readable                                           2   0.0%
fcp`ssfcp_init_ventilators                                  2   0.0%
genunix`sleepq_dequeue                                      2   0.0%
genunix`timeout                                             2   0.0%
unix`mutex_adaptive_tryenter                                2   0.0%
nvidia`_nv008401rm                                          2   0.0%
scsi`scsi_hba_pkt_alloc                                     2   0.0%
rootnex`rootnex_coredma_bindhdl                             2   0.0%
emlxs`emlxs_msi_intr                                        2   0.0%
unix`clock_tick_schedule                                    2   0.0%
nvidia`_nv002782rm                                          3   0.0%
ohci`ohci_handle_root_hub_status_change                     3   0.0%
zfs`arc_buf_freeze                                          3   0.0%
emlxs`emlxs_sli3_bde_setup                                  3   0.0%
fctl`fc_ulp_transport                                       3   0.0%
genunix`set_anoninfo                                        3   0.0%
nvidia`_nv005978rm                                          3   0.0%
zfs`add_reference                                           3   0.0%
zfs`dnode_hold                                              3   0.0%
zfs`zio_destroy                                             3   0.0%
zfs`zfs_range_unlock_reader                                 3   0.0%
genunix`new_cpu_mstate                                      3   0.0%
genunix`cv_timedwait_sig                                    3   0.0%
genunix`kmem_alloc                                          3   0.0%
unix`ddi_io_put32                                           3   0.0%
unix`disp_lowpri_cpu                                        3   0.0%
sockfs`socktpi_ioctl                                        3   0.0%
genunix`uiomove                                             3   0.0%
genunix`disp_lock_exit_high                                 3   0.0%
zfs`dbuf_do_evict                                           3   0.0%
zfs`buf_hash_insert                                         3   0.0%
nvidia`_nv001414rm                                          3   0.0%
unix`clock_tick_execute_common                              3   0.0%
unix`mutex_destroy                                          3   0.0%
unix`clock_tick_process                                     3   0.0%
unix`fpxsave_begin                                          3   0.0%
zfs`vdev_lookup_top                                         3   0.0%
zfs`zio_wait_for_children_ready                             3   0.0%
genunix`pollsys                                             3   0.0%
genunix`fop_ioctl                                           3   0.0%
genunix`avl_insert                                          3   0.0%
rootnex`rootnex_coredma_unbindhdl                           3   0.0%
genunix`kmem_zalloc                                         3   0.0%
unix`lwp_segregs_restore32                                  3   0.0%
rootnex`rootnex_dma_unbindhdl                               4   0.0%
zfs`vdev_stat_update                                        4   0.0%
genunix`ddi_get_instance                                    4   0.0%
unix`av_dispatch_softvect                                   4   0.0%
zfs`rrw_enter_read                                          4   0.0%
emlxs`emlxs_timer                                           4   0.0%
fcp`ssfcp_scsi_start                                        4   0.0%
zfs`remove_reference                                        4   0.0%
genunix`cv_waituntil_sig                                    4   0.0%
scsi_vhci`vhci_scsi_init_pkt                                4   0.0%
genunix`mdi_select_path                                     4   0.0%
scsi`scsi_init_pkt                                          4   0.0%
zfs`rrw_enter                                               4   0.0%
zfs`arc_read                                                4   0.0%
unix`hment_mapcnt                                           4   0.0%
fcp`ssfcp_cmd_callback                                      4   0.0%
sd`xbuf_iostart                                             4   0.0%
unix`i_ddi_vaddr_rep_get8                                   4   0.0%
zfs`zio_root                                                4   0.0%
genunix`avl_find                                            4   0.0%
nvidia`_nv003011rm                                          4   0.0%
zfs`zio_notify_parent                                       4   0.0%
genunix`poll_common                                         4   0.0%
sd`sd_initpkt_for_buf                                       4   0.0%
unix`rw_write_held                                          4   0.0%
genunix`callout_execute                                     4   0.0%
zfs`vdev_mirror_child_done                                  4   0.0%
genunix`sleepq_unlink                                       4   0.0%
zfs`zio_null                                                4   0.0%
zfs`arc_do_user_evicts                                      4   0.0%
zfs`zio_interrupt                                           4   0.0%
sd`sd_start_cmds                                            4   0.0%
genunix`ioctl                                               5   0.0%
zfs`dnode_rele                                              5   0.0%
sd`sd_xbuf_init                                             5   0.0%
emlxs`emlxs_issue_iocb                                      5   0.0%
zfs`rrn_find_and_remove                                     5   0.0%
zfs`arc_evict_ghost                                         5   0.0%
zfs`zio_wait_for_children_done                              5   0.0%
zfs`vdev_is_dead                                            5   0.0%
genunix`read32                                              5   0.0%
zfs`vdev_disk_io_intr                                       5   0.0%
zfs`arc_change_state                                        5   0.0%
zfs`zio_ready                                               5   0.0%
genunix`restore_mstate                                      5   0.0%
emlxs`emlxs_pcimem_bcopy                                    5   0.0%
genunix`sleepq_wakeall_chan                                 5   0.0%
genunix`mdi_pi_kstat_iosupdate                              5   0.0%
sha1`sha1_block_data_order                                  5   0.0%
sd`sdstrategy                                               5   0.0%
zfs`vdev_mirror_io_done                                     5   0.0%
zfs`dmu_buf_rele_array                                      5   0.0%
zfs`vdev_mirror_io_start                                    5   0.0%
zfs`spa_get_failmode                                        5   0.0%
unix`htable_va2entry                                        5   0.0%
nvidia`_nv006149rm                                          5   0.0%
zfs`dbuf_hash_remove                                        5   0.0%
unix`sys_syscall                                            6   0.0%
unix`mutex_init                                             6   0.0%
unix`kstat_waitq_enter                                      6   0.0%
zfs`zio_wait                                                6   0.0%
unix`cmt_balance                                            6   0.0%
emlxs`emlxs_handle_ring_event                               6   0.0%
scsi_vhci`vhci_intr                                         6   0.0%
zfs`vdev_queue_io_to_issue                                  6   0.0%
unix`bcopy                                                  6   0.0%
sd`sd_return_command                                        6   0.0%
zfs`zio_checksum_verify                                     6   0.0%
zfs`arc_buf_add_ref                                         6   0.0%
rootnex`rootnex_coredma_freehdl                             6   0.0%
emlxs`emlxs_send_fcp_cmd                                    6   0.0%
sd`sd_core_iostart                                          6   0.0%
unix`ddi_get16                                              6   0.0%
zfs`vdev_mirror_map_alloc                                   7   0.0%
emlxs`emlxs_pkt_to_bpl                                      7   0.0%
unix`tsc_scalehrtime                                        7   0.0%
sd`sdinfo                                                   7   0.0%
TS`ts_wakeup                                                7   0.0%
genunix`sleepq_insert                                       7   0.0%
genunix`cpu_decay                                           7   0.0%
zfs`vdev_disk_io_start                                      7   0.0%
zfs`zio_vdev_io_done                                        7   0.0%
genunix`post_syscall                                        7   0.0%
genunix`timespectohz64                                      7   0.0%
unix`mutex_owned                                            7   0.0%
emlxs`emlxs_unregister_pkt                                  7   0.0%
genunix`cv_block                                            7   0.0%
unix`av_dispatch_autovect                                   7   0.0%
zfs`zfs_shim_read                                           8   0.0%
emlxs`emlxs_proc_ring                                       8   0.0%
zfs`arc_buf_alloc                                           8   0.0%
zfs`dbuf_create                                             8   0.0%
emlxs`emlxs_register_pkt                                    8   0.0%
zfs`dbuf_read_impl                                          8   0.0%
genunix`taskq_dispatch                                      8   0.0%
genunix`timeout_common                                      8   0.0%
unix`i_ddi_vaddr_rep_put8                                   8   0.0%
TS`ts_sleep                                                 8   0.0%
zfs`zio_push_transform                                      9   0.0%
zfs`dbuf_whichblock                                         9   0.0%
genunix`getminor                                            9   0.0%
zfs`arc_buf_remove_ref                                      9   0.0%
zfs`buf_hash_remove                                         9   0.0%
zfs`arc_evict                                               9   0.0%
emlxs`emlxs_proc_attention                                  9   0.0%
genunix`fs_rwunlock                                         9   0.0%
unix`resume                                                 9   0.0%
dtrace`dtrace_dynvar_clean                                  9   0.0%
emlxs`emlxs_mem_get                                        10   0.0%
emlxs`emlxs_pkt_init                                       10   0.0%
genunix`taskq_thread                                       10   0.0%
emlxs`emlxs_handle_fcp_event                               10   0.0%
zfs`zio_checksum_error                                     10   0.0%
nvidia`_nv002899rm                                         10   0.0%
emlxs`emlxs_transport                                      10   0.0%
fcp`ssfcp_prepare_pkt                                      10   0.0%
genunix`ddi_dma_sync                                       10   0.0%
genunix`sleepq_wakeone_chan                                10   0.0%
genunix`callout_schedule_1                                 10   0.0%
scsi_vhci`vhci_bind_transport                              10   0.0%
unix`membar_enter                                          10   0.0%
unix`swtch                                                 11   0.0%
rootnex`rootnex_coredma_allochdl                           11   0.0%
zfs`dmu_buf_hold_array                                     11   0.0%
zfs`zfs_range_lock                                         11   0.0%
zfs`zfs_range_unlock                                       11   0.0%
unix`hat_switch                                            12   0.0%
zfs`dbuf_hold_impl                                         12   0.0%
genunix`avl_numnodes                                       12   0.0%
genunix`avl_remove                                         12   0.0%
unix`x86pte_access_pagetable                               12   0.0%
unix`intr_thread                                           12   0.0%
pcplusmp`apic_send_ipi                                     12   0.0%
unix`idle                                                  12   0.0%
nvidia`_nv003003rm                                         12   0.0%
unix`htable_getpage                                        12   0.0%
unix`htable_release                                        12   0.0%
unix`lwp_getdatamodel                                      13   0.0%
unix`setfrontdq                                            13   0.0%
genunix`cpu_update_pct                                     13   0.0%
zfs`zio_assess                                             13   0.0%
genunix`kmem_depot_alloc                                   13   0.0%
fcp`ssfcp_scsi_init_pkt                                    14   0.0%
nvidia`_nv003006rm                                         14   0.0%
genunix`fs_rwlock                                          14   0.0%
zfs`zio_wait_for_children                                  14   0.0%
zfs`dbuf_read                                              14   0.0%
genunix`kmem_free                                          14   0.0%
genunix`fop_read                                           14   0.0%
unix`lock_try                                              14   0.0%
unix`cpu_wakeup                                            14   0.0%
emlxs`emlxs_issue_iocb_cmd                                 14   0.0%
genunix`list_remove                                        15   0.0%
unix`bitset_in_set                                         15   0.0%
unix`x86pte_get                                            15   0.0%
rootnex`rootnex_get_sgl                                    15   0.0%
sd`sdintr                                                  15   0.0%
zfs`dnode_hold_impl                                        15   0.0%
unix`htable_lookup                                         15   0.0%
unix`atomic_cas_32                                         15   0.0%
zfs`dmu_zfetch                                             16   0.0%
zfs`zio_done                                               17   0.0%
zfs`buf_hash_find                                          17   0.0%
unix`copyin                                                17   0.0%
genunix`disp_lock_exit_nopreempt                           17   0.0%
zfs`dmu_read_uio                                           18   0.0%
unix`atomic_add_32                                         18   0.0%
zfs`dbuf_hash                                              18   0.0%
zfs`dmu_zfetch_find                                        18   0.0%
zfs`arc_released                                           18   0.0%
unix`htable_getpte                                         18   0.0%
unix`mutex_tryenter                                        18   0.0%
genunix`disp_lock_exit                                     19   0.0%
genunix`cv_broadcast                                       19   0.0%
zfs`dbuf_rele                                              20   0.0%
unix`hat_getpfnum                                          20   0.0%
unix`dosoftint                                             20   0.0%
genunix`nbl_need_check                                     21   0.0%
unix`lock_set                                              21   0.0%
zfs`arc_read_done                                          21   0.0%
genunix`gethrtime_unscaled                                 22   0.0%
unix`ddi_get32                                             22   0.0%
zfs`lzjb_decompress                                        24   0.0%
genunix`thread_lock                                        26   0.0%
zfs`zio_pop_transform                                      27   0.0%
unix`mutex_vector_exit                                     27   0.0%
mm`mmwrite                                                 27   0.0%
unix`_resume_from_idle                                     27   0.0%
zfs`dmu_buf_hold_array_by_dnode                            27   0.0%
genunix`watch_disable_addr                                 29   0.0%
unix`disp_anywork                                          29   0.0%
unix`page_next_scan_large                                  30   0.0%
zfs`zio_create                                             30   0.0%
zfs`dbuf_find                                              30   0.0%
genunix`kmem_cache_free                                    31   0.0%
genunix`kmem_cache_alloc                                   33   0.0%
zfs`zio_execute                                            33   0.0%
genunix`cdev_write                                         33   0.0%
genunix`fsflush                                            33   0.0%
zfs`zfs_read                                               33   0.0%
unix`disp                                                  36   0.0%
genunix`read                                               37   0.0%
unix`_interrupt                                            38   0.0%
unix`bitset_atomic_add                                     39   0.0%
unix`rw_exit                                               39   0.0%
unix`mul32                                                 39   0.0%
unix`pc_rtcget                                             40   0.0%
unix`clear_int_flag                                        41   0.0%
genunix`gethrestime                                        44   0.0%
genunix`clear_active_fd                                    44   0.0%
unix`setbackdq                                             48   0.0%
genunix`set_active_fd                                      48   0.0%
unix`bitset_atomic_del                                     48   0.0%
genunix`gethrtime                                          49   0.0%
genunix`fop_rwlock                                         51   0.0%
unix`gethrestime_sec                                       52   0.0%
zfs`buf_hash                                               53   0.0%
unix`rw_enter                                              60   0.0%
unix`lock_set_spl                                          61   0.0%
genunix`fop_rwunlock                                       62   0.0%
unix`atomic_add_64_nv                                      64   0.0%
genunix`write32                                            69   0.0%
mm`mmrw                                                    72   0.0%
unix`atomic_add_64                                         74   0.0%
specfs`spec_maxoffset                                      75   0.0%
genunix`releasef                                           78   0.0%
unix`mutex_exit                                            83   0.0%
unix`pc_gethrestime                                        93   0.0%
nvidia`_nv002897rm                                         94   0.0%
genunix`fop_write                                          94   0.0%
genunix`copyin_nowatch                                     95   0.0%
genunix`getf                                              100   0.0%
specfs`smark                                              102   0.0%
unix`disp_getwork                                         112   0.0%
unix`bzero                                                113   0.0%
genunix`copyin_args32                                     124   0.0%
genunix`syscall_entry                                     149   0.1%
specfs`spec_write                                         163   0.1%
genunix`write                                             219   0.1%
unix`tsc_gethrtime_delta                                  278   0.1%
genunix`syscall_mstate                                    360   0.1%
unix`tsc_gethrtimeunscaled_delta                          395   0.1%
unix`sys_syscall32                                        477   0.2%
genunix`fsflush_do_pages                                  582   0.2%
unix`mutex_enter                                          709   0.2%
unix`kcopy                                               1580   0.5%
zfs`fletcher_2_native                                    1965   0.7%
unix`cpu_halt                                          276440  95.9%

Bob Friesenhahn

2009-Jul-04 18:28 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Phil Harman wrote:
> However, this is only part of the problem. The fundamental issue is that
ZFS
> has its own ARC apart from the Solaris page cache, so whenever mmap() is 
> used, all I/O to that file has to make sure that the two caches are in
sync.
> Hence, a read(2) on a file which has sometime been mapped, will be
impacted,
> even if the file is nolonger mapped.
However, it seems that memory mapping is not responsible for the 
problem I am seeing here.  Memory mapping may make the problem seem 
worse, but it is clearly not the cause.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Schilling

2009-Jul-04 18:46 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Phil Harman <Phil.Harman at Sun.COM> wrote:
> I think Solaris (if you count SunOS 4.0, which was part of Solaris 1.0) 
> was the first UNIX to get a working implementation of mmap(2) for files 
> (if I recall correctly, BSD 4.3 had a manpage but no implementation for 
> files). From that we got a whole lot of cool stuff, not least dynamic 
> linking with ld.so (which has made it just about everywhere).
Well on BSD, you could mmap() devices but as a result from the fact that
there was no useful address space management, you had to first malloc()
the amount of space, forcing you to have the same amount of memory available
as swap. Later, the device was mapped on top of the allocated memory and
made the underlying spap space unacessible. We had to add expensive amounts of
swap that time in order to be able to mmap the 256 MB of RAM from our image 
processor that time at Berthold AG.

> The Solaris implementation of mmap(2) is functionally correct, but the 
> wait for a 64 bit address space rather moved the attention of 
> performance tuning elsewhere. I must admit I was surprised to see so 
> much code out there that still uses mmap(2) for general I/O (rather than 
> just to support dynamic linking).
When the new memory management architecture was introduced with SunOS-4.0,
things became better although the now unified and partially anomized address 
space made it hard to implement "limit memoryuse" (rlmit with
RLIMIT_RSS).
I made a working implementation for SunOS-4.0 but this did not make it into 
SunOS.

There are still related performance issues. If you e.g. store a CD/DVD/BluRay
image in /tmp that is bigger than the amount of RAM in the machine, you will
observe a buffer overflow while writing with cdrecord unless you use 
driveropts=burnfree because pagin in is slow on tmpfs.
> Software engineering is always about prioritising resource. Nothing 
> prioritises performance tuning attention quite like compelling 
> competitive data. When Bart Smaalders and I wrote libMicro we generated 
> a lot of very compelling data. I also coined the phrase "If Linux is 
> faster, it''s a Solaris bug". You will find quite a few
(mostly fixed)
> bugs with the synopsis "linux is faster than solaris at ...".
Fortunately, Linux is slower with most tasks ;-)

In 1988, the effect of mmap() was much more visible than it is now. 20 years 
ago, the CPU speed limited copy operations making pipes, copyout() and similar 
slow. This changed with modern CPUs and for this reason, the demand for using 
mmap() is lower than it has been 20 years ago.

> So, if mmap(2) playing nicely with ZFS is important to you, probably the 
> best thing you can do to help that along is to provide data that will 
> help build the business case for spending engineering resource on the
issue.
I would be interested to see a open(2) flag that tells the system that I will
read a file that I opened exactly once in native oder. This could tell the 
system to do read ahead and to later mark the pages as immediately reusable. 
This would make star even faster than it is now.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Phil Harman

2009-Jul-04 19:36 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote:
>> However, this is only part of the problem. The fundamental issue is 
>> that ZFS has its own ARC apart from the Solaris page cache, so 
>> whenever mmap() is used, all I/O to that file has to make sure that 
>> the two caches are in sync. Hence, a read(2) on a file which has 
>> sometime been mapped, will be impacted, even if the file is nolonger 
>> mapped.
>
> However, it seems that memory mapping is not responsible for the 
> problem I am seeing here.  Memory mapping may make the problem seem 
> worse, but it is clearly not the cause.
mmap(2) is what brings ZFS files into the page cache. I think you''ve 
shown us that once you''ve copied files with cp(1) - which does use 
mmap(2) - that anything that uses read(2) on the same files is impacted.
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us, 
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-04 19:41 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Phil Harman wrote:>> 
>> However, it seems that memory mapping is not responsible for the
problem I
>> am seeing here.  Memory mapping may make the problem seem worse, but it
is
>> clearly not the cause.
>
> mmap(2) is what brings ZFS files into the page cache. I think
you''ve shown us
> that once you''ve copied files with cp(1) - which does use mmap(2)
- that
> anything that uses read(2) on the same files is impacted.
The problem is observed with cpio, which does not use mmap.  This is 
immediately after a reboot or unmount/mount of the filesystem.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-04 19:49 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Jonathan Edwards wrote:>
> this is only going to help if you''ve got problems in zfetch ..
you''d probably
> see this better by looking for high lock contention in zfetch with lockstat
This is what lockstat says when performance is poor:

Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
    47  10%  10% 0.00     5813 0xffffffff80256000     untimeout+0x24
    46  10%  19% 0.00     2223 0xffffffffb0a2f200     taskq_thread+0xe3
    38   8%  27% 0.00     2252 0xffffffffb0a2f200     cv_wait+0x70
    29   6%  34% 0.00     1115 0xffffffff80256000     callout_execute+0xeb
    26   5%  39% 0.00     3006 0xffffffffb0a2f200     taskq_dispatch+0x1b8
    22   5%  44% 0.00     1200 0xffffffffa06158c0     post_syscall+0x206
    18   4%  47% 0.00     3858 arc_eviction_mtx       arc_do_user_evicts+0x76
    16   3%  51% 0.00     1352 arc_eviction_mtx       arc_buf_add_ref+0x2d
    15   3%  54% 0.00     5376 0xffffffffb1adac28     taskq_thread+0xe3
    11   2%  56% 0.00     2520 0xffffffffb1adac28     taskq_dispatch+0x1b8
     9   2%  58% 0.00     2158 0xffffffffbb909e20     pollwakeup+0x116
     9   2%  60% 0.00     2431 0xffffffffb1adac28     cv_wait+0x70
     8   2%  62% 0.00     3912 0xffffffff80259000     untimeout+0x24
     7   1%  63% 0.00     3679 0xffffffffb10dfbc0     polllock+0x3f
     7   1%  65% 0.00     2171 0xffffffffb0a2f2d8     cv_wait+0x70
     6   1%  66% 0.00      771 0xffffffffb3f23708     pcache_delete_fd+0xac
     6   1%  67% 0.00     4679 0xffffffffb0a2f2d8     taskq_dispatch+0x1b8
     5   1%  68% 0.00      500 0xffffffffbe555040     fifo_read+0xf8
     5   1%  69% 0.00    15838 0xffffffff8025c000     untimeout+0x24
     4   1%  70% 0.00     1213 0xffffffffac44b558     sd_initpkt_for_buf+0x110
     4   1%  71% 0.00      638 0xffffffffa28722a0     polllock+0x3f
     4   1%  72% 0.00      610 0xffffffff80259000     timeout_common+0x39
     4   1%  73% 0.00    10691 0xffffffff80256000     timeout_common+0x39
     3   1%  73% 0.00     1559 htable_mutex+0x78      htable_release+0x8a
     3   1%  74% 0.00     3610 0xffffffffbb909e20     cv_timedwait_sig+0x1c1
     3   1%  74% 0.00     1636 0xffffffffa240d410    
ohci_allocate_periodic_in_resource+0x71
     2   0%  75% 0.00     5959 0xffffffffbe555040     fifo_read+0x5c
     2   0%  75% 0.00     3744 0xffffffffbe555040     polllock+0x3f
     2   0%  76% 0.00      635 0xffffffffb3f23708     pollwakeup+0x116
     2   0%  76% 0.00      709 0xffffffffb3f23708     cv_timedwait_sig+0x1c1
     2   0%  77% 0.00      831 0xffffffffb3dd2070     pcache_insert+0x13d
     2   0%  77% 0.00     5976 0xffffffffb3dd2070     pollwakeup+0x116
     2   0%  77% 0.00     1339 0xffffffffb1eb9b80     metaslab_group_alloc+0x136
     2   0%  78% 0.00     1514 0xffffffffb0a2f2d8     taskq_thread+0xe3
     2   0%  78% 0.00     4042 0xffffffffb0a22988     vdev_queue_io_done+0xc3
     2   0%  79% 0.00     3428 0xffffffffb0a21f08     vdev_queue_io_done+0xc3
     2   0%  79% 0.00     1002 0xffffffffac44b558     sd_core_iostart+0x37
     2   0%  79% 0.00     1387 0xffffffffa8c56d80     xbuf_iostart+0x7d
     2   0%  80% 0.00      698 0xffffffffa58a3318     sd_return_command+0x11b
     2   0%  80% 0.00      385 0xffffffffa58a3318     sd_start_cmds+0x115
     2   0%  81% 0.00      562 0xffffffffa5647800     ssfcp_scsi_start+0x30
     2   0%  81% 0.00     1620 0xffffffffa4162d58     ssfcp_scsi_init_pkt+0x1be
     2   0%  82% 0.00      897 0xffffffffa4162d58     ssfcp_scsi_start+0x42
     2   0%  82% 0.00      475 0xffffffffa4162b78     ssfcp_scsi_start+0x42
     2   0%  82% 0.00      697 0xffffffffa40fb158     sd_start_cmds+0x115
     2   0%  83% 0.00    10901 0xffffffffa28722a0     fifo_write+0x5b
     2   0%  83% 0.00     4379 0xffffffffa28722a0     fifo_read+0xf8
     2   0%  84% 0.00     1534 0xffffffffa2638390     emlxs_tx_get+0x38
     2   0%  84% 0.00     1601 0xffffffffa2638350     emlxs_issue_iocb_cmd+0xc1
     2   0%  84% 0.00     6697 0xffffffffa2503f08     vdev_queue_io_done+0x7b
     2   0%  85% 0.00     4113 0xffffffffa24040b0    
gcpu_ntv_mca_poll_wrapper+0x64
     2   0%  85% 0.00      928 0xfffffe85dc140658     pollwakeup+0x116
     1   0%  86% 0.00      404 iommulib_lock          lookup_cache+0x2c
     1   0%  86% 0.00     4867 pidlock                thread_exit+0x6f
     1   0%  86% 0.00     1245 plocks+0x3c0           pollhead_delete+0x23
     1   0%  86% 0.00     2452 plocks+0x3c0           pollhead_insert+0x35
     1   0%  86% 0.00      882 htable_mutex+0x3c0     htable_lookup+0x83
     1   0%  87% 0.00    28547 htable_mutex+0x3c0     htable_create+0xe3
     1   0%  87% 0.00    21173 htable_mutex+0x3c0     htable_release+0x8a
     1   0%  87% 0.00     1235 htable_mutex+0x370     htable_lookup+0x83
     1   0%  87% 0.00     3212 htable_mutex+0x370     htable_release+0x8a
     1   0%  87% 0.00      793 htable_mutex+0x78      htable_lookup+0x83
     1   0%  88% 0.00      981 buf_hash_table+0x1210  arc_buf_add_ref+0x7c
     1   0%  88% 0.00     1222 buf_hash_table+0x1c50  arc_buf_add_ref+0x7c
     1   0%  88% 0.00     1585 buf_hash_table+0x2490  arc_buf_remove_ref+0x6d
     1   0%  88% 0.00  1545158 ARC_mru+0x58           remove_reference+0x56
     1   0%  88% 0.00      564 0xffffffffbcad4a00     strrput+0x19a
     1   0%  89% 0.00     1033 0xffffffffbcad4a00     polllock+0x3f
     1   0%  89% 0.00      587 0xffffffffbd328098     putnext+0x6c
     1   0%  89% 0.00    11576 0xffffffffbd328098     strrput+0x19a
     1   0%  89% 0.00      847 0xffffffffb3f23708     pcache_insert+0x13d
     1   0%  90% 0.00      703 0xffffffffbb909e20     poll_common+0x258
     1   0%  90% 0.00     1286 0xffffffffbcad4870     kstrgetmsg+0x79
     1   0%  90% 0.00     1528 0xffffffffb1012e00     cv_wait+0x70
     1   0%  90% 0.00      404 0xffffffffb1011de0     zio_notify_parent+0x37
     1   0%  90% 0.00      764 0xffffffffb1011de0     zio_create+0x29f
     1   0%  91% 0.00     5887 0xffffffffb0a2f3b0     cv_wait+0x70
     1   0%  91% 0.00      883 0xffffffffb1ad3de0     metaslab_group_alloc+0x7e
     1   0%  91% 0.00      555 0xffffffffb10dfbc0     fifo_write+0x5b
     1   0%  91% 0.00      692 0xffffffffb3dd2070     pollrelock+0x36
     1   0%  91% 0.00     4390 0xffffffffb0a22988     vdev_queue_io+0x6e
     1   0%  92% 0.00     1449 0xffffffffb0a21f60     vdev_cache_write+0x64
     1   0%  92% 0.00      859 0xffffffffb0a21a20     vdev_cache_write+0x64
     1   0%  92% 0.00      446 0xffffffffb0a20ec8     vdev_queue_io+0x6e
     1   0%  92% 0.00     1987 0xffffffffb0a21f08     vdev_queue_io+0x6e
     1   0%  92% 0.00     5968 0xffffffffb0a21f08     vdev_queue_io_done+0x3a
     1   0%  93% 0.00      280 0xffffffffac44b558     sd_start_cmds+0x115
     1   0%  93% 0.00      527 0xffffffffac44b118     sd_core_iostart+0x37
     1   0%  93% 0.00      380 0xffffffffac51aed8     sd_initpkt_for_buf+0x110
     1   0%  93% 0.00      742 0xffffffffac51aed8     sdstrategy+0x53
     1   0%  94% 0.00      696 0xffffffffac51aed8     sd_start_cmds+0x115
     1   0%  94% 0.00     5398 0xffffffffb0a1c988     vdev_queue_io_done+0x3a
     1   0%  94% 0.00     6102 0xffffffffb0a1c988     vdev_queue_io_done+0x7b
     1   0%  94% 0.00      988 0xffffffffa40fbcd8     sd_return_command+0x11b
     1   0%  94% 0.00      298 0xffffffffa4101460     ssfcp_scsi_init_pkt+0x3b4
     1   0%  95% 0.00      302 0xffffffffa40fbcd8     sdstrategy+0x53
     1   0%  95% 0.00     1436 0xffffffffa40fb158     sdintr+0x3a
     1   0%  95% 0.00      764 0xffffffffa40fbcd8     sd_initpkt_for_buf+0x110
     1   0%  95% 0.00      846 0xffffffffa40fbcd8     sd_start_cmds+0x115
     1   0%  95% 0.00     1172 0xffffffffa5644f60     vdev_cache_write+0x64
     1   0%  96% 0.00     8401 0xffffffffa5644f08     vdev_queue_io_done+0xc3
     1   0%  96% 0.00      417 0xffffffffa5644f08     vdev_queue_io+0x6e
     1   0%  96% 0.00     3419 0xffffffffa5644f08     vdev_queue_io_done+0x7b
     1   0%  96% 0.00     1341 0xffffffffa58a3318     sd_core_iostart+0x37
     1   0%  96% 0.00      431 0xffffffffa22cc840     fc_ulp_init_packet+0x31
     1   0%  97% 0.00      569 0xffffffff807a4000     callout_execute+0xeb
     1   0%  97% 0.00      695 0xffffffff8025c000     callout_execute+0xeb
     1   0%  97% 0.00      500 0xffffffff80244000     callout_execute+0xeb
     1   0%  97% 0.00      855 0xfffffe85dc140658     pcache_insert+0x13d
     1   0%  97% 0.00    13339 0xfffffe85d67ddc48     cv_wait+0x70
     1   0%  98% 0.00     5377 0xffffffff80253000     untimeout+0x24
     1   0%  98% 0.00     5104 0xffffffff80253000     timeout_common+0x39
     1   0%  98% 0.00      508 0xffffffff80253000     callout_execute+0xeb
     1   0%  98% 0.00      260 0xffffffffa2638420     emlxs_register_pkt+0x30
     1   0%  99% 0.00     1059 0xffffffffa2638390     emlxs_tx_put+0x79
     1   0%  99% 0.00      411 0xffffffffa3e3a298     sdstrategy+0x53
     1   0%  99% 0.00      336 0xffffffffa3d6e380     fc_ulp_init_packet+0x31
     1   0%  99% 0.00      926 0xffffffffa3cdfc58     sd_start_cmds+0x115
     1   0%  99% 0.00      894 0xffffffffa3e3a298     sd_core_iostart+0x37
     1   0% 100% 0.00      766 0xffffffffa3e3a298     sd_buf_iodone+0x23
     1   0% 100% 0.00      340 0xffffffffa3e58420     emlxs_register_pkt+0x30
     1   0% 100% 0.00     1516 0xffffffffa3e58420     emlxs_unregister_pkt+0x53
-------------------------------------------------------------------------------

Adaptive mutex block: 7 events in 30.019 seconds (0 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
     1  14%  14% 0.00    39004 0xffffffffbe555040     fifo_read+0x5c
     1  14%  29% 0.00    78624 0xffffffffb3dd2070     pollwakeup+0x116
     1  14%  43% 0.00     6668 0xffffffffb0a2f200     taskq_dispatch+0x1b8
     1  14%  57% 0.00    35694 0xffffffffb0a2f3b0     cv_wait+0x70
     1  14%  71% 0.00    22697 0xffffffffb0a21f08     vdev_queue_io_done+0xc3
     1  14%  86% 0.00    20174 0xffffffffb0a1c988     vdev_queue_io_done+0x3a
     1  14% 100% 0.00     7365 0xfffffe85d67ddc48     cv_wait+0x70 
-------------------------------------------------------------------------------

Spin lock spin: 478 events in 30.019 seconds (16 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
   131  27%  27% 0.00     1383 0xffffffffa21a89f8     disp_lock_enter+0x1e
    93  19%  47% 0.00      962 0xffffffffa250e9c0     disp_lock_enter+0x1e
    84  18%  64% 0.00     3085 0xffffffffa21a8a28     disp_lock_enter+0x1e
    73  15%  80% 0.00     2105 cpu0_disp              disp_lock_enter+0x1e
    33   7%  87% 0.00     3548 0xffffffffa21a8a28     disp_lock_enter_high+0x9
    22   5%  91% 0.00     6011 cpu0_disp              disp_lock_enter_high+0x9
    21   4%  96% 0.00     2222 hres_lock              hr_clock_lock+0x1d
     9   2%  97% 0.00     3869 0xffffffffa21a89f8     disp_lock_enter_high+0x9
     8   2%  99% 0.00     1649 0xffffffffa250e9c0     disp_lock_enter_high+0x9
     4   1% 100% 0.00      624 cp_default             disp_lock_enter+0x1e 
-------------------------------------------------------------------------------

Thread lock spin: 6 events in 30.019 seconds (0 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
     2  33%  33% 0.00      698 transition_lock        ts_update_list+0x5c
     2  33%  67% 0.00      779 cpu[3]+0xf8            cv_wait+0x3e
     1  17%  83% 0.00      452 cpu[3]+0xf8            cv_timedwait_sig+0xe1
     1  17% 100% 0.00      324 cpu[2]+0xf8            cv_timedwait_sig+0xe1 
-------------------------------------------------------------------------------

--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

dick hoogendijk

2009-Jul-04 20:07 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009 13:03:52 -0500 (CDT)
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Sat, 4 Jul 2009, Joerg Schilling wrote:
> > Did you try to use highly performant software like star?
> 
> No, because I don''t want to tarnish your software''s
stellar
> reputation.  I am focusing on Solaris 10 bugs today.
Blunt.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2009.06 release
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Phil Harman

2009-Jul-04 20:09 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Sat, 4 Jul 2009, Phil Harman wrote:
>>>
>>> However, it seems that memory mapping is not responsible for the 
>>> problem I am seeing here.  Memory mapping may make the problem seem
>>> worse, but it is clearly not the cause.
>>
>> mmap(2) is what brings ZFS files into the page cache. I think
you''ve
>> shown us that once you''ve copied files with cp(1) - which does
use
>> mmap(2) - that anything that uses read(2) on the same files is
impacted.
>
> The problem is observed with cpio, which does not use mmap.  This is 
> immediately after a reboot or unmount/mount of the filesystem.
Sorry, I didn''t get to your other post ...
> Ok, here is the scoop on the dire Solaris 10 (Generic_141415-03) 
> performance bug on my Sun Ultra 40-M2 attached to a StorageTek 2540 
> with latest firmware.  I rebooted the system used cpio to send the 
> input files to /dev/null, and then immediately used cpio a second time 
> to send the input files to /dev/null.  Note that the amount of file 
> data (243 GB) is plenty sufficient to purge any file data from the ARC 
> (which has a cap of 10 GB).
>
> % time cat dpx-files.txt | cpio -o > /dev/null
> 495713288 blocks
> cat dpx-files.txt  0.00s user 0.00s system 0% cpu 1.573 total
> cpio -o > /dev/null  78.92s user 360.55s system 43% cpu 16:59.48 total
>
> % time cat dpx-files.txt | cpio -o > /dev/null
> 495713288 blocks
> cat dpx-files.txt  0.00s user 0.00s system 0% cpu 0.198 total
> cpio -o > /dev/null  79.92s user 358.75s system 11% cpu 1:01:05.88 total
>
> zpool iostat averaged over 60 seconds reported that the first run 
> through the files read the data at 251 MB/s and the second run only 
> achieved 68 MB/s.  It seems clear that there is something really bad 
> about Solaris 10 zfs''s file caching code which is causing it to go
> into the weeds.
>
> I don''t think that the results mean much, but I have attached
output
> from ''hotkernel'' while a subsequent cpio copy is taking
place.  It
> shows that the kernel is mostly sleeping.
>
> This is not a new problem.  It seems that I have been banging my head 
> against this from the time I started using zfs. 
I''d like to see mpstat 1 for each case, on an otherwise idle system,
but
then there''s probably a whole lot of dtrace I''d like to do ...
but I''m
just off on vacation for a week, and this will probably have to be my 
last post on this thread until I''m back.

Cheers,
Phil

Bob Friesenhahn

2009-Jul-04 20:22 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, 4 Jul 2009, Phil Harman wrote:>> 
>> This is not a new problem.  It seems that I have been banging my head 
>> against this from the time I started using zfs. 
>
> I''d like to see mpstat 1 for each case, on an otherwise idle
system,
> but then there''s probably a whole lot of dtrace I''d like
to do ...
> but I''m just off on vacation for a week, and this will probably
have
> to be my last post on this thread until I''m back.
Shame on you for taking well-earned vacation in my time of need. :-)

''mpstat 1'' output when I/O is good:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0  1700  247 2187   11  214   11    0 10270    2   5   0  93
   1    0   0    0  1478    5 2812   18  241   10    0 18424    2   4   0  94
   2    0   0    1  1210    0 2392   60  185   19    0 301927    5  28   0  67
   3    0   0    0  3242 2320 2028   60  181    9    0 222500    3  24   0  73
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0  1862  244 2554    9  231    6    0  2880    2   3   0  95
   1    0   0    0  1158    1 2055   17  221    7    0  4479    1   3   0  96
   2    0   0    0  1037    0 2051   65  186   14    0 250211    4  24   0  73
   3    0   0    0  3037 2167 2101   62  186   11    0 251393    4  25   0  71

''mpstat 1''  output when I/O is bad:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0   859  243 1006    5  106    0    0 20733    2   3   0  95
   1    0   0    0   504   15  942   12   84    6    0 74009    3   6   0  91
   2    0   0    0   192    0  338    0   48    0    0    38    0   1   0  99
   3    0   0    0   549  376  522    1   36    0    0   135    0   2   0  98

Notice how intensely unbusy the CPU cores are when I/O is bad.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Boyd Adamson

2009-Jul-06 13:12 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Phil Harman <Phil.Harman at Sun.COM> writes:
> Gary Mills wrote:
> The Solaris implementation of mmap(2) is functionally correct, but the
> wait for a 64 bit address space rather moved the attention of
> performance tuning elsewhere. I must admit I was surprised to see so
> much code out there that still uses mmap(2) for general I/O (rather
> than just to support dynamic linking).
Probably this is encouraged by documentation like this:
> The memory mapping interface is described in Memory Management
> Interfaces. Mapping files is the most efficient form of file I/O for
> most applications run under the SunOS platform.
Found at:

http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=view


Boyd.

Bob Friesenhahn

2009-Jul-06 14:23 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 6 Jul 2009, Boyd Adamson wrote:>
> Probably this is encouraged by documentation like this:
>
>> The memory mapping interface is described in Memory Management
>> Interfaces. Mapping files is the most efficient form of file I/O for
>> most applications run under the SunOS platform.
>
> Found at:
>
> http://docs.sun.com/app/docs/doc/817-4415/fileio-2?l=en&a=view
People often think about the main benefit of mmap() being to reduce 
CPU consumption and buffer copies but the mmap() family of programming 
interfaces is much richer than low-level read/write, pread/pwrite, or 
stdio, because madvise() provides the ability for I/O scheduling, or 
to flush stale data from memory.  In recent Solaris, it also includes 
provisions which allow applications to improve their performance on 
NUMA systems.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gary Mills

2009-Jul-06 15:28 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sat, Jul 04, 2009 at 07:18:45PM +0100, Phil Harman
wrote:> Gary Mills wrote:
> >On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote:
> >  
> >>ZFS doesn''t mix well with mmap(2). This is because ZFS
uses the ARC
> >>instead of the Solaris page cache. But mmap() uses the latter. So
if
> >>anyone maps a file, ZFS has to keep the two caches in sync.
> >
> >That''s the first I''ve heard of this issue.  Our
e-mail server runs
> >Cyrus IMAP with mailboxes on ZFS filesystems.  Cyrus uses mmap(2)
> >extensively.  I understand that Solaris has an excellent
> >implementation of mmap(2).  ZFS has many advantages, snapshots for
> >example, for mailbox storage.  Is there anything that we can be do to
> >optimize the two caches in this environment?  Will mmap(2) one day
> >play nicely with ZFS?
> 
[..]> Software engineering is always about prioritising resource. Nothing 
> prioritises performance tuning attention quite like compelling 
> competitive data. When Bart Smaalders and I wrote libMicro we generated 
> a lot of very compelling data. I also coined the phrase "If Linux is 
> faster, it''s a Solaris bug". You will find quite a few
(mostly fixed)
> bugs with the synopsis "linux is faster than solaris at ...".
> 
> So, if mmap(2) playing nicely with ZFS is important to you, probably the 
> best thing you can do to help that along is to provide data that will 
> help build the business case for spending engineering resource on the
issue.
First of all, how significant is the double caching in terms of
performance?  If the effect is small, I won''t worry about it anymore.

What sort of data do you need?  Would a list of software products that
utilize mmap(2) extensively and could benefit from ZFS be suitable?

As for a business case, we just had an extended and catastrophic
performance degradation that was the result of two ZFS bugs.  If we
have another one like that, our director is likely to instruct us to
throw away all our Solaris toys and convert to Microsoft products.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Andre van Eyssen

2009-Jul-06 15:29 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 6 Jul 2009, Gary Mills wrote:
> As for a business case, we just had an extended and catastrophic
> performance degradation that was the result of two ZFS bugs.  If we
> have another one like that, our director is likely to instruct us to
> throw away all our Solaris toys and convert to Microsoft products.
If you change platform every time you get two bugs in a product, you must 
cycle platforms on a pretty regular basis!

-- 
Andre van Eyssen.
mail: andre at purplecow.org          jabber: andre at interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix             http://pix.purplecow.org

Bryan Allen

2009-Jul-06 15:44 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

+------------------------------------------------------------------------------
| On 2009-07-07 01:29:11, Andre van Eyssen wrote:
| 
| On Mon, 6 Jul 2009, Gary Mills wrote:
| 
| >As for a business case, we just had an extended and catastrophic
| >performance degradation that was the result of two ZFS bugs.  If we
| >have another one like that, our director is likely to instruct us to
| >throw away all our Solaris toys and convert to Microsoft products.
| 
| If you change platform every time you get two bugs in a product, you must 
| cycle platforms on a pretty regular basis!

Given that policy, I don''t imagine Windows will last very long anyway.
-- 
bda
cyberpunk is dead. long live cyberpunk.

Andrew Gabriel

2009-Jul-06 15:54 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Andre van Eyssen wrote:> On Mon, 6 Jul 2009, Gary Mills wrote:
>
>> As for a business case, we just had an extended and catastrophic
>> performance degradation that was the result of two ZFS bugs.  If we
>> have another one like that, our director is likely to instruct us to
>> throw away all our Solaris toys and convert to Microsoft products.
>
> If you change platform every time you get two bugs in a product, you 
> must cycle platforms on a pretty regular basis!
You often find the change is towards Windows. That very rarely has the 
same rules applied, so things then stick there.

-- 
Andrew

Sanjeev

2009-Jul-07 03:51 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob,

Catching up late on this thread.

Would it be possible for you to collect the following data :
- /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5
- /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5
- /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5

Or if you have access to the GUDs tool please collect data using that.

We need to understand how ARC plays a role here.

Thanks and regards,
Sanjeev.
On Sat, Jul 04, 2009 at 02:49:05PM -0500, Bob Friesenhahn
wrote:> On Sat, 4 Jul 2009, Jonathan Edwards wrote:
>>
>> this is only going to help if you''ve got problems in zfetch ..
you''d
>> probably see this better by looking for high lock contention in zfetch 
>> with lockstat
>
> This is what lockstat says when performance is poor:
>
> Adaptive mutex spin: 477 events in 30.019 seconds (16 events/sec)
>
> Count indv cuml rcnt     nsec Lock                   Caller  
>
-------------------------------------------------------------------------------
>    47  10%  10% 0.00     5813 0xffffffff80256000     untimeout+0x24
>    46  10%  19% 0.00     2223 0xffffffffb0a2f200     taskq_thread+0xe3
>    38   8%  27% 0.00     2252 0xffffffffb0a2f200     cv_wait+0x70
>    29   6%  34% 0.00     1115 0xffffffff80256000     callout_execute+0xeb
>    26   5%  39% 0.00     3006 0xffffffffb0a2f200     taskq_dispatch+0x1b8
>    22   5%  44% 0.00     1200 0xffffffffa06158c0     post_syscall+0x206
>    18   4%  47% 0.00     3858 arc_eviction_mtx      
arc_do_user_evicts+0x76
>    16   3%  51% 0.00     1352 arc_eviction_mtx       arc_buf_add_ref+0x2d
>    15   3%  54% 0.00     5376 0xffffffffb1adac28     taskq_thread+0xe3
>    11   2%  56% 0.00     2520 0xffffffffb1adac28     taskq_dispatch+0x1b8
>     9   2%  58% 0.00     2158 0xffffffffbb909e20     pollwakeup+0x116
>     9   2%  60% 0.00     2431 0xffffffffb1adac28     cv_wait+0x70
>     8   2%  62% 0.00     3912 0xffffffff80259000     untimeout+0x24
>     7   1%  63% 0.00     3679 0xffffffffb10dfbc0     polllock+0x3f
>     7   1%  65% 0.00     2171 0xffffffffb0a2f2d8     cv_wait+0x70
>     6   1%  66% 0.00      771 0xffffffffb3f23708     pcache_delete_fd+0xac
>     6   1%  67% 0.00     4679 0xffffffffb0a2f2d8     taskq_dispatch+0x1b8
>     5   1%  68% 0.00      500 0xffffffffbe555040     fifo_read+0xf8
>     5   1%  69% 0.00    15838 0xffffffff8025c000     untimeout+0x24
>     4   1%  70% 0.00     1213 0xffffffffac44b558    
sd_initpkt_for_buf+0x110
>     4   1%  71% 0.00      638 0xffffffffa28722a0     polllock+0x3f
>     4   1%  72% 0.00      610 0xffffffff80259000     timeout_common+0x39
>     4   1%  73% 0.00    10691 0xffffffff80256000     timeout_common+0x39
>     3   1%  73% 0.00     1559 htable_mutex+0x78      htable_release+0x8a
>     3   1%  74% 0.00     3610 0xffffffffbb909e20     cv_timedwait_sig+0x1c1
>     3   1%  74% 0.00     1636 0xffffffffa240d410    
ohci_allocate_periodic_in_resource+0x71
>     2   0%  75% 0.00     5959 0xffffffffbe555040     fifo_read+0x5c
>     2   0%  75% 0.00     3744 0xffffffffbe555040     polllock+0x3f
>     2   0%  76% 0.00      635 0xffffffffb3f23708     pollwakeup+0x116
>     2   0%  76% 0.00      709 0xffffffffb3f23708     cv_timedwait_sig+0x1c1
>     2   0%  77% 0.00      831 0xffffffffb3dd2070     pcache_insert+0x13d
>     2   0%  77% 0.00     5976 0xffffffffb3dd2070     pollwakeup+0x116
>     2   0%  77% 0.00     1339 0xffffffffb1eb9b80    
metaslab_group_alloc+0x136
>     2   0%  78% 0.00     1514 0xffffffffb0a2f2d8     taskq_thread+0xe3
>     2   0%  78% 0.00     4042 0xffffffffb0a22988    
vdev_queue_io_done+0xc3
>     2   0%  79% 0.00     3428 0xffffffffb0a21f08    
vdev_queue_io_done+0xc3
>     2   0%  79% 0.00     1002 0xffffffffac44b558     sd_core_iostart+0x37
>     2   0%  79% 0.00     1387 0xffffffffa8c56d80     xbuf_iostart+0x7d
>     2   0%  80% 0.00      698 0xffffffffa58a3318    
sd_return_command+0x11b
>     2   0%  80% 0.00      385 0xffffffffa58a3318     sd_start_cmds+0x115
>     2   0%  81% 0.00      562 0xffffffffa5647800     ssfcp_scsi_start+0x30
>     2   0%  81% 0.00     1620 0xffffffffa4162d58    
ssfcp_scsi_init_pkt+0x1be
>     2   0%  82% 0.00      897 0xffffffffa4162d58     ssfcp_scsi_start+0x42
>     2   0%  82% 0.00      475 0xffffffffa4162b78     ssfcp_scsi_start+0x42
>     2   0%  82% 0.00      697 0xffffffffa40fb158     sd_start_cmds+0x115
>     2   0%  83% 0.00    10901 0xffffffffa28722a0     fifo_write+0x5b
>     2   0%  83% 0.00     4379 0xffffffffa28722a0     fifo_read+0xf8
>     2   0%  84% 0.00     1534 0xffffffffa2638390     emlxs_tx_get+0x38
>     2   0%  84% 0.00     1601 0xffffffffa2638350    
emlxs_issue_iocb_cmd+0xc1
>     2   0%  84% 0.00     6697 0xffffffffa2503f08    
vdev_queue_io_done+0x7b
>     2   0%  85% 0.00     4113 0xffffffffa24040b0    
gcpu_ntv_mca_poll_wrapper+0x64
>     2   0%  85% 0.00      928 0xfffffe85dc140658     pollwakeup+0x116
>     1   0%  86% 0.00      404 iommulib_lock          lookup_cache+0x2c
>     1   0%  86% 0.00     4867 pidlock                thread_exit+0x6f
>     1   0%  86% 0.00     1245 plocks+0x3c0           pollhead_delete+0x23
>     1   0%  86% 0.00     2452 plocks+0x3c0           pollhead_insert+0x35
>     1   0%  86% 0.00      882 htable_mutex+0x3c0     htable_lookup+0x83
>     1   0%  87% 0.00    28547 htable_mutex+0x3c0     htable_create+0xe3
>     1   0%  87% 0.00    21173 htable_mutex+0x3c0     htable_release+0x8a
>     1   0%  87% 0.00     1235 htable_mutex+0x370     htable_lookup+0x83
>     1   0%  87% 0.00     3212 htable_mutex+0x370     htable_release+0x8a
>     1   0%  87% 0.00      793 htable_mutex+0x78      htable_lookup+0x83
>     1   0%  88% 0.00      981 buf_hash_table+0x1210  arc_buf_add_ref+0x7c
>     1   0%  88% 0.00     1222 buf_hash_table+0x1c50  arc_buf_add_ref+0x7c
>     1   0%  88% 0.00     1585 buf_hash_table+0x2490 
arc_buf_remove_ref+0x6d
>     1   0%  88% 0.00  1545158 ARC_mru+0x58           remove_reference+0x56
>     1   0%  88% 0.00      564 0xffffffffbcad4a00     strrput+0x19a
>     1   0%  89% 0.00     1033 0xffffffffbcad4a00     polllock+0x3f
>     1   0%  89% 0.00      587 0xffffffffbd328098     putnext+0x6c
>     1   0%  89% 0.00    11576 0xffffffffbd328098     strrput+0x19a
>     1   0%  89% 0.00      847 0xffffffffb3f23708     pcache_insert+0x13d
>     1   0%  90% 0.00      703 0xffffffffbb909e20     poll_common+0x258
>     1   0%  90% 0.00     1286 0xffffffffbcad4870     kstrgetmsg+0x79
>     1   0%  90% 0.00     1528 0xffffffffb1012e00     cv_wait+0x70
>     1   0%  90% 0.00      404 0xffffffffb1011de0     zio_notify_parent+0x37
>     1   0%  90% 0.00      764 0xffffffffb1011de0     zio_create+0x29f
>     1   0%  91% 0.00     5887 0xffffffffb0a2f3b0     cv_wait+0x70
>     1   0%  91% 0.00      883 0xffffffffb1ad3de0    
metaslab_group_alloc+0x7e
>     1   0%  91% 0.00      555 0xffffffffb10dfbc0     fifo_write+0x5b
>     1   0%  91% 0.00      692 0xffffffffb3dd2070     pollrelock+0x36
>     1   0%  91% 0.00     4390 0xffffffffb0a22988     vdev_queue_io+0x6e
>     1   0%  92% 0.00     1449 0xffffffffb0a21f60     vdev_cache_write+0x64
>     1   0%  92% 0.00      859 0xffffffffb0a21a20     vdev_cache_write+0x64
>     1   0%  92% 0.00      446 0xffffffffb0a20ec8     vdev_queue_io+0x6e
>     1   0%  92% 0.00     1987 0xffffffffb0a21f08     vdev_queue_io+0x6e
>     1   0%  92% 0.00     5968 0xffffffffb0a21f08    
vdev_queue_io_done+0x3a
>     1   0%  93% 0.00      280 0xffffffffac44b558     sd_start_cmds+0x115
>     1   0%  93% 0.00      527 0xffffffffac44b118     sd_core_iostart+0x37
>     1   0%  93% 0.00      380 0xffffffffac51aed8    
sd_initpkt_for_buf+0x110
>     1   0%  93% 0.00      742 0xffffffffac51aed8     sdstrategy+0x53
>     1   0%  94% 0.00      696 0xffffffffac51aed8     sd_start_cmds+0x115
>     1   0%  94% 0.00     5398 0xffffffffb0a1c988    
vdev_queue_io_done+0x3a
>     1   0%  94% 0.00     6102 0xffffffffb0a1c988    
vdev_queue_io_done+0x7b
>     1   0%  94% 0.00      988 0xffffffffa40fbcd8    
sd_return_command+0x11b
>     1   0%  94% 0.00      298 0xffffffffa4101460    
ssfcp_scsi_init_pkt+0x3b4
>     1   0%  95% 0.00      302 0xffffffffa40fbcd8     sdstrategy+0x53
>     1   0%  95% 0.00     1436 0xffffffffa40fb158     sdintr+0x3a
>     1   0%  95% 0.00      764 0xffffffffa40fbcd8    
sd_initpkt_for_buf+0x110
>     1   0%  95% 0.00      846 0xffffffffa40fbcd8     sd_start_cmds+0x115
>     1   0%  95% 0.00     1172 0xffffffffa5644f60     vdev_cache_write+0x64
>     1   0%  96% 0.00     8401 0xffffffffa5644f08    
vdev_queue_io_done+0xc3
>     1   0%  96% 0.00      417 0xffffffffa5644f08     vdev_queue_io+0x6e
>     1   0%  96% 0.00     3419 0xffffffffa5644f08    
vdev_queue_io_done+0x7b
>     1   0%  96% 0.00     1341 0xffffffffa58a3318     sd_core_iostart+0x37
>     1   0%  96% 0.00      431 0xffffffffa22cc840    
fc_ulp_init_packet+0x31
>     1   0%  97% 0.00      569 0xffffffff807a4000     callout_execute+0xeb
>     1   0%  97% 0.00      695 0xffffffff8025c000     callout_execute+0xeb
>     1   0%  97% 0.00      500 0xffffffff80244000     callout_execute+0xeb
>     1   0%  97% 0.00      855 0xfffffe85dc140658     pcache_insert+0x13d
>     1   0%  97% 0.00    13339 0xfffffe85d67ddc48     cv_wait+0x70
>     1   0%  98% 0.00     5377 0xffffffff80253000     untimeout+0x24
>     1   0%  98% 0.00     5104 0xffffffff80253000     timeout_common+0x39
>     1   0%  98% 0.00      508 0xffffffff80253000     callout_execute+0xeb
>     1   0%  98% 0.00      260 0xffffffffa2638420    
emlxs_register_pkt+0x30
>     1   0%  99% 0.00     1059 0xffffffffa2638390     emlxs_tx_put+0x79
>     1   0%  99% 0.00      411 0xffffffffa3e3a298     sdstrategy+0x53
>     1   0%  99% 0.00      336 0xffffffffa3d6e380    
fc_ulp_init_packet+0x31
>     1   0%  99% 0.00      926 0xffffffffa3cdfc58     sd_start_cmds+0x115
>     1   0%  99% 0.00      894 0xffffffffa3e3a298     sd_core_iostart+0x37
>     1   0% 100% 0.00      766 0xffffffffa3e3a298     sd_buf_iodone+0x23
>     1   0% 100% 0.00      340 0xffffffffa3e58420    
emlxs_register_pkt+0x30
>     1   0% 100% 0.00     1516 0xffffffffa3e58420    
emlxs_unregister_pkt+0x53
>
-------------------------------------------------------------------------------
>
> Adaptive mutex block: 7 events in 30.019 seconds (0 events/sec)
>
> Count indv cuml rcnt     nsec Lock                   Caller  
>
-------------------------------------------------------------------------------
>     1  14%  14% 0.00    39004 0xffffffffbe555040     fifo_read+0x5c
>     1  14%  29% 0.00    78624 0xffffffffb3dd2070     pollwakeup+0x116
>     1  14%  43% 0.00     6668 0xffffffffb0a2f200     taskq_dispatch+0x1b8
>     1  14%  57% 0.00    35694 0xffffffffb0a2f3b0     cv_wait+0x70
>     1  14%  71% 0.00    22697 0xffffffffb0a21f08    
vdev_queue_io_done+0xc3
>     1  14%  86% 0.00    20174 0xffffffffb0a1c988    
vdev_queue_io_done+0x3a
>     1  14% 100% 0.00     7365 0xfffffe85d67ddc48     cv_wait+0x70  
>
-------------------------------------------------------------------------------
>
> Spin lock spin: 478 events in 30.019 seconds (16 events/sec)
>
> Count indv cuml rcnt     nsec Lock                   Caller  
>
-------------------------------------------------------------------------------
>   131  27%  27% 0.00     1383 0xffffffffa21a89f8     disp_lock_enter+0x1e
>    93  19%  47% 0.00      962 0xffffffffa250e9c0     disp_lock_enter+0x1e
>    84  18%  64% 0.00     3085 0xffffffffa21a8a28     disp_lock_enter+0x1e
>    73  15%  80% 0.00     2105 cpu0_disp              disp_lock_enter+0x1e
>    33   7%  87% 0.00     3548 0xffffffffa21a8a28    
disp_lock_enter_high+0x9
>    22   5%  91% 0.00     6011 cpu0_disp             
disp_lock_enter_high+0x9
>    21   4%  96% 0.00     2222 hres_lock              hr_clock_lock+0x1d
>     9   2%  97% 0.00     3869 0xffffffffa21a89f8    
disp_lock_enter_high+0x9
>     8   2%  99% 0.00     1649 0xffffffffa250e9c0    
disp_lock_enter_high+0x9
>     4   1% 100% 0.00      624 cp_default             disp_lock_enter+0x1e 
>  
>
-------------------------------------------------------------------------------
>
> Thread lock spin: 6 events in 30.019 seconds (0 events/sec)
>
> Count indv cuml rcnt     nsec Lock                   Caller  
>
-------------------------------------------------------------------------------
>     2  33%  33% 0.00      698 transition_lock        ts_update_list+0x5c
>     2  33%  67% 0.00      779 cpu[3]+0xf8            cv_wait+0x3e
>     1  17%  83% 0.00      452 cpu[3]+0xf8            cv_timedwait_sig+0xe1
>     1  17% 100% 0.00      324 cpu[2]+0xf8            
> cv_timedwait_sig+0xe1  
>
-------------------------------------------------------------------------------
>
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
----------------
Sanjeev Bagewadi
Solaris RPE 
Bangalore, India

Lejun Zhu

2009-Jul-07 06:43 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

If cpu seems to be idle, the tool latencytop probably can give you some clue.
It''s developed for OpenSolaris but Solaris 10 should work too (with
glib 2.14 installed). You can get a copy of v0.1 at
http://opensolaris.org/os/project/latencytop/

To use latencytop, open a terminal and start "latencytop -s -k 2". The
tool will show a window with activities that are being blocked in the system.
Then you can launch your application to reproduce the performance problem in
another terminal, switch back to latencytop window, and use "<" and
">" to find your process. The list will tell you which function is
causing the delay.

After a couple minutes you may press "q" to exit from latencytop. When
it ends, a log file /var/log/latencytop.log will be created. It includes the
stack trace of waiting for IO, semaphore etc. when latencytop was running. If
you post the log here, I can probably extract a list of worst delays in ZFS
source code, and other experts may comment.
-- 
This message posted from opensolaris.org

James Andrewartha

2009-Jul-07 09:38 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Joerg Schilling wrote:> I would be interested to see a open(2) flag that tells the system that I
will
> read a file that I opened exactly once in native oder. This could tell the 
> system to do read ahead and to later mark the pages as immediately
reusable.
> This would make star even faster than it is now.
Are you aware of posix_fadvise(2) and madvise(2)?

-- 
James Andrewartha

Joerg Schilling

2009-Jul-07 14:05 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

James Andrewartha <jamesa at daa.com.au> wrote:
> Joerg Schilling wrote:
> > I would be interested to see a open(2) flag that tells the system that
I will
> > read a file that I opened exactly once in native oder. This could tell
the
> > system to do read ahead and to later mark the pages as immediately
reusable.
> > This would make star even faster than it is now.
>
> Are you aware of posix_fadvise(2) and madvise(2)?
I am of course aware of madvise since December 1987 but this is an interface 
that does not play nicely with a highly portable program like star.

posix_fadvise seems to be _very_ new for Solaris and even though I am 
frequently reading/writing the POSIX standards mailing list, I was not aware of 
it.
>From my tests with star, I cannot see a significant performance increase but
itmay have a 3% effect........

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Gary Mills

2009-Jul-07 15:55 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 06, 2009 at 04:54:16PM +0100, Andrew Gabriel
wrote:> Andre van Eyssen wrote:
> >On Mon, 6 Jul 2009, Gary Mills wrote:
> >
> >>As for a business case, we just had an extended and catastrophic
> >>performance degradation that was the result of two ZFS bugs.  If we
> >>have another one like that, our director is likely to instruct us
to
> >>throw away all our Solaris toys and convert to Microsoft products.
> >
> >If you change platform every time you get two bugs in a product, you 
> >must cycle platforms on a pretty regular basis!
> 
> You often find the change is towards Windows. That very rarely has the 
> same rules applied, so things then stick there.
There''s a more general principle in operation here.  Organizations do
sometimes change platforms for peculiar reasons, but once they do that
they''re not going to do it again for a long time.  That''s why
they
disregard problems with the new platform.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Bob Friesenhahn

2009-Jul-07 16:18 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 7 Jul 2009, Joerg Schilling wrote:>
> posix_fadvise seems to be _very_ new for Solaris and even though I am
> frequently reading/writing the POSIX standards mailing list, I was not
aware of
> it.
>
> From my tests with star, I cannot see a significant performance increase
but it
> may have a 3% effect........
Based on the prior discussions of using mmap() with ZFS and the way 
ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at 
all and POSIX_FADV_DONTNEED probably does not work either.  These are 
pretty straightforward to implement with UFS since UFS benefits from 
the existing working madvise() functionality.

ZFS seems to want to cache all read data in the ARC, period.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Schilling

2009-Jul-07 16:23 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Tue, 7 Jul 2009, Joerg Schilling wrote:
> >
> > posix_fadvise seems to be _very_ new for Solaris and even though I am
> > frequently reading/writing the POSIX standards mailing list, I was not
aware of
> > it.
> >
> > From my tests with star, I cannot see a significant performance
increase but it
> > may have a 3% effect........
>
> Based on the prior discussions of using mmap() with ZFS and the way 
> ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at 
> all and POSIX_FADV_DONTNEED probably does not work either.  These are 
> pretty straightforward to implement with UFS since UFS benefits from 
> the existing working madvise() functionality.
I did run my tests on UFS...
> ZFS seems to want to cache all read data in the ARC, period.
And this is definitely a conceptional mistake as there are applications like 
star that like to benefit from read ahead but that don''t like to trash
caches.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Bob Friesenhahn

2009-Jul-07 17:24 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 7 Jul 2009, Joerg Schilling wrote:>>
>> Based on the prior discussions of using mmap() with ZFS and the way
>> ZFS likes to work, my guess is that POSIX_FADV_NOREUSE does nothing at
>> all and POSIX_FADV_DONTNEED probably does not work either.  These are
>> pretty straightforward to implement with UFS since UFS benefits from
>> the existing working madvise() functionality.
>
> I did run my tests on UFS...
To clarify, you are not likely to see benefits until the system 
becomes starved for memory resources, or there is contention from 
multiple processes for read cache.  Solaris UFS is very well tuned so 
it is likely that a single process won''t see much benefit.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-07 21:56 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 7 Jul 2009, Sanjeev wrote:
> Bob,
>
> Catching up late on this thread.
>
> Would it be possible for you to collect the following data :
> - /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5
> - /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5
> - /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5
Here is the output of those commands.  The start of each command is 
prefixed with a ''+'':

+ /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 5

Adaptive mutex spin: 4 events in 5.023 seconds (1 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1  96%  96% 0.00   803888 0xffffffffb0a2f3b0     cv_wait+0x70

       nsec ------ Time Distribution ------ count     Stack
    1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         taskq_thread+0x14f
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   3%  99% 0.00    24784 0xfffffe85fc605af0     cv_wait+0x70

       nsec ------ Time Distribution ------ count     Stack
      32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         zio_wait+0x53
                                                     
dmu_buf_hold_array_by_dnode+0x108
                                                      dmu_buf_hold_array+0x81
                                                      dmu_read_uio+0x49
                                                      zfs_read+0x15c
                                                      zfs_shim_read+0xc
                                                      fop_read+0x31
                                                      read+0x188
                                                      read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   1% 100% 0.00     9699 pidlock[8]             thread_exit+0x6f

       nsec ------ Time Distribution ------ count     Stack
      16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         proc_exit+0x927
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0% 100% 0.00      669 0xffffffff80253000     untimeout+0x24

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         cv_timedwait+0xb1
                                                      taskq_d_thread+0xc5
                                                      thread_start+0x8 
-------------------------------------------------------------------------------

Adaptive mutex block: 2 events in 5.023 seconds (0 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1  72%  72% 0.00    24546 0xffffffffb0a2f3b0     cv_wait+0x70

       nsec ------ Time Distribution ------ count     Stack
      32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         taskq_thread+0x14f
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1  28% 100% 0.00     9431 0xfffffe85fc605af0     cv_wait+0x70

       nsec ------ Time Distribution ------ count     Stack
      16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         zio_wait+0x53
                                                     
dmu_buf_hold_array_by_dnode+0x108
                                                      dmu_buf_hold_array+0x81
                                                      dmu_read_uio+0x49
                                                      zfs_read+0x15c
                                                      zfs_shim_read+0xc
                                                      fop_read+0x31
                                                      read+0x188
                                                      read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------

Spin lock spin: 223 events in 5.023 seconds (44 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   119  62%  62% 0.00    77453 cpu0_disp[48]          disp_lock_enter+0x1e

       nsec ------ Time Distribution ------ count     Stack
        256 |                               1         disp+0x7a
        512 |@@@@@@@@@@@@                   49        swtch+0xa0
       1024 |@@@@@@@@@@@@@                  54        cv_wait+0x68
       2048 |@@                             8         taskq_thread+0x14f
       4096 |                               0         thread_start+0x8
       8192 |                               0
      16384 |                               0
      32768 |                               1
      65536 |                               0
     131072 |                               0
     262144 |                               0
     524288 |                               1
    1048576 |                               1
    2097152 |                               2
    4194304 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
    13  35%  97% 0.00   399130 0xffffffffa21a89f8     disp_lock_enter+0x1e

       nsec ------ Time Distribution ------ count     Stack
        256 |@@                             1         disp+0x7a
        512 |@@@@@@                         3         swtch+0xa0
       1024 |@@@@@@                         3         idle+0xdb
       2048 |                               0         thread_start+0x8
       4096 |                               0
       8192 |@@                             1
      16384 |                               0
      32768 |@@                             1
      65536 |                               0
     131072 |                               0
     262144 |                               0
     524288 |                               0
    1048576 |@@@@                           2
    2097152 |@@@@                           2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
    77   3% 100% 0.00     6402 0xffffffffa21a8a28     disp_lock_enter+0x1e

       nsec ------ Time Distribution ------ count     Stack
        256 |                               1         disp+0x7a
        512 |@@@@@@@@@@@@@@@                40        swtch+0xa0
       1024 |@@@@@@@@@@@@@                  35        cv_wait+0x68
       2048 |                               0         taskq_thread+0x14f
       4096 |                               0         thread_start+0x8
       8192 |                               0
      16384 |                               0
      32768 |                               0
      65536 |                               0
     131072 |                               0
     262144 |                               0
     524288 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     8   0% 100% 0.00     3096 hres_lock[4]           hr_clock_lock+0x1d

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@@@@@@@                    3         gethrestime_lasttick+0x11
       1024 |@@@@@@@@@@@@@@@                4         timeout_common+0x31
       2048 |                               0         realtime_timeout+0x21
       4096 |                               0         cv_timedwait_sig+0xc5
       8192 |                               0         cv_waituntil_sig+0x113
      16384 |                               0         poll_common+0x3f4
      32768 |@@@                            1         pollsys+0xbe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     5   0% 100% 0.00      594 0xffffffffa250e9c0     disp_lock_enter_high+0x9

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@@                         1         setfrontdq+0xc7
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@       4         ts_setrun+0x118
                                                      cv_unsleep+0x78
                                                      setrun_locked+0x7a
                                                      setrun+0x19
                                                      callout_execute+0xdb
                                                      softint+0x146
                                                      softlevel1+0x9
                                                      av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0% 100% 0.00     1248 cp_default[320]        disp_lock_enter+0x1e

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         disp_getbest+0x15
                                                      disp+0x48
                                                      swtch+0xa0
                                                      idle+0xdb
                                                      thread_start+0x8 
-------------------------------------------------------------------------------

Thread lock spin: 1 events in 5.023 seconds (0 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1 100% 100% 0.00      423 cpu[1][1512]           ts_tick+0x2a

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         clock_tick+0x48
                                                      clock_tick_process+0x14f
                                                     
clock_tick_execute_common+0x73
                                                      clock_tick_schedule+0x74
                                                      clock+0x2d0
                                                      cyclic_softint+0xba
                                                      cbe_softclock+0x17
                                                      av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
+ /usr/sbin/lockstat -HcwP -n 50000 -D 20 -s 40 sleep 5 
lockstat: warning: 45087 aggregation drops on CPU 0
lockstat: warning: 107548 aggregation drops on CPU 1
lockstat: warning: 15170 aggregation drops on CPU 2
lockstat: warning: 351494 aggregation drops on CPU 3
lockstat: warning: 12585 aggregation drops on CPU 0
lockstat: warning: 32687 aggregation drops on CPU 2
lockstat: warning: 72441 aggregation drops on CPU 3
lockstat: warning: 9933 aggregation drops on CPU 2
lockstat: warning: ran out of data records (use -n for more)

Adaptive mutex hold: 1160267 events in 5.765 seconds (201263 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
295278  11%  11% 0.00     1082 0xfffffe85dd53e540     releasef+0x87

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@                    115095    write+0x95
       2048 |@@@@@@@@@@@@@@@@@@             180125    write32+0xe
       4096 |                               29        sys_syscall32+0x101
       8192 |                               15
      16384 |                               1
      32768 |                               0
      65536 |                               0
     131072 |                               0
     262144 |                               6
     524288 |                               7 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   351   7%  18% 0.00   529819 0xfffffe85cce06290     poll_common+0x2a7

       nsec ------ Time Distribution ------ count     Stack
      32768 |                               2         pollsys+0xbe
      65536 |                               2         sys_syscall32+0x101
     131072 |                               1
     262144 |                               0
     524288 |@@@@@@@@@@@@@@                 172
    1048576 |@@@@@@@@@@@@@@                 172
    2097152 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
17412   5%  22% 0.00     7540 0xffffffffa0616a40     brk_internal+0x78

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               12        brk+0x44
       2048 |                               106       sys_syscall+0x17b
       4096 |                               4
       8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@   16645
      16384 |                               162
      32768 |                               457
      65536 |                               11
     131072 |                               15 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
14724   4%  26% 0.00     7143 buf_hash_table[16400]  arc_buf_remove_ref+0x8e

       nsec ------ Time Distribution ------ count     Stack
       2048 |@                              653       dbuf_rele+0x11d
       4096 |                               223       dmu_buf_rele_array+0x51
       8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    13521     dmu_read_uio+0xbb
      16384 |                               301       zfs_read+0x15c
      32768 |                               15        zfs_shim_read+0xc
      65536 |                               0         fop_read+0x31
     131072 |                               0         read+0x188
     262144 |                               5         read32+0xe
     524288 |                               5         sys_syscall32+0x101
    1048576 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
24009   3%  30% 0.00     4022 ph_mutex[65536]        page_lookup_create+0x27c

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               20        page_lookup+0x11
       2048 |@@@@@@@@@@@@@@                 11602     swap_getapage+0x6d
       4096 |                               23        swap_getpage+0x46
       8192 |@@@@@@@@@@@@@@                 11577     fop_getpage+0x47
      16384 |                               780       anon_zero+0xa4
      32768 |                               0         segvn_faultpage+0x46b
      65536 |                               3         segvn_fault+0x9a6
     131072 |                               4         as_fault+0x205
                                                      pagefault+0x8b
                                                      trap+0x3d7
                                                      cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
27696   3%  33% 0.00     3065 0xfffffe8769e3f298     dbuf_hold_impl+0x168

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@                            3218      dbuf_hold+0x1b
       2048 |@@@@@@@@@@@@@@@@               15243     dnode_hold_impl+0x7e
       4096 |                               2         dnode_hold+0x14
       8192 |@@@@@@@@@                      9205      dmu_buf_hold_array+0x3b
      16384 |                               19        dmu_read_uio+0x49
      32768 |                               0         zfs_read+0x15c
      65536 |                               0         zfs_shim_read+0xc
     131072 |                               0         fop_read+0x31
     262144 |                               7         read+0x188
     524288 |                               2         read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  4333   2%  35% 0.00    13537 0xfffffe85f4277558     pcache_insert+0x164

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@                           654       pcacheset_resolve+0x2d8
       2048 |@@@@@@@@@@@@@@@@@@@@@          3153      poll_common+0x565
       4096 |                               0         pollsys+0xbe
       8192 |                               37        sys_syscall32+0x101
      16384 |                               137
      32768 |                               89
      65536 |                               89
     131072 |                               1
     262144 |                               55
     524288 |                               117
    1048576 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
47343   2%  37% 0.00     1216 htable_mutex[1024]     htable_release+0x12a

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@                             4231      hat_getpfnum+0x151
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    42799     rootnex_get_sgl+0x34f
       4096 |                               272      
rootnex_coredma_bindhdl+0x12c
       8192 |                               3         rootnex_dma_bindhdl+0x1a
      16384 |                               1        
ddi_dma_buf_bind_handle+0xb0
      32768 |                               0         ssfcp_prepare_pkt+0x1c7
      65536 |                               0         ssfcp_scsi_init_pkt+0x3a0
     131072 |                               2         scsi_init_pkt+0x44
     262144 |                               0         vhci_bind_transport+0x124
     524288 |                               4         vhci_scsi_init_pkt+0xbf
                                                      scsi_init_pkt+0x44
                                                      sd_setup_rw_pkt+0xe5
                                                      sd_initpkt_for_buf+0xa3
                                                      sd_start_cmds+0xa5
                                                      sd_core_iostart+0x87
                                                     
sd_mapblockaddr_iostart+0x11a
                                                      sd_xbuf_strategy+0x46
                                                      xbuf_iostart+0x75
                                                      ddi_xbuf_qstrategy+0x4a
                                                      sdstrategy+0xbb
                                                      bdev_strategy+0x54
                                                      ldi_strategy+0x4e
                                                      vdev_disk_io_start+0x139
                                                      zio_vdev_io_start+0xba
                                                      zio_execute+0x60
                                                      zio_nowait+0x9
                                                      vdev_mirror_io_start+0xa9
                                                      zio_vdev_io_start+0xba
                                                      zio_execute+0x60
                                                      zio_nowait+0x9
                                                      vdev_mirror_io_start+0xa9
                                                      zio_vdev_io_start+0x147
                                                      zio_execute+0x60
                                                      zio_nowait+0x9
                                                      arc_read+0x487
                                                      dbuf_read_impl+0x1a0
                                                      dbuf_read+0x95
                                                     
dmu_buf_hold_array_by_dnode+0x217
                                                      dmu_buf_hold_array+0x81 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
44245   2%  39% 0.00     1231 anon_array_lock[8192]  anon_array_exit+0x2e

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@                           6971      segvn_faultpage+0x56c
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@      37061     segvn_fault+0x9a6
       4096 |                               157       as_fault+0x205
       8192 |                               7         pagefault+0x8b
      16384 |                               6         trap+0x3d7
      32768 |                               40        cmntrap+0x140
      65536 |                               2
     131072 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   113   2%  40% 0.00   385142 0xffffffffbca03610     poll_common+0x2a7

       nsec ------ Time Distribution ------ count     Stack
      32768 |@@                             8         pollsys+0xbe
      65536 |                               0         sys_syscall32+0x101
     131072 |@@@@@@@@@                      36
     262144 |@@                             10
     524288 |@@@@@@@                        29
    1048576 |@@@@@@@                        30 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
31183   1%  42% 0.00     1226 anonhash_lock[512]     anon_alloc+0x93

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@                             2843      anon_zero+0x65
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    28280     segvn_faultpage+0x46b
       4096 |                               53        segvn_fault+0x9a6
       8192 |                               2         as_fault+0x205
      16384 |                               0         pagefault+0x8b
      32768 |                               1         trap+0x3d7
      65536 |                               1         cmntrap+0x140
     131072 |                               1
     262144 |                               1
     524288 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
30898   1%  43% 0.00     1140 pio_mutex[1024]        page_io_unlock+0x44

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@                           4537      pvn_plist_init+0x9c
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@      26317     swap_getapage+0x1aa
       4096 |                               34        swap_getpage+0x46
       8192 |                               1         fop_getpage+0x47
      16384 |                               0         anon_zero+0xa4
      32768 |                               0         segvn_faultpage+0x46b
      65536 |                               3         segvn_fault+0x9a6
     131072 |                               2         as_fault+0x205
                                                      pagefault+0x8b
                                                      trap+0x3d7
                                                      cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
27390   1%  44% 0.00     1111 0xffffffffbe667a88     as_rangeunlock+0x26

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@                       7622      brk+0x50
       2048 |@@@@@@@@@@@@@@@@@@@@@          19718     sys_syscall+0x17b
       4096 |                               48
       8192 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
26982   1%  45% 0.00     1076 ani_free_pool[8192]    anon_alloc+0xc6

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@                      8793      anon_zero+0x65
       2048 |@@@@@@@@@@@@@@@@@@@@           18178     segvn_faultpage+0x46b
       4096 |                               9         segvn_fault+0x9a6
       8192 |                               1         as_fault+0x205
      16384 |                               0         pagefault+0x8b
      32768 |                               0         trap+0x3d7
      65536 |                               0         cmntrap+0x140
     131072 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
16883   1%  46% 0.00     1699 dbuf_hash_table[2064]  dbuf_find+0xdc

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@   16299     dbuf_hold_impl+0x42
       4096 |@                              582       dbuf_hold+0x1b
       8192 |                               2         dnode_hold_impl+0x7e
                                                      dnode_hold+0x14
                                                      dmu_buf_hold_array+0x3b
                                                      dmu_read_uio+0x49
                                                      zfs_read+0x15c
                                                      zfs_shim_read+0xc
                                                      fop_read+0x31
                                                      read+0x188
                                                      read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     5   1%  47% 0.00  4756677 dtrace_lock[8]         dtrace_ioctl+0xd7b

       nsec ------ Time Distribution ------ count     Stack
       8192 |@@@@@@                         1         cdev_ioctl+0x1d
      16384 |                               0         spec_ioctl+0x50
      32768 |@@@@@@                         1         fop_ioctl+0x25
      65536 |                               0         ioctl+0xac
     131072 |                               0         sys_syscall+0x17b
     262144 |                               0
     524288 |                               0
    1048576 |                               0
    2097152 |                               0
    4194304 |                               0
    8388608 |@@@@@@                         1
   16777216 |@@@@@@@@@@@@                   2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
19703   1%  48% 0.00     1195 0xffffffff807b22c0     kmem_cache_alloc+0x4d

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@                         4381      anon_alloc+0x21
       2048 |@@@@@@@@@@@@@@@@@@@@@@@        15169     anon_zero+0x65
       4096 |                               12        segvn_faultpage+0x46b
       8192 |                               0         segvn_fault+0x9a6
      16384 |                               137       as_fault+0x205
      32768 |                               1         pagefault+0x8b
      65536 |                               3         trap+0x3d7
                                                      cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
18555   1%  48% 0.00     1108 0xffffffffb0e8e958     rrw_exit+0x69

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@                      5997      zfs_read+0x199
       2048 |@@@@@@@@@@@@@@@@@@@@           12554     zfs_shim_read+0xc
       4096 |                               3         fop_read+0x31
       8192 |                               1         read+0x188
                                                      read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1579   1%  49% 0.00    12903 0xfffffffffb5db180    
page_get_mnode_freelist+0x33d

       nsec ------ Time Distribution ------ count     Stack
      16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1561      page_get_freelist+0x1a4
      32768 |                               16        page_create_va+0x256
      65536 |                               2         swap_getapage+0xfd
                                                      swap_getpage+0x46
                                                      fop_getpage+0x47
                                                      anon_zero+0xa4
                                                      segvn_faultpage+0x46b
                                                      segvn_fault+0x9a6
                                                      as_fault+0x205
                                                      pagefault+0x8b
                                                      trap+0x3d7
                                                      cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1576   1%  50% 0.00    12890 0xfffffffffb5db1a0    
page_get_mnode_freelist+0x33d

       nsec ------ Time Distribution ------ count     Stack
      16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1561      page_get_freelist+0x1a4
      32768 |                               14        page_create_va+0x256
      65536 |                               1         swap_getapage+0xfd
                                                      swap_getpage+0x46
                                                      fop_getpage+0x47
                                                      anon_zero+0xa4
                                                      segvn_faultpage+0x46b
                                                      segvn_fault+0x9a6
                                                      as_fault+0x205
                                                      pagefault+0x8b
                                                      trap+0x3d7
                                                      cmntrap+0x140 
-------------------------------------------------------------------------------

Spin lock hold: 36973 events in 5.765 seconds (6413 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  6135  38%  38% 0.00     7365 sleepq_head[32768]     disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@                        1445      ts_update_list+0x161
       4096 |                               173       ts_update+0x36
       8192 |@@@@@@@@@@@                    2260      callout_execute+0xdb
      16384 |@@@@@@@@@                      2036      taskq_thread+0xbc
      32768 |@                              218       thread_start+0x8
      65536 |                               0
     131072 |                               0
     262144 |                               2
     524288 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2832  19%  57% 0.00     8091 0xffffffffa1d54059     mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       8192 |@@@@@@@@@@@@@@@@@@@@@@         2108      pci_peekpoke_check+0xbb
      16384 |@@@@@@@                        723       pepb_ctlops+0x2be
      32768 |                               0         ddi_ctlops+0x3b
      65536 |                               0        
i_ddi_caut_getput_ctlops+0x36
     131072 |                               0         i_ddi_caut_get32+0x29
     262144 |                               1         pci_config_get32+0x2b
                                                     
nvidia_pci_check_config_space+0xb8
                                                      nv_intr+0x6f
                                                      av_dispatch_autovect+0x78
                                                      intr_thread+0x5f 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1146   7%  64% 0.00     7746 xc_mbox_lock[24]       mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       4096 |@@@@                           172       xc_do_call+0x9b
       8192 |@@@@@@@@@@@@@@@                593       xc_sync+0x36
      16384 |@@@@@@@@@                      377       dtrace_xcall+0x97
      32768 |                               2         dtrace_sync+0x17
      65536 |                               1         dtrace_dynvar_clean+0xe6
     131072 |                               0         dtrace_state_clean+0x29
     262144 |                               0         cyclic_softint+0xba
     524288 |                               0         cbe_low_level+0x14
    1048576 |                               1         av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  5664   6%  70% 0.00     1239 0xffffffffa1d54051     mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               10        pci_peekpoke_check+0xe6
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5625      pepb_ctlops+0x2be
       4096 |                               21        ddi_ctlops+0x3b
       8192 |                               8        
i_ddi_caut_getput_ctlops+0x36
                                                      i_ddi_caut_get32+0x29
                                                      pci_config_get32+0x2b
                                                     
nvidia_pci_check_config_space+0xb8
                                                      nv_intr+0x6f
                                                      av_dispatch_autovect+0x78
                                                      intr_thread+0x5f 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  3634   6%  76% 0.00     1842 cpu0_disp[48]          disp_lock_exit_high+0x2a

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               35        disp+0x137
       2048 |@@@@@@@@@@@@@@@@@@@@@@@        2882      swtch+0xa0
       4096 |@@@@                           564       idle+0xdb
       8192 |@                              133       thread_start+0x8
      16384 |                               20 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  3231   4%  80% 0.00     1657 0xffffffffa21a8a28     disp_lock_exit_high+0x2a

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               60        disp+0x137
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@      2693      swtch+0xa0
       4096 |@@@@                           451       idle+0xdb
       8192 |                               19        thread_start+0x8
      16384 |                               8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  3211   4%  84% 0.00     1534 hres_lock[4]           dtrace_hres_tick+0x69

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               5         cbe_hres_tick+0xe
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@     2884      cyclic_expire+0xbc
       4096 |@@                             306       cyclic_fire+0x5b
       8192 |                               15        cbe_fire+0x39
      16384 |                               1         av_dispatch_autovect+0x78
                                                      _interrupt+0x15a
                                                      cpu_halt+0x121
                                                      idle+0x89
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2265   4%  88% 0.00     1916 cpu[0][1512]           disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               13        post_syscall+0x3ec
       2048 |@@@@@@@@@@@@@@@@@@@            1436      syscall_exit+0x59
       4096 |@@@@@@@@@@                     798       sys_syscall32+0x1a0
       8192 |                               15
      16384 |                               3 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2127   3%  91% 0.00     1951 cpu[1][1512]           disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               21        post_syscall+0x3ec
       2048 |@@@@@@@@@@@@@@@@@              1237      syscall_exit+0x59
       4096 |@@@@@@@@@@@                    834       sys_syscall32+0x1a0
       8192 |                               32
      16384 |                               3 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1377   2%  93% 0.00     1617 0xffffffffa21a89f8     disp_lock_exit_high+0x2a

       nsec ------ Time Distribution ------ count     Stack
       1024 |@                              59        disp+0x137
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@       1146      swtch+0xa0
       4096 |@@@                            164       idle+0xdb
       8192 |                               6         thread_start+0x8
      16384 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1613   2%  95% 0.00     1315 softcall_lock[8]       mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               10        softint+0x13e
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1602      softlevel1+0x9
       4096 |                               1         av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   948   1%  96% 0.00     1884 cpu[2][1512]           disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               9         post_syscall+0x3ec
       2048 |@@@@@@@@@@@@@@@@@@@@           659       syscall_exit+0x59
       4096 |@@@@@@@                        248       sys_syscall32+0x1a0
       8192 |@                              32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1050   1%  98% 0.00     1599 cpu[3][1512]          
disp_lock_exit_nopreempt+0x3a

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               6         ts_tick+0x74
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@       858       clock_tick+0x48
       4096 |@@@@                           169       clock_tick_process+0x14f
       8192 |                               17       
clock_tick_execute_common+0x73
                                                      clock_tick_schedule+0x74
                                                      clock+0x2d0
                                                      cyclic_softint+0xba
                                                      cbe_softclock+0x17
                                                      av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   683   1%  99% 0.00     1609 0xffffffffa250e9c0     disp_lock_exit_high+0x2a

       nsec ------ Time Distribution ------ count     Stack
       1024 |@                              27        disp+0x137
       2048 |@@@@@@@@@@@@@@@@@@@@@@@        534       swtch+0xa0
       4096 |@@@@@                          121       idle+0xdb
       8192 |                               0         thread_start+0x8
      16384 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   788   1%  99% 0.00     1242 shuttle_lock[1]        disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 788       ts_update_list+0x161
                                                      ts_update+0x36
                                                      callout_execute+0xdb
                                                      taskq_thread+0xbc
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   212   0% 100% 0.00     1884 lwpsleepq[32768]       disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    191       ts_update_list+0x161
       4096 |                               5         ts_update+0x36
       8192 |@@                             16        callout_execute+0xdb
                                                      taskq_thread+0xbc
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
    23   0% 100% 0.00     4978 turnstile_table[4096]  disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@         17        turnstile_exit+0x50
       4096 |                               0         mutex_vector_enter+0x14d
       8192 |@                              1         cv_wait+0x70
      16384 |@@                             2         taskq_thread+0x14f
      32768 |@@@                            3         thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     9   0% 100% 0.00     5938 cp_default[320]        disp_lock_exit+0x78

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     3         sigtoproc+0x446
       4096 |@@@                            1         sigaddqa+0x4a
       8192 |                               0         timer_fire+0xb9
      16384 |@@@@@@@@@@@@@@@@               5         clock_realtime_fire+0x2d
                                                      callout_execute+0xdb
                                                      softint+0x146
                                                      softlevel1+0x9
                                                      av_dispatch_softvect+0x62
                                                      dosoftint+0x32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
    18   0% 100% 0.00     1292 0xffffffffa06fc6c9     mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18        ndi_fmc_insert+0x9d
                                                     
rootnex_coredma_bindhdl+0x490
                                                      rootnex_dma_bindhdl+0x1a
                                                     
ddi_dma_buf_bind_handle+0xb0
                                                      mpt_scsi_init_pkt+0x146
                                                      scsi_init_pkt+0x44
                                                      sd_setup_rw_pkt+0xe5
                                                      sd_initpkt_for_buf+0xa3
                                                      sd_start_cmds+0xa5
                                                      sd_core_iostart+0x87
                                                     
sd_mapblockaddr_iostart+0x11a
                                                      sd_xbuf_strategy+0x46
                                                      xbuf_iostart+0x75
                                                      ddi_xbuf_qstrategy+0x4a
                                                      sdstrategy+0xbb
                                                      bdev_strategy+0x54
                                                      log_roll_write_crb+0x59
                                                      log_roll_write+0x85
                                                      trans_roll+0x1e0
                                                      thread_start+0x8 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     2   0% 100% 0.00     5570 reaplock[8]            mutex_vector_exit+0xad

       nsec ------ Time Distribution ------ count     Stack
       4096 |@@@@@@@@@@@@@@@                1         lwp_create+0x1e0
       8192 |                               0         forklwp+0x8a
      16384 |@@@@@@@@@@@@@@@                1         cfork+0x7a6
                                                      fork1+0x10
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------

R/W writer hold: 12279 events in 5.765 seconds (2130 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  5479  64%  64% 0.00    65450 0xffffffffbe667ab8     as_map_locked+0x144

       nsec ------ Time Distribution ------ count     Stack
      65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  5407      as_map+0x4a
     131072 |                               63        brk_internal+0x28f
     262144 |                               8         brk+0x44
     524288 |                               0         sys_syscall+0x17b
    1048576 |                               0
    2097152 |                               0
    4194304 |                               0
    8388608 |                               0
   16777216 |                               0
   33554432 |                               0
   67108864 |                               0
  134217728 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   364  30%  94% 0.00   457691 0xfffffe85eb376ab0     as_unmap+0x10f

       nsec ------ Time Distribution ------ count     Stack
      65536 |                               1         munmap+0x85
     131072 |@@@@@@@@@@@@@@@@@@@            242       sys_syscall+0x17b
     262144 |@@@@@@@@                       100
     524288 |@                              17
    1048576 |                               1
    2097152 |                               1
    4194304 |                               0
    8388608 |                               0
   16777216 |                               1
   33554432 |                               0
   67108864 |                               0
  134217728 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  5477   2%  96% 0.00     1766 0xffffffffbf781348     segvn_extend_prev+0x15a

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    4964      segvn_create+0x8f5
       4096 |@@                             475       as_map_locked+0x102
       8192 |                               37        as_map+0x4a
      16384 |                               0         brk_internal+0x28f
      32768 |                               0         brk+0x44
      65536 |                               1         sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  96% 0.00  1210410 0xfffffe87611807c8     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
    2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00  1071082 0xfffffe85dea77ac0     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
    2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   503423 0xfffffe8754e01580     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   499633 0xfffffe87598e2e40     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   387715 0xfffffe87611156c0     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   383589 0xfffffe87611158c0     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   322684 0xfffffe87611189c8     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   314745 0xfffffe8761115440     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   312264 0xfffffe8761183680     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   305929 0xfffffe8761180988     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   246750 0xfffffe8764cbeec8     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   217796 0xfffffe8761115340     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   211458 0xfffffe87598e2cc0     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   207617 0xfffffe8761183200     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_unmap+0xdb
                                                      munmap+0x85
                                                      sys_syscall+0x17b 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   206543 0xfffffe85d8f2a680     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   205967 0xfffffe87598e2d40     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
     1   0%  97% 0.00   205288 0xfffffe85dea7cdc8     segvn_free+0xc0

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1         seg_free+0x3f
                                                      segvn_unmap+0x60e
                                                      as_free+0xac
                                                      relvm+0x1f7
                                                      proc_exit+0x3a1
                                                      exit+0x9
                                                      rexit+0x10
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------

R/W reader hold: 139192 events in 5.765 seconds (24145 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
25275  28%  28% 0.00   123303 0xffffffffbe667ab8     as_fault+0x488

       nsec ------ Time Distribution ------ count     Stack
       8192 |                               1         pagefault+0x8b
      16384 |                               0         trap+0x3d7
      32768 |                               0         cmntrap+0x140
      65536 |                               7
     131072 |@@@@@@@@@@@@@@@@@@@            16753
     262144 |@@@@@@@@@@                     8479
     524288 |                               34
    1048576 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
25253  27%  55% 0.00   116596 0xffffffffb94097c8     segvn_fault+0xcbd

       nsec ------ Time Distribution ------ count     Stack
      65536 |                               33        as_fault+0x205
     131072 |@@@@@@@@@@@@@@@@@@@@@@@        19403     pagefault+0x8b
     262144 |@@@@@@                         5787      trap+0x3d7
     524288 |                               29        cmntrap+0x140
    1048576 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
25253  25%  81% 0.00   109842 0xffffffffbf781348     segvn_fault+0xceb

       nsec ------ Time Distribution ------ count     Stack
      65536 |                               47        as_fault+0x205
     131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@   24350     pagefault+0x8b
     262144 |                               832       trap+0x3d7
     524288 |                               23        cmntrap+0x140
    1048576 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1690   3%  83% 0.00   172137 0xfffffe85eb376ab0     as_fault+0x488

       nsec ------ Time Distribution ------ count     Stack
     131072 |@@@@@@@@@@@@                   723       pagefault+0x8b
     262144 |@@@@@@@@@@@@@@@                865       trap+0x3d7
     524288 |@                              86        cmntrap+0x140
    1048576 |                               16 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller 
18480   2%  86% 0.00    14461 0xffffffffb2d31830     dbuf_read+0x215

       nsec ------ Time Distribution ------ count     Stack
      16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  18273     dnode_hold_impl+0xa7
      32768 |                               177       dnode_hold+0x14
      65536 |                               0         dmu_buf_hold_array+0x3b
     131072 |                               1         dmu_read_uio+0x49
     262144 |                               16        zfs_read+0x15c
     524288 |                               13        zfs_shim_read+0xc
                                                      fop_read+0x31
                                                      read+0x188
                                                      read32+0xe
                                                      sys_syscall32+0x101 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1509   2%  88% 0.00   155379 0xfffffe876684c4f8     segvn_fault+0xcbd

       nsec ------ Time Distribution ------ count     Stack
     131072 |@@@@@@@@@@@@@                  654       as_fault+0x205
     262144 |@@@@@@@@@@@@@@@@               829       pagefault+0x8b
     524288 |                               22        trap+0x3d7
    1048576 |                               4         cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1509   2%  90% 0.00   148662 0xfffffe85dea77d40     segvn_fault+0xceb

       nsec ------ Time Distribution ------ count     Stack
     131072 |@@@@@@@@@@@@@                  655       as_fault+0x205
     262144 |@@@@@@@@@@@@@@@@               830       pagefault+0x8b
     524288 |                               22        trap+0x3d7
    1048576 |                               2         cmntrap+0x140 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   528   1%  91% 0.00   247232 0xffffffffb0a2f3b8     taskq_thread+0xdb

       nsec ------ Time Distribution ------ count     Stack
     262144 |@@@@@@@@@@@@@@@@@@@@@@@@@      457       thread_start+0x8
     524288 |@@@                            69
    1048576 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2028   1%  92% 0.00    41220 0xfffffe87ba3ea5c0    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     676       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      628       zfs_shim_read+0xc
      32768 |                               4         fop_read+0x31
      65536 |@@@@@@@@@                      628       read+0x188
     131072 |                               44        read32+0xe
     262144 |                               4         sys_syscall32+0x101
     524288 |                               2
    1048576 |                               40
    2097152 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2031   1%  93% 0.00    40995 0xfffffe8686311cf0    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     677       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               1         zfs_read+0x15c
      16384 |@@@@@@@@@                      632       zfs_shim_read+0xc
      32768 |                               1         fop_read+0x31
      65536 |@@@@@@@@@                      631       read+0x188
     131072 |                               41        read32+0xe
     262144 |                               4         sys_syscall32+0x101
     524288 |                               3
    1048576 |                               36
    2097152 |                               5 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2025   1%  93% 0.00    40914 0xfffffe87b9955a78    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@                      673       dmu_buf_hold_array+0x81
       4096 |                               1         dmu_read_uio+0x49
       8192 |                               1         zfs_read+0x15c
      16384 |@@@@@@@@@                      631       zfs_shim_read+0xc
      32768 |                               4         fop_read+0x31
      65536 |@@@@@@@@@                      630       read+0x188
     131072 |                               41        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               4
    1048576 |                               35
    2097152 |                               5 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2028   1%  94% 0.00    40006 0xfffffe8683148058    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     676       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      629       zfs_shim_read+0xc
      32768 |                               6         fop_read+0x31
      65536 |@@@@@@@@@                      633       read+0x188
     131072 |                               42        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               1
    1048576 |                               40
    2097152 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2028   1%  95% 0.00    39638 0xfffffe875b9ef578    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     676       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      633       zfs_shim_read+0xc
      32768 |                               0         fop_read+0x31
      65536 |@@@@@@@@@                      627       read+0x188
     131072 |                               40        read32+0xe
     262144 |                               3         sys_syscall32+0x101
     524288 |                               8
    1048576 |                               38
    2097152 |                               3 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1944   1%  95% 0.00    40258 0xfffffe8724b592c8    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     648       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      607       zfs_shim_read+0xc
      32768 |                               3         fop_read+0x31
      65536 |@@@@@@@@@                      607       read+0x188
     131072 |                               37        read32+0xe
     262144 |                               1         sys_syscall32+0x101
     524288 |                               3
    1048576 |                               35
    2097152 |                               3 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  2028   1%  96% 0.00    38110 0xfffffe8628c965c0    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     676       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      632       zfs_shim_read+0xc
      32768 |                               1         fop_read+0x31
      65536 |@@@@@@@@@                      633       read+0x188
     131072 |                               41        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               3
    1048576 |                               41
    2097152 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1692   1%  97% 0.00    40668 0xfffffe873f426a90    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@                      563       dmu_buf_hold_array+0x81
       4096 |                               1         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      526       zfs_shim_read+0xc
      32768 |                               4         fop_read+0x31
      65536 |@@@@@@@@@                      529       read+0x188
     131072 |                               34        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               1
    1048576 |                               30
    2097152 |                               4 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1433   1%  97% 0.00    41323 0xfffffe861e20acd8    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     478       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      445       zfs_shim_read+0xc
      32768 |                               2         fop_read+0x31
      65536 |@@@@@@@@@                      444       read+0x188
     131072 |                               31        read32+0xe
     262144 |                               1         sys_syscall32+0x101
     524288 |                               3
    1048576 |                               25
    2097152 |                               4 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1326   0%  98% 0.00    39036 0xfffffe87b98b5060    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     442       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      413       zfs_shim_read+0xc
      32768 |                               3         fop_read+0x31
      65536 |@@@@@@@@@                      416       read+0x188
     131072 |                               26        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               0
    1048576 |                               26 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
  1140   0%  98% 0.00    41647 0xfffffe878b029cf0    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     380       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      351       zfs_shim_read+0xc
      32768 |                               5         fop_read+0x31
      65536 |@@@@@@@@@                      354       read+0x188
     131072 |                               24        read32+0xe
     262144 |                               0         sys_syscall32+0x101
     524288 |                               2
    1048576 |                               22
    2097152 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Hottest Caller
   987   0%  99% 0.00    42129 0xfffffe873752e520    
dmu_buf_hold_array_by_dnode+0x208

       nsec ------ Time Distribution ------ count     Stack
       2048 |@@@@@@@@@@                     329       dmu_buf_hold_array+0x81
       4096 |                               0         dmu_read_uio+0x49
       8192 |                               0         zfs_read+0x15c
      16384 |@@@@@@@@@                      306       zfs_shim_read+0xc
      32768 |                               1         fop_read+0x31
      65536 |@@@@@@@@@                      306       read+0x188
     131072 |                               20        read32+0xe
     262144 |                               1         sys_syscall32+0x101
     524288 |                               3
    1048576 |                               19
    2097152 |                               2 
-------------------------------------------------------------------------------
+ /usr/sbin/lockstat -kIW -i 977 -D 20 -s 40 sleep 5

Profiling interrupt: 19652 events in 5.028 seconds (3909 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller 
18681  95%  95% 0.00      464 cpu[2]                 cpu_halt

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@@@@@@@@@@@@@@@@@@         13724     idle
       1024 |@@@@@@@                        4905      thread_start
       2048 |                               50
       4096 |                               2 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
   108   1%  96% 0.00      656 cpu[0]                 kcopy

       nsec ------ Time Distribution ------ count     Stack
        512 |                               2         uiomove
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@   104       dmu_read_uio
       2048 |                               2         zfs_read
                                                      zfs_shim_read
                                                      fop_read
                                                      read
                                                      read32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
   102   1%  96% 0.00      719 cpu[1]                 (usermode)

       nsec ------ Time Distribution ------ count     Stack
        512 |@                              4
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@     91
       2048 |@@                             7 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
   100   1%  97% 0.00      638 cpu[2]                 fletcher_2_native

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@                            12        zio_checksum_verify
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@      85        zio_execute
       2048 |                               3         taskq_thread
                                                      thread_start 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    55   0%  97% 0.00     2909 cpu[3]                 fsflush_do_pages

       nsec ------ Time Distribution ------ count     Stack
       1024 |                               1         fsflush
       2048 |@                              3         thread_start
       4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    50
       8192 |                               1 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    38   0%  97% 0.00      617 cpu[1]                 sys_syscall32

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@                            4
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@     34 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    30   0%  97% 0.00      582 cpu[1]                 syscall_mstate

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@                           4         sys_syscall32
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@     26 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    20   0%  97% 0.00      629 cpu[0]                 write

       nsec ------ Time Distribution ------ count     Stack
        512 |@                              1         write32
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@   19        sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    19   0%  97% 0.00      645 cpu[1]                
tsc_gethrtimeunscaled_delta

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19        gethrtime_unscaled
                                                      syscall_mstate
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    15   0%  98% 0.00      641 cpu[0]                 pc_gethrestime

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@                           2         gethrestime
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@     13        gethrestime_sec
                                                      smark
                                                      spec_write
                                                      fop_write
                                                      write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    14   0%  98% 0.00      653 cpu[1]                 tsc_gethrtime_delta

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14        gethrtime
                                                      pc_gethrestime
                                                      gethrestime
                                                      gethrestime_sec
                                                      smark
                                                      spec_write
                                                      fop_write
                                                      write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    13   0%  98% 0.00      590 cpu[0]                 copyin_nowatch

       nsec ------ Time Distribution ------ count     Stack
        512 |@@                             1         copyin_args32
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    12        syscall_entry
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    13   0%  98% 0.00      624 cpu[0]                 copyin_args32

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 13        syscall_entry
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
    12   0%  98% 0.00      627 cpu[1]                 mutex_enter

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12        write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     9   0%  98% 0.00      595 cpu[1]                 spec_write

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9         fop_write
                                                      write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     7   0%  98% 0.00      596 cpu[1]                 gethrestime_sec

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@                           1         smark
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@      6         spec_write
                                                      fop_write
                                                      write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     7   0%  98% 0.00      683 cpu[1]                 gethrestime_sec

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7         spec_write
                                                      fop_write
                                                      write
                                                      write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     7   0%  98% 0.00      585 cpu[1]                 kcopy

       nsec ------ Time Distribution ------ count     Stack
       1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7         copyin_nowatch
                                                      copyin_args32
                                                      syscall_entry
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     7   0%  98% 0.00      590 cpu[1]                 fop_write

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@@@@@@@@                   3         write
       1024 |@@@@@@@@@@@@@@@@@              4         write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
     6   0%  98% 0.00      729 cpu[1]                 spec_maxoffset

       nsec ------ Time Distribution ------ count     Stack
        512 |@@@@@                          1         fop_write
       1024 |@@@@@@@@@@@@@@@@@@@@           4         write
       2048 |@@@@@                          1         write32
                                                      sys_syscall32 
-------------------------------------------------------------------------------
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

William Bauer

2009-Jul-09 18:13 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I have a much more generic question regarding this thread.  I have a sun T5120
(T2 quad core, 1.4GHz) with two 10K RPM SAS drives in a mirrored pool running
Solaris 10 u7.  The disk performance seems horrible.  I have the same apps
running on a Sun X2100M2 (dual core 1.8GHz AMD) also running Solaris 10u7 and an
old, really poor performing SATA drive (also with ZFS), and its disk performance
seems at least 5x better.

I''m not offering much detail here, but I had been attributing this to
what I''ve always observed--Solaris on x86 performs far better than on
sparc for any app I''ve ever used.

I guess the real question would be is ZFS ready for production in Solaris 10, or
should I flar this bugger up and rebuild with UFS?  This thread concerns me, and
I really want to keep ZFS on this system for its many features.  Sorry if this
is off-topic, but you guys got me wondering.
-- 
This message posted from opensolaris.org

William Bauer

2009-Jul-09 18:14 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I don''t swear.  The word it bleeped was not a bad word....
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-12 21:38 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

There has been no forward progress on the ZFS read performance issue 
for a week now.  A 4X reduction in file read performance due to having 
read the file before is terrible, and of course the situation is 
considerably worse if the file was previously mmapped as well.  Many 
of us have sent a lot of money to Sun and were not aware that ZFS is 
sucking the life out of our expensive Sun hardware.

It is trivially easy to reproduce this problem on multiple machines. 
For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
simple mirrored rpool.  On that system there is a 1.8X read slowdown 
from the file being accessed previously.

In order to raise visibility of this issue, I invite others to see if 
they can reproduce it in their ZFS pools.  The script at

http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh

Implements a simple test.  It requires a fair amount of disk space to 
run, but the main requirement is that the disk space consumed be more 
than available memory so that file data gets purged from the ARC. 
The script needs to run as root since it creates a filesystem and uses 
mount/umount.  The script does not destroy any data.

There are several adjustments which may be made at the front of the 
script.  The pool ''rpool'' is used by default, but the name of
the pool
to test may be supplied via an argument similar to:

# ./zfs-cache-test.ksh Sun_2540
zfs create Sun_2540/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/Sun_2540/zfscachetest ...
Done!
zfs unmount Sun_2540/zfscachetest
zfs mount Sun_2540/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    2m54.17s
user    0m7.65s
sys     0m36.59s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    11m54.65s
user    0m7.70s
sys     0m35.06s

Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.

And here is a similar run on my Blade 2500 using the default rpool:

# ./zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/rpool/zfscachetest ...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    13m3.91s
user    2m43.04s
sys     9m28.73s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    23m50.27s
user    2m41.81s
sys     9m46.76s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.

I am interested to hear about systems which do not suffer from this 
bug.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Lawson

2009-Jul-12 23:15 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob,

Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
called test1
which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on
the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.

This machine is brand new with a clean install of S10 05/09. It is 
destined to become a Oracle 10 server with
ZFS filesystems for zones and DB volumes.

[root at xxx /]#> uname -a
SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
[root at xxx /]#> cat /etc/release
                       Solaris 10 5/09 s10s_u7wos_08 SPARC
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 30 March 2009

[root at xxx /]#> prtdiag -v | more
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server
System clock frequency: 1064 MHz
Memory size: 16384 Megabytes


Here is the run output for you.

[root at xxx tmp]#> ./zfs-cache-test.ksh test1
zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...
Done!
zfs unmount test1/zfscachetest
zfs mount test1/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    4m48.94s
user    0m21.58s
sys     0m44.91s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    6m39.87s
user    0m21.62s
sys     0m46.20s

Feel free to clean up with ''zfs destroy test1/zfscachetest''.

Looks like a 25% performance loss for me. I was seeing around 80MB/s 
sustained
on the first run and around 60M/''s sustained on the 2nd.

/Scott.


Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue 
> for a week now.  A 4X reduction in file read performance due to having 
> read the file before is terrible, and of course the situation is 
> considerably worse if the file was previously mmapped as well.  Many 
> of us have sent a lot of money to Sun and were not aware that ZFS is 
> sucking the life out of our expensive Sun hardware.
>
> It is trivially easy to reproduce this problem on multiple machines. 
> For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
> simple mirrored rpool.  On that system there is a 1.8X read slowdown 
> from the file being accessed previously.
>
> In order to raise visibility of this issue, I invite others to see if 
> they can reproduce it in their ZFS pools.  The script at
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh 
>
>
> Implements a simple test.  It requires a fair amount of disk space to 
> run, but the main requirement is that the disk space consumed be more 
> than available memory so that file data gets purged from the ARC. The 
> script needs to run as root since it creates a filesystem and uses 
> mount/umount.  The script does not destroy any data.
>
> There are several adjustments which may be made at the front of the 
> script.  The pool ''rpool'' is used by default, but the
name of the pool
> to test may be supplied via an argument similar to:
>
> # ./zfs-cache-test.ksh Sun_2540
> zfs create Sun_2540/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /Sun_2540/zfscachetest ...
> Done!
> zfs unmount Sun_2540/zfscachetest
> zfs mount Sun_2540/zfscachetest
>
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    2m54.17s
> user    0m7.65s
> sys     0m36.59s
>
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    11m54.65s
> user    0m7.70s
> sys     0m35.06s
>
> Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.
>
> And here is a similar run on my Blade 2500 using the default rpool:
>
> # ./zfs-cache-test.ksh
> zfs create rpool/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /rpool/zfscachetest ...
> Done!
> zfs unmount rpool/zfscachetest
> zfs mount rpool/zfscachetest
>
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    13m3.91s
> user    2m43.04s
> sys     9m28.73s
>
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    23m50.27s
> user    2m41.81s
> sys     9m46.76s
>
> Feel free to clean up with ''zfs destroy
rpool/zfscachetest''.
>
> I am interested to hear about systems which do not suffer from this bug.
>
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us, 
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Gaëtan Lehmann

2009-Jul-13 08:58 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hi,

Here is the result on a Dell Precision T5500 with 24 GB of RAM and two  
HD in a mirror (SATA, 7200 rpm, NCQ).

[glehmann at marvin2 tmp]$ uname -a
SunOS marvin2 5.11 snv_117 i86pc i386 i86pc Solaris
[glehmann at marvin2 tmp]$ pfexec ./zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under /rpool/ 
zfscachetest ...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    8m19,74s
user    0m6,47s
sys     0m25,32s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    10m42,68s
user    0m8,35s
sys     0m30,93s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.

HTH,

Ga?tan



Le 13 juil. 09 ? 01:15, Scott Lawson a ?crit :
> Bob,
>
> Output of my run for you. System is a M3000 with 16 GB RAM and 1  
> zpool called test1
> which is contained on a raid 1 volume on a 6140 with 7.50.13.10  
> firmware on
> the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.
>
> This machine is brand new with a clean install of S10 05/09. It is  
> destined to become a Oracle 10 server with
> ZFS filesystems for zones and DB volumes.
>
> [root at xxx /]#> uname -a
> SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
> [root at xxx /]#> cat /etc/release
>                      Solaris 10 5/09 s10s_u7wos_08 SPARC
>          Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
>                       Use is subject to license terms.
>                            Assembled 30 March 2009
>
> [root at xxx /]#> prtdiag -v | more
> System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise  
> M3000 Server
> System clock frequency: 1064 MHz
> Memory size: 16384 Megabytes
>
>
> Here is the run output for you.
>
> [root at xxx tmp]#> ./zfs-cache-test.ksh test1
> zfs create test1/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under /test1/ 
> zfscachetest ...
> Done!
> zfs unmount test1/zfscachetest
> zfs mount test1/zfscachetest
>
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    4m48.94s
> user    0m21.58s
> sys     0m44.91s
>
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    6m39.87s
> user    0m21.62s
> sys     0m46.20s
>
> Feel free to clean up with ''zfs destroy
test1/zfscachetest''.
>
> Looks like a 25% performance loss for me. I was seeing around 80MB/s  
> sustained
> on the first run and around 60M/''s sustained on the 2nd.
>
> /Scott.
>
>
> Bob Friesenhahn wrote:
>> There has been no forward progress on the ZFS read performance  
>> issue for a week now.  A 4X reduction in file read performance due  
>> to having read the file before is terrible, and of course the  
>> situation is considerably worse if the file was previously mmapped  
>> as well.  Many of us have sent a lot of money to Sun and were not  
>> aware that ZFS is sucking the life out of our expensive Sun hardware.
>>
>> It is trivially easy to reproduce this problem on multiple  
>> machines. For example, I reproduced it on my Blade 2500 (SPARC)  
>> which uses a simple mirrored rpool.  On that system there is a 1.8X  
>> read slowdown from the file being accessed previously.
>>
>> In order to raise visibility of this issue, I invite others to see  
>> if they can reproduce it in their ZFS pools.  The script at
>>
>>
http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>>
>> Implements a simple test.  It requires a fair amount of disk space  
>> to run, but the main requirement is that the disk space consumed be  
>> more than available memory so that file data gets purged from the  
>> ARC. The script needs to run as root since it creates a filesystem  
>> and uses mount/umount.  The script does not destroy any data.
>>
>> There are several adjustments which may be made at the front of the  
>> script.  The pool ''rpool'' is used by default, but the
name of the
>> pool to test may be supplied via an argument similar to:
>>
>> # ./zfs-cache-test.ksh Sun_2540
>> zfs create Sun_2540/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under / 
>> Sun_2540/zfscachetest ...
>> Done!
>> zfs unmount Sun_2540/zfscachetest
>> zfs mount Sun_2540/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    2m54.17s
>> user    0m7.65s
>> sys     0m36.59s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    11m54.65s
>> user    0m7.70s
>> sys     0m35.06s
>>
>> Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.
>>
>> And here is a similar run on my Blade 2500 using the default rpool:
>>
>> # ./zfs-cache-test.ksh
>> zfs create rpool/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under /rpool/ 
>> zfscachetest ...
>> Done!
>> zfs unmount rpool/zfscachetest
>> zfs mount rpool/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    13m3.91s
>> user    2m43.04s
>> sys     9m28.73s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    23m50.27s
>> user    2m41.81s
>> sys     9m46.76s
>>
>> Feel free to clean up with ''zfs destroy
rpool/zfscachetest''.
>>
>> I am interested to hear about systems which do not suffer from this  
>> bug.
>>
>> Bob
>> -- 
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Ga?tan Lehmann
Biologie du D?veloppement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 203 bytes
Desc: Ceci est une signature ?lectronique PGP
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/11a75945/attachment.bin>

Alexander Skwar

2009-Jul-13 09:30 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob,

On Sun, Jul 12, 2009 at 23:38, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> There has been no forward progress on the ZFS read performance issue for a
> week now. ?A 4X reduction in file read performance due to having read the
> file before is terrible, and of course the situation is considerably worse
> if the file was previously mmapped as well. ?Many of us have sent a lot of
> money to Sun and were not aware that ZFS is sucking the life out of our
> expensive Sun hardware.
>
> It is trivially easy to reproduce this problem on multiple machines. For
> example, I reproduced it on my Blade 2500 (SPARC) which uses a simple
> mirrored rpool. ?On that system there is a 1.8X read slowdown from the file
> being accessed previously.
>
> In order to raise visibility of this issue, I invite others to see if they
> can reproduce it in their ZFS pools. ?The script at
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>
> Implements a simple test.
--($ ~)-- time sudo ksh zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under
/rpool/zfscachetest ...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 Bl?cke

real    4m7.70s
user    0m24.10s
sys     1m5.99s

Doing second ''cpio -o > /dev/null''
48000247 Bl?cke

real    1m44.88s
user    0m22.26s
sys     0m51.56s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.

real    10m47.747s
user    0m54.189s
sys     3m22.039s

This is a M4000 mit 32 GB RAM and two HDs in a mirror.

Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr => http://zyb.com/alexws77 ]
[ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at
gmail.com ]
[ Mehr => AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''

Ross

2009-Jul-13 11:23 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hey Bob,

Here are my results on a Dual 2.2Ghz Opteron, 8GB of RAM and 16 SATA disks
connected via a Supermicro AOC-SAT2-MV8 (albeit with one dead drive).

Looks like a 5x slowdown to me:

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    4m46.45s
user    0m10.29s
sys     0m58.27s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    15m50.62s
user    0m10.54s
sys     1m11.86s

Ross
-- 
This message posted from opensolaris.org

Daniel Rock

2009-Jul-13 11:52 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hi,


Solaris 10U7, patched to the latest released patches two weeks ago.

Four ST31000340NS attached to two SI3132 SATA controller, RAIDZ1.

Selfmade system with 2GB RAM and an
   x86 (chipid 0x0 AuthenticAMD family 15 model 35 step 2 clock 2210 MHz)
         AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
processor.


On the first run throughput was ~110MB/s, on the second run only 80MB/s.

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 Bl?cke

real    3m37.17s
user    0m11.15s
sys     0m47.74s

Doing second ''cpio -o > /dev/null''
48000247 Bl?cke

real    4m55.69s
user    0m10.69s
sys     0m47.57s




Daniel

Jorgen Lundman

2009-Jul-13 12:31 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

x4540 running svn117

# ./zfs-cache-test.ksh zpool1
zfs create zpool1/zfscachetest
creating data file set 93000 files of 8192000 bytes0 under 
/zpool1/zfscachetest ...
done1
zfs unmount zpool1/zfscachetest
zfs mount zpool1/zfscachetest

doing initial (unmount/mount) ''cpio -o . /dev/null''
48000247 blocks

real    4m7.13s
user    0m9.27s
sys     0m49.09s

doing second ''cpio -o . /dev/null''
48000247 blocks

real    4m52.52s
user    0m9.13s
sys     0m47.51s

Alexander Skwar

2009-Jul-13 12:51 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Here''s a more useful output, with having set the number of
files to 6000, so that it has a dataset which is larger than the
amount of RAM.

--($ ~)-- time sudo ksh zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (6000 files of 8192000 bytes) under
/rpool/zfscachetest ...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
96000493 Bl?cke

real    8m44.82s
user    0m46.85s
sys    2m15.01s

Doing second ''cpio -o > /dev/null''
96000493 Bl?cke

real    29m15.81s
user    0m45.31s
sys    3m2.36s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.

real    48m40.890s
user    1m47.192s
sys    8m2.165s

Still on S10 U7 Sparc M4000.

So I''m now inline with the other results - the 2nd run is WAY slower.
4x
as slow.

Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr => http://zyb.com/alexws77 ]
[ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at
gmail.com ]
[ Mehr => AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/30fd9a1d/attachment.html>

Bob Friesenhahn

2009-Jul-13 14:22 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Alexander Skwar wrote:>
> This is a M4000 mit 32 GB RAM and two HDs in a mirror.
I think that you should edit the script to increase the file count 
since your RAM size is big enough to cache most of the data.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-13 14:34 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Alexander Skwar wrote:>
> Still on S10 U7 Sparc M4000.
>
> So I''m now inline with the other results - the 2nd run is WAY
slower. 4x
> as slow.
It would be good to see results from a few OpenSolaris users running a 
recent 64-bit kernel, and with fast storage to see if this is an 
OpenSolaris issue as well.

It seems likely to be more evident with fast SAS disks or SAN devices 
rather than a few SATA disks since the SATA disks have more access 
latency.  Pools composed of mirrors should offer less read latency as 
well.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross

2009-Jul-13 15:53 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Interesting, I repeated the test on a few other machines running newer builds. 
First impressions are good:

snv_114, virtual machine, 1GB RAM, 30GB disk - 16% slowdown.
(Only 9GB free so I ran an 8GB test)

Doing initial (unmount/mount) ''cpio -o > /dev/null''
16000083 blocks

real    3m4.85s
user    0m16.74s
sys     0m41.69s

Doing second ''cpio -o > /dev/null''
16000083 blocks

real    3m34.58s
user    0m18.85s
sys     0m45.40s


And again on snv_117, Sun x2200, 40GB RAM, single 500GB sata disk:

First run (with the default 24GB set):

real    6m25.15s
user    0m11.93s
sys     0m54.93s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    1m9.97s
user    0m12.17s
sys     0m57.80s

... d''oh!  At least I know the ARC is working :-)


The second run, with a 98GB test is running now, I''ll post the results
in the morning.
-- 
This message posted from opensolaris.org

Brad Diggs

2009-Jul-13 16:35 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

You might want to have a look at my blog on filesystem cache  
tuning...  It will probably help
you to avoid memory contention between the ARC and your apps.

http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html

Brad
	Brad Diggs
Senior Directory Architect
Virtualization Architect
xVM Technology Lead


Sun Microsystems, Inc.
Phone x52957/+1 972-992-0002
Mail Bradley.Diggs at Sun.COM
Blog http://TheZoneManager.com
Blog http://BradDiggs.com

On Jul 4, 2009, at 2:48 AM, Phil Harman wrote:
> ZFS doesn''t mix well with mmap(2). This is because ZFS uses the
ARC
> instead of the Solaris page cache. But mmap() uses the latter. So if  
> anyone maps a file, ZFS has to keep the two caches in sync.
>
> cp(1) uses mmap(2). When you use cp(1) it brings pages of the files  
> it copies into the Solaris page cache. As long as they remain there  
> ZFS will be slow for those files, even if you subsequently use  
> read(2) to access them.
>
> If you reboot, your cpio(1) tests will probably go fast again, until  
> someone uses mmap(2) on the files again. I think tar(1) uses  
> read(2), but from my iPod I can''t be sure. It would be interesting
> to see how tar(1) performs if you run that test before cp(1) on a  
> freshly rebooted system.
>
> I have done some work with the ZFS team towards a fix, but it is  
> only currently in OpenSolaris.
>
> The other thing that slows you down is that ZFS only flushes to disk  
> every 5 seconds if there are no synchronous writes. It would be  
> interesting to see iostat -xnz 1 while you are running your tests.  
> You may find the disks are writing very efficiently for one second  
> in every five.
>
> Hope this helps,
> Phil
>
> blogs.sun.com/pgdh
>
>
> Sent from my iPod
>
> On 4 Jul 2009, at 05:26, Bob Friesenhahn  
> <bfriesen at simple.dallas.tx.us> wrote:
>
>> On Fri, 3 Jul 2009, Bob Friesenhahn wrote:
>>>
>>> Copy Method                Data Rate
>>> ====================================   
=================>>> cpio -pdum                75 MB/s
>>> cp -r                    32 MB/s
>>> tar -cf - . | (cd dest && tar -xf -)    26 MB/s
>>
>> It seems that the above should be ammended.  Running the cpio based  
>> copy again results in zpool iostat only reporting a read bandwidth  
>> of 33 MB/second.  The system seems to get slower and slower as it  
>> runs.
>>
>> Bob
>> --
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090713/2f4134f9/attachment.html>

Bob Friesenhahn

2009-Jul-13 18:54 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Brad Diggs wrote:
> You might want to have a look at my blog on filesystem cache tuning...  It 
> will probably help
> you to avoid memory contention between the ARC and your apps.
>
> http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html
Your post makes it sound like there is not a bug in the operating 
system.  It does not take long to see that there is a bug in the 
Solaris 10 operating system.  It is not clear if the same bug is 
shared by current OpenSolaris since it seems like it has not been 
tested.

Solaris 10 U7 reads files that it has not seen before at a constant 
rate regardless of the amount of file data it has already read.  When 
the file is read a second time, the read is 4X or more slower.  If 
reads were slowing down because the ARC was slow to expunge stale 
data, then that would be apparent on the first read pass.  However, 
the reads are not slowing down in the first read pass.  ZFS goes into 
the weeds if it has seen a file before but none of the file data is 
resident in the ARC.

It is pathetic that a Sun RAID array that I paid $21K for out of my 
own life savings is not able to perform better than the cheapo 
portable USB drives that I use for backup because of ZFS.  This is 
making me madder and madder by the minute.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

sean walmsley

2009-Jul-13 18:58 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches
current to the end of Feb 2009.

Current ARC size is ~6Gb.

ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb
SATA drives.

I used 4000 8Mb files for a total of 32Gb.

run 1: ~140M/s average according to zpool iostat
real    4m1.11s
user    0m10.44s
sys     0m50.76s

run 2: ~37M/s average according to zpool iostat
real    13m53.43s
user    0m10.62s
sys     0m55.80s

A zfs unmount followed by a mount of the filesystem returned the performance to
the run 1 case.

real    3m58.16s
user    0m11.54s
sys     0m51.95s

In summary, the second run performance drops to about 30% of the original run.
-- 
This message posted from opensolaris.org

Mike Gerdts

2009-Jul-13 19:11 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 13, 2009 at 9:34 AM, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> On Mon, 13 Jul 2009, Alexander Skwar wrote:
>>
>> Still on S10 U7 Sparc M4000.
>>
>> So I''m now inline with the other results - the 2nd run is WAY
slower. 4x
>> as slow.
>
> It would be good to see results from a few OpenSolaris users running a
> recent 64-bit kernel, and with fast storage to see if this is an
OpenSolaris
> issue as well.
Indeed it is.  Using ldoms with tmpfs as the backing store for virtual
disks, I see:

With S10u7:

# ./zfs-cache-test.ksh testpool
zfs create testpool/zfscachetest
Creating data file set (300 files of 8192000 bytes) under
/testpool/zfscachetest ...
Done!
zfs unmount testpool/zfscachetest
zfs mount testpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
4800025 blocks

real    0m30.35s
user    0m9.90s
sys     0m19.81s

Doing second ''cpio -o > /dev/null''
4800025 blocks

real    0m43.95s
user    0m9.67s
sys     0m17.96s

Feel free to clean up with ''zfs destroy
testpool/zfscachetest''.

# ./zfs-cache-test.ksh testpool
zfs unmount testpool/zfscachetest
zfs mount testpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
4800025 blocks

real    0m31.14s
user    0m10.09s
sys     0m20.47s

Doing second ''cpio -o > /dev/null''
4800025 blocks

real    0m40.24s
user    0m9.68s
sys     0m17.86s

Feel free to clean up with ''zfs destroy
testpool/zfscachetest''.

When I move the zpool to a 2009.06 ldom,

# /var/tmp/zfs-cache-test.ksh testpool
zfs create testpool/zfscachetest
Creating data file set (300 files of 8192000 bytes) under
/testpool/zfscachetest ...
Done!
zfs unmount testpool/zfscachetest
zfs mount testpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
4800025 blocks

real    0m30.09s
user    0m9.58s
sys     0m19.83s

Doing second ''cpio -o > /dev/null''
4800025 blocks

real    0m44.21s
user    0m9.47s
sys     0m18.18s

Feel free to clean up with ''zfs destroy
testpool/zfscachetest''.

# /var/tmp/zfs-cache-test.ksh testpool
zfs unmount testpool/zfscachetest
zfs mount testpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
4800025 blocks

real    0m29.89s
user    0m9.58s
sys     0m19.72s

Doing second ''cpio -o > /dev/null''
4800025 blocks

real    0m44.40s
user    0m9.59s
sys     0m18.24s

Feel free to clean up with ''zfs destroy
testpool/zfscachetest''.

Notice in these runs that each time the usr+sys time of the first run
adds up to the elapsed time - the rate was choked by CPU.  This is
verified by "prstat -mL".  The second run seemed to be slow due to a
lock as we had just demonstrated that the IO path can do more (not an
IO bottleneck) and "prstat -mL shows cpio at in sleep for a
significant amount of time.

FWIW, I hit another bug if I turn off primarycache.

http://defect.opensolaris.org/bz/show_bug.cgi?id=10004

This causes really abysmal performance - but equally so for repeat runs!

# /var/tmp/zfs-cache-test.ksh testpool
zfs unmount testpool/zfscachetest
zfs mount testpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
4800025 blocks

real    4m21.57s
user    0m9.72s
sys     0m36.30s

Doing second ''cpio -o > /dev/null''
4800025 blocks

real    4m21.56s
user    0m9.72s
sys     0m36.19s

Feel free to clean up with ''zfs destroy
testpool/zfscachetest''.

This bug report contains more detail of the configuration.  One thing
not covered in that bug report is that the S10u7 ldom has 2048 MB of
RAM and the 2009.06 ldom has 2024 MB of RAM.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Bob Friesenhahn

2009-Jul-13 19:54 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Mike Gerdts wrote:>
> FWIW, I hit another bug if I turn off primarycache.
>
> http://defect.opensolaris.org/bz/show_bug.cgi?id=10004
>
> This causes really abysmal performance - but equally so for repeat runs!
It is quite facinating seeing the huge difference in I/O performance 
from these various reports.  The bug you reported seems likely to be 
that without at least a little bit of caching, it is necessary to 
re-request the underlying 128K ZFS block several times as the program 
does numerous smaller I/Os (cpio uses 10240 bytes?) across it. 
Totally disabling data caching seems best reserved for block-oriented 
databases which are looking for a substitute for directio(3C).

It is easily demonstrated that the problem seen in Solaris 10 (jury 
still out on OpenSolaris although one report has been posted) is due 
to some sort of confusion.  It is not due to delays caused by purging 
old data from the ARC.  If these delays were caused by purging data 
from the ARC, then ''zfs iostat'' would start showing lower read
performance once the ARC becomes full, but that is not the case.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Schilling

2009-Jul-13 20:16 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Mon, 13 Jul 2009, Mike Gerdts wrote:
> >
> > FWIW, I hit another bug if I turn off primarycache.
> >
> > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004
> >
> > This causes really abysmal performance - but equally so for repeat
runs!
>
> It is quite facinating seeing the huge difference in I/O performance 
> from these various reports.  The bug you reported seems likely to be 
> that without at least a little bit of caching, it is necessary to 
> re-request the underlying 128K ZFS block several times as the program 
> does numerous smaller I/Os (cpio uses 10240 bytes?) across it. 
cpio reads/writes in 8192 byte chunks from the filesystem.

BTW: star by default creates a shared memory based FIFO of 8 MB size and
reads in the biggest possible size that would currently fit into the FIFO.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Jim Mauro

2009-Jul-13 20:16 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob - Have you filed a bug on this issue?
I am not up to speed on this thread, so I can
not comment on whether or not there is a bug
here, but you seem to have a test case and supporting
data. Filing a bug will get the attention of ZFS
engineering.

Thanks,
/jim


Bob Friesenhahn wrote:> On Mon, 13 Jul 2009, Mike Gerdts wrote:
>>
>> FWIW, I hit another bug if I turn off primarycache.
>>
>> http://defect.opensolaris.org/bz/show_bug.cgi?id=10004
>>
>> This causes really abysmal performance - but equally so for repeat
runs!
>
> It is quite facinating seeing the huge difference in I/O performance 
> from these various reports.  The bug you reported seems likely to be 
> that without at least a little bit of caching, it is necessary to 
> re-request the underlying 128K ZFS block several times as the program 
> does numerous smaller I/Os (cpio uses 10240 bytes?) across it. Totally 
> disabling data caching seems best reserved for block-oriented 
> databases which are looking for a substitute for directio(3C).
>
> It is easily demonstrated that the problem seen in Solaris 10 (jury 
> still out on OpenSolaris although one report has been posted) is due 
> to some sort of confusion.  It is not due to delays caused by purging 
> old data from the ARC.  If these delays were caused by purging data 
> from the ARC, then ''zfs iostat'' would start showing lower
read
> performance once the ARC becomes full, but that is not the case.
>
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us, 
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2009-Jul-13 20:23 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Joerg Schilling wrote:>
> cpio reads/writes in 8192 byte chunks from the filesystem.
Yes, I was just reading the cpio manual page and see that.  I think 
that re-reading the 128K zfs block 16 times to satisfy each request 
for 8192 bytes explains the 16X performance loss when caching is 
disabled.  I don''t think that this is strictly a bug since it is what 
the database folks are looking for.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2009-Jul-13 20:27 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 13, 2009 at 3:16 PM, Joerg
Schilling<Joerg.Schilling at fokus.fraunhofer.de>
wrote:> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
>
>> On Mon, 13 Jul 2009, Mike Gerdts wrote:
>> >
>> > FWIW, I hit another bug if I turn off primarycache.
>> >
>> > http://defect.opensolaris.org/bz/show_bug.cgi?id=10004
>> >
>> > This causes really abysmal performance - but equally so for repeat
runs!
>>
>> It is quite facinating seeing the huge difference in I/O performance
>> from these various reports. ?The bug you reported seems likely to be
>> that without at least a little bit of caching, it is necessary to
>> re-request the underlying 128K ZFS block several times as the program
>> does numerous smaller I/Os (cpio uses 10240 bytes?) across it.
>
> cpio reads/writes in 8192 byte chunks from the filesystem.
>
> BTW: star by default creates a shared memory based FIFO of 8 MB size and
> reads in the biggest possible size that would currently fit into the FIFO.
>
> J?rg
Using cpio''s -C option seems to not change the behavior for this bug,
but I did see a performance difference with the case where I hadn''t
modified the zfs caching behavior.  That is, the performance of the
tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 *
1024))>/dev/null".  At this point cpio was spending roughly 13% usr and 87%sys.

I haven''t tried star, but I did see that I could also reproduce with
"cat $file | cat > /dev/null".  This seems like a worthless use of
cat, but it forces cat to actually copy data from input to output
unlike when cat can mmap input and output.  When it does that and
output is /dev/null Solaris is smart enough to avoid any reads.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Mike Gerdts

2009-Jul-13 20:38 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 13, 2009 at 3:23 PM, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> On Mon, 13 Jul 2009, Joerg Schilling wrote:
>>
>> cpio reads/writes in 8192 byte chunks from the filesystem.
>
> Yes, I was just reading the cpio manual page and see that. ?I think that
> re-reading the 128K zfs block 16 times to satisfy each request for 8192
> bytes explains the 16X performance loss when caching is disabled. ?I
don''t
> think that this is strictly a bug since it is what the database folks are
> looking for.
>
> Bob
I did other tests with "dd bs=128k" and verified via truss that each
read(2) was returning 128K.  I thought I had seen excessive reads
there too, but now I can''t reproduce that.  Creating another fs with
recordsize=8k seems to make this behavior go away - things seem to be
working as designed. I''ll go update the (nota-)bug.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ross Walker

2009-Jul-13 20:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Jul 13, 2009, at 2:54 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us
 > wrote:
> On Mon, 13 Jul 2009, Brad Diggs wrote:
>
>> You might want to have a look at my blog on filesystem cache  
>> tuning...  It will probably help
>> you to avoid memory contention between the ARC and your apps.
>>
>>
http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html
>
> Your post makes it sound like there is not a bug in the operating  
> system.  It does not take long to see that there is a bug in the  
> Solaris 10 operating system.  It is not clear if the same bug is  
> shared by current OpenSolaris since it seems like it has not been  
> tested.
>
> Solaris 10 U7 reads files that it has not seen before at a constant  
> rate regardless of the amount of file data it has already read.   
> When the file is read a second time, the read is 4X or more slower.   
> If reads were slowing down because the ARC was slow to expunge stale  
> data, then that would be apparent on the first read pass.  However,  
> the reads are not slowing down in the first read pass.  ZFS goes  
> into the weeds if it has seen a file before but none of the file  
> data is resident in the ARC.
>
> It is pathetic that a Sun RAID array that I paid $21K for out of my  
> own life savings is not able to perform better than the cheapo  
> portable USB drives that I use for backup because of ZFS.  This is  
> making me madder and madder by the minute.
Have you tried limiting the ARC so it doesn''t squash the page cache?

Make sure page cache has enough for mmap plus buffers for bouncing  
between it and the ARC. I would say 1GB minimum, 2 to be safe.

-Ross

Bob Friesenhahn

2009-Jul-13 20:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Mike Gerdts wrote:>
> Using cpio''s -C option seems to not change the behavior for this
bug,
> but I did see a performance difference with the case where I
hadn''t
> modified the zfs caching behavior.  That is, the performance of the
> tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024))
>> /dev/null".  At this point cpio was spending roughly 13% usr and
87%
> sys.
Interesting.  I just updated zfs-cache-test.ksh on my web site so that 
it uses 131072 byte blocks.  I see a tiny improvement in performance 
from doing this, but I do see a bit less CPU consumption so the CPU 
consumption is essentially zero.  The bug remains. It seems best to 
use ZFS''s ideal block size so that issues don''t get confused.

Using an ARC monitoring script called ''arcstat.pl'' I see a
huge number
of ''dmis'' events when performance is poor.  The ARC size is
7GB, which
is less than its prescribed cap of 10GB.

Better:

     Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
15:39:37   20K    1K      6    58    0    1K  100    19  100     7G   10G
15:39:38   19K    1K      5    57    0    1K  100    19  100     7G   10G
15:39:39   19K    1K      6    54    0    1K  100    18  100     7G   10G
15:39:40   17K    1K      6    51    0    1K  100    17  100     7G   10G

Worse:

     Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
15:43:24    4K   280      6   280    6     0    0     4  100     9G   10G
15:43:25    4K   277      6   277    6     0    0     4  100     9G   10G
15:43:26    4K   268      6   268    6     0    0     5  100     9G   10G
15:43:27    4K   259      6   259    6     0    0     4  100     9G   10G

An ARC stats summary from a tool called ''arc_summary.pl'' is
appended
to this message.

Operation is quite consistent across the full span of files.  Since 
''dmis'' is still low when things are "good" (and even
when the ARC has
surely cycled already) this leads me to believe that prefetch is 
mostly working and is usually satisfying read requests.  When things 
go bad I see that ''dmiss'' becomes 100% of the misses.  A
hypothesis is
that if zfs thinks that the data might be in the ARC (due to having 
seen the file before) that it disables file prefetch entirely, 
assuming that it can retrieve the data from its cache.  Then once it 
finally determines that there is no cached data after all, it issues a 
read request.

Even the "better" read performance is 1/2 of what I would expect from 
my hardware and based on prior test results from ''iozone''. 
More
prefetch would surely help.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

System Memory:
 	 Physical RAM: 	20470 MB
 	 Free Memory : 	2511 MB
 	 LotsFree: 	312 MB

ZFS Tunables (/etc/system):
 	 * set zfs:zfs_arc_max = 0x300000000
 	 set zfs:zfs_arc_max = 0x280000000
 	 * set zfs:zfs_arc_max = 0x200000000
 	 set zfs:zfs_write_limit_override = 0xea600000
 	 * set zfs:zfs_write_limit_override = 0xa0000000
 	 set zfs:zfs_vdev_max_pending = 5

ARC Size:
 	 Current Size:             8735 MB (arcsize)
 	 Target Size (Adaptive):   10240 MB (c)
 	 Min Size (Hard Limit):    1280 MB (zfs_arc_min)
 	 Max Size (Hard Limit):    10240 MB (zfs_arc_max)

ARC Size Breakdown:
 	 Most Recently Used Cache Size: 	 95% 	9791 MB (p)
 	 Most Frequently Used Cache Size: 	  4% 	448 MB (c-p)

ARC Efficency:
 	 Cache Access Total:        	 827767314
 	 Cache Hit Ratio:      96%	 800123657   	[Defined State for buffer]
 	 Cache Miss Ratio:      3%	 27643657   	[Undefined State for Buffer]
 	 REAL Hit Ratio:       89%	 743665046   	[MRU/MFU Hits Only]

 	 Data Demand   Efficiency:    99%
 	 Data Prefetch Efficiency:    61%

 	CACHE HITS BY CACHE LIST:
 	  Anon:                        5% 	 47497010            	[ New Customer, First
Cache Hit ]
 	  Most Recently Used:         33% 	 271365449 (mru)      	[ Return Customer ]
 	  Most Frequently Used:       59% 	 472299597 (mfu)      	[ Frequent Customer
]
 	  Most Recently Used Ghost:    0% 	 1700764 (mru_ghost)	[ Return Customer
Evicted, Now Back ]
 	  Most Frequently Used Ghost:  0% 	 7260837 (mfu_ghost)	[ Frequent Customer
Evicted, Now Back ]
 	CACHE HITS BY DATA TYPE:
 	  Demand Data:                73% 	 589582518
 	  Prefetch Data:               2% 	 20424879
 	  Demand Metadata:            17% 	 139111510
 	  Prefetch Metadata:           6% 	 51004750
 	CACHE MISSES BY DATA TYPE:
 	  Demand Data:                21% 	 5814459
 	  Prefetch Data:              46% 	 12788265
 	  Demand Metadata:            27% 	 7700169
 	  Prefetch Metadata:           4% 	 1340764 
---------------------------------------------

Bob Friesenhahn

2009-Jul-13 21:06 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Ross Walker wrote:>
> Have you tried limiting the ARC so it doesn''t squash the page
cache?
Yes, the ARC is limited to 10GB, leaving another 10GB for the OS and 
applications.  Resource limits are not the problem.  There is a ton of 
memory and CPU to go around.

Current /etc/system tunables:

set maxphys = 0x20000
set zfs:zfs_arc_max = 0x280000000
set zfs:zfs_write_limit_override = 0xea600000
set zfs:zfs_vdev_max_pending = 5
> Make sure page cache has enough for mmap plus buffers for bouncing between
it
> and the ARC. I would say 1GB minimum, 2 to be safe.
In this testing mmap is not being used (cpio does not use mmap) so the 
page cache is not an issue.  It does become an issue for ''cp
-r''
though where we see the I/O be substantially (and essentially 
permanently) reduced even more for impacted files until the filesystem 
is unmounted.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mark Shellenbaum

2009-Jul-13 21:14 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue for 
> a week now.  A 4X reduction in file read performance due to having read 
> the file before is terrible, and of course the situation is considerably 
> worse if the file was previously mmapped as well.  Many of us have sent 
> a lot of money to Sun and were not aware that ZFS is sucking the life 
> out of our expensive Sun hardware.
> 
> It is trivially easy to reproduce this problem on multiple machines. For 
> example, I reproduced it on my Blade 2500 (SPARC) which uses a simple 
> mirrored rpool.  On that system there is a 1.8X read slowdown from the 
> file being accessed previously.
> 
> In order to raise visibility of this issue, I invite others to see if 
> they can reproduce it in their ZFS pools.  The script at
> 
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
> 
> Implements a simple test.  It requires a fair amount of disk space to 
> run, but the main requirement is that the disk space consumed be more 
> than available memory so that file data gets purged from the ARC. The 
> script needs to run as root since it creates a filesystem and uses 
> mount/umount.  The script does not destroy any data.
> 
> There are several adjustments which may be made at the front of the 
> script.  The pool ''rpool'' is used by default, but the
name of the pool
> to test may be supplied via an argument similar to:
> 
> # ./zfs-cache-test.ksh Sun_2540
> zfs create Sun_2540/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /Sun_2540/zfscachetest ...
> Done!
> zfs unmount Sun_2540/zfscachetest
> zfs mount Sun_2540/zfscachetest
> 
I''ve opened the following bug to track this issue:

6859997 zfs caching performance problem

We need to track down if/when this problem was introduced or if it has 
always been there.


    -Mark

Joerg Schilling

2009-Jul-13 21:17 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Mon, 13 Jul 2009, Joerg Schilling wrote:
> >
> > cpio reads/writes in 8192 byte chunks from the filesystem.
>
> Yes, I was just reading the cpio manual page and see that.  I think 
> that re-reading the 128K zfs block 16 times to satisfy each request 
> for 8192 bytes explains the 16X performance loss when caching is 
> disabled.  I don''t think that this is strictly a bug since it is
what
> the database folks are looking for.
cpio spends 1.6x more SYStem CPU time than star. This may mainly be a result
from the fact that cpio (when using the cpio archive format) reads/writes 512 
byte blocks from/to the archive file.

cpio by default spends 19x more USER CPU time than star. This seems to be a 
result of the inapropriate header structure with the cpio archive format and 
reblocking and cannot be easily changed (well you could use "scpio" -
or in
other words the "cpio" CLI personality of star, but this reduces the
USER CPU
time only by 10%-50% compared to Sun cpio).

cpio is a program from the past that does no fit well in our current world.
The internal limits cannot be lifted without creating a new incompatible 
archive format.

In other words: if you use cpio for your work, you have to live with
it''s
problems ;-)

If you like to play with different parameter values (e.g. read sizes), cpio 
is unsuitable for tests. Star allows you to set big filesystem read sizes by
using the FIFO and playing with the fifo size and smell filesystem read sizes by
switching off the FIFO and playing with the archive block size.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2009-Jul-13 21:29 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Mike Gerdts <mgerdts at gmail.com> wrote:
> Using cpio''s -C option seems to not change the behavior for this
bug,
> but I did see a performance difference with the case where I
hadn''t
> modified the zfs caching behavior.  That is, the performance of the
> tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 * 1024))
> >/dev/null".  At this point cpio was spending roughly 13% usr and
87%
> sys.
As mentioned before, a lot of the user CPU time from cpio is spend to 
create cpio archive headers or caused by the fact that cpio archives copy 
the file content to unaligned archive locations while the "tar"
archive format
starts each new file on a modulo 512 offset in the archive. This requires a lot
of unneeded copying of file data. You can of course slightly modify parameters
even with cpio. I am not sure what you mean with "13% usr and 87%" as
star
typically spends 6% of the wall clock time in user+sys CPU where the user 
CPU time is typically only 1.5% of the system CPU time.

In the "cached" case, it is obviously ZFS that''s responsible
for the slow down,
regardless what cpio did in the other case.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2009-Jul-13 21:32 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Mon, 13 Jul 2009, Mike Gerdts wrote:
> >
> > Using cpio''s -C option seems to not change the behavior for
this bug,
> > but I did see a performance difference with the case where I
hadn''t
> > modified the zfs caching behavior.  That is, the performance of the
> > tmpfs backed vdisk more than doubled with "cpio -o -C $((1024 *
1024))
> >> /dev/null".  At this point cpio was spending roughly 13% usr
and 87%
> > sys.
>
> Interesting.  I just updated zfs-cache-test.ksh on my web site so that 
> it uses 131072 byte blocks.  I see a tiny improvement in performance 
> from doing this, but I do see a bit less CPU consumption so the CPU 
> consumption is essentially zero.  The bug remains. It seems best to 
> use ZFS''s ideal block size so that issues don''t get
confused.
If you continue to use cpio and the cpio archive format, you force copying a 
lot of data as the cpio archive format does use odd header sizes and starts
new files "unaligned" directly after the archive header.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Bob Friesenhahn

2009-Jul-13 21:41 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Jim Mauro wrote:
> Bob - Have you filed a bug on this issue? I am not up to speed on 
> this thread, so I can not comment on whether or not there is a bug 
> here, but you seem to have a test case and supporting data. Filing a 
> bug will get the attention of ZFS engineering.
No, I have not filed a bug report yet.  Any problem report to Sun''s 
Service department seems to require at least one day''s time.

I was curious to see if recent OpenSolaris suffers from the same 
problem, but posted results (thus far) are not as conclusive as they 
are for Solaris 10.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2009-Jul-13 22:02 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 13, 2009 at 4:41 PM, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> On Mon, 13 Jul 2009, Jim Mauro wrote:
>
>> Bob - Have you filed a bug on this issue? I am not up to speed on this
>> thread, so I can not comment on whether or not there is a bug here, but
you
>> seem to have a test case and supporting data. Filing a bug will get the
>> attention of ZFS engineering.
>
> No, I have not filed a bug report yet. ?Any problem report to
Sun''s Service
> department seems to require at least one day''s time.
>
> I was curious to see if recent OpenSolaris suffers from the same problem,
> but posted results (thus far) are not as conclusive as they are for Solaris
> 10.
It doesn''t seem to be quite as bad as S10, but there is certainly a
hit.

# /var/tmp/zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (400 files of 8192000 bytes) under
/rpool/zfscachetest ...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
6400033 blocks

real    1m26.16s
user    0m12.83s
sys     0m25.88s

Doing second ''cpio -o > /dev/null''
6400033 blocks

real    2m44.46s
user    0m12.59s
sys     0m24.34s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.

# cat /etc/release
                        OpenSolaris 2009.06 snv_111b SPARC
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                              Assembled 07 May 2009

# uname -srvp
SunOS 5.11 snv_111b sparc

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Bob Friesenhahn

2009-Jul-13 22:11 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Mark Shellenbaum wrote:
> I''ve opened the following bug to track this issue:
>
> 6859997 zfs caching performance problem
>
> We need to track down if/when this problem was introduced or if it 
> has always been there.
I think that it has always been there as long as I have been using ZFS 
(1-3/4 years).  Sometimes it takes a while for me to wake up and smell 
the coffee.

Meanwhile I have opened a formal service request (IBIS 71326296) with 
Sun Support.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-13 22:17 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 13 Jul 2009, Joerg Schilling wrote:>
> If you continue to use cpio and the cpio archive format, you force copying
a
> lot of data as the cpio archive format does use odd header sizes and starts
> new files "unaligned" directly after the archive header.
Note that the output of cpio is sent to /dev/null in this test so it 
is only the reading part which is significant as long as cpio''s CPU 
use is low.  Sun Service won''t have a clue about
''star'' since it is
not part of Solaris 10.  It is best to stick with what they know so 
the problem report won''t be rejected.

If star is truely more efficient than cpio, it may make the difference 
even more obvious.  What did you discover when you modified my test 
script to use ''star'' instead?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Randy Jones

2009-Jul-14 01:37 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob: Sun v490, 4x1.35 processors, 32GB ram,  Solaris 10u7 working with a raidz1
zpool made up of 6x146 sas drives on a j4200. Results of your running your
script:

# zfs-cache-test.ksh pool2
zfs create pool2/zfscachetest
Creating data file set (6000 files of 8192000 bytes) under /pool2/zfscachetest
...
Done!
zfs unmount pool2/zfscachetest
zfs mount pool2/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
96000512 blocks

real    5m32.58s
user    0m12.75s
sys     2m56.58s

Doing second ''cpio -C 131072 -o > /dev/null''
96000512 blocks

real    17m26.68s
user    0m12.97s
sys     4m34.33s

Feel free to clean up with ''zfs destroy pool2/zfscachetest''.
#

Same results as you are seeing.

Thanks Randy
-- 
This message posted from opensolaris.org

Ross

2009-Jul-14 06:54 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Ok, build 117 does seem a lot better.  The second run is slower, but not by such
a huge margin. This was the end of the 98GB test:

Creating data file set (12000 files of 8192000 bytes) under /rpool/zfscachetest
...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
192000985 blocks

real    26m17.80s
user    0m47.55s
sys     3m56.94s

Doing second ''cpio -o > /dev/null''
192000985 blocks

real    27m14.35s
user    0m46.84s
sys     4m39.85s
-- 
This message posted from opensolaris.org

Ross

2009-Jul-14 06:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Jorgen,

Am I right in thinking the numbers here don''t quite work.  48M blocks
is just 9,000 files isn''t it, not 93,000?

I''m asking because I had to repeat a test earlier - I edited the script
with vi, but when I ran it, it was still using the old parameters.  I ignored it
as a one off, but I''m wondering if your test has done a similar thing.

Ross

> 
> x4540 running svn117
> 
> # ./zfs-cache-test.ksh zpool1
> zfs create zpool1/zfscachetest
> creating data file set 93000 files of 8192000 bytes0
> under 
> /zpool1/zfscachetest ...
> done1
> zfs unmount zpool1/zfscachetest
> zfs mount zpool1/zfscachetest
> 
> doing initial (unmount/mount) ''cpio -o . /dev/null''
> 48000247 blocks
> 
> real    4m7.13s
> user    0m9.27s
> sys     0m49.09s
> 
> doing second ''cpio -o . /dev/null''
> 48000247 blocks
> 
> real    4m52.52s
> user    0m9.13s
> sys     0m47.51s
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss-- 
This message posted from opensolaris.org

Jorgen Lundman

2009-Jul-14 07:10 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I have no idea. I downloaded the script from Bob without modifications 
and ran it specifying only the name of our pool. Should I have changed 
something to run the test?

We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running 
svn117 for ZFS quotas. Worth trying on both?

Lund




Ross wrote:> Jorgen,
> 
> Am I right in thinking the numbers here don''t quite work.  48M
blocks is just 9,000 files isn''t it, not 93,000?
> 
> I''m asking because I had to repeat a test earlier - I edited the
script with vi, but when I ran it, it was still using the old parameters.  I
ignored it as a one off, but I''m wondering if your test has done a
similar thing.
> 
> Ross
> 
> 
>> x4540 running svn117
>>
>> # ./zfs-cache-test.ksh zpool1
>> zfs create zpool1/zfscachetest
>> creating data file set 93000 files of 8192000 bytes0
>> under 
>> /zpool1/zfscachetest ...
>> done1
>> zfs unmount zpool1/zfscachetest
>> zfs mount zpool1/zfscachetest
>>
>> doing initial (unmount/mount) ''cpio -o . /dev/null''
>> 48000247 blocks
>>
>> real    4m7.13s
>> user    0m9.27s
>> sys     0m49.09s
>>
>> doing second ''cpio -o . /dev/null''
>> 48000247 blocks
>>
>> real    4m52.52s
>> user    0m9.13s
>> sys     0m47.51s
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
>> ss
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Ross

2009-Jul-14 08:50 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Aaah, nevermind, it looks like there''s just a rogue 9 appeared in your
output.  It was just a standard run of 3,000 files.
-- 
This message posted from opensolaris.org

Jorgen Lundman

2009-Jul-14 08:54 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Ah yes, my apologies! I haven''t quite worked out why OsX VNC server 
can''t handle keyboard mappings. I have to copy''paste
"@" even. As I
pasted the output into my mail over VNC, it would have destroyed the 
(not very) "unusual" characters.


Ross wrote:> Aaah, nevermind, it looks like there''s just a rogue 9 appeared in
your output.  It was just a standard run of 3,000 files.
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Kurt Schreiner

2009-Jul-14 10:33 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, Jul 14, 2009 at 08:54:36AM +0200, Ross wrote:> Ok, build 117 does seem a lot better.  The second run is slower,
> but not by such a huge margin.Hm, I can''t support this:

SunOS fred 5.11 snv_117 sun4u sparc SUNW,Sun-Fire-V440
The system has 16GB of Ram, pool is mirrored over two FUJITSU-MBA3147NC.
>-1007: sudo ksh zfs-cache-test.kshzfs create rpool/zfscachetest
Creating data file set (4000 files of 8192000 bytes) under /rpool/zfscachetest
...
Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''tar to /dev/null''

real    5m12.61s
user    0m0.30s
sys     1m28.36s

Doing second ''tar to /dev/null''

real    11m13.93s
user    0m0.22s
sys     1m37.41s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.
 user=2.32 sec, sys=343.41 sec, elapsed=23:39.41 min, cpu use=24.3%

And here''s what arcstat.pl has to say when starting the second read:

    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
11:53:26   11K   895      7    41    0   854  100    13  100    13G   13G  
11:53:27   12K   832      6    39    0   793  100    13  100    13G   13G  
11:53:28   11K   832      7    39    0   793  100    13  100    13G   13G  
11:53:29   11K   832      7    39    0   793  100    13   76    13G   13G  
11:53:30   12K   896      7    42    0   854  100    14  100    13G   13G  
11:53:31   11K   832      7    39    0   793  100    13  100    13G   13G  
11:53:32   11K   768      6    36    0   732  100    12  100    13G   13G  
11:53:33   11K   832      7    39    0   793  100    13  100    13G   13G  
11:53:34    7K   497      7   253    3   244   99     4   11    13G   13G  
11:53:35    5K   385      7   385    7     0    0     0    0    13G   13G  
11:53:36    5K   374      7   374    7     0    0     0    0    13G   13G  
11:53:37    5K   368      7   368    7     0    0     0    0    13G   13G  
11:53:38    4K   340      7   340    7     0    0     0    0    13G   13G  
11:53:39    5K   383      7   383    7     0    0     0    0    13G   13G  
11:53:40    5K   406      7   406    7     0    0     0    0    13G   13G  
11:53:41    4K   360      7   360    7     0    0     0    0    13G   13G  
11:53:42    4K   328      7   328    7     0    0     0    0    13G   13G  
11:53:43    4K   346      7   346    7     0    0     0    0    13G   13G  
11:53:44    4K   346      7   346    7     0    0     0    0    13G   13G  
11:53:45    4K   319      7   319    7     0    0     0    0    13G   13G  
11:53:47    4K   337      7   337    7     0    0     0    0    13G   13G  

I used tar in this run instead of cpio, just to give it a try...
[time (find . -type f | xargs -i tar cf /dev/null {} )]

Another run with Bob''s new script: (rpool/zfscachetest not destroyed
before
this run, so wall clock time below is lower)
>-1008: sudo ksh zfs-cache-test.ksh.1zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
64000512 blocks

real    4m40.25s
user    0m7.96s
sys     1m28.62s

Doing second ''cpio -C 131072 -o > /dev/null''
64000512 blocks

real    11m0.08s
user    0m7.37s
sys     1m38.58s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.
 user=15.35 sec, sys=187.87 sec, elapsed=15:43.65 min, cpu use=21.5%

Not much difference to the "tar"-run...

Kurt

Jorgen Lundman

2009-Jul-14 12:06 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I also ran this on my future RAID/NAS. Intel Atom 330 (D945GCLF2) dual 
core 1.6ghz, on a single HDD pool. svn_114, 64 bit, 2GB RAM.

bash-3.23 ./zfs-cache-test.ksh zboot
zfs create zboot/zfscachetest
creating data file set (3000 files of 8192000 bytes) under 
/zboot/zfscachetest ...
done1
zfs unmount zboot/zfscachetest
zfs mount zboot/zfscachetest

doing initial (unmount/mount) ''cpio -c 131072 -o . /dev/null''
48000256 blocks

real    7m45.96s
user    0m6.55s
sys     1m20.85s

doing second ''cpio -c 131072 -o . /dev/null''
48000256 blocks

real    7m50.35s
user    0m6.76s
sys     1m32.91s

feel free to clean up with ''zfs destroy zboot/zfscachetest''.





Bob Friesenhahn wrote:> There has been no forward progress on the ZFS read performance issue for 
> a week now.  A 4X reduction in file read performance due to having read 
> the file before is terrible, and of course the situation is considerably 
> worse if the file was previously mmapped as well.  Many of us have sent 
> a lot of money to Sun and were not aware that ZFS is sucking the life 
> out of our expensive Sun hardware.
> 
> It is trivially easy to reproduce this problem on multiple machines. For 
> example, I reproduced it on my Blade 2500 (SPARC) which uses a simple 
> mirrored rpool.  On that system there is a 1.8X read slowdown from the 
> file being accessed previously.
> 
> In order to raise visibility of this issue, I invite others to see if 
> they can reproduce it in their ZFS pools.  The script at
> 
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
> 
> Implements a simple test.  It requires a fair amount of disk space to 
> run, but the main requirement is that the disk space consumed be more 
> than available memory so that file data gets purged from the ARC. The 
> script needs to run as root since it creates a filesystem and uses 
> mount/umount.  The script does not destroy any data.
> 
> There are several adjustments which may be made at the front of the 
> script.  The pool ''rpool'' is used by default, but the
name of the pool
> to test may be supplied via an argument similar to:
> 
> # ./zfs-cache-test.ksh Sun_2540
> zfs create Sun_2540/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /Sun_2540/zfscachetest ...
> Done!
> zfs unmount Sun_2540/zfscachetest
> zfs mount Sun_2540/zfscachetest
> 
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
> 
> real    2m54.17s
> user    0m7.65s
> sys     0m36.59s
> 
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
> 
> real    11m54.65s
> user    0m7.70s
> sys     0m35.06s
> 
> Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.
> 
> And here is a similar run on my Blade 2500 using the default rpool:
> 
> # ./zfs-cache-test.ksh
> zfs create rpool/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /rpool/zfscachetest ...
> Done!
> zfs unmount rpool/zfscachetest
> zfs mount rpool/zfscachetest
> 
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
> 
> real    13m3.91s
> user    2m43.04s
> sys     9m28.73s
> 
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
> 
> real    23m50.27s
> user    2m41.81s
> sys     9m46.76s
> 
> Feel free to clean up with ''zfs destroy
rpool/zfscachetest''.
> 
> I am interested to hear about systems which do not suffer from this bug.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Ross

2009-Jul-14 12:07 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

For what it''s worth, I just repeated that test.  The timings are
suspiciously similar.  This is very definitely a reproducible bug:

zfs unmount rc-pool/zfscachetest
zfs mount rc-pool/zfscachetest

Doing initial (unmount/mount) ''cpio -o > /dev/null''
48000247 blocks

real    4m45.69s
user    0m10.22s
sys     0m53.29s

Doing second ''cpio -o > /dev/null''
48000247 blocks

real    15m47.48s
user    0m10.58s
sys     1m10.96s
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-14 15:29 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Ross,

Please refresh your test script from the source.  The current script 
tells cpio to use 128k blocks and mentions the proper command in its 
progress message.  I have now updated it to display useful information 
about the system being tested, and to dump the pool configuration.

It is really interesting seeing the various posted numbers.  This is 
as close as it comes to a common benchmark.  A sort of sanity check.

What is most interesting to me is the reported performance for those 
who paid for really fast storage hardware and are using what should be 
really fast storage configurations.  The reason why it is interesting 
is that there seems to be a hardware-independent cap on maximum read 
performance.  It seems that ZFS''s read algorithm is rate-limiting the 
read so that regardless of how nice the hardware is, there is a peak 
read limit.

There can be no other explanation as to why an ideal configuration of 
"Thumper II" SAS type hardware is neck and neck with my own setup, and
quite similar to another fast system as well.  My own setup is 
delivering less than 1/2 the performance that I would expect for the 
initial read (iozone says it can read 540MB/second from a huge file). 
Do the math and see if you think that zfs is giving you the 
read performance you expect based on your hardware.

I think that we are encountering several bugs here.  We also have a 
general read bottleneck.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-14 16:09 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 14 Jul 2009, Jorgen Lundman wrote:
> I have no idea. I downloaded the script from Bob without modifications and 
> ran it specifying only the name of our pool. Should I have changed
something
> to run the test?
If your system has quite a lot of memory, the number of files should 
be increased to at least match the amount of memory.
> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 running 
> svn117 for ZFS quotas. Worth trying on both?
It is useful to test as much as possible in order to fully understand 
the situation.

Since results often get posted without system details, the script is 
updated to dump some system info and the pool configuration.  Refresh 
from

http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

lists+zfs at xinu.tv

2009-Jul-14 16:32 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn
wrote:> On Tue, 14 Jul 2009, Jorgen Lundman wrote:
>
>> I have no idea. I downloaded the script from Bob without modifications 
>> and ran it specifying only the name of our pool. Should I have changed 
>> something to run the test?
>
> If your system has quite a lot of memory, the number of files should be 
> increased to at least match the amount of memory.
>
>> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2 
>> running svn117 for ZFS quotas. Worth trying on both?
>
> It is useful to test as much as possible in order to fully understand  
> the situation.
>
> Since results often get posted without system details, the script is  
> updated to dump some system info and the pool configuration.  Refresh  
> from
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Whitebox Quad-core Phenom, 8G RAM, RAID-Z (3x1TB + 3x1.5TB) SATA drives via an
AOC-USAS-L8i:

System Configuration: Gigabyte Technology Co., Ltd. GA-MA770-DS3
System architecture: i386
System release level: 5.11 snv_111b
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386
i86

Pool configuration:
  pool: pool
 state: ONLINE
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	pool        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    c3t7d0  ONLINE       0     0     0
	    c3t6d0  ONLINE       0     0     0
	    c3t4d0  ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    c3t2d0  ONLINE       0     0     0
	    c3t1d0  ONLINE       0     0     0
	    c3t0d0  ONLINE       0     0     0

errors: No known data errors

zfs create pool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under /pool/zfscachetest
...
Done!
zfs unmount pool/zfscachetest
zfs mount pool/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real	4m59.33s
user	0m21.83s
sys	2m56.05s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real	8m28.11s
user	0m22.66s
sys	3m13.26s

Feel free to clean up with ''zfs destroy pool/zfscachetest''.

Angelo Rajadurai

2009-Jul-14 16:47 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Just FYI. I ran a slightly different version of the test. I used SSD  
(for log & cache)! 3 x 32GB SSDs. 2 mirrored for log and one for  
cache. The systems is a 4150 with 12 GB of RAM. Here are the results

$ pfexec ./zfs-cache-test.ksh sdpool
System Configuration:
System architecture: i386
System release level: 5.11 snv_111b
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium  
i486 i386 i86

Pool configuration:
   pool: sdpool
  state: ONLINE
  scrub: resilver completed after 0h0m with 0 errors on Fri Jul 10  
11:33:01 2009
config:

         NAME        STATE     READ WRITE CKSUM
         sdpool      ONLINE       0     0     0
           mirror    ONLINE       0     0     0
             c7t1d0  ONLINE       0     0     0
             c7t3d0  ONLINE       0     0     0
         logs        ONLINE       0     0     0
           mirror    ONLINE       0     0     0
             c7t2d0  ONLINE       0     0     0
             c8t5d0  ONLINE       0     0     0
         cache
           c8t4d0    ONLINE       0     0     0

errors: No known data errors

zfs unmount sdpool/zfscachetest
zfs mount sdpool/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    3m27.06s
user    0m2.05s
sys     0m30.14s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    2m47.32s
user    0m2.09s
sys     0m32.32s

Feel free to clean up with ''zfs destroy sdpool/zfscachetest''.

-Angelo


On Jul 14, 2009, at 12:09 PM, Bob Friesenhahn wrote:
> On Tue, 14 Jul 2009, Jorgen Lundman wrote:
>
>> I have no idea. I downloaded the script from Bob without  
>> modifications and ran it specifying only the name of our pool.  
>> Should I have changed something to run the test?
>
> If your system has quite a lot of memory, the number of files should  
> be increased to at least match the amount of memory.
>
>> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2  
>> running svn117 for ZFS quotas. Worth trying on both?
>
> It is useful to test as much as possible in order to fully  
> understand the situation.
>
> Since results often get posted without system details, the script is  
> updated to dump some system info and the pool configuration.   
> Refresh from
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ross

2009-Jul-14 17:57 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hi Bob,

My guess is something like it''s single threaded, with each file dealt
with in order and requests being serviced by just one or two disks at a time. 
With that being the case, an x4500 is essentially just running off 7200 rpm SATA
drives, which really is nothing special.

A quick summary of some of the figures, with times normalized for 3000 files:

Sun x2200, single 500GB sata:   6m25.15s
Sun v490, raidz1 zpool of 6x146 sas drives on a j4200:  2m46.29s
Sun X4500, 7 sets of mirrored 500Gb SATA:  3m0.83s
Sun x4540, (unknown pool - Jorgen, what are you running?):   4m7.13s

Taking my single SATA drive as a base, a pool of mirrored SATA is almost exactly
twice as quick which makes sense if ZFS is reading the file off both drives at
once.

The raid pool of SAS drives is quicker again, but for a single threaded request
that also seems about right.  The random read benefits of the mirror
aren''t going to take effect unless you run multiple reads in parallel. 
What I suspect is helping here are the slightly better seek times of the SAS
drives, along with slightly higher throughput due to the raid.

What might be interesting would be to see the results off a ramdisk or SSD
drive.

Ross
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-14 18:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 14 Jul 2009, Ross wrote:
> My guess is something like it''s single threaded, with each file 
> dealt with in order and requests being serviced by just one or two 
> disks at a time.  With that being the case, an x4500 is essentially 
> just running off 7200 rpm SATA drives, which really is nothing 
> special.
Keep in mind that there is supposed to be file level read-ahead.  As 
an example, ZFS is able to read from my array at up to 551 MB/second 
when reading from a huge (64GB) file yet it is only managing 
145MB/second or so for these 8MB files sequentially accessed by cpio. 
This suggests that even for the initial read case that zfs is not 
applying enough file level read-ahead (or applying it soon enough) to 
keep the disks busy.  8MB is still pretty big in the world of files. 
Perhaps it takes zfs a long time to decide that read-ahead is 
required.

I have yet to find a tunable for file level read-ahead.  There are 
tunables for vdev-level read-ahead but vdev read-ahead pretty minor 
read-ahead and increasing it may cause more harm than help.
> A quick summary of some of the figures, with times normalized for 3000
files:
>
> Sun x2200, single 500GB sata:   6m25.15s
> Sun v490, raidz1 zpool of 6x146 sas drives on a j4200:  2m46.29s
> Sun X4500, 7 sets of mirrored 500Gb SATA:  3m0.83s
> Sun x4540, (unknown pool - Jorgen, what are you running?):   4m7.13s
And mine:

Ultra 40-M2 / StorageTek 2540, 6 sets of mirrored 300GB SAS: 2m44.20s

I think that Jorgen implied that his system is using SAN storage with 
a mirror across two jumbo LUNs.
> The raid pool of SAS drives is quicker again, but for a single 
> threaded request that also seems about right.  The random read 
> benefits of the mirror aren''t going to take effect unless you run 
> multiple reads in parallel.  What I suspect is helping here are the 
> slightly better seek times of the SAS drives, along with slightly 
> higher throughput due to the raid.
Once ZFS decides to apply file level read-ahead then it can issue many 
reads in parallel.  It should be able to keep at least six disks busy 
at once, leading to much better performance than we are seeing.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gaëtan Lehmann

2009-Jul-14 19:36 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Le 14 juil. 09 ? 18:09, Bob Friesenhahn a ?crit :
> On Tue, 14 Jul 2009, Jorgen Lundman wrote:
>
>> I have no idea. I downloaded the script from Bob without  
>> modifications and ran it specifying only the name of our pool.  
>> Should I have changed something to run the test?
>
> If your system has quite a lot of memory, the number of files should  
> be increased to at least match the amount of memory.
>
>> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2  
>> running svn117 for ZFS quotas. Worth trying on both?
>
> It is useful to test as much as possible in order to fully  
> understand the situation.
>
> Since results often get posted without system details, the script is  
> updated to dump some system info and the pool configuration.   
> Refresh from
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh


Here is the result on another host with faster drives (sas 10000 rpm)  
and solaris 10u7.

System Configuration: Sun Microsystems SUN FIRE X4150
System architecture: i386
System release level: 5.10 Generic_139556-08
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium  
i486 i386 i86

Pool configuration:
   pool : rpool
  ?tat : ONLINE
  purger : aucun requis
configuration :

	NAME          STATE     READ WRITE CKSUM
	rpool         ONLINE       0     0     0
	  mirror      ONLINE       0     0     0
	    c1t0d0s0  ONLINE       0     0     0
	    c1t1d0s0  ONLINE       0     0     0

erreurs : aucune erreur de donn?es connue

zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocs

real	4m56.84s
user	0m1.72s
sys	0m28.48s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocs

real	13m48.19s
user	0m2.07s
sys	0m44.45s

Feel free to clean up with ''zfs destroy rpool/zfscachetest''.


-- 
Ga?tan Lehmann
Biologie du D?veloppement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 203 bytes
Desc: Ceci est une signature ?lectronique PGP
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090714/7d389370/attachment.bin>

Halldor Runar Haflidason

2009-Jul-14 20:04 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue Jul 14, 2009 at 11:09:32AM -0500, Bob Friesenhahn
wrote:> On Tue, 14 Jul 2009, Jorgen Lundman wrote:
>
>> I have no idea. I downloaded the script from Bob without modifications
and
>> ran it specifying only the name of our pool. Should I have changed 
>> something to run the test?
>
> If your system has quite a lot of memory, the number of files should be 
> increased to at least match the amount of memory.
>
>> We have two kinds of x4500/x4540, those with Sol 10 10/08, and 2
running
>> svn117 for ZFS quotas. Worth trying on both?
>
> It is useful to test as much as possible in order to fully understand the 
> situation.
>
> Since results often get posted without system details, the script is 
> updated to dump some system info and the pool configuration.  Refresh from
>
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
And mine: 

dori at pax:1512 $ pfexec ./zfs-cache-test.ksh tank
System Configuration: MICRO-STAR INTERNATIONAL CO.,LTD MS-7365
System architecture: i386
System release level: 5.11 snv_101b
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium
i486 i386 i86

Pool configuration:
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h30m with 0 errors on Tue Jul  7
19:38:45 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c4d0    ONLINE       0     0     0
            c5d0    ONLINE       0     0     0
            c7d0    ONLINE       0     0     0

errors: No known data errors

zfs create tank/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under
/tank/zfscachetest ...
Done!
zfs unmount tank/zfscachetest
zfs mount tank/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    8m19.62s
user    0m2.07s
sys     0m30.18s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    5m4.59s
user    0m1.86s
sys     0m34.06s

Feel free to clean up with ''zfs destroy tank/zfscachetest''.

-- 
Regards,
D?ri

Richard Elling

2009-Jul-14 21:04 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Tue, 14 Jul 2009, Ross wrote:
>
>> My guess is something like it''s single threaded, with each
file dealt
>> with in order and requests being serviced by just one or two disks at 
>> a time.  With that being the case, an x4500 is essentially just 
>> running off 7200 rpm SATA drives, which really is nothing special.
>
> Keep in mind that there is supposed to be file level read-ahead.  As 
> an example, ZFS is able to read from my array at up to 551 MB/second 
> when reading from a huge (64GB) file yet it is only managing 
> 145MB/second or so for these 8MB files sequentially accessed by cpio. 
> This suggests that even for the initial read case that zfs is not 
> applying enough file level read-ahead (or applying it soon enough) to 
> keep the disks busy.  8MB is still pretty big in the world of files. 
> Perhaps it takes zfs a long time to decide that read-ahead is required.
>
> I have yet to find a tunable for file level read-ahead.  There are 
> tunables for vdev-level read-ahead but vdev read-ahead pretty minor 
> read-ahead and increasing it may cause more harm than help.
That is because file prefetch is dynamic.  benr wrote a good blog on the
subject and includes a DTrace script to monitor DMU prefetches.
http://www.cuddletech.com/blog/pivot/entry.php?id=1040
 -- richard

Bob Friesenhahn

2009-Jul-14 21:36 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 14 Jul 2009, Richard Elling wrote:>
> That is because file prefetch is dynamic.  benr wrote a good blog on the
> subject and includes a DTrace script to monitor DMU prefetches.
> http://www.cuddletech.com/blog/pivot/entry.php?id=1040
Apparently not dynamic enough.  The provided DTrace script has a 
syntax error when used for Solaris 10 U7.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jakov Sosic

2009-Jul-14 22:23 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hi!

Do you think that this issues will be seen on a ZVOL-s that are exported as
iSCSI tragets?
-- 
This message posted from opensolaris.org

Jorgen Lundman

2009-Jul-15 00:28 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

3 servers contained within.


Both x4500 and x4540 are setup the way Sun shipped to us. With minor 
changes (nfsservers=1024 etc). I was a little disappointed that they 
were identical in speed on round one, but the x4540 looked better part 
2. Which I suspect is probably just OS version?



x4500 Sol 10 100% idle, but with 3.86T existing data. 16GB memory, 4 core.
x4500-03:/var/tmp# ./zfs-cache-test.ksh zpool1
System Configuration: Sun Microsystems Sun Fire X4500
System architecture: i386
System release level: 5.10 on10-public-x:s10idr_ldi:03/27/2009
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 
i386 i86

Pool configuration:
   pool: zpool1
  state: ONLINE
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         zpool1      ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t0d0  ONLINE       0     0     0
             c1t0d0  ONLINE       0     0     0
             c5t0d0  ONLINE       0     0     0
             c7t0d0  ONLINE       0     0     0
             c8t0d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t1d0  ONLINE       0     0     0
             c1t1d0  ONLINE       0     0     0
             c5t1d0  ONLINE       0     0     0
             c6t1d0  ONLINE       0     0     0
             c7t1d0  ONLINE       0     0     0
             c8t1d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t2d0  ONLINE       0     0     0
             c1t2d0  ONLINE       0     0     0
             c5t2d0  ONLINE       0     0     0
             c6t2d0  ONLINE       0     0     0
             c7t2d0  ONLINE       0     0     0
             c8t2d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t3d0  ONLINE       0     0     0
             c1t3d0  ONLINE       0     0     0
             c5t3d0  ONLINE       0     0     0
             c6t3d0  ONLINE       0     0     0
             c7t3d0  ONLINE       0     0     0
             c8t3d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t4d0  ONLINE       0     0     0
             c1t4d0  ONLINE       0     0     0
             c5t4d0  ONLINE       0     0     0
             c7t4d0  ONLINE       0     0     0
             c8t4d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t5d0  ONLINE       0     0     0
             c1t5d0  ONLINE       0     0     0
             c5t5d0  ONLINE       0     0     0
             c6t5d0  ONLINE       0     0     0
             c7t5d0  ONLINE       0     0     0
             c8t5d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t6d0  ONLINE       0     0     0
             c1t6d0  ONLINE       0     0     0
             c5t6d0  ONLINE       0     0     0
             c6t6d0  ONLINE       0     0     0
             c7t6d0  ONLINE       0     0     0
             c8t6d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t7d0  ONLINE       0     0     0
             c1t7d0  ONLINE       0     0     0
             c5t7d0  ONLINE       0     0     0
             c6t7d0  ONLINE       0     0     0
             c7t7d0  ONLINE       0     0     0
             c8t7d0  ONLINE       0     0     0

errors: No known data errors

zfs create zpool1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/zpool1/zfscachetest ...
Done!
zfs unmount zpool1/zfscachetest
zfs mount zpool1/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    3m1.58s
user    0m1.92s
sys     0m56.67s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    7m7.76s
user    0m1.77s
sys     1m6.82s

Feel free to clean up with ''zfs destroy zpool1/zfscachetest''.






x4540 Sol svn 117, 100% idle, completely empty, 32GB memory, 8 core.

x4500-07:/var/tmp# ./zfs-cache-test.ksh zpool1
System Configuration: Sun Microsystems     Sun Fire X4540
System architecture: i386
System release level: 5.11 snv_117
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 
i386 i86

Pool configuration:
   pool: zpool1
  state: ONLINE
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         zpool1      ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t7d0  ONLINE       0     0     0
             c4t7d0  ONLINE       0     0     0
             c5t7d0  ONLINE       0     0     0
             c6t7d0  ONLINE       0     0     0
             c1t1d0  ONLINE       0     0     0
             c2t1d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t0d0  ONLINE       0     0     0
             c4t0d0  ONLINE       0     0     0
             c5t0d0  ONLINE       0     0     0
             c6t0d0  ONLINE       0     0     0
             c1t2d0  ONLINE       0     0     0
             c2t2d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t1d0  ONLINE       0     0     0
             c4t1d0  ONLINE       0     0     0
             c5t1d0  ONLINE       0     0     0
             c6t1d0  ONLINE       0     0     0
             c1t3d0  ONLINE       0     0     0
             c2t3d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t2d0  ONLINE       0     0     0
             c4t2d0  ONLINE       0     0     0
             c5t2d0  ONLINE       0     0     0
             c6t2d0  ONLINE       0     0     0
             c1t4d0  ONLINE       0     0     0
             c2t4d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t3d0  ONLINE       0     0     0
             c4t3d0  ONLINE       0     0     0
             c5t3d0  ONLINE       0     0     0
             c6t3d0  ONLINE       0     0     0
             c1t5d0  ONLINE       0     0     0
             c2t5d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t4d0  ONLINE       0     0     0
             c4t4d0  ONLINE       0     0     0
             c5t4d0  ONLINE       0     0     0
             c6t4d0  ONLINE       0     0     0
             c1t6d0  ONLINE       0     0     0
             c2t6d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t5d0  ONLINE       0     0     0
             c4t5d0  ONLINE       0     0     0
             c5t5d0  ONLINE       0     0     0
             c6t5d0  ONLINE       0     0     0
             c1t7d0  ONLINE       0     0     0
             c2t7d0  ONLINE       0     0     0
         spares
           c3t6d0    AVAIL
           c4t6d0    AVAIL
           c5t6d0    AVAIL
           c6t6d0    AVAIL

errors: No known data errors

zfs create zpool1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/zpool1/zfscachetest ...
Done!
zfs unmount zpool1/zfscachetest
zfs mount zpool1/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    3m5.51s
user    0m1.70s
sys     0m29.53s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    4m7.63s
user    0m1.67s
sys     0m26.66s

Feel free to clean up with ''zfs destroy zpool1/zfscachetest''.





Intel Atom:

bash-3.2# ./zfs-cache-test.ksh zboot
System Configuration: 

System architecture: i386
System release level: 5.11 snv_114
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 
i386 i86

Pool configuration:
   pool: zboot
  state: ONLINE
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         zboot       ONLINE       0     0     0
           c1d0s0    ONLINE       0     0     0

errors: No known data errors

zfs create zboot/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/zboot/zfscachetest ...
Done!
zfs unmount zboot/zfscachetest
zfs mount zboot/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    7m27.87s
user    0m6.51s
sys     1m20.28s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    7m25.34s
user    0m6.63s
sys     1m32.04s

Feel free to clean up with ''zfs destroy zboot/zfscachetest''.

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Scott Lawson

2009-Jul-15 00:37 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I added a second Lun identical in size as a mirror and reran test. 
Results are more in line with yours now.

./zfs-cache-test.ksh test1
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server
System architecture: sparc
System release level: 5.10 Generic_139555-08
CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 
sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc

Pool configuration:
  pool: test1
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Wed Jul 15 
11:38:54 2009
config:

        NAME                                       STATE     READ WRITE 
CKSUM
        test1                                      ONLINE       0     
0     0
          mirror                                   ONLINE       0     
0     0
            c3t600A0B80005622640000039B4A257E11d0  ONLINE       0     
0     0
            c3t600A0B8000336DE2000004394A258B93d0  ONLINE       0     
0     0

errors: No known data errors

zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...
Done!
zfs unmount test1/zfscachetest
zfs mount test1/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    3m25.13s
user    0m2.67s
sys     0m28.40s

Doing second ''cpio -C 131072 -o > /dev/null''

48000256 blocks

real    8m53.05s
user    0m2.69s
sys     0m32.83s

Feel free to clean up with ''zfs destroy test1/zfscachetest''.

Scott Lawson wrote:> Bob,
>
> Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
> called test1
> which is contained on a raid 1 volume on a 6140 with 7.50.13.10 
> firmware on
> the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.
>
> This machine is brand new with a clean install of S10 05/09. It is 
> destined to become a Oracle 10 server with
> ZFS filesystems for zones and DB volumes.
>
> [root at xxx /]#> uname -a
> SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
> [root at xxx /]#> cat /etc/release
>                       Solaris 10 5/09 s10s_u7wos_08 SPARC
>           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
>                        Use is subject to license terms.
>                             Assembled 30 March 2009
>
> [root at xxx /]#> prtdiag -v | more
> System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
> M3000 Server
> System clock frequency: 1064 MHz
> Memory size: 16384 Megabytes
>
>
> Here is the run output for you.
>
> [root at xxx tmp]#> ./zfs-cache-test.ksh test1
> zfs create test1/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /test1/zfscachetest ...
> Done!
> zfs unmount test1/zfscachetest
> zfs mount test1/zfscachetest
>
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    4m48.94s
> user    0m21.58s
> sys     0m44.91s
>
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    6m39.87s
> user    0m21.62s
> sys     0m46.20s
>
> Feel free to clean up with ''zfs destroy
test1/zfscachetest''.
>
> Looks like a 25% performance loss for me. I was seeing around 80MB/s 
> sustained
> on the first run and around 60M/''s sustained on the 2nd.
>
> /Scott.
>
>
> Bob Friesenhahn wrote:
>> There has been no forward progress on the ZFS read performance issue 
>> for a week now.  A 4X reduction in file read performance due to 
>> having read the file before is terrible, and of course the situation 
>> is considerably worse if the file was previously mmapped as well.  
>> Many of us have sent a lot of money to Sun and were not aware that 
>> ZFS is sucking the life out of our expensive Sun hardware.
>>
>> It is trivially easy to reproduce this problem on multiple machines. 
>> For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
>> simple mirrored rpool.  On that system there is a 1.8X read slowdown 
>> from the file being accessed previously.
>>
>> In order to raise visibility of this issue, I invite others to see if 
>> they can reproduce it in their ZFS pools.  The script at
>>
>>
http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>>
>>
>> Implements a simple test.  It requires a fair amount of disk space to 
>> run, but the main requirement is that the disk space consumed be more 
>> than available memory so that file data gets purged from the ARC. The 
>> script needs to run as root since it creates a filesystem and uses 
>> mount/umount.  The script does not destroy any data.
>>
>> There are several adjustments which may be made at the front of the 
>> script.  The pool ''rpool'' is used by default, but the
name of the
>> pool to test may be supplied via an argument similar to:
>>
>> # ./zfs-cache-test.ksh Sun_2540
>> zfs create Sun_2540/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under 
>> /Sun_2540/zfscachetest ...
>> Done!
>> zfs unmount Sun_2540/zfscachetest
>> zfs mount Sun_2540/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    2m54.17s
>> user    0m7.65s
>> sys     0m36.59s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    11m54.65s
>> user    0m7.70s
>> sys     0m35.06s
>>
>> Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.
>>
>> And here is a similar run on my Blade 2500 using the default rpool:
>>
>> # ./zfs-cache-test.ksh
>> zfs create rpool/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under 
>> /rpool/zfscachetest ...
>> Done!
>> zfs unmount rpool/zfscachetest
>> zfs mount rpool/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    13m3.91s
>> user    2m43.04s
>> sys     9m28.73s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    23m50.27s
>> user    2m41.81s
>> sys     9m46.76s
>>
>> Feel free to clean up with ''zfs destroy
rpool/zfscachetest''.
>>
>> I am interested to hear about systems which do not suffer from this
bug.
>>
>> Bob
>> -- 
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us, 
>> http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
_________________________________________________________________________

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

__________________________________________________________________________

perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

__________________________________________________________________________

Bob Friesenhahn

2009-Jul-15 01:16 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Jorgen Lundman wrote:>
> Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
> 48000256 blocks
>
> real    3m1.58s
> user    0m1.92s
> sys     0m56.67s
>
> Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
> 48000256 blocks
>
> real    3m5.51s
> user    0m1.70s
> sys     0m29.53s
>
You have some mighty pools there.  Something I find quite interesting 
is that those who have "mighty pools" generally obtain about the same 
data rate regardless of their relative degree of excessive "might". 
This causes me to believe that the Solaris kernel is throttling the 
read rate so that throwing more and faster hardware at the problem 
does not help.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jorgen Lundman

2009-Jul-15 02:07 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

> 
> You have some mighty pools there.  Something I find quite interesting is 
> that those who have "mighty pools" generally obtain about the
same data
> rate regardless of their relative degree of excessive "might".
This
> causes me to believe that the Solaris kernel is throttling the read rate 
> so that throwing more and faster hardware at the problem does not help.
> 

Are you saying the X4500s we have are set up incorrectly, or done in a 
way which will make them run poorly?

The servers came with no documentation nor advise. I have yet to find a 
good place that suggest configurations for dedicated x4500 NFS servers. 
We had to find out about the NFSD_SERVERS when the first trouble came 
in. (Followed by 5 other tweaks and limits-reached troubles).

If Sun really wants to compete with NetApp, you''d think they would ship
us hardware configured for NFS servers, not x4500s configured for 
desktops :(  They are cheap though! Nothing like being Wall-Mart of Storage!

That is how the pools were created as well. Admittedly it may be down to 
our Vendor again.

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Bob Friesenhahn

2009-Jul-15 02:09 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Scott Lawson wrote:>
>       NAME                                       STATE     READ WRITE CKSUM
>       test1                                      ONLINE       0     0     0
>         mirror                                   ONLINE       0     0     0
>           c3t600A0B80005622640000039B4A257E11d0  ONLINE       0     0     0
>           c3t600A0B8000336DE2000004394A258B93d0  ONLINE       0     0     0
>
> Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
> 48000256 blocks
>
> real    3m25.13s
> user    0m2.67s
> sys     0m28.40s
It is quite impressive that your little two disk mirror reads as fast 
as mega Sun systems with 38+ disks and striped vdevs to boot. 
Incredible!

Does this have something to do with your well-managed power and 
cooling? :-)

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-15 02:14 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 14 Jul 2009, Ross wrote:
> Hi Bob,
>
> My guess is something like it''s single threaded, with each file
dealt with in order and requests being serviced by just one or two disks at a
time.  With that being the case, an x4500 is essentially just running off 7200
rpm SATA drives, which really is nothing special.
>
> A quick summary of some of the figures, with times normalized for 3000
files:
>
> Sun x2200, single 500GB sata:   6m25.15s
> Sun v490, raidz1 zpool of 6x146 sas drives on a j4200:  2m46.29s
> Sun X4500, 7 sets of mirrored 500Gb SATA:  3m0.83s
> Sun x4540, (unknown pool - Jorgen, what are you running?):   4m7.13s
This new one from Scott Lawson is incredible (but technically quite 
possible):

SPARC Enterprise M3000, single SAS mirror pair: 3m25.13s

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-15 02:29 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Jorgen Lundman wrote:
>> You have some mighty pools there.  Something I find quite interesting
is
>> that those who have "mighty pools" generally obtain about the
same data
>> rate regardless of their relative degree of excessive
"might". This causes
>> me to believe that the Solaris kernel is throttling the read rate so
that
>> throwing more and faster hardware at the problem does not help.
>
> Are you saying the X4500s we have are set up incorrectly, or done in a way 
> which will make them run poorly?
No.  I am suggesting that all Solaris 10 (and probably OpenSolaris 
systems) currently have a software-imposed read bottleneck which 
places a limit on how well systems will perform on this simple 
sequential read benchmark.  After a certain point (which is 
unfortunately not very high), throwing more hardware at the problem 
does not result in any speed improvement.  This is demonstrated by 
Scott Lawson''s little two disk mirror almost producing the same 
performance as our much more exotic setups.

Evidence suggests that SPARC systems are doing better than x86.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Lawson

2009-Jul-15 03:55 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Scott Lawson wrote:
>>
>>       NAME                                       STATE     READ WRITE 
>> CKSUM
>>       test1                                      ONLINE       0     
>> 0     0
>>         mirror                                   ONLINE       0     
>> 0     0
>>           c3t600A0B80005622640000039B4A257E11d0  ONLINE       0     
>> 0     0
>>           c3t600A0B8000336DE2000004394A258B93d0  ONLINE       0     
>> 0     0Each of these LUNS is a pair of 146GB 15K drives in a RAID1 on Crystal 
firmware on a 6140. Each LUN is 2km
apart in different data centres. 1 LUN where the server is, 1 remote.

Interestingly by creating the mirror vdev the first run got faster, and 
the second much much slower.  The second cpio
took and extra 2 minutes by virtue of it being a mirror. I ran the 
script once again prior to adding the mirror
and the results were pretty much the same as the first run posted. (plus 
or minus a couple of seconds, which
is to be expected as these LUNS are on prod arrays feeding other servers 
as well)

I will try these tests on some of my J4500''s when I get a chance 
shortly. My interest is now piqued.
>>
>> Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
>> 48000256 blocks
>>
>> real    3m25.13s
>> user    0m2.67s
>> sys     0m28.40s
>
> It is quite impressive that your little two disk mirror reads as fast 
> as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible!
>
> Does this have something to do with your well-managed power and 
> cooling? :-)
Maybe it is Bob, maybe it is. ;) haha.>
> Bob
> -- 
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us, 
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Lawson

2009-Jul-15 04:11 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

This system has 32 GB of RAM so I will probbaly need to increase the 
data set size.

[root at xxxxx tmp]#> ./zfs-cache-test.ksh nbupool
System Configuration:  Sun Microsystems  sun4v SPARC Enterprise T5220
System architecture: sparc
System release level: 5.10 Generic_141414-02
CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 
sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc

Pool configuration:
  pool: nbupool
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        nbupool      ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t2d0   ONLINE       0     0     0
            c2t3d0   ONLINE       0     0     0
            c2t4d0   ONLINE       0     0     0
            c2t5d0   ONLINE       0     0     0
            c2t6d0   ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t7d0   ONLINE       0     0     0
            c2t8d0   ONLINE       0     0     0
            c2t9d0   ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t12d0  ONLINE       0     0     0
            c2t13d0  ONLINE       0     0     0
            c2t14d0  ONLINE       0     0     0
            c2t15d0  ONLINE       0     0     0
            c2t16d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t17d0  ONLINE       0     0     0
            c2t18d0  ONLINE       0     0     0
            c2t19d0  ONLINE       0     0     0
            c2t20d0  ONLINE       0     0     0
            c2t21d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t22d0  ONLINE       0     0     0
            c2t23d0  ONLINE       0     0     0
            c2t24d0  ONLINE       0     0     0
            c2t25d0  ONLINE       0     0     0
            c2t26d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t27d0  ONLINE       0     0     0
            c2t28d0  ONLINE       0     0     0
            c2t29d0  ONLINE       0     0     0
            c2t30d0  ONLINE       0     0     0
            c2t31d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t32d0  ONLINE       0     0     0
            c2t33d0  ONLINE       0     0     0
            c2t34d0  ONLINE       0     0     0
            c2t35d0  ONLINE       0     0     0
            c2t36d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t37d0  ONLINE       0     0     0
            c2t38d0  ONLINE       0     0     0
            c2t39d0  ONLINE       0     0     0
            c2t40d0  ONLINE       0     0     0
            c2t41d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t42d0  ONLINE       0     0     0
            c2t43d0  ONLINE       0     0     0
            c2t44d0  ONLINE       0     0     0
            c2t45d0  ONLINE       0     0     0
            c2t46d0  ONLINE       0     0     0
        spares
          c2t47d0    AVAIL  
          c2t48d0    AVAIL  
          c2t49d0    AVAIL  

errors: No known data errors

zfs create nbupool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/nbupool/zfscachetest ...
Done!
zfs unmount nbupool/zfscachetest
zfs mount nbupool/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    3m37.24s
user    0m9.87s
sys     1m54.08s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    1m59.11s
user    0m9.93s
sys     1m49.15s

Feel free to clean up with ''zfs destroy nbupool/zfscachetest''.

Scott Lawson wrote:> Bob,
>
> Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
> called test1
> which is contained on a raid 1 volume on a 6140 with 7.50.13.10 
> firmware on
> the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.
>
> This machine is brand new with a clean install of S10 05/09. It is 
> destined to become a Oracle 10 server with
> ZFS filesystems for zones and DB volumes.
>
> [root at xxx /]#> uname -a
> SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
> [root at xxx /]#> cat /etc/release
>                       Solaris 10 5/09 s10s_u7wos_08 SPARC
>           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
>                        Use is subject to license terms.
>                             Assembled 30 March 2009
>
> [root at xxx /]#> prtdiag -v | more
> System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
> M3000 Server
> System clock frequency: 1064 MHz
> Memory size: 16384 Megabytes
>
>
> Here is the run output for you.
>
> [root at xxx tmp]#> ./zfs-cache-test.ksh test1
> zfs create test1/zfscachetest
> Creating data file set (3000 files of 8192000 bytes) under 
> /test1/zfscachetest ...
> Done!
> zfs unmount test1/zfscachetest
> zfs mount test1/zfscachetest
>
> Doing initial (unmount/mount) ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    4m48.94s
> user    0m21.58s
> sys     0m44.91s
>
> Doing second ''cpio -o > /dev/null''
> 48000247 blocks
>
> real    6m39.87s
> user    0m21.62s
> sys     0m46.20s
>
> Feel free to clean up with ''zfs destroy
test1/zfscachetest''.
>
> Looks like a 25% performance loss for me. I was seeing around 80MB/s 
> sustained
> on the first run and around 60M/''s sustained on the 2nd.
>
> /Scott.
>
>
> Bob Friesenhahn wrote:
>> There has been no forward progress on the ZFS read performance issue 
>> for a week now.  A 4X reduction in file read performance due to 
>> having read the file before is terrible, and of course the situation 
>> is considerably worse if the file was previously mmapped as well.  
>> Many of us have sent a lot of money to Sun and were not aware that 
>> ZFS is sucking the life out of our expensive Sun hardware.
>>
>> It is trivially easy to reproduce this problem on multiple machines. 
>> For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
>> simple mirrored rpool.  On that system there is a 1.8X read slowdown 
>> from the file being accessed previously.
>>
>> In order to raise visibility of this issue, I invite others to see if 
>> they can reproduce it in their ZFS pools.  The script at
>>
>>
http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
>>
>>
>> Implements a simple test.  It requires a fair amount of disk space to 
>> run, but the main requirement is that the disk space consumed be more 
>> than available memory so that file data gets purged from the ARC. The 
>> script needs to run as root since it creates a filesystem and uses 
>> mount/umount.  The script does not destroy any data.
>>
>> There are several adjustments which may be made at the front of the 
>> script.  The pool ''rpool'' is used by default, but the
name of the
>> pool to test may be supplied via an argument similar to:
>>
>> # ./zfs-cache-test.ksh Sun_2540
>> zfs create Sun_2540/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under 
>> /Sun_2540/zfscachetest ...
>> Done!
>> zfs unmount Sun_2540/zfscachetest
>> zfs mount Sun_2540/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    2m54.17s
>> user    0m7.65s
>> sys     0m36.59s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    11m54.65s
>> user    0m7.70s
>> sys     0m35.06s
>>
>> Feel free to clean up with ''zfs destroy
Sun_2540/zfscachetest''.
>>
>> And here is a similar run on my Blade 2500 using the default rpool:
>>
>> # ./zfs-cache-test.ksh
>> zfs create rpool/zfscachetest
>> Creating data file set (3000 files of 8192000 bytes) under 
>> /rpool/zfscachetest ...
>> Done!
>> zfs unmount rpool/zfscachetest
>> zfs mount rpool/zfscachetest
>>
>> Doing initial (unmount/mount) ''cpio -o >
/dev/null''
>> 48000247 blocks
>>
>> real    13m3.91s
>> user    2m43.04s
>> sys     9m28.73s
>>
>> Doing second ''cpio -o > /dev/null''
>> 48000247 blocks
>>
>> real    23m50.27s
>> user    2m41.81s
>> sys     9m46.76s
>>
>> Feel free to clean up with ''zfs destroy
rpool/zfscachetest''.
>>
>> I am interested to hear about systems which do not suffer from this
bug.
>>
>> Bob
>> -- 
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us, 
>> http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
_________________________________________________________________________

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

__________________________________________________________________________

perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

__________________________________________________________________________

Richard Elling

2009-Jul-15 05:37 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I think a picture is emerging that if you have enough RAM, the
ARC is working very well. Which means that the ARC management
is suspect.

I propose the hypothesis that ARC misses are not prefetched.  The
first time through, prefetching works.  For the second pass, ARC
misses are not prefetched, so sequential reads go slower. 

For JBODs, the effect will be worse than for LUNs on a storage array
with lots of cache. benr''s prefetch script will help shed light on
this,
but apparently doesn''t work for Solaris 10. Since the Solaris 10
source is not publicly available, someone with source access might
need to adjust it to match the Solaris 10 source.
 -- richard

Joerg Schilling

2009-Jul-15 07:24 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Richard Elling <richard.elling at gmail.com> wrote:
> I think a picture is emerging that if you have enough RAM, the
> ARC is working very well. Which means that the ARC management
> is suspect.
>
> I propose the hypothesis that ARC misses are not prefetched.  The
> first time through, prefetching works.  For the second pass, ARC
> misses are not prefetched, so sequential reads go slower. 
You may be right as it may be that the cache is not filled by new important
data because there is already 100% of unimportant data inside.....


J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Ross

2009-Jul-15 09:25 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Yes, that makes sense.  For the first run, the pool has only just been mounted,
so the ARC will be empty, with plenty of space for prefetching.

On the second run however, the ARC is already full of the data that we just
read, and I''m guessing that the prefetch code is less aggressive when
there is already data in the ARC.  Which for normal use may be what you want -
it''s trying to keep things in the ARC in case they are needed.

However that does mean that ZFS prefetch is always going to suffer performance
degradation on a live system, although early signs are that this might not be so
severe in snv_117.

I wonder if there is any tuning that can be done to counteract this?  Is there
any way to tell ZFS to bias towards prefetching rather than preserving data in
the ARC?  That may provide better performance for scripts like this, or for
random access workloads.

Also, could there be any generic algorithm improvements that could help.  Why
should ZFS keep data in the ARC if it hasn''t been used?  This script
has 8GB files, but the ARC should be using at least 1GB of RAM.  That''s
a minimum of 128 files in memory, none of which would have been read more than
once.  If we''re reading a new file now, prefetching should be able to
displace any old object in the ARC that hasn''t been used - in this case
all 127 previous files should be candidates for replacement.

I wonder how that would interact with a L2ARC.  If that was fast enough
I''d certainly want to allocate more of the ARC to prefetching.

Finally, would it make sense for the ARC to always allow a certain percentage
for prefetching, possibly with that percentage being tunable, allowing us to
balance the needs of the two systems according to the expected usage?

Ross
-- 
This message posted from opensolaris.org

My D. Truong

2009-Jul-15 14:07 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

> It would be good to see results from a few
> OpenSolaris users running a 
> recent 64-bit kernel, and with fast storage to see if
> this is an 
> OpenSolaris issue as well.
Bob,

Here''s an example of an OpenSolaris machine, 2008.11 upgraded to the
117 devel release.  X4540, 32GB RAM.  The file count was bumped up to 9000 to be
a little over double the RAM.

root at deviant:~# ./zfs-cache-test.ksh gauss
System Configuration: Sun Microsystems     Sun Fire X4540
System architecture: i386
System release level: 5.11 snv_117
CPU ISA list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386
i86

Pool configuration:
  pool: gauss
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        gauss       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0
            c9t1d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
            c8t2d0  ONLINE       0     0     0
            c9t2d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
            c8t3d0  ONLINE       0     0     0
            c9t3d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c8t4d0  ONLINE       0     0     0
            c9t4d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t5d0  ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
            c8t5d0  ONLINE       0     0     0
            c9t5d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
            c5t6d0  ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c8t6d0  ONLINE       0     0     0
            c9t6d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c4t7d0  ONLINE       0     0     0
            c5t7d0  ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0
            c7t7d0  ONLINE       0     0     0
            c8t7d0  ONLINE       0     0     0
            c9t7d0  ONLINE       0     0     0

errors: No known data errors

zfs create gauss/zfscachetest
Creating data file set (9000 files of 8192000 bytes) under /gauss/zfscachetest
...
Done!
zfs unmount gauss/zfscachetest
zfs mount gauss/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
144000768 blocks

real    9m15.87s
user    0m5.16s
sys     1m29.32s

Doing second ''cpio -C 131072 -o > /dev/null''
144000768 blocks

real    28m57.54s
user    0m5.47s
sys     1m50.32s

Feel free to clean up with ''zfs destroy gauss/zfscachetest''.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-15 14:59 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Ross wrote:
> Yes, that makes sense.  For the first run, the pool has only just 
> been mounted, so the ARC will be empty, with plenty of space for 
> prefetching.
I don''t think that this hypothesis is quite correct.  If you use 
''zpool iostat'' to monitor the read rate while reading a large 
collection of files with total size far larger than the ARC, you will 
see that there is no fall-off in read performance once the ARC becomes 
full.  The performance problem occurs when there is still metadata 
cached for a file but the file data has since been expunged from the 
cache.  The implication here is that zfs speculates that the file data 
will be in the cache if the metadata is cached, and this results in a 
cache miss as well as disabling the file read-ahead algorithm.  You 
would not want to do read-ahead on data that you already have in a 
cache.

Recent OpenSolaris seems to take a 2X performance hit rather than the 
4X hit that Solaris 10 takes.  This may be due to improvement of 
existing algorithm function performance (optimizations) rather than a 
related design improvement.
> I wonder if there is any tuning that can be done to counteract this? 
> Is there any way to tell ZFS to bias towards prefetching rather than 
> preserving data in the ARC?  That may provide better performance for 
> scripts like this, or for random access workloads.
Recent zfs development focus has been on how to keep prefetch from 
damaging applications like database where prefetch causes more data to 
be read than is needed.  Since OpenSolaris now apparently includes an 
option setting which blocks file data caching and prefetch, this seems 
to open the door for use of more aggressive prefetch in the normal 
mode.

In summary, I agree with Richard Elling''s hypothesis (which is the 
same as my own).

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-15 15:08 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, My D. Truong wrote:>
> Here''s an example of an OpenSolaris machine, 2008.11 upgraded to
the
> 117 devel release.  X4540, 32GB RAM.  The file count was bumped up 
> to 9000 to be a little over double the RAM.
Your timings show a 3.1X hit so it appears that the OpenSolaris 
improvement is not as much as was assumed.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Jul-15 16:47 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Ross wrote:
>
>> Yes, that makes sense.  For the first run, the pool has only just 
>> been mounted, so the ARC will be empty, with plenty of space for 
>> prefetching.
>
> I don''t think that this hypothesis is quite correct.  If you use 
> ''zpool iostat'' to monitor the read rate while reading a
large
> collection of files with total size far larger than the ARC, you will 
> see that there is no fall-off in read performance once the ARC becomes 
> full.
Unfortunately, "zpool iostat" doesn''t really tell you
anything about
performance.  All it shows is bandwidth. Latency is what you need
to understand performance, so use iostat.
> The performance problem occurs when there is still metadata cached for 
> a file but the file data has since been expunged from the cache.  The 
> implication here is that zfs speculates that the file data will be in 
> the cache if the metadata is cached, and this results in a cache miss 
> as well as disabling the file read-ahead algorithm.  You would not 
> want to do read-ahead on data that you already have in a cache.
I realized this morning that what I posted last night might be
misleading to the casual reader. Clearly the first time through
the data is prefetched and misses the cache.  On the second
pass, it should also miss the cache, if it were a simple cache.
But the ARC tries to be more clever and has ghosts -- where
the data is no longer in cache, but the metadata is.  I suspect
the prefetching is not being used for the ghosts.  The arcstats
will show this. As benr blogs,
    "These Ghosts lists are magic. If you get a lot of hits to the
    ghost lists, it means that ARC is WAY too small and that
    you desperately need either more RAM or an L2 ARC
    device (likely, SSD). Please note, if you are considering
    investing in L2 ARC, check this FIRST."
http://www.cuddletech.com/blog/pivot/entry.php?id=979
This is the explicit case presented by your test. This also
explains why the entry from the system with an L2ARC
did not have the performance "problem."

Also, another test would be to have two large files.  Read from
one, then the other, then from the first again.  Capture arcstats
from between the reads and see if the haunting stops ;-)
 -- richard

Bob Friesenhahn

2009-Jul-15 16:58 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Richard Elling wrote:>
> Unfortunately, "zpool iostat" doesn''t really tell you
anything about
> performance.  All it shows is bandwidth. Latency is what you need
> to understand performance, so use iostat.
You are still thinking about this as if it was a hardware-related 
problem when it is clearly not. Iostat is useful for analyzing 
hardware-related problems in the case where the workload is too much 
for the hardware, or the hardware is non-responsive. Anyone who runs 
this crude benchmark will discover that iostat shows hardly any disk 
utilization at all, latencies are low, and read I/O rates are low 
enough that they could be satisfied by a portable USB drive.  You can 
even observe the blinking lights on the front of the drive array and 
see that it is lightly loaded.  This explains why a two disk mirror is 
almost able to keep up with a system with 40 fast SAS drives.

This is the opposite situation from the zfs writes which periodically 
push the hardware to its limits.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Jul-15 17:21 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Bob Friesenhahn wrote:> On Wed, 15 Jul 2009, Richard Elling wrote:
>>
>> Unfortunately, "zpool iostat" doesn''t really tell
you anything about
>> performance.  All it shows is bandwidth. Latency is what you need
>> to understand performance, so use iostat.
>
> You are still thinking about this as if it was a hardware-related 
> problem when it is clearly not. Iostat is useful for analyzing 
> hardware-related problems in the case where the workload is too much 
> for the hardware, or the hardware is non-responsive. Anyone who runs 
> this crude benchmark will discover that iostat shows hardly any disk 
> utilization at all, latencies are low, and read I/O rates are low 
> enough that they could be satisfied by a portable USB drive.  You can 
> even observe the blinking lights on the front of the drive array and 
> see that it is lightly loaded.  This explains why a two disk mirror is 
> almost able to keep up with a system with 40 fast SAS drives.
heh. What you would be looking for is evidence of prefetching.  If there
is a lot of prefetching, the actv will tend to be high and latencies 
relatively
low.  If there is no prefetching, actv will be low and latencies may be
higher. This also implies that if you use IDE disks, which cannot handle
multiple outstanding I/Os, the performance will look similar for both runs.

Or, you could get more sophisticated and use a dtrace script to look at
the I/O behavior to determine the latency between contiguous I/O
requests. Something like iopattern is a good start, though it doesn''t
try to measure the time between requests, it would be easy to add.
http://www.richardelling.com/Home/scripts-and-programs-1/iopattern
 -- richard

Bob Friesenhahn

2009-Jul-15 17:49 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 15 Jul 2009, Richard Elling wrote:>
> heh. What you would be looking for is evidence of prefetching.  If 
> there is a lot of prefetching, the actv will tend to be high and 
> latencies relatively low.  If there is no prefetching, actv will be 
> low and latencies may be higher. This also implies that if you use 
> IDE disks, which cannot handle multiple outstanding I/Os, the 
> performance will look similar for both runs.
Ok, here are some stats for the "poor" (initial "USB" rates)
and
"terrible" (sub-"USB" rates) cases.

"poor" (29% busy) iostat:

                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t0d0
     0.0    1.2    0.0   11.4  0.0  0.0    0.0    4.5   0   0 c1t1d0
    91.2    0.0 11654.7    0.0  0.0  0.8    0.0    9.2   0  27
c4t600A0B80003A8A0B0000096147B451BEd0
    95.0    0.0 12160.3    0.0  0.0  0.9    0.0    9.9   0  29
c4t600A0B800039C9B500000A9C47B4522Dd0
    96.4    0.0 12333.1    0.0  0.0  0.9    0.0    9.5   0  29
c4t600A0B800039C9B500000AA047B4529Bd0
    96.8    0.0 12377.9    0.0  0.0  0.9    0.0    9.5   0  30
c4t600A0B80003A8A0B0000096647B453CEd0
   100.4    0.0 12845.1    0.0  0.0  1.0    0.0    9.5   0  29
c4t600A0B800039C9B500000AA447B4544Fd0
    93.4    0.0 11949.1    0.0  0.0  0.8    0.0    9.0   0  28
c4t600A0B80003A8A0B0000096A47B4559Ed0
    91.5    0.0 11705.9    0.0  0.0  0.9    0.0    9.7   0  28
c4t600A0B800039C9B500000AA847B45605d0
    91.4    0.0 11680.3    0.0  0.0  0.9    0.0   10.1   0  29
c4t600A0B80003A8A0B0000096E47B456DAd0
    88.9    0.0 11366.7    0.0  0.0  0.9    0.0    9.7   0  27
c4t600A0B800039C9B500000AAC47B45739d0
    94.3    0.0 12045.5    0.0  0.0  0.9    0.0    9.9   0  29
c4t600A0B800039C9B500000AB047B457ADd0
    96.5    0.0 12339.5    0.0  0.0  0.9    0.0    9.3   0  28
c4t600A0B80003A8A0B0000097347B457D4d0
    87.9    0.0 11232.7    0.0  0.0  0.9    0.0   10.4   0  29
c4t600A0B800039C9B500000AB447B4595Fd0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
c2t202400A0B83A8A0Bd31
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
c3t202500A0B83A8A0Bd31
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
freddy:vold(pid508)

"terrible" (8% busy) iostat:

                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c1t0d0
     0.0    1.8    0.0    1.0  0.0  0.0    0.0   26.6   0   1 c1t1d0
    26.8    0.0 3430.4    0.0  0.0  0.1    0.0    2.9   0   8
c4t600A0B80003A8A0B0000096147B451BEd0
    21.0    0.0 2688.0    0.0  0.0  0.1    0.0    3.9   0   8
c4t600A0B800039C9B500000A9C47B4522Dd0
    24.0    0.0 3059.6    0.0  0.0  0.1    0.0    3.4   0   8
c4t600A0B800039C9B500000AA047B4529Bd0
    27.6    0.0 3532.8    0.0  0.0  0.1    0.0    3.2   0   9
c4t600A0B80003A8A0B0000096647B453CEd0
    20.8    0.0 2662.4    0.0  0.0  0.1    0.0    3.1   0   6
c4t600A0B800039C9B500000AA447B4544Fd0
    26.5    0.0 3392.0    0.0  0.0  0.1    0.0    2.6   0   7
c4t600A0B80003A8A0B0000096A47B4559Ed0
    20.6    0.0 2636.8    0.0  0.0  0.1    0.0    3.0   0   6
c4t600A0B800039C9B500000AA847B45605d0
    22.9    0.0 2931.2    0.0  0.0  0.1    0.0    3.8   0   9
c4t600A0B80003A8A0B0000096E47B456DAd0
    21.4    0.0 2739.2    0.0  0.0  0.1    0.0    3.5   0   7
c4t600A0B800039C9B500000AAC47B45739d0
    23.1    0.0 2944.4    0.0  0.0  0.1    0.0    3.7   0   9
c4t600A0B800039C9B500000AB047B457ADd0
    24.9    0.0 3187.2    0.0  0.0  0.1    0.0    3.4   0   8
c4t600A0B80003A8A0B0000097347B457D4d0
    28.3    0.0 3622.4    0.0  0.0  0.1    0.0    2.8   0   8
c4t600A0B800039C9B500000AB447B4595Fd0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c6t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
c2t202400A0B83A8A0Bd31
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
c3t202500A0B83A8A0Bd31
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0
freddy:vold(pid508)

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross

2009-Jul-15 18:38 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Aaah, ok, I think I understand now.  Thanks Richard.

I''ll grab the updated test and have a look at the ARC ghost results
when I get back to work tomorrow.
-- 
This message posted from opensolaris.org

James Andrewartha

2009-Jul-16 14:58 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sun, 2009-07-12 at 16:38 -0500, Bob Friesenhahn
wrote:> In order to raise visibility of this issue, I invite others to see if 
> they can reproduce it in their ZFS pools.  The script at
> 
> http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh
Here''s the results from two machines, the first has 12x400MHz US-II
CPUs, 11GB of RAM and the disks are 18GB 10krpm SCSI in a split D1000:

System Configuration:  Sun Microsystems  sun4u 8-slot Sun Enterprise
4000/5000
System architecture: sparc
System release level: 5.11 snv_101
CPU ISA list: sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8
sparcv8-fsmuld sparcv7 sparc

Pool configuration:
  pool: space
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h22m with 0 errors on Mon Jul 13 17:18:55
2009
config:

        NAME         STATE     READ WRITE CKSUM
        space        ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t3d0   ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t2d0   ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t4d0   ONLINE       0     0     0
            c2t12d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t5d0   ONLINE       0     0     0
            c2t13d0  ONLINE       1     0     0  128K repaired

errors: No known data errors

zfs create space/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest
...
Done!
zfs unmount space/zfscachetest
zfs mount space/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    11m40.67s
user    0m20.32s
sys     5m27.16s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    31m29.42s
user    0m19.31s
sys     6m46.39s

Feel free to clean up with ''zfs destroy space/zfscachetest''.

The second has 2x1.2GHz US-III+, 4GB RAM and 10krpm FC disks on a single
loop.

System Configuration:  Sun Microsystems  sun4u Sun Fire 480R
System architecture: sparc
System release level: 5.11 snv_97
CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis
sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc

Pool configuration:
  pool: space
 state: ONLINE
 scrub: none requested
config: 

        NAME         STATE     READ WRITE CKSUM
        space        ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t34d0  ONLINE       0     0     0
            c1t48d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t35d0  ONLINE       0     0     0
            c1t49d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t36d0  ONLINE       0     0     0
            c1t51d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t33d0  ONLINE       0     0     0
            c1t52d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t38d0  ONLINE       0     0     0
            c1t53d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t39d0  ONLINE       0     0     0
            c1t54d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t40d0  ONLINE       0     0     0
            c1t55d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t41d0  ONLINE       0     0     0
            c1t56d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t42d0  ONLINE       0     0     0
            c1t57d0  ONLINE       0     0     0
        logs         ONLINE       0     0     0
          c1t50d0    ONLINE       0     0     0

errors: No known data errors

zfs create space/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under /space/zfscachetest
...
Done!
zfs unmount space/zfscachetest
zfs mount space/zfscachetest

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
48000256 blocks

real    5m45.66s
user    0m5.63s
sys     1m14.66s

Doing second ''cpio -C 131072 -o > /dev/null''
48000256 blocks

real    15m29.42s
user    0m5.65s
sys     1m37.83s

Feel free to clean up with ''zfs destroy space/zfscachetest''.

James Andrewartha

Bob Friesenhahn

2009-Jul-16 19:44 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

I have received email that Sun CR numbers 6861397 & 6859997 have been 
created to get this performance problem fixed.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marion Hakanson

2009-Jul-21 01:45 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

bfriesen at simple.dallas.tx.us said:> No.  I am suggesting that all Solaris 10 (and probably OpenSolaris 
systems)
> currently have a software-imposed read bottleneck which  places a limit on
> how well systems will perform on this simple  sequential read benchmark.
> After a certain point (which is  unfortunately not very high), throwing
more
> hardware at the problem  does not result in any speed improvement.  This is
> demonstrated by  Scott Lawson''s little two disk mirror almost
producing the
> same  performance as our much more exotic setups. 
Apologies for reawakening this thread -- I was away last week.

Bob, have you tried changing your benchmark to be multithreaded?  It
occurs to me that maybe a single cpio invocation is another bottleneck.
I''ve definitely experienced the case where a single bonnie++ process
was
not enough to max out the storage system.

I''m not suggesting that the bug you''re demonstrating is not
real.  It''s
clear that subsequent runs on the same system show the degradation, and
that points out a problem.  Rather, I''m thinking that maybe the timing
comparisons between low-end and high-end storage systems on this particular
test are not revealing the whole story.

Regards,

Marion

Bob Friesenhahn

2009-Jul-21 02:52 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, 20 Jul 2009, Marion Hakanson wrote:>
> Bob, have you tried changing your benchmark to be multithreaded?  It
> occurs to me that maybe a single cpio invocation is another bottleneck.
> I''ve definitely experienced the case where a single bonnie++
process was
> not enough to max out the storage system.
It is likely that adding more cpios would cause more data to be read, 
but it would also thrash the disks with many more conflicting IOPS.
> I''m not suggesting that the bug you''re demonstrating is
not real.  It''s
It is definitely real.  Sun has opened internal CR 6859997.  It is now 
in Dispatched state at High priority.
> that points out a problem.  Rather, I''m thinking that maybe the
timing
> comparisons between low-end and high-end storage systems on this particular
> test are not revealing the whole story.
The similarity of performance between the low-end and high-end storage 
systems is a sign that the rotating rust is not a whole lot faster on 
the high-end storage systems.  Since zfs is failing to use pre-fetch, 
only one (or maybe two) disks are accessed at a time.  If more read 
I/Os are issued in parallel, then the data read rate will be vastly 
higher on the higher-end systems.

With my 12 disk array and a large sequential read, zfs can issue 12 
requests for 128K at once and since it can also queue pending I/Os, it 
can request many more than that.  Care is required since over-reading 
will penalize the system.  It is not an easy thing to get right.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Brent Jones

2009-Jul-21 04:14 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 20, 2009 at 7:52 PM, Bob
Friesenhahn<bfriesen at simple.dallas.tx.us>
wrote:> On Mon, 20 Jul 2009, Marion Hakanson wrote:
>
> It is definitely real. ?Sun has opened internal CR 6859997. ?It is now in
> Dispatched state at High priority.
>
Is there a way we can get a Sun person on this list to supply a little
bit more info on that CR?
Seems theres a lot of people bitten by this, from low end to extremely
high end hardware.

-- 
Brent Jones
brent at servuhome.net

Brad Diggs

2009-Jul-22 20:09 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Have you considered running your script with ZFS pre-fetching disabled  
altogether to see if
the results are consistent between runs?

Brad
	Brad Diggs
Senior Directory Architect
Virtualization Architect
xVM Technology Lead


Sun Microsystems, Inc.
Phone x52957/+1 972-992-0002
Mail Bradley.Diggs at Sun.COM
Blog http://TheZoneManager.com
Blog http://BradDiggs.com

On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote:
> On Wed, 15 Jul 2009, Ross wrote:
>
>> Yes, that makes sense.  For the first run, the pool has only just  
>> been mounted, so the ARC will be empty, with plenty of space for  
>> prefetching.
>
> I don''t think that this hypothesis is quite correct.  If you use  
> ''zpool iostat'' to monitor the read rate while reading a
large
> collection of files with total size far larger than the ARC, you  
> will see that there is no fall-off in read performance once the ARC  
> becomes full.  The performance problem occurs when there is still  
> metadata cached for a file but the file data has since been expunged  
> from the cache.  The implication here is that zfs speculates that  
> the file data will be in the cache if the metadata is cached, and  
> this results in a cache miss as well as disabling the file read- 
> ahead algorithm.  You would not want to do read-ahead on data that  
> you already have in a cache.
>
> Recent OpenSolaris seems to take a 2X performance hit rather than  
> the 4X hit that Solaris 10 takes.  This may be due to improvement of  
> existing algorithm function performance (optimizations) rather than  
> a related design improvement.
>
>> I wonder if there is any tuning that can be done to counteract  
>> this? Is there any way to tell ZFS to bias towards prefetching  
>> rather than preserving data in the ARC?  That may provide better  
>> performance for scripts like this, or for random access workloads.
>
> Recent zfs development focus has been on how to keep prefetch from  
> damaging applications like database where prefetch causes more data  
> to be read than is needed.  Since OpenSolaris now apparently  
> includes an option setting which blocks file data caching and  
> prefetch, this seems to open the door for use of more aggressive  
> prefetch in the normal mode.
>
> In summary, I agree with Richard Elling''s hypothesis (which is the
> same as my own).
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090722/c546582e/attachment.html>

Bob Friesenhahn

2009-Jul-22 23:51 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Wed, 22 Jul 2009, Roch wrote:>
> HI Bob did you consider running the 2 runs with
>
> echo zfs_prefetch_disable/W0t1 | mdb -kw
>
> and see if performance is constant between the 2 runs (and low).
> That would help clear the cause a bit. Sorry, I''d do it for
> you but since you have the setup etc...
>
> Revert with :
>
> echo zfs_prefetch_disable/W0t0 | mdb -kw
>
> -r
I see that if I update my test script so that prefetch is disabled 
before the first cpio is executed, the read performance of the first 
cpio reported by ''zpool iostat'' is similar to what has been
normal for
the second cpio case (i.e. 32MB/second). This seems to indicate that 
prefetch is entirely disabled if the file has ever been read before. 
However, there is a new wrinkle in that the second cpio completes 
twice as fast with prefetch disabled even though ''zpool
iostat''
indicates the same consistent throughput.  The difference goes away if 
I tripple the number of files.

With 3000 8.2MB files:
Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
14443520 blocks

real    3m41.61s
user    0m0.44s
sys     0m8.12s

Doing second ''cpio -C 131072 -o > /dev/null''
14443520 blocks

real    1m50.12s
user    0m0.42s
sys     0m7.21s

Now if I increase the number of files to 9000 8.2MB files:

Doing initial (unmount/mount) ''cpio -C 131072 -o >
/dev/null''
144000768 blocks

real    35m51.47s
user    0m4.46s
sys     1m20.11s

Doing second ''cpio -C 131072 -o > /dev/null''
144000768 blocks

real    35m22.41s
user    0m4.40s
sys     1m14.22s

Notice that with 3X the files, the throughput is dramatically reduced 
and the time is the same for both cases.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Rich Morris

2009-Jul-28 21:13 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
> Sun has opened internal CR 6859997.  It is now in Dispatched state at High
priority.
CR 6859997 has been accepted and is actively being worked on.  The 
following info has been added to that CR:

This is a problem with the ZFS file prefetch code (zfetch) in dmu_zfetch.c.  The
test script provided by the submitter (thanks Bob!) does no file prefetching the
second time through each file.  This problem exists in ZFS in Solaris 10,
Nevada, and OpenSolaris.

This test script creates 3000 files each 8M long so the amount of data (24G) is
greater than the amount of memory (16G on a Thumper). With the default blocksize
of 128k, each of the 3000 files has 63  blocks.  The first time through, zfetch
ramps up a single prefetch stream normally.  But the second time through,
dmu_zfetch() calls  dmu_zfetch_find() which thinks that the data has already
been prefetched so no additional prefetching is started.

This problem is not seen with 500 files each 48M in length (still 24G of data). 
In that case there''s still only one prefetch stream but it is reclaimed
when one of the requested offsets is not found.  The reason it is not found is
that stream "strided" the first time through after reaching the zfetch
cap, which is 256 blocks.  Files with no more than 256 blocks don''t
require a stride.  So this problem will only be seen when the data from a file
with no more than 256 blocks is accessed after being tossed from the ARC.

The fix for this problem may be more feedback between the ARC and the zfetch
code.  Or it may make sense to restart the prefetch stream after some time has
passed or perhaps whenever there''s a miss on a block that was expected
to have already been prefetched?

On a Thumper running Nevada build 118, the first pass of this test takes 2
minutes 50 seconds and the second pass takes 5 minutes 22 seconds.  If
dmu_zfetch_find() is modified to restart the refetch stream when the requested
offset is 0 and more than 2 seconds has passed since the stream was last
accessed then the time needed for the second pass is reduced to 2 minutes 24
seconds.

Additional investigation is currently taking place to determine if another
solution makes more sense.  And more testing will be needed to see what affect
this change has on other prefetch patterns.

6412053 is a related CR which mentions that the zfetch code may not be issuing
I/O at a sufficient pace.  This behavior is also seen on a Thumper running the
test script in CR 6859997 since, even when prefetch is ramping up as expected,
less than half of the available I/O bandwidth is being used.  Although more
aggressive file prefetching could increase memory pressure as described in CRs
6258102 and 6469558.


-- Rich

Bob Friesenhahn

2009-Jul-28 22:57 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 28 Jul 2009, Rich Morris wrote:>
> 6412053 is a related CR which mentions that the zfetch code may not be 
> issuing I/O at a sufficient pace.  This behavior is also seen on a Thumper 
> running the test script in CR 6859997 since, even when prefetch is ramping
up
> as expected, less than half of the available I/O bandwidth is being used. 
> Although more aggressive file prefetching could increase memory pressure as
> described in CRs 6258102 and 6469558.
It is good to see this analysis.  Certainly the optimum prefetching 
required for an Internet video streaming server (with maybe 300 
kilobits/second per stream) is radically different than what is 
required for uncompressed 2K preview (8MB/frame) of motion picture 
frames (320 megabytes/second per stream) but zfs should be able to 
support both.

Besides real-time analysis based on current stream behavior and 
memory, it would be useful to maintain some recent history for the 
whole pool so that a pool which is usually used for 1000 slow-speed 
video streams behaves differently by default than one used for one or 
two high-speed video streams.  With this bit of hint information, 
files belonging to a pool recently producing high-speed streams can be 
ramped up quickly while files belonging to a pool which has recently 
fed low-speed streams can be ramped up more conservatively (until 
proven otherwise) in order to not flood memory and starve the I/O 
needed by other streams.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Jul-28 23:08 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 28 Jul 2009, Rich Morris wrote:>
> The fix for this problem may be more feedback between the ARC and the
zfetch
> code.  Or it may make sense to restart the prefetch stream after some time 
> has passed or perhaps whenever there''s a miss on a block that was
expected to
> have already been prefetched?
Regarding this approach of waiting for a prefetch miss, this seems 
like it would produce an uneven flow of data to the application and 
not ensure that data is always available when the application goes to 
read it.  A stutter is likely to produce at least a 10ms gap (and 
possibly far greater) while the application is blocked in read() 
waiting for data.  Since zfs blocks are large, stuttering becomes 
expensive, and if the application itself needs to read ahead 128K in 
order to avoid the stutter, then it consumes memory in an expensive 
non-sharable way.  In the ideal case, zfs will always stay one 128K 
block ahead of the application''s requirement and the unconsumed data 
will be cached in the ARC where it can be shared with other processes.

For an application with real-time data requirements, it is definitely 
desireable not to stutter at all if possible.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Rich Morris

2009-Sep-10 19:12 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On 07/28/09 17:13, Rich Morris wrote:> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>
>> Sun has opened internal CR 6859997.  It is now in Dispatched state at 
>> High priority.
CR 6859997 has recently been fixed in Nevada.  This fix will also be in 
Solaris 10 Update 9. 

This fix speeds up the sequential prefetch pattern described in this CR 
without slowing down other prefetch patterns.  Some kstats have also 
been added to help improve the observability of ZFS file prefetching.

-- Rich
> CR 6859997 has been accepted and is actively being worked on.  The 
> following info has been added to that CR:
>
> This is a problem with the ZFS file prefetch code (zfetch) in 
> dmu_zfetch.c.  The test script provided by the submitter (thanks Bob!) 
> does no file prefetching the second time through each file.  This 
> problem exists in ZFS in Solaris 10, Nevada, and OpenSolaris.
>
> This test script creates 3000 files each 8M long so the amount of data 
> (24G) is greater than the amount of memory (16G on a Thumper). With 
> the default blocksize of 128k, each of the 3000 files has 63  blocks.  
> The first time through, zfetch ramps up a single prefetch stream 
> normally.  But the second time through, dmu_zfetch() calls  
> dmu_zfetch_find() which thinks that the data has already been 
> prefetched so no additional prefetching is started.
>
> This problem is not seen with 500 files each 48M in length (still 24G 
> of data).  In that case there''s still only one prefetch stream but
it
> is reclaimed when one of the requested offsets is not found.  The 
> reason it is not found is that stream "strided" the first time
through
> after reaching the zfetch cap, which is 256 blocks.  Files with no 
> more than 256 blocks don''t require a stride.  So this problem will
> only be seen when the data from a file with no more than 256 blocks is 
> accessed after being tossed from the ARC.
>
> The fix for this problem may be more feedback between the ARC and the 
> zfetch code.  Or it may make sense to restart the prefetch stream 
> after some time has passed or perhaps whenever there''s a miss on a
> block that was expected to have already been prefetched?
>
> On a Thumper running Nevada build 118, the first pass of this test 
> takes 2 minutes 50 seconds and the second pass takes 5 minutes 22 
> seconds.  If dmu_zfetch_find() is modified to restart the refetch 
> stream when the requested offset is 0 and more than 2 seconds has 
> passed since the stream was last accessed then the time needed for the 
> second pass is reduced to 2 minutes 24 seconds.
>
> Additional investigation is currently taking place to determine if 
> another solution makes more sense.  And more testing will be needed to 
> see what affect this change has on other prefetch patterns.
>
> 6412053 is a related CR which mentions that the zfetch code may not be 
> issuing I/O at a sufficient pace.  This behavior is also seen on a 
> Thumper running the test script in CR 6859997 since, even when 
> prefetch is ramping up as expected, less than half of the available 
> I/O bandwidth is being used.  Although more aggressive file 
> prefetching could increase memory pressure as described in CRs 6258102 
> and 6469558.
>
>
> -- Rich
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2009-Sep-10 20:17 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Thu, 10 Sep 2009, Rich Morris wrote:
> On 07/28/09 17:13, Rich Morris wrote:
>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>> 
>>> Sun has opened internal CR 6859997.  It is now in Dispatched state
at High
>>> priority.
>
> CR 6859997 has recently been fixed in Nevada.  This fix will also be in 
> Solaris 10 Update 9. 
> This fix speeds up the sequential prefetch pattern described in this CR 
> without slowing down other prefetch patterns.  Some kstats have also been 
> added to help improve the observability of ZFS file prefetching.
Excellent.  What level of read improvement are you seeing?  Is the 
prefetch rate improved, or does the fix simply avoid losing the 
prefetch?

Thanks,

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

eneal at businessgrade.com

2009-Sep-10 20:22 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:
> On Thu, 10 Sep 2009, Rich Morris wrote:
>
>> On 07/28/09 17:13, Rich Morris wrote:
>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>
>>>> Sun has opened internal CR 6859997.  It is now in Dispatched   
>>>> state at High priority.
>>
>> CR 6859997 has recently been fixed in Nevada.  This fix will also   
>> be in Solaris 10 Update 9. This fix speeds up the sequential   
>> prefetch pattern described in this CR without slowing down other   
>> prefetch patterns.  Some kstats have also been added to help   
>> improve the observability of ZFS file prefetching.
>
> Excellent.  What level of read improvement are you seeing?  Is the
> prefetch rate improved, or does the fix simply avoid losing the
> prefetch?
>
> Thanks,
>
> Bob
Is this fixed in snv_122 or something else?

--------------------------------------------------------------------------------

This email and any files transmitted with it are confidential and are  
intended solely for the use of the individual or entity to whom they  
are addressed. This communication may contain material protected by  
the attorney-client privilege. If you are not the intended recipient,  
be advised that any use, dissemination, forwarding, printing or  
copying is strictly prohibited. If you have received this email in  
error, please contact the sender and delete all copies.

Henrik Johansson

2009-Sep-10 20:26 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Hello Rich,

On Sep 10, 2009, at 9:12 PM, Rich Morris wrote:
> On 07/28/09 17:13, Rich Morris wrote:
>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>
>>> Sun has opened internal CR 6859997.  It is now in Dispatched state
>>> at High priority.
>
> CR 6859997 has recently been fixed in Nevada.  This fix will also be  
> in Solaris 10 Update 9.
> This fix speeds up the sequential prefetch pattern described in this  
> CR without slowing down other prefetch patterns.  Some kstats have  
> also been added to help improve the observability of ZFS file  
> prefetching.
Nice work, do you know if it will be released as a patch for s10u8 or  
will it only be part of the update 9 KUP?

Regards

Henrik
http://sparcv9.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090910/ec3067f4/attachment.html>

Rich Morris

2009-Sep-10 20:35 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On 09/10/09 16:17, Bob Friesenhahn wrote:> On Thu, 10 Sep 2009, Rich Morris wrote:
>
>> On 07/28/09 17:13, Rich Morris wrote:
>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>
>>>> Sun has opened internal CR 6859997.  It is now in Dispatched
state
>>>> at High priority.
>>
>> CR 6859997 has recently been fixed in Nevada.  This fix will also be 
>> in Solaris 10 Update 9. This fix speeds up the sequential prefetch 
>> pattern described in this CR without slowing down other prefetch 
>> patterns.  Some kstats have also been added to help improve the 
>> observability of ZFS file prefetching.
>
> Excellent.  What level of read improvement are you seeing?  Is the 
> prefetch rate improved, or does the fix simply avoid losing the prefetch?
This fix avoids using a prefetch stream when it is no longer valid.  
BTW, ZFS prefetch appears to work well for most prefetch patterns.  But 
this CR found a pattern that should have worked well but did not.

-- Rich

Bob Friesenhahn

2009-Sep-10 21:21 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Thu, 10 Sep 2009, Rich Morris wrote:>> 
>> Excellent.  What level of read improvement are you seeing?  Is the
prefetch
>> rate improved, or does the fix simply avoid losing the prefetch?
>
> This fix avoids using a prefetch stream when it is no longer valid.  BTW,
ZFS
> prefetch appears to work well for most prefetch patterns.  But this CR
found
> a pattern that should have worked well but did not.
It seems that after doing a fresh mount, the zfs prefetch is not quite 
enough to keep my hungry highly-tuned application sufficiently well 
fed.  I will have to wait and see though.

In the mean time, I need to investigate why recent Solaris 10 kernel 
patches (141415-10) cause my Sun Ultra-40M2 system to panic five 
minutes into ''zpool scrub'' with a fault being reported against
the
motherboard.  Maybe a few more motherboard swaps will solve it (on 4th 
motherboard now).  141415-3 seems less likely to panic since it 
survives a full scrub (unless VirtualBox is running a Linux instance).

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Rich Morris

2009-Sep-11 14:02 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On 09/10/09 16:22, eneal at businessgrade.com wrote:> Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:
>
>> On Thu, 10 Sep 2009, Rich Morris wrote:
>>
>>> On 07/28/09 17:13, Rich Morris wrote:
>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>>
>>>>> Sun has opened internal CR 6859997.  It is now in
Dispatched
>>>>> state at High priority.
>>>
>>> CR 6859997 has recently been fixed in Nevada.  This fix will also  
>>> be in Solaris 10 Update 9. This fix speeds up the sequential  
>>> prefetch pattern described in this CR without slowing down other  
>>> prefetch patterns.  Some kstats have also been added to help  
>>> improve the observability of ZFS file prefetching.
>>
>> Excellent.  What level of read improvement are you seeing?  Is the
>> prefetch rate improved, or does the fix simply avoid losing the
>> prefetch?
>>
>> Thanks,
>>
>> Bob
>
> Is this fixed in snv_122 or something else?
snv_124.   See 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997

Christian Kendi

2009-Sep-13 15:40 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

is already a diff for the source available?

El Sep 11, 2009, a las 4:02 PM, Rich Morris escribi?:
> On 09/10/09 16:22, eneal at businessgrade.com wrote:
>> Quoting Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:
>>
>>> On Thu, 10 Sep 2009, Rich Morris wrote:
>>>
>>>> On 07/28/09 17:13, Rich Morris wrote:
>>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>>>
>>>>>> Sun has opened internal CR 6859997.  It is now in
Dispatched
>>>>>> state at High priority.
>>>>
>>>> CR 6859997 has recently been fixed in Nevada.  This fix will  
>>>> also  be in Solaris 10 Update 9. This fix speeds up the  
>>>> sequential  prefetch pattern described in this CR without
slowing
>>>> down other  prefetch patterns.  Some kstats have also been
added
>>>> to help  improve the observability of ZFS file prefetching.
>>>
>>> Excellent.  What level of read improvement are you seeing?  Is the
>>> prefetch rate improved, or does the fix simply avoid losing the
>>> prefetch?
>>>
>>> Thanks,
>>>
>>> Bob
>>
>> Is this fixed in snv_122 or something else?
>
> snv_124.   See
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6859997
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFKrRJnp+9ff145KVIRAhErAKCYKnv6Fn/Vn61Fa2MYpl9S+P9KGACeJUMA
g+RhFTRl9NdI0eNOx5aZaXw=QAX8
-----END PGP SIGNATURE-----

Dale Ghent

2009-Sep-15 20:03 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sep 10, 2009, at 3:12 PM, Rich Morris wrote:
> On 07/28/09 17:13, Rich Morris wrote:
>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>
>>> Sun has opened internal CR 6859997.  It is now in Dispatched state
>>> at High priority.
>
> CR 6859997 has recently been fixed in Nevada.  This fix will also be  
> in Solaris 10 Update 9.
> This fix speeds up the sequential prefetch pattern described in this  
> CR without slowing down other prefetch patterns.  Some kstats have  
> also been added to help improve the observability of ZFS file  
> prefetching.
Awesome that the fix exists. I''ve been having a hell of a time with  
device-level prefetch on my iscsi clients causing tons of ultimately  
useless IO and have resorted to setting zfs_vdev_cache_max=1.

Question though... why is bug fix that can be a watershed for  
performance be held back for so long? s10u9 won''t be available for at  
least 6 months from now, and with a huge environment, I try hard not  
to live off of IDRs.

Am I the only one that thinks this is way too conservative? It''s just  
maddening to know that a highly beneficial fix is out there, but its  
release is based on time rather than need. Sustaining really needs to  
be more proactive when it comes to this stuff.

/dale

Richard Elling

2009-Sep-15 21:21 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote:
> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote:
>
>> On 07/28/09 17:13, Rich Morris wrote:
>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>
>>>> Sun has opened internal CR 6859997.  It is now in Dispatched  
>>>> state at High priority.
>>
>> CR 6859997 has recently been fixed in Nevada.  This fix will also  
>> be in Solaris 10 Update 9.
>> This fix speeds up the sequential prefetch pattern described in  
>> this CR without slowing down other prefetch patterns.  Some kstats  
>> have also been added to help improve the observability of ZFS file  
>> prefetching.
>
> Awesome that the fix exists. I''ve been having a hell of a time
with
> device-level prefetch on my iscsi clients causing tons of ultimately  
> useless IO and have resorted to setting zfs_vdev_cache_max=1.
This only affects metadata. Wouldn''t it be better to disable
prefetching for data?
  -- richard
>
> Question though... why is bug fix that can be a watershed for  
> performance be held back for so long? s10u9 won''t be available for
> at least 6 months from now, and with a huge environment, I try hard  
> not to live off of IDRs.
>
> Am I the only one that thinks this is way too conservative? It''s  
> just maddening to know that a highly beneficial fix is out there,  
> but its release is based on time rather than need. Sustaining really  
> needs to be more proactive when it comes to this stuff.
>
> /dale
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Dale Ghent

2009-Sep-15 21:38 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sep 15, 2009, at 5:21 PM, Richard Elling wrote:
>
> On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote:
>
>> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote:
>>
>>> On 07/28/09 17:13, Rich Morris wrote:
>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>>
>>>>> Sun has opened internal CR 6859997.  It is now in
Dispatched
>>>>> state at High priority.
>>>
>>> CR 6859997 has recently been fixed in Nevada.  This fix will also  
>>> be in Solaris 10 Update 9.
>>> This fix speeds up the sequential prefetch pattern described in  
>>> this CR without slowing down other prefetch patterns.  Some kstats
>>> have also been added to help improve the observability of ZFS file
>>> prefetching.
>>
>> Awesome that the fix exists. I''ve been having a hell of a time
with
>> device-level prefetch on my iscsi clients causing tons of  
>> ultimately useless IO and have resorted to setting  
>> zfs_vdev_cache_max=1.
>
> This only affects metadata. Wouldn''t it be better to disable
> prefetching for data?
Well, that''s a surprise to me, but the zfs_vdev_cache_max=1 did  
provide relief.

Just a general description of my environment:

My setup consists of several s10uX iscsi clients which get LUNs from a  
pairs of thumpers. Each thumper pair exports identical LUNs to each  
iscsi client, and the client in turn mirrors each LUN pair inside a  
local zpool. As more space is needed on a client, a new LUN is created  
on the pair of thumpers, exported to the iscsi client, which then  
picks it up and we add a new mirrored vdev to the client''s existing  
zpool.

This is so we have data redundancy across chassis, so if one thumper  
were to fail or need patching, etc, the iscsi clients just see one of  
side of their mirrors drop out.

The problem that we observed on the iscsi clients was that, when  
viewing things through ''zpool iostat -v'', far more IO was
being
requested from the LUs than was being registered for the vdev those  
LUs were a member of.

Being that that was a iscsi setup with stock thumpers (no SSD ZIL,  
L2ARC) serving the LUs, this apparently overhead caused far more  
uneccessary disk IO on the thumpers, thus starving out IO for data  
that was actually needed.

The working set is lots of small-ish files, entirely random IO.

If zfs_vdev_cache_max only affects metadata prefetches, which  
parameter affects data prefetches ?

I have to admit that disabling device-level prefetching was a shot in  
the dark, but it did result in drastically reduced contention on the  
thumpers.

/dale
>
>>
>> Question though... why is bug fix that can be a watershed for  
>> performance be held back for so long? s10u9 won''t be available
for
>> at least 6 months from now, and with a huge environment, I try hard  
>> not to live off of IDRs.
>>
>> Am I the only one that thinks this is way too conservative?
It''s
>> just maddening to know that a highly beneficial fix is out there,  
>> but its release is based on time rather than need. Sustaining  
>> really needs to be more proactive when it comes to this stuff.
>>
>> /dale
>>
>>
>>
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2009-Sep-15 22:10 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Reference below...

On Sep 15, 2009, at 2:38 PM, Dale Ghent wrote:
> On Sep 15, 2009, at 5:21 PM, Richard Elling wrote:
>
>>
>> On Sep 15, 2009, at 1:03 PM, Dale Ghent wrote:
>>
>>> On Sep 10, 2009, at 3:12 PM, Rich Morris wrote:
>>>
>>>> On 07/28/09 17:13, Rich Morris wrote:
>>>>> On Mon, Jul 20, 2009 at 7:52 PM, Bob Friesenhahn wrote:
>>>>>
>>>>>> Sun has opened internal CR 6859997.  It is now in
Dispatched
>>>>>> state at High priority.
>>>>
>>>> CR 6859997 has recently been fixed in Nevada.  This fix will
also
>>>> be in Solaris 10 Update 9.
>>>> This fix speeds up the sequential prefetch pattern described in
>>>> this CR without slowing down other prefetch patterns.  Some  
>>>> kstats have also been added to help improve the observability
of
>>>> ZFS file prefetching.
>>>
>>> Awesome that the fix exists. I''ve been having a hell of a
time
>>> with device-level prefetch on my iscsi clients causing tons of  
>>> ultimately useless IO and have resorted to setting  
>>> zfs_vdev_cache_max=1.
>>
>> This only affects metadata. Wouldn''t it be better to disable
>> prefetching for data?
>
> Well, that''s a surprise to me, but the zfs_vdev_cache_max=1 did  
> provide relief.
>
> Just a general description of my environment:
>
> My setup consists of several s10uX iscsi clients which get LUNs from  
> a pairs of thumpers. Each thumper pair exports identical LUNs to  
> each iscsi client, and the client in turn mirrors each LUN pair  
> inside a local zpool. As more space is needed on a client, a new LUN  
> is created on the pair of thumpers, exported to the iscsi client,  
> which then picks it up and we add a new mirrored vdev to the  
> client''s existing zpool.
>
> This is so we have data redundancy across chassis, so if one thumper  
> were to fail or need patching, etc, the iscsi clients just see one  
> of side of their mirrors drop out.
>
> The problem that we observed on the iscsi clients was that, when  
> viewing things through ''zpool iostat -v'', far more IO was
being
> requested from the LUs than was being registered for the vdev those  
> LUs were a member of.
>
> Being that that was a iscsi setup with stock thumpers (no SSD ZIL,  
> L2ARC) serving the LUs, this apparently overhead caused far more  
> uneccessary disk IO on the thumpers, thus starving out IO for data  
> that was actually needed.
>
> The working set is lots of small-ish files, entirely random IO.
>
> If zfs_vdev_cache_max only affects metadata prefetches, which  
> parameter affects data prefetches ?
There are two main areas for prefetch: at the transactional object  
layer (DMU) and the pooled
storage level (VDEV).  zfs_vdev_cache_max works at the VDEV level,  
obviously. The
DMU knows more about the context of the data and is where the  
intelligent prefetching
algorithm works.

You can easily observe the VDEV cache statistics with kstat:
	# kstat -n vdev_cache_stats
	module: zfs                             instance: 0
	name:   vdev_cache_stats                class:    misc
	        crtime                          38.83342625
	        delegations                     14030
	        hits                            105169
	        misses                          59452
         	snaptime                        4564628.18130739

This represents a 59% cache hit rate, which is pretty decent.  But you
will notice far fewer delegations+hits+misses than real IOPS because  
it is
only caching metadata.

Unfortunately, there is not a kstat for showing the DMU cache stats.
But a DTrace script can be written or, even easier, lockstat will show
if you are spending much time in the zfetch_* functions.  More details
are in the Evil Tuning Guide, including how to set zfs_prefetch_disable
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
>
> I have to admit that disabling device-level prefetching was a shot  
> in the dark, but it did result in drastically reduced contention on  
> the thumpers.
That is a little bit surprising.  I would expect little metadata  
activity for iscsi
service. It would not be surprising for older Solaris 10 releases,  
though.
It was fixed in NV b70, circa July 2007.
  -- richard

Bob Friesenhahn

2009-Sep-15 22:28 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 15 Sep 2009, Dale Ghent wrote:>
> Question though... why is bug fix that can be a watershed for 
> performance be held back for so long? s10u9 won''t be available for
> at least 6 months from now, and with a huge environment, I try hard 
> not to live off of IDRs.
As someone who currently faces kernel panics with recent U7+ kernel 
patches (on AMD64 and SPARC) related to PCI bus upset, I expect that 
Sun will take the time to make sure that the implementation is as good 
as it can be and is thoroughly tested before release.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Dale Ghent

2009-Sep-15 23:03 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Sep 15, 2009, at 6:28 PM, Bob Friesenhahn wrote:
> On Tue, 15 Sep 2009, Dale Ghent wrote:
>>
>> Question though... why is bug fix that can be a watershed for  
>> performance be held back for so long? s10u9 won''t be available
for
>> at least 6 months from now, and with a huge environment, I try hard  
>> not to live off of IDRs.
>
> As someone who currently faces kernel panics with recent U7+ kernel  
> patches (on AMD64 and SPARC) related to PCI bus upset, I expect that  
> Sun will take the time to make sure that the implementation is as  
> good as it can be and is thoroughly tested before release.
Are you referring the the same testing that gained you this PCI panic  
feature in s10u7?

Testing is a no-brainer, and I would expect that there already exists  
some level of assurance that a CR fix is correct at the point of  
putback.

But I''ve dealt with many bugs both very recently and long in the past  
where a fix has existed in nevada for months, even a year, before I  
got bit by the same bug in s10 and then had to go through the support  
channels to A) convince whomever I''m talking to that, yes, I''m
hitting
this bug, B) yes, there is a fix, and then C) pretty please can I have  
an IDR

Just this week I''m wrapping up testing of a IDR which addresses a  
e1000g hardware errata that was fixed in onnv earlier this year in  
February. For something that addresses a hardware issue on a Intel  
chipset used on shipping Sun servers, one would think that Sustaining  
would be on the ball and get that integrated ASAP. But the current  
mode of operation appears to be "no CR, no backport", which leaves us
customers needlessly running into bugs and then begging for their  
fixes... or hearing the dreaded "oh that fix will be available two  
updates from now." Not cool.

/dale

/dale

Bob Friesenhahn

2009-Sep-16 03:29 UTC

head link

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

On Tue, 15 Sep 2009, Dale Ghent wrote:>> 
>> As someone who currently faces kernel panics with recent U7+ kernel
patches
>> (on AMD64 and SPARC) related to PCI bus upset, I expect that Sun will
take
>> the time to make sure that the implementation is as good as it can be
and
>> is thoroughly tested before release.
>
> Are you referring the the same testing that gained you this PCI panic
feature
> in s10u7?
No.  The system worked with the kernel patch corresponding to baseline 
S10U7.  Problems started with later kernel patches (which seem to be 
much less tested).  Of course there could actually be a real hardware 
problem.

Regardless, when the integrity of our data is involved, I prefer to 
wait for more testing rather than to potentially have to recover the 
pool from backup.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

zfs discuss - Jul 2009 - Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Why is Solaris 10 ZFS performance so terrible?