thr3ads.net - zfs discuss - [zfs-discuss] repost

If this information is useful, please help other people find it:
Share via:

Brad

2009-Dec-27 05:52 UTC

[zfs-discuss] repost - high read iops

repost - Sorry for ccing the other forums.

I''m running into a issue where there seems to be a high number of read
iops hitting disks and physical free memory is fluctuating between 200MB ->
450MB out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd
and slog on another 32GB X25-E ssd.

According to our tester, Oracle writes are extremely slow (high latency).

Below is a snippet of iostat:

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0
401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0
421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0
403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0
406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0
414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0
406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0
404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0
404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0
407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0
407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0
402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0
408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0
9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0
0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0

Is this an indicator that we need more physical memory? From
http://blogs.sun.com/brendan/entry/test, the order that a read request is
satisfied is:

1) ARC
2) vdev cache of L2ARC devices
3) L2ARC devices
4) vdev cache of disks
5) disks

Using arc_summary.pl, we determined that prefletch was not helping much so we
disabled.

CACHE HITS BY DATA TYPE:
Demand Data: 22% 158853174
Prefetch Data: 17% 123009991 <---not helping???
Demand Metadata: 60% 437439104
Prefetch Metadata: 0% 2446824

The write iops started to kick in more and latency reduced on spinning disks:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1
0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0
126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0
129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0
128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0
128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0
125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0
128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0
128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0
128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0
129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0
130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0
126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0
129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0
90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0
0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0


Is it true if your MFU stats start to go over 50% then more memory is needed?
CACHE HITS BY CACHE LIST:
Anon: 10% 74845266 [ New Customer, First Cache Hit ]
Most Recently Used: 19% 140478087 (mru) [ Return Customer ]
Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ]
Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, Now
Back ]
Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted,
Now Back ]
CACHE HITS BY DATA TYPE:
Demand Data: 22% 158852935
Prefetch Data: 17% 123009991
Demand Metadata: 60% 437438658
Prefetch Metadata: 0% 2446824

My theory is since there''s not enough memory for the arc to cache data,
its hits the l2arc where it can''t find data and has to query the disk
for the request. This causes contention between reads and writes causing the
service times to inflate.

uname: 5.10 Generic_141445-09 i86pc i386 i86pc
Sun Fire X4270: 11+1 raidz (SAS)
                        l2arc Intel X25-E
                        slog Intel X25-E
Thoughts?
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-27 16:50 UTC

head link

[zfs-discuss] repost - high read iops

OK, I''ll take a stab at it...

On Dec 26, 2009, at 9:52 PM, Brad wrote:
> repost - Sorry for ccing the other forums.
>
> I''m running into a issue where there seems to be a high number of
> read iops hitting disks and physical free memory is fluctuating  
> between 200MB -> 450MB out of 16GB total. We have the l2arc  
> configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E  
> ssd.
OK, this shows that memory is being used... a good thing.
> According to our tester, Oracle writes are extremely slow (high  
> latency).
OK, this is a workable problem statement... another good thing.
> Below is a snippet of iostat:
>
> r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
> 4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1
> 0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0
> 401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0
> 421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0
> 403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0
> 406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0
> 414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0
> 406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0
> 404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0
> 404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0
> 407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0
> 407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0
> 402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0
> 408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0
> 9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0
> 0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0
You are getting 400+ IOPS @ 4 KB out of HDDs.  Count your lucky stars.
Don''t expect that kind of performance as normal, it is much better than
normal.
> Is this an indicator that we need more physical memory? From
http://blogs.sun.com/brendan/entry/test
> , the order that a read request is satisfied is:
    0) Oracle SGA> 1) ARC
> 2) vdev cache of L2ARC devices
> 3) L2ARC devices
> 4) vdev cache of disks
> 5) disks
>
> Using arc_summary.pl, we determined that prefletch was not helping  
> much so we disabled.
>
> CACHE HITS BY DATA TYPE:
> Demand Data: 22% 158853174
> Prefetch Data: 17% 123009991 <---not helping???
> Demand Metadata: 60% 437439104
> Prefetch Metadata: 0% 2446824
>
> The write iops started to kick in more and latency reduced on  
> spinning disks:
>
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
> 1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1
> 0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0
> 126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0
> 129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0
> 128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0
> 128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0
> 125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0
> 128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0
> 128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0
> 128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0
> 129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0
> 130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0
> 126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0
> 129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0
> 90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0
> 0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0
latency is reduced, but you are also now only seeing 200 IOPS,
not 400+ IOPS.  This is closer to what you would see as a max
for HDDs.

I cannot tell which device is the cache device.  I would expect
to see one disk with significantly more reads than the others.
What do the l2arc stats show?
> Is it true if your MFU stats start to go over 50% then more memory  
> is needed?
That is a good indicator. It means that most of the cache entries are
frequently used. Grow your SGA and you should see this go down.
> CACHE HITS BY CACHE LIST:
> Anon: 10% 74845266 [ New Customer, First Cache Hit ]
> Most Recently Used: 19% 140478087 (mru) [ Return Customer ]
> Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ]
> Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer  
> Evicted, Now Back ]
> Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent  
> Customer Evicted, Now Back ]
> CACHE HITS BY DATA TYPE:
> Demand Data: 22% 158852935
> Prefetch Data: 17% 123009991
> Demand Metadata: 60% 437438658
> Prefetch Metadata: 0% 2446824
>
> My theory is since there''s not enough memory for the arc to cache
> data, its hits the l2arc where it can''t find data and has to query
> the disk for the request. This causes contention between reads and  
> writes causing the service times to inflate.
If you have a choice of where to use memory, always choose closer to
the application. Try a larger SGA first.  Be aware of large page  
stealing --
consider increasing the SGA immediately after a reboot and before the
database or applications are started.
  -- richard
> uname: 5.10 Generic_141445-09 i86pc i386 i86pc
> Sun Fire X4270: 11+1 raidz (SAS)
>                        l2arc Intel X25-E
>                        slog Intel X25-E
> Thoughts?
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Brad

2009-Dec-28 06:24 UTC

head link

[zfs-discuss] repost - high read iops

Richard - the l2arc is c1t13d0.  What tools can be use to show the l2arc stats?

  raidz1     2.68T   580G    543    453  4.22M  3.70M
    c1t1d0       -      -    258    102   689K   358K
    c1t2d0       -      -    256    103   684K   354K
    c1t3d0       -      -    258    102   690K   359K
    c1t4d0       -      -    260    103   687K   354K
    c1t5d0       -      -    255    101   686K   358K
    c1t6d0       -      -    263    103   685K   354K
    c1t7d0       -      -    259    101   689K   358K
    c1t8d0       -      -    259    103   687K   354K
    c1t9d0       -      -    260    102   689K   358K
    c1t10d0      -      -    263    103   686K   354K
    c1t11d0      -      -    260    102   687K   359K
    c1t12d0      -      -    263    104   684K   354K
  c1t14d0     396K  29.5G      0     65      7  3.61M
cache            -      -      -      -      -      -
  c1t13d0    29.7G  11.1M    157     84  3.93M  6.45M

We''ve added 16GB to the box bring the overall total to 32GB.
arc_max is set to 8GB:
set zfs:zfs_arc_max = 8589934592

arc_summary output:
ARC Size:
         Current Size:             8192 MB (arcsize)
         Target Size (Adaptive):   8192 MB (c)
         Min Size (Hard Limit):    1024 MB (zfs_arc_min)
         Max Size (Hard Limit):    8192 MB (zfs_arc_max)

ARC Size Breakdown:
         Most Recently Used Cache Size:          39%    3243 MB (p)
         Most Frequently Used Cache Size:        60%    4948 MB (c-p)

ARC Efficency:
         Cache Access Total:             154663786
         Cache Hit Ratio:      41%       64221251       [Defined State for
buffer]
         Cache Miss Ratio:     58%       90442535       [Undefined State for
Buffer]
         REAL Hit Ratio:       41%       64221251       [MRU/MFU Hits Only]

         Data Demand   Efficiency:    38%
         Data Prefetch Efficiency:    DISABLED (zfs_prefetch_disable)

        CACHE HITS BY CACHE LIST:
          Anon:                       --%        Counter Rolled.
          Most Recently Used:         17%        11118906 (mru)         [ Return
Customer ]
          Most Frequently Used:       82%        53102345 (mfu)         [
Frequent Customer ]
          Most Recently Used Ghost:   14%        9427708 (mru_ghost)    [ Return
Customer Evicted, Now Back ]
          Most Frequently Used Ghost:  6%        4344287 (mfu_ghost)    [
Frequent Customer Evicted, Now Back ]
        CACHE HITS BY DATA TYPE:
          Demand Data:                84%        54444108
          Prefetch Data:               0%        0
          Demand Metadata:            15%        9777143
          Prefetch Metadata:           0%        0
        CACHE MISSES BY DATA TYPE:
          Demand Data:                96%        87542292
          Prefetch Data:               0%        0
          Demand Metadata:             3%        2900243
          Prefetch Metadata:           0%        0


Also disabled file-level pre-fletch and vdev cache max:
set zfs:zfs_prefetch_disable = 1
set zfs:zfs_vdev_cache_max = 0x1

After reading about some issues with concurrent ios, I tweaked the setting down
from 35 to 1 and it reduced the response times greatly (2 -> 8ms):
set zfs:zfs_vdev_max_pending=1

It did increased the actv...I''m still unsure about the side-effects
here:
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
 2295.2  398.7    4.2    7.2  0.0 18.6    0.0    6.9   0 1084 c1
    0.0    0.8    0.0    0.0  0.0  0.0    0.0    0.1   0   0 c1t0d0
  190.3   22.9    0.4    0.0  0.0  1.5    0.0    7.0   0  87 c1t1d0
  180.9   20.6    0.3    0.0  0.0  1.7    0.0    8.5   0  95 c1t2d0
  195.0   43.0    0.3    0.2  0.0  1.6    0.0    6.8   0  93 c1t3d0
  193.2   21.7    0.4    0.0  0.0  1.5    0.0    6.8   0  88 c1t4d0
  195.7   34.8    0.3    0.1  0.0  1.7    0.0    7.5   0  97 c1t5d0
  186.8   20.6    0.3    0.0  0.0  1.5    0.0    7.3   0  88 c1t6d0
  188.4   21.0    0.4    0.0  0.0  1.6    0.0    7.7   0  91 c1t7d0
  189.6   21.2    0.3    0.0  0.0  1.6    0.0    7.4   0  91 c1t8d0
  193.8   22.6    0.4    0.0  0.0  1.5    0.0    7.1   0  91 c1t9d0
  192.6   20.8    0.3    0.0  0.0  1.4    0.0    6.8   0  88 c1t10d0
  195.7   22.2    0.3    0.0  0.0  1.5    0.0    6.7   0  88 c1t11d0
  184.7   20.3    0.3    0.0  0.0  1.4    0.0    6.8   0  84 c1t12d0
    7.3   82.4    0.1    5.5  0.0  0.0    0.0    0.2   0   1 c1t13d0
    1.3   23.9    0.0    1.3  0.0  0.0    0.0    0.2   0   0 c1t14d0

I''m still in talks with the dba in seeing if we can raise the SGA from
4GB to 6GB to see if it''ll help.

The changes that showed a lot of improvement is disabling file/device level
pre-fletch and reducing concurrent ios from 35 to 1 (tried 10 but it
didn''t help much).  Is there anything else that could be tweaked to
increase write performance?  Record sizes are set according to 8K and 128K for
redo logs.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-28 16:23 UTC

head link

[zfs-discuss] repost - high read iops

Hi Brad, comments below...

On Dec 27, 2009, at 10:24 PM, Brad wrote:
> Richard - the l2arc is c1t13d0.  What tools can be use to show the  
> l2arc stats?
>
>  raidz1     2.68T   580G    543    453  4.22M  3.70M
>    c1t1d0       -      -    258    102   689K   358K
>    c1t2d0       -      -    256    103   684K   354K
>    c1t3d0       -      -    258    102   690K   359K
>    c1t4d0       -      -    260    103   687K   354K
>    c1t5d0       -      -    255    101   686K   358K
>    c1t6d0       -      -    263    103   685K   354K
>    c1t7d0       -      -    259    101   689K   358K
>    c1t8d0       -      -    259    103   687K   354K
>    c1t9d0       -      -    260    102   689K   358K
>    c1t10d0      -      -    263    103   686K   354K
>    c1t11d0      -      -    260    102   687K   359K
>    c1t12d0      -      -    263    104   684K   354K
>  c1t14d0     396K  29.5G      0     65      7  3.61M
> cache            -      -      -      -      -      -
>  c1t13d0    29.7G  11.1M    157     84  3.93M  6.45M
>
> We''ve added 16GB to the box bring the overall total to 32GB.
In general, this is always a good idea.
> arc_max is set to 8GB:
> set zfs:zfs_arc_max = 8589934592
You will be well served to add much more memory to the SGA
and reduce that to the ARC.  More below...
> arc_summary output:
> ARC Size:
>         Current Size:             8192 MB (arcsize)
>         Target Size (Adaptive):   8192 MB (c)
>         Min Size (Hard Limit):    1024 MB (zfs_arc_min)
>         Max Size (Hard Limit):    8192 MB (zfs_arc_max)
>
> ARC Size Breakdown:
>         Most Recently Used Cache Size:          39%    3243 MB (p)
>         Most Frequently Used Cache Size:        60%    4948 MB (c-p)
>
> ARC Efficency:
>         Cache Access Total:             154663786
>         Cache Hit Ratio:      41%       64221251       [Defined  
> State for buffer]
>         Cache Miss Ratio:     58%       90442535       [Undefined  
> State for Buffer]
>         REAL Hit Ratio:       41%       64221251       [MRU/MFU Hits  
> Only]
>
>         Data Demand   Efficiency:    38%
>         Data Prefetch Efficiency:    DISABLED (zfs_prefetch_disable)
>
>        CACHE HITS BY CACHE LIST:
>          Anon:                       --%        Counter Rolled.
>          Most Recently Used:         17%        11118906  
> (mru)         [ Return Customer ]
>          Most Frequently Used:       82%        53102345  
> (mfu)         [ Frequent Customer ]
>          Most Recently Used Ghost:   14%        9427708  
> (mru_ghost)    [ Return Customer Evicted, Now Back ]
>          Most Frequently Used Ghost:  6%        4344287  
> (mfu_ghost)    [ Frequent Customer Evicted, Now Back ]
>        CACHE HITS BY DATA TYPE:
>          Demand Data:                84%        54444108
>          Prefetch Data:               0%        0
>          Demand Metadata:            15%        9777143
>          Prefetch Metadata:           0%        0
>        CACHE MISSES BY DATA TYPE:
>          Demand Data:                96%        87542292
>          Prefetch Data:               0%        0
>          Demand Metadata:             3%        2900243
>          Prefetch Metadata:           0%        0
>
>
> Also disabled file-level pre-fletch and vdev cache max:
> set zfs:zfs_prefetch_disable = 1
> set zfs:zfs_vdev_cache_max = 0x1
I think this is a waste of time. The database will prefetch,
by default, so you might as well start that work ahead of time.
Note that ZFS uses an intelligent prefetch algorithm, so if it
detects that the accesses are purely random, it won''t prefetch.
> After reading about some issues with concurrent ios, I tweaked the  
> setting down from 35 to 1 and it reduced the response times greatly  
> (2 -> 8ms):
> set zfs:zfs_vdev_max_pending=1
This can be a red herring.  Judging by the number of IOPS below,
it has not improved. At this point, I will assume you are using
disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
If you only issue one command at a time, you effectively disable
NCQ and thus cannot take advantage of its efficiencies.
> It did increased the actv...I''m still unsure about the
side-effects
> here:
>    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
> 2295.2  398.7    4.2    7.2  0.0 18.6    0.0    6.9   0 1084 c1
>    0.0    0.8    0.0    0.0  0.0  0.0    0.0    0.1   0   0 c1t0d0
>  190.3   22.9    0.4    0.0  0.0  1.5    0.0    7.0   0  87 c1t1d0
>  180.9   20.6    0.3    0.0  0.0  1.7    0.0    8.5   0  95 c1t2d0
>  195.0   43.0    0.3    0.2  0.0  1.6    0.0    6.8   0  93 c1t3d0
>  193.2   21.7    0.4    0.0  0.0  1.5    0.0    6.8   0  88 c1t4d0
>  195.7   34.8    0.3    0.1  0.0  1.7    0.0    7.5   0  97 c1t5d0
>  186.8   20.6    0.3    0.0  0.0  1.5    0.0    7.3   0  88 c1t6d0
>  188.4   21.0    0.4    0.0  0.0  1.6    0.0    7.7   0  91 c1t7d0
>  189.6   21.2    0.3    0.0  0.0  1.6    0.0    7.4   0  91 c1t8d0
>  193.8   22.6    0.4    0.0  0.0  1.5    0.0    7.1   0  91 c1t9d0
>  192.6   20.8    0.3    0.0  0.0  1.4    0.0    6.8   0  88 c1t10d0
>  195.7   22.2    0.3    0.0  0.0  1.5    0.0    6.7   0  88 c1t11d0
>  184.7   20.3    0.3    0.0  0.0  1.4    0.0    6.8   0  84 c1t12d0
>    7.3   82.4    0.1    5.5  0.0  0.0    0.0    0.2   0   1 c1t13d0
>    1.3   23.9    0.0    1.3  0.0  0.0    0.0    0.2   0   0 c1t14d0
>
> I''m still in talks with the dba in seeing if we can raise the SGA
> from 4GB to 6GB to see if it''ll help.
Try an SGA more like 20-25 GB.  Remember, the database can cache more
effectively than any file system underneath.  The best I/O is the I/O  
you don''t
have to make.
  -- richard
>
> The changes that showed a lot of improvement is disabling file/ 
> device level pre-fletch and reducing concurrent ios from 35 to 1  
> (tried 10 but it didn''t help much).  Is there anything else that  
> could be tweaked to increase write performance?  Record sizes are  
> set according to 8K and 128K for redo logs.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Brad

2009-Dec-28 20:40 UTC

head link

[zfs-discuss] repost - high read iops

"Try an SGA more like 20-25 GB. Remember, the database can cache more
effectively than any file system underneath. The best I/O is the I/O
you don''t have to make."

We''ll be turning up the SGA size from 4GB to 16GB.
The arc size will be set from 8GB to 4GB.

"This can be a red herring. Judging by the number of IOPS below,
it has not improved. At this point, I will assume you are using
disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
If you only issue one command at a time, you effectively disable
NCQ and thus cannot take advantage of its efficiencies."

Here''s another sample of the data taken at another time after the
number of concurrent ios change from 10 to 1.  We''re using Seagate
Savio 10K SAS drives...I could not pull up info if the drives support NCQ or
not.  What''s the recommended value to set concurrent IOs to?

    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
 1402.2 7805.3    2.7   36.2  0.2 54.9    0.0    6.0   0 940 c1
   10.8    1.0    0.1    0.0  0.0  0.1    0.0    7.0   0   7 c1t0d0
  117.1  640.7    0.2    1.8  0.0  4.5    0.0    5.9   1  76 c1t1d0
  116.9  638.2    0.2    1.7  0.0  4.6    0.0    6.1   1  78 c1t2d0
  116.4  639.1    0.2    1.8  0.0  4.6    0.0    6.0   1  78 c1t3d0
  116.6  638.1    0.2    1.7  0.0  4.6    0.0    6.1   1  77 c1t4d0
  113.2  638.0    0.2    1.8  0.0  4.6    0.0    6.1   1  77 c1t5d0
  116.6  635.3    0.2    1.7  0.0  4.5    0.0    6.0   1  76 c1t6d0
  116.2  637.8    0.2    1.8  0.0  4.7    0.0    6.2   1  79 c1t7d0
  115.3  636.7    0.2    1.8  0.0  4.4    0.0    5.8   1  77 c1t8d0
  115.4  637.8    0.2    1.8  0.0  4.5    0.0    5.9   1  77 c1t9d0
  114.8  635.0    0.2    1.8  0.0  4.3    0.0    5.7   1  76 c1t10d0
  114.9  639.9    0.2    1.8  0.0  4.7    0.0    6.2   1  78 c1t11d0
  115.1  638.7    0.2    1.8  0.0  4.4    0.0    5.9   1  77 c1t12d0
    1.6  140.0    0.0   15.1  0.0  0.6    0.0    4.4   0   8 c1t13d0
    1.3    9.1    0.0    0.1  0.0  0.0    0.0    1.0   0   0 c1t14d0
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-28 21:09 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 28, 2009, at 12:40 PM, Brad wrote:
> "Try an SGA more like 20-25 GB. Remember, the database can cache more
> effectively than any file system underneath. The best I/O is the I/O
> you don''t have to make."
>
> We''ll be turning up the SGA size from 4GB to 16GB.
> The arc size will be set from 8GB to 4GB.
This doesn''t make sense to me. You''ve got 32 GB, why not use
it?
Artificially limiting the memory use to 20 GB seems like a waste of
good money.
> "This can be a red herring. Judging by the number of IOPS below,
> it has not improved. At this point, I will assume you are using
> disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
> If you only issue one command at a time, you effectively disable
> NCQ and thus cannot take advantage of its efficiencies."
>
> Here''s another sample of the data taken at another time after the
> number of concurrent ios change from 10 to 1.  We''re using Seagate
> Savio 10K SAS drives...I could not pull up info if the drives  
> support NCQ or not.  What''s the recommended value to set
concurrent
> IOs to?
>
>    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0
> 1402.2 7805.3    2.7   36.2  0.2 54.9    0.0    6.0   0 940 c1
>   10.8    1.0    0.1    0.0  0.0  0.1    0.0    7.0   0   7 c1t0d0
>  117.1  640.7    0.2    1.8  0.0  4.5    0.0    5.9   1  76 c1t1d0
>  116.9  638.2    0.2    1.7  0.0  4.6    0.0    6.1   1  78 c1t2d0
>  116.4  639.1    0.2    1.8  0.0  4.6    0.0    6.0   1  78 c1t3d0
>  116.6  638.1    0.2    1.7  0.0  4.6    0.0    6.1   1  77 c1t4d0
>  113.2  638.0    0.2    1.8  0.0  4.6    0.0    6.1   1  77 c1t5d0
>  116.6  635.3    0.2    1.7  0.0  4.5    0.0    6.0   1  76 c1t6d0
>  116.2  637.8    0.2    1.8  0.0  4.7    0.0    6.2   1  79 c1t7d0
>  115.3  636.7    0.2    1.8  0.0  4.4    0.0    5.8   1  77 c1t8d0
>  115.4  637.8    0.2    1.8  0.0  4.5    0.0    5.9   1  77 c1t9d0
>  114.8  635.0    0.2    1.8  0.0  4.3    0.0    5.7   1  76 c1t10d0
>  114.9  639.9    0.2    1.8  0.0  4.7    0.0    6.2   1  78 c1t11d0
>  115.1  638.7    0.2    1.8  0.0  4.4    0.0    5.9   1  77 c1t12d0
>    1.6  140.0    0.0   15.1  0.0  0.6    0.0    4.4   0   8 c1t13d0
>    1.3    9.1    0.0    0.1  0.0  0.0    0.0    1.0   0   0 c1t14d0
SAS will be CTQ, basically the same thing as NCQ for SATA disks.
You can see here that you''re averaging 4.6 I/Os queued at the
disks (actv column) and the response time is quite good.
Meanwhile, the disks are handling more than 700 IOPS with
less than 10 ms response time.  Not bad at all for HDDs, but
not a level that can be expected, either. Here we see more
than 600 small write IOPS. These will be sequential (as in
contiguous blocks, not sequential as in large blocks) so
they get buffered and efficiently written by the disk.  When
your workload returns to the read-mostly random activity,
then the IOPS will go down.

As to what is the magic number? It is hard to say.  In this case,
more than 4 is good.  Remember, the default of 35 is as much
of a guess as anything.  For HDDs, 35 might be a little bit too
much, but for a RAID array, something more like 1,000 might
be optimal.  Keeping an eye on the actv column of iostat can
help you make that decision.
  -- richard

Brad

2009-Dec-28 21:40 UTC

head link

[zfs-discuss] repost - high read iops

"This doesn''t make sense to me. You''ve got 32 GB, why not
use it?
Artificially limiting the memory use to 20 GB seems like a waste of
good money."

I''m having a hard time convincing the dbas to increase the size of the
SGA to 20GB because their philosophy is, no matter what eventually
you''ll have to hit disk to pick up data thats not stored in cache (arc
or l2arc).  The typical database server in our environment holds over 3TB of
data.

If the performance does not improve then we''ll possibly have to change
the raid layout from raidz to a raid10.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-28 23:18 UTC

head link

[zfs-discuss] repost - high read iops

On Mon, 28 Dec 2009, Brad wrote:
> I''m having a hard time convincing the dbas to increase the size of
> the SGA to 20GB because their philosophy is, no matter what 
> eventually you''ll have to hit disk to pick up data thats not
stored
> in cache (arc or l2arc).  The typical database server in our 
> environment holds over 3TB of data.
But if the working set is 25GB, then things will be magically better. 
If it is 50GB or 500GB, then performance may still suck.
> If the performance does not improve then we''ll possibly have to 
> change the raid layout from raidz to a raid10.
Mirror vdevs are what is definitely recommended for use with 
databases.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Dec-29 03:44 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 28, 2009, at 1:40 PM, Brad wrote:
> "This doesn''t make sense to me. You''ve got 32 GB,
why not use it?
> Artificially limiting the memory use to 20 GB seems like a waste of
> good money."
>
> I''m having a hard time convincing the dbas to increase the size of
> the SGA to 20GB because their philosophy is, no matter what  
> eventually you''ll have to hit disk to pick up data thats not
stored
> in cache (arc or l2arc).  The typical database server in our  
> environment holds over 3TB of data.
Wow!  Where did you find DBAs who didn''t want more resources? :-)
If that is the case, then you might need many more disks to keep the
(hungry) database fed.
> If the performance does not improve then we''ll possibly have to  
> change the raid layout from raidz to a raid10.
Yes, the notion of adding more disks and using them as mirrors
are closely aligned. However, you know that the data in the ARC
is more than 50% frequently used, which makes the argument that
a larger SGA (or ARC) should benefit the workload.
  -- richard

przemolicc at poczta.fm

2009-Dec-29 09:14 UTC

head link

[zfs-discuss] repost - high read iops

On Mon, Dec 28, 2009 at 01:40:03PM -0800, Brad wrote:> "This doesn''t make sense to me. You''ve got 32 GB,
why not use it?
> Artificially limiting the memory use to 20 GB seems like a waste of
> good money."
> 
> I''m having a hard time convincing the dbas to increase the size of
the SGA to 20GB because their philosophy is, no matter what eventually
you''ll have to hit disk to pick up data thats not stored in cache (arc
or l2arc).  The typical database server in our environment holds over 3TB of
data.
Brad,

are your DBAs aware that if you increase your SGA (currently 4 GB)
- to 8  GB - you get 100 % more memory for SGA
- to 16 GB - you get 300 % more memory for SGA
- to 20 GB - you get 400 % ...

If they are not aware, well ...

But try to be patient - I had similar situation. It took quite long time to
convince
our DBA to increase SGA from 16 GB to 20 GB. Finally they did :-)

You can always use "stronger" argument that not using already bought
memory
is wasting of _money_.

Regards
Przemyslaw Bak (przemol)
--
http://przemol.blogspot.com/





























----------------------------------------------------------------------
Sprawdz, co przyniesie Nowy Rok!
Zapytaj wrozke >> http://link.interia.pl/f254d

Brad

2009-Dec-29 12:55 UTC

head link

[zfs-discuss] repost - high read iops

Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle but whats the
difference between a raidz mirrored vdev vs a raid10 setup?

We have tested a zfs stripe configuration before with 15 disks and our tester
was extremely happy with the performance.  After talking to our tester, she
doesn''t feel comfortable with the current raidz setup.
-- 
This message posted from opensolaris.org

Ross Walker

2009-Dec-29 16:13 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 29, 2009, at 7:55 AM, Brad <beneri3 at yahoo.com> wrote:
> Thanks for the suggestion!
>
> I have heard mirrored vdevs configuration are preferred for Oracle  
> but whats the difference between a raidz mirrored vdev vs a raid10  
> setup?
A mirrored raidz provides redundancy at a steep cost to performance  
and might I add a high monetary cost.

Because each write of a raidz is striped across the disks the  
effective IOPS of the vdev is equal to that of a single disk. This can  
be improved by utilizing multiple (smaller) raidz vdevs which are  
striped, but not by mirroring them.

With raid10 each mirrored pair has the IOPS of a single drive. Since  
these mirrors are typically 2 disk vdevs, you can have a lot more of  
them and thus a lot more IOPS (some people talk about using 3 disk  
mirrors, but it''s probably just as good as getting setting copies=2 on
a regular pool of mirrors).
> We have tested a zfs stripe configuration before with 15 disks and  
> our tester was extremely happy with the performance.  After talking  
> to our tester, she doesn''t feel comfortable with the current raidz
> setup.
How many luns are you working with now? 15?

Is the storage direct attached or is it coming from a storage server  
that may have the physical disks in a raid configuration already?

If direct attached, create a pool of mirrors. If it''s coming from a  
storage server where the disks are in a raid already, just create a  
striped pool and set copies=2.

-Ross

Eric D. Mudama

2009-Dec-29 16:24 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, Dec 29 at  4:55, Brad wrote:> Thanks for the suggestion!
>
> I have heard mirrored vdevs configuration are preferred for Oracle
> but whats the difference between a raidz mirrored vdev vs a raid10
> setup?
>
> We have tested a zfs stripe configuration before with 15 disks and
> our tester was extremely happy with the performance.  After talking
> to our tester, she doesn''t feel comfortable with the current raidz
> setup.
As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev.  Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

When you''re in a workload that you expect to be bounded by random IO
performance, in ZFS you''d want to increase the number of VDEVs to be
as large as possible, which acts to distribute random work across all
of your disks.  Building a pool out of 2-disk mirrors, then, is the
preferred layout for random performance, since it''s the highest ratio
of disks to vdevs you can achieve (short of non-fault-tolerant
configurations).

This winds up looking similar to RAID10 in layout, in that you''re
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different.  Performance should also be
similar, though it''s possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don''t believe RAID10 normally validates checksums on returned data
if the device didn''t return an error.  In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Brad

2009-Dec-29 16:57 UTC

head link

[zfs-discuss] repost - high read iops

@ross

"Because each write of a raidz is striped across the disks the
effective IOPS of the vdev is equal to that of a single disk. This can
be improved by utilizing multiple (smaller) raidz vdevs which are
striped, but not by mirroring them."

So with random reads, would it perform better on a raid5 layout since the FS
blocks are written to each disk instead of a stripe?

With zfs''s implementation of raid10, would we still get data protection
and checksumming?

"How many luns are you working with now? 15?  
Is the storage direct attached or is it coming from a storage server
that may have the physical disks in a raid configuration already?
If direct attached, create a pool of mirrors. If it''s coming from a
storage server where the disks are in a raid already, just create a
striped pool and set copies=2."

We''re not using a SAN but a Sun X4270 with sixteen SAS drives (two
dedicated to OS, two for ssd, raid 11+1.
There''s a total of seven datasets from a single pool.
-- 
This message posted from opensolaris.org

Brad

2009-Dec-29 17:16 UTC

head link

[zfs-discuss] repost - high read iops

@eric

"As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO."

It sounds like we''ll need 16 vdevs striped in a pool to at least get
the performance of 15 drives plus another 16 mirrored for redundancy.

If we are bounded in iops by the vdev, would it make sense to go with the bare
minimum of drives (3) per vdev?

"This winds up looking similar to RAID10 in layout, in that you''re
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it''s possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don''t believe RAID10 normally validates checksums on returned data
if the device didn''t return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error."

That''s interesting to know that with ZFS''s implementation of
raid10 it doesn''t have checksumming built-in.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-29 17:36 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, 29 Dec 2009, Ross Walker wrote:>
> A mirrored raidz provides redundancy at a steep cost to performance and
might
> I add a high monetary cost.
I am not sure what a "mirrored raidz" is.  I have never heard of such 
a thing before.
> With raid10 each mirrored pair has the IOPS of a single drive. Since these 
> mirrors are typically 2 disk vdevs, you can have a lot more of them and
thus
> a lot more IOPS (some people talk about using 3 disk mirrors, but
it''s
> probably just as good as getting setting copies=2 on a regular pool of 
> mirrors).
This is another case where using a term like "raid10" does not make 
sense when discussing zfs.  ZFS does not support "raid10".  ZFS does 
not support RAID 0 or RAID 1 so it can''t support RAID 1+0 (RAID 10).

Some important points to consider are that every write to a raidz vdev 
must be synchronous.  In other words, the write needs to complete on 
all the drives in the stripe before the write may return as complete. 
This is also true of "RAID 1" (mirrors) which specifies that the 
drives are perfect duplicates of each other.  However, zfs does not 
implement "RAID 1" either.  This is easily demonstrated since you can 
unplug one side of the mirror and the writes to the zfs mirror will 
still succeed, catching up the mirror which is behind as soon as it is 
plugged back in.  When using mirrors, zfs supports logic which will 
catch that mirror back up (only sending the missing updates) when 
connectivity improves.  With RAID 1 where is no way to recover a 
mirror other than a full copy from the other drive.

Zfs load-shares across vdevs so it will load-share across mirror vdevs 
rather than striping (as RAID 10 would require).

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mattias Pantzare

2009-Dec-29 17:51 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, Dec 29, 2009 at 18:16, Brad <beneri3 at yahoo.com>
wrote:> @eric
>
> "As a general rule of thumb, each vdev has the random performance
> roughly the same as a single member of that vdev. Having six RAIDZ
> vdevs in a pool should give roughly the performance as a stripe of six
> bare drives, for random IO."
>
> It sounds like we''ll need 16 vdevs striped in a pool to at least
get the performance of 15 drives plus another 16 mirrored for redundancy.
>
> If we are bounded in iops by the vdev, would it make sense to go with the
bare minimum of drives (3) per vdev?
Minimum is 1 drive per vdev. Minimum with redundancy is 2 if you use
mirroring. You should do mirroring to get the best performance.
> "This winds up looking similar to RAID10 in layout, in that
you''re
> striping across a lot of disks that each consists of a mirror, though
> the checksumming rules are different. Performance should also be
> similar, though it''s possible RAID10 may give slightly better
random
> read performance at the expense of some data quality guarantees, since
> I don''t believe RAID10 normally validates checksums on returned
data
> if the device didn''t return an error. In normal practice, RAID10
and
> a pool of mirrored vdevs should benchmark against each other within
> your margin of error."
>
> That''s interesting to know that with ZFS''s implementation
of raid10 it doesn''t have checksumming built-in.
He was talking about RAID10, not mirroring in ZFS. ZFS will always use
checksums.

Eric D. Mudama

2009-Dec-29 18:01 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, Dec 29 at  9:16, Brad wrote:>@eric
>
>"As a general rule of thumb, each vdev has the random performance
>roughly the same as a single member of that vdev. Having six RAIDZ
>vdevs in a pool should give roughly the performance as a stripe of six
>bare drives, for random IO."
>
> It sounds like we''ll need 16 vdevs striped in a pool to at least
get
> the performance of 15 drives plus another 16 mirrored for redundancy.
If you were striping across 16 devices before, you will achieve
similar random IO performance by striping across 16 vdevs, regardless
of their type.  Sequential throughput is more a function of the number
of devices, not the number of vdevs, in that a 3-disk RAIDZ will have
the sequential write throughput (roughly) of a pair of drives.

You still get checksumming, but if a device fails or you get a
corruption in your non-redundant stripe, zfs may not have enough
information to repair your data.  For a read-only data reference,
maybe a restore from backup in these situations is okay, but for most
installations that is unacceptable.

The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.
> If we are bounded in iops by the vdev, would it make sense to go
> with the bare minimum of drives (3) per vdev?
ZFS supports non-redundant vdev layouts, but they''re generally not
recommended.  The smallest mirror you can build is 2 devices, and the
smallest raidz is 3 devices per vdev.
>"This winds up looking similar to RAID10 in layout, in that
you''re
>striping across a lot of disks that each consists of a mirror, though
>the checksumming rules are different. Performance should also be
>similar, though it''s possible RAID10 may give slightly better
random
>read performance at the expense of some data quality guarantees, since
>I don''t believe RAID10 normally validates checksums on returned
data
>if the device didn''t return an error. In normal practice, RAID10
and
>a pool of mirrored vdevs should benchmark against each other within
>your margin of error."
>
> That''s interesting to know that with ZFS''s implementation
of raid10
> it doesn''t have checksumming built-in.
I don''t believe I said this.  I am reasonably certain that all
zpool/zfs layouts validate checksums, even if built with no
redundancy.  The "RAID10-similar" layout in ZFS is an array of
mirrors, such that you build a bunch of 2-device mirrored vdevs, and
add them all into a single pool.  You wind up with a layout like:

Pool0
   mirror-0
     disk0
     disk1
   mirror-1
     disk2
     disk3
   mirror-2
     disk4
     disk5
   ...
   mirror-N
     disk-2N
     disk-2N+1

This will give you the best random IO performance possible with ZFS,
independent of the type of disks used.  (Obviously some of the same
rules may not apply with ramdisks or SSDs, but those are special cases
for most.)

--eric


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Richard Elling

2009-Dec-29 18:07 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 29, 2009, at 9:16 AM, Brad wrote:
> @eric
>
> "As a general rule of thumb, each vdev has the random performance
> roughly the same as a single member of that vdev. Having six RAIDZ
> vdevs in a pool should give roughly the performance as a stripe of six
> bare drives, for random IO."
This model begins to break down with raidz2 and further breaks down
with raidz3.  Since I wrote about this simple model here:
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
we''ve refined it a bit, to take into account the number of parity  
devices.

For small, random read IOPS the performance of a single, top-level  
vdev is
	performance = performance of a disk * (N / (N - P))

where,
	N = number of disks in the vdev
	P = number of parity devices in the vdev

For example, using 5 disks @ 100 IOPS we get something like:
	2-disk mirror: 200 IOPS
	4+1 raidz: 125 IOPS
	3+2 raidz2: 167 IOPS
	2+3 raidz3:  250 IOPS

Once again, it is clear that mirroring will offer the best small,  
random read
IOPS.
> It sounds like we''ll need 16 vdevs striped in a pool to at least
get
> the performance of 15 drives plus another 16 mirrored for redundancy.
>
> If we are bounded in iops by the vdev, would it make sense to go  
> with the bare minimum of drives (3) per vdev?
>
> "This winds up looking similar to RAID10 in layout, in that
you''re
> striping across a lot of disks that each consists of a mirror, though
> the checksumming rules are different. Performance should also be
> similar, though it''s possible RAID10 may give slightly better
random
> read performance at the expense of some data quality guarantees, since
> I don''t believe RAID10 normally validates checksums on returned
data
> if the device didn''t return an error. In normal practice, RAID10
and
> a pool of mirrored vdevs should benchmark against each other within
> your margin of error."
>
> That''s interesting to know that with ZFS''s implementation
of raid10
> it doesn''t have checksumming built-in.
ZFS always checksums everything unless you explicitly disable
checksumming for data. Metadata is always checksummed.
  -- richard

Tim Cook

2009-Dec-29 18:31 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, Dec 29, 2009 at 12:07 PM, Richard Elling
<richard.elling at gmail.com>wrote:
> On Dec 29, 2009, at 9:16 AM, Brad wrote:
>
>  @eric
>>
>> "As a general rule of thumb, each vdev has the random performance
>> roughly the same as a single member of that vdev. Having six RAIDZ
>> vdevs in a pool should give roughly the performance as a stripe of six
>> bare drives, for random IO."
>>
>
> This model begins to break down with raidz2 and further breaks down
> with raidz3.  Since I wrote about this simple model here:
>
>
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
> we''ve refined it a bit, to take into account the number of parity
devices.
>
> For small, random read IOPS the performance of a single, top-level vdev is
>        performance = performance of a disk * (N / (N - P))
>
> where,
>        N = number of disks in the vdev
>        P = number of parity devices in the vdev
>
> For example, using 5 disks @ 100 IOPS we get something like:
>        2-disk mirror: 200 IOPS
>        4+1 raidz: 125 IOPS
>        3+2 raidz2: 167 IOPS
>        2+3 raidz3:  250 IOPS
>
> Once again, it is clear that mirroring will offer the best small, random
> read
> IOPS.
>
>
>  It sounds like we''ll need 16 vdevs striped in a pool to at least
get the
>> performance of 15 drives plus another 16 mirrored for redundancy.
>>
>> If we are bounded in iops by the vdev, would it make sense to go with
the
>> bare minimum of drives (3) per vdev?
>>
>> "This winds up looking similar to RAID10 in layout, in that
you''re
>> striping across a lot of disks that each consists of a mirror, though
>> the checksumming rules are different. Performance should also be
>> similar, though it''s possible RAID10 may give slightly better
random
>> read performance at the expense of some data quality guarantees, since
>> I don''t believe RAID10 normally validates checksums on
returned data
>> if the device didn''t return an error. In normal practice,
RAID10 and
>> a pool of mirrored vdevs should benchmark against each other within
>> your margin of error."
>>
>> That''s interesting to know that with ZFS''s
implementation of raid10 it
>> doesn''t have checksumming built-in.
>>
>
> ZFS always checksums everything unless you explicitly disable
> checksumming for data. Metadata is always checksummed.
>  -- richard
>
>
>I imagine he''s referring to the fact that it cannot fix any checksum
errors
it finds.  <flamesuit>Let me open the can of worms by saying this is
nearly
as bad as not doing checksumming at all.  Knowing the data is bad when you
can''t do anything to fix it doesn''t really help if you have no
way to
regenerate it. </flamesuit>


-- 
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091229/780d78ea/attachment.html>

Erik Trimble

2009-Dec-29 19:14 UTC

head link

[zfs-discuss] repost - high read iops

Eric D. Mudama wrote:> On Tue, Dec 29 at  9:16, Brad wrote:
> The disk cost of a raidz pool of mirrors is identical to the disk cost
> of raid10.
>ZFS can''t do a raidz of mirrors or a mirror of raidz.  Members of a 
mirror or raidz[123] must be a fundamental device (i.e. file or drive)

>> "This winds up looking similar to RAID10 in layout, in that
you''re
>> striping across a lot of disks that each consists of a mirror, though
>> the checksumming rules are different. Performance should also be
>> similar, though it''s possible RAID10 may give slightly better
random
>> read performance at the expense of some data quality guarantees, since
>> I don''t believe RAID10 normally validates checksums on
returned data
>> if the device didn''t return an error. In normal practice,
RAID10 and
>> a pool of mirrored vdevs should benchmark against each other within
>> your margin of error."
>>
>> That''s interesting to know that with ZFS''s
implementation of raid10
>> it doesn''t have checksumming built-in.
>
> I don''t believe I said this.  I am reasonably certain that all
> zpool/zfs layouts validate checksums, even if built with no
> redundancy.  The "RAID10-similar" layout in ZFS is an array of
> mirrors, such that you build a bunch of 2-device mirrored vdevs, and
> add them all into a single pool.  You wind up with a layout like:

Yes. PLEASE be careful - checksumming and redundancy are DIFFERENT concepts.

In ZFS, EVERYTHING is checksummed - the data blocks, and the metadata.  
This is separate from redundancy.  Regardless of the zpool layout 
(mirrors, raidz, or no redundancy), ZFS stores a checksum of all objects 
- this checksum is used to determine if the object has been corrupted.  
This check is done on any /read/

Should the checksum determine that the object is corrupt, then there are 
two things that can happen:  if your zpool has some form of redundancy 
for that object, ZFS will then reread the object from the redundant side 
of the mirror, or reconstruct the data using parity.  It will then 
re-write the object to another place in the zpool, and eliminate the 
"bad" object.  Else, if there is no redundancy, then it will fail to 
return the data, and log an error message to the syslog.

In the case of metadata, even in a non-redundant zpool, some of that 
metadata is stored multiple times, so there is the possibility that you 
will be able to recover/reconstruct some metadata which fails checksumming.

In short, Checksumming is how ZFS /determines/ data corruption, and 
Redundancy is how ZFS /fixes/ it.  Checksumming is /always/ present, 
while redundancy depends on the pool layout and options (cf. "copies" 
property).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Brad

2009-Dec-29 19:26 UTC

head link

[zfs-discuss] repost - high read iops

@relling
"For small, random read IOPS the performance of a single, top-level
vdev is
performance = performance of a disk * (N / (N - P))  
              133 * 12/(12-1)              133 * 12/11

where,
N = number of disks in the vdev
P = number of parity devices in the vdev"

performance of a disk => Is this a rough estimate of the disk''s IOP?


"For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3: 250 IOPS"

So if the rated iops on our disks is @133 iops
133 * 12/(12-1) = 145

11+1 raidz: 145 IOPS?????

If that''s the rate for a 11+1 raidz vdev, then why is iostat showing
about 700 combined IOPS (reads/writes) per disk?

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
-- 
This message posted from opensolaris.org

Richard Elling

2009-Dec-29 20:48 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 29, 2009, at 11:26 AM, Brad wrote:
> @relling
> "For small, random read IOPS the performance of a single, top-level
> vdev is
> performance = performance of a disk * (N / (N - P))
>              133 * 12/(12-1)>              133 * 12/11
>
> where,
> N = number of disks in the vdev
> P = number of parity devices in the vdev"
>
> performance of a disk => Is this a rough estimate of the disk''s
IOP?
>
>
> "For example, using 5 disks @ 100 IOPS we get something like:
> 2-disk mirror: 200 IOPS
> 4+1 raidz: 125 IOPS
> 3+2 raidz2: 167 IOPS
> 2+3 raidz3: 250 IOPS"
>
> So if the rated iops on our disks is @133 iops
> 133 * 12/(12-1) = 145
>
> 11+1 raidz: 145 IOPS?????
>
> If that''s the rate for a 11+1 raidz vdev, then why is iostat
showing
> about 700 combined IOPS (reads/writes) per disk?
Because the model is for small, random read IOPS
over the full size of the disk. What you are seeing is
caching and seek optimization at work (a good thing).
But, AFAIK,  there are no decent performance models
which take caching into account. In most cases, storage
is sized based on empirical studies.
  -- richard
> r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
> 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
> 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
> 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
> 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
> 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
> 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
> 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
> 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
> 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
> 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
> 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
> 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
> 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
> 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
> 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
> 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Eric D. Mudama

2009-Dec-30 00:37 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, Dec 29 at 11:14, Erik Trimble wrote:>Eric D. Mudama wrote:
>>On Tue, Dec 29 at  9:16, Brad wrote:
>>The disk cost of a raidz pool of mirrors is identical to the disk cost
>>of raid10.
>>
>ZFS can''t do a raidz of mirrors or a mirror of raidz.  Members of a
>mirror or raidz[123] must be a fundamental device (i.e. file or 
>drive)
Sorry, typo/thinko ... I meant to say a zpool of mirrors, not a raidz
pool of mirrors.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Ross Walker

2009-Dec-30 04:53 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn <bfriesen at
simple.dallas.tx.us
 > wrote:
> On Tue, 29 Dec 2009, Ross Walker wrote:
>>
>> A mirrored raidz provides redundancy at a steep cost to performance  
>> and might I add a high monetary cost.
>
> I am not sure what a "mirrored raidz" is.  I have never heard of
> such a thing before.
>
>> With raid10 each mirrored pair has the IOPS of a single drive.  
>> Since these mirrors are typically 2 disk vdevs, you can have a lot  
>> more of them and thus a lot more IOPS (some people talk about using  
>> 3 disk mirrors, but it''s probably just as good as getting
setting
>> copies=2 on a regular pool of mirrors).
>
> This is another case where using a term like "raid10" does not
make
> sense when discussing zfs.  ZFS does not support "raid10".  ZFS
does
> not support RAID 0 or RAID 1 so it can''t support RAID 1+0 (RAID
10).
Did it again... I understand the difference. I hope it didn''t confuse  
the OP by throwing that out there. What I meant to say was a zpool of  
mirror vdevs.
> Some important points to consider are that every write to a raidz  
> vdev must be synchronous.  In other words, the write needs to  
> complete on all the drives in the stripe before the write may return  
> as complete. This is also true of "RAID 1" (mirrors) which
specifies
> that the drives are perfect duplicates of each other.
I believe mirrored vdevs can do this in parallel though, while raidz  
vdevs need to do this serially due to the ordered nature of the  
transaction which makes the sync writes faster on the mirrors.
>  However, zfs does not implement "RAID 1" either.  This is easily
> demonstrated since you can unplug one side of the mirror and the  
> writes to the zfs mirror will still succeed, catching up the mirror  
> which is behind as soon as it is plugged back in.  When using  
> mirrors, zfs supports logic which will catch that mirror back up  
> (only sending the missing updates) when connectivity improves.  With  
> RAID 1 where is no way to recover a mirror other than a full copy  
> from the other drive.
That''s not completely true these days as a lot of raid implementations
use bitmaps to track changed blocks and a raid1 continues to function  
when the other side disappears. The real difference is the mirror  
implementation in ZFS is in the file system and not at an abstracted  
block-io layer so it is more intelligent in it''s use and layout.
> Zfs load-shares across vdevs so it will load-share across mirror  
> vdevs rather than striping (as RAID 10 would require).
Bob, an interesting question was brought up to me about how copies may  
affect random read performance. I didn''t know the answer, but if ZFS  
knows there are additional copies would it not also spread the load  
across those as well to make sure the wait queues on each spindle are  
as even as possible?

-Ross

Toby Thain

2009-Dec-30 15:27 UTC

head link

[zfs-discuss] repost - high read iops

On 29-Dec-09, at 11:53 PM, Ross Walker wrote:
> On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn  
> <bfriesen at simple.dallas.tx.us> wrote:
>
> ...
>>  However, zfs does not implement "RAID 1" either.  This is
easily
>> demonstrated since you can unplug one side of the mirror and the  
>> writes to the zfs mirror will still succeed, catching up the  
>> mirror which is behind as soon as it is plugged back in.  When  
>> using mirrors, zfs supports logic which will catch that mirror  
>> back up (only sending the missing updates) when connectivity  
>> improves.  With RAID 1 where is no way to recover a mirror other  
>> than a full copy from the other drive.
>
> That''s not completely true these days as a lot of raid  
> implementations use bitmaps to track changed blocks and a raid1  
> continues to function when the other side disappears. The real  
> difference is the mirror implementation in ZFS is in the file  
> system and not at an abstracted block-io layer so it is more  
> intelligent in it''s use and layout.
Another important difference is that ZFS has the means to know which  
side of a mirror returned valid data.

--Toby

Bob Friesenhahn

2009-Dec-30 17:35 UTC

head link

[zfs-discuss] repost - high read iops

On Tue, 29 Dec 2009, Ross Walker wrote:>
>> Some important points to consider are that every write to a raidz vdev
must
>> be synchronous.  In other words, the write needs to complete on all the
>> drives in the stripe before the write may return as complete. This is
also
>> true of "RAID 1" (mirrors) which specifies that the drives
are perfect
>> duplicates of each other.
>
> I believe mirrored vdevs can do this in parallel though, while raidz vdevs 
> need to do this serially due to the ordered nature of the transaction which
> makes the sync writes faster on the mirrors.
I don''t think that the raidz needs to write the stripe serially, but 
it does need to ensure that the data is committed to the drives before 
considering the write to be completed.  This is due to the nature of 
the RAID5 stripe, which needs to be completely written.  It seems that 
mirrors are more sloppy in that writing and committing to one mirror 
is enough.
> Bob, an interesting question was brought up to me about how copies may
affect
> random read performance. I didn''t know the answer, but if ZFS
knows there are
> additional copies would it not also spread the load across those as well to
> make sure the wait queues on each spindle are as even as possible?
Previously we were told that zfs uses a semi-random algorithm to 
select which mirror side (or copy) to read from.  Presumably more 
(double) performance would be available if zfs was able to precisely 
schedule and interleave reads from the mirror devices, but perfecting 
that could be quite challenging.  With mirrors, we do see more read 
performance than one device can provide, but we don''t see double.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2009-Dec-30 18:07 UTC

head link

[zfs-discuss] repost - high read iops

On Wed, Dec 30, 2009 at 12:35 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Tue, 29 Dec 2009, Ross Walker wrote:
>>
>>> Some important points to consider are that every write to a raidz
vdev
>>> must be synchronous. ?In other words, the write needs to complete
on all the
>>> drives in the stripe before the write may return as complete. This
is also
>>> true of "RAID 1" (mirrors) which specifies that the
drives are perfect
>>> duplicates of each other.
>>
>> I believe mirrored vdevs can do this in parallel though, while raidz
vdevs
>> need to do this serially due to the ordered nature of the transaction
which
>> makes the sync writes faster on the mirrors.
>
> I don''t think that the raidz needs to write the stripe serially,
but it does
> need to ensure that the data is committed to the drives before considering
> the write to be completed. ?This is due to the nature of the RAID5 stripe,
> which needs to be completely written. ?It seems that mirrors are more
sloppy
> in that writing and committing to one mirror is enough.
Ok, that makes sense, as long as the metadata is committed, which can
happen for mirrors as soon as one side is written, but not for raidz
until the whole stripe is written. So accurate in the increased
latency for raidz, but for the wrong reason.
>> Bob, an interesting question was brought up to me about how copies may
>> affect random read performance. I didn''t know the answer, but
if ZFS knows
>> there are additional copies would it not also spread the load across
those
>> as well to make sure the wait queues on each spindle are as even as
>> possible?
>
> Previously we were told that zfs uses a semi-random algorithm to select
> which mirror side (or copy) to read from. ?Presumably more (double)
> performance would be available if zfs was able to precisely schedule and
> interleave reads from the mirror devices, but perfecting that could be
quite
> challenging. ?With mirrors, we do see more read performance than one device
> can provide, but we don''t see double.
That isn''t quite what I was getting at. Say one has a pool of mirrors,
then they set copies=2 on the pool, which in theory should create a
second copy of the data on another vdev in the pool. If during
servicing a read request, is the ZFS scheduler smart enough to read
from the second copy if the vdev where the first copy lies is being
serviced by another read/write request? If so this would increase the
read performance of the pool of mirrors by sacrificing some write
performance.

-Ross

Richard Elling

2009-Dec-30 18:15 UTC

head link

[zfs-discuss] repost - high read iops

On Dec 30, 2009, at 9:35 AM, Bob Friesenhahn wrote:
> On Tue, 29 Dec 2009, Ross Walker wrote:
>>
>>> Some important points to consider are that every write to a raidz  
>>> vdev must be synchronous.  In other words, the write needs to  
>>> complete on all the drives in the stripe before the write may  
>>> return as complete. This is also true of "RAID 1"
(mirrors) which
>>> specifies that the drives are perfect duplicates of each other.
>>
>> I believe mirrored vdevs can do this in parallel though, while  
>> raidz vdevs need to do this serially due to the ordered nature of  
>> the transaction which makes the sync writes faster on the mirrors.
>
> I don''t think that the raidz needs to write the stripe serially,
but
> it does need to ensure that the data is committed to the drives  
> before considering the write to be completed.  This is due to the  
> nature of the RAID5 stripe, which needs to be completely written.   
> It seems that mirrors are more sloppy in that writing and committing  
> to one mirror is enough.
Yes, though I wouldn''t call it "sloppy" ;-)
With traditional software RAID, you have to make sure both sides of  
the mirror
are written because you also assume that you can later read either  
side.  For
ZFS, if only one side of the mirror is written, you know the bad side  
is bad
because of the checksum. The checksum is owned by the parent, which is
an important design decision that applies here, too.

Methinks it might be a good idea to start a comparison wiki to share  
some
of the details...
  -- richard

Bob Friesenhahn

2009-Dec-30 18:44 UTC

head link

[zfs-discuss] repost - high read iops

On Wed, 30 Dec 2009, Richard Elling wrote:> are written because you also assume that you can later read either side. 
For
> ZFS, if only one side of the mirror is written, you know the bad side is
bad
> because of the checksum. The checksum is owned by the parent, which is
> an important design decision that applies here, too.
In a mirror vdev, each mirror contains a complete copy of the vdev, 
and the last commited TXG can be used to know if the mirror is up to 
date with its peers, or if it needs a little bit of resilvering to 
catch up.  This differs from a raidz/raidz2 vdev, where most of the 
disks (N-1 or N-2) need to be available in order to have a complete 
copy of the vdev.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

zfs discuss - Dec 2009 - repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops

[zfs-discuss] repost - high read iops