why is sum of disks bandwidth from `zpool iostat -v 1`
less than the pool total while watching `du /zfs`
on opensol-20060605 bits?
                capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfs         1.17T  1.16T    147      0   573K      0
   raidz1    1.17T  1.16T    147      0   573K      0
     c2d0p0      -      -     29      0  1.93M      0
     c4d0p0      -      -     30      0  2.07M      0
     c6d0p0      -      -     45      0  3.05M      0
     c8d0p0      -      -     24      0  1.50M      0
     c3d0p0      -      -     39      0  2.44M      0
     c5d0p0      -      -     35      0  2.30M      0
     c7d0p0      -      -     50      0  3.24M      0
     c9d0p0      -      -     22      0  1.36M      0
----------  -----  -----  -----  -----  -----  -----
                capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfs         1.17T  1.16T    129      0   565K      0
   raidz1    1.17T  1.16T    129      0   565K      0
     c2d0p0      -      -     38      0  2.42M      0
     c4d0p0      -      -     34      0  2.18M      0
     c6d0p0      -      -     35      0  2.18M      0
     c8d0p0      -      -     33      0  2.10M      0
     c3d0p0      -      -     39      0  2.50M      0
     c5d0p0      -      -     38      0  2.38M      0
     c7d0p0      -      -     36      0  2.26M      0
     c9d0p0      -      -     32      0  2.06M      0
----------  -----  -----  -----  -----  -----  -----
if `zpool upgrade` shows all pools as "ZFS version 3"
and I never rewrite any of my root dirs, does the old
old "raidz1" pool ever make any ditto blocks?
Hello Rob,
Friday, June 9, 2006, 7:36:58 AM, you wrote:
RL> why is sum of disks bandwidth from `zpool iostat -v 1`
RL> less than the pool total while watching `du /zfs`
RL> on opensol-20060605 bits?
RL>                 capacity     operations    bandwidth
RL> pool         used  avail   read  write   read  write
RL> ----------  -----  -----  -----  -----  -----  -----
RL> zfs         1.17T  1.16T    147      0   573K      0
RL>    raidz1    1.17T  1.16T    147      0   573K      0
RL>      c2d0p0      -      -     29      0  1.93M      0
RL>      c4d0p0      -      -     30      0  2.07M      0
RL>      c6d0p0      -      -     45      0  3.05M      0
RL>      c8d0p0      -      -     24      0  1.50M      0
RL>      c3d0p0      -      -     39      0  2.44M      0
RL>      c5d0p0      -      -     35      0  2.30M      0
RL>      c7d0p0      -      -     50      0  3.24M      0
RL>      c9d0p0      -      -     22      0  1.36M      0
RL> ----------  -----  -----  -----  -----  -----  -----
Due to raid-z implementation. See last discussion on raid-z
performance, etc.
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com
> RL> why is sum of disks bandwidth from `zpool iostat -v 1` > RL> less than the pool total while watching `du /zfs` > RL> on opensol-20060605 bits? > > Due to raid-z implementation. See last discussion on raid-z > performance, etc.It''s an artifact of the way raidz and the vdev read cache interact. Currently, when you read a block from disk, we always read at least 64k. We keep the result in a per-disk cache -- like a software track buffer. The idea is that if you do several small reads in a row, only the first one goes to disk. For some workloads, this is a huge win. For others, it''s a net lose. More tuning is needed, certainly. Both the good and the bad aspects of vdev caching are amplified by RAID-Z. When you write a 2k block to a 5-disk raidz vdev, it will be stored as a single 512-byte sector on each disk (4 data + 1 parity). When you read it back, we''ll issue 4 reads (to the data disks); each of those will become a 64k cache-fill read, so you''re reading a total of 4*64k = 256k to fetch a 2k block. If that block is the first in a series, you''re golden: the next 127 reads will be free (no disk I/O). On the other hand, if it''s an isolated random read, we just did 128 times more I/O than was actually useful. This is a rather extreme case, but it''s real. I''m hoping that by making the higher-level prefetch logic in ZFS a little smarter, we can eliminate the need for vdev-level caching altogether. If not, we''ll need to make the vdev cache policy smarter. I''ve filed this bug to track the issue: 6437054 vdev_cache: wise up or die Jeff
> a total of 4*64k = 256k to fetch a 2k block.> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054 perhaps a quick win would be to tell vdev_cache about the DMU_OT_* type so it can read ahead appropriately. it seems the largest losses are metadata. (du,find,scrub/resilver)
What is the status of bug 6437054? The bug tracker still shows it open. Ron This message posted from opensolaris.org
On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:> What is the status of bug 6437054? The bug tracker still shows it open. > > RonDo you mean: 6437054 vdev_cache: wise up or die This bug is still under investigation. A bunch of investigation has been done, but no definitive action has been taken, yet. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2007-Apr-23  21:06 UTC
[zfs-discuss] Re: opensol-20060605 # zpool iostat -v 1
Hello Eric, Monday, April 23, 2007, 7:13:26 PM, you wrote: ES> On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:>> What is the status of bug 6437054? The bug tracker still shows it open. >> >> RonES> Do you mean: ES> 6437054 vdev_cache: wise up or die ES> This bug is still under investigation. A bunch of investigation has ES> been done, but no definitive action has been taken, yet. I set bshift to ^13 (8K) almost on every server with zfs. I wish I could put it in /etc/system on S10U3... -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ron Halstead
2007-Apr-24  14:20 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Robert, How do you set bshift to ^13 (8K)? Is there a document describing the procedure? Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-24  14:42 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron,
Tuesday, April 24, 2007, 4:20:41 PM, you wrote:
RH> Robert,
RH> How do you set bshift to ^13 (8K)? Is there a document describing the
procedure?
In latest nevada you can set it via /etc/system like:
  set zfs:zfs_vdev_cache_bshift=13
(2^13 is 8K).
In older releases or in S10U2/U3 you can set it via mdb or usr Roch''s
script - http://blogs.sun.com/roch/entry/tuning_the_knobs
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com
Ron Halstead
2007-Apr-24  14:54 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Thanks Robert. This will be put to use. Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-26  09:56 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron,
Tuesday, April 24, 2007, 4:54:52 PM, you wrote:
RH> Thanks Robert. This will be put to use.
Please let us know about the results.
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com