why is sum of disks bandwidth from `zpool iostat -v 1`
less than the pool total while watching `du /zfs`
on opensol-20060605 bits?
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
zfs 1.17T 1.16T 147 0 573K 0
raidz1 1.17T 1.16T 147 0 573K 0
c2d0p0 - - 29 0 1.93M 0
c4d0p0 - - 30 0 2.07M 0
c6d0p0 - - 45 0 3.05M 0
c8d0p0 - - 24 0 1.50M 0
c3d0p0 - - 39 0 2.44M 0
c5d0p0 - - 35 0 2.30M 0
c7d0p0 - - 50 0 3.24M 0
c9d0p0 - - 22 0 1.36M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
zfs 1.17T 1.16T 129 0 565K 0
raidz1 1.17T 1.16T 129 0 565K 0
c2d0p0 - - 38 0 2.42M 0
c4d0p0 - - 34 0 2.18M 0
c6d0p0 - - 35 0 2.18M 0
c8d0p0 - - 33 0 2.10M 0
c3d0p0 - - 39 0 2.50M 0
c5d0p0 - - 38 0 2.38M 0
c7d0p0 - - 36 0 2.26M 0
c9d0p0 - - 32 0 2.06M 0
---------- ----- ----- ----- ----- ----- -----
if `zpool upgrade` shows all pools as "ZFS version 3"
and I never rewrite any of my root dirs, does the old
old "raidz1" pool ever make any ditto blocks?
Hello Rob,
Friday, June 9, 2006, 7:36:58 AM, you wrote:
RL> why is sum of disks bandwidth from `zpool iostat -v 1`
RL> less than the pool total while watching `du /zfs`
RL> on opensol-20060605 bits?
RL> capacity operations bandwidth
RL> pool used avail read write read write
RL> ---------- ----- ----- ----- ----- ----- -----
RL> zfs 1.17T 1.16T 147 0 573K 0
RL> raidz1 1.17T 1.16T 147 0 573K 0
RL> c2d0p0 - - 29 0 1.93M 0
RL> c4d0p0 - - 30 0 2.07M 0
RL> c6d0p0 - - 45 0 3.05M 0
RL> c8d0p0 - - 24 0 1.50M 0
RL> c3d0p0 - - 39 0 2.44M 0
RL> c5d0p0 - - 35 0 2.30M 0
RL> c7d0p0 - - 50 0 3.24M 0
RL> c9d0p0 - - 22 0 1.36M 0
RL> ---------- ----- ----- ----- ----- ----- -----
Due to raid-z implementation. See last discussion on raid-z
performance, etc.
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com
> RL> why is sum of disks bandwidth from `zpool iostat -v 1` > RL> less than the pool total while watching `du /zfs` > RL> on opensol-20060605 bits? > > Due to raid-z implementation. See last discussion on raid-z > performance, etc.It''s an artifact of the way raidz and the vdev read cache interact. Currently, when you read a block from disk, we always read at least 64k. We keep the result in a per-disk cache -- like a software track buffer. The idea is that if you do several small reads in a row, only the first one goes to disk. For some workloads, this is a huge win. For others, it''s a net lose. More tuning is needed, certainly. Both the good and the bad aspects of vdev caching are amplified by RAID-Z. When you write a 2k block to a 5-disk raidz vdev, it will be stored as a single 512-byte sector on each disk (4 data + 1 parity). When you read it back, we''ll issue 4 reads (to the data disks); each of those will become a 64k cache-fill read, so you''re reading a total of 4*64k = 256k to fetch a 2k block. If that block is the first in a series, you''re golden: the next 127 reads will be free (no disk I/O). On the other hand, if it''s an isolated random read, we just did 128 times more I/O than was actually useful. This is a rather extreme case, but it''s real. I''m hoping that by making the higher-level prefetch logic in ZFS a little smarter, we can eliminate the need for vdev-level caching altogether. If not, we''ll need to make the vdev cache policy smarter. I''ve filed this bug to track the issue: 6437054 vdev_cache: wise up or die Jeff
> a total of 4*64k = 256k to fetch a 2k block.> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054 perhaps a quick win would be to tell vdev_cache about the DMU_OT_* type so it can read ahead appropriately. it seems the largest losses are metadata. (du,find,scrub/resilver)
What is the status of bug 6437054? The bug tracker still shows it open. Ron This message posted from opensolaris.org
On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:> What is the status of bug 6437054? The bug tracker still shows it open. > > RonDo you mean: 6437054 vdev_cache: wise up or die This bug is still under investigation. A bunch of investigation has been done, but no definitive action has been taken, yet. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2007-Apr-23 21:06 UTC
[zfs-discuss] Re: opensol-20060605 # zpool iostat -v 1
Hello Eric, Monday, April 23, 2007, 7:13:26 PM, you wrote: ES> On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:>> What is the status of bug 6437054? The bug tracker still shows it open. >> >> RonES> Do you mean: ES> 6437054 vdev_cache: wise up or die ES> This bug is still under investigation. A bunch of investigation has ES> been done, but no definitive action has been taken, yet. I set bshift to ^13 (8K) almost on every server with zfs. I wish I could put it in /etc/system on S10U3... -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ron Halstead
2007-Apr-24 14:20 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Robert, How do you set bshift to ^13 (8K)? Is there a document describing the procedure? Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-24 14:42 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron,
Tuesday, April 24, 2007, 4:20:41 PM, you wrote:
RH> Robert,
RH> How do you set bshift to ^13 (8K)? Is there a document describing the
procedure?
In latest nevada you can set it via /etc/system like:
set zfs:zfs_vdev_cache_bshift=13
(2^13 is 8K).
In older releases or in S10U2/U3 you can set it via mdb or usr Roch''s
script - http://blogs.sun.com/roch/entry/tuning_the_knobs
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com
Ron Halstead
2007-Apr-24 14:54 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Thanks Robert. This will be put to use. Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-26 09:56 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron,
Tuesday, April 24, 2007, 4:54:52 PM, you wrote:
RH> Thanks Robert. This will be put to use.
Please let us know about the results.
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com