why is sum of disks bandwidth from `zpool iostat -v 1` less than the pool total while watching `du /zfs` on opensol-20060605 bits? capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- zfs 1.17T 1.16T 147 0 573K 0 raidz1 1.17T 1.16T 147 0 573K 0 c2d0p0 - - 29 0 1.93M 0 c4d0p0 - - 30 0 2.07M 0 c6d0p0 - - 45 0 3.05M 0 c8d0p0 - - 24 0 1.50M 0 c3d0p0 - - 39 0 2.44M 0 c5d0p0 - - 35 0 2.30M 0 c7d0p0 - - 50 0 3.24M 0 c9d0p0 - - 22 0 1.36M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- zfs 1.17T 1.16T 129 0 565K 0 raidz1 1.17T 1.16T 129 0 565K 0 c2d0p0 - - 38 0 2.42M 0 c4d0p0 - - 34 0 2.18M 0 c6d0p0 - - 35 0 2.18M 0 c8d0p0 - - 33 0 2.10M 0 c3d0p0 - - 39 0 2.50M 0 c5d0p0 - - 38 0 2.38M 0 c7d0p0 - - 36 0 2.26M 0 c9d0p0 - - 32 0 2.06M 0 ---------- ----- ----- ----- ----- ----- ----- if `zpool upgrade` shows all pools as "ZFS version 3" and I never rewrite any of my root dirs, does the old old "raidz1" pool ever make any ditto blocks?
Hello Rob, Friday, June 9, 2006, 7:36:58 AM, you wrote: RL> why is sum of disks bandwidth from `zpool iostat -v 1` RL> less than the pool total while watching `du /zfs` RL> on opensol-20060605 bits? RL> capacity operations bandwidth RL> pool used avail read write read write RL> ---------- ----- ----- ----- ----- ----- ----- RL> zfs 1.17T 1.16T 147 0 573K 0 RL> raidz1 1.17T 1.16T 147 0 573K 0 RL> c2d0p0 - - 29 0 1.93M 0 RL> c4d0p0 - - 30 0 2.07M 0 RL> c6d0p0 - - 45 0 3.05M 0 RL> c8d0p0 - - 24 0 1.50M 0 RL> c3d0p0 - - 39 0 2.44M 0 RL> c5d0p0 - - 35 0 2.30M 0 RL> c7d0p0 - - 50 0 3.24M 0 RL> c9d0p0 - - 22 0 1.36M 0 RL> ---------- ----- ----- ----- ----- ----- ----- Due to raid-z implementation. See last discussion on raid-z performance, etc. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> RL> why is sum of disks bandwidth from `zpool iostat -v 1` > RL> less than the pool total while watching `du /zfs` > RL> on opensol-20060605 bits? > > Due to raid-z implementation. See last discussion on raid-z > performance, etc.It''s an artifact of the way raidz and the vdev read cache interact. Currently, when you read a block from disk, we always read at least 64k. We keep the result in a per-disk cache -- like a software track buffer. The idea is that if you do several small reads in a row, only the first one goes to disk. For some workloads, this is a huge win. For others, it''s a net lose. More tuning is needed, certainly. Both the good and the bad aspects of vdev caching are amplified by RAID-Z. When you write a 2k block to a 5-disk raidz vdev, it will be stored as a single 512-byte sector on each disk (4 data + 1 parity). When you read it back, we''ll issue 4 reads (to the data disks); each of those will become a 64k cache-fill read, so you''re reading a total of 4*64k = 256k to fetch a 2k block. If that block is the first in a series, you''re golden: the next 127 reads will be free (no disk I/O). On the other hand, if it''s an isolated random read, we just did 128 times more I/O than was actually useful. This is a rather extreme case, but it''s real. I''m hoping that by making the higher-level prefetch logic in ZFS a little smarter, we can eliminate the need for vdev-level caching altogether. If not, we''ll need to make the vdev cache policy smarter. I''ve filed this bug to track the issue: 6437054 vdev_cache: wise up or die Jeff
> a total of 4*64k = 256k to fetch a 2k block.> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054 perhaps a quick win would be to tell vdev_cache about the DMU_OT_* type so it can read ahead appropriately. it seems the largest losses are metadata. (du,find,scrub/resilver)
What is the status of bug 6437054? The bug tracker still shows it open. Ron This message posted from opensolaris.org
On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:> What is the status of bug 6437054? The bug tracker still shows it open. > > RonDo you mean: 6437054 vdev_cache: wise up or die This bug is still under investigation. A bunch of investigation has been done, but no definitive action has been taken, yet. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2007-Apr-23 21:06 UTC
[zfs-discuss] Re: opensol-20060605 # zpool iostat -v 1
Hello Eric, Monday, April 23, 2007, 7:13:26 PM, you wrote: ES> On Mon, Apr 23, 2007 at 10:10:23AM -0700, Ron Halstead wrote:>> What is the status of bug 6437054? The bug tracker still shows it open. >> >> RonES> Do you mean: ES> 6437054 vdev_cache: wise up or die ES> This bug is still under investigation. A bunch of investigation has ES> been done, but no definitive action has been taken, yet. I set bshift to ^13 (8K) almost on every server with zfs. I wish I could put it in /etc/system on S10U3... -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ron Halstead
2007-Apr-24 14:20 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Robert, How do you set bshift to ^13 (8K)? Is there a document describing the procedure? Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-24 14:42 UTC
[zfs-discuss] Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron, Tuesday, April 24, 2007, 4:20:41 PM, you wrote: RH> Robert, RH> How do you set bshift to ^13 (8K)? Is there a document describing the procedure? In latest nevada you can set it via /etc/system like: set zfs:zfs_vdev_cache_bshift=13 (2^13 is 8K). In older releases or in S10U2/U3 you can set it via mdb or usr Roch''s script - http://blogs.sun.com/roch/entry/tuning_the_knobs -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ron Halstead
2007-Apr-24 14:54 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Thanks Robert. This will be put to use. Ron This message posted from opensolaris.org
Robert Milkowski
2007-Apr-26 09:56 UTC
[zfs-discuss] Re: Re: Re[2]: Re: opensol-20060605 # zpool iostat -v 1
Hello Ron, Tuesday, April 24, 2007, 4:54:52 PM, you wrote: RH> Thanks Robert. This will be put to use. Please let us know about the results. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com