I believe I have tracked down the problem discussed in the "low disk performance thread." It seems that an alignment issue will cause small file/block performance to be abysmal on a RAID-Z. metaslab_ff_alloc() seems to naturally align all allocations, and so all blocks will be aligned to asize on a RAID-Z. At certain block sizes which do not produce full width writes, contiguous writes will leave holes of dead space in the RAID-Z. What I have observed with the iosnoop dtrace script is that the first disks aggregate the single block writes, while the last disk(s) are forced to do numerous writes every other sector. If you would like to reproduce this, simply copy a large file to a recordsize=4k filesystem on a 4 disk RAID-Z. It would probably fix the problem if this dead space was explicitly zeroed to allow the writes to be aggregated, but that would be an egregious hack. If the alignment constraints could be relaxed though, that should improve the parity distribution, as well as get rid of the dead space and associated problem. Chris
Chris Csanady wrote:> I believe I have tracked down the problem discussed in the "low > disk performance thread." It seems that an alignment issue will > cause small file/block performance to be abysmal on a RAID-Z. > > metaslab_ff_alloc() seems to naturally align all allocations, and > so all blocks will be aligned to asize on a RAID-Z. At certain > block sizes which do not produce full width writes, contiguous > writes will leave holes of dead space in the RAID-Z. > > What I have observed with the iosnoop dtrace script is that the > first disks aggregate the single block writes, while the last disk(s) > are forced to do numerous writes every other sector. If you would > like to reproduce this, simply copy a large file to a recordsize=4k > filesystem on a 4 disk RAID-Z.Why would I want to set recordsize=4k if I''m using large files? For that matter, why would I ever want to use a recordsize=4k, is there a database which needs 4k record sizes?> It would probably fix the problem if this dead space was explicitly > zeroed to allow the writes to be aggregated, but that would be > an egregious hack. If the alignment constraints could be relaxed > though, that should improve the parity distribution, as well as get > rid of the dead space and associated problem.This is one of those things I wanted to look at in my copious spare time. Has anyone else done similar analysis? -- richard
On 9/26/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:> Chris Csanady wrote: > > What I have observed with the iosnoop dtrace script is that the > > first disks aggregate the single block writes, while the last disk(s) > > are forced to do numerous writes every other sector. If you would > > like to reproduce this, simply copy a large file to a recordsize=4k > > filesystem on a 4 disk RAID-Z. > > Why would I want to set recordsize=4k if I''m using large files? > For that matter, why would I ever want to use a recordsize=4k, is > there a database which needs 4k record sizes?Sorry, I wasn''t very clear about the reasoning for this. It is not something that you would normally do, but it generates just the right combination of block size and stripe width to make the problem very apparent. It is also possible to encounter this on a filesystem with the default recordsize, and I have observed the effect while extracting a large archive of sources. Still, it was never bad enough for my uses to be anything more than a curiosity. However, while trying to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo seemingly stumbled upon this worst case performance scenerio. (Though, unlike my example, it is also possible to end up with holes in the second column.) Also, while it may be a small error, could these stranded sectors throw off the space accounting enough to cause problems when a pool is nearly full? Chris
Thanks, Chris, for digging into this and sharing your results. These seemingly stranded sectors are actually properly accounted for in terms of space utilization, since they are actually unusable while maintaining integrity in the face of a single drive failure. The way the RAID-Z space accounting works is this: 1) Take the size of your data block (4k in your example) and figure out how much parity you need to protect it. This turns out to be 3 sectors, for a total of 11 (5.5k). See vdev_raidz_asize() for details. 2) For single-parity RAID-Z, round up to a multiple of 2 sectors, and for double-parity RAID-Z, round up to a multiple of 3 sectors. This becomes ASIZE (6k in your case). The reason for this is a bit complicated, but without this roundup, you can end up with stranded sectors that are unallocated and unusable, leading to the question, "I still have free space, why can''t I write a file?" We simply account for for these roundup sectors as part of the allocation that caused them. 3) Allocate space for ASIZE bytes from the RAID-Z space map. With the first-fit allocator, this aligns the write to the greatest power of 2 that evenly divides ASIZE (2k in this case). With all this in mind, what winds up happening is exactly what Chris surmised. In this illustration, "A" represents a single sector of data and "A." indicates its parity. Disk A B C D -------------------- LBA 0 A. A A A 1 A. A A A 2 A. A A X 3 B. B B B 4 B. B B B 5 B. B B X And so forth. In this scenario, you wind up with the described situation of non-continuous writes on one of the disks, which will kill the performance. Sorry about that. Jeff and I had actually talked at one point about how we could fix this. Basically, you could represent the "X" dead sector as an opportunistic write that would only get sent to disk if it got aggregated, and would get dropped on the floor otherwise. I think it wouldn''t be too bad with some pipeline tricks. If anyone is interested enough to pick this up, let me know and we can discuss the details. --Bill On Tue, Sep 26, 2006 at 07:43:34PM -0500, Chris Csanady wrote:> On 9/26/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote: > >Chris Csanady wrote: > >> What I have observed with the iosnoop dtrace script is that the > >> first disks aggregate the single block writes, while the last disk(s) > >> are forced to do numerous writes every other sector. If you would > >> like to reproduce this, simply copy a large file to a recordsize=4k > >> filesystem on a 4 disk RAID-Z. > > > >Why would I want to set recordsize=4k if I''m using large files? > >For that matter, why would I ever want to use a recordsize=4k, is > >there a database which needs 4k record sizes? > > Sorry, I wasn''t very clear about the reasoning for this. It is not > something that you would normally do, but it generates just > the right combination of block size and stripe width to make the > problem very apparent. > > It is also possible to encounter this on a filesystem with the > default recordsize, and I have observed the effect while extracting > a large archive of sources. Still, it was never bad enough for my > uses to be anything more than a curiosity. However, while trying > to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo > seemingly stumbled upon this worst case performance scenerio. > (Though, unlike my example, it is also possible to end up with > holes in the second column.) > > Also, while it may be a small error, could these stranded sectors > throw off the space accounting enough to cause problems when > a pool is nearly full? > > Chris > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Thank you Bill for your clear description. Now I have to find a way to justify myself with my head office that after spending 100k+ in hw and migrating to "the most advanced OS" we are running about 8 time slower :) Anyway I have a problem much more serious than rsync process speed. I hope you''ll help me solving it out! Our situation: /data/a /data/b /data/zones/ZONEX (whole root zone) As you know I have a process running "rsync -ax /data/a/* /data/b" for about 14hrs. The problem is that, while that rsync process is running, ZONEX is completely unusable because of the rsync I/O load. Even if we''re using FSS, Solaris seems unable to give a small amount of I/O resource to ZONEX''s activity ... I know that FSS doesn''t deal with I/O but I think Solaris should be smarter .. To draw a comparison, FreeBSD Jail doesn''t suffer from this problem ... thank, Gino This message posted from opensolaris.org
observations below... Bill Moore wrote:> Thanks, Chris, for digging into this and sharing your results. These > seemingly stranded sectors are actually properly accounted for in terms > of space utilization, since they are actually unusable while maintaining > integrity in the face of a single drive failure. > > The way the RAID-Z space accounting works is this: > > 1) Take the size of your data block (4k in your example) and figure > out how much parity you need to protect it. This turns out to be > 3 sectors, for a total of 11 (5.5k). See vdev_raidz_asize() for > details. > 2) For single-parity RAID-Z, round up to a multiple of 2 sectors, > and for double-parity RAID-Z, round up to a multiple of 3 > sectors. This becomes ASIZE (6k in your case). The reason > for this is a bit complicated, but without this roundup, you can > end up with stranded sectors that are unallocated and unusable, > leading to the question, "I still have free space, why can''t I > write a file?" We simply account for for these roundup sectors > as part of the allocation that caused them. > 3) Allocate space for ASIZE bytes from the RAID-Z space map. With > the first-fit allocator, this aligns the write to the greatest > power of 2 that evenly divides ASIZE (2k in this case). > > With all this in mind, what winds up happening is exactly what Chris > surmised. In this illustration, "A" represents a single sector of data > and "A." indicates its parity. > > Disk A B C D > -------------------- > LBA 0 A. A A A > 1 A. A A A > 2 A. A A X > 3 B. B B B > 4 B. B B B > 5 B. B B XIn the interim, does it makes sense for a simple rule of thumb? For example, in the above case, I would not have the hole if I did any of the following: 1. add one disk 2. remove one disk 3. use raidz2 instead of raidz More generally, I could suggest that we use an odd number of vdevs for raidz and an even number for mirrors and raidz2. Thoughts? -- richard
Richard Elling - PAE wrote:> > > More generally, I could suggest that we use an odd number of vdevs > for raidz and an even number for mirrors and raidz2. > Thoughts?Sounds good to me. I''d make sure it''s in the same section of the BP guide as "Align the block size with your app..." type notes.
> > More generally, I could suggest that we use an odd > number of vdevs > for raidz and an even number for mirrors and raidz2. > Thoughts?uhm ... we found serious performance problems also using a RAID-Z of 3 luns ... Gino This message posted from opensolaris.org
On Wed, 27 Sep 2006, Gino Ruopolo wrote:> Thank you Bill for your clear description. > > Now I have to find a way to justify myself with my head office that after spending 100k+ in hw and migrating to "the most advanced OS" we are running about 8 time slower :) > > Anyway I have a problem much more serious than rsync process speed. I hope you''ll help me solving it out! > > Our situation: > > /data/a > /data/b > /data/zones/ZONEX (whole root zone) > > As you know I have a process running "rsync -ax /data/a/* /data/b" for about 14hrs. > The problem is that, while that rsync process is running, ZONEX is completely unusable because of the rsync I/O load. > Even if we''re using FSS, Solaris seems unable to give a small amount of I/O resource to ZONEX''s activity ... > > I know that FSS doesn''t deal with I/O but I think Solaris should be smarter ..What about using ipqos (man ipqos)?> To draw a comparison, FreeBSD Jail doesn''t suffer from this problem ... > > thank, > GinoRegards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
> > Even if we''re using FSS, Solaris seems unable to > give a small amount of I/O resource to ZONEX''s > activity ... > > > > I know that FSS doesn''t deal with I/O but I think > Solaris should be smarter .. > > What about using ipqos (man ipqos)?I''m not referring to "network I/O" but "storage I/O" ... thanks, Gino This message posted from opensolaris.org
Gino Ruopolo writes: > Thank you Bill for your clear description. > > Now I have to find a way to justify myself with my head office that > after spending 100k+ in hw and migrating to "the most advanced OS" we > are running about 8 time slower :) > > Anyway I have a problem much more serious than rsync process speed. I > hope you''ll help me solving it out! > > Our situation: > > /data/a > /data/b > /data/zones/ZONEX (whole root zone) > > As you know I have a process running "rsync -ax /data/a/* /data/b" for > about 14hrs. > The problem is that, while that rsync process is running, ZONEX is > completely unusable because of the rsync I/O load. > Even if we''re using FSS, Solaris seems unable to give a small amount > of I/O resource to ZONEX''s activity ... > > I know that FSS doesn''t deal with I/O but I think Solaris should be > smarter .. > > To draw a comparison, FreeBSD Jail doesn''t suffer from this problem > ... > > thank, > Gino > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Under a streaming write load, we kind of overwhelm the devices and reads are few and far between. To alleviate this we somewhat need to throttle writers more 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers This is in a state of "fix in progress". At the same time, the notion of reserved slots for reads is being investigated. That should do wonders to your issue. I don''t know how to workaround this for now (appart from starving rsync process of cpu access). -r
On September 27, 2006 11:27:16 AM -0700 Richard Elling - PAE <Richard.Elling at Sun.COM> wrote:> In the interim, does it makes sense for a simple rule of thumb? > For example, in the above case, I would not have the hole if I did > any of the following: > 1. add one disk > 2. remove one disk > 3. use raidz2 instead of raidz > > More generally, I could suggest that we use an odd number of vdevs > for raidz and an even number for mirrors and raidz2.I have a 12-drive array (JBOD). I was going to carve out a 5-way raidz and a 6-way raidz, both in one pool. Should I do 5-3-3 instead? -frank
Frank Cusack wrote:> On September 27, 2006 11:27:16 AM -0700 Richard Elling - PAE > <Richard.Elling at Sun.COM> wrote: >> In the interim, does it makes sense for a simple rule of thumb? >> For example, in the above case, I would not have the hole if I did >> any of the following: >> 1. add one disk >> 2. remove one disk >> 3. use raidz2 instead of raidz >> >> More generally, I could suggest that we use an odd number of vdevs >> for raidz and an even number for mirrors and raidz2.These rules are not generally accurate. To eliminate the blank "round up" sectors for power-of-two blocksizes of 8k or larger, you should use a power-of-two plus 1 number of disks in your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 6) and for 2k, use 3 disks (for double parity, use 4).> I have a 12-drive array (JBOD). I was going to carve out a 5-way > raidz and a 6-way raidz, both in one pool. Should I do 5-3-3 instead?If you know you''ll be using lots of small, same-size blocks (eg, a database where you''re changing the recordsize property), AND you need the best possible performance, AND you can''t afford to use mirroring, then you should do a performance comparison on both and see how it works. Otherwise (ie. in the common case), don''t worry about it. --matt