thr3ads.net - zfs discuss - [zfs-discuss] Metaslab alignment on RAID-Z [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Chris Csanady

2006-Sep-26 19:57 UTC

[zfs-discuss] Metaslab alignment on RAID-Z

I believe I have tracked down the problem discussed in the "low
disk performance thread."  It seems that an alignment issue will
cause small file/block performance to be abysmal on a RAID-Z.

metaslab_ff_alloc() seems to naturally align all allocations, and
so all blocks will be aligned to asize on a RAID-Z.  At certain
block sizes which do not produce full width writes, contiguous
writes will leave holes of dead space in the RAID-Z.

What I have observed with the iosnoop dtrace script is that the
first disks aggregate the single block writes, while the last disk(s)
are forced to do numerous writes every other sector.  If you would
like to reproduce this, simply copy a large file to a recordsize=4k
filesystem on a 4 disk RAID-Z.

It would probably fix the problem if this dead space was explicitly
zeroed to allow the writes to be aggregated, but that would be
an egregious hack.  If the alignment constraints could be relaxed
though, that should improve the parity distribution, as well as get
rid of the dead space and associated problem.

Chris

Richard Elling - PAE

2006-Sep-26 20:12 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

Chris Csanady wrote:> I believe I have tracked down the problem discussed in the "low
> disk performance thread."  It seems that an alignment issue will
> cause small file/block performance to be abysmal on a RAID-Z.
> 
> metaslab_ff_alloc() seems to naturally align all allocations, and
> so all blocks will be aligned to asize on a RAID-Z.  At certain
> block sizes which do not produce full width writes, contiguous
> writes will leave holes of dead space in the RAID-Z.
> 
> What I have observed with the iosnoop dtrace script is that the
> first disks aggregate the single block writes, while the last disk(s)
> are forced to do numerous writes every other sector.  If you would
> like to reproduce this, simply copy a large file to a recordsize=4k
> filesystem on a 4 disk RAID-Z.
Why would I want to set recordsize=4k if I''m using large files?
For that matter, why would I ever want to use a recordsize=4k, is
there a database which needs 4k record sizes?
> It would probably fix the problem if this dead space was explicitly
> zeroed to allow the writes to be aggregated, but that would be
> an egregious hack.  If the alignment constraints could be relaxed
> though, that should improve the parity distribution, as well as get
> rid of the dead space and associated problem.
This is one of those things I wanted to look at in my copious spare
time.  Has anyone else done similar analysis?
  -- richard

Chris Csanady

2006-Sep-27 00:43 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

On 9/26/06, Richard Elling - PAE <Richard.Elling at sun.com>
wrote:> Chris Csanady wrote:
> > What I have observed with the iosnoop dtrace script is that the
> > first disks aggregate the single block writes, while the last disk(s)
> > are forced to do numerous writes every other sector.  If you would
> > like to reproduce this, simply copy a large file to a recordsize=4k
> > filesystem on a 4 disk RAID-Z.
>
> Why would I want to set recordsize=4k if I''m using large files?
> For that matter, why would I ever want to use a recordsize=4k, is
> there a database which needs 4k record sizes?
Sorry, I wasn''t very clear about the reasoning for this.  It is not
something that you would normally do, but it generates just
the right combination of block size and stripe width to make the
problem very apparent.

It is also possible to encounter this on a filesystem with the
default recordsize, and I have observed the effect while extracting
a large archive of sources.  Still, it was never bad enough for my
uses to be anything more than a curiosity.  However, while trying
to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo
seemingly stumbled upon this worst case performance scenerio.
(Though, unlike my example, it is also possible to end up with
holes in the second column.)

Also, while it may be a small error, could these stranded sectors
throw off the space accounting enough to cause problems when
a pool is nearly full?

Chris

Bill Moore

2006-Sep-27 02:26 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

Thanks, Chris, for digging into this and sharing your results.  These
seemingly stranded sectors are actually properly accounted for in terms
of space utilization, since they are actually unusable while maintaining
integrity in the face of a single drive failure.

The way the RAID-Z space accounting works is this:

    1) Take the size of your data block (4k in your example) and figure
       out how much parity you need to protect it.  This turns out to be 
       3 sectors, for a total of 11 (5.5k).  See vdev_raidz_asize() for
       details.
    2) For single-parity RAID-Z, round up to a multiple of 2 sectors,
       and for double-parity RAID-Z, round up to a multiple of 3
       sectors.  This becomes ASIZE (6k in your case).  The reason
       for this is a bit complicated, but without this roundup, you can
       end up with stranded sectors that are unallocated and unusable,
       leading to the question, "I still have free space, why
can''t I
       write a file?"  We simply account for for these roundup sectors
       as part of the allocation that caused them.
    3) Allocate space for ASIZE bytes from the RAID-Z space map.  With
       the first-fit allocator, this aligns the write to the greatest
       power of 2 that evenly divides ASIZE (2k in this case).

With all this in mind, what winds up happening is exactly what Chris
surmised.  In this illustration, "A" represents a single sector of
data
and "A." indicates its parity.

	Disk   A   B   C   D
	--------------------
     LBA   0   A.  A   A   A
           1   A.  A   A   A
	   2   A.  A   A   X
	   3   B.  B   B   B
	   4   B.  B   B   B
	   5   B.  B   B   X

And so forth.  In this scenario, you wind up with the described
situation of non-continuous writes on one of the disks, which will kill
the performance.  Sorry about that.  Jeff and I had actually talked at
one point about how we could fix this.  Basically, you could represent
the "X" dead sector as an opportunistic write that would only get sent
to disk if it got aggregated, and would get dropped on the floor
otherwise.  I think it wouldn''t be too bad with some pipeline tricks.

If anyone is interested enough to pick this up, let me know and we can
discuss the details.


--Bill

On Tue, Sep 26, 2006 at 07:43:34PM -0500, Chris Csanady
wrote:> On 9/26/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:
> >Chris Csanady wrote:
> >> What I have observed with the iosnoop dtrace script is that the
> >> first disks aggregate the single block writes, while the last
disk(s)
> >> are forced to do numerous writes every other sector.  If you would
> >> like to reproduce this, simply copy a large file to a
recordsize=4k
> >> filesystem on a 4 disk RAID-Z.
> >
> >Why would I want to set recordsize=4k if I''m using large
files?
> >For that matter, why would I ever want to use a recordsize=4k, is
> >there a database which needs 4k record sizes?
> 
> Sorry, I wasn''t very clear about the reasoning for this.  It is
not
> something that you would normally do, but it generates just
> the right combination of block size and stripe width to make the
> problem very apparent.
> 
> It is also possible to encounter this on a filesystem with the
> default recordsize, and I have observed the effect while extracting
> a large archive of sources.  Still, it was never bad enough for my
> uses to be anything more than a curiosity.  However, while trying
> to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo
> seemingly stumbled upon this worst case performance scenerio.
> (Though, unlike my example, it is also possible to end up with
> holes in the second column.)
> 
> Also, while it may be a small error, could these stranded sectors
> throw off the space accounting enough to cause problems when
> a pool is nearly full?
> 
> Chris
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Gino Ruopolo

2006-Sep-27 16:43 UTC

head link

[zfs-discuss] Re: Metaslab alignment on RAID-Z

Thank you Bill for your clear description.

Now I have to find a way to justify myself with my head office that after
spending  100k+ in hw and migrating to "the most advanced OS" we are
running about 8 time slower :)

Anyway I have a problem much more serious than rsync process speed. I hope
you''ll help me solving it out!

Our situation:

/data/a
/data/b
/data/zones/ZONEX    (whole root zone)

As you know I have a process running "rsync -ax /data/a/* /data/b" for
about 14hrs.
The problem is that, while that rsync process is running, ZONEX is completely
unusable because of the rsync I/O load.
Even if we''re using FSS, Solaris seems unable to give a small amount of
I/O resource to ZONEX''s activity ...

I know that FSS  doesn''t deal with I/O but I think Solaris should be
smarter ..

To draw a comparison, FreeBSD Jail doesn''t suffer from this problem ...

thank,
Gino
 
 
This message posted from opensolaris.org

Richard Elling - PAE

2006-Sep-27 18:27 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

observations below...

Bill Moore wrote:> Thanks, Chris, for digging into this and sharing your results.  These
> seemingly stranded sectors are actually properly accounted for in terms
> of space utilization, since they are actually unusable while maintaining
> integrity in the face of a single drive failure.
> 
> The way the RAID-Z space accounting works is this:
> 
>     1) Take the size of your data block (4k in your example) and figure
>        out how much parity you need to protect it.  This turns out to be 
>        3 sectors, for a total of 11 (5.5k).  See vdev_raidz_asize() for
>        details.
>     2) For single-parity RAID-Z, round up to a multiple of 2 sectors,
>        and for double-parity RAID-Z, round up to a multiple of 3
>        sectors.  This becomes ASIZE (6k in your case).  The reason
>        for this is a bit complicated, but without this roundup, you can
>        end up with stranded sectors that are unallocated and unusable,
>        leading to the question, "I still have free space, why
can''t I
>        write a file?"  We simply account for for these roundup sectors
>        as part of the allocation that caused them.
>     3) Allocate space for ASIZE bytes from the RAID-Z space map.  With
>        the first-fit allocator, this aligns the write to the greatest
>        power of 2 that evenly divides ASIZE (2k in this case).
> 
> With all this in mind, what winds up happening is exactly what Chris
> surmised.  In this illustration, "A" represents a single sector
of data
> and "A." indicates its parity.
> 
> 	Disk   A   B   C   D
> 	--------------------
>      LBA   0   A.  A   A   A
>            1   A.  A   A   A
> 	   2   A.  A   A   X
> 	   3   B.  B   B   B
> 	   4   B.  B   B   B
> 	   5   B.  B   B   X
In the interim, does it makes sense for a simple rule of thumb?
For example, in the above case, I would not have the hole if I did
any of the following:
	1. add one disk
	2. remove one disk
	3. use raidz2 instead of raidz

More generally, I could suggest that we use an odd number of vdevs
for raidz and an even number for mirrors and raidz2.
Thoughts?
  -- richard

Torrey McMahon

2006-Sep-27 18:31 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

Richard Elling - PAE wrote:>
>
> More generally, I could suggest that we use an odd number of vdevs
> for raidz and an even number for mirrors and raidz2.
> Thoughts? 
Sounds good to me. I''d make sure it''s in the same section of
the BP
guide as "Align the block size with your app..." type notes.

Gino Ruopolo

2006-Sep-27 18:49 UTC

head link

[zfs-discuss] Re: Metaslab alignment on RAID-Z

> 
> More generally, I could suggest that we use an odd
> number of vdevs
> for raidz and an even number for mirrors and raidz2.
> Thoughts?
uhm ... we found serious performance problems also using a RAID-Z of 3 luns ...

Gino
 
 
This message posted from opensolaris.org

Al Hopper

2006-Sep-27 21:22 UTC

head link

[zfs-discuss] Re: Metaslab alignment on RAID-Z

On Wed, 27 Sep 2006, Gino Ruopolo wrote:
> Thank you Bill for your clear description.
>
> Now I have to find a way to justify myself with my head office that after
spending  100k+ in hw and migrating to "the most advanced OS" we are
running about 8 time slower :)
>
> Anyway I have a problem much more serious than rsync process speed. I hope
you''ll help me solving it out!
>
> Our situation:
>
> /data/a
> /data/b
> /data/zones/ZONEX    (whole root zone)
>
> As you know I have a process running "rsync -ax /data/a/*
/data/b" for about 14hrs.
> The problem is that, while that rsync process is running, ZONEX is
completely unusable because of the rsync I/O load.
> Even if we''re using FSS, Solaris seems unable to give a small
amount of I/O resource to ZONEX''s activity ...
>
> I know that FSS  doesn''t deal with I/O but I think Solaris should
be smarter ..
What about using ipqos (man ipqos)?
> To draw a comparison, FreeBSD Jail doesn''t suffer from this
problem ...
>
> thank,
> Gino
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

Gino Ruopolo

2006-Sep-28 07:28 UTC

head link

[zfs-discuss] Re: Re: Metaslab alignment on RAID-Z

> > Even if we''re using FSS, Solaris seems unable to
> give a small amount of I/O resource to ZONEX''s
> activity ...
> >
> > I know that FSS  doesn''t deal with I/O but I think
> Solaris should be smarter ..
> 
> What about using ipqos (man ipqos)?
I''m not referring to "network I/O" but "storage
I/O" ...
 
thanks,
Gino
 
 
This message posted from opensolaris.org

Roch

2006-Sep-28 09:01 UTC

head link

[zfs-discuss] Re: Metaslab alignment on RAID-Z

Gino Ruopolo writes:
 > Thank you Bill for your clear description.
 > 
 > Now I have to find a way to justify myself with my head office that
 > after spending  100k+ in hw and migrating to "the most advanced
OS" we
 > are running about 8 time slower :) 
 > 
 > Anyway I have a problem much more serious than rsync process speed. I
 > hope you''ll help me solving it out! 
 > 
 > Our situation:
 > 
 > /data/a
 > /data/b
 > /data/zones/ZONEX    (whole root zone)
 > 
 > As you know I have a process running "rsync -ax /data/a/*
/data/b" for
 > about 14hrs. 
 > The problem is that, while that rsync process is running, ZONEX is
 > completely unusable because of the rsync I/O load. 
 > Even if we''re using FSS, Solaris seems unable to give a small
amount
 > of I/O resource to ZONEX''s activity ... 
 > 
 > I know that FSS  doesn''t deal with I/O but I think Solaris should
be
 > smarter .. 
 > 
 > To draw a comparison, FreeBSD Jail doesn''t suffer from this
problem
 > ... 
 > 
 > thank,
 > Gino
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Under a streaming write load, we kind of overwhelm the devices
and reads are few and far between. To alleviate this we
somewhat need to throttle writers more 

	6429205 each zpool needs to monitor it''s  throughput and throttle
heavy writers

This is in a  state of "fix in progress".  At the same time,
the notion    of    reserved slots   for   reads is    being
investigated. That should do wonders  to your issue. I don''t
know how to  workaround  this for now (appart  from starving
rsync process of cpu access).


-r

Frank Cusack

2006-Sep-28 23:31 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

On September 27, 2006 11:27:16 AM -0700 Richard Elling - PAE <Richard.Elling
at Sun.COM> wrote:> In the interim, does it makes sense for a simple rule of thumb?
> For example, in the above case, I would not have the hole if I did
> any of the following:
> 	1. add one disk
> 	2. remove one disk
> 	3. use raidz2 instead of raidz
>
> More generally, I could suggest that we use an odd number of vdevs
> for raidz and an even number for mirrors and raidz2.
I have a 12-drive array (JBOD).  I was going to carve out a 5-way
raidz and a 6-way raidz, both in one pool.  Should I do 5-3-3 instead?

-frank

Matthew Ahrens

2006-Sep-29 04:43 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

Frank Cusack wrote:> On September 27, 2006 11:27:16 AM -0700 Richard Elling - PAE 
> <Richard.Elling at Sun.COM> wrote:
>> In the interim, does it makes sense for a simple rule of thumb?
>> For example, in the above case, I would not have the hole if I did
>> any of the following:
>>     1. add one disk
>>     2. remove one disk
>>     3. use raidz2 instead of raidz
>>
>> More generally, I could suggest that we use an odd number of vdevs
>> for raidz and an even number for mirrors and raidz2.
These rules are not generally accurate.  To eliminate the blank "round 
up" sectors for power-of-two blocksizes of 8k or larger, you should use 
a power-of-two plus 1 number of disks in your raid-z group -- that is, 
3, 5, or 9 disks (for double-parity, use a power-of-two plus 2 -- that 
is, 4, 6, or 10).  Smaller blocksizes are more constrained; for 4k, use 
3 or 5 disks (for double parity, use 4 or 6) and for 2k, use 3 disks 
(for double parity, use 4).
> I have a 12-drive array (JBOD).  I was going to carve out a 5-way
> raidz and a 6-way raidz, both in one pool.  Should I do 5-3-3 instead?
If you know you''ll be using lots of small, same-size blocks (eg, a 
database where you''re changing the recordsize property), AND you need 
the best possible performance, AND you can''t afford to use mirroring, 
then you should do a performance comparison on both and see how it works.

Otherwise (ie. in the common case), don''t worry about it.

--matt

Robert Milkowski

2009-Jan-06 09:31 UTC

head link

[zfs-discuss] Metaslab alignment on RAID-Z

Is there any update on this? You suggested that Jeff had some kind of solution
for this - has it been integrated or is someone working on it?
-- 
This message posted from opensolaris.org

zfs discuss - Sep 2006 - Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Re: Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Re: Metaslab alignment on RAID-Z

[zfs-discuss] Re: Metaslab alignment on RAID-Z

[zfs-discuss] Re: Re: Metaslab alignment on RAID-Z

[zfs-discuss] Re: Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z

[zfs-discuss] Metaslab alignment on RAID-Z