thr3ads.net - zfs discuss - [zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Daniel Smedegaard Buus

2010-Mar-03 10:54 UTC

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi all :)

I''ve been wanting to make the switch from XFS over RAID5 to ZFS/RAIDZ2
for some time now, ever since I read about ZFS the first time. Absolutely
amazing beast!

I''ve built my own little hobby server at home and have a boatload of
disks in different sizes that I''ve been using together to build a RAID5
array on Linux using mdadm in two layers; first layer is JBODs pooling together
smaller disks to match the size of the largest disks, and then on top of that, a
RAID5 layer to join everything into one big block device.

A simplified example is this:
A 2TB disk (raw device) + a 2TB JBOD mdadm device created from 2 1TB raw devices
+ a 2TB JBOD mdadm device created from 4 500GB raw devices = 3x2 TB mixed
(physical and logical) devices to form a final RAID5 mdadm device.

So, migrating to ZFS, I first examined the possibility to logically do the same,
except throw away the "intermediate JBOD layer", that is, I thought
it''d be nice if ZFS could do that part, i.e. make intermediate vdevs of
smaller disks to use in the final vdev. As I found out, this isn''t
possible, though.

The choices that I''ve come down to are two:
1) Use SVM to create the intermediate logical 2TB devices from smaller raw
devices, then create a RAIDZ2 vdev using a mix of physical and logical devices
and zpool that.
2) Divide all disks larger than 500GB into 500GB slices, then create 4
individual RAIDZ2 vdevs directly on the raw devices, and combine them into the
final zpool, thus eliminating the need for SVM, and maintaining portability
between Linux and Solaris based systems.

I really prefer the second choice. I do realize this isn''t best
practice, but considering the drawbacks mentioned, I really don''t mind
the extra maintenance (it''s my hobby ;) ), I can live with ZFS not
being able to utilize the disk cache, and then there''s mentioned the
bad idea of UFS and ZFS living on the same drive, but that wouldn''t be
the case here anyway. All slices would be all ZFS.

However, what I''m concerned about, is that with this setup,
there''d be 4 RAIDZ vdevs of which the 2TB disk would be part of all of
them, the 1TB disk would be part of half of them, while the 500GB disks would
each only be part of one of them.

The final question, then (sorry for the long-winded buildup ;) ), is: When ZFS
pools together these four vdevs, will it be able to detect that these vdevs
exist partly on the same disks and act accordingly? And by accordingly, I mean,
if you just say "hey, there are four vdevs for me, better distribute reads
and writes as much as possible to maximize throughput and response time",
then this would be absolutely true in all cases where the vdevs all utilize
separate hardware. But the exact opposite is the case here, where all four vdevs
are (partly) on the one 2TB drive. If this approach is used here, then the 2TB
drive would on the contrary suffer from heavy head thrashing when ZFS would be
distributing accesses to four slices on the disk simultaneously.

In this particular case, the best approach would be to compound the four vdevs
in a "JBOD style" rather than a "RAID style".

Does anyone have enough insight into the inner workings of ZFS to help me answer
this question?

Thanks in advance,
Daniel :)
-- 
This message posted from opensolaris.org

Tonmaus

2010-Mar-03 17:18 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi,

following the zfs best practise guide, my understanding is that neither choice
is very good. There is maybe a third choice, that is

pool
------vdev1 
--------------disk
--------------disk
.....
--------------disk 

...

------vdev n
--------------disk 
--------------disk
.....
--------------disk

whereas the vdevs will add up in capacity. As far as I understand the option to
use a parity protected stripe set (i.e. raidz) would be on the vdev layer. As
far as I understand the smallest disk will limit the capacity of the vdev, not
of the pool, so that the size should be constant within a pool. Potential hot
spares would be universally usable for any vdev if they match the size of the
largest member of any vdev. (i.e. 2 GB).
The benefit of that solution are that a physical disk device failure will not
affect more than one vdev, and that IO will scale across vdevs as much as
capacity. The drawback is that the per-vdev redundancy has a price in capacity.
I hope I am correct - I am a newbie as you.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Daniel Smedegaard Buus

2010-Mar-04 08:41 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi tonmaus, thanks for your reply :)

I do know that this isn''t best practice, and I''ve also
considered the approach you''re hinting at of distributing each vdev
over different disks. However, this yields a massive loss in capacity if I want
double-parity RAIDZ2 (which I do ;) ), and I''ll be unable to use my two
2TB disks since I don''t have a third one for RAIDZ2 (which, even if I
did, would result in a 2TB vdev from 6TB of raw storage space :-O ).

So I''ve boiled it down to one of the two aforementioned solutions, and
like I wrote, I''m aware of it not being best practice. I''m
just wondering whether my 2nd solution will cause head thrashing or not in which
case I''d be going for solution no. 1.

So if you or anyone else has any insight on the original question, I''d
be very happy to hear it :)

Thanks :)
-- 
This message posted from opensolaris.org

Tonmaus

2010-Mar-04 11:51 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi,

the corners I am basing my previous idea on you can find here:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#RAIDZ_Configuration_Requirements_and_Recommendations
I can confirm some of the recommendations already from personal practise. First
and foremost this sentence: "The recommended number of disks per group is
between 3 and 9. If you have more disks, use multiple groups."
One example:
I am running 11+1 disks in a single group now. I have recently changed the
configuration from raidz to raidz2, and the performance while scrub dropped from
500 MB/s to app. 200 MB/s by the imposition of the second parity. I am sure that
if I had chosen two groups in raidz, the performance would have been even better
than the original config while I could still loose two drives in the pool unless
the loss wouldn''t occur within a single group.
The bottom line is that while increasing the number of stripes in a group the
performance, especially random I/O, will converge against the performance of a
single group member.
The only reason why I am sticking with the single group configuration myself is
that performance is "good enough" for what I am doing for now, and
that "11 is not so far from 9".

In your case, there are two other aspects:
- if you pool small devices as JBODS below a vdev member, no parity will help
you when you loose a member of the underlying JBOD.
- If you use slices as vdev members, performance will drop dramatically.

Regards,

tonmaus
--
This message posted from opensolaris.org

Daniel Smedegaard Buus

2010-Mar-04 12:33 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

> Hi,
> 
Hi tonmaus :) (btw, isn''t that German for Audio Mouse?)
> the corners I am basing my previous idea on you can
> find here:
> http://www.solarisinternals.com/wiki/index.php/ZFS_Bes
> t_Practices_Guide#RAIDZ_Configuration_Requirements_and
> _Recommendations
Yep, me too :)
> I can confirm some of the recommendations already
> from personal practise. First and foremost this
> sentence: "The recommended number of disks per group
> is between 3 and 9. If you have more disks, use
> multiple groups."
> One example:
> I am running 11+1 disks in a single group now. I have
> recently changed the configuration from raidz to
> raidz2, and the performance while scrub dropped from
> 500 MB/s to app. 200 MB/s by the imposition of the
> second parity. I am sure that if I had chosen two
> groups in raidz, the performance would have been even
> better than the original config while I could still
> loose two drives in the pool unless the loss wouldn''t
> occur within a single group. 
Except that that "if" is the one that effectively brings down the pool
to single-parity redundancy. You''d be counting on luck that that second
disk wouldn''t fail in the same vdev. And we''ve probably all
heard ever-so often that the second disk that fails often fails in the same
array once you add a replacement disk to the degraded array and start
rebuilding. So actually, the odds would count against you being lucky. I
would''ve sticked with the larger configuration like you, too :)
> The bottom line is that while  increasing the number
> of stripes in a group the performance, especially
> random I/O, will converge against the performance of
> a single group member.
But in neither case would this apply to my two options. The stripe/device count
would be the same for the vdev or vdevs. In option two, the vdev count would be
quadrupled, but the device count would be the same. (FYI, the actual raw device
count I''m trying to assemble is 2x2TB, 6x1TB, 8x500GB, in both my
options resulting in 7 members for the RAIDZ-2 vdev(s)).
> The only reason why I am sticking with the single
> group configuration myself is that performance is
> "good enough" for what I am doing for now, and that
> "11 is not so far from 9".
> 
This is also why I don''t mind deviating a bit from the best practices.
Performance is less important to me than effective storage space which again is
less important than security through redundancy.
> In your case, there are two other aspects:
> - if you pool small devices as JBODS below a vdev
> member, no superordinate parity will help you when
> you loose a member of the underlying JBOD. The whole
> pool will just be broken, and you will loose a good
> part of your data.
No, that''s not correct. The first option of pooling smaller disks into
larger, logical devices via SVM would allow me to theoretically lose up to
[b]eight[/b] disks while still having a live zpool (in the case where I lose 2
logical devices comprised of four 500GB drives each; this would only kill two
actual RAIDZ2 members).

Using slices, I''d be able to lose up to [b]five[/b] disks (in the case
where I''d lose one 2TB disk (affecting all four vdevs) and four 500GB
disks, one from each vdev).

I''d have to be extremely "lucky", though, for any of these
scenarios to actually play out ;) But in any case, both of my options,
redundancy-wise, are in the worst-case scenario always [b]at least[/b] as robust
as distributing the RAIDZ2 vdevs over similar disks, while potentially being
even more robust.
> - If you use slices as vdev members, performance will
> drop dramatically.
> 
And this is what I''m asking. [b]Aside[/b] from the issue with ZFS not
being able to utilize the sliced drives'' caches. Because performance is
priority 3 for me. But if head thrashing occurs, the slice-n-dice method is
clearly not the way ahead for me ;)

So I''m still very open to any knowledge on this particular question.
> I can''t see that raidz2 would be a good choice unless
> on the group layer, and raidz is probably good enough
> with comparably small disks and pool size.
> On the other side I am very curious what your
> findings are trying what you have in mind... :-)
> 
I''m already planning to do a blog post on this once I''m done
:) It''ll even include pictures of my modded computer case (just drilled
221 air holes in the front the other day ;) ).

Cheers,
Daniel :)
-- 
This message posted from opensolaris.org

Tonmaus

2010-Mar-05 22:34 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi,
> > In your case, there are two other aspects:
> > - if you pool small devices as JBODS below a vdev
> > member, no superordinate parity will help you when
> > you loose a member of the underlying JBOD. The
> whole
> > pool will just be broken, and you will loose a
> good
> > part of your data.
> 
> No, that''s not correct. The first option of pooling
> smaller disks into larger, logical devices via SVM
> would allow me to theoretically lose up to
> [b]eight[/b] disks while still having a live zpool
> (in the case where I lose 2 logical devices comprised
> of four 500GB drives each; this would only kill two
> actual RAIDZ2 members).
You are right. I was wrong with the JBOD observation. In the worst case the
array still can''t tolerate more than 2 disk failures, if all disk
failures are across different 2 TB building blocks.
> Using slices, I''d be able to lose up to [b]five[/b]
> disks (in the case where I''d lose one 2TB disk
> (affecting all four vdevs) and four 500GB disks, one
> from each vdev).
As a single 2 TB disk is causing a failure in each group for scenario 2, the
worst case here is as well "3 disks and you are out". This
circumstance reduces the options to play with grouping to not less than 4 groups
with that setup.

The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot
spare)

Doesn''t that all point at option 1 as the better choice, as the
performance will be much better, obviously when slicing the 2 TB drives will
leave you at basically un-cached IO for these members, dominating the rest of
the array?

One more thing with SVM is unclear for me: if one of the smaller disks goes,
from zfs perspective the whole JBOD has to be resilvered. But what will be the
interactions between fixing the jbod in SVM and re-silvering in ZFS?
Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Tonmaus

2010-Mar-05 22:38 UTC

head link

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Hi,
> > In your case, there are two other aspects:
> > - if you pool small devices as JBODS below a vdev
> > member, no superordinate parity will help you when
> > you loose a member of the underlying JBOD. The
> whole
> > pool will just be broken, and you will loose a
> good
> > part of your data.
> 
> No, that''s not correct. The first option of pooling
> smaller disks into larger, logical devices via SVM
> would allow me to theoretically lose up to
> [b]eight[/b] disks while still having a live zpool
> (in the case where I lose 2 logical devices comprised
> of four 500GB drives each; this would only kill two
> actual RAIDZ2 members).
You are right. I was wrong with the JBOD observation. In the worst case the
array still can''t tolerate more than 2 disk failures, if all disk
failures are across different 2 TB building blocks.
> Using slices, I''d be able to lose up to [b]five[/b]
> disks (in the case where I''d lose one 2TB disk
> (affecting all four vdevs) and four 500GB disks, one
> from each vdev).
As a single 2 TB disk is causing a failure in each group for scenario 2, the
worst case here is as well "3 disks and you are out". This
circumstance reduces the options to play with grouping to not less than 4 groups
with that setup.

The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot
spare)

Doesn''t that all point at option 1 as the better choice, as the
performance will be much better, obviously when slicing the 2 TB drives will
leave you at basically un-cached IO for these members, dominating the rest of
the array?

One more thing with SVM is unclear for me: if one of the smaller disks goes,
from zfs perspective the whole JBOD has to be resilvered. But what will be the
interactions between fixing the jbod in SVM and re-silvering in ZFS?
Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Reasonably Related Threads

Search for more apparently analagous threads

zfs discuss - Mar 2010 - Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk

Reasonably Related Threads