Daniel Smedegaard Buus
2010-Mar-03  10:54 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi all :) I''ve been wanting to make the switch from XFS over RAID5 to ZFS/RAIDZ2 for some time now, ever since I read about ZFS the first time. Absolutely amazing beast! I''ve built my own little hobby server at home and have a boatload of disks in different sizes that I''ve been using together to build a RAID5 array on Linux using mdadm in two layers; first layer is JBODs pooling together smaller disks to match the size of the largest disks, and then on top of that, a RAID5 layer to join everything into one big block device. A simplified example is this: A 2TB disk (raw device) + a 2TB JBOD mdadm device created from 2 1TB raw devices + a 2TB JBOD mdadm device created from 4 500GB raw devices = 3x2 TB mixed (physical and logical) devices to form a final RAID5 mdadm device. So, migrating to ZFS, I first examined the possibility to logically do the same, except throw away the "intermediate JBOD layer", that is, I thought it''d be nice if ZFS could do that part, i.e. make intermediate vdevs of smaller disks to use in the final vdev. As I found out, this isn''t possible, though. The choices that I''ve come down to are two: 1) Use SVM to create the intermediate logical 2TB devices from smaller raw devices, then create a RAIDZ2 vdev using a mix of physical and logical devices and zpool that. 2) Divide all disks larger than 500GB into 500GB slices, then create 4 individual RAIDZ2 vdevs directly on the raw devices, and combine them into the final zpool, thus eliminating the need for SVM, and maintaining portability between Linux and Solaris based systems. I really prefer the second choice. I do realize this isn''t best practice, but considering the drawbacks mentioned, I really don''t mind the extra maintenance (it''s my hobby ;) ), I can live with ZFS not being able to utilize the disk cache, and then there''s mentioned the bad idea of UFS and ZFS living on the same drive, but that wouldn''t be the case here anyway. All slices would be all ZFS. However, what I''m concerned about, is that with this setup, there''d be 4 RAIDZ vdevs of which the 2TB disk would be part of all of them, the 1TB disk would be part of half of them, while the 500GB disks would each only be part of one of them. The final question, then (sorry for the long-winded buildup ;) ), is: When ZFS pools together these four vdevs, will it be able to detect that these vdevs exist partly on the same disks and act accordingly? And by accordingly, I mean, if you just say "hey, there are four vdevs for me, better distribute reads and writes as much as possible to maximize throughput and response time", then this would be absolutely true in all cases where the vdevs all utilize separate hardware. But the exact opposite is the case here, where all four vdevs are (partly) on the one 2TB drive. If this approach is used here, then the 2TB drive would on the contrary suffer from heavy head thrashing when ZFS would be distributing accesses to four slices on the disk simultaneously. In this particular case, the best approach would be to compound the four vdevs in a "JBOD style" rather than a "RAID style". Does anyone have enough insight into the inner workings of ZFS to help me answer this question? Thanks in advance, Daniel :) -- This message posted from opensolaris.org
Tonmaus
2010-Mar-03  17:18 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, following the zfs best practise guide, my understanding is that neither choice is very good. There is maybe a third choice, that is pool ------vdev1 --------------disk --------------disk ..... --------------disk ... ------vdev n --------------disk --------------disk ..... --------------disk whereas the vdevs will add up in capacity. As far as I understand the option to use a parity protected stripe set (i.e. raidz) would be on the vdev layer. As far as I understand the smallest disk will limit the capacity of the vdev, not of the pool, so that the size should be constant within a pool. Potential hot spares would be universally usable for any vdev if they match the size of the largest member of any vdev. (i.e. 2 GB). The benefit of that solution are that a physical disk device failure will not affect more than one vdev, and that IO will scale across vdevs as much as capacity. The drawback is that the per-vdev redundancy has a price in capacity. I hope I am correct - I am a newbie as you. Regards, Tonmaus -- This message posted from opensolaris.org
Daniel Smedegaard Buus
2010-Mar-04  08:41 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi tonmaus, thanks for your reply :) I do know that this isn''t best practice, and I''ve also considered the approach you''re hinting at of distributing each vdev over different disks. However, this yields a massive loss in capacity if I want double-parity RAIDZ2 (which I do ;) ), and I''ll be unable to use my two 2TB disks since I don''t have a third one for RAIDZ2 (which, even if I did, would result in a 2TB vdev from 6TB of raw storage space :-O ). So I''ve boiled it down to one of the two aforementioned solutions, and like I wrote, I''m aware of it not being best practice. I''m just wondering whether my 2nd solution will cause head thrashing or not in which case I''d be going for solution no. 1. So if you or anyone else has any insight on the original question, I''d be very happy to hear it :) Thanks :) -- This message posted from opensolaris.org
Tonmaus
2010-Mar-04  11:51 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, the corners I am basing my previous idea on you can find here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#RAIDZ_Configuration_Requirements_and_Recommendations I can confirm some of the recommendations already from personal practise. First and foremost this sentence: "The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups." One example: I am running 11+1 disks in a single group now. I have recently changed the configuration from raidz to raidz2, and the performance while scrub dropped from 500 MB/s to app. 200 MB/s by the imposition of the second parity. I am sure that if I had chosen two groups in raidz, the performance would have been even better than the original config while I could still loose two drives in the pool unless the loss wouldn''t occur within a single group. The bottom line is that while increasing the number of stripes in a group the performance, especially random I/O, will converge against the performance of a single group member. The only reason why I am sticking with the single group configuration myself is that performance is "good enough" for what I am doing for now, and that "11 is not so far from 9". In your case, there are two other aspects: - if you pool small devices as JBODS below a vdev member, no parity will help you when you loose a member of the underlying JBOD. - If you use slices as vdev members, performance will drop dramatically. Regards, tonmaus -- This message posted from opensolaris.org
Daniel Smedegaard Buus
2010-Mar-04  12:33 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
> Hi, >Hi tonmaus :) (btw, isn''t that German for Audio Mouse?)> the corners I am basing my previous idea on you can > find here: > http://www.solarisinternals.com/wiki/index.php/ZFS_Bes > t_Practices_Guide#RAIDZ_Configuration_Requirements_and > _RecommendationsYep, me too :)> I can confirm some of the recommendations already > from personal practise. First and foremost this > sentence: "The recommended number of disks per group > is between 3 and 9. If you have more disks, use > multiple groups." > One example: > I am running 11+1 disks in a single group now. I have > recently changed the configuration from raidz to > raidz2, and the performance while scrub dropped from > 500 MB/s to app. 200 MB/s by the imposition of the > second parity. I am sure that if I had chosen two > groups in raidz, the performance would have been even > better than the original config while I could still > loose two drives in the pool unless the loss wouldn''t > occur within a single group.Except that that "if" is the one that effectively brings down the pool to single-parity redundancy. You''d be counting on luck that that second disk wouldn''t fail in the same vdev. And we''ve probably all heard ever-so often that the second disk that fails often fails in the same array once you add a replacement disk to the degraded array and start rebuilding. So actually, the odds would count against you being lucky. I would''ve sticked with the larger configuration like you, too :)> The bottom line is that while increasing the number > of stripes in a group the performance, especially > random I/O, will converge against the performance of > a single group member.But in neither case would this apply to my two options. The stripe/device count would be the same for the vdev or vdevs. In option two, the vdev count would be quadrupled, but the device count would be the same. (FYI, the actual raw device count I''m trying to assemble is 2x2TB, 6x1TB, 8x500GB, in both my options resulting in 7 members for the RAIDZ-2 vdev(s)).> The only reason why I am sticking with the single > group configuration myself is that performance is > "good enough" for what I am doing for now, and that > "11 is not so far from 9". >This is also why I don''t mind deviating a bit from the best practices. Performance is less important to me than effective storage space which again is less important than security through redundancy.> In your case, there are two other aspects: > - if you pool small devices as JBODS below a vdev > member, no superordinate parity will help you when > you loose a member of the underlying JBOD. The whole > pool will just be broken, and you will loose a good > part of your data.No, that''s not correct. The first option of pooling smaller disks into larger, logical devices via SVM would allow me to theoretically lose up to [b]eight[/b] disks while still having a live zpool (in the case where I lose 2 logical devices comprised of four 500GB drives each; this would only kill two actual RAIDZ2 members). Using slices, I''d be able to lose up to [b]five[/b] disks (in the case where I''d lose one 2TB disk (affecting all four vdevs) and four 500GB disks, one from each vdev). I''d have to be extremely "lucky", though, for any of these scenarios to actually play out ;) But in any case, both of my options, redundancy-wise, are in the worst-case scenario always [b]at least[/b] as robust as distributing the RAIDZ2 vdevs over similar disks, while potentially being even more robust.> - If you use slices as vdev members, performance will > drop dramatically. >And this is what I''m asking. [b]Aside[/b] from the issue with ZFS not being able to utilize the sliced drives'' caches. Because performance is priority 3 for me. But if head thrashing occurs, the slice-n-dice method is clearly not the way ahead for me ;) So I''m still very open to any knowledge on this particular question.> I can''t see that raidz2 would be a good choice unless > on the group layer, and raidz is probably good enough > with comparably small disks and pool size. > On the other side I am very curious what your > findings are trying what you have in mind... :-) >I''m already planning to do a blog post on this once I''m done :) It''ll even include pictures of my modded computer case (just drilled 221 air holes in the front the other day ;) ). Cheers, Daniel :) -- This message posted from opensolaris.org
Tonmaus
2010-Mar-05  22:34 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi,> > In your case, there are two other aspects: > > - if you pool small devices as JBODS below a vdev > > member, no superordinate parity will help you when > > you loose a member of the underlying JBOD. The > whole > > pool will just be broken, and you will loose a > good > > part of your data. > > No, that''s not correct. The first option of pooling > smaller disks into larger, logical devices via SVM > would allow me to theoretically lose up to > [b]eight[/b] disks while still having a live zpool > (in the case where I lose 2 logical devices comprised > of four 500GB drives each; this would only kill two > actual RAIDZ2 members).You are right. I was wrong with the JBOD observation. In the worst case the array still can''t tolerate more than 2 disk failures, if all disk failures are across different 2 TB building blocks.> Using slices, I''d be able to lose up to [b]five[/b] > disks (in the case where I''d lose one 2TB disk > (affecting all four vdevs) and four 500GB disks, one > from each vdev).As a single 2 TB disk is causing a failure in each group for scenario 2, the worst case here is as well "3 disks and you are out". This circumstance reduces the options to play with grouping to not less than 4 groups with that setup. The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot spare) Doesn''t that all point at option 1 as the better choice, as the performance will be much better, obviously when slicing the 2 TB drives will leave you at basically un-cached IO for these members, dominating the rest of the array? One more thing with SVM is unclear for me: if one of the smaller disks goes, from zfs perspective the whole JBOD has to be resilvered. But what will be the interactions between fixing the jbod in SVM and re-silvering in ZFS? Regards, Tonmaus -- This message posted from opensolaris.org
Tonmaus
2010-Mar-05  22:38 UTC
[zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi,> > In your case, there are two other aspects: > > - if you pool small devices as JBODS below a vdev > > member, no superordinate parity will help you when > > you loose a member of the underlying JBOD. The > whole > > pool will just be broken, and you will loose a > good > > part of your data. > > No, that''s not correct. The first option of pooling > smaller disks into larger, logical devices via SVM > would allow me to theoretically lose up to > [b]eight[/b] disks while still having a live zpool > (in the case where I lose 2 logical devices comprised > of four 500GB drives each; this would only kill two > actual RAIDZ2 members).You are right. I was wrong with the JBOD observation. In the worst case the array still can''t tolerate more than 2 disk failures, if all disk failures are across different 2 TB building blocks.> Using slices, I''d be able to lose up to [b]five[/b] > disks (in the case where I''d lose one 2TB disk > (affecting all four vdevs) and four 500GB disks, one > from each vdev).As a single 2 TB disk is causing a failure in each group for scenario 2, the worst case here is as well "3 disks and you are out". This circumstance reduces the options to play with grouping to not less than 4 groups with that setup. The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot spare) Doesn''t that all point at option 1 as the better choice, as the performance will be much better, obviously when slicing the 2 TB drives will leave you at basically un-cached IO for these members, dominating the rest of the array? One more thing with SVM is unclear for me: if one of the smaller disks goes, from zfs perspective the whole JBOD has to be resilvered. But what will be the interactions between fixing the jbod in SVM and re-silvering in ZFS? Regards, Tonmaus -- This message posted from opensolaris.org