Don''t hear about triple-parity RAID that often:> Author: Adam Leventhal > Repository: /hg/onnv/onnv-gate > Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651 > Total changesets: 1 > Log message: > 6854612 triple-parity RAID-Zhttp://mail.opensolaris.org/pipermail/onnv-notify/2009-July/009872.html http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612 (Via Blog O'' Matty.) Would be curious to see performance characteristics.
> Don''t hear about triple-parity RAID that often:I agree completely. In fact, I have wondered (probably in these forums), why we don''t bite the bullet and make a generic raidzN, where N is any number >=0. In fact, get rid of mirroring, because it clearly is a variant of raidz with two devices. Want three way mirroring? Call that raidz2 with three devices. The truth is that a generic raidzN would roll up everything: striping, mirroring, parity raid, double parity, etc. into a single format with one parameter. If memory serves, the second parity is calculated using Reed-Solomon which implies that any number of parity devices is possible. Let''s not stop there, though. Once we have any number of parity devices, why can''t I add a parity device to an array? That should be simple enough with a scrub to set the parity. In fact, what is to stop me from removing a parity device? Once again, I think the code would make this rather easy. Once we can add and remove parity devices at will, it might not be a stretch to convert a parity device to data and vice versa. If you have four data drives and two parity drives but need more space, in a pinch just convert one parity drive to data and get more storage. The flip side would work as well. If I have six data drives and a single parity drive but have, over the years, replaced them all with vastly larger drives and have space to burn, I might want to covert a data drive to parity. I may sleep better at night. If we had a generic raidzN, the ability to add/remove parity devices and the ability to convert a data device from/to a parity device, then what happens? Total freedom. Add devices to the array, or take them away. Choose the blend of performance and redundancy that meets YOUR needs, then change it later when the technology and business needs change, all without interruption. Ok, back to the real world. The one downside to triple parity is that I recall the code discovered the corrupt block by excluding it from the stripe, reconstructing the stripe and comparing that with the checksum. In other words, for a given cost of X to compute a stripe and a number P of corrupt blocks, the cost of reading a stripe is approximately X^P. More corrupt blocks would radically slow down the system. With raidz2, the maximum number of corrupt blocks would be two, putting a cap on how costly the read can be. Standard disclaimers apply: I could be wrong, I am often wrong, etc. -- This message posted from opensolaris.org
On Sat, 18 Jul 2009, Martin wrote:> In fact, get rid of mirroring, because it clearly is a variant of > raidz with two devices. Want three way mirroring? Call that raidz2I don''t see much similarity between mirroring and raidz other than that they both support redundancy.> Let''s not stop there, though. Once we have any number of parity > devices, why can''t I add a parity device to an array? That should > be simple enough with a scrub to set the parity. In fact, what is > to stop me from removing a parity device? Once again, I think the > code would make this rather easy.A RAID system with distributed parity (like raidz) does not have a "parity device". Instead, all disks are treated as equal. Without distributed parity you have a bottleneck and it becomes difficult to scale the array to different stripe sizes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> I don''t see much similarity between mirroring and raidz other than > that they both support redundancy.A single parity device against a single data device is, in essence, mirroring. For all intents and purposes, raid and mirroring with this configuration are one and the same.> A RAID system with distributed parity (like raidz) does not have a > "parity device". Instead, all disks are treated as equal. Without > distributed parity you have a bottleneck and it becomes difficult to > scale the array to different stripe sizes.Agreed. Distributed parity is the way to go. Nonetheless, if I have an array with a single parity, then I still have one device dedicated to parity, even if the actual device which holds the parity information will vary from stripe to stripe. The point simply was that it might be straightforward to add a device and convert a raidz array into a raidz2 array, which effectively would be adding a parity device. An extension of that is to convert a raidz2 array back into a raidz array and increase its size without adding a device. -- This message posted from opensolaris.org
On Sun, 19 Jul 2009, Martin wrote:>> I don''t see much similarity between mirroring and raidz other than >> that they both support redundancy. > > A single parity device against a single data device is, in essence, > mirroring. For all intents and purposes, raid and mirroring with > this configuration are one and the same.Try creating a raidz pool with two drives (or files), pull one of the drives, and see what happens. Then try the same with mirroring. Do they behave the same? I expect not ... I am not sure if raidz even allows you to create pool with just two drives.> The point simply was that it might be straightforward to add a > device and convert a raidz array into a raidz2 array, which > effectively would be adding a parity device. An extension of that > is to convert a raidz2 array back into a raidz array and increase > its size without adding a device.That would be nice. Before developers worry about such exotic features, I would rather that they attend to the gross performance issues so that zfs performs at least as well as Windows NTFS or Linux XFS in all common cases. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
In response to:>> I don''t see much similarity between mirroring and raidz other than >> that they both support redundancy.Martin wrote:> A single parity device against a single data device is, in essence, mirroring. > For all intents and purposes, raid and mirroring with this configuration are > one and the same. >I would have to disagree with this. Mirrored data will have mulitple copies of the actual data. Any copy is a valid source for data access. Lose one disk and the other is a complete "original". A raid 3/4/5/6/z/z2 configuration will generate a mathematical value to restore a portion of the lost data one of the storage units in the stripe. A 2-disk raidz will have 1/2 of each disk''s used space holding primary data interlaced with the other 1/2 holding a parity "reflection" of the data. Any time we access the parity representation, some computation will be needed to render the live data. This would have to add *some* overhead to the io. Craig Cory
which gap? -- This message posted from opensolaris.org
> which gap? > > ''RAID-Z should mind the gap on writes'' ? > > Message was edited by: thometalI believe this is in reference to the raid 5 write hole, described here: http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance RAIDZ should avoid this via it''s Copy on Write model: http://en.wikipedia.org/wiki/Zfs#Copy-on-write_transactional_model So I''m not sure what the ''RAID-Z should mind the gap on writes'' comment is getting at either. Clarification? -Scott -- This message posted from opensolaris.org
http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/009872.html second bug, its the same link like in the first post. -- This message posted from opensolaris.org
>That would be nice. Before developers worry about such exotic >features, I would rather that they attend to the gross performance >issues so that zfs performs at least as well as Windows NTFS or Linux >XFS in all common cases.To each their own. A FS that calculates and writes parity onto disks will have difficulties being as fast as a FS that just dumps data. A FS that verifies read data parity will have difficulties being as fast as a FS that just returns whatever it reads. I can not see how this can happen. That''s no reason not to aim for a low overhead, but one has to make choices here. Mine is data safety and ease of use, so I''d love the "elastic" zpool idea. Of course, others will have different needs. Enterprises will not care about ease so much as they have dedicated professionals to pamper their arrays. They can also address speed issues with more spindles. ZFS+RAIDZ provides data integrity no RAID level can match thanks its checksumming. That''s worth a speed sacrifice in my book. Anything I missed? -- This message posted from opensolaris.org
On Mon, 20 Jul 2009, chris wrote:>> That would be nice. Before developers worry about such exotic >> features, I would rather that they attend to the gross performance >> issues so that zfs performs at least as well as Windows NTFS or Linux >> XFS in all common cases. > > To each their own.I was referring to gripes about performance in another discussion thread, and not due to RAID-Z3. I don''t think that adding another parity disk will make much difference to performance. Adding another parity disk has a similar performance impact as making the stripe one disk wider. MTTDL analysis shows that given normal evironmental conditions, the MTTDL of RAID-Z2 is already much longer than the life of the computer or the attendant human. Of course sometimes one encounters unusual conditions where additional redundancy is desired. I do think that it is worthwhile to be able to add another parity disk to an existing raidz vdev but I don''t know how much work that entails. Zfs development seems to be overwelmed with marketing-driven requirements lately and it is time to get back to brass tacks and make sure that the parts already developed are truely enterprise-grade. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Enterprises will not care about ease so much as they > have dedicated professionals to pamper their arrays.Enterprises can afford the professionals. I work for a fairly large bank which can, and does, afford a dedicated storage team. On the other hand, no enterprise can afford downtime. Where I work, a planned outage is a major event and any solution which allows flexibility without an outage is most welcome. While I am unfamiliar withe innards of VxFS, I have seen several critical production VxFS mount points expanded with little or no interruption. ZFS is so close on so many levels. -- This message posted from opensolaris.org
Hey Bob,> MTTDL analysis shows that given normal evironmental conditions, the > MTTDL of RAID-Z2 is already much longer than the life of the > computer or the attendant human. Of course sometimes one encounters > unusual conditions where additional redundancy is desired.To what analysis are you referring? Today the absolute fastest you can resilver a 1TB drive is about 4 hours. Real-world speeds might be half that. In 2010 we''ll have 3TB drives meaning it may take a full day to resilver. The odds of hitting a latent bit error are already reasonably high especially with a large pool that''s infrequently scrubbed meaning. What then are the odds of a second drive failing in the 24 hours it takes to resiler?> I do think that it is worthwhile to be able to add another parity > disk to an existing raidz vdev but I don''t know how much work that > entails.It entails a bunch of work: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Matt Ahrens is working on a key component after which it should all be possible.> Zfs development seems to be overwelmed with marketing-driven > requirements lately and it is time to get back to brass tacks and > make sure that the parts already developed are truely enterprise- > grade.While I don''t disagree that the focus for ZFS should be ensuring enterprise-class reliability and performance, let me assure you that requirements are driven by the market and not by marketing. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
>> which gap? >> >> ''RAID-Z should mind the gap on writes'' ? >> >> Message was edited by: thometal > > I believe this is in reference to the raid 5 write hole, described > here: > http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performanceIt''s not.> So I''m not sure what the ''RAID-Z should mind the gap on writes'' > comment is getting at either. > > Clarification?I''m planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS''s write aggregation as well as the hard drive''s ability to group I/Os and write them quickly. Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we''re going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don''t care about. Of course, doing this for writes is a bit trickier since we can''t just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of ''optional'' I/Os purely for the purpose of coalescing writes into larger chunks. I hope that''s clear; if it''s not, stay tuned for the aforementioned blog post. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
> Don''t hear about triple-parity RAID that often: > >> Author: Adam Leventhal >> Repository: /hg/onnv/onnv-gate >> Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651 >> Total changesets: 1 >> Log message: >> 6854612 triple-parity RAID-Z > > http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ > 009872.html > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612 > > (Via Blog O'' Matty.) > > Would be curious to see performance characteristics.I just blogged about triple-parity RAID-Z (raidz3): http://blogs.sun.com/ahl/entry/triple_parity_raid_z As for performance, on the system I was using (a max config Sun Storage 7410), I saw about a 25% improvement to 1GB/s for a streaming write workload. YMMV, but I''d be interested in hearing your results. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
>> Don''t hear about triple-parity RAID that often: > > I agree completely. In fact, I have wondered (probably in these > forums), why we don''t bite the bullet and make a generic raidzN, > where N is any number >=0.I agree, but raidzN isn''t simple to implement and it''s potentially difficult to get it to perform well. That said, it''s something I intend to bring to ZFS in the next year or so.> If memory serves, the second parity is calculated using Reed-Solomon > which implies that any number of parity devices is possible.True; it''s a degenerate case.> In fact, get rid of mirroring, because it clearly is a variant of > raidz with two devices. Want three way mirroring? Call that raidz2 > with three devices. The truth is that a generic raidzN would roll > up everything: striping, mirroring, parity raid, double parity, etc. > into a single format with one parameter.That''s an interesting thought, but there are some advantages to calling out mirroring for example as its own vdev type. As has been pointed out, reading from either side of the mirror involves no computation whereas reading from a RAID-Z 1+2 for example would involve more computation. This would complicate the calculus of balancing read operations over the mirror devices.> Let''s not stop there, though. Once we have any number of parity > devices, why can''t I add a parity device to an array? That should > be simple enough with a scrub to set the parity. In fact, what is > to stop me from removing a parity device? Once again, I think the > code would make this rather easy.With RAID-Z stripes can be of variable width meaning that, say, a single row in a 4+2 configuration might have two stripes of 1+2. In other words, there might not be enough space in the new parity device. I did write up the steps that would be needed to support RAID-Z expansion; you can find it here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z> Ok, back to the real world. The one downside to triple parity is > that I recall the code discovered the corrupt block by excluding it > from the stripe, reconstructing the stripe and comparing that with > the checksum. In other words, for a given cost of X to compute a > stripe and a number P of corrupt blocks, the cost of reading a > stripe is approximately X^P. More corrupt blocks would radically > slow down the system. With raidz2, the maximum number of corrupt > blocks would be two, putting a cap on how costly the read can be.Computing the additional parity of triple-parity RAID-Z is slightly more expensive, but not much -- it''s just bitwise operations. Recovering from a read failure is identical (and performs identically) to raidz1 or raidz2 until you actually have sustained three failures. In that case, performance is slower as more computation is involved -- but aren''t you just happy to get your data back? If there is silent data corruption, then and only then can you encounter the O(n^3) algorithm that you alluded to, but only as a last resort. If we don''t know what drives failed, we try to reconstruct your data by assuming that one drive, then two drives, then three drives are returning bad data. For raidz1, this was a linear operation; raidz2, quadratic; now raidz3 is N-cubed. There''s really no way around it. Fortunately with proper scrubbing encountering data corruption in one stripe on three different drives is highly unlikely. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On 22.07.09 10:45, Adam Leventhal wrote:>>> which gap? >>> >>> ''RAID-Z should mind the gap on writes'' ? >>> >>> Message was edited by: thometal >> >> I believe this is in reference to the raid 5 write hole, described here: >> http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance > > It''s not. > >> So I''m not sure what the ''RAID-Z should mind the gap on writes'' >> comment is getting at either. >> >> Clarification? > > > I''m planning to write a blog post describing this, but the basic problem > is that RAID-Z, by virtue of supporting variable stripe writes (the > insight that allows us to avoid the RAID-5 write hole), must round the > number of sectors up to a multiple of nparity+1. This means that we may > have sectors that are effectively skipped. ZFS generally lays down data > in large contiguous streams, but these skipped sectors can stymie both > ZFS''s write aggregation as well as the hard drive''s ability to group > I/Os and write them quickly. > > Jeff Bonwick added some code to mind these gaps on reads. The key > insight there is that if we''re going to read 64K, say, with a 512 byte > hole in the middle, we might as well do one big read rather than two > smaller reads and just throw out the data that we don''t care about. > > Of course, doing this for writes is a bit trickier since we can''t just > blithely write over gaps as those might contain live data on the disk. > To solve this we push the knowledge of those skipped sectors down to the > I/O aggregation layer in the form of ''optional'' I/Os purely for the > purpose of coalescing writes into larger chunks.This exact issue was discussed here almost three years ago: http://www.opensolaris.org/jive/thread.jspa?messageID=60241
Adam Leventhal wrote:> Hey Bob, > >> MTTDL analysis shows that given normal evironmental conditions, the >> MTTDL of RAID-Z2 is already much longer than the life of the computer >> or the attendant human. Of course sometimes one encounters unusual >> conditions where additional redundancy is desired. > > To what analysis are you referring? Today the absolute fastest you can > resilver a 1TB drive is about 4 hours. Real-world speeds might be half > that. In 2010 we''ll have 3TB drives meaning it may take a full day to > resilver. The odds of hitting a latent bit error are already > reasonably high especially with a large pool that''s infrequently > scrubbed meaning. What then are the odds of a second drive failing in > the 24 hours it takes to resiler? >I wish it was so good with raid-zN. In real life, at least from mine experience, it can take several days to resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real data. While the way zfs ynchronizes data is way faster under some circumstances it is also much slower under other. IIRC some builds ago there were some fixes integrated so maybe it is different now.>> I do think that it is worthwhile to be able to add another parity >> disk to an existing raidz vdev but I don''t know how much work that >> entails. > > It entails a bunch of work: > > http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z > > Matt Ahrens is working on a key component after which it should all be > possible. >A lot of people are waiting for it! :) :) :) ps. thank you for raid-z3! -- Robert Milkowski http://milek.blogspot.com
Adam Leventhal wrote:> > I just blogged about triple-parity RAID-Z (raidz3): > > http://blogs.sun.com/ahl/entry/triple_parity_raid_z > > As for performance, on the system I was using (a max config Sun Storage > 7410), I saw about a 25% improvement to 1GB/s for a streaming write > workload. YMMV, but I''d be interested in hearing your results.25% improvement when comparing what exactly to what? -- Robert Milkowski http://milek.blogspot.com
Robert, On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski wrote:>> To what analysis are you referring? Today the absolute fastest you can >> resilver a 1TB drive is about 4 hours. Real-world speeds might be half >> that. In 2010 we''ll have 3TB drives meaning it may take a full day to >> resilver. The odds of hitting a latent bit error are already reasonably >> high especially with a large pool that''s infrequently scrubbed meaning. >> What then are the odds of a second drive failing in the 24 hours it takes >> to resiler? > > I wish it was so good with raid-zN. > In real life, at least from mine experience, it can take several days to > resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real > data. > While the way zfs ynchronizes data is way faster under some circumstances > it is also much slower under other. > IIRC some builds ago there were some fixes integrated so maybe it is > different now.Absolutely. I was talking more or less about optimal timing. I realize that due to the priorities within ZFS and real word loads that it can take far longer. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Interesting, so the more drive failures you have, the slower the array gets? Would I be right in assuming that the slowdown is only up to the point where FMA / ZFS marks the drive as faulted? -- This message posted from opensolaris.org
> With RAID-Z stripes can be of variable width meaning that, say, a > single row > in a 4+2 configuration might have two stripes of 1+2. In other words, > there > might not be enough space in the new parity device.Wow -- I totally missed that scenario. Excellent point.> I did write up the > steps > that would be needed to support RAID-Z expansionGood write up. If I understand it, the basic approach is to add the device to each row and leave the unusable fragments there. New stripes will take advantage of the wider row but old stripes will not. It would seem that the mythical bp_rewrite() that I see mentioned here and there could relocate a stripe to another set of rows without altering the transaction_id (or whatever it''s called), critical for tracking snapshots. I suspect this function would allow background defrag/coalesce (a needed feature IMHO) and deduplication. With background defrag, the extra space on existing stripes would not immediately be usable, but would appear over time. Many thanks for the insight and thoughts. Bluntly, how can I help? I have cut a lifetime of C code in a past life. Cheers, Marty -- This message posted from opensolaris.org