Hi, I am going through understanding the fundamentals of raidz. From the man pages, a raidz configuration of P disks and N parity provides (P-N)*X storage space where X is the size of the disk. For example, if I have 3 disks of 10G each and I configure it with raidz1, I will have 20G of usable storage. In addition, I continue to work even if 1 disk fails. First, I don''t understand why parity takes so much space. From what I know about parity, there is typically one parity bit per byte. Therefore, the parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing? Second, if one disk fails, how is my lost data reconstructed? There is no duplicate data as this is not a mirrored configuration. Somehow, there should be enough information in the parity disk to reconstruct the lost data. How is this possible? Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org
Am 11.08.10 00:40, schrieb Peter Taps:> Hi, > > I am going through understanding the fundamentals of raidz. From the man pages, a raidz configuration of P disks and N parity provides (P-N)*X storage space where X is the size of the disk. For example, if I have 3 disks of 10G each and I configure it with raidz1, I will have 20G of usable storage. In addition, I continue to work even if 1 disk fails. > > First, I don''t understand why parity takes so much space. From what I know about parity, there is typically one parity bit per byte. Therefore, the parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing? > > Second, if one disk fails, how is my lost data reconstructed? There is no duplicate data as this is not a mirrored configuration. Somehow, there should be enough information in the parity disk to reconstruct the lost data. How is this possible? > > Thank you in advance for your help. >Nah it is more like Disk3 is disk2 xor disk1. You can read about it on Raid5 (raidz is more complicated but the basic idea stays the same). The parity you describe is only for error checking. More like a zfs checksum which also one takes very little additional space. Arne -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 6392 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100811/675cc85c/attachment.bin>
On Tue, Aug 10 at 15:40, Peter Taps wrote:>Hi, > > First, I don''t understand why parity takes so much space. From what > I know about parity, there is typically one parity bit per > byte. Therefore, the parity should be taking 1/8 of storage, not 1/3 > of storage. What am I missing?Think of it as 1 bit of parity per N-wide RAID''d bit stored on your data drives, which is why it occupies 1/N size. With 3 disks it''s 1/3, with 8 disks it''s 1/8, and with 10983 disks it would be 1/10983, because you''re generating parity across the "width" of your stripe, not as a tail to each stored byte on individual devices. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Hi Eric, Thank you for your help. At least one part is clear now. I still am confused about how the system is still functional after one disk fails. Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it simple let''s not consider block sizes. Let''s say I send a write value "abcdef" to the zpool. As the data gets striped, we will have 2 characters per disk. disk1 = "ab" + some parity info disk2 = "cd" + some parity info disk3 = "ef" + some parity info Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info may tell me that something is bad but I don''t see how my data will get recovered. The only good thing is that any newer data will now be striped over two disks. Perhaps I am missing some fundamental concept about raidz. Regards, Peter -- This message posted from opensolaris.org
On 8/10/2010 9:57 PM, Peter Taps wrote:> Hi Eric, > > Thank you for your help. At least one part is clear now. > > I still am confused about how the system is still functional after one disk fails. > > Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it simple let''s not consider block sizes. > > Let''s say I send a write value "abcdef" to the zpool. > > As the data gets striped, we will have 2 characters per disk. > > disk1 = "ab" + some parity info > disk2 = "cd" + some parity info > disk3 = "ef" + some parity info > > Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info may tell me that something is bad but I don''t see how my data will get recovered. > > The only good thing is that any newer data will now be striped over two disks. > > Perhaps I am missing some fundamental concept about raidz. > > Regards, > PeterParity is not intended to tell you *if* something is bad (well, it''s not *designed* for that). It tells you how to RECONSTRUCT something should it be bad. ZFS uses Checksums of the data (which are stored as data themselves) to tell if some data is bad, and thus needs to be re-written (which is what virtually no other filesystem does now). Parity is used at a lower level to reconstruct data on devices after a device failure. It is not directly used to determine if a device (or block of data) is bad. To simplify, let''s assume we''re talking about raidz1 (the principles generally apply to raidz2 and raidz3, but the details differ slightly). Parity is constructed using mathematical XOR, which has the following property: if A XOR B = C then A XOR C = B and also B XOR C = A (XOR is also fully commutative, so A XOR B = B XOR A ) So, in your case, what we have some some data "abcdef", and three disks. So, assuming we have a stripe set up so that 1 BYTE (i.e. character) gets stored on each device, then what you have is this: Stripe Device 1 Device 2 Device 3 1 A B A XOR B 2 C XOR D C D 3 E E XOR F F (where X XOR Y means the binary value computed by XOR-ing X with Y) In any case, if I lose one of the devices above, I simply XOR the corresponding values from the other two devices to reconstruct what I need. For RaidZ[23], there are 2 or three parity calculations (it''s not a straight XOR, I forget the algorithm), but the process is the same - you use the data from the remaining devices to recompute the lost device or devices. As the parity block for a stripe is stored in a balanced manner across all devices (there is no dedicated parity-only device), it becomes simpler to recover data while retaining performance. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Peter Taps wrote:> Hi Eric, > > Thank you for your help. At least one part is clear now. > > I still am confused about how the system is still functional after one disk fails. > > Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it simple let''s not consider block sizes. > > Let''s say I send a write value "abcdef" to the zpool. > > As the data gets striped, we will have 2 characters per disk. > > disk1 = "ab" + some parity info > disk2 = "cd" + some parity info > disk3 = "ef" + some parity info > > Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info may tell me that something is bad but I don''t see how my data will get recovered. > > The only good thing is that any newer data will now be striped over two disks. > > Perhaps I am missing some fundamental concept about raidz. > > Regards, > Peter >It''s done via math and numbers. :) In a computer, everything is numbers, stored in base 2 (binary)...there are no letters or other symbols. Your sample value of ''abcdef'' will be represented as a sequence of numbers, probably using the ASCII equivalent numbers, which are in turn represented as a binary sequence. A simplified view of how you can protect multiple independent pieces of information with once piece of parity is as follows. (Note: this simplified view is not exactly how RAID5 or RAIDZ work, as they actually make use of XOR at a bitwise level). Consider an equation with variables (unrelated to your sample value) A, B, and P, where A + B = P. P is the parity value. A and B are numbers representing your data; they were indirectly chosen by you when you created your data. P is the generated parity value. If A=97, and B=98, then P=97+98=195. Each of the three variables is stored on a different disk. If any one variable is lost (the disk failed), the missing variable can be recalculated by rearranging the formula and using the known values. Assuming ''A'' was lost, then A=P-B P-B=195-98 195-98=97 A=97. Data recovered. In this simplified example, one piece of parity data P is generated for every pair of A and B values that are written. Special cases handle things when only one value needs to be written (zero padding). For more than 3 disks, the formula can expand to variations of A+B+C+D+E+F=P where P is the parity. Additional levels of parity require using more complex techniques to generate the needed parity values. There are lots of other explanations online that might help you out as well: http://www.google.com/#hl=en&q=how+raid+works
On Wed, Aug 11, 2010 at 12:57 AM, Peter Taps <ptrtap at yahoo.com> wrote:> Hi Eric, > > Thank you for your help. At least one part is clear now. > > I still am confused about how the system is still functional after one disk > fails. > > Consider my earlier example of 3 disks zpool configured for raidz-1. To > keep it simple let''s not consider block sizes. > > Let''s say I send a write value "abcdef" to the zpool. > > As the data gets striped, we will have 2 characters per disk. > > disk1 = "ab" + some parity info > disk2 = "cd" + some parity info > disk3 = "ef" + some parity info > > Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity > info may tell me that something is bad but I don''t see how my data will get > recovered. > > The only good thing is that any newer data will now be striped over two > disks. > > Perhaps I am missing some fundamental concept about raidz. > > Regards, > Peter >I find the best way to understand how parity works is to think back to your algebra class when you''d have something like 1x +2 = 3 and you could solve for x....it''s not EXACTLY like that but solving the parity stuff is similar to solving for x> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100811/ebf12e5b/attachment.html>
Erik Trimble wrote:> On 8/10/2010 9:57 PM, Peter Taps wrote: > > Hi Eric, > > > > Thank you for your help. At least one part is clear > now. > > > > I still am confused about how the system is still > functional after one disk fails. > > > > Consider my earlier example of 3 disks zpool > configured for raidz-1. To keep it simple let''s not > consider block sizes. > > > > Let''s say I send a write value "abcdef" to the > zpool. > > > > As the data gets striped, we will have 2 characters > per disk. > > > > disk1 = "ab" + some parity info > > disk2 = "cd" + some parity info > > disk3 = "ef" + some parity info > > > > Now, if disk2 fails, I lost "cd." How will I ever > recover this? The parity info may tell me that > something is bad but I don''t see how my data will get > recovered. > > > > The only good thing is that any newer data will now > be striped over two disks. > > > > Perhaps I am missing some fundamental concept about > raidz. > > > > Regards, > > Peter > > Parity is not intended to tell you *if* something is > bad (well, it''s not > *designed* for that). It tells you how to RECONSTRUCT > something should > it be bad. ZFS uses Checksums of the data (which are > stored as data > themselves) to tell if some data is bad, and thus > needs to be re-writtenTo follow up Erik''s post, parity is used both to detect and correct errors in a string of equal sized numbers, each parity is equal in size to each of the numbers. In the old serial protocols, one bit was used to detect an error in a string of 7 bits, so each "number" in the string was a one bit. In the case of ZFS, each "number" in the string is a disk block. The length of the string of numbers is completely arbitrary. I am rusty on parity math, but Reed-Solomon is used (of which XOR is a degenerate case) such that each parity is independent of the other parities. RAIDZ can support up to three parities per stripe. Generally, a single parity can either detect a single corrupt number in a string or if it is known which number is corrupt, a single parity can correct that number. Traditional RAID5 makes the assumption that it knows which number (i.e. block) is bad because the disk failed and therefore can use the parity block to reconstruct it. RAID5 cannot reconstruct a random bit-flip. RAIDZ takes a different approach where the checksum for the number string (i.e. stripe) exists in a different, already validated stripe. With that checksum in hand, ZFS knows when a stripe is corrupt but not which block. ZFS will then reconstruct each data block in the stripe using the parity block, one data block at a time until the checksum matches. At that point ZFS knows which block is bad and can rebuild it and write it to disk. A scrub does this for all stripes and all parities in each stripe. Using the example above, the disk layout would look more like the following for a single stripe, and as Erik mentioned, the location of the data and parity blocks will change from stripe to stripe: disk1 = "ab" disk2 = "cd" disk3 = parity info Again using the example above, if disk 2 fails, or even stays online but producess bad data, the information can be reconstructed from disk 3. The beauty of ZFS is that it does not depend on parity to detect errors, your stripes can be as wide as you want (up to 100-ish devices) and you can choose 1, 2 or 3 parity devices. Hope that makes sense, Marty -- This message posted from opensolaris.org
On Tue, Aug 10 at 21:57, Peter Taps wrote:>Hi Eric, > >Thank you for your help. At least one part is clear now. > >I still am confused about how the system is still functional after one disk fails.The data for any given sector striped across all drives can be thought of as: A+B+C = P where A..C represent the contents of sector N on devices a..c, and P is the parity located on device p. From that, you can do some simple algebra to convert it to: A+B+C-P = 0 If any of A,B,C or P are unreadable (assume B), from simple algebra, you can solve for any single unknown (x) to recreate it: A+x+C = P A+x+C-A-C = P-A-C x = P-A-C and voila, you now have your original B contents, since B=x. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Thank you all for your help. It appears my understanding of parity was rather limited. I kept on thinking about parity in memory where the extra bit would be used to ensure that the total of all 9 bits is always even. In case of zfs, the above type of checking is actually moved into checksum. What zfs calls parity is much more than a simple check. No wonder it takes more space. One question though. Marty mentioned that raidz parity is limited to 3. But in my experiment, it seems I can get parity to any level. You create a raidz zpool as: # zpool create mypool raidzx disk1 diskk2 .... Here, x in raidzx is a numeric value indicating the desired parity. In my experiment, the following command seems to work: # zpool create mypool raidz10 disk1 disk2 ... In my case, it gives an error that I need at least 11 disks (which I don''t) but the point is that raidz parity does not seem to be limited to 3. Is this not true? Thank you once again for your help. Regards, Peter -- This message posted from opensolaris.org
Peter wrote:> One question though. Marty mentioned that raidz > parity is limited to 3. But in my experiment, it > seems I can get parity to any level. > > You create a raidz zpool as: > > # zpool create mypool raidzx disk1 diskk2 .... > > Here, x in raidzx is a numeric value indicating the > desired parity. > > In my experiment, the following command seems to > work: > > # zpool create mypool raidz10 disk1 disk2 ... > > In my case, it gives an error that I need at least 11 > disks (which I don''t) but the point is that raidz > parity does not seem to be limited to 3. Is this not > true?You have my curiousity. I was asking for that feature in these forums last year. What OS, version and ZFS version are you running? -- This message posted from opensolaris.org
> In my case, it gives an error that I need at least 11 disks (which I don''t) but the point is that raidz parity does not seem to be limited to 3. Is this not true?RAID-Z is limited to 3 parity disks. The error message is giving you false hope and that''s a bug. If you had plugged in 11 disks or more in the example you provided you would have simply gotten a different error. - ahl
I am running ZFS file system version 5 on Nexenta. Peter -- This message posted from opensolaris.org