thr3ads.net - zfs discuss - [zfs-discuss] Raidz - what is stored in parity? [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Peter Taps

2010-Aug-10 22:40 UTC

[zfs-discuss] Raidz - what is stored in parity?

Hi,

I am going through understanding the fundamentals of raidz. From the man pages,
a raidz configuration of P disks and N parity provides (P-N)*X storage space
where X is the size of the disk. For example, if I have 3 disks of 10G each and
I configure it with raidz1, I will have 20G of usable storage. In addition, I
continue to work even if 1 disk fails.

First, I don''t understand why parity takes so much space. From what I
know about parity, there is typically one parity bit per byte. Therefore, the
parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing?

Second, if one disk fails, how is my lost data reconstructed? There is no
duplicate data as this is not a mirrored configuration. Somehow, there should be
enough information in the parity disk to reconstruct the lost data. How is this
possible?

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org

Arne Schwabe

2010-Aug-10 22:44 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Am 11.08.10 00:40, schrieb Peter Taps:> Hi,
>
> I am going through understanding the fundamentals of raidz. From the man
pages, a raidz configuration of P disks and N parity provides (P-N)*X storage
space where X is the size of the disk. For example, if I have 3 disks of 10G
each and I configure it with raidz1, I will have 20G of usable storage. In
addition, I continue to work even if 1 disk fails.
>
> First, I don''t understand why parity takes so much space. From
what I know about parity, there is typically one parity bit per byte. Therefore,
the parity should be taking 1/8 of storage, not 1/3 of storage. What am I
missing?
>
> Second, if one disk fails, how is my lost data reconstructed? There is no
duplicate data as this is not a mirrored configuration. Somehow, there should be
enough information in the parity disk to reconstruct the lost data. How is this
possible?
>
> Thank you in advance for your help.
>Nah it is more like Disk3 is disk2 xor disk1. You can read about it on
Raid5 (raidz is more complicated but the basic idea stays the same). The
parity you describe is only for error checking. More like a zfs checksum
which also one takes very little additional space.

Arne

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6392 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100811/675cc85c/attachment.bin>

Eric D. Mudama

2010-Aug-10 22:47 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

On Tue, Aug 10 at 15:40, Peter Taps wrote:>Hi,
>
> First, I don''t understand why parity takes so much space. From
what
> I know about parity, there is typically one parity bit per
> byte. Therefore, the parity should be taking 1/8 of storage, not 1/3
> of storage. What am I missing?
Think of it as 1 bit of parity per N-wide RAID''d bit stored on your
data drives, which is why it occupies 1/N size.

With 3 disks it''s 1/3, with 8 disks it''s 1/8, and with 10983
disks it
would be 1/10983, because you''re generating parity across the
"width"
of your stripe, not as a tail to each stored byte on individual
devices.

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Peter Taps

2010-Aug-11 04:57 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it
simple let''s not consider block sizes.

Let''s say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The
parity info may tell me that something is bad but I don''t see how my
data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter
-- 
This message posted from opensolaris.org

Erik Trimble

2010-Aug-11 05:46 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

On 8/10/2010 9:57 PM, Peter Taps wrote:> Hi Eric,
>
> Thank you for your help. At least one part is clear now.
>
> I still am confused about how the system is still functional after one disk
fails.
>
> Consider my earlier example of 3 disks zpool configured for raidz-1. To
keep it simple let''s not consider block sizes.
>
> Let''s say I send a write value "abcdef" to the zpool.
>
> As the data gets striped, we will have 2 characters per disk.
>
> disk1 = "ab" + some parity info
> disk2 = "cd" + some parity info
> disk3 = "ef" + some parity info
>
> Now, if disk2 fails, I lost "cd." How will I ever recover this?
The parity info may tell me that something is bad but I don''t see how
my data will get recovered.
>
> The only good thing is that any newer data will now be striped over two
disks.
>
> Perhaps I am missing some fundamental concept about raidz.
>
> Regards,
> Peter
Parity is not intended to tell you *if* something is bad (well, it''s
not
*designed* for that). It tells you how to RECONSTRUCT something should 
it be bad.  ZFS uses Checksums of the data (which are stored as data 
themselves) to tell if some data is bad, and thus needs to be re-written 
(which is what virtually no other filesystem does now). Parity is used 
at a lower level to reconstruct data on devices after a device failure. 
It is not directly used to determine if a device (or block of data) is bad.

To simplify, let''s assume we''re talking about raidz1  (the
principles
generally apply to raidz2 and raidz3, but the details differ slightly).

Parity is constructed using mathematical XOR, which has the following 
property:

if A XOR B = C
then
     A XOR C = B    and also    B XOR C = A

(XOR is also fully commutative, so A XOR B = B XOR A )

So, in your case, what we have some some data "abcdef", and three
disks.
So, assuming we have a stripe set up so that 1 BYTE (i.e. character) 
gets stored on each device, then what you have is this:

Stripe       Device 1     Device 2     Device 3
1            A            B            A XOR B
2            C XOR D      C            D
3            E            E XOR F      F

(where X XOR Y means the binary value computed by XOR-ing X with Y)

In any case, if I lose one of the devices above, I simply XOR the 
corresponding values from the other two devices to reconstruct what I need.

For RaidZ[23], there are 2 or three parity calculations (it''s not a 
straight XOR, I forget the algorithm), but the process is the same - you 
use the data from the remaining devices to recompute the lost device or 
devices. As the parity block for a stripe is stored in a balanced manner 
across all devices (there is no dedicated parity-only device), it 
becomes simpler to recover data while retaining performance.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Haudy Kazemi

2010-Aug-11 06:48 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Peter Taps wrote:> Hi Eric,
>
> Thank you for your help. At least one part is clear now.
>
> I still am confused about how the system is still functional after one disk
fails.
>
> Consider my earlier example of 3 disks zpool configured for raidz-1. To
keep it simple let''s not consider block sizes.
>
> Let''s say I send a write value "abcdef" to the zpool.
>
> As the data gets striped, we will have 2 characters per disk.
>
> disk1 = "ab" + some parity info
> disk2 = "cd" + some parity info
> disk3 = "ef" + some parity info
>
> Now, if disk2 fails, I lost "cd." How will I ever recover this?
The parity info may tell me that something is bad but I don''t see how
my data will get recovered.
>
> The only good thing is that any newer data will now be striped over two
disks.
>
> Perhaps I am missing some fundamental concept about raidz.
>
> Regards,
> Peter
>   
It''s done via math and numbers.  :)  In a computer, everything is 
numbers, stored in base 2 (binary)...there are no letters or other 
symbols.  Your sample value of ''abcdef'' will be represented as
a
sequence of numbers, probably using the ASCII equivalent numbers, which 
are in turn represented as a binary sequence.

A simplified view of how you can protect multiple independent pieces of 
information with once piece of parity is as follows.
(Note: this simplified view is not exactly how RAID5 or RAIDZ work, as 
they actually make use of XOR at a bitwise level).

Consider an equation with variables (unrelated to your sample value) A, 
B, and P, where A + B = P.  P is the parity value.
A and B are numbers representing your data; they were indirectly chosen 
by you when you created your data.  P is the generated parity value.

If A=97, and B=98, then P=97+98=195.

Each of the three variables is stored on a different disk.  If any one 
variable is lost (the disk failed), the missing variable can be 
recalculated by rearranging the formula and using the known values.

Assuming ''A'' was lost, then A=P-B
P-B=195-98
195-98=97
A=97.  Data recovered.

In this simplified example, one piece of parity data P is generated for 
every pair of A and B values that are written.  Special cases handle 
things when only one value needs to be written (zero padding).  For more 
than 3 disks, the formula can expand to variations of A+B+C+D+E+F=P 
where P is the parity.  Additional levels of parity require using more 
complex techniques to generate the needed parity values.

There are lots of other explanations online that might help you out as 
well: http://www.google.com/#hl=en&q=how+raid+works

Thomas Burgess

2010-Aug-11 12:00 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

On Wed, Aug 11, 2010 at 12:57 AM, Peter Taps <ptrtap at yahoo.com> wrote:
> Hi Eric,
>
> Thank you for your help. At least one part is clear now.
>
> I still am confused about how the system is still functional after one disk
> fails.
>
> Consider my earlier example of 3 disks zpool configured for raidz-1. To
> keep it simple let''s not consider block sizes.
>
> Let''s say I send a write value "abcdef" to the zpool.
>
> As the data gets striped, we will have 2 characters per disk.
>
> disk1 = "ab" + some parity info
> disk2 = "cd" + some parity info
> disk3 = "ef" + some parity info
>
> Now, if disk2 fails, I lost "cd." How will I ever recover this?
The parity
> info may tell me that something is bad but I don''t see how my data
will get
> recovered.
>
> The only good thing is that any newer data will now be striped over two
> disks.
>
> Perhaps I am missing some fundamental concept about raidz.
>
> Regards,
> Peter
>




I find the best way to understand how parity works is to think back to your
algebra class when you''d have something like

1x +2 = 3

and you could solve for x....it''s not EXACTLY like that but solving the
parity stuff is similar to solving for x






> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100811/ebf12e5b/attachment.html>

Marty Scholes

2010-Aug-11 13:52 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Erik Trimble wrote:> On 8/10/2010 9:57 PM, Peter Taps wrote:
> > Hi Eric,
> >
> > Thank you for your help. At least one part is clear
> now.
> >
> > I still am confused about how the system is still
> functional after one disk fails.
> >
> > Consider my earlier example of 3 disks zpool
> configured for raidz-1. To keep it simple let''s not
> consider block sizes.
> >
> > Let''s say I send a write value "abcdef" to the
> zpool.
> >
> > As the data gets striped, we will have 2 characters
> per disk.
> >
> > disk1 = "ab" + some parity info
> > disk2 = "cd" + some parity info
> > disk3 = "ef" + some parity info
> >
> > Now, if disk2 fails, I lost "cd." How will I ever
> recover this? The parity info may tell me that
> something is bad but I don''t see how my data will get
> recovered.
> >
> > The only good thing is that any newer data will now
> be striped over two disks.
> >
> > Perhaps I am missing some fundamental concept about
> raidz.
> >
> > Regards,
> > Peter
> 
> Parity is not intended to tell you *if* something is
> bad (well, it''s not 
> *designed* for that). It tells you how to RECONSTRUCT
> something should 
> it be bad.  ZFS uses Checksums of the data (which are
> stored as data 
> themselves) to tell if some data is bad, and thus
> needs to be re-written 
To follow up Erik''s post, parity is used both to detect and correct
errors in a string of equal sized numbers, each parity is equal in size to each
of the numbers.  In the old serial protocols, one bit was used to detect an
error in a string of 7 bits, so each "number" in the string was a one
bit.  In the case of ZFS, each "number" in the string is a disk block.
The length of the string of numbers is completely arbitrary.

I am rusty on parity math, but Reed-Solomon is used (of which XOR is a
degenerate case) such that each parity is independent of the other parities. 
RAIDZ can support up to three parities per stripe.

Generally, a single parity can either detect a single corrupt number in a string
or if it is known which number is corrupt, a single parity can correct that
number.  Traditional RAID5 makes the assumption that it knows which number (i.e.
block) is bad because the disk failed and therefore can use the parity block to
reconstruct it.  RAID5 cannot reconstruct a random bit-flip.

RAIDZ takes a different approach where the checksum for the number string (i.e.
stripe) exists in a different, already validated stripe.  With that checksum in
hand, ZFS knows when a stripe is corrupt but not which block.  ZFS will then
reconstruct each data block in the stripe using the parity block, one data block
at a time until the checksum matches.  At that point ZFS knows which block is
bad and can rebuild it and write it to disk.  A scrub does this for all stripes
and all parities in each stripe.

Using the example above, the disk layout would look more like the following for
a single stripe, and as Erik mentioned, the location of the data and parity
blocks will change from stripe to stripe:
disk1 = "ab"
disk2 = "cd"
disk3 = parity info

Again using the example above, if disk 2 fails, or even stays online but
producess bad data, the information can be reconstructed from disk 3.

The beauty of ZFS is that it does not depend on parity to detect errors, your
stripes can be as wide as you want (up to 100-ish devices) and you can choose 1,
2 or 3 parity devices.

Hope that makes sense,
Marty
-- 
This message posted from opensolaris.org

Eric D. Mudama

2010-Aug-11 17:45 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

On Tue, Aug 10 at 21:57, Peter Taps wrote:>Hi Eric,
>
>Thank you for your help. At least one part is clear now.
>
>I still am confused about how the system is still functional after one disk
fails.
The data for any given sector striped across all drives can be thought
of as:

A+B+C = P

where A..C represent the contents of sector N on devices a..c, and P
is the parity located on device p.

 From that, you can do some simple algebra to convert it to:

A+B+C-P = 0

If any of A,B,C or P are unreadable (assume B), from simple algebra,
you can solve for any single unknown (x) to recreate it:

A+x+C = P
A+x+C-A-C = P-A-C
x = P-A-C

and voila, you now have your original B contents, since B=x.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Peter Taps

2010-Aug-11 17:53 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Thank you all for your help. It appears my understanding of parity was rather
limited. I kept on thinking about parity in memory where the extra bit would be
used to ensure that the total of all 9 bits is always even.

In case of zfs, the above type of checking is actually moved into checksum. What
zfs calls parity is much more than a simple check. No wonder it takes more
space.

One question though. Marty mentioned that raidz parity is limited to 3. But in
my experiment, it seems I can get parity to any level.

You create a raidz zpool as:

# zpool create mypool raidzx disk1 diskk2 ....

Here, x in raidzx is a numeric value indicating the desired parity.

In my experiment, the following command seems to work:

# zpool create mypool raidz10 disk1 disk2 ...

In my case, it gives an error that I need at least 11 disks (which I
don''t) but the point is that raidz parity does not seem to be limited
to 3. Is this not true?

Thank you once again for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org

Marty Scholes

2010-Aug-11 19:13 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Peter wrote:> One question though. Marty mentioned that raidz
> parity is limited to 3. But in my experiment, it
> seems I can get parity to any level.
> 
> You create a raidz zpool as:
> 
> # zpool create mypool raidzx disk1 diskk2 ....
> 
> Here, x in raidzx is a numeric value indicating the
> desired parity.
> 
> In my experiment, the following command seems to
> work:
> 
> # zpool create mypool raidz10 disk1 disk2 ...
> 
> In my case, it gives an error that I need at least 11
> disks (which I don''t) but the point is that raidz
> parity does not seem to be limited to 3. Is this not
> true?
You have my curiousity.  I was asking for that feature in these forums last
year.

What OS, version and ZFS version are you running?
-- 
This message posted from opensolaris.org

Adam Leventhal

2010-Aug-11 19:31 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

> In my case, it gives an error that I need at least 11 disks (which I
don''t) but the point is that raidz parity does not seem to be limited
to 3. Is this not true?
RAID-Z is limited to 3 parity disks. The error message is giving you false hope
and that''s a bug. If you had plugged in 11 disks or more in the example
you provided you would have simply gotten a different error.

- ahl

Peter Taps

2010-Aug-11 20:26 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

I am running ZFS file system version 5 on Nexenta.

Peter
-- 
This message posted from opensolaris.org

Peter Taps

2010-Aug-11 20:29 UTC

head link

[zfs-discuss] Raidz - what is stored in parity?

Thank you, Eric. Your explanation is clear to understand.

Regards,
Peter
-- 
This message posted from opensolaris.org

zfs discuss - Aug 2010 - Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?

[zfs-discuss] Raidz - what is stored in parity?