thr3ads.net - zfs discuss - [zfs-discuss] raidz data loss stories? [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Frank Cusack

2009-Dec-20 16:53 UTC

[zfs-discuss] raidz data loss stories?

The zfs best practices page (and all the experts in general) talk about
MTTDL and raidz2 is better than raidz and so on.

Has anyone here ever actually experienced data loss in a raidz that
has a hot spare?  Of course, I mean from disk failure, not from bugs
or admin error, etc.

-frank

James Risner

2009-Dec-21 17:13 UTC

head link

[zfs-discuss] raidz data loss stories?

If you are asking if anyone has experienced two drive failures simultaneously? 
The answer is yes.

It has happened to me (at home) and to one client, at least that I can remember.
In both cases, I was able to dd off one of the failed disks (with just bad
sectors or less bad sectors) and reconstruct the raid 5 (force it online) to
then copy data off the raid onto new drives.

Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. 
All my "boot from zfs" systems have 3 way mirrors root/usr/var disks
(using 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or
more and a spare.)
-- 
This message posted from opensolaris.org

Scott Meilicke

2009-Dec-21 18:01 UTC

head link

[zfs-discuss] raidz data loss stories?

Yes, a coworker lost a second disk during a rebuild of a raid5 and lost all
data. I have not had a failure, however when migrating EqualLogic arrays in and
out of pools, I lost a disk on an array. No data loss, but it concerns me
because during the moves, you are essentially reading and writing all of the
data on the disk. Did I have a latent problem on that particular disk that only
exposed itself when doing such a large read/write? What if another disk had
failed, and during the rebuild this latent problem was exposed? Trouble,
trouble.

They say security is an onion. So is data protection.

Scott
-- 
This message posted from opensolaris.org

Michael Herf

2009-Dec-21 21:09 UTC

head link

[zfs-discuss] raidz data loss stories?

Anyone who''s lost data this way: were you doing weekly scrubs, or did
you
find out about the simultaneous failures after not touching the bits for
months?

mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091221/7d43cffa/attachment.html>

Adam Leventhal

2009-Dec-22 02:13 UTC

head link

[zfs-discuss] raidz data loss stories?

Hey James,
> Personally, I think mirroring is safer (and 3 way mirroring) than
raidz/z2/5.  All my "boot from zfs" systems have 3 way mirrors
root/usr/var disks (using 9 disks) but all my data partitions are 2 way mirrors
(usually 8 disks or more and a spare.)
Double-parity (or triple-parity) RAID are certainly more resilient against some
failure modes than 2-way mirroring. For example, bit errors can arise at a
certain rate from disks. In the case of a disk failure in a mirror,
it''s possible to encounter a bit error such that data is lost.

I recently wrote an article for ACM Queue that examines recent trends in hard
drives and makes the case for triple-parity RAID. It''s at least
peripherally relevant to this conversation:

  http://blogs.sun.com/ahl/entry/acm_triple_parity_raid

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Ross Walker

2009-Dec-22 03:45 UTC

head link

[zfs-discuss] raidz data loss stories?

On Dec 21, 2009, at 4:09 PM, Michael Herf <mbherf at gmail.com> wrote:
> Anyone who''s lost data this way: were you doing weekly scrubs, or
> did you find out about the simultaneous failures after not touching  
> the bits for months?
Scrubbing on a routine basis is good for detecting problems early, but  
it doesn''t solve the problem of a double failure during resilver. As  
the size of disks become huge the chance of a double failure during  
resilvering increases to the point of real possibility. Due to the  
amount of data, the bit error rates of the medium and the prolonged  
stress of resilvering these monsters.

For up to 1TB drives use nothing less than raidz2. For 1TB+ drives use  
raidz3. Avoid raidz vdevs larger than 7 drives, better to have  
multiple vdevs both for performance and reliability.

With 24 2.5" drive enclosures you can easily create 3 7 drive raidz3s  
or 4 5 drive raidz2s with a spare for each vdev, or 2 spares and 1-2  
SSD drives. Both options give 12/24 usable disk space. 4 raidz2s give  
more performance, 3 raidz3s gives more reliability.

-Ross

Roman Naumenko

2009-Dec-22 04:56 UTC

head link

[zfs-discuss] raidz data loss stories?

> On Dec 21, 2009, at 4:09 PM, Michael Herf
> <mbherf at gmail.com> wrote:
> 
> > Anyone who''s lost data this way: were you doing
> weekly scrubs, or  
> > did you find out about the simultaneous failures
> after not touching  
> > the bits for months?
> 
> Scrubbing on a routine basis is good for detecting
> problems early, but  
> it doesn''t solve the problem of a double failure
> during resilver. As  
> the size of disks become huge the chance of a double
> failure during  
> resilvering increases to the point of real
> possibility. Due to the  
> amount of data, the bit error rates of the medium and
> the prolonged  
> stress of resilvering these monsters.
> 
> For up to 1TB drives use nothing less than raidz2.
> For 1TB+ drives use  
> raidz3. Avoid raidz vdevs larger than 7 drives,
> better to have  
> multiple vdevs both for performance and reliability.
> 
> With 24 2.5" drive enclosures you can easily create 3
> 7 drive raidz3s  
> or 4 5 drive raidz2s with a spare for each vdev, or 2
> spares and 1-2  
> SSD drives. Both options give 12/24 usable disk
> space. 4 raidz2s give  
> more performance, 3 raidz3s gives more reliability.
> 
> -Ross
> 
Hi Ross,

What about old good raid10? It''s a pretty reasonable choice for heavy
loaded storages, isn''t it?

I remember when I migrated raidz2 to 8xdrives raid10 the application
administrators were just really happy with the new access speed. (we
didn''t use stripped raidz2 though as you are suggesting).

--
Roman
-- 
This message posted from opensolaris.org

Roman Naumenko

2009-Dec-22 04:58 UTC

head link

[zfs-discuss] raidz data loss stories?

> On Dec 21, 2009, at 4:09 PM, Michael Herf
> <mbherf at gmail.com> wrote:
> 
> > Anyone who''s lost data this way: were you doing
> weekly scrubs, or  
> > did you find out about the simultaneous failures
> after not touching  
> > the bits for months?
> 
> Scrubbing on a routine basis is good for detecting
> problems early, but  
> it doesn''t solve the problem of a double failure
> during resilver. As  
> the size of disks become huge the chance of a double
> failure during  
> resilvering increases to the point of real
> possibility. Due to the  
> amount of data, the bit error rates of the medium and
> the prolonged  
> stress of resilvering these monsters.
> 
> For up to 1TB drives use nothing less than raidz2.
> For 1TB+ drives use  
> raidz3. Avoid raidz vdevs larger than 7 drives,
> better to have  
> multiple vdevs both for performance and reliability.
> 
> With 24 2.5" drive enclosures you can easily create 3
> 7 drive raidz3s  
> or 4 5 drive raidz2s with a spare for each vdev, or 2
> spares and 1-2  
> SSD drives. Both options give 12/24 usable disk
> space. 4 raidz2s give  
> more performance, 3 raidz3s gives more reliability.
> 
> -Ross
> 
Hi Ross,

What about old good raid10? It''s a pretty reasonable choice for heavy
loaded storages, isn''t it?

I remember when I migrated raidz2 to 8xdrives raid10 the application
administrators were just really happy with the new access speed. (we
didn''t use stripped raidz2 though as you are suggesting).

--
Roman
-- 
This message posted from opensolaris.org

Ross Walker

2009-Dec-22 14:10 UTC

head link

[zfs-discuss] raidz data loss stories?

On Dec 21, 2009, at 11:56 PM, Roman Naumenko <roman at naumenko.ca> wrote:
>> On Dec 21, 2009, at 4:09 PM, Michael Herf
>> <mbherf at gmail.com> wrote:
>>
>>> Anyone who''s lost data this way: were you doing
>> weekly scrubs, or
>>> did you find out about the simultaneous failures
>> after not touching
>>> the bits for months?
>>
>> Scrubbing on a routine basis is good for detecting
>> problems early, but
>> it doesn''t solve the problem of a double failure
>> during resilver. As
>> the size of disks become huge the chance of a double
>> failure during
>> resilvering increases to the point of real
>> possibility. Due to the
>> amount of data, the bit error rates of the medium and
>> the prolonged
>> stress of resilvering these monsters.
>>
>> For up to 1TB drives use nothing less than raidz2.
>> For 1TB+ drives use
>> raidz3. Avoid raidz vdevs larger than 7 drives,
>> better to have
>> multiple vdevs both for performance and reliability.
>>
>> With 24 2.5" drive enclosures you can easily create 3
>> 7 drive raidz3s
>> or 4 5 drive raidz2s with a spare for each vdev, or 2
>> spares and 1-2
>> SSD drives. Both options give 12/24 usable disk
>> space. 4 raidz2s give
>> more performance, 3 raidz3s gives more reliability.
>>
>> -Ross
>>
>
> Hi Ross,
>
> What about old good raid10? It''s a pretty reasonable choice for  
> heavy loaded storages, isn''t it?
>
> I remember when I migrated raidz2 to 8xdrives raid10 the application  
> administrators were just really happy with the new access speed. (we  
> didn''t use stripped raidz2 though as you are suggesting).
Raid10 provides excellent performance and if performance is a priority  
then I recommend it, but I was under the impression that resiliency  
was the priority, as raidz2/raidz3 provide greater resiliency for a  
sacrifice in performance.

-Ross

Marty Scholes

2009-Dec-22 15:52 UTC

head link

[zfs-discuss] raidz data loss stories?

> > Hi Ross,
> >
> > What about old good raid10? It''s a pretty
> reasonable choice for  
> > heavy loaded storages, isn''t it?
> >
> > I remember when I migrated raidz2 to 8xdrives
> raid10 the application  
> > administrators were just really happy with the new
> access speed. (we  
> > didn''t use stripped raidz2 though as you are
> suggesting).
> 
> Raid10 provides excellent performance and if
> performance is a priority  
> then I recommend it, but I was under the impression
> that resiliency  
> was the priority, as raidz2/raidz3 provide greater
> resiliency for a  
> sacrifice in performance.
My experience is in line with Ross'' comments.  There is no question
that more independent vdevs will improve IOPS, e.g. RAID10 or even a pile of
RAIDZ vdevs.

I have been burnt too many times to let an array get critical (no redunancy). 
Never, ever, ever again.

With a RAID1 or RAID10, one disk loss puts the whole pool critical, just one bad
sector from disaster.  One prays the hot spare can be built in time.

With RAIDZ, the same is true.

I think of triple (or even quad) mirroring the same way as I think of RAIDZ3:
it''s like having prebuilt hot spares.

I suspect that the IOPS problems of wide stripes are becoming mitigated by
L2ARC/ZIL and that the trend will be toward wide stripes with ever higher parity
counts.

Sun''s recent storage offerings tend to confirm this trend: slower,
cheaper and bigger SATA drives fronted by SSD L2ARC and ZIL.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-22 16:46 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, 22 Dec 2009, Ross Walker wrote:>
> Raid10 provides excellent performance and if performance is a priority then
I
> recommend it, but I was under the impression that resiliency was the 
> priority, as raidz2/raidz3 provide greater resiliency for a sacrifice in 
> performance.
Why are people talking about "RAID-5", RAID-6", and
"RAID-10" on this
list?  This is the zfs-discuss list and zfs does not do "RAID-5", 
"RAID-6", or "RAID-10".

Applying classic RAID terms to zfs is just plain wrong and misleading 
since zfs does not directly implement these classic RAID approaches 
even though it re-uses some of the algorithms for data recovery. 
Details do matter.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marty Scholes

2009-Dec-22 17:19 UTC

head link

[zfs-discuss] raidz data loss stories?

Bob Friesenhahn wrote:> Why are people talking about "RAID-5", RAID-6", and
> "RAID-10" on this 
> list?  This is the zfs-discuss list and zfs does not
> do "RAID-5", 
> "RAID-6", or "RAID-10".
> 
> Applying classic RAID terms to zfs is just plain
> wrong and misleading 
> since zfs does not directly implement these classic
> RAID approaches 
> even though it re-uses some of the algorithms for
> data recovery. 
> Details do matter.
That''s not entirely true, is it?
* RAIDZ is RAID5 + checksum + COW
* RAIDZ2 is RAID6 + checksum + COW
* A stack of mirror vdevs is RAID10 + checksum + COW

While there isn''t an actual one-to-one mapping, many traditional RAID
concepts do seem to apply to ZFS discussions, don''t they?

Marty
-- 
This message posted from opensolaris.org

Roman Naumenko

2009-Dec-22 17:42 UTC

head link

[zfs-discuss] raidz data loss stories?

> On Tue, 22 Dec 2009, Ross Walker wrote:
> Applying classic RAID terms to zfs is just plain
> wrong and misleading  since zfs does not directly implement these classic
RAID approaches
> even though it re-uses some of the algorithms for data recovery. 
> Details do matter.
> 
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
I wouldn''t agree. 
SUN introduced just another marketing names for the well known things, even
adding some new functionality.

raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid-ADG or somehow
else.
Sounds nice, but it''s is just buzzwords.

--
Roman
roman at naumenko.ca
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-22 18:08 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, 22 Dec 2009, Marty Scholes wrote:>
> That''s not entirely true, is it?
> * RAIDZ is RAID5 + checksum + COW
> * RAIDZ2 is RAID6 + checksum + COW
> * A stack of mirror vdevs is RAID10 + checksum + COW
These are layman''s simplifications that no one here should be 
comfortable with.

Zfs borrows proven data recovery technologies from classic RAID but 
the data layout on disk is not classic RAID, or even close to it. 
Metadata and file data are handled differently.  Metadata is always 
duplicated, with the most critical metadata being strewn across 
multiple disks.  Even "mirror" disks are not really mirrors of each 
other.

Earlier in this discussion thread someone claimed that if a raidz disk 
was lost that the pool was then just one data error away from total 
disaster, but that is not normally true due to the many other things 
that zfs does.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joerg Moellenkamp

2009-Dec-22 18:10 UTC

head link

[zfs-discuss] raidz data loss stories?

On 22.12.09 18:42, Roman Naumenko wrote:>> On Tue, 22 Dec 2009, Ross Walker wrote:
>> Applying classic RAID terms to zfs is just plain
>> wrong and misleading  since zfs does not directly implement these
classic RAID approaches
>> even though it re-uses some of the algorithms for data recovery.
>> Details do matter.
>>
>> I wouldn''t agree.
>> SUN introduced just another marketing names for the well known things,
even adding some new functionality.
>>
>> raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid-ADG
or somehow else.
>> Sounds nice, but it''s is just buzzwords.
>>
>>
>>      Sorry, but that isn''t correct. Or to be correct: It depends on your 
definition.... when you just consider RAID5 as "Stripeset with an 
interleaved Parity" then you may be right. But  the differences of RAID5 
to RAIDZ (and the same of for RAID6 to RAIDZ2) are vast enough to 
justify an own name. Just look at the different parity handling. 
Otherwise this would like denying diesel and and gasoline engines 
different names just because they are both internal combustion piston 
engines ...

Bob Friesenhahn

2009-Dec-22 18:12 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, 22 Dec 2009, Roman Naumenko wrote:>
> raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid-ADG or
somehow else.
> Sounds nice, but it''s is just buzzwords.
It is true that many vendors like to make their storage array seem 
special, but references to RAID6 when describing raidz2 are only used 
in order to help assist with your understanding.  They are a form of 
analogy.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Travis Tabbal

2009-Dec-22 18:38 UTC

head link

[zfs-discuss] raidz data loss stories?

Interesting discussion. I know the bias here is generally toward enterprise
users. I was wondering if the same recommendations hold for home users that are
generally more price sensitive. I''m currently running OpenSolaris on a
system with 12 drives. I had split them into 3 sets of 4 raidz1 arrays. This
made some sense at the time as I can upgrade 4 disks at a time as new sizes come
out. However, with 8 of the disks currently being 1.5TB, I''m getting
concerned about this strategy. While important data is backed up, a loss of the
server data would be very irritating.

My next thought was to get more drives and run a single raidz3 vdev with
12x1.5TB. More space than I need for quite a while, since I can''t add
just a few drives, triple parity for protection. I''d need a few extra
drives to hold the data while I rebuild the main array, so I''d have
cold-spares available that I would use for backing up critical data from the
server. So they would see use and scrubs, not just sitting on the shelf. Access
is over a gigE network, so I don''t need more performance than that. I
have read that the overall speed of a vdev is approximately the speed of a
single device in the vdev, and in this case that is more than fast enough.
I''m curious what the experts here think of this new plan. I''m
pretty sure I know what you all think of the old one. :)

Do you recommend swapping spare drives into the array periodically? It seems
like it wouldn''t really be any better than running scrub over the same
period, but I''ve heard of people doing it on hardware raid controllers.
-- 
This message posted from opensolaris.org

Marty Scholes

2009-Dec-22 18:45 UTC

head link

[zfs-discuss] raidz data loss stories?

Bob Friesenhahn wrote:> On Tue, 22 Dec 2009, Marty Scholes wrote:
> >
> > That''s not entirely true, is it?
> > * RAIDZ is RAID5 + checksum + COW
> > * RAIDZ2 is RAID6 + checksum + COW
> > * A stack of mirror vdevs is RAID10 + checksum +
> COW
> 
> These are layman''s simplifications that no one here
> should be 
> comfortable with.
Well, ok.  They do seem to capture the essence of what the different flavors of
ZFS protection do, but I''ll take you at your word.

We do seem to be spinning off on a tangent, tho.
> Zfs borrows proven data recovery technologies from
> classic RAID but 
> the data layout on disk is not classic RAID, or even
> close to it. 
> Metadata and file data are handled differently.
>  Metadata is always 
> uplicated, with the most critical metadata being
> strewn across 
> multiple disks.  Even "mirror" disks are not really
> mirrors of each 
> other.
I am having a little trouble reconciling the above statements, but again, ok.  I
haven''t read the official RAID spec, so again, I''ll take you
at your word.  Honestly, those seem like important nuances, but nuances
nonetheless.
> Earlier in this discussion thread someone claimed
> that if a raidz disk 
> was lost that the pool was then just one data error
> away from total disaster
That would be me.  Let me substitute the phrase "user data loss in some
way, shape or form which disrupts availability" for the words "total
disaster."

Honestly, I think we are splitting hairs here.  Everyone agrees that RAIDZ takes
RAID5 to a new level.
-- 
This message posted from opensolaris.org

Toby Thain

2009-Dec-22 19:49 UTC

head link

[zfs-discuss] raidz data loss stories?

On 22-Dec-09, at 12:42 PM, Roman Naumenko wrote:
>> On Tue, 22 Dec 2009, Ross Walker wrote:
>> Applying classic RAID terms to zfs is just plain
>> wrong and misleading  since zfs does not directly implement these  
>> classic RAID approaches
>> even though it re-uses some of the algorithms for data recovery.
>> Details do matter.
>>
>> Bob
>> --
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us,
>
> I wouldn''t agree.
> SUN introduced just another marketing names for the well known  
> things, even adding some new functionality.
>
> raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid- 
> ADG or somehow else.
> Sounds nice, but it''s is just buzzwords.
The implied equivalence is wrong and confusing. That''s the kind of  
mislabelling that Bob was complaining about.

--Toby
>
> --
> Roman
> roman at naumenko.ca
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2009-Dec-22 20:09 UTC

head link

[zfs-discuss] raidz data loss stories?

On Dec 22, 2009, at 11:49 AM, Toby Thain wrote:
> On 22-Dec-09, at 12:42 PM, Roman Naumenko wrote:
>
>>> On Tue, 22 Dec 2009, Ross Walker wrote:
>>> Applying classic RAID terms to zfs is just plain
>>> wrong and misleading  since zfs does not directly implement these  
>>> classic RAID approaches
>>> even though it re-uses some of the algorithms for data recovery.
>>> Details do matter.
>>>
>>> Bob
>>> --
>>> Bob Friesenhahn
>>> bfriesen at simple.dallas.tx.us,
>>
>> I wouldn''t agree.
>> SUN introduced just another marketing names for the well known  
>> things, even adding some new functionality.
>>
>> raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid- 
>> ADG or somehow else.
>> Sounds nice, but it''s is just buzzwords.
>
> The implied equivalence is wrong and confusing. That''s the kind of
> mislabelling that Bob was complaining about.
Yes. Also note that the RAID levels have rather strict definitions.
http://www.snia.org/education/dictionary/r
IMHO the biggest difference is the dynamic nature of ZFS.  For
example, the definition of RAID-0 (data striping) is:
	A disk array data mapping technique in which fixed-length sequences
	of virtual disk data addresses are mapped to sequences of member
	disk addresses in a regular rotating pattern.

ZFS implements dynamic striping, which is different in that the "fixed- 
length
sequences" aren''t really fixed and the "regular rotating
pattern" is
biased
towards allocations on devices which have more free space. The upshot
is that the space available to a dynamic stripe is the sum of the  
space of
the vdevs, whereas for RAID-0 it is N * min(vdev size).
  -- richard

James Risner

2009-Dec-22 20:33 UTC

head link

[zfs-discuss] raidz data loss stories?

ttabbal:

    If I understand correctly, raidz{1} is 1 drive protection and space is
(drives - 1) available.  Raidz2 is 2 drive protection and space is (drives - 2)
etc.  Same for raidz3 being 3 drive protection.

     Everything I''ve seen you should stay around 6-9 drives for raidz,
so don''t do a raidz3 with 12 drives.  Instead make two raidz3 with 6
drives each (which is (6-3)*1.5 * 2 = 9 TB array.)

     As for whether or not to do raidz, for me the issue is performance.  I
can''t handle the raidz write penalty.  If I needed triple drive
protection, a 3way mirror setup would be the only way I would go.  I
don''t yet quite understand why a 4+ drive raidz3 vdev is better than a
3 drive mirror vdev?  Other than a 6 drive setup is 3 drives of space when a 6
drive setup using 3 way mirror is only 2 drive space.

Adam Leventhal:
     If we can compare apples and oranges, would you same recommendation
("use raidz2 and/or raidz3") be the same when comparing to mirror with
the same number of drives?  In other words, a 2 drive mirror compares to
raidz{1} the same as a 3 drive mirror compares to raidz2 and a 4 drive mirror
compares to raidz3?  If you were enterprise (in other words card about perf) why
would you ever use raidz instead of throwing more drives at the problem and
doing mirroring with identical parity?

Joerg Moellenkamp:
     I do "consider RAID5 as ''Stripeset with an interleaved
Parity''", so I don''t agree with the strong objection in
this thread by many about the use of RAID5 to describe what raidz does.  I
don''t think many particularly care about the nuanced differences
between hardware card RAID5 and raidz, other than knowing they would rather have
raidz over RAID5.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-22 22:01 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, 22 Dec 2009, James Risner wrote:
>     I do "consider RAID5 as ''Stripeset with an interleaved
Parity''",
> so I don''t agree with the strong objection in this thread by many 
> about the use of RAID5 to describe what raidz does.  I don''t think
> many particularly care about the nuanced differences between 
> hardware card RAID5 and raidz, other than knowing they would rather 
> have raidz over RAID5.
One of the "nuanced differences" is that raidz supports more data 
recovery mechanisms than RAID5 does since it redundantly stores its 
metadata and provides the option to redundantly store user data as 
well, in addition to what is provided by "RAID5".  The COW mechanism 
also provides some measure of protection since if the corrupted data 
was recently written, a somewhat older version may still be available 
by rolling back a transaction group.  Valid older data may also be 
available in a snapshot.

It is not uncommon to see postings from people who report that their 
single-disk pool said that some data corruption was encountered, the 
problem was automatically corrected, and user data was not impacted.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marty Scholes

2009-Dec-22 22:19 UTC

head link

[zfs-discuss] raidz data loss stories?

risner wrote:> If I understand correctly, raidz{1} is 1 drive
> protection and space is (drives - 1) available.
> Raidz2 is 2 drive protection and space is (drives -
> 2) etc.  Same for raidz3 being 3 drive protection.
Yes.
> Everything I''ve seen you should stay around 6-9
> drives for raidz, so don''t do a raidz3 with 12
> drives.  Instead make two raidz3 with 6 drives each
>  (which is (6-3)*1.5 * 2 = 9 TB array.)
>From what I can tell, this is purely a function of needed IOPS.  Wider
stripe = better storage/bandwidth utilization = less IOPS.  For home usage I run
a 14 drive RAIDZ3 array.
> As for whether or not to do raidz, for me the
> issue is performance.  I can''t handle the raidz
> write penalty.
If there is a RAIDZ write penalty over mirroring, I am unaware of it.  In fact,
sequential writes are faster under RAIDZ.
> If I needed triple drive protection,
> a 3way mirror setup would be the only way I would
> go.
That will give high IOPS with 33% storage utilization and 33% bandwidth
utilization.  In other words, for every MB of data read/witten by an
application, 3MB is read/written from/to the array and stored.

Multiply all storage and bandwidth needs by three.
>  I don''t yet quite understand why a 3+ drive
> raidz2 vdev is better than a 3 drive mirror vdev?
> Other than a 5 drive setup is 3 drives of space
> when a 6 drive setup using 3 way mirror is only 2
>  drive space.
Part of the question you answered yourself.  The other part is that with a 6
drive RAIDZ3, I can lose ANY three drives and still be running.  With three
mirrors, I can lose the pool if the wrong two drives die.
-- 
This message posted from opensolaris.org

Travis Tabbal

2009-Dec-22 22:31 UTC

head link

[zfs-discuss] raidz data loss stories?

> Everything I''ve seen you should stay around 6-9
> drives for raidz, so don''t do a raidz3 with 12
> drives.  Instead make two raidz3 with 6 drives each
>  (which is (6-3)*1.5 * 2 = 9 TB array.)
So the question becomes, why? If it''s performance, I can live with
lower IOPS and max throughput. If it''s reliability, I''d like
to hear why. I would think that the number of acceptable devices in a raidz
would scale somewhat with the number of drives used for parity. So I would
expect to see a sliding scale somewhat like the one mentioned before regarding
disk size vs. raidz level.

For example: 

3-4 drives: raidz1
4-8 drives: raidz2
8+ drives: raidz3

In practice, I would expect to see some kind of chart with number of devices and
size of devices used together to determine the proper raidz level. Perhaps
I''m way off base though. Note that I don''t really have a
problem doing 2 arrays, but I would think that perhaps raidz2 would be
acceptable in that configuration. The benefit to that config for me would be
that I could create a parallel array of 6 to copy my existing data to, then add
the second array after the initial file copy/scrub. I would need fewer disks to
complete the transition.
 > As for whether or not to do raidz, for me the
> issue is performance.  I can''t handle the raidz
> write penalty.  If I needed triple drive protection,
> a 3way mirror setup would be the only way I would
> go.  I don''t yet quite understand why a 3+ drive
> raidz2 vdev is better than a 3 drive mirror vdev?
> Other than a 5 drive setup is 3 drives of space
> when a 6 drive setup using 3 way mirror is only 2
>  drive space.
I''ve already stipulated that performance is not the primary concern.
100MB/sec with reasonable random I/O for a max of 5 clients is more than enough.
My existing raidz is more than fast enough for my needs, and I have 5400RPM
drives in there.

I''d be very interested to hear an expert opinion on this. Given, say, 6
disks. What advantage in reliability, if any, would a raidz3 have vs. a striped
pair of 3-way mirrors? Obviously the raidz3 has 1 disk worth of extra space, but
we''re talking about reliability here. I would guess performance would
be higher with the mirrors.

With all of my comments, please keep in mind that I am not a huge enterprise
customer with loads of money to spend on this. If I were, I''d just buy
Thumpers. I''m a home user with a decent fileserver.
-- 
This message posted from opensolaris.org

Toby Thain

2009-Dec-22 22:45 UTC

head link

[zfs-discuss] raidz data loss stories?

On 22-Dec-09, at 3:33 PM, James Risner wrote:
> ...
> Joerg Moellenkamp:
>      I do "consider RAID5 as ''Stripeset with an interleaved  
> Parity''", so I don''t agree with the strong objection
in this thread
> by many about the use of RAID5 to describe what raidz does.  I  
> don''t think many particularly care about the nuanced differences  
> between hardware card RAID5 and raidz, other than knowing they  
> would rather have raidz over RAID5.
These are hardly "nuanced differences". The most powerful  
capabilities of ZFS simply aren''t available in RAID.

* Because ZFS is labelled a "filesystem", people assume it is  
analogous to a conventional filesystem then make misleading  
comparisons which fail to expose the profound differences;
* or people think it''s a RAID or volume manager, assume it''s
just
RAID relabelled, and fail to see where it goes beyond.

Of course it is neither, exactly, but a synthesis of the two which is  
far more capable than the two conventionally discrete layers in  
combination. (I know most of the list knows this :)

--Toby

> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ross Walker

2009-Dec-23 01:08 UTC

head link

[zfs-discuss] raidz data loss stories?

On Dec 22, 2009, at 11:46 AM, Bob Friesenhahn <bfriesen at
simple.dallas.tx.us
 > wrote:
> On Tue, 22 Dec 2009, Ross Walker wrote:
>>
>> Raid10 provides excellent performance and if performance is a  
>> priority then I recommend it, but I was under the impression that  
>> resiliency was the priority, as raidz2/raidz3 provide greater  
>> resiliency for a sacrifice in performance.
>
> Why are people talking about "RAID-5", RAID-6", and
"RAID-10" on
> this list?  This is the zfs-discuss list and zfs does not do  
> "RAID-5", "RAID-6", or "RAID-10".
Because raid10 is shorter to type then pool of mirrors, and for new  
comers it''s easier to grasp these terms. Notice how I refer to raidz/ 
2/3 and not raid5/6? Cause it''s roughly the same amount of characters.

-Ross

PS Really Bob?

Frank Cusack

2009-Dec-23 04:40 UTC

head link

[zfs-discuss] raidz data loss stories?

On December 21, 2009 10:45:29 PM -0500 Ross Walker <rswwalker at
gmail.com>
wrote:> Scrubbing on a routine basis is good for detecting problems early, but it
> doesn''t solve the problem of a double failure during resilver. As
the
> size of disks become huge the chance of a double failure during
> resilvering increases to the point of real possibility. Due to the amount
> of data, the bit error rates of the medium and the prolonged stress of
> resilvering these monsters.
>
> For up to 1TB drives use nothing less than raidz2. For 1TB+ drives use
> raidz3. Avoid raidz vdevs larger than 7 drives, better to have multiple
> vdevs both for performance and reliability.
Would be good fodder for the best practices doc, if you have the math
to back it up.

-frank

Marc Bevand

2009-Dec-23 06:09 UTC

head link

[zfs-discuss] raidz data loss stories?

Ross Walker <rswwalker <at> gmail.com>
writes:> 
> Scrubbing on a routine basis is good for detecting problems early, but  
> it doesn''t solve the problem of a double failure during resilver.
Scrubbing doesn''t solve double failures, but it significantly decreases
their
likelihood. The assumption here is that the most common type of 2nd failures 
in a double failure scenario is uncorrectable errors. Not only scrubs detect 
and fix unc errors, but they also stress the drives as much as a resilver 
would do.

Personally I have had my share of single-drive failures, but never any double 
failure. I do scrub on a weekly or monthly basis.

-mrb

Bob Friesenhahn

2009-Dec-23 16:14 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, 22 Dec 2009, Marty Scholes wrote:>
> If there is a RAIDZ write penalty over mirroring, I am unaware of 
> it.  In fact, sequential writes are faster under RAIDZ.
There is always an IOPS penalty for raidz when writing or reading, 
given a particular zfs block size.  There may be a write penalty for 
mirroring, but this depends heavily on whether the I/O paths are 
saturated or operate in parallel.  It is true that a mirror requires a 
write for each mirror device, but if the I/O subsystem has the 
bandwidth for it, the cost of this can be astonishingly insignificant. 
It becomes significant when the I/O path is shared with limited 
bandwidth and the writes are large.

As to whether sequential writes are faster under raidz, I have yet to 
see any actual evidence of that.  Perhaps someone can provide some 
actual evidence?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric D. Mudama

2009-Dec-24 01:25 UTC

head link

[zfs-discuss] raidz data loss stories?

On Tue, Dec 22 at 12:33, James Risner wrote:> As for whether or not to do raidz, for me the issue is performance.
> I can''t handle the raidz write penalty.  If I needed triple drive
> protection, a 3way mirror setup would be the only way I would go.  I
> don''t yet quite understand why a 4+ drive raidz3 vdev is better
than
> a 3 drive mirror vdev?  Other than a 6 drive setup is 3 drives of
> space when a 6 drive setup using 3 way mirror is only 2 drive space.
That''s a pretty big "other than", since the difference is 50%
more
space for the raidz3 in your case, and the difference grows as the
number of drives increases.

I concur with some of the other thoughts, that migration towards L2ARC
+ big slow sata pools is becoming a more recommended configuration.
The recent automated pool recovery and ZIL removal improvements makes
that design much more practical.

--eric


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Adam Leventhal

2009-Dec-26 04:42 UTC

head link

[zfs-discuss] raidz data loss stories?

>> Applying classic RAID terms to zfs is just plain
>> wrong and misleading 
>> since zfs does not directly implement these classic
>> RAID approaches 
>> even though it re-uses some of the algorithms for
>> data recovery. 
>> Details do matter.
> 
> That''s not entirely true, is it?
> * RAIDZ is RAID5 + checksum + COW
> * RAIDZ2 is RAID6 + checksum + COW
> * A stack of mirror vdevs is RAID10 + checksum + COW
Others have noted that RAID-Z isn''t really the same as RAID-5 and
RAID-Z2 isn''t the same as RAID-6 because RAID-5 and RAID-6 define not
just the number of parity disks (which would have made far more sense in my
mind), but instead also include in the definition a notion of how the data and
parity are laid out. The RAID levels were used to describe groupings of existing
implementations and conflate things like the number of parity devices with, say,
how parity is distributed across devices.

For example, RAID-Z1 lays out data most like RAID-3, that is a single block is
carved up and spread across many disks, but distributes parity as required for
RAID-5 but in a different manner. It''s an unfortunate state of affairs
which is why further RAID levels should identify only the most salient aspect
(the number of parity devices) or we should use unambiguous terms like
single-parity and double-parity RAID.
>     If we can compare apples and oranges, would you same recommendation
("use raidz2 and/or raidz3") be the same when comparing to mirror with
the same number of drives?  In other words, a 2 drive mirror compares to
raidz{1} the same as a 3 drive mirror compares to raidz2 and a 4 drive mirror
compares to raidz3?  If you were enterprise (in other words card about perf) why
would you ever use raidz instead of throwing more drives at the problem and
doing mirroring with identical parity?
You''re right that a mirror is a degenerate form of raidz1, for example,
but mirrors allow for specific optimizations. While the redundancy would be the
same, the performance would not.

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Al Hopper

2009-Dec-27 16:44 UTC

head link

[zfs-discuss] raidz data loss stories?

I know I''m a bit late to contribute to this thread, but I''d
still like to
add my $0.02. My "gut feel" is that we (generally) don''t yet
understand the
subtleties of disk drive failure modes as they relate to 1.5 or 2Tb+ drives.
Why? Because those large drives have not been widely available until
relatively recently.

There''s a tendency to extrapolate ones existing knowledge base and
understanding of how/why drives fail (or degrade) by basing our expected
outcome on some "extension" of our existing knowledge base. In the
case of
the current generation of high capacity drives, that may or may not be
appropriate. We simply don''t know! Mainly because the hard drive
manufacturers, those engineering gods and providers of ever increasing
storage density, don''t communicate their acquired and evolving
knowledge as
it relates to disk reliability (or failure) mechanisms.

In this case I feel, as a user, it''s best to take a very conservative
approach and err on the side of safety by using raidz3 when high capacity
drives are being deployed. Over time, a consensus based understanding of
the failure modes will emerge and then, from a user perspective, we can have
a clearer understanding of the risks of data loss and its relation to
different ZFS pool configurations.

Personally, I was surprised at how easily I was able to "take out" a
1Tb WD
Caviar black drive by moving a 1U server with the drives spinning. Earlier
drive generations (500Gb or smaller) tolerated this abuse with no signs of
degradation. So I know that high capacity drives are a lot more sensitive
to mechanical "abuse" - I can only assume that 2Tb drives are probably
even
more sensitive and that shock mounting, to reduce vibration induced by a
bunch of similar drives operating in the same "box", is probably a
smart
move.

Likewise, my previous experience has seen how a given percentage of disk
drives would fail in the 2 or 3 week period following a temperature
"excursion" in a data center environment. Sometimes everyone knows
about
that event, and sometimes the folks doing A/C work over a holiday weekend
will "forget" to publish the details of what went wrong! :) Again -
the
same doubts continue to nag me: are the current 1.5Tb+ drives more likely to
suffer degradation due to a temperature excursion over a relatively small
time period? If the drive firmware does its job and remaps damaged sectors
or tracks transparently, we, as the users, won''t know - until it
happens one
time too many!!

Regards,

--
Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com
Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/80da6dc8/attachment.html>

zfs discuss - Dec 2009 - raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?

[zfs-discuss] raidz data loss stories?