thr3ads.net - zfs discuss - [zfs-discuss] [RFC] Backup solution [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Roy Sigurd Karlsbakk

2010-Oct-07 21:54 UTC

[zfs-discuss] [RFC] Backup solution

Hi all

I''m setting up a couple of 110TB servers and I just want some feedback
in case I have forgotten something.

The servers (two of them) will, as of current plans, be using 11 VDEVs with 7
2TB WD Blacks each, with a couple of Crucial RealSSD 256GB SSDs for the L2ARC
and another couple of 100GB OCZ Vertex 2 Pro for the SLOG (I know, it''s
way too much, but they will wear out slowlier and there aren''t fast
SSDs around that are small). There will be 48 gigs of RAM for each box on recent
Xeon CPUs.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Ian Collins

2010-Oct-07 22:02 UTC

head link

[zfs-discuss] [RFC] Backup solution

On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:> Hi all
>
> I''m setting up a couple of 110TB servers and I just want some
feedback in case I have forgotten something.
>
> The servers (two of them) will, as of current plans, be using 11 VDEVs with
7 2TB WD Blacks each, with a couple of Crucial RealSSD 256GB SSDs for the L2ARC
and another couple of 100GB OCZ Vertex 2 Pro for the SLOG (I know, it''s
way too much, but they will wear out slowlier and there aren''t fast
SSDs around that are small). There will be 48 gigs of RAM for each box on recent
Xeon CPUs.
>
>    What configuration are you proposing for the vdevs?  Don''t forget you 
will have very long resilver times with those drives.

-- 
Ian.

Roy Sigurd Karlsbakk

2010-Oct-07 22:06 UTC

head link

[zfs-discuss] [RFC] Backup solution

----- Original Message -----> On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:
> > Hi all
> >
> > I''m setting up a couple of 110TB servers and I just want some
> > feedback in case I have forgotten something.
> >
> > The servers (two of them) will, as of current plans, be using 11
> > VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD
> > 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2
> > Pro for the SLOG (I know, it''s way too much, but they will
wear out
> > slowlier and there aren''t fast SSDs around that are small).
There
> > will be 48 gigs of RAM for each box on recent Xeon CPUs.
>
> What configuration are you proposing for the vdevs? Don''t forget
you
> will have very long resilver times with those drives.
RAIDz2 on each VDEV. I''m aware of that the resilver time will be worse
than using 10k or 15k drives, but then, those 2TB drives aren''t
available for anything but 7k2 or less.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Ian Collins

2010-Oct-07 22:17 UTC

head link

[zfs-discuss] [RFC] Backup solution

On 10/ 8/10 11:06 AM, Roy Sigurd Karlsbakk wrote:> ----- Original Message -----
>    
>> On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:
>>      
>>> Hi all
>>>
>>> I''m setting up a couple of 110TB servers and I just want
some
>>> feedback in case I have forgotten something.
>>>
>>> The servers (two of them) will, as of current plans, be using 11
>>> VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD
>>> 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2
>>> Pro for the SLOG (I know, it''s way too much, but they will
wear out
>>> slowlier and there aren''t fast SSDs around that are
small). There
>>> will be 48 gigs of RAM for each box on recent Xeon CPUs.
>>>        
>> What configuration are you proposing for the vdevs? Don''t
forget you
>> will have very long resilver times with those drives.
>>      
> RAIDz2 on each VDEV. I''m aware of that the resilver time will be
worse than using 10k or 15k drives, but then, those 2TB drives aren''t
available for anything but 7k2 or less.
>
>    I would seriously consider raidz3, given I typically see 80-100 hour 
resilver times for 500G drives in raidz2 vdevs.  If you haven''t
already,
read Adam Leventhal''s paper:

http://queue.acm.org/detail.cfm?id=1670144

-- 
Ian.

Scott Meilicke

2010-Oct-07 22:22 UTC

head link

[zfs-discuss] [RFC] Backup solution

Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7
disk raidz2 vdev that took about 16 hours to resliver. There was very little IO
on the array, and it had maybe 3.5T of data to resliver.

On Oct 7, 2010, at 3:17 PM, Ian Collins wrote:  > I would seriously consider raidz3, given I typically see 80-100 hour
resilver times for 500G drives in raidz2 vdevs.  If you haven''t
already, read Adam Leventhal''s paper:
> 
> http://queue.acm.org/detail.cfm?id=1670144
> 
> -- 
> Ian.
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Scott Meilicke

Ian Collins

2010-Oct-07 22:31 UTC

head link

[zfs-discuss] [RFC] Backup solution

On 10/ 8/10 11:22 AM, Scott Meilicke wrote:> Those must be pretty busy drives. I had a recent failure of a 1.5T disks in
a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little
IO on the array, and it had maybe 3.5T of data to resliver.
>
> On Oct 7, 2010, at 3:17 PM, Ian Collins wrote:
>    
>> I would seriously consider raidz3, given I typically see 80-100 hour
resilver times for 500G drives in raidz2 vdevs.
>>      
> Those must be pretty busy drives. I had a recent failure of a 1.5T disks in
a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little
IO on the array, and it had maybe 3.5T of data to resliver.
It''s is a backup staging server (a Thumper), so it''s receiving
a steady
stream of snapshots and rsyncs (from windows).  That''s why it typically
gets to 100% complete half way through the actual resilver!

-- 
Ian.

Edward Ned Harvey

2010-Oct-08 01:07 UTC

head link

[zfs-discuss] [RFC] Backup solution

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Ian Collins
> 
> I would seriously consider raidz3, given I typically see 80-100 hour
> resilver times for 500G drives in raidz2 vdevs.  If you haven''t
> already,
If you''re going raidz3, with 7 disks, then you might as well just make
mirrors instead, and eliminate the slow resilver.

Mirrors resilver enormously faster than raidzN.  At least for now, until
maybe one day the raidz resilver code might be rewritten.

Peter Jeremy

2010-Oct-08 02:01 UTC

head link

[zfs-discuss] [RFC] Backup solution

On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at nedharvey.com>
wrote:>If you''re going raidz3, with 7 disks, then you might as well just
make
>mirrors instead, and eliminate the slow resilver.
There is a difference in reliability:  raidzN means _any_ N disks can
fail, whereas mirror means one disk in each mirror pair can fail.
With a mirror, Murphy''s Law says that the second disk to fail will be
the pair of the first disk :-).

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101008/4ad90baa/attachment.bin>

Edward Ned Harvey

2010-Oct-08 11:33 UTC

head link

[zfs-discuss] [RFC] Backup solution

> From: Peter Jeremy [mailto:peter.jeremy at alcatel-lucent.com]
> Sent: Thursday, October 07, 2010 10:02 PM
> 
> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at
nedharvey.com>
> wrote:
> >If you''re going raidz3, with 7 disks, then you might as well
just make
> >mirrors instead, and eliminate the slow resilver.
> 
> There is a difference in reliability:  raidzN means _any_ N disks can
> fail, whereas mirror means one disk in each mirror pair can fail.
> With a mirror, Murphy''s Law says that the second disk to fail will
be
> the pair of the first disk :-).
Maybe.  But in reality, you''re just guessing the probability of a
single
failure, the probability of multiple failures, and the probability of
multiple failures within the critical time window and critical redundancy
set.

The probability of a 2nd failure within the critical time window is smaller
whenever the critical time window is decreased, and the probability of that
failure being within the critical redundancy set is smaller whenever your
critical redundancy set is smaller.  So if raidz2 takes twice as long to
resilver than a mirror, and has a larger critical redundancy set, then you
haven''t gained any probable resiliency over a mirror.

Although it''s true with mirrors, it''s possible for 2 disks to
fail and
result in loss of pool, I think the probability of that happening is smaller
than the probability of a 3-disk failure in the raidz2.

How much longer does a 7-disk raidz2 take to resilver as compared to a
mirror?  According to my calculations, it''s in the vicinity of 10x
longer.

Bob Friesenhahn

2010-Oct-08 14:28 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Thu, 7 Oct 2010, Edward Ned Harvey wrote:>
> If you''re going raidz3, with 7 disks, then you might as well just
make
> mirrors instead, and eliminate the slow resilver.
While the math supports using raidz3, practicality (other than storage 
space) supports using mirrors.  Mirrors are just much more agile and 
easier to maintain.  Having one or two hot spares that zfs can 
resilver to right away will help improve mirrored pool reliability.
> Mirrors resilver enormously faster than raidzN.  At least for now, until
> maybe one day the raidz resilver code might be rewritten.
The resilver algorithm is closely aligned to the zfs data storage 
model so it is unlikely to dramatically improve.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Michael DeMan

2010-Oct-08 14:53 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote:
>> From: Peter Jeremy [mailto:peter.jeremy at alcatel-lucent.com]
>> Sent: Thursday, October 07, 2010 10:02 PM
>> 
>> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at
nedharvey.com>
>> wrote:
>>> If you''re going raidz3, with 7 disks, then you might as
well just make
>>> mirrors instead, and eliminate the slow resilver.
>> 
>> There is a difference in reliability:  raidzN means _any_ N disks can
>> fail, whereas mirror means one disk in each mirror pair can fail.
>> With a mirror, Murphy''s Law says that the second disk to fail
will be
>> the pair of the first disk :-).
> 
> Maybe.  But in reality, you''re just guessing the probability of a
single
> failure, the probability of multiple failures, and the probability of
> multiple failures within the critical time window and critical redundancy
> set.
> 
> The probability of a 2nd failure within the critical time window is smaller
> whenever the critical time window is decreased, and the probability of that
> failure being within the critical redundancy set is smaller whenever your
> critical redundancy set is smaller.  So if raidz2 takes twice as long to
> resilver than a mirror, and has a larger critical redundancy set, then you
> haven''t gained any probable resiliency over a mirror.
> 
> Although it''s true with mirrors, it''s possible for 2
disks to fail and
> result in loss of pool, I think the probability of that happening is
smaller
> than the probability of a 3-disk failure in the raidz2.
> 
> How much longer does a 7-disk raidz2 take to resilver as compared to a
> mirror?  According to my calculations, it''s in the vicinity of 10x
longer.
> 
This article has been posted elsewhere, is about 10 months old, but is a good
read:

http://queue.acm.org/detail.cfm?id=1670144

Really, there should be a ballpark / back of the napkin formula to be able to
calculate this?  I''ve been curious about this too, so here goes a 1st
cut...

DR = disk reliability, in terms of chance of the disk dying in any given time
period, say any given hour?

DFW = disk full write - time to write every sector on the disk.  This will vary
depending on system load, but is still an input item that can be determined by
some testing.

RSM = resilver time for a mirror of two of the given disks
RSZ1 = resilver time for raidz1 vdev of two of the given disks?
RSZ2 = resilver time for raidz2 vdev of two of the given disks?

chances of losing all data in a mirror: DLM = RSM * DR.
chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR.
chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR

Now, for the above, I''ll make some other assumptions...

Lets just guess at a 1-year MTBF for our disks, and for purposes here, just flat
line that at a failure rate  of chance per hour throughout the year.

Lets presume rebuilding a mirror takes one hour.
Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk than
a mirror, I think this would be a ''safe'' ratio to the benefit
of the mirror.
Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk than
a mirror, this should be ''safe'' and again benefit to the
mirror.

DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk
might die in any given hour.

DLM = one hour * DR = .000114

DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in
the pool, and any one of them could fail)

DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks) 
= a much tinier chance of losing all that data.

A better way to think about it maybe....

Based on our 1-year flat-line MTBF for disks, to figure out how much faster the
mirror must rebuild for reliability to be the same as a raidz2...

DLM = DLRZ2

.0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks)

X = (.0001114 * 6-disks) * 5 

X = .003342

So, the mirror would have to resilver three hundred times faster than the raiz2 
(1 / .003342) in order for it to offer the same levels of reliability in regards
to the chances of losing the entire vdev due to additional disk failures during
a resilver?

The governing thing here is that O(2) level of reliability based on expected
chances of failure of  additional disks during any given moment in time, vs.
O(1) for mirrors and raidz1?

Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we
are working on the assumption we have already lost one disk.

With raidz3, we would have ( 1  /  (.0001114 * 4-disks remaining in pool ), or
about 2,000 times more reliability?

Now, the above does not include things like proper statistics that the chances
of that 2nd and 3rd disk failing (even correlations) may be higher than our
''flat-line'' %/hr. based on 1-year MTBF, or stuff like if all
the disks were purchased in the same lots and at the same time, so their chances
of failing around the same time is higher, etc.

Bob Friesenhahn

2010-Oct-08 15:25 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Fri, 8 Oct 2010, Michael DeMan wrote:
> Now, the above does not include things like proper statistics that 
> the chances of that 2nd and 3rd disk failing (even correlations) may 
> be higher than our ''flat-line'' %/hr. based on 1-year
MTBF, or stuff
> like if all the disks were purchased in the same lots and at the 
> same time, so their chances of failing around the same time is 
> higher, etc.
It also does not include the "human factor" which is still the most 
significant contributor to data loss.  This is the most difficult 
factor to diminish.  If the humans have difficulty understanding the 
system or the hardware, then they are more likely to do something 
wrong which damages the data.

It also does not account for an OS kernel which caches quite a lot of 
data in memory (relying on ECC for reliability), and which may have 
bugs.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Meilicke

2010-Oct-08 15:34 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Oct 8, 2010, at 8:25 AM, Bob Friesenhahn wrote:> 
> It also does not include the "human factor" which is still the
most significant contributor to data loss.  This is the most difficult factor to
diminish.  If the humans have difficulty understanding the system or the
hardware, then they are more likely to do something wrong which damages the
data.
This is often overlooked during a system design. It is very easy to lose your
head during a high stress moment, and pull the wrong drive (I of course, have
never done that... <ahem>). Having z2(3) / triple mirrors, graphical
pictures of which disk has failed, working LED failures lights, and letting a
hot spare finish reslivering before replacing a disk are all good counter
measures.
> It also does not account for an OS kernel which caches quite a lot of data
in memory (relying on ECC for reliability), and which may have bugs.
At some point you have to rely on your backups for the unexpected and
unforeseen. Make sure they are good!

Michael, nice reliability write up!

--

Scott Meilicke

Roy Sigurd Karlsbakk

2010-Oct-08 16:51 UTC

head link

[zfs-discuss] [RFC] Backup solution

> Now, the above does not include things like proper statistics that the
> chances of that 2nd and 3rd disk failing (even correlations) may be
> higher than our ''flat-line'' %/hr. based on 1-year MTBF,
or stuff like
> if all the disks were purchased in the same lots and at the same time,
> so their chances of failing around the same time is higher, etc.
In addition to this comes another aspect. What if one drive fails and you find
bad data on another in the same VDEV while resilvering. This is quite common
these days, and for mirrors, that will mean data loss unless you mirror 3-way or
more, which will be rather costy.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Bob Friesenhahn

2010-Oct-08 17:01 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Fri, 8 Oct 2010, Roy Sigurd Karlsbakk wrote:
> In addition to this comes another aspect. What if one drive fails 
> and you find bad data on another in the same VDEV while resilvering. 
> This is quite common these days, and for mirrors, that will mean 
> data loss unless you mirror 3-way or more, which will be rather 
> costy.
The "answer" to this is to schedule a periodic scrub.  It is of course
not a complete answer since the drive may degrade since the previous 
scrub and you might still lose some (or even all!) data.  If you use 
mirrors or raidz1 you should definitely include a periodic scrub in 
the plan.  The good news is that mirrors scrub quickly with far fewer 
I/Os and system impact than raidz?.

Regardless, nothing beats raidz3 based on computable statistics.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2010-Oct-09 01:21 UTC

head link

[zfs-discuss] [RFC] Backup solution

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
> 
> In addition to this comes another aspect. What if one drive fails and
> you find bad data on another in the same VDEV while resilvering. This
> is quite common these days, and for mirrors, that will mean data loss
> unless you mirror 3-way or more, which will be rather costy.
Like the resilver, scrub goes faster with mirrors.  Scrub regularly.

Richard Elling

2010-Oct-10 04:57 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote:> Regardless, nothing beats raidz3 based on computable statistics.
Well, no, not really. It all depends on the number of sets and the MTTR.
Consider the case where you have 1 set of raidz3 and 2 sets of 3-way
mirrors.  The raidz3 set can only stand to lose 3 disks where the mirrored
sets can stand to lose 4 disks.  The answer is not immediately intuitive 
because it does depend on the MTTR for practical cases.
 -- richard

Bob Friesenhahn

2010-Oct-10 14:11 UTC

head link

[zfs-discuss] [RFC] Backup solution

On Sat, 9 Oct 2010, Richard Elling wrote:
> On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote:
>> Regardless, nothing beats raidz3 based on computable statistics.
>
> Well, no, not really. It all depends on the number of sets and the MTTR.
Well, ok.  I should have appended "except for 3-way mirrors". :-)

3-way mirrors seem like an expensive solution for bulk data backup 
except that it turns out that if the current data fits (with plenty of 
headroom) on the 3-way mirror solution, zfs snapshots (with 
compression enabled) are an excellent way to capture the incremental 
changes over time.  This requires care for how updates are applied to 
the backup pool so that unchanged data blocks are not overwritten. 
Usually backed up data does not change rapidly over time so the 
incremental snapshots don''t require much space.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

zfs discuss - Oct 2010 - [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution

[zfs-discuss] [RFC] Backup solution