Hi all I''m setting up a couple of 110TB servers and I just want some feedback in case I have forgotten something. The servers (two of them) will, as of current plans, be using 11 VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2 Pro for the SLOG (I know, it''s way too much, but they will wear out slowlier and there aren''t fast SSDs around that are small). There will be 48 gigs of RAM for each box on recent Xeon CPUs. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:> Hi all > > I''m setting up a couple of 110TB servers and I just want some feedback in case I have forgotten something. > > The servers (two of them) will, as of current plans, be using 11 VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2 Pro for the SLOG (I know, it''s way too much, but they will wear out slowlier and there aren''t fast SSDs around that are small). There will be 48 gigs of RAM for each box on recent Xeon CPUs. > >What configuration are you proposing for the vdevs? Don''t forget you will have very long resilver times with those drives. -- Ian.
----- Original Message -----> On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote: > > Hi all > > > > I''m setting up a couple of 110TB servers and I just want some > > feedback in case I have forgotten something. > > > > The servers (two of them) will, as of current plans, be using 11 > > VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD > > 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2 > > Pro for the SLOG (I know, it''s way too much, but they will wear out > > slowlier and there aren''t fast SSDs around that are small). There > > will be 48 gigs of RAM for each box on recent Xeon CPUs. > > What configuration are you proposing for the vdevs? Don''t forget you > will have very long resilver times with those drives.RAIDz2 on each VDEV. I''m aware of that the resilver time will be worse than using 10k or 15k drives, but then, those 2TB drives aren''t available for anything but 7k2 or less. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 10/ 8/10 11:06 AM, Roy Sigurd Karlsbakk wrote:> ----- Original Message ----- > >> On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote: >> >>> Hi all >>> >>> I''m setting up a couple of 110TB servers and I just want some >>> feedback in case I have forgotten something. >>> >>> The servers (two of them) will, as of current plans, be using 11 >>> VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD >>> 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2 >>> Pro for the SLOG (I know, it''s way too much, but they will wear out >>> slowlier and there aren''t fast SSDs around that are small). There >>> will be 48 gigs of RAM for each box on recent Xeon CPUs. >>> >> What configuration are you proposing for the vdevs? Don''t forget you >> will have very long resilver times with those drives. >> > RAIDz2 on each VDEV. I''m aware of that the resilver time will be worse than using 10k or 15k drives, but then, those 2TB drives aren''t available for anything but 7k2 or less. > >I would seriously consider raidz3, given I typically see 80-100 hour resilver times for 500G drives in raidz2 vdevs. If you haven''t already, read Adam Leventhal''s paper: http://queue.acm.org/detail.cfm?id=1670144 -- Ian.
Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little IO on the array, and it had maybe 3.5T of data to resliver. On Oct 7, 2010, at 3:17 PM, Ian Collins wrote:> I would seriously consider raidz3, given I typically see 80-100 hour resilver times for 500G drives in raidz2 vdevs. If you haven''t already, read Adam Leventhal''s paper: > > http://queue.acm.org/detail.cfm?id=1670144 > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussScott Meilicke
On 10/ 8/10 11:22 AM, Scott Meilicke wrote:> Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little IO on the array, and it had maybe 3.5T of data to resliver. > > On Oct 7, 2010, at 3:17 PM, Ian Collins wrote: > >> I would seriously consider raidz3, given I typically see 80-100 hour resilver times for 500G drives in raidz2 vdevs. >>> Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little IO on the array, and it had maybe 3.5T of data to resliver.It''s is a backup staging server (a Thumper), so it''s receiving a steady stream of snapshots and rsyncs (from windows). That''s why it typically gets to 100% complete half way through the actual resilver! -- Ian.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Ian Collins > > I would seriously consider raidz3, given I typically see 80-100 hour > resilver times for 500G drives in raidz2 vdevs. If you haven''t > already,If you''re going raidz3, with 7 disks, then you might as well just make mirrors instead, and eliminate the slow resilver. Mirrors resilver enormously faster than raidzN. At least for now, until maybe one day the raidz resilver code might be rewritten.
On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at nedharvey.com> wrote:>If you''re going raidz3, with 7 disks, then you might as well just make >mirrors instead, and eliminate the slow resilver.There is a difference in reliability: raidzN means _any_ N disks can fail, whereas mirror means one disk in each mirror pair can fail. With a mirror, Murphy''s Law says that the second disk to fail will be the pair of the first disk :-). -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101008/4ad90baa/attachment.bin>
> From: Peter Jeremy [mailto:peter.jeremy at alcatel-lucent.com] > Sent: Thursday, October 07, 2010 10:02 PM > > On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at nedharvey.com> > wrote: > >If you''re going raidz3, with 7 disks, then you might as well just make > >mirrors instead, and eliminate the slow resilver. > > There is a difference in reliability: raidzN means _any_ N disks can > fail, whereas mirror means one disk in each mirror pair can fail. > With a mirror, Murphy''s Law says that the second disk to fail will be > the pair of the first disk :-).Maybe. But in reality, you''re just guessing the probability of a single failure, the probability of multiple failures, and the probability of multiple failures within the critical time window and critical redundancy set. The probability of a 2nd failure within the critical time window is smaller whenever the critical time window is decreased, and the probability of that failure being within the critical redundancy set is smaller whenever your critical redundancy set is smaller. So if raidz2 takes twice as long to resilver than a mirror, and has a larger critical redundancy set, then you haven''t gained any probable resiliency over a mirror. Although it''s true with mirrors, it''s possible for 2 disks to fail and result in loss of pool, I think the probability of that happening is smaller than the probability of a 3-disk failure in the raidz2. How much longer does a 7-disk raidz2 take to resilver as compared to a mirror? According to my calculations, it''s in the vicinity of 10x longer.
On Thu, 7 Oct 2010, Edward Ned Harvey wrote:> > If you''re going raidz3, with 7 disks, then you might as well just make > mirrors instead, and eliminate the slow resilver.While the math supports using raidz3, practicality (other than storage space) supports using mirrors. Mirrors are just much more agile and easier to maintain. Having one or two hot spares that zfs can resilver to right away will help improve mirrored pool reliability.> Mirrors resilver enormously faster than raidzN. At least for now, until > maybe one day the raidz resilver code might be rewritten.The resilver algorithm is closely aligned to the zfs data storage model so it is unlikely to dramatically improve. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote:>> From: Peter Jeremy [mailto:peter.jeremy at alcatel-lucent.com] >> Sent: Thursday, October 07, 2010 10:02 PM >> >> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey <shill at nedharvey.com> >> wrote: >>> If you''re going raidz3, with 7 disks, then you might as well just make >>> mirrors instead, and eliminate the slow resilver. >> >> There is a difference in reliability: raidzN means _any_ N disks can >> fail, whereas mirror means one disk in each mirror pair can fail. >> With a mirror, Murphy''s Law says that the second disk to fail will be >> the pair of the first disk :-). > > Maybe. But in reality, you''re just guessing the probability of a single > failure, the probability of multiple failures, and the probability of > multiple failures within the critical time window and critical redundancy > set. > > The probability of a 2nd failure within the critical time window is smaller > whenever the critical time window is decreased, and the probability of that > failure being within the critical redundancy set is smaller whenever your > critical redundancy set is smaller. So if raidz2 takes twice as long to > resilver than a mirror, and has a larger critical redundancy set, then you > haven''t gained any probable resiliency over a mirror. > > Although it''s true with mirrors, it''s possible for 2 disks to fail and > result in loss of pool, I think the probability of that happening is smaller > than the probability of a 3-disk failure in the raidz2. > > How much longer does a 7-disk raidz2 take to resilver as compared to a > mirror? According to my calculations, it''s in the vicinity of 10x longer. >This article has been posted elsewhere, is about 10 months old, but is a good read: http://queue.acm.org/detail.cfm?id=1670144 Really, there should be a ballpark / back of the napkin formula to be able to calculate this? I''ve been curious about this too, so here goes a 1st cut... DR = disk reliability, in terms of chance of the disk dying in any given time period, say any given hour? DFW = disk full write - time to write every sector on the disk. This will vary depending on system load, but is still an input item that can be determined by some testing. RSM = resilver time for a mirror of two of the given disks RSZ1 = resilver time for raidz1 vdev of two of the given disks? RSZ2 = resilver time for raidz2 vdev of two of the given disks? chances of losing all data in a mirror: DLM = RSM * DR. chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR. chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR Now, for the above, I''ll make some other assumptions... Lets just guess at a 1-year MTBF for our disks, and for purposes here, just flat line that at a failure rate of chance per hour throughout the year. Lets presume rebuilding a mirror takes one hour. Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk than a mirror, I think this would be a ''safe'' ratio to the benefit of the mirror. Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk than a mirror, this should be ''safe'' and again benefit to the mirror. DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk might die in any given hour. DLM = one hour * DR = .000114 DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in the pool, and any one of them could fail) DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks) = a much tinier chance of losing all that data. A better way to think about it maybe.... Based on our 1-year flat-line MTBF for disks, to figure out how much faster the mirror must rebuild for reliability to be the same as a raidz2... DLM = DLRZ2 .0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks) X = (.0001114 * 6-disks) * 5 X = .003342 So, the mirror would have to resilver three hundred times faster than the raiz2 (1 / .003342) in order for it to offer the same levels of reliability in regards to the chances of losing the entire vdev due to additional disk failures during a resilver? The governing thing here is that O(2) level of reliability based on expected chances of failure of additional disks during any given moment in time, vs. O(1) for mirrors and raidz1? Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we are working on the assumption we have already lost one disk. With raidz3, we would have ( 1 / (.0001114 * 4-disks remaining in pool ), or about 2,000 times more reliability? Now, the above does not include things like proper statistics that the chances of that 2nd and 3rd disk failing (even correlations) may be higher than our ''flat-line'' %/hr. based on 1-year MTBF, or stuff like if all the disks were purchased in the same lots and at the same time, so their chances of failing around the same time is higher, etc.
On Fri, 8 Oct 2010, Michael DeMan wrote:> Now, the above does not include things like proper statistics that > the chances of that 2nd and 3rd disk failing (even correlations) may > be higher than our ''flat-line'' %/hr. based on 1-year MTBF, or stuff > like if all the disks were purchased in the same lots and at the > same time, so their chances of failing around the same time is > higher, etc.It also does not include the "human factor" which is still the most significant contributor to data loss. This is the most difficult factor to diminish. If the humans have difficulty understanding the system or the hardware, then they are more likely to do something wrong which damages the data. It also does not account for an OS kernel which caches quite a lot of data in memory (relying on ECC for reliability), and which may have bugs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Oct 8, 2010, at 8:25 AM, Bob Friesenhahn wrote:> > It also does not include the "human factor" which is still the most significant contributor to data loss. This is the most difficult factor to diminish. If the humans have difficulty understanding the system or the hardware, then they are more likely to do something wrong which damages the data.This is often overlooked during a system design. It is very easy to lose your head during a high stress moment, and pull the wrong drive (I of course, have never done that... <ahem>). Having z2(3) / triple mirrors, graphical pictures of which disk has failed, working LED failures lights, and letting a hot spare finish reslivering before replacing a disk are all good counter measures.> It also does not account for an OS kernel which caches quite a lot of data in memory (relying on ECC for reliability), and which may have bugs.At some point you have to rely on your backups for the unexpected and unforeseen. Make sure they are good! Michael, nice reliability write up! -- Scott Meilicke
> Now, the above does not include things like proper statistics that the > chances of that 2nd and 3rd disk failing (even correlations) may be > higher than our ''flat-line'' %/hr. based on 1-year MTBF, or stuff like > if all the disks were purchased in the same lots and at the same time, > so their chances of failing around the same time is higher, etc.In addition to this comes another aspect. What if one drive fails and you find bad data on another in the same VDEV while resilvering. This is quite common these days, and for mirrors, that will mean data loss unless you mirror 3-way or more, which will be rather costy. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Fri, 8 Oct 2010, Roy Sigurd Karlsbakk wrote:> In addition to this comes another aspect. What if one drive fails > and you find bad data on another in the same VDEV while resilvering. > This is quite common these days, and for mirrors, that will mean > data loss unless you mirror 3-way or more, which will be rather > costy.The "answer" to this is to schedule a periodic scrub. It is of course not a complete answer since the drive may degrade since the previous scrub and you might still lose some (or even all!) data. If you use mirrors or raidz1 you should definitely include a periodic scrub in the plan. The good news is that mirrors scrub quickly with far fewer I/Os and system impact than raidz?. Regardless, nothing beats raidz3 based on computable statistics. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > In addition to this comes another aspect. What if one drive fails and > you find bad data on another in the same VDEV while resilvering. This > is quite common these days, and for mirrors, that will mean data loss > unless you mirror 3-way or more, which will be rather costy.Like the resilver, scrub goes faster with mirrors. Scrub regularly.
On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote:> Regardless, nothing beats raidz3 based on computable statistics.Well, no, not really. It all depends on the number of sets and the MTTR. Consider the case where you have 1 set of raidz3 and 2 sets of 3-way mirrors. The raidz3 set can only stand to lose 3 disks where the mirrored sets can stand to lose 4 disks. The answer is not immediately intuitive because it does depend on the MTTR for practical cases. -- richard
On Sat, 9 Oct 2010, Richard Elling wrote:> On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote: >> Regardless, nothing beats raidz3 based on computable statistics. > > Well, no, not really. It all depends on the number of sets and the MTTR.Well, ok. I should have appended "except for 3-way mirrors". :-) 3-way mirrors seem like an expensive solution for bulk data backup except that it turns out that if the current data fits (with plenty of headroom) on the 3-way mirror solution, zfs snapshots (with compression enabled) are an excellent way to capture the incremental changes over time. This requires care for how updates are applied to the backup pool so that unchanged data blocks are not overwritten. Usually backed up data does not change rapidly over time so the incremental snapshots don''t require much space. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/