Eugen Leitl
2011-May-26 16:34 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
How bad would raidz2 do on mostly sequential writes and reads (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? The best way is to go is striping mirrored pools, right? I''m worried about losing the two "wrong" drives out of 8. These are all 7200.11 Seagates, refurbished. I''d scrub once a week, that''d probably suck on raidz2, too? Thanks. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Roy Sigurd Karlsbakk
2011-May-26 16:40 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > The best way is to go is striping mirrored pools, right? > I''m worried about losing the two "wrong" drives out of 8. > These are all 7200.11 Seagates, refurbished. I''d scrub > once a week, that''d probably suck on raidz2, too?I see no problems with that. I''ve had rather large pools in production with ''commodity'' drives without much issues. Every now and then a drive fails, but then, no pool failures yet. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Brandon High
2011-May-26 17:15 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Thu, May 26, 2011 at 9:34 AM, Eugen Leitl <eugen at leitl.org> wrote:> How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)?I was using a similar but slightly higher spec setup (quad-core cpu & 8 GB RAM) at home and didn''t have any problems with an 8-drive raidz2, though my usage is fairly light. The system is more than fast enough to saturate gigabit ethernet for sequential reads and writes. My drives were WD10EADS "Green" drives. -B -- Brandon High : bhigh at freaks.com
Ian Collins
2011-May-26 20:10 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On 05/27/11 04:34 AM, Eugen Leitl wrote:> How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > The best way is to go is striping mirrored pools, right? > I''m worried about losing the two "wrong" drives out of 8. > These are all 7200.11 Seagates, refurbished. I''d scrub > once a week, that''d probably suck on raidz2, too? >That''s the same configuration (stripe of mirrors) I''ve been using for my server pool since march 2007. No problems to report so far! Weekly scrubs are somewhat over the top. -- Ian.
Frank Van Damme
2011-May-27 09:50 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
2011/5/26 Eugen Leitl <eugen at leitl.org>:> How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > The best way is to go is striping mirrored pools, right? > I''m worried about losing the two "wrong" drives out of 8. > These are all 7200.11 Seagates, refurbished. I''d scrub > once a week, that''d probably suck on raidz2, too? > > Thanks.Sequential? Let''s suppose no spares. 4 mirrors of 2 = sustained bandwidth of 4 disks raidz2 with 8 disks = sustained bandwidth of 6 disks So :) -- Frank Van Damme No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
Jim Klimov
2011-May-27 12:38 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
2011-05-27 13:50, Frank Van Damme wrote:> Sequential? Let''s suppose no spares. > 4 mirrors of 2 = sustained bandwidth of 4 disks > raidz2 with 8 disks = sustained bandwidth of 6 disksWell, technically, for reads the mirrors might get parallelized to read different portions of data for separate users (processes). In case of ZFS that might *theoretically* be metadata vs. data blocks, if they were *theoretically* co-located in different parts of the platters. More realistically, if you have some other workload beside the one sequential read (which due to COW and/or dedup and following fragmentation would not likely be sequential really), this statement might be more realistic: "the 4*2 mirror may give you up to 8 disks of bandwidth (in reads)". And if the ZFS is supposedly smart enough to use request coalescing as to minimize mechanical seek times, then it might actually be possible that your disks would get "stuck" averagely serving requests from different parts of the platter, i.e. middle-inside and middle-outside and this might even be averagely more than 2x faster than a single drive (due to non-zero track-to-track seek times). This is purely my speculation, but now that I thought about it, can''t get rid of the idea ;) ... -- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+
Eugen Leitl
2011-May-27 12:52 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Fri, May 27, 2011 at 04:38:15PM +0400, Jim Klimov wrote:> And if the ZFS is supposedly smart enough to use request coalescing > as to minimize mechanical seek times, then it might actually be > possible that your disks would get "stuck" averagely serving requests > from different parts of the platter, i.e. middle-inside and middle-outside > and this might even be averagely more than 2x faster than a single > drive (due to non-zero track-to-track seek times).In practice I''ve just found out I''m completely CPU-bound. Load goes to >11 during scrub, dd a large file causes ssh to crap out, etc. Completely unusable, in other words. So I think I''ll try to go with a mirrored pool, and see whether the CPU load will go down. Maybe it''s a FreeBSD (FreeNAS 8.0) brain damage, and things would have been better with OpenSolaris. I''ll have to try the HP N36L setup to see what the CPU load with either raidz2 or mirrored pools will be.> This is purely my speculation, but now that I thought about it, can''t get > rid of the idea ;) ...-- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Marty Scholes
2011-May-27 18:49 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> 2011/5/26 Eugen Leitl <eugen at leitl.org>: > > How bad would raidz2 do on mostly sequential writes > and reads > > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > > > The best way is to go is striping mirrored pools, > right? > > I''m worried about losing the two "wrong" drives out > of 8. > > These are all 7200.11 Seagates, refurbished. I''d > scrub > > once a week, that''d probably suck on raidz2, too? > > > > Thanks. > > Sequential? Let''s suppose no spares. > > 4 mirrors of 2 = sustained bandwidth of 4 disks > raidz2 with 8 disks = sustained bandwidth of 6 disks > > So :)Turn it around and discuss writes. Reads may or may not give 8x throughput with mirrors. In either setup, writes will require 8x storage bandwidth since all drives will be written to. Mirrors will deliver 4x throughput and RAIDZ2 will deliver 6x throughput. For what it''s worth, I ran a 22 disk home array as a single RAIDZ3 vdev (19+3)for several months and it was fine. These days I run a 32 disk array laid out as four vdevs, each an 8 disk RAIDZ2, i.e. 4x 6+2. The best advice is simply to test your workload against different configurations. ZFS lets you pick what works for you. -- This message posted from opensolaris.org
Edward Ned Harvey
2011-May-28 14:49 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Frank Van Damme > > 4 mirrors of 2 = sustained bandwidth of 4 disks > raidz2 with 8 disks = sustained bandwidth of 6 disksCorrection: 4 mirrors of 2 = sustained read bandwidth of 8 disks, sustained write bandwidth of 4 disks.
Edward Ned Harvey
2011-May-28 14:51 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eugen Leitl > > How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > The best way is to go is striping mirrored pools, right?As far as performance is concerned, the big difference between these two configurations is for random access. The stripe of mirrors will perform much better for random access. Since you said you''re doing mostly sequential (like the home movie theater situation) the raidz2 will be fine for you.
Roy Sigurd Karlsbakk
2011-May-29 10:55 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> > And if the ZFS is supposedly smart enough to use request coalescing > > as to minimize mechanical seek times, then it might actually be > > possible that your disks would get "stuck" averagely serving > > requests > > from different parts of the platter, i.e. middle-inside and > > middle-outside > > and this might even be averagely more than 2x faster than a single > > drive (due to non-zero track-to-track seek times). > > In practice I''ve just found out I''m completely CPU-bound. > Load goes to >11 during scrub, dd a large file causes ssh > to crap out, etc. Completely unusable, in other words.That''s I/O-bound, not CPU-bound. The CPU load is likely to be rather low during scrub, but the I/O load will be high. IIRC this is tunable, at least in Solaris and friends. Also, adding SLOG/L2ARC will help a lot to make the system more resposible during scrub/resilver, again, at least on Solaris and friends. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Paul Kraus
2011-May-31 12:28 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Fri, May 27, 2011 at 2:49 PM, Marty Scholes <martyscholes at yahoo.com> wrote:> For what it''s worth, I ran a 22 disk home array as a single RAIDZ3 vdev (19+3)for several > months and it was fine. ?These days I run a 32 disk array laid out as four vdevs, each an > 8 disk RAIDZ2, i.e. 4x 6+2.I tested 40 drives in various configurations and determined that for random read workloads, the I/O scaled linearly with the number of vdevs, NOT the number of drives. See https://spreadsheets.google.com/a/kraus-haus.org/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&output=html for results using raidz2 vdevs. I did not test sequential read performance here as our workload does not include any. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Jim Klimov
2011-May-31 12:48 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
Interesting, although makes sense ;) Now, I wonder about reliability (with large 2-3Tb drives and long scrub/resilver/replace times): say I have 12 drives in my box. I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1 or a 1*12-disk raidz3 with nearly the same capacity (8-9 data disks plus parity). I see that with more vdevs the IOPS will grow - does this translate to better resilver and scrub times as well? Smaller raidz sets can be more easily spread over different controllers and JBOD boxes, which is also an interesting factor... How good or bad is the expected reliability of 3*4-disk-raidz1 vs 1*12-disk raidz3, so which of the tradeoffs is better - more vdevs or more parity to survive loss of ANY 3 disks vs. "right" 3 disks? Thanks, //Jim Klimov -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110531/855a81a6/attachment.html>
Dimitar Hadjiev
2011-Jun-12 12:08 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
>I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1 >or a 1*12-disk raidz3 with nearly the same capacity (8-9 >data disks plus parity). I see that with more vdevs the >IOPS will grow - does this translate to better resilver >and scrub times as well?Yes it would translate in better resilver times as any failures will affect only one of the vdevs leading to a shorter parity restore time as oposed to rebuilding the whole raidz2. As for scrubbing it would be as fast as the scrub of each vdev since the whole pool does not have parity data to synchronize.>How good or bad is the expected reliability of >3*4-disk-raidz1 vs 1*12-disk raidz3, so which >of the tradeoffs is better - more vdevs or more >parity to survive loss of ANY 3 disks vs. "right" >3 disks?I''d say the chances of loseing a whole vdev in a 4*3 configuration equal the chances of loseing 4 drives in a 1*12 raidz3 configuration - it might happen, nothing is foolproof. -- This message posted from opensolaris.org
Erik Trimble
2011-Jun-13 22:01 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On 6/12/2011 5:08 AM, Dimitar Hadjiev wrote:>> I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1 >> or a 1*12-disk raidz3 with nearly the same capacity (8-9 >> data disks plus parity). I see that with more vdevs the >> IOPS will grow - does this translate to better resilver >> and scrub times as well? > Yes it would translate in better resilver times as any failures will affect only one of the vdevs leading to a shorter parity restore time as oposed to rebuilding the whole raidz2. As for scrubbing it would be as fast as the scrub of each vdev since the whole pool does not have parity data to synchronize.Go look through the mail archives, and there''s at least a couple of posts from me and Richard Elling (amongst others) about the workload that a resilver requires on a raidz* vdev. Essentially, "typical" usage of a vdev will result in resilver times linearly degrading with each additional DATA disk in the raidz*, as a resilver is IOPS-bound on the single replaced disk. So, a 3-disk raidz1 (2 data disks) should, on average, resilver 4.5 times faster than a 12-disk raidz3 (9 data disks).>> How good or bad is the expected reliability of >> 3*4-disk-raidz1 vs 1*12-disk raidz3, so which >> of the tradeoffs is better - more vdevs or more >> parity to survive loss of ANY 3 disks vs. "right" >> 3 disks? > I''d say the chances of loseing a whole vdev in a 4*3 configuration equal the chances of loseing 4 drives in a 1*12 raidz3 configuration - it might happen, nothing is foolproof.No, the reliability of a 1x12raidz3 is *significantly* better than that of 4x3 raidz1 (or, frankly, ANY raidz1 configuration using 12 disks). Richard has some stats around here somewhere... basically, the math (singular, you damn Brits! :-) says that while a 3-disk raidz1 will certainly take shorter to re-silver after a loss than a 12-disk raidz3, this is more than counterbalanced by the ability of a 12-disk raidz3 to handle additional disk losses, where the 4x3 config is only *probabilisticly* likely to be handle a 2nd or 3rd drive failure. I''d have to re-look at the exact numbers, but, I''d generally say that 2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or 3x4raidz1, for a home server not looking for super-critical protection (in which case, you should be using mirrors with spares, not raidz*). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Paul Kraus
2011-Jun-14 12:04 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Mon, Jun 13, 2011 at 6:01 PM, Erik Trimble <erik.trimble at oracle.com> wrote:> I''d have to re-look at the exact numbers, but, I''d generally say that > 2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or > 3x4raidz1, for a home server not looking for super-critical protection (in > which case, you should be using mirrors with spares, not raidz*).I saw some stats a year or more ago that indicated the MTDL for raidZ2 was better than for a 2-way mirror. In order of best to worst I remember the rankings as: raidZ3 (least likely to lose data) 3-way mirror raidZ2 2-way mirror raidZ1 (most likely to lose data) This is for Mean Time to Data Loss, or essentially the odds of losing _data_ due to one (or more) drive failures. I do not know if this took number of devices per vdev and time to resilver into account. Non-redundant configurations were not even discussed. This information came out of Sun (pre-Oracle) and _may_ have been traceable back to Brendan Gregg. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Eric D. Mudama
2011-Jun-14 15:48 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Tue, Jun 14 at 8:04, Paul Kraus wrote:>On Mon, Jun 13, 2011 at 6:01 PM, Erik Trimble <erik.trimble at oracle.com> wrote: > >> I''d have to re-look at the exact numbers, but, I''d generally say that >> 2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or >> 3x4raidz1, for a home server not looking for super-critical protection (in >> which case, you should be using mirrors with spares, not raidz*). > >I saw some stats a year or more ago that indicated the MTDL for raidZ2 >was better than for a 2-way mirror. In order of best to worst I >remember the rankings as: > >raidZ3 (least likely to lose data) >3-way mirror >raidZ2 >2-way mirror >raidZ1 (most likely to lose data) > >This is for Mean Time to Data Loss, or essentially the odds of losing >_data_ due to one (or more) drive failures. I do not know if this took >number of devices per vdev and time to resilver into account. >Non-redundant configurations were not even discussed. This information >came out of Sun (pre-Oracle) and _may_ have been traceable back to >Brendan Gregg.Google "mttdl raidz zfs" digs up: http://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance http://blogs.oracle.com/relling/entry/raid_recommendations_space_vs_mttdl http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html I think the second picture is the one you were thinking of. The 3rd link adds raidz3 data to the charts. -- Eric D. Mudama edmudama at bounceswoosh.org
Paul Kraus
2011-Jun-14 16:30 UTC
[zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Tue, Jun 14, 2011 at 11:48 AM, Eric D. Mudama <edmudama at bounceswoosh.org> wrote:> On Tue, Jun 14 at ?8:04, Paul Kraus wrote:>> I saw some stats a year or more ago that indicated the MTDL for raidZ2 >> was better than for a 2-way mirror. In order of best to worst I >> remember the rankings as: >> >> raidZ3 (least likely to lose data) >> 3-way mirror >> raidZ2 >> 2-way mirror >> raidZ1 (most likely to lose data)> Google "mttdl raidz zfs" digs up: > > http://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance > http://blogs.oracle.com/relling/entry/raid_recommendations_space_vs_mttdl > http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html > > I think the second picture is the one you were thinking of. ?The 3rd > link adds raidz3 data to the charts.Yup, that was it, although I got the information via a Sun FE and not directly. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players