Suppose I want to build a 100-drive storage system, wondering if there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then setup ZFS file system on these 20 virtual drives and configure them as RAIDZ? I understand people always say ZFS doesn''t prefer HW RAID. Under this case, the HW RAID0 is only for stripping (allows higher data transfer rate), while the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes care all the checksum/error detection/auto-repair. I guess this will not affect any advantages of using ZFS, while I could get higher data transfer rate. Wondering if it''s the case? Any suggestion or comment? Please kindly advise. Thanks! -- This message posted from opensolaris.org
D''oh. I shouldn''t answer questions first thing Monday morning. I think you test this configuration with and without the underlying hardware RAID. If RAIDZ is the right redundancy level for your workload, you might be pleasantly surprised with a RAIDZ configuration built on the h/w raid array in JBOD mode. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide cs On 08/15/11 08:41, Cindy Swearingen wrote:> > Hi Tom, > > I think you test this configuration with and without the > underlying hardware RAID. > > If RAIDZ is the right redundancy level for your workload, > you might be pleasantly surprised. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > Thanks, > > Cindy > > On 08/12/11 19:34, Tom Tang wrote: >> Suppose I want to build a 100-drive storage system, wondering if there >> is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives >> each), then setup ZFS file system on these 20 virtual drives and >> configure them as RAIDZ? >> >> I understand people always say ZFS doesn''t prefer HW RAID. Under this >> case, the HW RAID0 is only for stripping (allows higher data transfer >> rate), while the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes >> care all the checksum/error detection/auto-repair. I guess this will >> not affect any advantages of using ZFS, while I could get higher data >> transfer rate. Wondering if it''s the case? >> Any suggestion or comment? Please kindly advise. Thanks! >
On Fri, 12 Aug 2011, Tom Tang wrote:> Suppose I want to build a 100-drive storage system, wondering if > there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 > drives each), then setup ZFS file system on these 20 virtual drives > and configure them as RAIDZ?The main concern would be resilver times if a drive in one of the HW RAID0''s fails. The resilver time would be similar to one huge disk drive since there would not be any useful concurrency. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
imho, not a good idea, any two hdd in your raid0 fail zpool is dead if possible just do one hdd raid0 then use zfs to do mirror raidz or raidz2 will be the last choice Sent from my iPad Hung-Sheng Tsao ( LaoTsao) Ph.D On Aug 12, 2011, at 21:34, Tom Tang <thompsont at supermicro.com> wrote:> Suppose I want to build a 100-drive storage system, wondering if there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then setup ZFS file system on these 20 virtual drives and configure them as RAIDZ? > > I understand people always say ZFS doesn''t prefer HW RAID. Under this case, the HW RAID0 is only for stripping (allows higher data transfer rate), while the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes care all the checksum/error detection/auto-repair. I guess this will not affect any advantages of using ZFS, while I could get higher data transfer rate. Wondering if it''s the case? > > Any suggestion or comment? Please kindly advise. Thanks! > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang <thompsont at supermicro.com> wrote:> Suppose I want to build a 100-drive storage system, wondering if there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then setup ZFS file system on these 20 virtual drives and configure them as RAIDZ?A 20-device wide raidz is a bad idea. Making those devices from stripes just compounds the issue. The biggest problem is that resilvering would be a nightmare, and you''re practically guaranteed to have additional failures or read errors while degraded. You would achieve better performance, error detection and recovery by using several top-level raidz. 20 x 5-disk raidz would give you very good read and write performance with decent resilver times and 20% overhead for redundancy. 10 x 10-disk raidz2 would give more protection, but a little less performance, and higher resilver times. -B -- Brandon High : bhigh at freaks.com
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Tom Tang > > I understand people always say ZFS doesn''t prefer HW RAID.A better way to say it is like this: ZFS does raid better than HW raid. It''s both faster and more reliable. The only exception is sometimes HW can resilver faster than ZFS, depending on how full your disks are, and how it is laid out on disk.> HW RAID0 is only for stripping (allows higher data transfer rate),There is no reason to do this. ZFS does striping better and faster than HW.> while the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes care allthe> checksum/error detection/auto-repair.You''re talking about a 20-way raidz. Definitely don''t do that. A resilver will probably never complete on that pool. If you''re trying to achieve 1/20 redundancy... I don''t know any good way to do that. Most people would probably advise you make a bunch of 8-way raidz''s or something like that. I know 1/8 is not the same as 1/20, but it''s the closest I think you can come reasonably safely.
On Aug 26, 2011, at 4:02 PM, Brandon High <bhigh at freaks.com> wrote:> On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang <thompsont at supermicro.com> wrote: >> Suppose I want to build a 100-drive storage system, wondering if there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then setup ZFS file system on these 20 virtual drives and configure them as RAIDZ? > > A 20-device wide raidz is a bad idea. Making those devices from > stripes just compounds the issue.Yes... you need to think in the reverse. Instead of making highly dependable solutions out of unreliable components, you need to make judicious use of reliable components. In other words, RAID-10 is much better than RAID-01, or in this case, RAID-z0 is much better than RAID-0z.> The biggest problem is that resilvering would be a nightmare, and > you''re practically guaranteed to have additional failures or read > errors while degraded.I''m getting a but tired of people designing for fast resilvering. This is akin to buying a car based on how easy it is to change a flat tire. It is a better idea to base your decision on cost, fuel economy, safety, or even color.> You would achieve better performance, error detection and recovery by > using several top-level raidz. 20 x 5-disk raidz would give you very > good read and write performance with decent resilver times and 20% > overhead for redundancy. 10 x 10-disk raidz2 would give more > protection, but a little less performance, and higher resilver times.A 20 x 5-disk raidz (RAID-z0) is a superior design in every way. Using the simple Mean Time To Data Loss (MTTDL) model, for disks with 1 million hours rated Mean Time Between Failure (MTBF) and a Mean Time To Repair (MTTR) of 10 days: 5 disk RAID-0 has MTTDL of 199,728 hours 20-way raidz of those disks has a MTTDL of 437,124 hours as compared to: 20 x 5 raidz has MTTDL of 12,395,400 hours For MTTDL, 12,395,400 hours is better than 437,124 hours QED -- richard
On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:> I''m getting a but tired of people designing for fast resilvering.It is a design consideration, regardless, though your point is valid that it shouldn''t be the overriding consideration. To the original question and poster: This often arises out of another type of consideration, that of the size of a "failure unit". When plugging together systems at almost any scale beyond a handful of disks, there are many kinds of groupings of disks whereby the whole group may disappear if a certain component fails: controllers, power supplies, backplanes, cables, network/fabric switches, backplanes, etc. The probabilities of each of these varies, often greatly, but they can shape and constrain a design. I''m going to choose a deliberately exaggerated example, to illustrate the discussion and recommendations in the thread, using the OP''s numbers. Let''s say that I have 20 5-disk little NAS boxes, each with their own single power supply and NIC. Each is an iSCSI target, and can serve up either 5 bare-disk LUNs, or a single LUN for the whole box, backed by internal RAID. Internal RAID can be 0 or 5. Clearly, a-box-of-5-disks is an independent failure unit, at non-trivial probability via a variety of possible causes. I better plan my pool accordingly. The first option is to "simplify" the configuration, representing the obvious failure unit as a single LUN, just a big disk. There is merit in simplicity, especially for the humans involved if they''re not sophisticated and experienced ZFS users (or else why would they be asking these questions?). This may prevent confusion and possible mistakes (at 3am under pressure, even experienced admins make those). This gives us 20 "disks" to make a pool, of whatever layout suits our performance and resiliency needs. Regardless of what disks are used, a 20-way RAIDZ is unlikely to be a good answer. 2x 10-way raidz2, 4x 5-way raidz1, 2-way and 3-way mirrors, might all be useful depending on circumstances. (As an aside, mirrors might be the layout of choice if switch failures are also to be taken into consideration, for practical network topologies.) The second option is to give ZFS all the disks individually. We will embed our knowledge of the failure domains into the pool structure, choosing which disks go in which vdev accordingly. The simplest expression of this is to take the same layout we chose above for 20 big disks, and make 5 of them, each as a top-level vdev in the same pattern, for each of the 5 individual disks. Think about making 5 separate pools with the same layout as the previous case, and stacking them together into one. (As another aside, in previous discussions I''ve also recommended considering multiple pools vs multiple vdevs, that still applies but I won''t reiterate here.) If our pool had enough redundancy for our needs before, we will now have 5 times as many top-level vdevs, which will experience tolerable failures in groups of 5 if a disk box dies, for the same overall result. ZFS generally does better this way. We will have more direct concurrency, because ZFS''s device tree maps to spindles, rather than to a more complex interaction of underlying components. Physical disk failures can now be seen by ZFS as such, and don''t get amplified to whole LUN failures (RAID0) or performance degradations during internal reconstruction (RAID5). ZFS will prefer not to allocate new data on a degraded vdev until it is repaired, but needs to know about it in the first place. Even before we talk about recovery, ZFS can likely report errors better than the internal RAID, which may just hide an issue long enough for it to become a real problem during another later event. If we can (e.g.) assign the WWN''s of the exported LUNs according to a scheme that makes disk location obvious, we''re less likely to get confused because of all the extra disks. The structure is still apparent. (There are more layouts we can now create using the extra disks, but we lose the simplicity, and they don''t really enhance this example for the general case. Very careful analysis will be required, and errors under pressure might result in a situation where the system works, but later resiliency is compromised. This is especially true if hot-spares are involved.) So, the ZFS preference is definitely for individual disks. What might override this preference, and cause us to use LUNs over the internal raid, other than the perception of simplicity due to inexperience? Some possibilities are below. Because local reconstructions within a box may be much faster than over the network. Remember, though, that we trust ZFS more than RAID5 (even before any specific implementation has a chance to add its own bugs and wrinkles). So, effectively, after such a local RAID5 reconstruction, we''d want to run a scrub anyway - at which point we might as well just have let ZFS resilver. If we have more than one top-level vdev, which we certanly will in this example, a scrub is lot more work (and network traffic) than letting the vdev resilver. Because our iSCSI boxes don''t support hot-swap, or the dumb firmware hangs all LUNs when one drive is timing out on errors, so a drive replacement winds up amplifying to a 5-way failure regardless. Or we want to use hot-spares, or some other operational consideration that basically says we don''t ever want to deal with hardware failures below this level of granularity. Remember, though, that recovery from a 5-way offline to fix a single failure is still a lot faster, and thus safer, because each disk only needs to be resilvered for what it missed while offline. Because the internal RAID in the NAS boxes is really good and fast (fancy controller with battery-backed nvram we''ve already paid for), such that presenting them as a single fast unit performs much better. Maybe the "fancy controller" is actually ZFS, and the nvram is an SSD, or maybe it''s something else that we trust pretty well. In which case, clustering them into a common logical storage with cross-device redundancy, using a method other than ZFS may be more appropriate. Our example is contrived, but these considerations apply for other connectivity types, and especially where geographical separation is involved. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110829/d7cad38d/attachment.bin>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Daniel Carosone > > On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote: > > I''m getting a but tired of people designing for fast resilvering. > > It is a design consideration, regardless, though your point is valid > that it shouldn''t be the overriding consideration.I disagree. I think if you build a system that will literally never complete a resilver, or if the resilver requires weeks or months to complete, then you''ve fundamentally misconfigured your system. Avoiding such situations should be a top priority. Such a misconfiguration is sometimes the case with people building 21-disk raidz3 and similar configurations...
On Mon, Aug 29, 2011 at 11:40:34PM -0400, Edward Ned Harvey wrote:> > On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote: > > > I''m getting a but tired of people designing for fast resilvering. > > > > It is a design consideration, regardless, though your point is valid > > that it shouldn''t be the overriding consideration. > > I disagree. I think if you build a system that will literally never > complete a resilver, or if the resilver requires weeks or months to > complete, then you''ve fundamentally misconfigured your system. Avoiding > such situations should be a top priority. Such a misconfiguration is > sometimes the case with people building 21-disk raidz3 and similar > configurations...Ok, yes, for these extreme cases, any of the considerations gets a veto for "pool is unservicable". Beyond that, though, Richard''s point is that optimising for resilver time to the exclusion of other requirements will produce bad designs. In my extended example, I mentioned resilver and recovery times and impacts, but only in amongst other factors. Another way of putting it is that pool configs that will be pessimal for resilver will likely also be pessimal for other considerations (general iops performance being the obvious closely-linked case). -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110831/5b2b770e/attachment-0001.bin>