Vincent Fox
2007-Dec-01 05:15 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
We will be using Cyrus to store mail on 2540 arrays. We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected to same host, and mirror and stripe the LUNs. So a ZFS RAID-10 set composed of 4 LUNs. Multi-pathing also in use for redundancy. My question is any guidance on best choice in CAM for stripe size in the LUNs? Default is 128K right now, can go up to 512K, should we go higher? Cyrus stores mail messages as many small files, not big mbox files. But there are so many layers in action here it''s hard to know what is best choice. This message posted from opensolaris.org
Louwtjie Burger
2007-Dec-01 06:43 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
On Dec 1, 2007 7:15 AM, Vincent Fox <vincent_b_fox at yahoo.com> wrote:> We will be using Cyrus to store mail on 2540 arrays. > > We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected to same host, and mirror and stripe the LUNs. So a ZFS RAID-10 set composed of 4 LUNs. Multi-pathing also in use for redundancy.Any reason why you are using a mirror of raid-5 lun''s? I can understand that perhaps you want ZFS to be in control of rebuilding broken vdev''s, if anything should go wrong ... but rebuilding RAID-5''s seems a little over the top. How about running a ZFS mirror over RAID-0 luns? Then again, the downside is that you need intervention to fix a LUN after a disk goes boom! But you don''t waste all that space :) PS: It would be nice to know what the LSI firmware does (after 15 years of evolution) to writes into the controller... it might have been better to buy JOBD''s ... I see Sun will be releasing some soon (rumour?)
can you guess?
2007-Dec-01 10:59 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> We will be using Cyrus to store mail on 2540 arrays. > > We have chosen to build 5-disk RAID-5 LUNs in 2 > arrays which are both connected to same host, and > mirror and stripe the LUNs. So a ZFS RAID-10 set > composed of 4 LUNs. Multi-pathing also in use for > redundancy.Sounds good so far: lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter, but RAID-5 should be OK, especially if the workload is read-dominated. ZFS might aggregate small writes such that their performance would be good as well if Cyrus doesn''t force them to be performed synchronously (and ZFS doesn''t force them to disk synchronously on file close); even synchronous small writes could perform well if you mirror the ZFS small-update log: flash - at least the kind with decent write performance - might be ideal for this, but if you want to steer clear of a specialized configuration just carving one small LUN for mirroring out of each array (you could use a RAID-0 stripe on each array if you were compulsive about keeping usage balanced; it would be nice to be able to ''center'' it on the disks, but probably not worth the management overhead unless the array makes it easy to do so) should still offer a noticeable improvement over just placing the ZIL on the RAID-5 LUNs.> > My question is any guidance on best choice in CAM for > stripe size in the LUNs? > > Default is 128K right now, can go up to 512K, should > we go higher?By ''stripe size'' do you mean the size of the entire stripe (i.e., your default above reflects 32 KB on each data disk, plus a 32 KB parity segment) or the amount of contiguous data on each disk (i.e., your default above reflects 128 KB on each data disk for a total of 512 KB in the entire stripe, exclusive of the 128 KB parity segment)? If the former, by all means increase it to 512 KB: this will keep the largest ZFS block on a single disk (assuming that ZFS aligns them on ''natural'' boundaries) and help read-access parallelism significantly in large-block cases (I''m guessing that ZFS would use small blocks for small files but still quite possibly use large blocks for its metadata). Given ZFS''s attitude toward multi-block on-disk contiguity there might not be much benefit in going to even larger stripe sizes, though it probably wouldn''t hurt noticeably either as long as the entire stripe (ignoring parity) didn''t exceed 4 - 16 MB in size (all the above numbers assume the 4 + 1 stripe configuration that you described). In general, having less than 1 MB per-disk stripe segments doesn''t make sense for *any* workload: it only takes 10 - 20 milliseconds to transfer 1 MB from a contemporary SATA drive (the analysis for high-performance SCSI/FC/SAS drives is similar, since both bandwidth and latency performance improve), which is comparable to the 12 - 13 ms. that it takes on average just to position to it - and you can still stream data at high bandwidths in parallel from the disks in an array as long as you have a client buffer as large in MB as the number of disks you need to stream from to reach the required bandwidth (you want 1 GB/sec? no problem: just use a 10 - 20 MB buffer and stream from 10 - 20 disks in parallel). Of course, this assumes that higher software layers organize data storage to provide that level of contiguity to leverage... - bill This message posted from opensolaris.org
can you guess?
2007-Dec-01 11:31 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> Any reason why you are using a mirror of raid-5 > lun''s?Some people aren''t willing to run the risk of a double failure - especially when recovery from a single failure may take a long time. E.g., if you''ve created a disaster-tolerant configuration that separates your two arrays and a fire completely destroys one of them, you''d really like to be able to run the survivor without worrying too much until you can replace its twin (hence each must be robust in its own right). The above situation is probably one reason why ''RAID-6'' and similar approaches (like ''RAID-Z2'') haven''t generated more interest: if continuous on-line access to your data is sufficiently critical to consider them, then it''s also probably sufficiently critical to require such a disaster-tolerant approach (which dual-parity RAIDs can''t address). It would still be nice to be able to recover from a bad sector on the single surviving site, of course, but you don''t necessarily need full-blown RAID-6 for that: you can quite probably get by with using large blocks and appending a private parity sector to them (maybe two private sectors just to accommodate a situation where a defect hits both the last sector in the block and the parity sector that immediately follows it; it would also be nice to know that the block size is significantly smaller than a disk track size, for similar reasons). This would, however, tend to require file-system involvement such that all data was organized into such large blocks: otherwise, all writes for smaller blocks would turn into read/modify/writes. Panasas (I always tend to put an extra ''s'' into that name, and to judge from Google so do a hell of a lot of other people: is it because of the resemblance to ''parnassas''?) has been crowing about something that it calls ''tiered parity'' recently, and it may be something like the above. ...> How about running a ZFS mirror over RAID-0 luns? Then > again, the > downside is that you need intervention to fix a LUN > after a disk goes > boom! But you don''t waste all that space :)''Wasting'' 20% of your disk space (in the current example) doesn''t seem all that alarming - especially since you''re getting more for that expense than just faster and more automated recovery if a disk (or even just a sector) fails. - bill This message posted from opensolaris.org
max at bruningsystems.com
2007-Dec-01 14:34 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
Hi Bill, can you guess? wrote:>> We will be using Cyrus to store mail on 2540 arrays. >> >> We have chosen to build 5-disk RAID-5 LUNs in 2 >> arrays which are both connected to same host, and >> mirror and stripe the LUNs. So a ZFS RAID-10 set >> composed of 4 LUNs. Multi-pathing also in use for >> redundancy. >> > > Sounds good so far: lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter,Why does "lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter"? thanks, max
can you guess?
2007-Dec-01 14:53 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> Hi Bill,... lots of small files in a> largish system with presumably significant access > parallelism makes RAID-Z a non-starter, > Why does "lots of small files in a largish system > with presumably > significant access parallelism makes RAID-Z a > non-starter"? > thanks, > maxEvery ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe - so not only do N + 1 disks get written for every block update, but N disks get *read* on every block *read*. Normally, small files can be read in a single I/O request to one disk (even in conventional parity-RAID implementations). RAID-Z requires N I/O requests spread across N disks, so for parallel-access reads to small files RAID-Z provides only about 1/Nth the throughput of conventional implementations unless the disks are sufficiently lightly loaded that they can absorb the additional load that RAID-Z places on them without reducing throughput commensurately. - bill This message posted from opensolaris.org
Vincent Fox
2007-Dec-01 17:57 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> On Dec 1, 2007 7:15 AM, Vincent Fox > > Any reason why you are using a mirror of raid-5 > lun''s? > > I can understand that perhaps you want ZFS to be in > control of > rebuilding broken vdev''s, if anything should go wrong > ... but > rebuilding RAID-5''s seems a little over the top.Because the decision of our technical leads was that a straight ZFS RAID-10 set made up of individual disks from the 2540 was more risky. A double-disk failure in a mirror pair would hose the pool and when the pool contains email for >10K people this was not acceptable. Another possibility is one of the arrays goes offline now you are running a RAID-0 stripe set and a single-disk fails then you are again dead. The setup we have can survive quite multiple failures and we have seen enough weird events in our career that wee decided to do this. YMMV. Let''s move on, I just wanted to describe our setup not start an argument about it.> How about running a ZFS mirror over RAID-0 luns? Then > again, the > downside is that you need intervention to fix a LUN > after a disk goes > boom! But you don''t waste all that space :) > > PS: It would be nice to know what the LSI firmware > does (after 15 > years of evolution) to writes into the controller... > it might have > been better to buy JOBD''s ... I see Sun will be > releasing some soon > (rumour?)A guy in our group exported the disks as LUNs by the way and ran Bonnie++ and the results were a little better for a straight RAID-10 set of all disks, but not hugely better enough to tip the balance towards it. Not perhaps the best test, but what we had time to do. This message posted from opensolaris.org
Vincent Fox
2007-Dec-01 19:00 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> Sounds good so far: lots of small files in a largish > system with presumably significant access parallelism > makes RAID-Z a non-starter, but RAID-5 should be OK, > especially if the workload is read-dominated. ZFS > might aggregate small writes such that their > performance would be good as well if Cyrus doesn''t > force them to be performed synchronously (and ZFS > doesn''t force them to disk synchronously on file > close); even synchronous small writes could perform > well if you mirror the ZFS small-update log: flash - > at least the kind with decent write performance - > might be ideal for this, but if you want to steer > clear of a specialized configuration just carving one > small LUN for mirroring out of each array (you could > use a RAID-0 stripe on each array if you were > compulsive about keeping usage balanced; it would be > nice to be able to ''center'' it on the disks, but > probably not worth the management overhead unless the > array makes it easy to do so) should still offer a > noticeable improvement over just placing the ZIL on > the RAID-5 LUNs.I''m not sure I understand you here. I suppose I need to read up on the ZIL option. We are running Solaris 10u4 not OpenSolaris. Can I setup a disk in each 2540 array for this ZIL disk, and then mirror them such that if one array goes down I''m not dead? If this ZIL disk also goes dead, what is the failure mode and recovery option then? We did get the 2540 fully populated. With 12 disks, and wanting to have at least ONE hot global spare in each array, and needing to keep LUNs the same size, you end up doing 2 5-disk RAID-5 LUNs and 2 hot spares in each array. Not that I really need 2 spares I just didn''t see any way to make good use of an extra disk in each array. If we wanted to dedicate them instead to this ZIL need, what is best way to go about that? Our current setup to be specific: {cyrus3-1:vf5:136} zpool status pool: ms11 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ms11 ONLINE 0 0 0 mirror ONLINE 0 0 0 c6t600A0B800038ACA0000002AB47504368d0 ONLINE 0 0 0 c6t600A0B800038A04400000251475045D1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c6t600A0B800038A1CF000002994750442Fd0 ONLINE 0 0 0 c6t600A0B800038A3C40000028447504628d0 ONLINE 0 0 0 errors: No known data errors> By ''stripe size'' do you mean the size of the entire > stripe (i.e., your default above reflects 32 KB on > each data disk, plus a 32 KB parity segment) or the > amount of contiguous data on each disk (i.e., your > default above reflects 128 KB on each data disk for a > total of 512 KB in the entire stripe, exclusive of > the 128 KB parity segment)?I''m going from the pulldown menu choices in CAM 6.0 for the 2540 arrays, which are currently 128K, and only go up to 512K. I''ll have to pull up the interface again when I am at work but I think it was called stripe size, and referred to values the 2540 firmware was assigning onto the 5-disk RAID-5 sets.> If the former, by all means increase it to 512 KB: > this will keep the largest ZFS block on a single > disk (assuming that ZFS aligns them on ''natural'' > boundaries) and help read-access parallelism > significantly in large-block cases (I''m guessing > that ZFS would use small blocks for small files but > still quite possibly use large blocks for its > metadata). Given ZFS''s attitude toward multi-block > on-disk contiguity there might not be much benefit > in going to even larger stripe sizes, though it > probably wouldn''t hurt noticeably either as long as > the entire stripe (ignoring parity) didn''t exceed 4 > - 16 MB in size (all the above numbers assume the 4 > + 1 stripe configuration that you described). > > In general, having less than 1 MB per-disk stripe > segments doesn''t make sense for *any* workload: it > only takes 10 - 20 milliseconds to transfer 1 MB from > a contemporary SATA drive (the analysis for > high-performance SCSI/FC/SAS drives is similar, since > both bandwidth and latency performance improve), > which is comparable to the 12 - 13 ms. that it takes > on average just to position to it - and you can still > stream data at high bandwidths in parallel from the > disks in an array as long as you have a client buffer > as large in MB as the number of disks you need to > stream from to reach the required bandwidth (you want > 1 GB/sec? no problem: just use a 10 - 20 MB buffer > and stream from 10 - 20 disks in parallel). Of > course, this assumes that higher software layers > organize data storage to provide that level of > contiguity to leverage...Hundreds of POP and IMAP user processes coming and going from users reading their mail. Hundreds more LMTP processes from mail being delivered to the Cyrus mail-store. Sometimes writes predominate over reads, depends on time of day whether backups are running, etc. This message posted from opensolaris.org
can you guess?
2007-Dec-02 00:36 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> We are running Solaris 10u4 is the log option in > there?Someone more familiar with the specifics of the ZFS releases will have to answer that.> > If this ZIL disk also goes dead, what is the failure > mode and recovery option then?The ZIL should at a minimum be mirrored. But since that won''t give you as much redundancy as your main pool has, perhaps you should create a small 5-disk RAID-0 LUN sharing the disks of each RAID-5 LUN and mirror the log to all four of them: even if one entire array box is lost, the other will still have a mirrored ZIL and all the RAID-5 LUNs will be the same size (not that I''d expect a small variation in size between the two pairs of LUNs to be a problem that ZFS couldn''t handle: can''t it handle multiple disk sizes in a mirrored pool as long as each individual *pair* of disks matches?). Having 4 copies of the ZIL on disks shared with the RAID-5 activity will compromise the log''s performance, since each log write won''t complete until the slowest copy finishes (i.e., congestion in either of the RAID-5 pairs could delay it). It still should usually be faster than just throwing the log in with the rest of the RAID-5 data, though. Then again, I see from your later comment that you have the same questions that I had about whether the results reported in http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on suggest that having a ZIL may not help much anyway (at least for your specific workload: I can imagine circumstances in which performance of small, synchronous writes might be more critical than other performance, in which case separating them out could be useful).> > We did get the 2540 fully populated with 15K 146-gig > drives. With 12 disks, and wanting to have at least > ONE hot global spare in each array, and needing to > keep LUNs the same size, you end up doing 2 5-disk > RAID-5 LUNs and 2 hot spares in each array. Not that > I really need 2 spares I just didn''t see any way to > make good use of an extra disk in each array. If we > wanted to dedicate them instead to this ZIL need, > what is best way to go about that?As I noted above, you might not want to have less redundancy in the ZIL than you have in the main pool: while the data in the ZIL is only temporary (until it gets written back to the main pool), there''s a good chance that there will *always* be *some* data in it, so if you lost one array box entirely at least that small amount of data would be at the mercy of any failure on the log disk that made any portion of the log unreadable. Now, if you could dedicate all four spare disks to the log (mirroring it 4 ways) and make each box understand that it was OK to steal one of them to use as a hot spare should the need arise, that might give you reasonable protection (since then any increased exposure would only exist until the failed disk was manually replaced - and normally the other box would still hold two copies as well). But I have no idea whether the box provides anything like that level of configurability. ...> Hundreds of POP and IMAP user processes coming and > going from users reading their mail. Hundreds more > LMTP processes from mail being delivered to the Cyrus > mail-store.And with 10K or more users a *lot* of parallelism in the workload - which is what I assumed given that you had over 1 TB of net email storage space (but I probably should have made that assumption more explicit, just in case it was incorrect). Sometimes writes predominate over reads,> depends on time of day whether backups are running, > etc. The servers are T2000 with 16 gigs RAM so no > shortage of room for ARC cache. I have turned off > cache flush also pursuing performance.>From Neil''s comment in the blog entry that you referenced, that sounds *very* dicey (at least by comparison with the level of redundancy that you''ve built into the rest of your system) - even if you have rock-solid UPSs (which have still been known to fail). Allowing a disk to lie to higher levels of the system (if indeed that''s what you did by ''turning off cache flush'') by saying that it''s completed a write when it really hasn''t is usually a very bad idea, because those higher levels really *do* make important assumptions based on that information.- bill This message posted from opensolaris.org
Vincent Fox
2007-Dec-02 02:11 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> From Neil''s comment in the blog entry that you > referenced, that sounds *very* dicey (at least by > comparison with the level of redundancy that you''ve > built into the rest of your system) - even if you > have rock-solid UPSs (which have still been known to > fail). Allowing a disk to lie to higher levels of > the system (if indeed that''s what you did by ''turning > off cache flush'') by saying that it''s completed a > write when it really hasn''t is usually a very bad > idea, because those higher levels really *do* make > important assumptions based on that information.I think the point of dual battery-backed controllers is that data should never bet lost. Perhaps I don''t know enough. Is it that bad? This message posted from opensolaris.org
can you guess?
2007-Dec-02 03:51 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> I think the point of dual battery-backed controllers > is > that data should never be lost. Am I wrong?That depends upon exactly what effect turning off the ZFS cache-flush mechanism has. If all data is still sent to the controllers as ''normal'' disk writes and they have no concept of, say, using *volatile* RAM to store stuff when higher levels enable the "disk''s" write-back cache nor any inclination to pass along such requests blithely to their underlying disks (which of course would subvert any controller-level guarantees, since they can evict data from their own write-back caches as soon as the disk write request completes), then presumably as long as they get the data they guarantee that it will eventually get to the platters and the ZFS cache-flush mechanism is a no-op. Of course, if that''s true then disabling cache-flush should have no noticeable effect on performance (the controller just answers "Done" as soon as it receives a cache-flush request, because there''s no applicable cache to flush), so you might as well just leave it enabled. Conversely, if you found that disabling it *did* improve performance, then it probably opened up a significant reliability hole. - bill This message posted from opensolaris.org
Vincent Fox
2007-Dec-02 04:54 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
Bill, you have a long-winded way of saying "I don''t know". But thanks for elucidating the possibilities. This message posted from opensolaris.org
Anton B. Rang
2007-Dec-02 05:01 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> That depends upon exactly what effect turning off the > ZFS cache-flush mechanism has.The only difference is that ZFS won''t send a SYNCHRONIZE CACHE command at the end of a transaction group (or ZIL write). It doesn''t change the actual read or write commands (which are always sent as ordinary writes -- for the ZIL, I suspect that setting the FUA bit on writes rather than flushing the whole cache might provide better performance in some cases, but I''m not sure, since it probably depends what other I/O might be outstanding.)> Of course, if that''s true then disabling cache-flush > should have no noticeable effect on performance (the > controller just answers "Done" as soon as it receives > a cache-flush request, because there''s no applicable > cache to flush), so you might as well just leave it > enabled.The problem with SYNCHRONIZE CACHE is that its semantics aren''t quite defined as precisely as one would want (until a fairly recent update). Some controllers interpret it as "push all data to disk" even if they have battery-backed NVRAM. In this case, you lose quite a lot of performance, and you gain only a modicum of reliability (at least in the case of larger RAID systems, which will generally use their battery to flush NVRAM to disk if power is lost). There''s a bit defined now that can be used to say "Only flush volatile caches, it''s OK if data is in non-volatile cache." But not many controllers support this yet, and Solaris didn''t as of last year -- not sure if it''s been added yet. -- Anton This message posted from opensolaris.org
can you guess?
2007-Dec-02 05:04 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> Bill, you have a long-winded way of saying "I don''t > know". But thanks for elucidating the possibilities.Hmmm - I didn''t mean to be *quite* as noncommittal as that suggests: I was trying to say (without intending to offend) "FOR GOD''S SAKE, MAN: TURN IT BACK ON!", and explaining why (i.e., that either disabling it made no difference and thus it might as well be enabled, or that if indeed it made a difference that indicated that it was very likely dangerous). - bill This message posted from opensolaris.org
can you guess?
2007-Dec-02 05:18 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
> > That depends upon exactly what effect turning off > the > > ZFS cache-flush mechanism has. > > The only difference is that ZFS won''t send a > SYNCHRONIZE CACHE command at the end of a transaction > group (or ZIL write). It doesn''t change the actual > read or write commands (which are always sent as > ordinary writes -- for the ZIL, I suspect that > setting the FUA bit on writes rather than flushing > the whole cache might provide better performance in > some cases, but I''m not sure, since it probably > depends what other I/O might be outstanding.)It''s a bit difficult to imagine a situation where flushing the entire cache unnecessarily just to force the ZIL would be preferable - especially if ZFS makes any attempt to cluster small transaction groups together into larger aggregates (in which case you''d like to let them continue to accumulate until the aggregate is large enough to be worth forcing to disk in a single I/O).> > > Of course, if that''s true then disabling > cache-flush > > should have no noticeable effect on performance > (the > > controller just answers "Done" as soon as it > receives > > a cache-flush request, because there''s no > applicable > > cache to flush), so you might as well just leave > it > > enabled. > > The problem with SYNCHRONIZE CACHE is that its > semantics aren''t quite defined as precisely as one > would want (until a fairly recent update). Some > controllers interpret it as "push all data to disk" > even if they have battery-backed NVRAM.That seems silly, given that for most other situations they consider that data in NVRAM is equivalent to data on the platter. But silly or not, if that''s the way some arrays interpret the command, then it does have performance implications (and the other reply I just wrote would be unduly alarmist in such cases). Thanks for adding some actual experience with the hardware to what had been a purely theoretical discussion. - bill This message posted from opensolaris.org
Al Hopper
2007-Dec-03 02:14 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
On Fri, 30 Nov 2007, Vincent Fox wrote: ... reformatted ...> We will be using Cyrus to store mail on 2540 arrays. > > We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are > both connected to same host, and mirror and stripe the LUNs. So a > ZFS RAID-10 set composed of 4 LUNs. Multi-pathing also in use for > redundancy. > > My question is any guidance on best choice in CAM for stripe size in the LUNs?[after reading the entire thread where details of the storage related application is presented piecemeal and piecing together the details] I can''t give you an answer or a recommendation, because the question does not make sense IMHO. IOW: This is like saying: "I want to get from Dallas to LA as quickly as possible and have already decided that a bicycle would be the best mode of transport to use; can you tell me how I should configure the bicycle." The problem is that its very unlikely that the bicycle is the correct solution and to recommend which bicycle config is correct is likely to provide very bad advice..... and also validate the supposition that the solution utilizing the bicycle is, indeed, the correct solution.> Default is 128K right now, can go up to 512K, should we go higher? > > Cyrus stores mail messages as many small files, not big mbox files. > But there are so many layers in action here it''s hard to know what > is best choice.[again based on reading the entire thread and not an answer to the above paragraph] It appears that the chosen solution is to use a stripe of two hardware RAID5 luns presented by a 2540 (please correct me if this is incorrect). There are several issues with this proposal: a) You''re mixing solutions: Hardware RAID5 and ZFS. Why? All this does is introduce needless complexity and make it very difficult to troubleshoot issues with the storage subsystem - especially if the issue is performance related. Also - how do you localize a fault condition that is caused by a 2540 RAID firmware bug? How do you isolate performance issues caused by the interaction between the hardware RAID5 luns and ZFS? b) You''ve chosen a stripe - despite Richard Ellings best advice (something like "friends don''t let friends use stripes"). See Richards blogs for a comparison of the reliability rates for different storage configurations. c) For a mail storage subsystem a stripe seems totally wrong. Generally speaking, email (stores) consists of many small files - with occasional medium sized files (due to attachments) and less commonly, some large files - usually limited by the max message size defined by the MTA (typical value is 10Mb - what is it in your case?). d) ZFS, with its built-in volume manager, relies on having direct access to individual disks (JBOD). Placing a hardware RAID engine between ZFS and the actual disks is a "black box" in terms of the ZFS volume manager - and it can''t possibly "understand" how various storage providers'' "black boxes" will behave.... especially when ZFS tells the "disk" to do something and the hardware RAID lun lies to ZFS (example sync writes). e) You''ve presented no data in terms of typical iostat -xcnz 5 output - generalized over various times of the day where particular user data access patterns are known. This information would allow us to give you some basic recommendations. IOW - we need to know the basic requirements in terms of IOPS and average I/O transfer sizes. BTW: Brendan Greggs DTrace scripts will allow you to gather very detailed I/O usage data on the production system with no risk. f) You have not provided any details of the 2540 config - except for the fact that it is "fully loaded" IIRC. SAS disks? 10,000 RPM drives of 15k RPM drives? Disk drive size? g) You''ve provided no details of how the host is configured. If you decide to deploy a ZFS based system, the amount of installed RAM on the mailserver will have a *huge* impact on the actual load placed on the I/O subsystem. In this regard, ZFS is your friend, as it''ll cache almost _everything_, given enough RAM. And DDR2 RAM is (arguably) less than $40 a gigabyte today - with 2Gb SIMMs having reached price parity with the equivalent pricing of 2 * 1Gb DIMMs. For example: if an end-user MUA is configured to poll the mailserver every 30 Seconds, to check if new mail has arrived, if the mailserver has sufficient (cache) memory, then only the first request will require disk access and a large number of subsequent requests will be handled out of (cache) memory. h) Another observation: You''ve commented on the importance of system reliability because there are 10k users on the mailserver. Whether you have 10 users or 10k users or 100k users is of no importance if you are considering system reliability (aka failure rates). IOW - a system that is configured to a certain reliability requirement will be the same, regardless of the number of end users that rely on that system. The number of concurrent users is important only in terms of system performance and response time. i) I don''t know what the overall storage requirement is (someone said 1Tb IIRC) and how this relates to the number/size of the available disk drives (in the 2540). Observations: 1) Any striped config seems inherently wrong - given the available information. 2) mixing RAID5 luns (backend) with ZFS introduces unnecessary system complexity. 3) designing a system when no requirements have been presented in terms of: i) I/O access patterns ii) IOPS (I/O Ops per Second) iii) required response time iv) number of concurrent requests v) application host config (CPUs/cores, RAM, I/O bus, disk ctrls) vi) backup methodology and frequency vii) storage subsystem config .... is very unlikely to result in a correctly configured system that will meet the owner/operators expectations. Please don''t frame this response as completely negative. That is not my intention - what I''m trying to do is present you with a list of questions that must be answered before a technically correct storage subsystem can be designed and implemented. IOW - before a storage subsystem can be correctly *engineered*. Also - please don''t be discouraged by this response. If you are willing to fill in the blanks, I''m willing to help provide a meaningful recommendation. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :)
Vincent Fox
2007-Dec-03 18:02 UTC
[zfs-discuss] Best stripe-size in array for ZFS mail storage?
Thanks for your observations. HOWEVER, I didn''t pose the question "How do I architect the HA and storage and everything for an email system?" Our site like many other data centers has HA standards and politics and all this other baggage that may lead a design to a certain point. Thus our answer will be different than yours. You can poke holes in my designs, I can poke holes in yours, this could go on all day. Considering I am adding a new server to a group of existing servers of similar design, we are not going to make radical ground-up redesign decisions at this time. I can fiddle around in the margins with things like stripe-size. I will point out AS I HAVE BEFORE that ZFS is not yet completely enterprise-ready in our view. For example in one commonly-proposed amateurish (IMO) scenario, we would have 2 big JBOD units and mirror the drives between arrays. This works fine if a drive fails or even if an array goes down. BUT, you are then left with a storage pool which must be immediately serviced or a single additional drive failure will destroy the pool. Or simple drive failure which spare rolls in? The one from the same array, or one from the other? Seems a coin toss. When it''s a terabyte of email and 10K+ users that''s a big deal for some people, and we did our HA design such that multiple failures can occur with no service impact. The performance may not be ideal, and the design may not seem ELEGANT to everyone. Mixing HW controller RAID and ZFS mirroring is admittedly an odd hybrid design. Our answer works for us and that is all that matters. So if someone has an idea of what stripe-size will work best for us that would be helpful. Thanks! This message posted from opensolaris.org