Folks, We are in the process of purchasing new san/s that our mail server runs on (JES3). We have moved our mailstores to zfs and continue to have checksum errors -- they are corrected but this improves on the ufs inode errors that require system shutdown and fsck. So, I am recommending that we buy small jbods, do raidz2 and let zfs handle the raiding of these boxes. As we need more storage, we can add boxes and place them in a pool. This would allow more controllers and move spindles which I would think would add reliability and performance. I am thinking SATA II drives. Any recommendations and/or advice is welcome. thanks, keith
Keith Clay wrote:> We are in the process of purchasing new san/s that our mail server runs > on (JES3). We have moved our mailstores to zfs and continue to have > checksum errors -- they are corrected but this improves on the ufs inode > errors that require system shutdown and fsck. > > So, I am recommending that we buy small jbods, do raidz2 and let zfs > handle the raiding of these boxes. As we need more storage, we can add > boxes and place them in a pool. This would allow more controllers and > move spindles which I would think would add reliability and > performance. I am thinking SATA II drives. > > Any recommendations and/or advice is welcome.I would take a look at the Hitachi Enterprise-class SATA drives. Also, try to keep them cool. -- richard
On Thu, 2006-09-28 at 10:51 -0700, Richard Elling - PAE wrote:> Keith Clay wrote: > > We are in the process of purchasing new san/s that our mail server runs > > on (JES3). We have moved our mailstores to zfs and continue to have > > checksum errors -- they are corrected but this improves on the ufs inode > > errors that require system shutdown and fsck. > > > > So, I am recommending that we buy small jbods, do raidz2 and let zfs > > handle the raiding of these boxes. As we need more storage, we can add > > boxes and place them in a pool. This would allow more controllers and > > move spindles which I would think would add reliability and > > performance. I am thinking SATA II drives. > > > > Any recommendations and/or advice is welcome. >Also, I can''t remember how JES3 does its mailstore, but lots of little writes to a RAIDZ volume aren''t good for performance, even through ZFS is better about waiting for sufficient write data to do a full-stripe-width write (vs. RAID-5). That is, using RAIDZ on SATA isn''t a good performance idea for the small write usage pattern, so I''d be careful and get a demo unit first to check out the actual numbers. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble writes: > On Thu, 2006-09-28 at 10:51 -0700, Richard Elling - PAE wrote: > > Keith Clay wrote: > > > We are in the process of purchasing new san/s that our mail server runs > > > on (JES3). We have moved our mailstores to zfs and continue to have > > > checksum errors -- they are corrected but this improves on the ufs inode > > > errors that require system shutdown and fsck. > > > > > > So, I am recommending that we buy small jbods, do raidz2 and let zfs > > > handle the raiding of these boxes. As we need more storage, we can add > > > boxes and place them in a pool. This would allow more controllers and > > > move spindles which I would think would add reliability and > > > performance. I am thinking SATA II drives. > > > > > > Any recommendations and/or advice is welcome. > > > > Also, I can''t remember how JES3 does its mailstore, but lots of little > writes to a RAIDZ volume aren''t good for performance, even through ZFS > is better about waiting for sufficient write data to do a > full-stripe-width write (vs. RAID-5). > > That is, using RAIDZ on SATA isn''t a good performance idea for the small > write usage pattern, so I''d be careful and get a demo unit first to > check out the actual numbers. > IMO, RAIDZn should perform admirably on the write loads. The random reads aspects is more limited. The simple rule of thumb is to consider that a RAIDZ group will deliver random read IOPS with the performance characteristic of single device. That rule does not apply to either read or write streaming data but only for small random reads pattern. If that means you need to construct small RAIDZ groups then do consider mirroring as an alternative. -r ____________________________________________________________________________________ Performance, Availability & Architecture Engineering Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France http://icncweb.france/~rbourbon http://blogs.sun.com/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20
On Sep 29, 2006, at 2:41 AM, Roch wrote:>> > > IMO, RAIDZn should perform admirably on the write loads. > The random reads aspects is more limited. The simple rule of > thumb is to consider that a RAIDZ group will deliver random > read IOPS with the performance characteristic of single > device. That rule does not apply to either read or write > streaming data but only for small random reads pattern. > > If that means you need to construct small RAIDZ groups > then do consider mirroring as an alternative. >So, mirroring, on jbods would give me the same/better performance than raidzn? Would that apply to fc drives or also to SATA? keith
Keith Clay writes: > > On Sep 29, 2006, at 2:41 AM, Roch wrote: > > > >> > > > > IMO, RAIDZn should perform admirably on the write loads. > > The random reads aspects is more limited. The simple rule of > > thumb is to consider that a RAIDZ group will deliver random > > read IOPS with the performance characteristic of single > > device. That rule does not apply to either read or write > > streaming data but only for small random reads pattern. > > > > If that means you need to construct small RAIDZ groups > > then do consider mirroring as an alternative. > > > > So, mirroring, on jbods would give me the same/better performance > than raidzn? Would that apply to fc drives or also to SATA? > On small random read loads ? It''s much better to use mirroring and that''s Independant on device type. > keith
On Sep 29, 2006, at 6:24 AM, Roch wrote:> Keith Clay writes: >> On Sep 29, 2006, at 2:41 AM, Roch wrote: >>> IMO, RAIDZn should perform admirably on the write loads. >>> The random reads aspects is more limited. The simple rule of >>> thumb is to consider that a RAIDZ group will deliver random >>> read IOPS with the performance characteristic of single >>> device. That rule does not apply to either read or write >>> streaming data but only for small random reads pattern. >>> >>> If that means you need to construct small RAIDZ groups >>> then do consider mirroring as an alternative. >>> >> >> So, mirroring, on jbods would give me the same/better performance >> than raidzn? Would that apply to fc drives or also to SATA? >> > > On small random read loads ? It''s much better to use mirroring > and that''s Independant on device type.What about the case of an iSCSI LUN? Does this change? I get that while local to the system a read from a mirror versus a RAIDZ pool is desirable, but would an IP network introduce enough latency that the difference is negligible? And wouldn''t I get better aggregate performance across more spindles for multiple iSCSI LUNs than trying to create a mirror pair for each individual iSCSI LUN? Thanks, --Randy
On Fri, 2006-09-29 at 09:41 +0200, Roch wrote:> Erik Trimble writes: > > On Thu, 2006-09-28 at 10:51 -0700, Richard Elling - PAE wrote: > > > Keith Clay wrote: > > > > We are in the process of purchasing new san/s that our mail server runs > > > > on (JES3). We have moved our mailstores to zfs and continue to have > > > > checksum errors -- they are corrected but this improves on the ufs inode > > > > errors that require system shutdown and fsck. > > > > > > > > So, I am recommending that we buy small jbods, do raidz2 and let zfs > > > > handle the raiding of these boxes. As we need more storage, we can add > > > > boxes and place them in a pool. This would allow more controllers and > > > > move spindles which I would think would add reliability and > > > > performance. I am thinking SATA II drives. > > > > > > > > Any recommendations and/or advice is welcome. > > > > > > > Also, I can''t remember how JES3 does its mailstore, but lots of little > > writes to a RAIDZ volume aren''t good for performance, even through ZFS > > is better about waiting for sufficient write data to do a > > full-stripe-width write (vs. RAID-5). > > > > That is, using RAIDZ on SATA isn''t a good performance idea for the small > > write usage pattern, so I''d be careful and get a demo unit first to > > check out the actual numbers. > > > > IMO, RAIDZn should perform admirably on the write loads. > The random reads aspects is more limited. The simple rule of > thumb is to consider that a RAIDZ group will deliver random > read IOPS with the performance characteristic of single > device. That rule does not apply to either read or write > streaming data but only for small random reads pattern. > > If that means you need to construct small RAIDZ groups > then do consider mirroring as an alternative. > > -r > > ____________________________________________________________________________________ > Performance, Availability & Architecture Engineering > > Roch Bourbonnais Sun Microsystems, Icnc-Grenoble > Senior Performance Analyst 180, Avenue De L''Europe, 38330, > Montbonnot Saint Martin, France > http://icncweb.france/~rbourbon http://blogs.sun.com/roch > Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20 >I''d like to see benchmarking for RAIDZn vs striping or single disk for random write. Random (re)write is a problem for RAID-5 (and relatives), as it requires a full stripe-width read, parity calculation, then full stripe-width write to do any data change on a stripe. For new data, RAID-5 isn''t so bad, since it can skip the initial stripe read. And, as pointed out, random read for RAID-5 is good, equal to striping in most cases. Now, RAIDZn should beat RAID-5 since it tends to queue up writes until it can write a full stripe at once (right?), so you will get _less_ writes required, but it still has the same problem for sparse writes (i.e. small writes spaced far apart on the disk layout, where writes to the same area are infrequent). For the original question, of a mail store backend, I think the best compromise between cost and performance is to use RAIDZn across SATA JBODs for mail archives, and use RAID-10 (striped mirrors) on SCSI or FC drives for the primary mail spool/user directories. Assuming, of course, this is a system handling at minimum of 100,000 messages/day. (i.e. mid-size business and larger).. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Actually, random writes on a RAID-5, while not performing that well because of the pre-read, don''t require a full stripe read (or write). They only require reading the old data and parity, then writing the new data and parity. This is quite a bit better than a full stripe, since only two actuators are involved; thus you can simultaneously process up to N/2 write operations to an N-wide array. (If most of the stripe is being written, it''s cheaper to read the remaining data and use that to compute the parity instead; many implementations use this optimization.) RAID-Z requires the full stripe access to verify the checksum. (Though some checksum algorithms could be recomputed with only a partial read, as above, allowing parallel writes.) RAID-10 is almost certainly the best choice for a random-write-heavy workload, while RAID-Z gives the best storage utilization. Perhaps ZFS will provide migration facilities someday to automatically place blocks on the optimal storage type. :-) Actually, there''s no inherent reason not to allow both mirrored and RAID-Z blocks to be intermixed on the same devices in the future.... This message posted from opensolaris.org
Randy Bias wrote:> > On Sep 29, 2006, at 6:24 AM, Roch wrote: >> Keith Clay writes: >>> On Sep 29, 2006, at 2:41 AM, Roch wrote: >>>> IMO, RAIDZn should perform admirably on the write loads. >>>> The random reads aspects is more limited. The simple rule of >>>> thumb is to consider that a RAIDZ group will deliver random >>>> read IOPS with the performance characteristic of single >>>> device. That rule does not apply to either read or write >>>> streaming data but only for small random reads pattern. >>>> >>>> If that means you need to construct small RAIDZ groups >>>> then do consider mirroring as an alternative. >>>> >>> >>> So, mirroring, on jbods would give me the same/better performance >>> than raidzn? Would that apply to fc drives or also to SATA? >>> >> >> On small random read loads ? It''s much better to use mirroring >> and that''s Independant on device type. > > What about the case of an iSCSI LUN? Does this change? I get that > while local to the system a read from a mirror versus a RAIDZ pool is > desirable, but would an IP network introduce enough latency that the > difference is negligible? And wouldn''t I get better aggregate > performance across more spindles for multiple iSCSI LUNs than trying > to create a mirror pair for each individual iSCSI LUN?I hate to say but ... it depends. How fast is the iSCSI connection. There are too many variables in the stack - network, array, backend target, iSCSI stack, etc. - to make a generalization.
On Sep 30, 2006, at 1:07 PM, Torrey McMahon wrote:>> What about the case of an iSCSI LUN? Does this change? I get >> that while local to the system a read from a mirror versus a RAIDZ >> pool is desirable, but would an IP network introduce enough >> latency that the difference is negligible? And wouldn''t I get >> better aggregate performance across more spindles for multiple >> iSCSI LUNs than trying to create a mirror pair for each individual >> iSCSI LUN? > I hate to say but ... it depends. How fast is the iSCSI connection. > There are too many variables in the stack - network, array, backend > target, iSCSI stack, etc. - to make a generalization.Hrm... I hate to overload this particular thread with details on this question. Perhaps I should lob in another e-mail? I''m essentially intending to build a backend iSCSI target using Solaris x86 w/ZFS on commodity hardware. Front-end is a heterogenous group of VMware instances running on Linux physical hosts. I''ll open up a separate thread, but essentially I''m trying to get scalable performance out of the backend. It doesn''t have to be blazing fast, but it would be great if I could get close to a single spindle of speed for each VM up to say about 10 or even 20 VMs. --Randy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060930/34b49562/attachment.html>
Folks, Would it be wise to buy 2 jbod box and place one side of the mirror on each one? Would that make sense? Also, we are looking at SATA to FC to hook into our san. Any comments/admonitions/advice? keith
On Oct 3, 2006, at 11:15 AM, Keith Clay wrote:> Folks, > > Would it be wise to buy 2 jbod box and place one side of the mirror > on each one? Would that make sense?Of course that makes sense. Doing so will give you chassis-level redundancy. If one JBOD were to, say, lose power or in some way experience a failure which effects every drive in it, you will definitely want that other side of the mirror on your (hopefully) still operable second JBOD. /dale
Now, RAIDZn should beat RAID-5 since it tends to queue up writes until it can write a full stripe at once (right?) correct. so you will get _less_ writes required, but it still has the same problem for sparse writes (i.e. small writes spaced far apart on the disk layout, where writes to the same area are infrequent). Where writes end up on disk is not controlled by application behavior but by free blocks availability. Writes should bunch up to either contiguous free blocks or at least to blocks that have physical proximity. ____________________________________________________________________________________ Performance, Availability & Architecture Engineering Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France http://icncweb.france/~rbourbon http://blogs.sun.com/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20