thr3ads.net - zfs discuss - [zfs-discuss] Best stripe-size in array for ZFS mail storage? [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Vincent Fox

2007-Dec-01 05:15 UTC

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

We will be using Cyrus to store mail on 2540 arrays.

We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected
to same host, and mirror and stripe the LUNs.  So a ZFS RAID-10 set composed of
4 LUNs.  Multi-pathing also in use for redundancy.

My question is any guidance on best choice in CAM for stripe size in the LUNs?

Default is 128K right now, can go up to 512K, should we go higher?

Cyrus stores mail messages as many small files, not big mbox files.  But there
are so many layers in action here it''s hard to know what is best
choice.
 
 
This message posted from opensolaris.org

Louwtjie Burger

2007-Dec-01 06:43 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

On Dec 1, 2007 7:15 AM, Vincent Fox <vincent_b_fox at yahoo.com>
wrote:> We will be using Cyrus to store mail on 2540 arrays.
>
> We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both
connected to same host, and mirror and stripe the LUNs.  So a ZFS RAID-10 set
composed of 4 LUNs.  Multi-pathing also in use for redundancy.
Any reason why you are using a mirror of raid-5 lun''s?

I can understand that perhaps you want ZFS to be in control of
rebuilding broken vdev''s, if anything should go wrong ... but
rebuilding RAID-5''s seems a little over the top.

How about running a ZFS mirror over RAID-0 luns? Then again, the
downside is that you need intervention to fix a LUN after a disk goes
boom! But you don''t waste all that space :)

PS: It would be nice to know what the LSI firmware does (after 15
years of evolution) to writes into the controller... it might have
been better to buy JOBD''s ... I see Sun will be releasing some soon
(rumour?)

can you guess?

2007-Dec-01 10:59 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> We will be using Cyrus to store mail on 2540 arrays.
> 
> We have chosen to build 5-disk RAID-5 LUNs in 2
> arrays which are both connected to same host, and
> mirror and stripe the LUNs.  So a ZFS RAID-10 set
> composed of 4 LUNs.  Multi-pathing also in use for
> redundancy.
Sounds good so far:  lots of small files in a largish system with presumably
significant access parallelism makes RAID-Z a non-starter, but RAID-5 should be
OK, especially if the workload is read-dominated.  ZFS might aggregate small
writes such that their performance would be good as well if Cyrus
doesn''t force them to be performed synchronously (and ZFS
doesn''t force them to disk synchronously on file close); even
synchronous small writes could perform well if you mirror the ZFS small-update
log:  flash - at least the kind with decent write performance - might be ideal
for this, but if you want to steer clear of a specialized configuration just
carving one small LUN for mirroring out of each array (you could use a RAID-0
stripe on each array if you were compulsive about keeping usage balanced; it
would be nice to be able to ''center'' it on the disks, but
probably not worth the management overhead unless the array makes it easy to do
so) should still offer a noticeable improvement over just placing the ZIL on the
RAID-5 LUNs.
> 
> My question is any guidance on best choice in CAM for
> stripe size in the LUNs?
> 
> Default is 128K right now, can go up to 512K, should
> we go higher?
By ''stripe size'' do you mean the size of the entire stripe
(i.e., your default above reflects 32 KB on each data disk, plus a 32 KB parity
segment) or the amount of contiguous data on each disk (i.e., your default above
reflects 128 KB on each data disk for a total of 512 KB in the entire stripe,
exclusive of the 128 KB parity segment)?

If the former, by all means increase it to 512 KB:  this will keep the largest
ZFS block on a single disk (assuming that ZFS aligns them on
''natural'' boundaries) and help read-access parallelism
significantly in large-block cases (I''m guessing that ZFS would use
small blocks for small files but still quite possibly use large blocks for its
metadata).  Given ZFS''s attitude toward multi-block on-disk contiguity
there might not be much benefit in going to even larger stripe sizes, though it
probably wouldn''t hurt noticeably either as long as the entire stripe
(ignoring parity) didn''t exceed 4 - 16 MB in size (all the above
numbers assume the 4 + 1 stripe configuration that you described).

In general, having less than 1 MB per-disk stripe segments doesn''t make
sense for *any* workload:  it only takes 10 - 20 milliseconds to transfer 1 MB
from a contemporary SATA drive (the analysis for high-performance SCSI/FC/SAS
drives is similar, since both bandwidth and latency performance improve), which
is comparable to the 12 - 13 ms. that it takes on average just to position to it
- and you can still stream data at high bandwidths in parallel from the disks in
an array as long as you have a client buffer as large in MB as the number of
disks you need to stream from to reach the required bandwidth (you want 1
GB/sec?  no problem:  just use a 10 - 20 MB buffer and stream from 10 - 20 disks
in parallel).  Of course, this assumes that higher software layers organize data
storage to provide that level of contiguity to leverage...

- bill
 
 
This message posted from opensolaris.org

can you guess?

2007-Dec-01 11:31 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> Any reason why you are using a mirror of raid-5
> lun''s?
Some people aren''t willing to run the risk of a double failure -
especially when recovery from a single failure may take a long time.  E.g., if
you''ve created a disaster-tolerant configuration that separates your
two arrays and a fire completely destroys one of them, you''d really
like to be able to run the survivor without worrying too much until you can
replace its twin (hence each must be robust in its own right).

The above situation is probably one reason why ''RAID-6'' and
similar approaches (like ''RAID-Z2'') haven''t generated
more interest:  if continuous on-line access to your data is sufficiently
critical to consider them, then it''s also probably sufficiently
critical to require such a disaster-tolerant approach (which dual-parity RAIDs
can''t address).

It would still be nice to be able to recover from a bad sector on the single
surviving site, of course, but you don''t necessarily need full-blown
RAID-6 for that:  you can quite probably get by with using large blocks and
appending a private parity sector to them (maybe two private sectors just to
accommodate a situation where a defect hits both the last sector in the block
and the parity sector that immediately follows it; it would also be nice to know
that the block size is significantly smaller than a disk track size, for similar
reasons).  This would, however, tend to require file-system involvement such
that all data was organized into such large blocks:  otherwise, all writes for
smaller blocks would turn into read/modify/writes.

Panasas (I always tend to put an extra ''s'' into that name, and
to judge from Google so do a hell of a lot of other people:  is it because of
the resemblance to ''parnassas''?) has been crowing about
something that it calls ''tiered parity'' recently, and it may
be something like the above.

...
> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
> after a disk goes
> boom! But you don''t waste all that space :)
''Wasting'' 20% of your disk space (in the current example)
doesn''t seem all that alarming - especially since you''re
getting more for that expense than just faster and more automated recovery if a
disk (or even just a sector) fails.

- bill
 
 
This message posted from opensolaris.org

max at bruningsystems.com

2007-Dec-01 14:34 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

Hi Bill,
can you guess? wrote:>> We will be using Cyrus to store mail on 2540 arrays.
>>
>> We have chosen to build 5-disk RAID-5 LUNs in 2
>> arrays which are both connected to same host, and
>> mirror and stripe the LUNs.  So a ZFS RAID-10 set
>> composed of 4 LUNs.  Multi-pathing also in use for
>> redundancy.
>>     
>
> Sounds good so far:  lots of small files in a largish system with
presumably significant access parallelism makes RAID-Z a non-starter,Why does "lots of small files in a largish system with presumably 
significant access parallelism makes RAID-Z a non-starter"?
thanks,
max

can you guess?

2007-Dec-01 14:53 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> Hi Bill,
...

  lots of small files in a> largish system with presumably significant access
> parallelism makes RAID-Z a non-starter,
> Why does "lots of small files in a largish system
> with presumably 
> significant access parallelism makes RAID-Z a
> non-starter"?
> thanks,
> max
Every ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe -
so not only do N + 1 disks get written for every block update, but N disks get
*read* on every block *read*.

Normally, small files can be read in a single I/O request to one disk (even in
conventional parity-RAID implementations).  RAID-Z requires N I/O requests
spread across N disks, so for parallel-access reads to small files RAID-Z
provides only about 1/Nth the throughput of conventional implementations unless
the disks are sufficiently lightly loaded that they can absorb the additional
load that RAID-Z places on them without reducing throughput commensurately.

- bill
 
 
This message posted from opensolaris.org

Vincent Fox

2007-Dec-01 17:57 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> On Dec 1, 2007 7:15 AM, Vincent Fox
> 
> Any reason why you are using a mirror of raid-5
> lun''s?
> 
> I can understand that perhaps you want ZFS to be in
> control of
> rebuilding broken vdev''s, if anything should go wrong
> ... but
> rebuilding RAID-5''s seems a little over the top.
Because the decision of our technical leads was that a straight
ZFS RAID-10 set made up of individual disks from the 2540 was
more risky.  A double-disk failure in a mirror pair would hose the
pool and when the pool contains email for >10K people this was not
acceptable. Another possibility is one of the arrays goes offline
now you are running a RAID-0 stripe set and a single-disk fails
then you are again dead.

The setup we have can survive quite multiple failures
and we have seen enough weird events in our career that
wee decided to do this. YMMV.  Let''s move on, I just wanted
to describe our setup not start an argument about it.

> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
> after a disk goes
> boom! But you don''t waste all that space :)
> 
> PS: It would be nice to know what the LSI firmware
> does (after 15
> years of evolution) to writes into the controller...
> it might have
> been better to buy JOBD''s ... I see Sun will be
> releasing some soon
> (rumour?)
A guy in our group exported the disks as LUNs by
the way and ran Bonnie++ and the results were a little
better for a straight RAID-10 set of all disks, but not
hugely better enough to tip the balance towards it.
Not perhaps the best test, but what we had time to do.
 
 
This message posted from opensolaris.org

Vincent Fox

2007-Dec-01 19:00 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> Sounds good so far:  lots of small files in a largish
> system with presumably significant access parallelism
> makes RAID-Z a non-starter, but RAID-5 should be OK,
> especially if the workload is read-dominated.  ZFS
> might aggregate small writes such that their
> performance would be good as well if Cyrus doesn''t
> force them to be performed synchronously (and ZFS
> doesn''t force them to disk synchronously on file
> close); even synchronous small writes could perform
> well if you mirror the ZFS small-update log:  flash -
> at least the kind with decent write performance -
> might be ideal for this, but if you want to steer
> clear of a specialized configuration just carving one
> small LUN for mirroring out of each array (you could
> use a RAID-0 stripe on each array if you were
> compulsive about keeping usage balanced; it would be
> nice to be able to ''center'' it on the disks, but
> probably not worth the management overhead unless the
> array makes it easy to do so) should still offer a
> noticeable improvement over just placing the ZIL on
> the RAID-5 LUNs.
I''m not sure I understand you here.  I suppose I need to read
up on the ZIL option.  We are running Solaris 10u4 not OpenSolaris.

Can I setup a disk in each 2540 array for this ZIL disk, and then mirror them
such that if one array goes down I''m not dead?  If this ZIL disk also
goes dead, what is the failure mode and recovery option then?

We did get the 2540 fully populated.  With 12 disks, and wanting to have at
least ONE hot global spare in each array, and needing to keep LUNs the same
size, you end up doing 2 5-disk RAID-5 LUNs and 2 hot spares in each array.  Not
that I really need 2 spares I just didn''t see any way to make good use
of an extra disk in each array.  If we wanted to dedicate them instead to this
ZIL need, what is best way to go about that?  Our current setup to be specific:

{cyrus3-1:vf5:136} zpool status
  pool: ms11
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        ms11                                       ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600A0B800038ACA0000002AB47504368d0  ONLINE       0     0     0
            c6t600A0B800038A04400000251475045D1d0  ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600A0B800038A1CF000002994750442Fd0  ONLINE       0     0     0
            c6t600A0B800038A3C40000028447504628d0  ONLINE       0     0     0

errors: No known data errors
> By ''stripe size'' do you mean the size of the entire
> stripe (i.e., your default above reflects 32 KB on
> each data disk, plus a 32 KB parity segment) or the
> amount of contiguous data on each disk (i.e., your
> default above reflects 128 KB on each data disk for a
> total of 512 KB in the entire stripe, exclusive of
> the 128 KB parity segment)?
I''m going from the pulldown menu choices in CAM 6.0 for
the 2540 arrays, which are currently 128K, and only go
up to 512K.  I''ll have to pull up the interface again when I am
at work but I think it was called stripe size, and referred to values
the 2540 firmware was assigning onto the 5-disk RAID-5 sets.
> If the former, by all means increase it to 512 KB:
> this will keep the largest ZFS block on a single
> disk (assuming that ZFS aligns them on ''natural''
> boundaries) and help read-access parallelism
> significantly in large-block cases (I''m guessing
> that ZFS would use small blocks for small files but
> still quite possibly use large blocks for its
> metadata).  Given ZFS''s attitude toward multi-block
> on-disk contiguity there might not be much benefit
> in going to even larger stripe sizes, though it
> probably wouldn''t hurt noticeably either as long as
> the entire stripe (ignoring parity) didn''t exceed 4
> - 16 MB in size (all the above numbers assume the 4
>  + 1 stripe configuration that you described).
> 
> In general, having less than 1 MB per-disk stripe
> segments doesn''t make sense for *any* workload:  it
> only takes 10 - 20 milliseconds to transfer 1 MB from
> a contemporary SATA drive (the analysis for
> high-performance SCSI/FC/SAS drives is similar, since
> both bandwidth and latency performance improve),
> which is comparable to the 12 - 13 ms. that it takes
> on average just to position to it - and you can still
> stream data at high bandwidths in parallel from the
> disks in an array as long as you have a client buffer
> as large in MB as the number of disks you need to
> stream from to reach the required bandwidth (you want
> 1 GB/sec?  no problem:  just use a 10 - 20 MB buffer
> and stream from 10 - 20 disks in parallel).  Of
> course, this assumes that higher software layers
> organize data storage to provide that level of
> contiguity to leverage...
Hundreds of POP and IMAP user processes coming and going from users reading
their mail.  Hundreds more LMTP processes from mail being delivered to the Cyrus
mail-store.  Sometimes writes predominate over reads, depends on time of day
whether backups are running, etc.
 
 
This message posted from opensolaris.org

can you guess?

2007-Dec-02 00:36 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> We are running Solaris 10u4 is the log option in
> there?
Someone more familiar with the specifics of the ZFS releases will have to answer
that.
> 
> If this ZIL disk also goes dead, what is the failure
> mode and recovery option then?
The ZIL should at a minimum be mirrored.  But since that won''t give you
as much redundancy as your main pool has, perhaps you should create a small
5-disk RAID-0 LUN sharing the disks of each RAID-5 LUN and mirror the log to all
four of them:  even if one entire array box is lost, the other will still have a
mirrored ZIL and all the RAID-5 LUNs will be the same size (not that
I''d expect a small variation in size between the two pairs of LUNs to
be a problem that ZFS couldn''t handle:  can''t it handle
multiple disk sizes in a mirrored pool as long as each individual *pair* of
disks matches?).

Having 4 copies of the ZIL on disks shared with the RAID-5 activity will
compromise the log''s performance, since each log write won''t
complete until the slowest copy finishes (i.e., congestion in either of the
RAID-5 pairs could delay it).  It still should usually be faster than just
throwing the log in with the rest of the RAID-5 data, though.

Then again, I see from your later comment that you have the same questions that
I had about whether the results reported in
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on suggest that having a
ZIL may not help much anyway (at least for your specific workload:  I can
imagine circumstances in which performance of small, synchronous writes might be
more critical than other performance, in which case separating them out could be
useful).
> 
> We did get the 2540 fully populated with 15K 146-gig
> drives.  With 12 disks, and wanting to have at least
> ONE hot global spare in each array, and needing to
> keep LUNs the same size, you end up doing 2 5-disk
> RAID-5 LUNs and 2 hot spares in each array.  Not that
> I really need 2 spares I just didn''t see any way to
> make good use of an extra disk in each array.  If we
> wanted to dedicate them instead to this ZIL need,
> what is best way to go about that?
As I noted above, you might not want to have less redundancy in the ZIL than you
have in the main pool:  while the data in the ZIL is only temporary (until it
gets written back to the main pool), there''s a good chance that there
will *always* be *some* data in it, so if you lost one array box entirely at
least that small amount of data would be at the mercy of any failure on the log
disk that made any portion of the log unreadable.

Now, if you could dedicate all four spare disks to the log (mirroring it 4 ways)
and make each box understand that it was OK to steal one of them to use as a hot
spare should the need arise, that might give you reasonable protection (since
then any increased exposure would only exist until the failed disk was manually
replaced - and normally the other box would still hold two copies as well).  But
I have no idea whether the box provides anything like that level of
configurability.

...
> Hundreds of POP and IMAP user processes coming and
> going from users reading their mail.  Hundreds more
> LMTP processes from mail being delivered to the Cyrus
> mail-store.
And with 10K or more users a *lot* of parallelism in the workload - which is
what I assumed given that you had over 1 TB of net email storage space (but I
probably should have made that assumption more explicit, just in case it was
incorrect).

  Sometimes writes predominate over reads,> depends on time of day whether backups are running,
> etc.  The servers are T2000 with 16 gigs RAM so no
> shortage of room for ARC cache. I have turned off
> cache flush also pursuing performance.
>From Neil''s comment in the blog entry that you referenced, that
sounds *very* dicey (at least by comparison with the level of redundancy that
you''ve built into the rest of your system) - even if you have
rock-solid UPSs (which have still been known to fail).  Allowing a disk to lie
to higher levels of the system (if indeed that''s what you did by
''turning off cache flush'') by saying that it''s
completed a write when it really hasn''t is usually a very bad idea,
because those higher levels really *do* make important assumptions based on that
information.
- bill
 
 
This message posted from opensolaris.org

Vincent Fox

2007-Dec-02 02:11 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> From Neil''s comment in the blog entry that you
> referenced, that sounds *very* dicey (at least by
> comparison with the level of redundancy that you''ve
> built into the rest of your system) - even if you
> have rock-solid UPSs (which have still been known to
> fail).  Allowing a disk to lie to higher levels of
> the system (if indeed that''s what you did by ''turning
> off cache flush'') by saying that it''s completed a
> write when it really hasn''t is usually a very bad
> idea, because those higher levels really *do* make
> important assumptions based on that information.
I think the point of dual battery-backed controllers is that
data should never bet lost.  Perhaps I don''t know enough.
Is it that bad?
 
 
This message posted from opensolaris.org

can you guess?

2007-Dec-02 03:51 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> I think the point of dual battery-backed controllers
> is
> that data should never be lost.  Am I wrong?
That depends upon exactly what effect turning off the ZFS cache-flush mechanism
has.  If all data is still sent to the controllers as ''normal''
disk writes and they have no concept of, say, using *volatile* RAM to store
stuff when higher levels enable the "disk''s" write-back cache
nor any inclination to pass along such requests blithely to their underlying
disks (which of course would subvert any controller-level guarantees, since they
can evict data from their own write-back caches as soon as the disk write
request completes), then presumably as long as they get the data they guarantee
that it will eventually get to the platters and the ZFS cache-flush mechanism is
a no-op.

Of course, if that''s true then disabling cache-flush should have no
noticeable effect on performance (the controller just answers "Done"
as soon as it receives a cache-flush request, because there''s no
applicable cache to flush), so you might as well just leave it enabled. 
Conversely, if you found that disabling it *did* improve performance, then it
probably opened up a significant reliability hole.

- bill
 
 
This message posted from opensolaris.org

Vincent Fox

2007-Dec-02 04:54 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

Bill, you have a long-winded way of saying "I don''t know". 
But thanks for elucidating the possibilities.
 
 
This message posted from opensolaris.org

Anton B. Rang

2007-Dec-02 05:01 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> That depends upon exactly what effect turning off the
> ZFS cache-flush mechanism has.
The only difference is that ZFS won''t send a SYNCHRONIZE CACHE command
at the end of a transaction group (or ZIL write). It doesn''t change the
actual read or write commands (which are always sent as ordinary writes -- for
the ZIL, I suspect that setting the FUA bit on writes rather than flushing the
whole cache might provide better performance in some cases, but I''m not
sure, since it probably depends what other I/O might be outstanding.)
> Of course, if that''s true then disabling cache-flush
> should have no noticeable effect on performance (the
> controller just answers "Done" as soon as it receives
> a cache-flush request, because there''s no applicable
> cache to flush), so you might as well just leave it
> enabled.
The problem with SYNCHRONIZE CACHE is that its semantics aren''t quite
defined as precisely as one would want (until a fairly recent update). Some
controllers interpret it as "push all data to disk" even if they have
battery-backed NVRAM. In this case, you lose quite a lot of performance, and you
gain only a modicum of reliability (at least in the case of larger RAID systems,
which will generally use their battery to flush NVRAM to disk if power is lost).

There''s a bit defined now that can be used to say "Only flush
volatile caches, it''s OK if data is in non-volatile cache." But
not many controllers support this yet, and Solaris didn''t as of last
year -- not sure if it''s been added yet.

-- Anton
 
 
This message posted from opensolaris.org

can you guess?

2007-Dec-02 05:04 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> Bill, you have a long-winded way of saying "I don''t
> know".  But thanks for elucidating the possibilities.
Hmmm - I didn''t mean to be *quite* as noncommittal as that suggests:  I
was trying to say (without intending to offend) "FOR GOD''S SAKE,
MAN:  TURN IT BACK ON!", and explaining why (i.e., that either disabling it
made no difference and thus it might as well be enabled, or that if indeed it
made a difference that indicated that it was very likely dangerous).

- bill
 
 
This message posted from opensolaris.org

can you guess?

2007-Dec-02 05:18 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

> > That depends upon exactly what effect turning off
> the
> > ZFS cache-flush mechanism has.
> 
> The only difference is that ZFS won''t send a
> SYNCHRONIZE CACHE command at the end of a transaction
> group (or ZIL write). It doesn''t change the actual
> read or write commands (which are always sent as
> ordinary writes -- for the ZIL, I suspect that
> setting the FUA bit on writes rather than flushing
> the whole cache might provide better performance in
> some cases, but I''m not sure, since it probably
> depends what other I/O might be outstanding.)
It''s a bit difficult to imagine a situation where flushing the entire
cache unnecessarily just to force the ZIL would be preferable - especially if
ZFS makes any attempt to cluster small transaction groups together into larger
aggregates (in which case you''d like to let them continue to accumulate
until the aggregate is large enough to be worth forcing to disk in a single
I/O).
> 
> > Of course, if that''s true then disabling
> cache-flush
> > should have no noticeable effect on performance
> (the
> > controller just answers "Done" as soon as it
> receives
> > a cache-flush request, because there''s no
> applicable
> > cache to flush), so you might as well just leave
> it
> > enabled.
> 
> The problem with SYNCHRONIZE CACHE is that its
> semantics aren''t quite defined as precisely as one
> would want (until a fairly recent update). Some
> controllers interpret it as "push all data to disk"
> even if they have battery-backed NVRAM.
That seems silly, given that for most other situations they consider that data
in NVRAM is equivalent to data on the platter.  But silly or not, if
that''s the way some arrays interpret the command, then it does have
performance implications (and the other reply I just wrote would be unduly
alarmist in such cases).

Thanks for adding some actual experience with the hardware to what had been a
purely theoretical discussion.

- bill
 
 
This message posted from opensolaris.org

Al Hopper

2007-Dec-03 02:14 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

On Fri, 30 Nov 2007, Vincent Fox wrote:

... reformatted ...> We will be using Cyrus to store mail on 2540 arrays.
>
> We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are 
> both connected to same host, and mirror and stripe the LUNs.  So a 
> ZFS RAID-10 set composed of 4 LUNs.  Multi-pathing also in use for 
> redundancy.
>
> My question is any guidance on best choice in CAM for stripe size in the
LUNs?
[after reading the entire thread where details of the storage related 
application is presented piecemeal and piecing together the details] I 
can''t give you an answer or a recommendation, because the question 
does not make sense IMHO.

IOW: This is like saying: "I want to get from Dallas to LA as quickly 
as possible and have already decided that a bicycle would be the best 
mode of transport to use; can you tell me how I should configure the 
bicycle."  The problem is that its very unlikely that the bicycle is 
the correct solution and to recommend which bicycle config is correct 
is likely to provide very bad advice..... and also validate the 
supposition that the solution utilizing the bicycle is, indeed, the 
correct solution.
> Default is 128K right now, can go up to 512K, should we go higher?
>
> Cyrus stores mail messages as many small files, not big mbox files. 
> But there are so many layers in action here it''s hard to know what
> is best choice.
[again based on reading the entire thread and not an answer to the 
above paragraph]

It appears that the chosen solution is to use a stripe of two hardware 
RAID5 luns presented by a 2540 (please correct me if this is 
incorrect).  There are several issues with this proposal:

a) You''re mixing solutions: Hardware RAID5 and ZFS.  Why?  All this 
does is introduce needless complexity and make it very difficult to 
troubleshoot issues with the storage subsystem - especially if the 
issue is performance related.  Also - how do you localize a fault 
condition that is caused by a 2540 RAID firmware bug?  How do you 
isolate performance issues caused by the interaction between the 
hardware RAID5 luns and ZFS?

b) You''ve chosen a stripe - despite Richard Ellings best advice 
(something like "friends don''t let friends use stripes"). 
See
Richards blogs for a comparison of the reliability rates for different 
storage configurations.

c) For a mail storage subsystem a stripe seems totally wrong. 
Generally speaking, email (stores) consists of many small files - with 
occasional medium sized files (due to attachments) and less commonly, 
some large files - usually limited by the max message size defined by 
the MTA (typical value is 10Mb - what is it in your case?).

d) ZFS, with its built-in volume manager, relies on having direct 
access to individual disks (JBOD).  Placing a hardware RAID engine 
between ZFS and the actual disks is a "black box" in terms of the ZFS 
volume manager - and it can''t possibly "understand" how
various
storage providers'' "black boxes" will behave.... especially
when ZFS
tells the "disk" to do something and the hardware RAID lun lies to ZFS
(example sync writes).

e) You''ve presented no data in terms of typical iostat -xcnz 5 output 
- generalized over various times of the day where particular user data 
access patterns are known.  This information would allow us to give 
you some basic recommendations.  IOW - we need to know the basic 
requirements in terms of IOPS and average I/O transfer sizes.  BTW: 
Brendan Greggs DTrace scripts will allow you to gather very detailed 
I/O usage data on the production system with no risk.

f) You have not provided any details of the 2540 config - except for 
the fact that it is "fully loaded" IIRC.  SAS disks?  10,000 RPM 
drives of 15k RPM drives?  Disk drive size?

g) You''ve provided no details of how the host is configured.  If you 
decide to deploy a ZFS based system, the amount of installed RAM on 
the mailserver will have a *huge* impact on the actual load placed on 
the I/O subsystem.  In this regard, ZFS is your friend, as it''ll cache 
almost _everything_, given enough RAM.  And DDR2 RAM is (arguably) 
less than $40 a gigabyte today - with 2Gb SIMMs having reached price 
parity with the equivalent pricing of 2 * 1Gb DIMMs.

For example: if an end-user MUA is configured to poll the mailserver 
every 30 Seconds, to check if new mail has arrived, if the mailserver 
has sufficient (cache) memory, then only the first request will 
require disk access and a large number of subsequent requests will be 
handled out of (cache) memory.

h) Another observation: You''ve commented on the importance of system 
reliability because there are 10k users on the mailserver.  Whether 
you have 10 users or 10k users or 100k users is of no importance if 
you are considering system reliability (aka failure rates).  IOW - a 
system that is configured to a certain reliability requirement will be 
the same, regardless of the number of end users that rely on that 
system.  The number of concurrent users is important only in terms of 
system performance and response time.

i) I don''t know what the overall storage requirement is (someone said 
1Tb IIRC) and how this relates to the number/size of the available 
disk drives (in the 2540).

Observations:

1) Any striped config seems inherently wrong - given the available 
information.

2) mixing RAID5 luns (backend) with ZFS introduces unnecessary 
system complexity.

3) designing a system when no requirements have been presented in 
terms of:
    i)   I/O access patterns
    ii)  IOPS (I/O Ops per Second)
    iii) required response time
    iv)  number of concurrent requests
    v)   application host config (CPUs/cores, RAM, I/O bus, disk ctrls)
    vi)  backup methodology and frequency
    vii) storage subsystem config
    ....
is very unlikely to result in a correctly configured system that will 
meet the owner/operators expectations.

Please don''t frame this response as completely negative.  That is not 
my intention - what I''m trying to do is present you with a list of 
questions that must be answered before a technically correct storage 
subsystem can be designed and implemented.  IOW - before a storage 
subsystem can be correctly *engineered*.

Also - please don''t be discouraged by this response.  If you are 
willing to fill in the blanks, I''m willing to help provide a 
meaningful recommendation.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)

Vincent Fox

2007-Dec-03 18:02 UTC

head link

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

Thanks for your observations.

HOWEVER, I didn''t pose the question

"How do I architect the HA and storage and everything for an email
system?"

Our site like many other data centers has HA standards and politics and all this
other baggage that may lead a design to a certain point.  Thus our answer will
be different than yours.  You can poke holes in my designs, I can poke holes in
yours, this could go on all day.

Considering I am adding a new server to a group of existing servers of similar
design, we are not going to make radical ground-up redesign decisions at this
time.  I can fiddle around in the margins with things like stripe-size.

I will point out AS I HAVE BEFORE that ZFS is not yet completely
enterprise-ready in our view.  For example in one commonly-proposed amateurish
(IMO) scenario, we would have 2 big JBOD units and mirror the drives between
arrays.  This works fine if a drive fails or even if an array goes down.  BUT,
you are then left with a storage pool which must be immediately serviced or a
single additional drive failure will destroy the pool.  Or simple drive failure
which spare rolls in? The one from the same array, or one from the other?  Seems
a coin toss.  When it''s a terabyte of email and 10K+ users
that''s a big deal for some people, and we did our HA design such that
multiple failures can occur with no service impact.  The performance may not be
ideal, and the design may not seem ELEGANT to everyone.  Mixing HW controller
RAID and ZFS mirroring is admittedly an odd hybrid design.  Our answer works for
us and that is all that matters.

So if someone has an idea of what stripe-size will work best for us that would
be helpful.

Thanks!
 
 
This message posted from opensolaris.org

zfs discuss - Dec 2007 - Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?

[zfs-discuss] Best stripe-size in array for ZFS mail storage?