thr3ads.net - zfs discuss - [zfs-discuss] One LUN per RAID group [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Gary Mills

2011-Feb-14 19:38 UTC

[zfs-discuss] One LUN per RAID group

With ZFS on a Solaris server using storage on a SAN device, is it
reasonable to configure the storage device to present one LUN for each
RAID group?  I''m assuming that the SAN and storage device are
sufficiently reliable that no additional redundancy is necessary on
the Solaris ZFS server.  I''m also assuming that all disk management is
done on the storage device.

I realize that it is possible to configure more than one LUN per RAID
group on the storage device, but doesn''t ZFS assume that each LUN
represents an independant disk, and schedule I/O accordingly?  In that
case, wouldn''t ZFS I/O scheduling interfere with I/O scheduling
already done by the storage device?

Is there any reason not to use one LUN per RAID group?

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-

Paul Kraus

2011-Feb-14 20:04 UTC

head link

[zfs-discuss] One LUN per RAID group

On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills <mills at cc.umanitoba.ca>
wrote:
> I realize that it is possible to configure more than one LUN per RAID
> group on the storage device, but doesn''t ZFS assume that each LUN
> represents an independant disk, and schedule I/O accordingly? ?In that
> case, wouldn''t ZFS I/O scheduling interfere with I/O scheduling
> already done by the storage device?
>
> Is there any reason not to use one LUN per RAID group?
    My empirical testing confirms both the claims made that ZFS random
read I/O (at the very least) scales linearly with the NUMBER of vdev''s
and NOT the number of spindles as well as the recommendation (I
believe from an Oracle White Paper on using ZFS for Oracle DBs) that
if you are using a "hardware" RAID device (with NVRAM write cache),
you should configure one LUN per spindle in the backend raid set.

    In other words, if you build a zpool with one vdev of 10GB and
another with two vdev''s each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

    Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not "spindles".

    I suggest performing your own testing to insure you have the
performance to handle your specific application load.

    Now, as to reliability, the hardware RAID array cannot detect
silent corruption of data the way the end to end ZFS checksum can.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Gary Mills

2011-Feb-14 23:52 UTC

head link

[zfs-discuss] One LUN per RAID group

On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus
wrote:> On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills <mills at
cc.umanitoba.ca> wrote:
> >
> > Is there any reason not to use one LUN per RAID group?
[...]>     In other words, if you build a zpool with one vdev of 10GB and
> another with two vdev''s each of 5GB (both coming from the same
array
> and raid set) you get almost exactly twice the random read performance
> from the 2x5 zpool vs. the 1x10 zpool.
This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.
>     Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
> spares), you get substantially better random read performance using 10
> LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
> ZFS aith number of vdevs and not "spindles".
-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-

Erik Trimble

2011-Feb-15 03:37 UTC

head link

[zfs-discuss] One LUN per RAID group

On 2/14/2011 3:52 PM, Gary Mills wrote:> On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:
>> On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills<mills at
cc.umanitoba.ca>  wrote:
>>> Is there any reason not to use one LUN per RAID group?
> [...]
>>      In other words, if you build a zpool with one vdev of 10GB and
>> another with two vdev''s each of 5GB (both coming from the same
array
>> and raid set) you get almost exactly twice the random read performance
>> from the 2x5 zpool vs. the 1x10 zpool.
> This finding is surprising to me.  How do you explain it?  Is it
> simply that you get twice as many outstanding I/O requests with two
> LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
> all, all of the I/O requests must be handled by the same RAID group
> once they reach the storage device.
>
>>      Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
>> spares), you get substantially better random read performance using 10
>> LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
>> ZFS aith number of vdevs and not "spindles".
I''m going to go out on a limb here and say that you get the extra 
performance under one condition:  you don''t overwhelm the NVRAM write 
cache on the SAN device head.

So long as the SAN''s NVRAM cache can acknowledge the write immediately 
(i.e. it isn''t full with pending commits to backing store), then, yes, 
having multiple write commits coming from different ZFS vdevs will 
obviously give more performance than a single ZFS vdev.

That said, given that SAN NVRAM caches are true write caches (and not a 
ZIL-like thing), it should be relatively simple to swamp one with write 
requests (most SANs have little more than 1GB of cache), at which point, 
the SAN will be blocking on flushing its cache to disk.

So, if you can arrange your workload to avoid more than the maximum 
write load of the SAN''s raid array over a defined period, then, yes, go
with the multiple LUN/array setup.  In particular, I would think this 
would be excellent for small-write/latency-sensitive applications, where 
the total amount of data written (over several seconds) isn''t large,
but
where latency is critical.  For larger I/O requests (or, for consistent, 
sustained I/O of more than small amounts), all bets are off as far as 
possibly advantage of multiple LUNS/array.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Torrey McMahon

2011-Feb-15 21:37 UTC

head link

[zfs-discuss] One LUN per RAID group

On 2/14/2011 10:37 PM, Erik Trimble wrote:> That said, given that SAN NVRAM caches are true write caches (and not 
> a ZIL-like thing), it should be relatively simple to swamp one with 
> write requests (most SANs have little more than 1GB of cache), at 
> which point, the SAN will be blocking on flushing its cache to disk. 
Actually, most array controllers now have 10s if not 100s of GB of 
cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. The 
latest HDS box is probably close if not more.

Of course you still have to flush to disk and the cache flush algorithms 
of the boxes themselves come into play but 1GB was a long time ago.

Erik Trimble

2011-Feb-16 01:36 UTC

head link

[zfs-discuss] One LUN per RAID group

On 2/15/2011 1:37 PM, Torrey McMahon wrote:>
> On 2/14/2011 10:37 PM, Erik Trimble wrote:
>> That said, given that SAN NVRAM caches are true write caches (and not 
>> a ZIL-like thing), it should be relatively simple to swamp one with 
>> write requests (most SANs have little more than 1GB of cache), at 
>> which point, the SAN will be blocking on flushing its cache to disk. 
>
> Actually, most array controllers now have 10s if not 100s of GB of 
> cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. 
> The latest HDS box is probably close if not more.
>
> Of course you still have to flush to disk and the cache flush 
> algorithms of the boxes themselves come into play but 1GB was a long 
> time ago.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

STK2540 and the STK6140 have at most 1GB.
STK6180 has 4GB.

The move to large GB caches is only recent - only large (i.e big array 
setups with a dedicated SAN head) have had multi-GB NVRAM cache for any 
length of time.

In particular, pretty much all base arrays still have 4GB or less on the 
enclosure controller - only in the SAN heads do you find big multi-GB 
caches. And, lots (I''m going to be brave and say the vast majority) of 
ZFS deployments use direct-attach arrays or internal storage, rather 
than large SAN configs. Lots of places with older SAN heads are also 
going to have much smaller caches. Given the price tag of most large 
SANs, I''m thinking that there are still huge numbers of 5+ year-old
SANs
out there, and practically all of them have only a dozen or less GB of 
cache.

So, yes, big SAN modern configurations have lots of cache. But they''re 
also the ones most likely to be hammered with huge amounts of I/O from 
multiple machines. All of which makes it relatively easy to blow through 
the cache capacity and slow I/O back down to the disk speed.

Once you get back down to raw disk speed, having multiple LUNS per raid 
array is almost certainly going to perform worse than a single LUN, due 
to thrashing.  That is, it would certainly be better (i.e. faster) for 
an array to have to commit 1 128k slab than 4 32k slabs.

So, the original recommendation is interesting, but needs to have the 
caveat that you''d really only use it if you can either limit the amount
of sustained I/O you have, or are using very-large-cache disk setups.

I would think it idea might also apply (i.e. be useful) for something 
like the F5100 or similar RAM/Flash arrays.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

zfs discuss - Feb 2011 - One LUN per RAID group

[zfs-discuss] One LUN per RAID group

[zfs-discuss] One LUN per RAID group

[zfs-discuss] One LUN per RAID group

[zfs-discuss] One LUN per RAID group

[zfs-discuss] One LUN per RAID group

[zfs-discuss] One LUN per RAID group