With ZFS on a Solaris server using storage on a SAN device, is it reasonable to configure the storage device to present one LUN for each RAID group? I''m assuming that the SAN and storage device are sufficiently reliable that no additional redundancy is necessary on the Solaris ZFS server. I''m also assuming that all disk management is done on the storage device. I realize that it is possible to configure more than one LUN per RAID group on the storage device, but doesn''t ZFS assume that each LUN represents an independant disk, and schedule I/O accordingly? In that case, wouldn''t ZFS I/O scheduling interfere with I/O scheduling already done by the storage device? Is there any reason not to use one LUN per RAID group? -- -Gary Mills- -Unix Group- -Computer and Network Services-
On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills <mills at cc.umanitoba.ca> wrote:> I realize that it is possible to configure more than one LUN per RAID > group on the storage device, but doesn''t ZFS assume that each LUN > represents an independant disk, and schedule I/O accordingly? ?In that > case, wouldn''t ZFS I/O scheduling interfere with I/O scheduling > already done by the storage device? > > Is there any reason not to use one LUN per RAID group?My empirical testing confirms both the claims made that ZFS random read I/O (at the very least) scales linearly with the NUMBER of vdev''s and NOT the number of spindles as well as the recommendation (I believe from an Oracle White Paper on using ZFS for Oracle DBs) that if you are using a "hardware" RAID device (with NVRAM write cache), you should configure one LUN per spindle in the backend raid set. In other words, if you build a zpool with one vdev of 10GB and another with two vdev''s each of 5GB (both coming from the same array and raid set) you get almost exactly twice the random read performance from the 2x5 zpool vs. the 1x10 zpool. Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot spares), you get substantially better random read performance using 10 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of ZFS aith number of vdevs and not "spindles". I suggest performing your own testing to insure you have the performance to handle your specific application load. Now, as to reliability, the hardware RAID array cannot detect silent corruption of data the way the end to end ZFS checksum can. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:> On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills <mills at cc.umanitoba.ca> wrote: > > > > Is there any reason not to use one LUN per RAID group?[...]> In other words, if you build a zpool with one vdev of 10GB and > another with two vdev''s each of 5GB (both coming from the same array > and raid set) you get almost exactly twice the random read performance > from the 2x5 zpool vs. the 1x10 zpool.This finding is surprising to me. How do you explain it? Is it simply that you get twice as many outstanding I/O requests with two LUNs? Is it limited by the default I/O queue depth in ZFS? After all, all of the I/O requests must be handled by the same RAID group once they reach the storage device.> Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot > spares), you get substantially better random read performance using 10 > LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of > ZFS aith number of vdevs and not "spindles".-- -Gary Mills- -Unix Group- -Computer and Network Services-
On 2/14/2011 3:52 PM, Gary Mills wrote:> On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote: >> On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills<mills at cc.umanitoba.ca> wrote: >>> Is there any reason not to use one LUN per RAID group? > [...] >> In other words, if you build a zpool with one vdev of 10GB and >> another with two vdev''s each of 5GB (both coming from the same array >> and raid set) you get almost exactly twice the random read performance >> from the 2x5 zpool vs. the 1x10 zpool. > This finding is surprising to me. How do you explain it? Is it > simply that you get twice as many outstanding I/O requests with two > LUNs? Is it limited by the default I/O queue depth in ZFS? After > all, all of the I/O requests must be handled by the same RAID group > once they reach the storage device. > >> Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot >> spares), you get substantially better random read performance using 10 >> LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of >> ZFS aith number of vdevs and not "spindles".I''m going to go out on a limb here and say that you get the extra performance under one condition: you don''t overwhelm the NVRAM write cache on the SAN device head. So long as the SAN''s NVRAM cache can acknowledge the write immediately (i.e. it isn''t full with pending commits to backing store), then, yes, having multiple write commits coming from different ZFS vdevs will obviously give more performance than a single ZFS vdev. That said, given that SAN NVRAM caches are true write caches (and not a ZIL-like thing), it should be relatively simple to swamp one with write requests (most SANs have little more than 1GB of cache), at which point, the SAN will be blocking on flushing its cache to disk. So, if you can arrange your workload to avoid more than the maximum write load of the SAN''s raid array over a defined period, then, yes, go with the multiple LUN/array setup. In particular, I would think this would be excellent for small-write/latency-sensitive applications, where the total amount of data written (over several seconds) isn''t large, but where latency is critical. For larger I/O requests (or, for consistent, sustained I/O of more than small amounts), all bets are off as far as possibly advantage of multiple LUNS/array. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On 2/14/2011 10:37 PM, Erik Trimble wrote:> That said, given that SAN NVRAM caches are true write caches (and not > a ZIL-like thing), it should be relatively simple to swamp one with > write requests (most SANs have little more than 1GB of cache), at > which point, the SAN will be blocking on flushing its cache to disk.Actually, most array controllers now have 10s if not 100s of GB of cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. The latest HDS box is probably close if not more. Of course you still have to flush to disk and the cache flush algorithms of the boxes themselves come into play but 1GB was a long time ago.
On 2/15/2011 1:37 PM, Torrey McMahon wrote:> > On 2/14/2011 10:37 PM, Erik Trimble wrote: >> That said, given that SAN NVRAM caches are true write caches (and not >> a ZIL-like thing), it should be relatively simple to swamp one with >> write requests (most SANs have little more than 1GB of cache), at >> which point, the SAN will be blocking on flushing its cache to disk. > > Actually, most array controllers now have 10s if not 100s of GB of > cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. > The latest HDS box is probably close if not more. > > Of course you still have to flush to disk and the cache flush > algorithms of the boxes themselves come into play but 1GB was a long > time ago. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussSTK2540 and the STK6140 have at most 1GB. STK6180 has 4GB. The move to large GB caches is only recent - only large (i.e big array setups with a dedicated SAN head) have had multi-GB NVRAM cache for any length of time. In particular, pretty much all base arrays still have 4GB or less on the enclosure controller - only in the SAN heads do you find big multi-GB caches. And, lots (I''m going to be brave and say the vast majority) of ZFS deployments use direct-attach arrays or internal storage, rather than large SAN configs. Lots of places with older SAN heads are also going to have much smaller caches. Given the price tag of most large SANs, I''m thinking that there are still huge numbers of 5+ year-old SANs out there, and practically all of them have only a dozen or less GB of cache. So, yes, big SAN modern configurations have lots of cache. But they''re also the ones most likely to be hammered with huge amounts of I/O from multiple machines. All of which makes it relatively easy to blow through the cache capacity and slow I/O back down to the disk speed. Once you get back down to raw disk speed, having multiple LUNS per raid array is almost certainly going to perform worse than a single LUN, due to thrashing. That is, it would certainly be better (i.e. faster) for an array to have to commit 1 128k slab than 4 32k slabs. So, the original recommendation is interesting, but needs to have the caveat that you''d really only use it if you can either limit the amount of sustained I/O you have, or are using very-large-cache disk setups. I would think it idea might also apply (i.e. be useful) for something like the F5100 or similar RAM/Flash arrays. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA