Hi all, My company will be acquiring the Sun SE6920 for our storage virtualization project and we intend to use quite a bit of ZFS as well. The 2 technologies seems somewhat at odds since the 6920 means layers of hardware abstraction but ZFS seems to prefer more direct access to disk. I tried to search around but couldn''t find any performance numbers of ZFS on SE6920 nor any recommendations where to start or what the considerations could be. Will appreciate any hints in this area. Thanks. -- Just me, Wire ...
Wee Yeh Tan wrote:> My company will be acquiring the Sun SE6920 for our storage > virtualization project and we intend to use quite a bit of ZFS as > well. The 2 technologies seems somewhat at odds since the 6920 means > layers of hardware abstraction but ZFS seems to prefer more direct > access to disk.Yes, and then again, no. What you have with the SE6920 is a rack which provides you with hardware redundancy and whopping great cache. For my money, as long as you configure multiple paths from your attached hosts and ensure that each lun the SE6920 presents has at least two paths to your ZFS host, then you should be just fine.> I tried to search around but couldn''t find any performance numbers of > ZFS on SE6920 nor any recommendations where to start or what the > considerations could be. Will appreciate any hints in this area.Ah, well that really depends on what your target usage mode is, and what you really want to achieve. With a hw raid array you''ve got a lot of knobs to tweak (so to speak), and you can also decide between mirrors and raidz/raidz2 with zfs. I suggest that you contact Roch Bourbonnais or Richard Elling since this is really their area. (They''re both @Sun.COM btw). best regards, James C. McPherson
On Wednesday 16 August 2006 11:55, Wee Yeh Tan wrote:> Hi all, > > My company will be acquiring the Sun SE6920 for our storage > virtualization project and we intend to use quite a bit of ZFS as > well. The 2 technologies seems somewhat at odds since the 6920 means > layers of hardware abstraction but ZFS seems to prefer more direct > access to disk.I suggest reading the Best Practices for the Sun StorEdge? 6920 System - 819-0122-10 (v3.0) http://docs.sun.com/app/docs?q=819-0122-10 Regards Jerome> > I tried to search around but couldn''t find any performance numbers of > ZFS on SE6920 nor any recommendations where to start or what the > considerations could be. Will appreciate any hints in this area. > > Thanks.-- Jerome Haynes-Smith Sun PTS Storage EMEA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Wee Yeh Tan wrote:> Hi all, > > My company will be acquiring the Sun SE6920 for our storage > virtualization project and we intend to use quite a bit of ZFS as > well. The 2 technologies seems somewhat at odds since the 6920 means > layers of hardware abstraction but ZFS seems to prefer more direct > access to disk.Not at odds. I would say overlapping in some areas and, in some cases complimentary, but not at odds unless mis-configured. First, ZFS prefers access to LUNs. In many cases it prefers access to lots of LUNs for file system integrity as well as performance. The composition of those LUNs can be just about anything as long as ZFS can read and write blocks in a sane manner. Sure, the simpler the config the better for overall manageability but in many cases storage configurations increase in complexity as they increase in functionality. Second, as you noted above the 6920 lets you, among other things, virtualize the underlying storage. It lets you do this for multiple hosts and OS at the block level. Today, ZFS is a Solaris 10 single host filesystem. Placing ZFS on the LUNs exported from a 6920 to your Solaris 10 hosts is one option you should definitely look into. Unfortunately, you might not be able to use it for all of the hosts you intend to hook up to the 6920. Third you need to look at the overall system configuration and architecture when making storage decisions. Does it make sense to use RaidZ on the host on top of a Raid1 volume from the 6920 that sits on a Raid5 volume in the T3B you have in a rack across the room? Probably not. (In fact it never does but I digress....) Does it make sense to use a base RAID level within the 6920 that exports LUNs to all of your data center? Perhaps with the assumption that you''ll then you then use a simple ZFS configuration on top of it for the hosts that run ZFS. Again, it depends on the config and what your trying to accomplish overall as well as the applications and host specifics.
WYT said: Hi all, My company will be acquiring the Sun SE6920 for our storage virtualization project and we intend to use quite a bit of ZFS as well. The 2 technologies seems somewhat at odds since the 6920 means layers of hardware abstraction but ZFS seems to prefer more direct access to disk. I tried to search around but couldn''t find any performance numbers of ZFS on SE6920 nor any recommendations where to start or what the considerations could be. Will appreciate any hints in this area. Thanks. -- Just me, Wire ... _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss My general principles are: If you can, to improve you ''Availability'' metrics, let ZFS handle one level of redundancy; For Random Read performance prefer mirrors over raid-z. If you use raid-z, group together a smallish number of volumes. setup volumes that correspond to small number of drives (smallest you can bear) with a volume interlace that is in the [1M-4M] range. And next, a very very important thing that we will have to pursue with Storage Manufacturers including ourself: In cases where the storage cache is to be considered "stable storage" in the face of power failure, we have to be able to configure the storage to ignore the "flush write cache" commands that ZFS issues. Some Storage do ignore the flush out of the box, others don''t. It should be easy to verify the latency of a small O_DSYNC write. On a quiet system, I expect sub millisec response. 5ms to a battery protected cache should be red-flagged. This was just filed to track the issue: 6460889 zil shouldn''t send write-cache-flush command to <some> devices Note also that S10U2 has already been greatly improved performance wise, tracking releases is very important. -r ____________________________________________________________________________________ Performance, Availability & Architecture Engineering Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France http://icncweb.france/~rbourbon http://blogs.sun.com/roller/page/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20
Hello Roch, Thursday, August 17, 2006, 11:08:37 AM, you wrote: R> My general principles are: R> If you can, to improve you ''Availability'' metrics, R> let ZFS handle one level of redundancy; R> For Random Read performance prefer mirrors over R> raid-z. If you use raid-z, group together a smallish R> number of volumes. R> setup volumes that correspond to small number of R> drives (smallest you can bear) with a volume R> interlace that is in the [1M-4M] range. Why that big interlace? With lot of small reads it could actually introduce large overhead, right? I can understand something like 960KB, but 4M? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski writes: > Hello Roch, > > Thursday, August 17, 2006, 11:08:37 AM, you wrote: > R> My general principles are: > > R> If you can, to improve you ''Availability'' metrics, > R> let ZFS handle one level of redundancy; > > R> For Random Read performance prefer mirrors over > R> raid-z. If you use raid-z, group together a smallish > R> number of volumes. > > R> setup volumes that correspond to small number of > R> drives (smallest you can bear) with a volume > R> interlace that is in the [1M-4M] range. > > Why that big interlace? With lot of small reads it could actually > introduce large overhead, right? I can understand something like > 960KB, but 4M? > I also think we should be fine with 1M. Not sure what overhead we''re talking here. Did you mean large skew ? During a pool synch, at least, one of interest, we expect to have lots of data to synch, even if it''s just a 1GB, 4M interlace still spreads to 256 disks. -r
Thanks to all who have responded. I spent 2 weekends working through the best practices tthat Jerome recommended -- it''s quite a mouthful. On 8/17/06, Roch <Roch.Bourbonnais at sun.com> wrote:> My general principles are: > > If you can, to improve you ''Availability'' metrics, > let ZFS handle one level of redundancy;Cool. This is a good way to take advantage of the error-detection/correcting feature in ZFS. We will definitely take this suggestion!> For Random Read performance prefer mirrors over > raid-z. If you use raid-z, group together a smallish > number of volumes.> setup volumes that correspond to small number of > drives (smallest you can bear) with a volume > interlace that is in the [1M-4M] range.I have a hard time picturing this wrt the 6920 storage pool. The internal disks in the 6920 presents up to 2 VD per array (6-7 disk each?). The storage pool will be built from a bunch of these VD and may be futher partitioned into several volumes and each volume is presented to a ZFS host. What should the storage profile look like? I can probably do a stripe profile since I can leave the redundancy to ZFS. To complicate matters, we are likely going to attach all our 3510 into the 6920 and use some of these for the ZFS volumes so futher restrictions may apply. Are we better off doing a direct attach?> And next, a very very important thing that we will have to > pursue with Storage Manufacturers including ourself: > > In cases where the storage cache is to be considered > "stable storage" in the face of power failure, we > have to be able to configure the storage to ignore > the "flush write cache" commands that ZFS issues. > > Some Storage do ignore the flush out of the box, > others don''t. It should be easy to verify the > latency of a small O_DSYNC write. On a quiet system, > I expect sub millisec response. 5ms to a battery > protected cache should be red-flagged. > > This was just filed to track the issue: > 6460889 zil shouldn''t send write-cache-flush command to <some> devicesNoted.> Note also that S10U2 has already been greatly improved > performance wise, tracking releases is very important. > > -r-- Just me, Wire ...
Hello Wee, Saturday, August 26, 2006, 6:43:05 PM, you wrote: WYT> Thanks to all who have responded. I spent 2 weekends working through WYT> the best practices tthat Jerome recommended -- it''s quite a mouthful. WYT> On 8/17/06, Roch <Roch.Bourbonnais at sun.com> wrote:>> My general principles are: >> >> If you can, to improve you ''Availability'' metrics, >> let ZFS handle one level of redundancy;WYT> Cool. This is a good way to take advantage of the WYT> error-detection/correcting feature in ZFS. We will definitely take WYT> this suggestion!>> For Random Read performance prefer mirrors over >> raid-z. If you use raid-z, group together a smallish >> number of volumes.>> setup volumes that correspond to small number of >> drives (smallest you can bear) with a volume >> interlace that is in the [1M-4M] range.WYT> I have a hard time picturing this wrt the 6920 storage pool. The WYT> internal disks in the 6920 presents up to 2 VD per array (6-7 disk WYT> each?). The storage pool will be built from a bunch of these VD and WYT> may be futher partitioned into several volumes and each volume is WYT> presented to a ZFS host. What should the storage profile look like? WYT> I can probably do a stripe profile since I can leave the redundancy to WYT> ZFS. IMHO if you have VD make just one partition and present it as a LUN to ZFS. Do not present severap partitions from the same disks to ZFS as different LUN. WYT> To complicate matters, we are likely going to attach all our 3510 into WYT> the 6920 and use some of these for the ZFS volumes so futher WYT> restrictions may apply. Are we better off doing a direct attach? You can attach 3510 JBODs (I guess) directly - but currently there''re restrictions - only one host and no MPxIO. If it''s ok it looks like you''ll get better performance than if going with 3510 head unit. ps. I did try with MPxIO and two hosts connected, with several JBODs - and I did see FC loop logoug/login, etc. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On 8/28/06, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Saturday, August 26, 2006, 6:43:05 PM, you wrote: > WYT> Thanks to all who have responded. I spent 2 weekends working through > WYT> the best practices tthat Jerome recommended -- it''s quite a mouthful. > > WYT> On 8/17/06, Roch <Roch.Bourbonnais at sun.com> wrote: > >> My general principles are: > >> > >> If you can, to improve you ''Availability'' metrics, > >> let ZFS handle one level of redundancy; > > WYT> Cool. This is a good way to take advantage of the > WYT> error-detection/correcting feature in ZFS. We will definitely take > WYT> this suggestion! > > >> For Random Read performance prefer mirrors over > >> raid-z. If you use raid-z, group together a smallish > >> number of volumes. > > >> setup volumes that correspond to small number of > >> drives (smallest you can bear) with a volume > >> interlace that is in the [1M-4M] range. > > WYT> I have a hard time picturing this wrt the 6920 storage pool. The > WYT> internal disks in the 6920 presents up to 2 VD per array (6-7 disk > WYT> each?). The storage pool will be built from a bunch of these VD and > WYT> may be futher partitioned into several volumes and each volume is > WYT> presented to a ZFS host. What should the storage profile look like? > WYT> I can probably do a stripe profile since I can leave the redundancy to > WYT> ZFS. > > IMHO if you have VD make just one partition and present it as a LUN to > ZFS. Do not present severap partitions from the same disks to ZFS as > different LUN.I''m a real newbie here as you can probably tell and this is one of the aspect I''m struggling with. The compromise seems to be between putting in more spindles without increasing the volume stripe size. ZFS on simple disks manages itself nicely. When constructing the VD from the 3510, we will likely stripe across 2-3 disks. For virtualisation strategy on the 6920, we will probably go with concat. I do not imagine that striping here will go well with ZFS. I just have to be careful not to present volumes from the same VDs. Another alternative will be to just present the VDs directly without virtualization. Cost is the primary concern for this project.> WYT> To complicate matters, we are likely going to attach all our 3510 into > WYT> the 6920 and use some of these for the ZFS volumes so futher > WYT> restrictions may apply. Are we better off doing a direct attach? > > You can attach 3510 JBODs (I guess) directly - but currently there''re > restrictions - only one host and no MPxIO. If it''s ok it looks like > you''ll get better performance than if going with 3510 head unit. > > ps. I did try with MPxIO and two hosts connected, with several JBODs - > and I did see FC loop logoug/login, etc.I saw your benchmarking results. Great work there. -- Just me, Wire ...