Hello, I have a Dell 2950 with a Perc 5/i, two 300GB 15K SAS drives in a RAID0 array. I am considering going to ZFS and I would like to get some feedback about which situation would yield the highest performance: using the Perc 5/i to provide a hardware RAID0 that is presented as a single volume to OpenSolaris, or using the drives separately and creating the RAID0 with OpenSolaris and ZFS? Or maybe just adding the hardware RAID0 to a ZFS pool? Can anyone suggest some articles or FAQs on implementing ZFS RAID? Which situation would provide the highest read and write throughput? Thanks in advance This message posted from opensolaris.org
Gregory Perry wrote:> Hello, > > I have a Dell 2950 with a Perc 5/i, two 300GB 15K SAS drives in a RAID0 array. I am considering going to ZFS and I would like to get some feedback about which situation would yield the highest performance: using the Perc 5/i to provide a hardware RAID0 that is presented as a single volume to OpenSolaris, or using the drives separately and creating the RAID0 with OpenSolaris and ZFS? Or maybe just adding the hardware RAID0 to a ZFS pool? Can anyone suggest some articles or FAQs on implementing ZFS RAID? > > Which situation would provide the highest read and write throughput? > >I''m not sure which will perform the best. But giving ZFS the job of doing your redundancy (which with raid0 sounds like you''re not planning to do.) would be better than having the HW do it (if you have enough equipment having both do it is ok.) That said, I had recent discussions on this list about an IBM HW RAID controller with battery backed cache, and the net result of the discussion seemed to be that with the cache, making each drive into a single drive RAID0 Lun on the HW raid (to gain the chance to use the write cache) and then letting ZFS combine the disks in what ever RAID maaner you want, would give performance benefits, especially if serving the filesystems over NFS is your goal. -Kyle> Thanks in advance > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
I package up 5 or 6 disks into a RAID-5 LUN on our Sun 3510 and 2540 arrays. Then I use ZFS to RAID-10 these volumes. Safety first! Quite frankly I''ve had ENOUGH of rebuilding trashed filesystems. I am tired to chasing performance like it''s the Holy Grail and shoving other considerations aside. A ZFS mirror pair can know when there''s a bad block and get it from the other one, which is something you will not get with HW RAID. When Sun starts selling good SAS JBOD boxes equipped with appropriate redundancies and a flash-drive or 2 for the ZIL I will definitely go that route. For now I have a bunch of existing Sun HW RAID arrays so I make use of them mainly to make sure I can package LUNs and that assigned hot-spares are used to rebuild the LUNs when needed. This message posted from opensolaris.org
Vincent Fox wrote:> When Sun starts selling good SAS JBOD boxes equipped with appropriate redundancies and a flash-drive or 2 for the ZIL I will definitely go that route. For now I have a bunch of existing Sun HW RAID arrays so I make use of them mainly to make sure I can package LUNs and that assigned hot-spares are used to rebuild the LUNs when needed. >Why not a RAID box with enough battery backed RAM, and that supports enough LUNs that you can make 1 RAID0 LUN for every drive? and gain the benefits of the battery backed write cache on every write, not just the ZIL writes? Is the benefit of a fast ZIL enough? Is it so close that the rest of what I describe is a waste? Granted JBOD plus Flash ZIL might be cheaper. Not having been able to do any testing of my own yet, I''m still struggling to understand the likely performance differences in the 2 approaches (or the third: the RAID0 idea above plus the Flash ZIL.) -Kyle> > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
So the point is, a JBOD with a flash drive in one (or two to mirror the ZIL) of the slots would be a lot SIMPLER. We''ve all spent the last decade or two offloading functions into specialized hardware, that has turned into these massive unneccessarily complex things. I don''t want to go to a new training class everytime we buy a new model of storage unit. I don''t want to have to setup a new server on my private network to run the Java GUI management software for that array and all the other BS that array vendors put us through. I just want storage. This message posted from opensolaris.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Vincent Fox wrote: | So the point is, a JBOD with a flash drive in one (or two to mirror the ZIL) of the slots would be a lot SIMPLER. I guess a USB pendrive would be slower than a harddisk. Bad performance for the ZIL. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ ~ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBR6ImAplgi5GaxT1NAQJy6QQAm865PjzCGcJb70HMgrwDDOVHz3+kLvwA JlLA2icsMp+FdbuSO1xYU2AYejxFYTxzjrwLyi/vqbaDMM+HZzkOPRk8TXsgBPB+ 2aHQArFfS3ih3ZYakW0A0x5h35vykeu/Cl9aRjOrCSERkVsqjkXnQSceGKSdgz5J mMPWKBUWnyI=UoBx -----END PGP SIGNATURE-----
Vincent Fox wrote:> So the point is, a JBOD with a flash drive in one (or two to mirror the ZIL) of the slots would be a lot SIMPLER. > > We''ve all spent the last decade or two offloading functions into specialized hardware, that has turned into these massive unneccessarily complex things. > > I don''t want to go to a new training class everytime we buy a new model of storage unit. I don''t want to have to setup a new server on my private network to run the Java GUI management software for that array and all the other BS that array vendors put us through. > > I just want storage. > >Good Point. -Kyle> > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Kyle McDonald wrote:> Vincent Fox wrote: > >> So the point is, a JBOD with a flash drive in one (or two to mirror the ZIL) of the slots would be a lot SIMPLER. >> >> We''ve all spent the last decade or two offloading functions into specialized hardware, that has turned into these massive unneccessarily complex things. >> >> I don''t want to go to a new training class everytime we buy a new model of storage unit. I don''t want to have to setup a new server on my private network to run the Java GUI management software for that array and all the other BS that array vendors put us through. >> >> I just want storage. >> >> >> > Good Point.You still need interfaces, of some kind, to manage the device. Temp sensors? Drive fru information? All that information has to go out, and some in, over an interface of some sort.
tmcmahon2 at yahoo.com said:> You still need interfaces, of some kind, to manage the device. Temp sensors? > Drive fru information? All that information has to go out, and some in, over > an interface of some sort.Looks like the Sun 2530 array recently added in-band management over the SAS (data) interface. Maybe the SAS/SATA JBOD products will be able to do the same. So, how long before we can buy an NVRAM cache card or SSD that we can put in a Thumper to make NFS go fast? That, plus Solaris-10 bits to use them, are what we''re in need of. Regards, Marion
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Vincent Fox wrote: > | So the point is, a JBOD with a flash drive in one > (or two to mirror > the ZIL) of the slots would be a lot SIMPLER. > > I guess a USB pendrive would be slower than a > harddisk. Bad performance > for the ZIL. >Does any one have any data on this?> - -- > Jesus Cea Avion _/_/ > _/_/_/ _/_/_/ > argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ > _/_/ _/_/ > ber / xmpp:jcea at jabber.org _/_/ _/_/ > _/_/_/_/_/ > _/_/ _/_/ _/_/ > _/_/ _/_/ > re not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ > _/_/ > My name is Dump, Core Dump" _/_/_/ _/_/_/ > _/_/ _/_/ > mor es poner tu felicidad en la felicidad de otro" - > Leibniz > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - > http://enigmail.mozdev.org > > iQCVAwUBR6ImAplgi5GaxT1NAQJy6QQAm865PjzCGcJb70HMgrwDDO > VHz3+kLvwA > JlLA2icsMp+FdbuSO1xYU2AYejxFYTxzjrwLyi/vqbaDMM+HZzkOPR > k8TXsgBPB+ > 2aHQArFfS3ih3ZYakW0A0x5h35vykeu/Cl9aRjOrCSERkVsqjkXnQS > ceGKSdgz5J > mMPWKBUWnyI> =UoBx > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssThis message posted from opensolaris.org
John-Paul Drawneek wrote:>> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Vincent Fox wrote: >> | So the point is, a JBOD with a flash drive in one >> (or two to mirror >> the ZIL) of the slots would be a lot SIMPLER. >> >> I guess a USB pendrive would be slower than a >> harddisk. Bad performance >> for the ZIL. >> >> > > Does any one have any data on this? >+1 Inquiring minds want to know. :) -Kyle
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 John-Paul Drawneek wrote: | I guess a USB pendrive would be slower than a | harddisk. Bad performance | for the ZIL. A "decent" pendrive of mine writes at 3-5MB/s. Sure there are faster ones, but any desktop harddisk can write at 50MB/s. If you are *not* talking about consumer grade pendrives, I can''t comment. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ ~ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBR6sQSplgi5GaxT1NAQKD+AP/XdzxquaUk559ldZr2Wwcq0mIGnAXXDsf uCz+HBiYVLpgqqyv6I5gGgoeF417YopPvsiL0fpAEWIMeB/BgeTvU/xarq2sFeD6 NOt9S31C2pOaRCfDkPerBwof5ScKvqL4LICPUhWfYbrx45V6A6dV6IVYYzx1Pj6r ePKcyjPfDhQ=n2Ut -----END PGP SIGNATURE-----
Much of the complexity in hardware RAID is in the fault detection, isolation, and management. The fun part is trying to architect a fault-tolerant system when the suppliers of the components can not come close to enumerating most of the possible failure modes. What happens when a drive''s performance slows down because it is having to go through internal retries more than others? What layer gets to declare a drive dead? What happens when you start declaring the drives dead one by one because of they all seemed to stop responding but the problem is not really the drives? Hardware RAID systems attempt to deal with problems that are not always straight forward...Hopefully we will eventually get similar functionality in Solaris... Understand that I am a proponent of ZFS, but everything has it''s use. -Joel This message posted from opensolaris.org
With my (COTS) LSI 1068 and 1078 based controllers I get consistently better performance when I export all disks as jbod (MegaCli - CfgEachDskRaid0). I even went through all the loops and hoops with 6120''s, 6130''s and even some SGI storage and the result was always the same; better performance exporting single disk than even the "ZFS" profiles within CAM. --- ''pool0'': #zpool create pool0 mirror c2t0d0 c2t1d0 #zpool add pool0 mirror c2t2d0 c2t3d0 #zpool add pool0 mirror c2t4d0 c2t5d0 #zpool add pool0 mirror c2t6d0 c2t7d0 ''pool2'': #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 #zpool add pool2 raidz c3t12d0 c3t13d0 c3t14d0 c3t15d0 ---- I have really learned not to do it this way with raidz and raidz2: #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0 c3t13d0 c3t14d0 c3t15d0 So when is thumper going to have an all SAS option? :) -Andy On Feb 7, 2008, at 2:28 PM, Joel Miller wrote:> Much of the complexity in hardware RAID is in the fault detection, > isolation, and management. The fun part is trying to architect a > fault-tolerant system when the suppliers of the components can not > come close to enumerating most of the possible failure modes. > > What happens when a drive''s performance slows down because it is > having to go through internal retries more than others? > > What layer gets to declare a drive dead? What happens when you start > declaring the drives dead one by one because of they all seemed to > stop responding but the problem is not really the drives? > > Hardware RAID systems attempt to deal with problems that are not > always straight forward...Hopefully we will eventually get similar > functionality in Solaris... > > Understand that I am a proponent of ZFS, but everything has it''s use. > > -Joel > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Actually the point is that there are situations that occur in which the typical software stack will make the wrong decision because it has no concept of the underlying hardware and no fault management structure that factors in more than just a single failed IO at a time... Hardware RAID controllers are for the most part just an embedded system....the firmware is software...yes there might be a few ASICs or FPGAs to help out with XOR calculations, but most use CPUs that have XOR engines nowadays.... The difference is that the firmware is designed with intimate knowledge of how things are connected, silicon bugs, etc.... Running a solaris box with zfs to JBODS and exporting a file system to clients is approaching the same structure...with the exception that the solaris box is usually a general purpose box instead of an embedded controller.... Once you add a second solaris box and cluster them, you start rebuilding a redundant RAID array, but still would be lacking the level of fault handling that is put into the typical hardware array... I am sure this will be done at some point with the blade servers...it only makes sense... -Joel> You just made a great case for doing it in software. > > --Toby >
Andy Lubel wrote:> With my (COTS) LSI 1068 and 1078 based controllers I get consistently > better performance when I export all disks as jbod (MegaCli - > CfgEachDskRaid0). > >Is that really ''all disks as JBOD''? or is it ''each disk as a single drive RAID0''? It may not sound different on the surface, but I asked in another thread and others confirmed, that if your RAID card has a battery backed cache giving ZFS many single drive RAID0''s is much better than JBOD (using the ''nocacheflush'' option may even improve it more.) My understanding is that it''s kind of like the best of both worlds. You get the higher number of spindles and vdevs for ZFS to manage, ZFS gets to do the redundancy, and the the HW RAID Cache gives virtually instant acknowledgement of writes, so that ZFS can be on it''s way. So I think many RAID0''s is not always the same as JBOD. That''s not to say that even True JBOD doesn''t still have an advantage over HW RAID. I don''t know that for sure. But I think there is a use for HW RAID in ZFS configs which wasn''t always the theory I''ve heard.> I have really learned not to do it this way with raidz and raidz2: > > #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0 > c3t13d0 c3t14d0 c3t15d0 >Why? I know creating raidz''s with more than 9-12 devices, but that doesn''t cross that threshold. Is there a reason you''d split 8 disks up into 2 groups of 4? What experience led you to this? (Just so I don''t have to repeat it. ;) ) -Kyle
> With my (COTS) LSI 1068 and 1078 based controllers I get consistently> better performance when I export all disks as jbod (MegaCli - > CfgEachDskRaid0). > > >> Is that really ''all disks as JBOD''? or is it ''each disk as a single >> drive RAID0''?single disk raid0: ./MegaCli -CfgEachDskRaid0 Direct -a0>>It may not sound different on the surface, but I asked in anotherthread>>and others confirmed, that if your RAID card has a battery backedcache>>giving ZFS many single drive RAID0''s is much better than JBOD (usingthe>>''nocacheflush'' option may even improve it more.)>>My understanding is that it''s kind of like the best of both worlds.You>>get the higher number of spindles and vdevs for ZFS to manage, ZFSgets>>to do the redundancy, and the the HW RAID Cache gives virtuallyinstant>>acknowledgement of writes, so that ZFS can be on it''s way.>>So I think many RAID0''s is not always the same as JBOD. That''s not to >>say that even True JBOD doesn''t still have an advantage over HW RAID.I>>don''t know that for sure.I have tried mixing hardware and zfs raid but it just doesn''t make sense to use from a performance or redundancy standpoint why we would add those layers of complexity. In this case I''m building nearline so there isn''t even a battery attached and I have disabled any caching on the controller. I have a SUN SAS HBA on the way which would be what I would use ultimately for my JBOD attachment.>>But I think there is a use for HW RAID in ZFS configs which wasn''t >>always the theory I''ve heard. > I have really learned not to do it this way with raidz and raidz2: > > #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0 > c3t13d0 c3t14d0 c3t15d0 > >>Why? I know creating raidz''s with more than 9-12 devices, but that >>doesn''t cross that threshold. >>Is there a reason you''d split 8 disks up into 2 groups of 4? What >>experience led you to this? >>(Just so I don''t have to repeat it. ;) )I don''t know why but with most setups I have tested (8 and 16 drive configs) dividing raid5 into 4 disks per vdev and 5 for a raidz2 perform better. Take a look at my simple dd test (filebench results as soon as I can figure out how to get it working proper with SOL10). ==== 8 SATA 500gb disk system with LSI 1068 (megaRAID 8888ELP) - no BBU --------- bash-3.00# zpool history History for ''pool0-raidz'': 2008-02-11.16:38:13 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT pool0-raidz 117K 3.10T 42.6K /pool0-raidz bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192 count=131072;time sync 131072+0 records in 131072+0 records out real 0m1.768s user 0m0.080s sys 0m1.688s real 0m3.495s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/pool0-raidz/rw-test.lo0 bs=8192; time sync 131072+0 records in 131072+0 records out real 0m6.994s user 0m0.097s sys 0m2.827s real 0m1.043s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192 count=655360;time sync 655360+0 records in 655360+0 records out real 0m24.064s user 0m0.402s sys 0m8.974s real 0m1.629s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/pool0-raidz/rw-test.lo1 bs=8192; time sync 655360+0 records in 655360+0 records out real 0m40.542s user 0m0.476s sys 0m16.077s real 0m0.617s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/dev/null bs=8192; time sync 131072+0 records in 131072+0 records out real 0m3.443s user 0m0.084s sys 0m1.327s real 0m0.013s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/dev/null bs=8192; time sync 655360+0 records in 655360+0 records out real 0m15.972s user 0m0.413s sys 0m6.589s real 0m0.013s user 0m0.001s sys 0m0.012s ----------------------- bash-3.00# zpool history History for ''pool0-raidz'': 2008-02-11.17:02:16 zpool create pool0-raidz raidz c2t0d0 c2t1d0 c2t2d0 c2t3d0 2008-02-11.17:02:51 zpool add pool0-raidz raidz c2t4d0 c2t5d0 c2t6d0 c2t7d0 bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT pool0-raidz 110K 2.67T 36.7K /pool0-raidz bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo0 bs=8192 count=131072;time sync 131072+0 records in 131072+0 records out real 0m1.835s user 0m0.079s sys 0m1.687s real 0m2.521s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/pool0-raidz/rw-test.lo0 bs=8192; time sync 131072+0 records in 131072+0 records out real 0m2.376s user 0m0.084s sys 0m2.291s real 0m2.578s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/dev/zero of=/pool0-raidz/w-test.lo1 bs=8192 count=655360;time sync 655360+0 records in 655360+0 records out real 0m19.531s user 0m0.404s sys 0m8.731s real 0m2.255s user 0m0.001s sys 0m0.013s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/pool0-raidz/rw-test.lo1 bs=8192; time sync 655360+0 records in 655360+0 records out real 0m34.698s user 0m0.484s sys 0m13.868s real 0m0.741s user 0m0.001s sys 0m0.016s bash-3.00# time dd if=/pool0-raidz/w-test.lo0 of=/dev/null bs=8192; time sync 131072+0 records in 131072+0 records out real 0m3.372s user 0m0.088s sys 0m1.209s real 0m0.015s user 0m0.001s sys 0m0.012s bash-3.00# time dd if=/pool0-raidz/w-test.lo1 of=/dev/null bs=8192; time sync 655360+0 records in 655360+0 records out real 0m15.863s user 0m0.431s sys 0m6.077s real 0m0.013s user 0m0.001s sys 0m0.013s == The latter is my 4 disk split and as you can see, it performs pretty good. Maybe someone can help us understand why it appears this way? -Andy