I''m trying to implement a Nexsan SATABeast (an external disk array, read more: http://www.nexsan.com/satabeast.php, 14 disks available) with a Sun Fire X4100 M2 server running Solaris 10 u6 (connected via fiber) and have a couple of questions: (My motivation for this is the corrupted ZFS volume discussion I had earlier with no result, and this time I''m trying to make a more robust implementation) 1. On the external disk array, I not able to configure JBOD or RAID 0 or 1 with just one disk. I can''t find any options for my Solaris server to access the disk directly so I have to configure some raids on the SATABeast. I was thinking of striping two disks in each raid and then add all 7 raids to one zpool as a zraid. The problem with this is if one disk breaks down, I''ll loose one RAID 0 disk but maybe ZFS can handle this? Should I rather implement RAID5 disks one the SATABeast and then export them to the Solaris machine? 14 disks would give me 4 RAID5 volumes and 2 spare disks? I''ll loose a lot of disk space. What about create larger RAID volumes on the SATABeast? Like 3 RAID volumes with 5 disks in 2 RAIDS and 4 disks in one RAID? I''m really not sure what to choose ... At the moment I''ve striped two disks in one RAID volume. 2. After reading from the ZFS Evil Tuning Guide (http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes ) about cache flushes I checked the cache configuration on the SATABaest and I can change these settings: System Admin Configure Cache Cache Configuration Current write cache state: Enabled, FUA ignored - 495 MB Manually override current write cache status: [ ] Force write cache to Disabled Desired write cache state: [X] Enabled [ ] Disabled Allow attached host to override write cache configuration: [ ] Ignore force unit access (FUA) bit: [X] Write cache streaming mode: [ ] Cache optimization setting: [ ] Random access [X] Mixed sequential/random [ ] Sequential access And from the help section:> Write cache will normally speed up host writes, data is buffered in > the RAID controllers memory when the installed disk drives are not > ready to accept the write data. The RAID controller write cache > memory is battery backed, this allows any unwritten array data to be > kept intact during a power failure situation. When power is restored > this battery backed data will be flushed out to the RAID array. > > Current write cache state - This is the current state of the write > cache that the RAID system is using. > > Manually override current write cache status - This allows the write > caching to be forced on or off by the user, this change will take > effect immediately. > > Desired write cache state - This is the state of the write cache the > user wishes to have after boot up. > > Allow attached host to override write cache configuration - This > allows the host system software to issue commands to the RAID system > via the host interface that will either turn off or on the write > caching. > > Ignore force unit access (FUA) bit - When the force unit access > (FUA) bit is set by a host system on a per command basis data is > written / read directly to / from the disks without using the > onboard cache. This will incur a time overhead, but guarantees the > data is on the media. Set this option to force the controller to > ignore the FUA bit such that command execution times are more > consistent. > > Write cache streaming mode - When the write cache is configured in > streaming mode (check box ticked), the system continuously flushes > the cache (it runs empty). This provides maximum cache buffering to > protect against raid system delays adversely affecting command > response times to the host. > When the write cache operates in non-streaming mode (check box not > ticked) the system runs with a full write cache to maximise cache > hits and maximise random IO performance. > > Cache optimization setting - The cache optimization setting adjusts > the cache behaviour to maximize performance for the expected host I/ > O pattern. > > Note that the write cache will be flushed 5 seconds after the last > host write. It is recommended that all host activity is stopped 30 > seconds before powering the system off.Any thoughts about this? Regards, Lars-Gunnar Persson
On Mon, 9 Mar 2009, Lars-Gunnar Persson wrote:> > 1. On the external disk array, I not able to configure JBOD or RAID 0 or 1 > with just one disk. I can''t find any options for my Solaris server to access > the disk directly so I have to configure some raids on the SATABeast. I was > thinking of striping two disks in each raid and then add all 7 raids to one > zpool as a zraid. The problem with this is if one disk breaks down, I''ll > loose one RAID 0 disk but maybe ZFS can handle this? Should I rather > implement RAID5 disks one the SATABeast and then export them to the Solaris > machine? 14 disks would give me 4 RAID5 volumes and 2 spare disks? I''ll loose > a lot of disk space. What about create larger RAID volumes on the SATABeast? > Like 3 RAID volumes with 5 disks in 2 RAIDS and 4 disks in one RAID? I''m > really not sure what to choose ... At the moment I''ve striped two disks in > one RAID volume.Your idea to stripe two disks per LUN should work. Make sure to use raidz2 rather than plain raidz for the extra reliability. This solution is optimized for high data throughput from one user. An alternative is to create individual "RAID 0" LUNs which actually only contain a single disk. Then implement the pool as two raidz2s with six LUNs each, and two hot spares. That would be my own preference. Due to ZFS''s load share this should provide better performance (perhaps 2X) for multi-user loads. Some testing may be required to make sure that your hardware is happy with this. Avoid RAID5 if you can because it is not as reliable with today''s large disks and the resulting huge LUN size can take a long time to resilver if the RAID5 should fail (or be considered to have failed). There is also the issue that a RAID array bug might cause transient wrong data to be returned and this could cause confusion for ZFS''s own diagnostics/repair and result in useless "repairs". If ZFS reports a problem but the RAID array says that the data is fine, then there is confusion, finger-pointing, and likely a post to this list. If you are already using ZFS, then you might as well use ZFS for most of the error detection/correction as well. These are my own opinions and others will surely differ. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On March 9, 2009 12:06:40 PM +0100 Lars-Gunnar Persson <lars-gunnar.persson at nersc.no> wrote:> I''m trying to implement a Nexsan SATABeast...> 1. On the external disk array, I not able to configure JBOD or RAID 0 or > 1 with just one disk.exactly why i didn''t buy this product.
On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson <lars-gunnar.persson at nersc.no> wrote:>1. On the external disk array, I not able to configure JBOD or RAID 0 >or 1 with just one disk.In some arrays it seems to be possible to configure separate disks by offering the array just one disk in one slot at a time, and, very important, leaving all other slots empty(!). Repeat for as many disks as you have, seating each disk in its own slot, and all other slots empty. (ok, it''s just hear-say, but it might be worth a try with the first 4 disks or so). -- ( Kees Nuyt ) c[_]
How about this configuration? On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then on the Nexsan define several smaller volumes and then add those volumes to a raidz2/raidz zpool? Could that be an useful configuration? Maybe I''ll loose too much space with "double" raid 5 or 6 configuration? What about performance? Regards, Lars-Gunnar Persson On 10. mars. 2009, at 00.26, Kees Nuyt wrote:> On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson > <lars-gunnar.persson at nersc.no> wrote: > >> 1. On the external disk array, I not able to configure JBOD or RAID 0 >> or 1 with just one disk. > > In some arrays it seems to be possible to configure separate > disks by offering the array just one disk in one slot at a > time, and, very important, leaving all other slots empty(!). > > Repeat for as many disks as you have, seating each disk in > its own slot, and all other slots empty. > > (ok, it''s just hear-say, but it might be worth a try with > the first 4 disks or so). > -- > ( Kees Nuyt > ) > c[_] > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
I realized that I''ll loose too much disk space with the "double" raid configuration suggested below. Agree? I''ve done some performance testing with raidz/raidz1 vs raidz2: bash-3.00# zpool status -v raid5 pool: raid5 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raid5 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c7t6000402001FC442C609DC5A300000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCA4A00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCA2200000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCABF00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCADB00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCAF800000000d0 ONLINE 0 0 0 c7t6000402001FC442C609F029100000000d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT raid5 12.6T 141K 12.6T 0% ONLINE - bash-3.00# df -h /raid5 Filesystem size used avail capacity Mounted on raid5 11T 41K 11T 1% /raid5 bash-3.00# echo zfs_nocacheflush/D | mdb -k zfs_nocacheflush: zfs_nocacheflush: 0 bash-3.00# ./filesync-1 /raid5 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 9.871197 bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw zfs_nocacheflush: 0 = 0x1 bash-3.00# ./filesync-1 /raid5 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 7.363303 Then I destroyed the raid5 pool and created a raid6 pool: bash-3.00# zpool status -v raid6 pool: raid6 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raid6 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c7t6000402001FC442C609DC5A300000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCA4A00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCA2200000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCABF00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCADB00000000d0 ONLINE 0 0 0 c7t6000402001FC442C609DCAF800000000d0 ONLINE 0 0 0 c7t6000402001FC442C609F029100000000d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT raid6 12.6T 195K 12.6T 0% ONLINE - bash-3.00# df -h /raid6 Filesystem size used avail capacity Mounted on raid6 8.8T 52K 8.8T 1% /raid6 bash-3.00# echo zfs_nocacheflush/D | mdb -k zfs_nocacheflush: zfs_nocacheflush: 0 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 9.879219 bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw zfs_nocacheflush: 0 = 0x1 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 7.560435 My conclusion on raidz1 vs raidz2 would be no difference in performance and big difference in disk space available. On 10. mars. 2009, at 09.13, Lars-Gunnar Persson wrote:> How about this configuration? > > On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then > on the Nexsan define several smaller volumes and then add those > volumes to a raidz2/raidz zpool? > > Could that be an useful configuration? Maybe I''ll loose too much > space with "double" raid 5 or 6 configuration? What about performance? > > Regards, > > Lars-Gunnar Persson > > On 10. mars. 2009, at 00.26, Kees Nuyt wrote: > >> On Mon, 9 Mar 2009 12:06:40 +0100, Lars-Gunnar Persson >> <lars-gunnar.persson at nersc.no> wrote: >> >>> 1. On the external disk array, I not able to configure JBOD or >>> RAID 0 >>> or 1 with just one disk. >> >> In some arrays it seems to be possible to configure separate >> disks by offering the array just one disk in one slot at a >> time, and, very important, leaving all other slots empty(!). >> >> Repeat for as many disks as you have, seating each disk in >> its own slot, and all other slots empty. >> >> (ok, it''s just hear-say, but it might be worth a try with >> the first 4 disks or so). >> -- >> ( Kees Nuyt >> ) >> c[_] >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >.--------------------------------------------------------------------------. |Lars-Gunnar Persson | |IT- sjef | | | |Nansen senteret for milj? og fjernm?ling | |Adresse : Thorm?hlensgate 47, 5006 Bergen | |Direkte : 55 20 58 31, sentralbord: 55 20 58 00, fax: 55 20 58 01 | |Internett: http://www.nersc.no, e-post: lars- gunnar.persson at nersc.no | ''--------------------------------------------------------------------------''
Test 1: bash-3.00# echo zfs_nocacheflush/D | mdb -k zfs_nocacheflush: zfs_nocacheflush: 0 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 292.223081 bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw zfs_nocacheflush: 0 = 0x1 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 288.099066 Test 2: ash-3.00# echo zfs_nocacheflush/D | mdb -k zfs_nocacheflush: zfs_nocacheflush: 0 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 13.092332 bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw zfs_nocacheflush: 0 = 0x1 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 9.591622 Test 3: bash-3.00# echo zfs_nocacheflush/D | mdb -k zfs_nocacheflush: zfs_nocacheflush: 0 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 9.879219 bash-3.00# echo zfs_nocacheflush/W1 | mdb -kw zfs_nocacheflush: 0 = 0x1 bash-3.00# ./filesync-1 /raid6 10000 Time in seconds to create and unlink 10000 files with O_DSYNC: 7.560435 Thank you for your reply. If I make a raidz or a raidz2 on the Solaris box, then I get enough redundancy? The Nexsan can''t do block-level snapshots. On 9. mars. 2009, at 18.27, Miles Nordin wrote:>>>>>> "lp" == Lars-Gunnar Persson <lars-gunnar.persson at nersc.no> >>>>>> writes: > > lp> Ignore force unit access (FUA) bit: [X] > > lp> Any thoughts about this? > > run three tests > > (1) write cache disabled > > (2) write cache enabled, ignore FUA off > > (3) write cache enabled, ignore FUA [X] > > if all three are the same, either the test is broken or something is > wrong. > > If (2) and (3) are the same, then ZFS is working as expected. run > with either (2) or (3). > > If (1) and (2) are the same, then ZFS doesn''t seem to have the > cache-flush changes implemented, or else they aren''t sufficient for > your array. You could look into it more, file a bug, something like > that. > > If you''re not interested in investigating more and filing a bug in the > last case, then you could just set (3), and do no testing at all. > > It might be good to do some of the testing just to catch wtf cases > like, ``oh sorry, we didn''t sell you a cache.'''' or ``the battery is > dead? how''d that happen?'''' but there are so many places for wtf cases > maybe this isn''t the one to worry about. > > but I''m not sure why you are switching to one big single-disk vdev > with FC if you are having ZFS corruption problems. I think everyone''s > been saying you are more likely to have problems by using FC instead > of direct attached, and also by using FC without vdev redundancy (the > redundancy seems to work around some bugs). At least the people > reporting lots of lost pools are the ones using FC and iSCSI, who lose > pools during target reboots or SAN outages. > > I suppose if the NexSAN can do block-level snapshots, you could > snapshot exported copies of your pool from time to time, and roll back > to a snapshot if ZFS refuses to import your ``corrupt'''' pool. In that > way it could help you?
On Tue, Mar 10, 2009 at 3:13 AM, Lars-Gunnar Persson < lars-gunnar.persson at nersc.no> wrote:> How about this configuration? > > On the Nexsan SATABeast, add all disk to one RAID 5 or 6 group. Then on the > Nexsan define several smaller volumes and then add those volumes to a > raidz2/raidz zpool? > > Could that be an useful configuration? Maybe I''ll loose too much space with > "double" raid 5 or 6 configuration? What about performance? >Bad idea. The probability of a double disk failure in this scenario is relatively high (48 disks, one/two parity drive), and when you do so, you''ll lose ALL data. I don''t quite follow the reasoning behind putting software raid on top of your hardware raid either. Either you trust you have a solid raid-solution on the back end or you don''t. If you don''t sell it and buy something else. If you do, put the zfs filesystem directly on top without software raid and be done with it. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090310/a0ab0f5f/attachment.html>
Bob Friesenhahn wrote:> Your idea to stripe two disks per LUN should work. Make sure to use > raidz2 rather than plain raidz for the extra reliability. This > solution is optimized for high data throughput from one user.Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of redundancy (either mirror or raidz[2]) would be an efficient use of space. There would be no additional space overhead caused by running that way. Note, however, that if you do this, ZFS must resilver the larger LUN in the event of a single disk failure on the backend. This means a longer time to rebuild, and a lot of "extra" work on the other (non-failed) half of the RAID0 stripe.> > An alternative is to create individual "RAID 0" LUNs which actually > only contain a single disk.This is certainly preferable, since the unit of failure at the hardware level corresponds to the unit of resilvering at the ZFS level. And at least on my Nexsan SATAboy(2f) this configuration is possible.> Then implement the pool as two raidz2s > with six LUNs each, and two hot spares. That would be my own > preference. Due to ZFS''s load share this should provide better > performance (perhaps 2X) for multi-user loads. Some testing may be > required to make sure that your hardware is happy with this.I disagree with this suggestion. With this config, you only get 8 disks worth of storage, out of the 14, which is a ~42% overhead. In order to lose data in this scenario, 3 disks would have to fail out of a single 6-disk group before zfs is able to resilver any of them to the hot spares. That seems (to me) a lot more redundancy than is needed. As far as workload, any time you use RAIDZ[2], ZFS must read the entire stripe (across all of the disks) in order to verify the checksum for that data block. This means that a 128k read (the default zfs blocksize) requires a 32kb read from each of 6 disks, which may include a relatively slow seek to the relevant part of the spinning rust. So for random I/O, even though the data is striped across all the disks, you will see only a single disks''s worth of throughput. For sequential I/O, you''ll see the full RAID set''s worth of throughput. If you are expecting a non-sequential workload, you would be better off taking the 50% storage overhead to do ZFS mirroring.> > Avoid RAID5 if you can because it is not as reliable with today''s > large disks and the resulting huge LUN size can take a long time to > resilver if the RAID5 should fail (or be considered to have failed).Here''s a place that ZFS shines: it doesn''t resilver the whole disk, just the data blocks. So it doesn''t have to read the full array to rebuild a failed disk, so it''s less likely to cause a subsequent failure during parity rebuild. My $.02. --Joe
On Tue, 10 Mar 2009, Lars-Gunnar Persson wrote:> > My conclusion on raidz1 vs raidz2 would be no difference in performance and > big difference in disk space available.I am not so sure about the "big difference" in disk space available. Disk capacity is cheap, but failure is not. If you need to make up the difference in disk space, then use raidz2 and don''t allocate any spare disks. Just make sure that you have a spare disk drive handy, or will be able to purchase one in a reasonable amount of time. With raidz1 or RAID5 you are left feeling naked and exposed as soon as one disk fails, and you realize that the remaining disk drives will need to work perfectly in order to preserve your data. With raidz2, losing a single disk drive does not leave you naked and exposed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
[amplification of Joe''s point below...] Moore, Joe wrote:> Bob Friesenhahn wrote: > >> Your idea to stripe two disks per LUN should work. Make sure to use >> raidz2 rather than plain raidz for the extra reliability. This >> solution is optimized for high data throughput from one user. >> > > Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of redundancy (either mirror or raidz[2]) would be an efficient use of space. There would be no additional space overhead caused by running that way. >It would also reduce your per-vdev MTBF by half. In general, better reliability at the vdev level is a good thing. For example, consider the case where we have 6 same-sized disks. We can configure them in two different ways using 2+1 RAID-5 sets: configuration MTTDL[1] -------------------------- RAID-5+0 188,297 RAID-0+5 94,149 The MTTDL[1] model does consider MTTR, which is a combination of the logistical response time and reconstruction time. Unless you have zero response time and reconstruction time, RAID-0+5 is not as good as RAID-5+0.> Note, however, that if you do this, ZFS must resilver the larger LUN in the event of a single disk failure on the backend. This means a longer time to rebuild, and a lot of "extra" work on the other (non-failed) half of the RAID0 stripe. >ZFS resilver tends to be I/O bound in one of two ways: bandwidth on the resilvering vdev and iops on the surviving vdevs. You might consider this when you use "hardware" RAID vdevs. -- richard
On Tue, 10 Mar 2009, Moore, Joe wrote:> As far as workload, any time you use RAIDZ[2], ZFS must read the > entire stripe (across all of the disks) in order to verify the > checksum for that data block. This means that a 128k read (the > default zfs blocksize) requires a 32kb read from each of 6 disks, > which may include a relatively slow seek to the relevant part of the > spinning rust. So for random I/O, even though the data is stripedThis is not quite true. Raidz2 is not the same as RAID6. ZFS has an independent checksum for its data blocks. The traditional RAID type technology is used to repair in case data corruption is detected. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Mar 10, 2009 at 05:57:16PM -0500, Bob Friesenhahn wrote:> On Tue, 10 Mar 2009, Moore, Joe wrote: > > >As far as workload, any time you use RAIDZ[2], ZFS must read the > >entire stripe (across all of the disks) in order to verify the > >checksum for that data block. This means that a 128k read (the > >default zfs blocksize) requires a 32kb read from each of 6 disks, > >which may include a relatively slow seek to the relevant part of the > >spinning rust. So for random I/O, even though the data is striped > > This is not quite true. Raidz2 is not the same as RAID6. ZFS has an > independent checksum for its data blocks. The traditional RAID type > technology is used to repair in case data corruption is detected.What part isn''t true? ZFS has a independent checksum for the data block. But if the data block is spread over multiple disks, then each of the disks have to be read to verify the checksum. -- Darren
On Tue, Mar 10, 2009 at 23:57, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 10 Mar 2009, Moore, Joe wrote: > >> As far as workload, any time you use RAIDZ[2], ZFS must read the entire >> stripe (across all of the disks) in order to verify the checksum for that >> data block. ?This means that a 128k read (the default zfs blocksize) >> requires a 32kb read from each of 6 disks, which may include a relatively >> slow seek to the relevant part of the spinning rust. ?So for random I/O, >> even though the data is striped > > This is not quite true. ?Raidz2 is not the same as RAID6. ?ZFS has an > independent checksum for its data blocks. ?The traditional RAID type > technology is used to repair in case data corruption is detected.What he is saying is true. RAIDZ will spread blocks on all disks, and therefore requires full stripe reads to read the block. The good thing is that it will always do full stripe writes so writes are fast. RAID6 has no blocks so you can read any sector by reading from 1 disk, you only have to read from the other disks in the stipe in case of a fault.
On Tue, 10 Mar 2009, A Darren Dunham wrote:> > What part isn''t true? ZFS has a independent checksum for the data > block. But if the data block is spread over multiple disks, then each > of the disks have to be read to verify the checksum.I interpreted what you said to imply that RAID6 type algorithms were being used to validate the data, rather than to correct wrong data. I agree that it is necessary to read a full ZFS block in order to use the ZFS block checksum. I also agree that a raidz2 vdev has IOPS behavior which is similar to a single disk.>From what I understand, a raidz2 with a very large number of diskswon''t use all of the disks to store one ZFS block. There is a maximum number of disks in a stripe which can be supported by the ZFS block size. -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I would like to go back to my question for a second: I checked with my Nexsan supplier and they confirmed that access to every single disk in SATABeast is not possible. The smallest entities I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I''ll loose too much disk space and I believe that leaves me with RAID 0 as the only reasonable option. But with this unsecure RAID format I''ll need higher redundancy in the ZFS configuration. I think I''ll go with the following configuration: On the Nexsan SATABeast: * 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is 1 TB which gives me a total of 14 TB raw disk space). * Each RAID 0 array configured as one volume. On the Sun Fire X4100 M2 with Solaris 10: * Add all 7 volumes to one zpool configured in on raidz2 (gives me approx. 8,8 TB available disk space) Any comments or suggestions? Best regards, Lars-Gunnar Persson On 11. mars. 2009, at 02.39, Bob Friesenhahn wrote:> On Tue, 10 Mar 2009, A Darren Dunham wrote: >> >> What part isn''t true? ZFS has a independent checksum for the data >> block. But if the data block is spread over multiple disks, then >> each >> of the disks have to be read to verify the checksum. > > I interpreted what you said to imply that RAID6 type algorithms were > being used to validate the data, rather than to correct wrong data. > I agree that it is necessary to read a full ZFS block in order to > use the ZFS block checksum. I also agree that a raidz2 vdev has > IOPS behavior which is similar to a single disk. > >> From what I understand, a raidz2 with a very large number of disks > won''t use all of the disks to store one ZFS block. There is a > maximum number of disks in a stripe which can be supported by the > ZFS block size. > > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Lars-Gunnar Persson wrote:> I would like to go back to my question for a second: > > I checked with my Nexsan supplier and they confirmed that access to > every single disk in SATABeast is not possible. The smallest entities > I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I''ll > loose too much disk space and I believe that leaves me with RAID 0 as > the only reasonable option. But with this unsecure RAID format I''ll > need higher redundancy in the ZFS configuration. I think I''ll go with > the following configuration: > > On the Nexsan SATABeast: > * 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is > 1 TB which gives me a total of 14 TB raw disk space). > * Each RAID 0 array configured as one volume.So what the front end will see is 7 disks, 2TB each disk.> > On the Sun Fire X4100 M2 with Solaris 10: > * Add all 7 volumes to one zpool configured in on raidz2 (gives me > approx. 8,8 TB available disk space)You''ll get 5 LUNs worth of space in this config, or 10TB of usable space.> > Any comments or suggestions?Given the hardware constraints (no single-disk volumes allowed) this is a good configuration for most purposes. The advantages/disadvantages are: . 10TB of usable disk space, out of 14TB purchased. . At least three hard disk failures are required to lose the ZFS pool. . Random non-cached read performance will be about 300 IO/sec. . Sequential reads and writes of the whole ZFS blocksize will be fast (up to 2000 IO/sec). . One hard drive failure will cause the used blocks of the 2TB LUN (raid0 pair) to be resilvered, even though the other half of the pair is not damaged. The other half of the pair is more likely to fail during the ZFS resilvering operation because of increased load. You''ll want to pay special attention to the cache settings on the Nexsan. You earlier showed that the write cache is enabled, but IIRC the array doesn''t have a nonvolatile (battery-backed) cache. If that''s the case, MAKE SURE it''s hooked up to a UPS that can support it for the 30 second cache flush timeout on the array. And make sure you don''t power it down hard. I think you want to uncheck the "ignore FUA" setting, so that FUA requests are respected. My guess is that this will cause the array to properly handle the cache_flush requests that ZFS uses to ensure data consistancy. --Joe