Hi Now that Solaris 10 06/06 is finally downloadable I have some questions about ZFS. -We have a big storage sytem supporting RAID5 and RAID1. At the moment, we only use RAID5 (for non-solaris systems as well). We are thinking about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 seems like overkill, an option would be to use RAID1 with RAID-Z. Then again, this is a waist of space, as it needs more disks, due to the mirroring. Later on, we might be using asynchronous replication to another storage system using SAN, even more waste of space. This looks somehow like storage virtualization as of today just doesn''t work nicely together. What we need, would be the feature to use JBODs. -Does ZFS in the current version support LUN extension? With UFS, we have to zero the VTOC, and then adjust the new disk geometry. How does it look like with ZFS? -I''ve read the threads about zfs and databases. Still I''m not 100% convenienced about read performance. Doesn''t the fragmentation of the large database files (because of the concept of COW) impact read-performance? -Does anybody have any experience in database cloning using the ZFS mechanism? What factors influence the performance, when running the cloned database in parallel? -I really like the idea to keep all needed databasefiles together, to allow fast and consistent cloning. Thanks Mika # mv Disclaimer.txt /dev/null ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
About: -I''ve read the threads about zfs and databases. Still I''m not 100% convenienced about read performance. Doesn''t the fragmentation of the large database files (because of the concept of COW) impact read-performance? I do need to get back to this thread. The way I am currently looking at this is this: ZFS will perform great at doing the transaction component (say the small (8K) O_DSYNC writes) because the ZIL will aggregate them in fewer larger I/Os and the block allocation will stream them to the surface. On the other hand, read streaming will require a good prefetch code (under review) to get the read performance we want. If the requirements balances random writes and read streaming, then ZFS should be right there with the best FS. If the critical requirement focuses exclusively on read streaming a file that was written randomly and, in addition, the number of spindles is limited then that is not the sweetspot of ZFS. Read performance should still scale with number of spindles. And, if the load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. -r
On Jun 26, 2006, at 1:15 AM, Mika Borner wrote:> Hi > > Now that Solaris 10 06/06 is finally downloadable I have some > questions > about ZFS. > > -We have a big storage sytem supporting RAID5 and RAID1. At the > moment, > we only use RAID5 (for non-solaris systems as well). We are thinking > about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 > seems like overkill, an option would be to use RAID1 with RAID-Z. Then > again, this is a waist of space, as it needs more disks, due to the > mirroring. Later on, we might be using asynchronous replication to > another storage system using SAN, even more waste of space. This looks > somehow like storage virtualization as of today just doesn''t work > nicely > together. What we need, would be the feature to use JBODs. >If you''ve got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn''t go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized hardware - SAN-based hardware products allow sharing of storage among multiple hosts. This allows storage to be utilized more effectively.> -Does ZFS in the current version support LUN extension? With UFS, we > have to zero the VTOC, and then adjust the new disk geometry. How does > it look like with ZFS? >I don''t understand what you''re asking. What problem is solved by zeroing the vtoc?> -I''ve read the threads about zfs and databases. Still I''m not 100% > convenienced about read performance. Doesn''t the fragmentation of the > large database files (because of the concept of COW) impact > read-performance? >This is discussed elsewhere in the zfs discussion group.> -Does anybody have any experience in database cloning using the ZFS > mechanism? What factors influence the performance, when running the > cloned database in parallel? > -I really like the idea to keep all needed databasefiles together, to > allow fast and consistent cloning. > > Thanks > > Mika > > > # mv Disclaimer.txt /dev/null > > > > > > ---------------------------------------------------------------------- > --- > This message is intended for the addressee only and may > contain confidential or privileged information. If you > are not the intended receiver, any disclosure, copying > to any person or any action taken or omitted to be taken > in reliance on this e-mail, is prohibited and may be un- > lawful. You must therefore delete this e-mail. > Internet communications may not be secure or error-free > and may contain viruses. They may be subject to possible > data corruption, accidental or on purpose. This e-mail is > not and should not be construed as an offer or the > solicitation of an offer to purchase or subscribe or sell > or redeem any investments. > ---------------------------------------------------------------------- > --- > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
> > -Does ZFS in the current version support LUN extension? With UFS, we > > have to zero the VTOC, and then adjust the new disk geometry. How does > > it look like with ZFS? > > I don''t understand what you''re asking. What problem is solved by > zeroing the vtoc?When the underlying storage increases the size of the LUN. The old size is still on the label and the ''sd'' driver doesn''t recognize the increase. This doesn''t appear to be a level of ZFS, but interactions with the EFI label may be interesting. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On Tue, 2006-06-27 at 02:27, Gregory Shaw wrote:> On Jun 26, 2006, at 1:15 AM, Mika Borner wrote: > > ><snip> What we need, would be the feature to use JBODs. > > > > If you''ve got hardware raid-5, why not just run regular (non-raid) > pools on top of the raid-5? > > I wouldn''t go back to JBOD. Hardware arrays offer a number of > advantages to JBOD: > - disk microcode management > - optimized access to storage > - large write caches > - RAID computation can be done in specialized hardware > - SAN-based hardware products allow sharing of storage among > multiple hosts. This allows storage to be utilized more effectively.How would ZFS self heal in this case? Nathan.
On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:> On Tue, 2006-06-27 at 02:27, Gregory Shaw wrote: > > On Jun 26, 2006, at 1:15 AM, Mika Borner wrote: > > > > ><snip> What we need, would be the feature to use JBODs. > > > > > > > If you''ve got hardware raid-5, why not just run regular (non-raid) > > pools on top of the raid-5? > > > > I wouldn''t go back to JBOD. Hardware arrays offer a number of > > advantages to JBOD: > > - disk microcode management > > - optimized access to storage > > - large write caches > > - RAID computation can be done in specialized hardware > > - SAN-based hardware products allow sharing of storage among > > multiple hosts. This allows storage to be utilized more effectively. > > How would ZFS self heal in this case? > > Nathan. > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussYou''re using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You''d need to keep on top of it, but that''s a given in the case of either hardware or software raid. If you''ve got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I''ve always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime.
Roch wrote:> And, if the load can accomodate a > reorder, to get top per-spindle read-streaming performance, > a cp(1) of the file should do wonders on the layout. >but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-)
> If you''ve got hardware raid-5, why not just run > regular (non-raid) > pools on top of the raid-5? > > I wouldn''t go back to JBOD. Hardware arrays offer a > number of > advantages to JBOD: > - disk microcode management > - optimized access to storage > - large write caches > - RAID computation can be done in specialized > d hardware > - SAN-based hardware products allow sharing of > f storage among > multiple hosts. This allows storage to be utilized > more effectively. >I''m a little confused by the first poster''s message as well, but you lose some benefits of ZFS if you don''t create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn''t going to detect that because all it knows about are blocks. -Nate This message posted from opensolaris.org
On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:> > You''re using hardware raid. The hardware raid controller will rebuild > the volume in the event of a single drive failure. You''d need to keep > on top of it, but that''s a given in the case of either hardware or > software raid.True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric Schrock wrote:> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: >> You''re using hardware raid. The hardware raid controller will rebuild >> the volume in the event of a single drive failure. You''d need to keep >> on top of it, but that''s a given in the case of either hardware or >> software raid. > > True for total drive failure, but not there are a more failure modes > than that. With hardware RAID, there is no way for the RAID controller > to know which block was bad, and therefore cannot repair the block. > With RAID-Z, we have the integrated checksum and can do combinatorial > analysis to know not only which drive was bad, but what the data > _should_ be, and can repair it to prevent more corruption in the future.Keep in mind that each disk data block is accompanied by a pretty long error correction code (ECC) which allows for (a) verification of data integrity (b) repair of lost/misread bits (typically up to about 10% of the block data). Therefore, in case of single block errors there are several possible situations: - non-recoverable errors - the amount of correct bits in the combined data + ECC in insufficient - such errors are visible to the RAID controller, the controller can use a redundant copy of the data, and the controller can perform the repair - recoverable errors - some bits can''t be read correctly but they can be reconstructed using ECC - these errors are not directly visible to either the RAID controller or ZFS. However, the disks keep the count of recoverable errors so disk scrubbers can identify disk areas with rotten blocks and force block relocation - silent data corruption - it can happen in memory before the data was written to disk, it can occur in the disk cache, it can be caused by a bug in disk firmware. Here the disk controller can''t do anything and the end-to-end checksums, which ZFS offers, are the only solution. -- Olaf
Gregory Shaw wrote:> On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: >> How would ZFS self heal in this case?>> You''re using hardware raid. The hardware raid controller will rebuild > the volume in the event of a single drive failure. You''d need to keep > on top of it, but that''s a given in the case of either hardware or > software raid. > > If you''ve got requirements for surviving an array failure, the > recommended solution in that case is to mirror between volumes on > multiple arrays. I''ve always liked software raid (mirroring) in that > case, as no manual intervention is needed in the event of an array > failure. Mirroring between discrete arrays is usually reserved for > mission-critical applications that cost thousands of dollars per hour in > downtime. >In other words, it won''t. You''ve spent the disk space, but because you''re mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
> -Does ZFS in the current version support LUN extension? With UFS, we > have to zero the VTOC, and then adjust the new disk geometry. How does > it look like with ZFS?The vdev can handle dynamic lun growth, but the underlying VTOC or EFI label may need to be zero''d and reapplied if you setup the initial vdev on a slice. If you introduced the entire disk to the pool you should be fine, but I believe you''ll still need to offline/online the pool. .je -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060627/3b1fbdc9/attachment.html>
Olaf Manczak wrote:> Eric Schrock wrote: >> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: >>> You''re using hardware raid. The hardware raid controller will rebuild >>> the volume in the event of a single drive failure. You''d need to keep >>> on top of it, but that''s a given in the case of either hardware or >>> software raid. >> >> True for total drive failure, but not there are a more failure modes >> than that. With hardware RAID, there is no way for the RAID controller >> to know which block was bad, and therefore cannot repair the block. >> With RAID-Z, we have the integrated checksum and can do combinatorial >> analysis to know not only which drive was bad, but what the data >> _should_ be, and can repair it to prevent more corruption in the future. > > Keep in mind that each disk data block is accompanied by a pretty > long error correction code (ECC) which allows for (a) verification > of data integrity (b) repair of lost/misread bits (typically up to > about 10% of the block data).AFAIK, typical disk ECC will correct 8 bytes. I''d love for it to be 10% (51 bytes). Do you have a pointer to such information?> Therefore, in case of single block errors there are several possible > situations: > > - non-recoverable errors - the amount of correct bits in the combined > data + ECC in insufficient - such errors are visible to the RAID > controller, the controller can use a redundant copy of the data, and > the controller can perform the repair > > - recoverable errors - some bits can''t be read correctly but they > can be reconstructed using ECC - these errors are not directly > visible to either the RAID controller or ZFS. However, the disks > keep the count of recoverable errors so disk scrubbers can identify > disk areas with rotten blocks and force block relocation > > - silent data corruption - it can happen in memory before the data > was written to disk, it can occur in the disk cache, it can be caused > by a bug in disk firmware. Here the disk controller can''t do > anything and the end-to-end checksums, which ZFS offers, > are the only solution.Another mode occurs when you use a format(1m)-like utility to scan and repair disks. For such utilities, if the data cannot be reconstructed it is zero-filled. If there was real data stored there, then ZFS will detect it and the majority of other file systems will not detect it. For an array, one should not be able to readily access such utilities, and cause such corrective actions, but I would not bet the farm on it -- end-to-end error detection will always prevail. -- richard
>The vdev can handle dynamic lun growth, but the underlying VTOC or >EFI label >may need to be zero''d and reapplied if you setup the initial vdev on>a slice. If >you introduced the entire disk to the pool you should be fine, but I>believe you''ll >still need to offline/online the pool.Fine, at least the vdev can handle this... I asked about this feature in October and hoped that it would be implemented when integrating ZFS into Sol10U2 ... http://www.opensolaris.org/jive/thread.jspa?messageID=11646 Does anybody know something about when this feature is finally coming? This would keep the number of LUNs low on the host. Especially as devicenames can be really ugly (long!). //Mika # mv Disclaimer.txt /dev/null ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
>I''m a little confused by the first poster''s message as well, but youlose some benefits of ZFS if you don''t create >your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn''t going to detect that >because all it knows about are blocks. That''s the dilemma, the array provides nice features like RAID1 and RAID5, but those are of no real use when using ZFS. The advantages to use ZFS on such array are e.g. the sometimes huge write cache available, use of consolidated storage and in SAN configurations, cloning and sharing storage between hosts. The price comes of course in additional administrative overhead (lots of microcode updates, more components that can fail in between, etc). Also, in bigger companies there usually is a team of storage specialist, that mostly do not know about the applications running on top of it, or do not care... (like: "here you have your bunch of gigabytes...") //Mika # mv Disclaimer.txt /dev/null ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
>but there may not be filesystem space for double the data. >Sounds like there is a need for a zfs-defragement-file utilityperhaps?>Or if you want to be politically cagey about naming choice, perhaps, >zfs-seq-read-optimize-file ? :-)For Datawarehouse and streaming applications a "seq-read-omptimization" could bring additional performance. For "normal" databases this should be benchmarked... This brings me back to another question. We have a production database, that is cloned on every end of month for end-of-month processing (currently with a feature on our storage array). I''m thinking about a ZFS version of this task. Requirements: the production database should not suffer from performance degradation, whilst running the clone in parallel. As ZFS does not clone all the blocks, I wonder how much the procution database will suffer from sharing most of the data with the clone (concurrent access vs. caching) Maybe we need a feature in ZFS to do a full clone (speak: copy all blocks) inside the pool, if performance is an issue.... just like the "Quick Copy" vs. "Shadow Image" -features on HDS Arrays... ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
>That''s the dilemma, the array provides nice features like RAID1 and >RAID5, but those are of no real use when using ZFS.RAID5 is not a "nice" feature when it breaks. A RAID controller cannot guarantee that all bits of a RAID5 stripe are written when power fails; then you have data corruption and no one can tell you what data was corrupted. ZFS RAIDZ can.>The advantages to use ZFS on such array are e.g. the sometimes huge >write cache available, use of consolidated storage and in SAN >configurations, cloning and sharing storage between hosts.Are huge write caches really a advantage? Or are you taking about huge write caches with non-volatile storage?>The price comes of course in additional administrative overhead (lots >of microcode updates, more components that can fail in between, etc). > >Also, in bigger companies there usually is a team of storage >specialist, that mostly do not know about the applications running on >top of it, or do not care... (like: "here you have your bunch of >gigabytes...")True enough .... Casper
>RAID5 is not a "nice" feature when it breaks.Let me correct myself... RAID5 is a "nice" feature for systems without ZFS...>Are huge write caches really a advantage? Or are you taking abouthuge>write caches with non-volatile storage?Yes, you are right. The huge cache is needed mostly because of poor write performance for RAID5 (of course battery backuped)... // Mika # mv Disclaimer.txt /dev/null ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
Hello Nathanael, NB> I''m a little confused by the first poster''s message as well, but NB> you lose some benefits of ZFS if you don''t create your pools with NB> either RAID1 or RAIDZ, such as data corruption detection. The NB> array isn''t going to detect that because all it knows about are blocks. Actually ZFS will detect data corruption if pool is not redundand but it won''t repair data (metadata are protected with 2 or/and 3 copies anyway). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Mika, Tuesday, June 27, 2006, 10:19:05 AM, you wrote:>>but there may not be filesystem space for double the data. >>Sounds like there is a need for a zfs-defragement-file utilityMB> perhaps?>>Or if you want to be politically cagey about naming choice, perhaps, >>zfs-seq-read-optimize-file ? :-)MB> For Datawarehouse and streaming applications a MB> "seq-read-omptimization" could bring additional performance. For MB> "normal" databases this should be benchmarked... MB> This brings me back to another question. We have a production database, MB> that is cloned on every end of month for end-of-month processing MB> (currently with a feature on our storage array). MB> I''m thinking about a ZFS version of this task. Requirements: the MB> production database should not suffer from performance degradation, MB> whilst running the clone in parallel. As ZFS does not clone all the MB> blocks, I wonder how much the procution database will suffer from MB> sharing most of the data with the clone (concurrent access vs. caching) MB> Maybe we need a feature in ZFS to do a full clone (speak: copy all MB> blocks) inside the pool, if performance is an issue.... just like the MB> "Quick Copy" vs. "Shadow Image" -features on HDS Arrays... I belive you want a clone on different pool (so different disks) and that way you get separation. The most important problem with two DBs after current clone would be shared spindles. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Philip Brown writes: > Roch wrote: > > And, if the load can accomodate a > > reorder, to get top per-spindle read-streaming performance, > > a cp(1) of the file should do wonders on the layout. > > > > but there may not be filesystem space for double the data. > Sounds like there is a need for a zfs-defragement-file utility perhaps? > > Or if you want to be politically cagey about naming choice, perhaps, > > zfs-seq-read-optimize-file ? :-) > Possibly or may using fcntl ? Now the goal is to take a file with scattered blocks and order them in contiguous chunks. So this is contigent on the existence of regions of free contiguous disk space. This will get more difficult as we get close to full on the storage. -r > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mika Borner writes: > >RAID5 is not a "nice" feature when it breaks. > > Let me correct myself... RAID5 is a "nice" feature for systems without > ZFS... > > >Are huge write caches really a advantage? Or are you taking about > huge > >write caches with non-volatile storage? > > Yes, you are right. The huge cache is needed mostly because of poor > write performance for RAID5 (of course battery backuped)... > > > // Mika > Having a certain amount on non-volatile cache is great to speed up the latency of ZIL operations which directly impact some application performance. -r
Does it make sense to solve these problems piece-meal: * Performance: ZFS algorithms and NVRAM * Error detection: ZFS checksums * Error correction: ZFS RAID1 or RAIDZ Nathanael Burton wrote:>> If you''ve got hardware raid-5, why not just run regular (non-raid) pools on >> top of the raid-5? >> >> I wouldn''t go back to JBOD. Hardware arrays offer a number of advantages to >> JBOD: - disk microcode management - optimized access to storage - large write >> caches - RAID computation can be done in specialized d hardware - SAN-based >> hardware products allow sharing of f storage among multiple hosts. This >> allows storage to be utilized more effectively. >> > > > I''m a little confused by the first poster''s message as well, but you lose some > benefits of ZFS if you don''t create your pools with either RAID1 or RAIDZ, such > as data corruption detection. The array isn''t going to detect that because all > it knows about are blocks.-- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
On Tue, 2006-06-27 at 04:19, Mika Borner wrote:> I''m thinking about a ZFS version of this task. Requirements: the > production database should not suffer from performance degradation, > whilst running the clone in parallel. As ZFS does not clone all the > blocks, I wonder how much the procution database will suffer from > sharing most of the data with the clone (concurrent access vs. caching)given that zfs always does copy-on-write for any updates, it''s not clear why this would necessarily degrade performance..> Maybe we need a feature in ZFS to do a full clone (speak: copy all > blocks) inside the pool, if performance is an issue.... just like the > "Quick Copy" vs. "Shadow Image" -features on HDS Arrays...Seems to me that the main reason you''d need to do a full copy would be to get clone and production on different sets of disks so their access patterns don''t end up fighting. For ZFS that requires having separate pools; if they''re in the same pool, sharing the unchanged blocks should only help performance. If you want a full copy you can use zfs send/zfs receive -- either within the same pool or between two different pools.
>given that zfs always does copy-on-write for any updates, it''s notclear>why this would necessarily degrade performance..Writing should be no problem, as it is serialized... but when both database instances are reading a lot of different blocks at the same time, the spindles might "heat up".>If you want a full copy you can use zfs send/zfs receive -- either >within the same pool or between two different pools.Ok. But then again, it might be necessary to throttle zfs send/receive replication between pools. Otherwise the replication process might be influencing the production environment performance too much. Or is there already some kind of prioritization, that I have overlooked? //Mika # mv Disclaimer.txt /dev/null ------------------------------------------------------------------------- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -------------------------------------------------------------------------
Mika Borner writes: > >given that zfs always does copy-on-write for any updates, it''s not > clear > >why this would necessarily degrade performance.. > > Writing should be no problem, as it is serialized... but when both > database instances are reading a lot of different blocks at the same > time, the spindles might "heat up". > > >If you want a full copy you can use zfs send/zfs receive -- either > >within the same pool or between two different pools. > > Ok. But then again, it might be necessary to throttle zfs send/receive > replication between pools. Otherwise the replication process might be > influencing the production environment performance too much. Or is there > already some kind of prioritization, that I have overlooked? > > //Mika > > # mv Disclaimer.txt /dev/null > > I think this is heading toward ''quotas and reservation'' of IOPS. Sounds like something that would be very useful. I don''t know if this is planned. -r > > > > > ------------------------------------------------------------------------- > This message is intended for the addressee only and may > contain confidential or privileged information. If you > are not the intended receiver, any disclosure, copying > to any person or any action taken or omitted to be taken > in reliance on this e-mail, is prohibited and may be un- > lawful. You must therefore delete this e-mail. > Internet communications may not be secure or error-free > and may contain viruses. They may be subject to possible > data corruption, accidental or on purpose. This e-mail is > not and should not be construed as an offer or the > solicitation of an offer to purchase or subscribe or sell > or redeem any investments. > ------------------------------------------------------------------------- > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yes, but the idea of using software raid on a large server doesn''t make sense in modern systems. If you''ve got a large database server that runs a large oracle instance, using CPU cycles for RAID is counter productive. Add to that the need to manage the hardware directly (drive microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in modern systems starts to lose value in a big way. You will detect any corruption when doing a scrub. It''s not end-to- end, but it''s no worse than today with VxVM. On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:>> If you''ve got hardware raid-5, why not just run >> regular (non-raid) >> pools on top of the raid-5? >> >> I wouldn''t go back to JBOD. Hardware arrays offer a >> number of >> advantages to JBOD: >> - disk microcode management >> - optimized access to storage >> - large write caches >> - RAID computation can be done in specialized >> d hardware >> - SAN-based hardware products allow sharing of >> f storage among >> multiple hosts. This allows storage to be utilized >> more effectively. >> > > I''m a little confused by the first poster''s message as well, but > you lose some benefits of ZFS if you don''t create your pools with > either RAID1 or RAIDZ, such as data corruption detection. The > array isn''t going to detect that because all it knows about are > blocks. > > -Nate > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn''t help when a double-failure occurs. Luckily, that''s very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: >> >> You''re using hardware raid. The hardware raid controller will >> rebuild >> the volume in the event of a single drive failure. You''d need to >> keep >> on top of it, but that''s a given in the case of either hardware or >> software raid. > > True for total drive failure, but not there are a more failure modes > than that. With hardware RAID, there is no way for the RAID > controller > to know which block was bad, and therefore cannot repair the block. > With RAID-Z, we have the integrated checksum and can do combinatorial > analysis to know not only which drive was bad, but what the data > _should_ be, and can repair it to prevent more corruption in the > future. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ > eschrock----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
So everything you are saying seems to suggest you think ZFS was a waste of engineering time since hardware raid solves all the problems ? I don''t believe it does but I''m no storage expert and maybe I''ve drank too much cool aid. I''m software person and for me ZFS is brilliant it is so much easier than managing any of the hardware raid systems I''ve dealt with. -- Darren J Moffat
I don''t like to top-post, but there''s no better way right now. This issue has recurred several times and there have been no answers to it that cover the bases. The question is, say I as a customer have a database, let''s say it''s around 8 TB, all built on a series of high end storage arrays that _don''t_ support the JBOD everyone seems to want - what is the preferred configuration for my storage arrays to present LUNs to the OS for ZFS to consume? Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 - that spans the breadth of about as good as it gets. What should I as a customer do? Should I create RAID0 sets and let ZFS self-heal via its own mirroring or RAIDZ when a disk blows in the set? Should I use RAID1 and eat the disk space used? RAID5 and be thankful I have a large write cache - and then which type of ZFS pool should I create over it? See, telling folks "you should just use JBOD" when they don''t have JBOD and have invested millions to get to state they''re in where they''re efficiently utilizing their storage via a SAN infrastructure is just plain one big waste of everyone''s time. Shouting down the advantages of storage arrays with the same arguments over and over without providing an answer to the customer problem doesn''t do anyone any good. So. I''ll restate the question. I have a 10TB database that''s spread over 20 storage arrays that I''d like to migrate to ZFS. How should I configure the storage array? Let''s at least get that conversation moving... - Pete Gregory Shaw wrote:> Yes, but the idea of using software raid on a large server doesn''t make > sense in modern systems. If you''ve got a large database server that > runs a large oracle instance, using CPU cycles for RAID is counter > productive. Add to that the need to manage the hardware directly (drive > microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in > modern systems starts to lose value in a big way. > > You will detect any corruption when doing a scrub. It''s not end-to-end, > but it''s no worse than today with VxVM. > > On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote: > >>> If you''ve got hardware raid-5, why not just run >>> regular (non-raid) >>> pools on top of the raid-5? >>> >>> I wouldn''t go back to JBOD. Hardware arrays offer a >>> number of >>> advantages to JBOD: >>> - disk microcode management >>> - optimized access to storage >>> - large write caches >>> - RAID computation can be done in specialized >>> d hardware >>> - SAN-based hardware products allow sharing of >>> f storage among >>> multiple hosts. This allows storage to be utilized >>> more effectively. >>> >> >> I''m a little confused by the first poster''s message as well, but you >> lose some benefits of ZFS if you don''t create your pools with either >> RAID1 or RAIDZ, such as data corruption detection. The array isn''t >> going to detect that because all it knows about are blocks. >> >> -Nate >> >> >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bart Smaalders wrote:> Gregory Shaw wrote: >> On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: >>> How would ZFS self heal in this case? > > > >> You''re using hardware raid. The hardware raid controller will rebuild >> the volume in the event of a single drive failure. You''d need to keep >> on top of it, but that''s a given in the case of either hardware or >> software raid. >> >> If you''ve got requirements for surviving an array failure, the >> recommended solution in that case is to mirror between volumes on >> multiple arrays. I''ve always liked software raid (mirroring) in that >> case, as no manual intervention is needed in the event of an array >> failure. Mirroring between discrete arrays is usually reserved for >> mission-critical applications that cost thousands of dollars per hour in >> downtime. >> > > In other words, it won''t. You''ve spent the disk space, but > because you''re mirroring in the wrong place (the raid array) > all ZFS can do is tell you that your data is gone. With luck, > subsequent reads _might_ get the right data, but maybe not.Careful here when you say "wrong place". There are many scenarios where mirroring in the hardware is the correct way to go even when running ZFS on top of it.
Peter Rival wrote:> storage arrays with the same arguments over and over without providing > an answer to the customer problem doesn''t do anyone any good. So. I''ll > restate the question. I have a 10TB database that''s spread over 20 > storage arrays that I''d like to migrate to ZFS. How should I configure > the storage array? Let''s at least get that conversation moving...I''ll answer your question with more questions: What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? What of that doesn''t work for you ? What functionality of ZFS is it that you want to leverage ? -- Darren J Moffat
Unfortunately, a storage-based RAID controller cannot detect errors which occurred between the filesystem layer and the RAID controller, in either direction - in or out. ZFS will detect them through its use of checksums. But ZFS can only fix them if it can access redundant bits. It can''t tell a storage device to provide the redundant bits, so it must use its own data protection system (RAIDZ or RAID1) in order to correct errors it detects. Gregory Shaw wrote:> Most controllers support a background-scrub that will read a volume and > repair any bad stripes. This addresses the bad block issue in most cases. > > It still doesn''t help when a double-failure occurs. Luckily, that''s > very rare. Usually, in that case, you need to evacuate the volume and > try to restore what was damaged. > > On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: > >> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: >> >>> >>> You''re using hardware raid. The hardware raid controller will rebuild >>> the volume in the event of a single drive failure. You''d need to keep >>> on top of it, but that''s a given in the case of either hardware or >>> software raid. >> >> >> True for total drive failure, but not there are a more failure modes >> than that. With hardware RAID, there is no way for the RAID controller >> to know which block was bad, and therefore cannot repair the block. >> With RAID-Z, we have the integrated checksum and can do combinatorial >> analysis to know not only which drive was bad, but what the data >> _should_ be, and can repair it to prevent more corruption in the future. >> >> - Eric >> >> -- >> Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ >> eschrock > > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
Peter Rival wrote:> > See, telling folks "you should just use JBOD" when they don''t have JBOD > and have invested millions to get to state they''re in where they''re > efficiently utilizing their storage via a SAN infrastructure is just > plain one big waste of everyone''s time. Shouting down the advantages of > storage arrays with the same arguments over and over without providing > an answer to the customer problem doesn''t do anyone any good. So. I''ll > restate the question. I have a 10TB database that''s spread over 20 > storage arrays that I''d like to migrate to ZFS. How should I configure > the storage array? Let''s at least get that conversation moving...In general, I''d say that if the storage has battery-backed cache, use RAID5 on the storage device - limit the amount of redundant data, but improve performance and achieve some data protection in fast special-purpose hardware. Just my $.02. -- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
Peter Rival wrote:> I don''t like to top-post, but there''s no better way right now. This > issue has recurred several times and there have been no answers to it > that cover the bases. The question is, say I as a customer have a > database, let''s say it''s around 8 TB, all built on a series of high end > storage arrays that _don''t_ support the JBOD everyone seems to want - > what is the preferred configuration for my storage arrays to present > LUNs to the OS for ZFS to consume? > > Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 - > that spans the breadth of about as good as it gets. What should I as a > customer do? Should I create RAID0 sets and let ZFS self-heal via its > own mirroring or RAIDZ when a disk blows in the set? Should I use RAID1 > and eat the disk space used? RAID5 and be thankful I have a large write > cache - and then which type of ZFS pool should I create over it?The only use I see for RAID-0 is when you are configuring your competitor''s systems. Real friends don''t let friends use RAID-0. For most modern arrays, RAID-5 works pretty well wrt performance. While not quite as good as RAID-1+0, most people are ok with RAID-5. s/-5/-6/g> See, telling folks "you should just use JBOD" when they don''t have JBOD > and have invested millions to get to state they''re in where they''re > efficiently utilizing their storage via a SAN infrastructure is just > plain one big waste of everyone''s time. Shouting down the advantages of > storage arrays with the same arguments over and over without providing > an answer to the customer problem doesn''t do anyone any good. So. I''ll > restate the question. I have a 10TB database that''s spread over 20 > storage arrays that I''d like to migrate to ZFS. How should I configure > the storage array? Let''s at least get that conversation moving...It almost always boils down to how much money you have to spend. Since I''m a RAS guy, I prefer multiple ZFS RAID-1 mirrors over RAID-1 LUNs with hot spares and multiple kilometer separation with multiple data paths between them. After I win the lottery, I might be able to afford that :-). More applicable guidance would be to use the best redundancy closest to the context of the data first, and work down the stack from there. This philosophy will give you the best fault detection and recovery. Having the applications themselves provide such redundancy is best, but very uncommon. Next in the stack is the file system, where RAID-1 and RAID-Z[2] can help. Finally, the hardware RAID. This begs for a performability analysis [*], which is on my plate, once things get settled a bit. [*] does anyone know what performability analysis is? I''d be happy to post some info on how we do that at Sun. -- richard
Not at all. ZFS is a quantum leap in Solaris filesystem/VM functionality. However, I don''t see a lot of use for RAID-Z (or Z2) in large enterprise customers situations. For instance, does ZFS enable Sun to walk into an account and say "You can now replace all of your high- end (EMC) disk with JBOD."? I don''t think many customers would bite on that. RAID-Z is an excellent feature, however, it doesn''t address many of the reasons for using high-end arrays: - Exporting snapshots to alternate systems (for live database or backup purposes) - Remote replication - Sharing of storage among multiple systems (LUN masking and equivalent) - Storage management (migration between tiers of storage) - No-downtime failure replacement (the system doesn''t even know) - Clustering I know that ZFS is still a work in progress, so some of the above may arrive in future versions of the product. I see the RAID-Z[2] value in small-to-mid size systems where the storage is relatively small and you don''t have high availability requirements. On Jun 27, 2006, at 8:48 AM, Darren J Moffat wrote:> So everything you are saying seems to suggest you think ZFS was a > waste of engineering time since hardware raid solves all the > problems ? > > I don''t believe it does but I''m no storage expert and maybe I''ve > drank too much cool aid. I''m software person and for me ZFS is > brilliant it is so much easier than managing any of the hardware > raid systems I''ve dealt with. > > -- > Darren J Moffat----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
This is getting pretty picky. You''re saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn''t guarantee that the data given to ZFS wasn''t already corrupted. If you don''t trust your storage subsystem, you''re going to encounter issues regardless of the software use to store data. We''ll have to see if ZFS can ''save'' customers in this situation. I''ve found that regardless of the storage solution in question you can''t anticipate all issues and when a brownout or other ugly loss-of-service occurs, you may or may not be intact, ZFS or no. I''ve never seen a product that can deal with all possible situations. On Jun 27, 2006, at 9:01 AM, Jeff Victor wrote:> Unfortunately, a storage-based RAID controller cannot detect errors > which occurred between the filesystem layer and the RAID > controller, in either direction - in or out. ZFS will detect them > through its use of checksums. > > But ZFS can only fix them if it can access redundant bits. It > can''t tell a storage device to provide the redundant bits, so it > must use its own data protection system (RAIDZ or RAID1) in order > to correct errors it detects. > > > Gregory Shaw wrote: >> Most controllers support a background-scrub that will read a >> volume and repair any bad stripes. This addresses the bad block >> issue in most cases. >> It still doesn''t help when a double-failure occurs. Luckily, >> that''s very rare. Usually, in that case, you need to evacuate >> the volume and try to restore what was damaged. >> On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: >>> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: >>> >>>> >>>> You''re using hardware raid. The hardware raid controller will >>>> rebuild >>>> the volume in the event of a single drive failure. You''d need >>>> to keep >>>> on top of it, but that''s a given in the case of either hardware or >>>> software raid. >>> >>> >>> True for total drive failure, but not there are a more failure modes >>> than that. With hardware RAID, there is no way for the RAID >>> controller >>> to know which block was bad, and therefore cannot repair the block. >>> With RAID-Z, we have the integrated checksum and can do >>> combinatorial >>> analysis to know not only which drive was bad, but what the data >>> _should_ be, and can repair it to prevent more corruption in the >>> future. >>> >>> - Eric >>> >>> -- >>> Eric Schrock, Solaris Kernel Development http:// >>> blogs.sun.com/ eschrock >> ----- >> Gregory Shaw, IT Architect >> Phone: (303) 673-8273 Fax: (303) 673-8273 >> ITCTO Group, Sun Microsystems Inc. >> 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) >> Louisville, CO 80028-4382 shaw at fmsoft.com (home) >> "When Microsoft writes an application for Linux, I''ve Won." - >> Linus Torvalds >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > ---------------------------------------------------------------------- > ---- > Jeff VICTOR Sun Microsystems jeff.victor @ > sun.com > OS Ambassador Sr. Technical Specialist > Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/ > zones/faq > ---------------------------------------------------------------------- > --------- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
>This is getting pretty picky. You''re saying that ZFS will detect any >errors introduced after ZFS has gotten the data. However, as stated >in a previous post, that doesn''t guarantee that the data given to ZFS >wasn''t already corrupted.But there''s a big difference between the time ZFS gets the data and the time your typical storage system gets it. And your typical storage system does not store any information which allows it to detect all but the most simple errors. Storage systems are complicated and have many failure modes at many different levels. - disks not writing data or writing data in incorrect location - disks not reporting failures when they occur - bit errors in disk write buffers causing data corruption - storage array software with bugs - storage array with undetected hardware errors - data corruption in the path (such as switches with mangle packets but keep the TCP checksum working>If you don''t trust your storage subsystem, you''re going to encounter >issues regardless of the software use to store data. We''ll have to >see if ZFS can ''save'' customers in this situation. I''ve found that >regardless of the storage solution in question you can''t anticipate >all issues and when a brownout or other ugly loss-of-service occurs, >you may or may not be intact, ZFS or no. > >I''ve never seen a product that can deal with all possible situations.ZFS attempts to deal with more problems than any of the current existing solutions by giving end-to-end verification of the data. One of the reasons why ZFS was created was a particular large customer who had datacorruption which occured two years (!) before it was detected. The bad data had migrated and corrupted; the good data was no longer available on backups (which weren''t very relevant anyway after such a long time) ZFS tries to give one important guarantee: if the data is bad, we will not return it. One case in point is the person in MPK with a SATA controller which corrupts memory; he didn''t discover this using UFS (except for perhaps a few strange events he noticed). After switch to ZFS he started to find corruption so now he uses a self-healing ZFS mirror (or RAIDZ). ZFS helps at the low end as much as it does at the highend. I''ll bet that ZFS will generate more calls about broken hardware and fingers will be pointed at ZFS at first because it''s the new kid; it will be some time before people realize that the data was rotting all along. Casper
Gregory Shaw wrote:> Yes, but the idea of using software raid on a large server doesn''t make > sense in modern systems. If you''ve got a large database server that > runs a large oracle instance, using CPU cycles for RAID is counter > productive. Add to that the need to manage the hardware directly (drive > microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in > modern systems starts to lose value in a big way. > > You will detect any corruption when doing a scrub. It''s not end-to-end, > but it''s no worse than today with VxVM.Yes, but we''re trying to be better than VxVM. The end-to-end guarantee that ZFS offers is one of, if not the, primary attractor to using it in the first place. CPU cycles are cheap these days. In the era of sub-1ghz single core/chip systems, yes, those XOR calculations for software RAID were expensive. Now, not so much I think as that problem has been solved by brute force. When using ZFS in a storage network, I''m envisioning the arrays being a hybrid between what a JBOD is and what a full-fledge hardware RAID5 with tons o'' cache is. As far as I''m concerned, the traditional RAID features on an array do not offer me much when using those LUNs with ZFS. I lose that end-to-end. But the arrays are still useful from a performance perspective because of their caching, LUN management, and FC-related abilities which is something a JBOD largely lacks. /dale
Gregory Shaw wrote:> Not at all. ZFS is a quantum leap in Solaris filesystem/VM functionality.Agreed.> However, I don''t see a lot of use for RAID-Z (or Z2) in large > enterprise customers situations. For instance, does ZFS enable Sun to > walk into an account and say "You can now replace all of your high-end > (EMC) disk with JBOD."? I don''t think many customers would bite on that.I don''t see this happening, for organizational reasons more than technical reasons -- the folks who manage storage are usually different than the folks who specify OSes. Rather, I think they complement each other. More interesting is the entreprenuer who builds a storage array using ZFS in the back end. Leveraging ZFS could save a lot of feature development work.> RAID-Z is an excellent feature, however, it doesn''t address many of the > reasons for using high-end arrays: > > - Exporting snapshots to alternate systems (for live database or backup > purposes) > - Remote replication > - Sharing of storage among multiple systems (LUN masking and equivalent) > - Storage management (migration between tiers of storage) > - No-downtime failure replacement (the system doesn''t even know) > - ClusteringThis list is beyond the scope of ZFS itself. I could see ZFS playing a part in such solutions, Sun Cluster will support it for example.> I know that ZFS is still a work in progress, so some of the above may > arrive in future versions of the product. > > I see the RAID-Z[2] value in small-to-mid size systems where the storage > is relatively small and you don''t have high availability requirements.I see it being very applicable to any high availability requirement. If you think of availability as a continuum, it gets you a little bit closer to perfect availability no matter what else is in the system. -- richard
Darren J Moffat wrote:> Peter Rival wrote: > >> storage arrays with the same arguments over and over without >> providing an answer to the customer problem doesn''t do anyone any >> good. So. I''ll restate the question. I have a 10TB database that''s >> spread over 20 storage arrays that I''d like to migrate to ZFS. How >> should I configure the storage array? Let''s at least get that >> conversation moving... > > > I''ll answer your question with more questions: > > What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? > > What of that doesn''t work for you ? > > What functionality of ZFS is it that you want to leverage ? >It seems that the big thing we all want (relative to the discussion of moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to reliabily detect that a given block is bad, and have ZFS compensate). Now, how do we get things when using HW arrays, and not just treat them like JBODs (which is impractical for large SAN and similar arrays that are already configured). Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m assuming that the proper way to get benefits from both ZFS and HW RAID is the following: (1) ZFS mirror of HW stripes, i.e. "zpool create tank mirror hwStripe1 hwStripe2" (2) ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, hwMirror2" (3) ZFS RAIDZ of HW stripes, i.e. "zpool create tank raidz hwStripe1, hwStripe2" mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty sure they''re considerably less useful than the 3 above. Personally, I can''t think of a good reason to use ZFS with HW RAID5; case (3) above seems to me to provide better performance with roughly the same amount of redundancy (not quite true, but close). I''d vote for (1) if you need high performance, at the cost of disk space, (2) for maximum redundancy, and (3) as maximum space with reasonable performance. I''m making a couple of assumptions here: (a) you have the spare cycles on your hosts to allow for using ZFS RAIDZ, which is a non-trivial cost (though not that big, folks). (b) your HW RAID controller uses NVRAM (or battery-backed cache), which you''d like to be able to use to speed up writes (c) you HW RAID''s NVRAM speeds up ALL writes, regardless of the configuration of arrays in the HW (d) having your HW controller present individual disks to the machines is a royal pain (way too many, the HW does other nice things with arrays, etc) Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA
One of the key points here is that people seem focused on two types of errors: 1. Total drive failure 2. Bit rot Traditional RAID solves #1. Reed-Solomon ECC found in all modern drives solves #2 for all but the most extreme cases. The real problem is the rising complexity of firmware in modern drives and the reality of software bugs. Misdirected reads and writes and phantom writes are all real phenomena, and while more prevalent in SATA and commodity drives, is by no means restricted to the low end. This type of corruption happens everwhere, and results in corruption that is undetectable by drive firmware. We''ve seen these failures in SCSI, FC, and SATA drives. At a large storage company, a common story related to us was that they would see approximately one silently corrupted block per 9 TB of storage (on high-end FC drives). As mentioned previously, traditional RAID can detect these failures, but cannot repair the damaged data. Also, as pointed out previously, ZFS can detect failures in the entire data path, up to the point where it reaches main memory (at which point FMA takes over). Once again, bad switches, cables, and drivers are a reality of life. There will always be a tradeoff between hardware RAID and RAID-Z. But saying that RAID-Z provides no discernable benefit over hardware RAID is a lie, and has been disproven time and again by its ability to detect and correct otherwise silent data corruption, even on top of hardware RAID. You are welcome to argue that people will make a judgement call and chose performance/familiarity over RAID-Z in the datacenter, but that is a matter of opinion that can only be settled by watching the evolution of ZFS deployment over the next five years. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote:> This is getting pretty picky. You''re saying that ZFS will detect any > errors introduced after ZFS has gotten the data. However, as stated > in a previous post, that doesn''t guarantee that the data given to ZFS > wasn''t already corrupted.There will always be some place where errors can be introduced and go on undetected. But some parts of the system are more error prone than others, and ZFS targets the most error prone of them: rotating rust. For the rest, make sure you have ECC memory, that you''re using secure NFS (with krb5i or krb5p), and the probability of undetectable data corruption errors should be much closer to zero than what you''d get with other systems. That said, there''s a proposal to add end-to-end data checksumming over NFSv4 (see the IETF NFSv4 WG list archives). That proposal can''t protect meta-data, and it doesn''t remove any one type of data corruption error on the client side, but it does on the server side. Nico --
On 6/27/06, Erik Trimble <Erik.Trimble at sun.com> wrote:> Darren J Moffat wrote: > > > Peter Rival wrote: > > > >> storage arrays with the same arguments over and over without > >> providing an answer to the customer problem doesn''t do anyone any > >> good. So. I''ll restate the question. I have a 10TB database that''s > >> spread over 20 storage arrays that I''d like to migrate to ZFS. How > >> should I configure the storage array? Let''s at least get that > >> conversation moving... > > > > > > I''ll answer your question with more questions: > > > > What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? > > > > What of that doesn''t work for you ? > > > > What functionality of ZFS is it that you want to leverage ? > > > It seems that the big thing we all want (relative to the discussion of > moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to > reliabily detect that a given block is bad, and have ZFS compensate). > Now, how do we get things when using HW arrays, and not just treat them > like JBODs (which is impractical for large SAN and similar arrays that > are already configured). > > Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m > assuming that the proper way to get benefits from both ZFS and HW RAID > is the following: > > (1) ZFS mirror of HW stripes, i.e. "zpool create tank mirror > hwStripe1 hwStripe2" > (2) ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, > hwMirror2" > (3) ZFS RAIDZ of HW stripes, i.e. "zpool create tank raidz hwStripe1, > hwStripe2" > > mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty > sure they''re considerably less useful than the 3 above. > > Personally, I can''t think of a good reason to use ZFS with HW RAID5; > case (3) above seems to me to provide better performance with roughly > the same amount of redundancy (not quite true, but close). > > I''d vote for (1) if you need high performance, at the cost of disk > space, (2) for maximum redundancy, and (3) as maximum space with > reasonable performance. > > > I''m making a couple of assumptions here: > > (a) you have the spare cycles on your hosts to allow for using ZFS > RAIDZ, which is a non-trivial cost (though not that big, folks). > (b) your HW RAID controller uses NVRAM (or battery-backed cache), which > you''d like to be able to use to speed up writes > (c) you HW RAID''s NVRAM speeds up ALL writes, regardless of the > configuration of arrays in the HW > (d) having your HW controller present individual disks to the machines > is a royal pain (way too many, the HW does other nice things with > arrays, etc) > >The case for HW RAID 5 with ZFS is easy: when you use iscsi. You get major performance degradation over iscsi when trying to coordinate writes and reads serially over iscsi using RAIDZ. The sweet spot in the iscsi world is let your targets do RAID5 or whatnot (RAID10, RAID50, RAID6), and combine those into ZFS pools, mirrored or not. There are other benefits to ZFS, including snapshots, easily managed storage pools, and with iscsi, ease of switching head nodes with a simple export/import.> > Erik Trimble > Java System Support > Mailstop: usca14-102 > Phone: x17195 > Santa Clara, CA > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Casper.Dik at Sun.COM wrote:> >> That''s the dilemma, the array provides nice features like RAID1 and >> RAID5, but those are of no real use when using ZFS. >> > > > RAID5 is not a "nice" feature when it breaks. > > A RAID controller cannot guarantee that all bits of a RAID5 stripe > are written when power fails; then you have data corruption and no > one can tell you what data was corrupted. ZFS RAIDZ can.That depends on the raid controller. Some implementations use a log *and* a battery back up. In some cases the battery is a embedded UPS of sorts to make sure the power stays up long enough to take the writes from the host and get them to disk.
Darren J Moffat wrote:> So everything you are saying seems to suggest you think ZFS was a > waste of engineering time since hardware raid solves all the problems ? > > I don''t believe it does but I''m no storage expert and maybe I''ve drank > too much cool aid. I''m software person and for me ZFS is brilliant it > is so much easier than managing any of the hardware raid systems I''ve > dealt with.ZFS is great....for the systems that can run it. However, any enterprise datacenter is going to be made up of many many hosts running many many OS. In that world you''re going to consolidate on large arrays and use the features of those arrays where they cover the most ground. For example, if I''ve 100 hosts all running different OS and apps and I can perform my data replication and redundancy algorithms, in most cases Raid, in one spot then it will be much more cost efficient to do it there.
Your example would prove more effective if you added, "I''ve got ten databases. Five on AIX, Five on Solaris 8...." Peter Rival wrote:> I don''t like to top-post, but there''s no better way right now. This > issue has recurred several times and there have been no answers to it > that cover the bases. The question is, say I as a customer have a > database, let''s say it''s around 8 TB, all built on a series of high > end storage arrays that _don''t_ support the JBOD everyone seems to > want - what is the preferred configuration for my storage arrays to > present LUNs to the OS for ZFS to consume? > > Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 - > that spans the breadth of about as good as it gets. What should I as > a customer do? Should I create RAID0 sets and let ZFS self-heal via > its own mirroring or RAIDZ when a disk blows in the set? Should I use > RAID1 and eat the disk space used? RAID5 and be thankful I have a > large write cache - and then which type of ZFS pool should I create > over it? > > See, telling folks "you should just use JBOD" when they don''t have > JBOD and have invested millions to get to state they''re in where > they''re efficiently utilizing their storage via a SAN infrastructure > is just plain one big waste of everyone''s time. Shouting down the > advantages of storage arrays with the same arguments over and over > without providing an answer to the customer problem doesn''t do anyone > any good. So. I''ll restate the question. I have a 10TB database > that''s spread over 20 storage arrays that I''d like to migrate to ZFS. > How should I configure the storage array? Let''s at least get that > conversation moving... > > - Pete > > Gregory Shaw wrote: >> Yes, but the idea of using software raid on a large server doesn''t >> make sense in modern systems. If you''ve got a large database server >> that runs a large oracle instance, using CPU cycles for RAID is >> counter productive. Add to that the need to manage the hardware >> directly (drive microcode, drive brownouts/restarts, etc.) and the >> idea of using JBOD in modern systems starts to lose value in a big way. >> >> You will detect any corruption when doing a scrub. It''s not >> end-to-end, but it''s no worse than today with VxVM. >> >> On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote: >> >>>> If you''ve got hardware raid-5, why not just run >>>> regular (non-raid) >>>> pools on top of the raid-5? >>>> >>>> I wouldn''t go back to JBOD. Hardware arrays offer a >>>> number of >>>> advantages to JBOD: >>>> - disk microcode management >>>> - optimized access to storage >>>> - large write caches >>>> - RAID computation can be done in specialized >>>> d hardware >>>> - SAN-based hardware products allow sharing of >>>> f storage among >>>> multiple hosts. This allows storage to be utilized >>>> more effectively. >>>> >>> >>> I''m a little confused by the first poster''s message as well, but you >>> lose some benefits of ZFS if you don''t create your pools with either >>> RAID1 or RAIDZ, such as data corruption detection. The array isn''t >>> going to detect that because all it knows about are blocks. >>> >>> -Nate >>> >>> >
Casper.Dik at Sun.COM wrote:> > I''ll bet that ZFS will generate more calls about broken hardware > and fingers will be pointed at ZFS at first because it''s the new > kid; it will be some time before people realize that the data was > rotting all along.Ehhh....I don''t think so. Most of our customers have HW arrays that have been scrubbing data for years and years as well as apps on the top that have been verifying the data. (Oracle for example.) Not to mention there will be a bit of time before people move over to ZFS in the high end.
Torrey McMahon wrote:> ZFS is great....for the systems that can run it. However, any enterprise > datacenter is going to be made up of many many hosts running many many > OS. In that world you''re going to consolidate on large arrays and use > the features of those arrays where they cover the most ground. For > example, if I''ve 100 hosts all running different OS and apps and I can > perform my data replication and redundancy algorithms, in most cases > Raid, in one spot then it will be much more cost efficient to do it there.Exactly what I''m pondering. In the near to mid term, Solaris with ZFS can be seen as sort of a storage virtualizer where it takes disks into ZFS pools and volumes and then presents them to other hosts and OSes via iSCSI, NFS, SMB and so on. At that point, those other OSes can enjoy the benefits of ZFS. In the long term, it would be nice to see ZFS (or its concepts) integrated as the LUN provisioning and backing store mechanism on hardware RAID arrays themselves, supplanting the traditional RAID paradigms that have been in use for years. /dale
Torrey McMahon wrote:> Casper.Dik at Sun.COM wrote: > >> >> I''ll bet that ZFS will generate more calls about broken hardware >> and fingers will be pointed at ZFS at first because it''s the new >> kid; it will be some time before people realize that the data was >> rotting all along. > > > > Ehhh....I don''t think so. Most of our customers have HW arrays that > have been scrubbing data for years and years as well as apps on the > top that have been verifying the data. (Oracle for example.) Not to > mention there will be a bit of time before people move over to ZFS in > the high end. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussAhh... but there is the rub. Today - you/we don''t *really* know, do we? Maybe there''s bad juju blocks, maybe not. Running ZFS, whether in a redundant vdev or not, will certainly turn the big spotlight on and give us the data that checksums matched, or they didn''t. And if we are in redundant vdevs, hey - we''ll fix it. If not, well we are certainly no worse off then today''s filesystems, but at least we''ll know the bad juju is there. How do the number of checksum mismatches compare across different types/vendors/costs of storage subsystems? SLAs based on the number of bad checksums? Price cut on storage that routinely gives back bad checksummed data? Now, that is what will be interesting to me to see.... ZFS, the DTrace of storage - no more guessing, just data. /jason
Jason Schroeder wrote:> Torrey McMahon wrote: > >> Casper.Dik at Sun.COM wrote: >> >>> >>> I''ll bet that ZFS will generate more calls about broken hardware >>> and fingers will be pointed at ZFS at first because it''s the new >>> kid; it will be some time before people realize that the data was >>> rotting all along. >> >> >> >> Ehhh....I don''t think so. Most of our customers have HW arrays that >> have been scrubbing data for years and years as well as apps on the >> top that have been verifying the data. (Oracle for example.) Not to >> mention there will be a bit of time before people move over to ZFS in >> the high end. >> > > Ahh... but there is the rub. Today - you/we don''t *really* know, do > we? Maybe there''s bad juju blocks, maybe not. Running ZFS, whether > in a redundant vdev or not, will certainly turn the big spotlight on > and give us the data that checksums matched, or they didn''t.A spotlight on what? How is that data going to get into ZFS? The more I think about this more I realize it''s going to do little for existing data sets. You''re going to have to migrate that data from "filesystem X" into ZFS first. From that point on ZFS has no idea if the data was bad to begin with. If you can do an in place migration then you might be able to weed out some bad physical blocks/drives over time but I assert that the current disk scrubbing methodologies catch most of those. Yes, it''s great for new data sets where you started with ZFS. Sorry if I sound like I''m raining on the parade here folks. That''s not the case, really, and I''m all for the great new features and EAU ZFS gives where applicable.
Nicolas Williams wrote:> On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote: >> This is getting pretty picky. You''re saying that ZFS will detect any >> errors introduced after ZFS has gotten the data. However, as stated >> in a previous post, that doesn''t guarantee that the data given to ZFS >> wasn''t already corrupted. > > There will always be some place where errors can be introduced and go on > undetected. But some parts of the system are more error prone than > others, and ZFS targets the most error prone of them: rotating rust. > > For the rest, make sure you have ECC memory, that you''re using secure > NFS (with krb5i or krb5p), and the probability of undetectable data > corruption errors should be much closer to zero than what you''d get with > other systems.Another alternative is using IPsec with just AH. For the benefit of those outside of Sun MPK17 both krb5i and IPsec AH were used to diagnose and prove that we have a faulty router in a lab that was causing very strange build errors. TCP/IP alone didn''t catch the problems and sometimes they showed up with SCCS simple checksums and sometimes we had compile errors. -- Darren J Moffat
Torrey McMahon wrote:> Darren J Moffat wrote: >> So everything you are saying seems to suggest you think ZFS was a >> waste of engineering time since hardware raid solves all the problems ? >> >> I don''t believe it does but I''m no storage expert and maybe I''ve drank >> too much cool aid. I''m software person and for me ZFS is brilliant it >> is so much easier than managing any of the hardware raid systems I''ve >> dealt with. > > > ZFS is great....for the systems that can run it. However, any enterprise > datacenter is going to be made up of many many hosts running many many > OS. In that world you''re going to consolidate on large arrays and use > the features of those arrays where they cover the most ground. For > example, if I''ve 100 hosts all running different OS and apps and I can > perform my data replication and redundancy algorithms, in most cases > Raid, in one spot then it will be much more cost efficient to do it there.but you still need a local file system on those systems in many cases. So back to where we started I guess, how to effectively use ZFS to benefit Solaris (and the other platforms it gets ported to) while still using Hardware RAID because you have no choice but to use it. -- Darren J Moffat
On Tue, 27 Jun 2006 Casper.Dik at sun.com wrote:> > >This is getting pretty picky. You''re saying that ZFS will detect any > >errors introduced after ZFS has gotten the data. However, as stated > >in a previous post, that doesn''t guarantee that the data given to ZFS > >wasn''t already corrupted. > > But there''s a big difference between the time ZFS gets the data > and the time your typical storage system gets it. > > And your typical storage system does not store any information which > allows it to detect all but the most simple errors. > > Storage systems are complicated and have many failure modes at many > different levels. > > - disks not writing data or writing data in incorrect location > - disks not reporting failures when they occur > - bit errors in disk write buffers causing data corruption > - storage array software with bugsCase in point, there was a gentleman who posted on the Yahoo Groups solx86 list and described how faulty firmware on a Hitach HDS system damaged a bunch of data. The HDS system moves disk blocks around, between one disk and another, in the background, to optimized the filesystem layout. Long after he had written data, blocks from one data set were intermingled with blocks for other data sets/files causing extensive data corruption. I know this is a simplistic explanation (and perhaps technically inaccurate) of the exact failure mode - but the effects were that a lot of data was silently corrupted and went undiscovered for several days. .... snip ..... Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
On Tue, 27 Jun 2006, Gregory Shaw wrote:> Yes, but the idea of using software raid on a large server doesn''t > make sense in modern systems. If you''ve got a large database server > that runs a large oracle instance, using CPU cycles for RAID is > counter productive. Add to that the need to manage the hardware > directly (drive microcode, drive brownouts/restarts, etc.) and the > idea of using JBOD in modern systems starts to lose value in a big way. > > You will detect any corruption when doing a scrub. It''s not end-to- > end, but it''s no worse than today with VxVM.The initial impression I got, after reading the original post, is that its author[1] does not grok something fundamental about ZFS and/or how it works! Or does not understand that there are many CPU cycles in a modern Unix box that are never taken advantage of. It''s clear to me that ZFS provides considerable, never before available, features and facilities, and that any impact that ZFS may have on CPU or memory utilization will become meaningless over time, as the # of CPU cores increase - along with their performance. And that average system memory size will continue to increase over time. Perhaps the author is taking a narrow view that ZFS will *replace* existing systems. I don''t think that this will be the general case. Especially in a large organization where politics and turf wars will dominate any "technical" discussions and implementation decisions will be made by senior management who are 100% buzzword compliant (and have questionable technical/engineering skills). Rather it will provide the system designer with a hugely powerful *new* tool to apply in system design. And will challenge the designer to use it creatively and effectively. There is no such thing as the universal screw-driver. Every toolbox has tens of screwdrivers and tool designers will continue to dream about replacing them all with _one_ tool. [1] Sorry Gregory. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Al Hopper wrote:> On Tue, 27 Jun 2006, Gregory Shaw wrote: > > >>Yes, but the idea of using software raid on a large server doesn''t >>make sense in modern systems. If you''ve got a large database server >>that runs a large oracle instance, using CPU cycles for RAID is >>counter productive. Add to that the need to manage the hardware >>directly (drive microcode, drive brownouts/restarts, etc.) and the >>idea of using JBOD in modern systems starts to lose value in a big way. >> >>You will detect any corruption when doing a scrub. It''s not end-to- >>end, but it''s no worse than today with VxVM. > > > The initial impression I got, after reading the original post, is that its > author[1] does not grok something fundamental about ZFS and/or how it > works! Or does not understand that there are many CPU cycles in a modern > Unix box that are never taken advantage of.Just because there are idle cpu cycles does not mean it is ok for the Operating System to use them. If there is a valid reason for the OS to consume those cycles then that is fine. But every cycle that the OS consumes is one less cycle that is available for the customer apps (be it Oracle or whatever, and I spend a lot of my time trying to squeeze those cycles out of the high end systems). The job of the operating system is get the hell out of the way as quickly as possible so the user aps. can do there work. That can mean offloading some of the work onto smart arrays. As someone once said to me once, a customer does not buy hardware to run an OS on, they buy it to accomplish some given piece of work.> > It''s clear to me that ZFS provides considerable, never before available, > features and facilities, and that any impact that ZFS may have on CPU or > memory utilization will become meaningless over time, as the # of CPU > cores increase - along with their performance. And that average system > memory size will continue to increase over time.This is true and will probably be true for ever and has been going on ever since the first chip. There has always been more demand for more power by the end users. However just because we have available cycles does not mean the OS should consume them.> > Perhaps the author is taking a narrow view that ZFS will *replace* > existing systems. I don''t think that this will be the general case. > Especially in a large organization where politics and turf wars will > dominate any "technical" discussions and implementation decisions will be > made by senior management who are 100% buzzword compliant (and have > questionable technical/engineering skills). Rather it will provide the > system designer with a hugely powerful *new* tool to apply in system > design. And will challenge the designer to use it creatively and > effectively.It all depends on your needs. The idea of ZFS of providing raid capabilities is very appealing for those systems that are desk top units or small servers. But where we are talking petabyte+ storage with 30+ gig/sec of IO bandwidth capacity, I believe we will find the CPUs are going to consume way to much to handle the IO rate in such in environment, at which time the work needs to be off loaded to smart arrays (I have to do that experimentation yet). You do not buy a 18 wheel tractor trailer to simply move a lawnmower from job site to job site, you buy a SUV, pickup truck or trailer. Vice versa you do not buy a pickup truck to move a tracked excavator, you have a tractor trailer.> There is no such thing as the universal screw-driver. Every toolbox has > tens of screwdrivers and tool designers will continue to dream about > replacing them all with _one_ tool. >How true. ZFS is one of many tools available. However the impression I have been picking up out of here at various times is that alot of people view ZFS as the only tool in the tool box, thus everything is looking like a nail because all you have is a hammer. If ZFS is providing better data integrity then the current storage arrays, that sounds like to me an opportunity for the next generation of intelligent arrays to become better. Dave Valin> [1] Sorry Gregory.> > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > OpenSolaris Governing Board (OGB) Member - Feb 2006 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, 2006-06-27 at 17:50, Erik Trimble wrote:> Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m > assuming that the proper way to get benefits from both ZFS and HW RAID > is the following: > > (1) ZFS mirror of HW stripes, i.e. "zpool create tank mirror > hwStripe1 hwStripe2" > (2) ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, > hwMirror2" > (3) ZFS RAIDZ of HW stripes, i.e. "zpool create tank raidz hwStripe1, > hwStripe2" > > mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty > sure they''re considerably less useful than the 3 above. > > Personally, I can''t think of a good reason to use ZFS with HW RAID5; > case (3) above seems to me to provide better performance with roughly > the same amount of redundancy (not quite true, but close).You really need some level of redundancy if you''re using HW raid. Using plain stripes is downright dangerous. 0+1 vs 1+0 and all that. Seems to me that the simplest way to go is to use zfs to mirror HW raid5, preferably with the HW raid5 LUNs being completely independent disks attached to completely independent controllers with no components or datapaths in common. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Darren J Moffat wrote:> Torrey McMahon wrote: >> Darren J Moffat wrote: >>> So everything you are saying seems to suggest you think ZFS was a >>> waste of engineering time since hardware raid solves all the problems ? >>> >>> I don''t believe it does but I''m no storage expert and maybe I''ve >>> drank too much cool aid. I''m software person and for me ZFS is >>> brilliant it is so much easier than managing any of the hardware >>> raid systems I''ve dealt with. >> >> >> ZFS is great....for the systems that can run it. However, any >> enterprise datacenter is going to be made up of many many hosts >> running many many OS. In that world you''re going to consolidate on >> large arrays and use the features of those arrays where they cover >> the most ground. For example, if I''ve 100 hosts all running different >> OS and apps and I can perform my data replication and redundancy >> algorithms, in most cases Raid, in one spot then it will be much more >> cost efficient to do it there. > > but you still need a local file system on those systems in many cases. > > So back to where we started I guess, how to effectively use ZFS to > benefit Solaris (and the other platforms it gets ported to) while > still using Hardware RAID because you have no choice but to use it. >Too many variables in an overall storage environment. This is why I always jump on people that say, "Dude! You''ve got ZFS. Just use JBODs". They''re not based in a reality outside of the ones that constitute a brand new workstation or SMB server....and we don''t really target that market these days. You need to clearly define what the environment is, what the data growth will look like, what apps are going to be deployed, replication requirements, etc. It''s the way things have been for years. ZFS just changes a couple of variables. It doesn''t eliminate them or turn the equation into anything easier to solve.
On Jun 27, 2006, at 3:30 PM, Al Hopper wrote:> On Tue, 27 Jun 2006, Gregory Shaw wrote: > >> Yes, but the idea of using software raid on a large server doesn''t >> make sense in modern systems. If you''ve got a large database server >> that runs a large oracle instance, using CPU cycles for RAID is >> counter productive. Add to that the need to manage the hardware >> directly (drive microcode, drive brownouts/restarts, etc.) and the >> idea of using JBOD in modern systems starts to lose value in a big >> way. >> >> You will detect any corruption when doing a scrub. It''s not end-to- >> end, but it''s no worse than today with VxVM. > > The initial impression I got, after reading the original post, is > that its > author[1] does not grok something fundamental about ZFS and/or how it > works! Or does not understand that there are many CPU cycles in a > modern > Unix box that are never taken advantage of. > > It''s clear to me that ZFS provides considerable, never before > available, > features and facilities, and that any impact that ZFS may have on > CPU or > memory utilization will become meaningless over time, as the # of CPU > cores increase - along with their performance. And that average > system > memory size will continue to increase over time. > > Perhaps the author is taking a narrow view that ZFS will *replace* > existing systems. I don''t think that this will be the general case. > Especially in a large organization where politics and turf wars will > dominate any "technical" discussions and implementation decisions > will be > made by senior management who are 100% buzzword compliant (and have > questionable technical/engineering skills). Rather it will provide > the > system designer with a hugely powerful *new* tool to apply in system > design. And will challenge the designer to use it creatively and > effectively. > > There is no such thing as the universal screw-driver. Every > toolbox has > tens of screwdrivers and tool designers will continue to dream about > replacing them all with _one_ tool. > > [1] Sorry Gregory. > > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > OpenSolaris Governing Board (OGB) Member - Feb 2006 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussNo insult taken. I was trying to point out that many customers don''t have ''free'' cpu cycles, and that every little bit you take from their machine for subsystem control is that much real work the system will not be doing. I think of the statement of "many cpu cycles in modern unix boxes that are never taken advantage of" in the similar vein of monitoring vendors: "It''s just another agent. It won''t take more than 5% of the box." I think we''ll let the customer decide on the above. I''ve encountered both situations: customers with large boxes with plenty of headroom, and customers that run 100% all day, every day and have no cycles that aren''t dedicated to real work. When I read as a ex-customer (e.g. not with Sun) that I''ve got to sacrifice cpu cycles in a software upgrade, it says to me that the upgrade will result in a slower system. ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060627/31980de5/attachment.html>
On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote:> Case in point, there was a gentleman who posted on the Yahoo Groups solx86 > list and described how faulty firmware on a Hitach HDS system damaged a > bunch of data. The HDS system moves disk blocks around, between one disk > and another, in the background, to optimized the filesystem layout. Long > after he had written data, blocks from one data set were intermingled with > blocks for other data sets/files causing extensive data corruption.Al, the problem you described comes probably from failures in code of firmware not the failure of disk surface. Sun''s engineers can also do some mistakes in ZFS code, right ? przemol
Hello David, Wednesday, June 28, 2006, 12:30:54 AM, you wrote: DV> If ZFS is providing better data integrity then the current storage DV> arrays, that sounds like to me an opportunity for the next generation DV> of intelligent arrays to become better. Actually they can''t. If you want end-to-end data integrity it has to be checked on a server. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Erik, Tuesday, June 27, 2006, 6:50:52 PM, you wrote: ET> Personally, I can''t think of a good reason to use ZFS with HW RAID5; ET> case (3) above seems to me to provide better performance with roughly ET> the same amount of redundancy (not quite true, but close). I can see a reason. In our enviroment it looks like hw raid-5 could be actually faster than raid-z. We do have lot''s of small random IOs large enough that caching doesn''t help. I don''t have actual data as it''s production and there wasn''t unfortunately time to test it - just the "feeling". I belive raid-z will offer much better write performance in most scenarios but not necessary better read performance. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Peter, Wednesday, June 28, 2006, 1:11:29 AM, you wrote: PT> On Tue, 2006-06-27 at 17:50, Erik Trimble wrote: PT> You really need some level of redundancy if you''re using HW raid. PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all PT> that. Seems to me that the simplest way to go is to use zfs to mirror PT> HW raid5, preferably with the HW raid5 LUNs being completely PT> independent disks attached to completely independent controllers PT> with no components or datapaths in common. well, it will give you less than half your raw storage. Due to costs I belive in most cases it won''t be acceptable. People are using raid-5 mostly due to costs and you are proposing something worse (in terms of available logical storage) than mirroring. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Wed, 28 Jun 2006 przemolicc at poczta.fm wrote:> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote: > > Case in point, there was a gentleman who posted on the Yahoo Groups solx86 > > list and described how faulty firmware on a Hitach HDS system damaged a > > bunch of data. The HDS system moves disk blocks around, between one disk > > and another, in the background, to optimized the filesystem layout. Long > > after he had written data, blocks from one data set were intermingled with > > blocks for other data sets/files causing extensive data corruption. > > Al, > > the problem you described comes probably from failures in code of firmware > not the failure of disk surface. Sun''s engineers can also do some mistakes > in ZFS code, right ?Yes! Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Hello przemolicc, Wednesday, June 28, 2006, 10:57:17 AM, you wrote: ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote:>> Case in point, there was a gentleman who posted on the Yahoo Groups solx86 >> list and described how faulty firmware on a Hitach HDS system damaged a >> bunch of data. The HDS system moves disk blocks around, between one disk >> and another, in the background, to optimized the filesystem layout. Long >> after he had written data, blocks from one data set were intermingled with >> blocks for other data sets/files causing extensive data corruption.ppf> Al, ppf> the problem you described comes probably from failures in code of firmware ppf> not the failure of disk surface. Sun''s engineers can also do some mistakes ppf> in ZFS code, right ? But the point is that ZFS should detect also such errors and take proper actions. Other filesystems can''t. And of course there are bugs in ZFS :P -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote:> Hello przemolicc, > > Wednesday, June 28, 2006, 10:57:17 AM, you wrote: > > ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote: > >> Case in point, there was a gentleman who posted on the Yahoo Groups solx86 > >> list and described how faulty firmware on a Hitach HDS system damaged a > >> bunch of data. The HDS system moves disk blocks around, between one disk > >> and another, in the background, to optimized the filesystem layout. Long > >> after he had written data, blocks from one data set were intermingled with > >> blocks for other data sets/files causing extensive data corruption. > > ppf> Al, > > ppf> the problem you described comes probably from failures in code of firmware > ppf> not the failure of disk surface. Sun''s engineers can also do some mistakes > ppf> in ZFS code, right ? > > But the point is that ZFS should detect also such errors and take > proper actions. Other filesystems can''t.Does it mean that ZFS can detect errors in ZFS''s code itself ? ;-) What I wanted to point out is the Al''s example: he wrote about damaged data. Data were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair errors in its (ZFS) code. I am comparing firmware code to ZFS code. przemol
Hello,> What I wanted to point out is the Al''s example: he wrote about damaged data. Data > were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can > detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair > errors in its (ZFS) code. > > I am comparing firmware code to ZFS code. >Firmware doesn''t do end to end checksumming. If ZFS code is buggy, the checksums won''t match up anyway, so you still detect errors. Plus it is a lot easier to debug ZFS code than firmware. -- Regards, Jeremy
przemolicc at poczta.fm wrote:> On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote: > > What I wanted to point out is the Al''s example: he wrote about damaged data. Data > were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can > detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair > errors in its (ZFS) code. >If you mean "ZFS doesn''t help with firmware problems" that is not true. For example, if ZFS is mirroring a pool across two different storage arrays, a firmware error in one of them will cause problems that ZFS will detect when it tries to read the data. Further, ZFS would be able to correct the error by reading from the other mirror, unless the second array also suffered from a firmware error. There are categories of problems that ZFS cannot handle, mostly regarding data availability after catastophes (as Richard E described) but ZFS can help with many firmware problems. -- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
Hello przemolicc, Wednesday, June 28, 2006, 3:05:42 PM, you wrote: ppf> On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote:>> Hello przemolicc, >> >> Wednesday, June 28, 2006, 10:57:17 AM, you wrote: >> >> ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote: >> >> Case in point, there was a gentleman who posted on the Yahoo Groups solx86 >> >> list and described how faulty firmware on a Hitach HDS system damaged a >> >> bunch of data. The HDS system moves disk blocks around, between one disk >> >> and another, in the background, to optimized the filesystem layout. Long >> >> after he had written data, blocks from one data set were intermingled with >> >> blocks for other data sets/files causing extensive data corruption. >> >> ppf> Al, >> >> ppf> the problem you described comes probably from failures in code of firmware >> ppf> not the failure of disk surface. Sun''s engineers can also do some mistakes >> ppf> in ZFS code, right ? >> >> But the point is that ZFS should detect also such errors and take >> proper actions. Other filesystems can''t.ppf> Does it mean that ZFS can detect errors in ZFS''s code itself ? ;-) ppf> What I wanted to point out is the Al''s example: he wrote about damaged data. Data ppf> were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf> errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Hello David, > > Wednesday, June 28, 2006, 12:30:54 AM, you wrote: > > DV> If ZFS is providing better data integrity then the current storage > DV> arrays, that sounds like to me an opportunity for the next generation > DV> of intelligent arrays to become better. > > Actually they can''t. > If you want end-to-end data integrity it has to be checked on a > server.but the checking could be done by a cooperating zfs module and stuff in the hardware array. That is making some of ZFS pluggable in away that it can be delegated to hardware. -- Darren J Moffat
Jeremy Teo wrote:> Hello, > >> What I wanted to point out is the Al''s example: he wrote about >> damaged data. Data >> were damaged by firmware _not_ disk surface ! In such case ZFS >> doesn''t help. ZFS can >> detect (and repair) errors on disk surface, bad cables, etc. But >> cannot detect and repair >> errors in its (ZFS) code. >> >> I am comparing firmware code to ZFS code. >> > > Firmware doesn''t do end to end checksumming. If ZFS code is buggy, the > checksums won''t match up anyway, so you still detect errors. > > Plus it is a lot easier to debug ZFS code than firmware. >Depends on your definition of firmware. In higher end arrays the data is checksummed when it comes in and a hash is written when it gets to disk. Of course this is no where near end to end but it is better then nothing. ... and code is code. Easier to debug is a context sensitive term.
On Wed, 2006-06-28 at 09:05, przemolicc at poczta.fm wrote:> > But the point is that ZFS should detect also such errors and take > > proper actions. Other filesystems can''t. > > Does it mean that ZFS can detect errors in ZFS''s code itself ? ;-)In many cases, yes. As a hypothetical: Consider a bug in the file system''s block allocator which causes an allocated on-disk block to be prematurely reused by another file. With UFS, you''re doomed -- one file or the other (or both) will be corrupted and you''ll have no way to tell which one has correct data; all you can do is take the filesystem offline and run fsck on it to prune out the damaged area. With ZFS''s design, because block checksums are an integral part of the block pointer, the checksum error received when reading one or the other file will most likely indicate that something is wrong and these errors will be flagged; with an error of this form, the filesystem will either deliver the correct data to the app or will know that it can''t. - Bill
>Depends on your definition of firmware. In higher end arrays the data is >checksummed when it comes in and a hash is written when it gets to disk. >Of course this is no where near end to end but it is better then nothing.The checksum is often stored with the data (so if the data is not written or in the wrong location the checksum is still valid) ZFS stores the checksum with the data pointer; so it knows more about the data and whether is was proper. ZFS also checksums before the data travels over the fabric.>... and code is code. Easier to debug is a context sensitive term.Uhm, well, firmware, in production systems? Casper
>> > > > Depends on your definition of firmware. In higher end arrays the data > is checksummed when it comes in and a hash is written when it gets to > disk. Of course this is no where near end to end but it is better then > nothing. > > ... and code is code. Easier to debug is a context sensitive term. >Its unfortunate that so many posts hung about the code, Its the design that protects your data and with ZFS you have a better design for data integrity. If the code is faulty and now thats a bug. And design should protect you unless your error detection and correction logic is faulty. (I mean this is like anti-corruption buereau being corrupt :-)). There is a huge difference between ability to detect corruption versus not knowing that data is corrupted at all. Now if the code is upto design or not, is what real world testing shows, in most of the cases ZFS should help. Kiran
Robert Milkowski wrote:> Hello Peter, > > Wednesday, June 28, 2006, 1:11:29 AM, you wrote: > > PT> On Tue, 2006-06-27 at 17:50, Erik Trimble wrote: > > PT> You really need some level of redundancy if you''re using HW raid. > PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all > PT> that. Seems to me that the simplest way to go is to use zfs to mirror > PT> HW raid5, preferably with the HW raid5 LUNs being completely > PT> independent disks attached to completely independent controllers > PT> with no components or datapaths in common. > > well, it will give you less than half your raw storage. > Due to costs I belive in most cases it won''t be acceptable. > People are using raid-5 mostly due to costs and you are proposing > something worse (in terms of available logical storage) than > mirroring. >The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: ZFS mirror/ RAID5: capacity = (N / 2) -1 speed << N / 2 -1 minimum # disks to lose before loss of data: 4 maximum # disks to lose before loss of data: (N / 2) + 2 ZFS mirror / HW Stripe capacity = (N / 2) speed >= N / 2 minimum # disks to lose before loss of data: 2 maximum # disks to lose before loss of data: (N / 2) + 1 Given a reasonable number of hot-spares, I simply can''t see the (very) marginal increase in safety give by using HW RAID5 as out balancing the considerable speed hit using RAID5 takes. Robert - I would definitely like to see the difference between read on HW RAID5 vs read on RAIDZ. Naturally, one of the big concerns I would have is how much RAM is needed to avoid any cache starvation on the ZFS machine. I''d discount the NVRAM on the RAID controller, since I''d assume that it would be dedicated to write acceleration, and not for read. My big problem right now is that I only have an old A3500FC to do testing on, as all my other HW RAID controllers are IBM ServerRAIDs, for which the Solaris driver isn''t really the best. -Erik
Erik Trimble wrote:> The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: > > ZFS mirror/ RAID5: capacity = (N / 2) -1 > speed << N / 2 -1 > minimum # disks to lose before loss > of data: 4 > maximum # disks to lose before loss > of data: (N / 2) + 2 > > ZFS mirror / HW Stripe capacity = (N / 2) > speed >= N / 2 > minimum # disks to lose before loss > of data: 2 > maximum # disks to lose before loss > of data: (N / 2) + 1 > > Given a reasonable number of hot-spares, I simply can''t see the (very) > marginal increase in safety give by using HW RAID5 as out balancing the > considerable speed hit using RAID5 takes.Eric, Your analysis lacks some very important views of the problem. 0. Probability of failure is not constant across the components involved. 1. Disks don''t tend to fail completely as often as they fail partially. For partial failures, the recovery method is very different for the various hardware RAID types and ZFS. 2. Analysis for data availability is different than analysis for data loss and performance. Typically, we do a performability analysis which shows the relationship between availability and performance. Data loss analysis is handled differently, as it is often measured in years (perhaps tens of thousands of years) and is highly dependent upon maintenance activity. 3. For most hardware RAID arrays, RAID-5 performance is similar to RAID-1+0. In order to assign a value to the performance envelope, something must be known about the workload. RAID-6 or raidz2 performs ??? 4. Scrubbing methods are also different between ZFS and RAID arrays. This does impact latent fault detection which in turn impacts data loss. Depending on requirements, we might recommend something fast, but risky, or something designed to never forget. Saying that some configuration has little value only applies to a specific set of requirements. -- richard
On Jun 28, 2006, at 12:32, Erik Trimble wrote:> The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: > > ZFS mirror/ RAID5: capacity = (N / 2) -1 > speed << N / 2 -1 > minimum # disks to lose before > loss of data: 4 > maximum # disks to lose before > loss of data: (N / 2) + 2shouldn''t that be capacity = ((N -1) / 2) ? loss of a single disk would cause a rebuild on the R5 stripe which could affect performance on that side of the mirror. Generally speaking good RAID controllers will dedicate processors and channels to calculate the parity and write it out so you''re not impacted from the host access PoV. There is a similar sort of CoW behaviour that can happen between the array cache and the drives, but in the ideal case you''re dealing with this in dedicated hw instead of shared hw.> > ZFS mirror / HW Stripe capacity = (N / 2) > speed >= N / 2 > minimum # disks to lose before > loss of data: 2 > maximum # disks to lose before > loss of data: (N / 2) + 1 > > Given a reasonable number of hot-spares, I simply can''t see the > (very) marginal increase in safety give by using HW RAID5 as out > balancing the considerable speed hit using RAID5 takes.I think you''re comparing this to software R5 or at least badly implemented array code and divining that there is a considerable speed hit when using R5. In practice this is not always the case provided that the response time and interaction between the array cache and drives is sufficient for the incoming stream. By moving your operation to software you''re now introducing more layers between the CPU, L1/L2 cache, memory bus, and system bus before you get to the interconnect and further latencies on the storage port and underlying device (virtualized or not.) Ideally it would be nice to see ZFS style improvements in array firmware, but given the state of embedded Solaris and the predominance of 32bit controllers - I think we''re going to have some issues. We''d also need to have some sort of client mechanism to interact with the array if we''re talking about moving the filesystem layer out there .. just a thought Jon E -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060628/2d1d6853/attachment.html>
On Wed, Jun 28, 2006 at 11:15:34AM +0200, Robert Milkowski wrote:> DV> If ZFS is providing better data integrity then the current storage > DV> arrays, that sounds like to me an opportunity for the next generation > DV> of intelligent arrays to become better. >RM> Actually they can''t. RM> If you want end-to-end data integrity it has to be checked on a RM> server. But Joe makes a good point about RAID-Z and iSCSI. It''d be nice if RAID HW could assist RAID-Z, and it wouldn''t take much to do that: parity computation on write, checksum verification on read and, if the checksum verification fails, combinarotial reconstruction on read. The ZFS system (iSCSI client) would still have to verify the checksum on read... ...but leaving parity computation/reconstruction to the iSCSI server would greatly cut down the amount of I/O needed for RAID-Z to something similar to that needed for HW RAID-5. Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor iSCSI extensions for server-assisted RAID-Z. But at least iSCSI protocol extensions could be pursued now. Nico --
On Wed, 2006-06-28 at 17:32, Erik Trimble wrote:> The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: > > ZFS mirror/ RAID5: capacity = (N / 2) -1 > speed << N / 2 -1 > minimum # disks to lose before loss > of data: 4 > maximum # disks to lose before loss > of data: (N / 2) + 2 > > ZFS mirror / HW Stripe capacity = (N / 2) > speed >= N / 2 > minimum # disks to lose before loss > of data: 2 > maximum # disks to lose before loss > of data: (N / 2) + 1 > > Given a reasonable number of hot-spares, I simply can''t see the (very) > marginal increase in safety give by using HW RAID5 as out balancing the > considerable speed hit using RAID5 takes.That''s not quite right. There''s no significant difference in performance, and the question is whether you''re prepared to give up a small amount of space for orders of magnitude increase in safety. Each extra disk you can survive the failure of leads to a massive increase in safety. By something like (just considering isolated random disk failures) the ratio of the MTBF of a disk to the time it takes to get a spare back in and repair the LUN. That''s something like 100,000 - which isn''t a marginal increase in safety! Yes, I know it''s not that simple. The point to take from this is simply that being able to survive 2 failures instead of 1 doesn''t double your safety, it increases it by a very large number. And by a very large number again for the next failure you can survive. In the stripe case, a single disk loss (pretty common) loses all your redundancy straight off. A second disk failure (not particularly rare) and all your data''s toast. Hot spares don''t really help in this case. At this point having HW raid-5 underneath means you''re still humming along safely. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Robert,> PT> You really need some level of redundancy if you''re using HW raid. > PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all > PT> that. Seems to me that the simplest way to go is to use zfs to mirror > PT> HW raid5, preferably with the HW raid5 LUNs being completely > PT> independent disks attached to completely independent controllers > PT> with no components or datapaths in common. > > well, it will give you less than half your raw storage. > Due to costs I belive in most cases it won''t be acceptable. > People are using raid-5 mostly due to costs and you are proposing > something worse (in terms of available logical storage) than > mirroring.I realise that, but the question was about what combination of ZFS redundancy and HW-raid redundancy made sense. My point was that putting no redundancy at all at the HW-raid layer was a really bad idea, and the self-healing capability of zfs means that you want a level of redundancy within zfs. So you are inevitably going to lose some extra capacity. Which is better - zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5? I wouldn''t rule out raidz (or even raidz2) across multiple arrays that are HW-raid5 internally. My real concern there is the small random read performance issue. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Wed, 2006-06-28 at 13:24 -0400, Jonathan Edwards wrote:> > On Jun 28, 2006, at 12:32, Erik Trimble wrote: > > > The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: > > > > > > ZFS mirror/ RAID5: capacity = (N / 2) -1 > > > > speed << N / 2 -1 > > > > minimum # disks to lose before > > loss of data: 4 > > > > maximum # disks to lose before > > loss of data: (N / 2) + 2 > > > > > shouldn''t that be capacity = ((N -1) / 2) ? >Nope. For instance, 12 drives: 2 mirrors of 6 drive RAID5, which actually has 5 drives capacity. N=12, so (12 / 2) -1 = 6 -1 = 5.> > loss of a single disk would cause a rebuild on the R5 stripe which > could affect performance on that side of the mirror. Generally > speaking good RAID controllers will dedicate processors and channels > to calculate the parity and write it out so you''re not impacted from > the host access PoV. There is a similar sort of CoW behaviour that > can happen between the array cache and the drives, but in the ideal > case you''re dealing with this in dedicated hw instead of shared hw. >But, in all cases I''ve ever observed, even with hardware assist, writing to a N-drive RAID5 array is slower than writing to a (N-1)-drive HW Striped array. NVRAM of course can mitigate this somewhat, but the truth comes down to that RAID 5/6 always requires more work to be done than simple striping. And, a N-drive striped array will always outperform a N-drive RAID5/6 array. Always. I agree that there is some latitude for array design/cache performance/workload variance in this, but I''ve compared what would be the generally optimal RAID-5 workload (large size streaming writes/ streaming reads) against a identical number of striped drives, and you are looking at BEST CASE the RAID5 performing at (N-1)/N of the stripe. [ in reality, that isn''t quite the best case. The best case is that RAID-5 matches striping, in the case of reads of size <= (stripe size) * (N-1) ]> > > > ZFS mirror / HW Stripe capacity = (N / 2) > > > > speed >= N / 2 > > > > minimum # disks to lose before > > loss of data: 2 > > > > maximum # disks to lose before > > loss of data: (N / 2) + 1 > > > > > > Given a reasonable number of hot-spares, I simply can''t see the > > (very) marginal increase in safety give by using HW RAID5 as out > > balancing the considerable speed hit using RAID5 takes. > > > > > I think you''re comparing this to software R5 or at least badly > implemented array code and divining that there is a considerable speed > hit when using R5. In practice this is not always the case provided > that the response time and interaction between the array cache and > drives is sufficient for the incoming stream. By moving your > operation to software you''re now introducing more layers between the > CPU, L1/L2 cache, memory bus, and system bus before you get to the > interconnect and further latencies on the storage port and underlying > device (virtualized or not.) Ideally it would be nice to see ZFS > style improvements in array firmware, but given the state of embedded > Solaris and the predominance of 32bit controllers - I think we''re > going to have some issues. We''d also need to have some sort of client > mechanism to interact with the array if we''re talking about moving the > filesystem layer out there .. just a thought > > > Jon E >What I was trying to provide was the case for those using HW Arrays AND ZFS, and what the best configuration would be to do so. I''m not saying either/or; what the discussion centered around was what the best way to do BOTH is. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
> Which is better - > zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?The latter. With a mirror of RAID-5 arrays, you get: (1) Self-healing data. (2) Tolerance of whole-array failure. (3) Tolerance of *at least* three disk failures. (4) More IOPs than raidz of hardware mirrors (see Roch''s blog entry). (5) More convenient FRUs (the whole array becomes a FRU). Jeff
Hello Peter, Wednesday, June 28, 2006, 11:24:32 PM, you wrote: PT> Robert,>> PT> You really need some level of redundancy if you''re using HW raid. >> PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all >> PT> that. Seems to me that the simplest way to go is to use zfs to mirror >> PT> HW raid5, preferably with the HW raid5 LUNs being completely >> PT> independent disks attached to completely independent controllers >> PT> with no components or datapaths in common. >> >> well, it will give you less than half your raw storage. >> Due to costs I belive in most cases it won''t be acceptable. >> People are using raid-5 mostly due to costs and you are proposing >> something worse (in terms of available logical storage) than >> mirroring.PT> I realise that, but the question was about what combination of PT> ZFS redundancy and HW-raid redundancy made sense. My point was PT> that putting no redundancy at all at the HW-raid layer was a PT> really bad idea, and the self-healing capability of zfs means PT> that you want a level of redundancy within zfs. So you are PT> inevitably going to lose some extra capacity. Which is better - PT> zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5? PT> I wouldn''t rule out raidz (or even raidz2) across multiple PT> arrays that are HW-raid5 internally. My real concern there is PT> the small random read performance issue. I hit that problem (raidz on hw-raid5) with lot of small random reads (and many small writes). The performance was not acceptable here (nor more raid-z groups due to too much logical storage consumed for redundancy). I belive that in many cases mirroring hw-raid-5 luns would perform actually better. And why exactly do you think that not redundant luns on hw arrays is a bad idea? (except lack of hot spare support in zfs) You still would benefit from caches in the array. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Erik, Wednesday, June 28, 2006, 6:32:38 PM, you wrote: ET> Robert - ET> I would definitely like to see the difference between read on HW RAID5 ET> vs read on RAIDZ. Naturally, one of the big concerns I would have is ET> how much RAM is needed to avoid any cache starvation on the ZFS ET> machine. I''d discount the NVRAM on the RAID controller, since I''d ET> assume that it would be dedicated to write acceleration, and not for ET> read. My big problem right now is that I only have an old A3500FC to do ET> testing on, as all my other HW RAID controllers are IBM ServerRAIDs, for ET> which the Solaris driver isn''t really the best. I belive the problem here was mostly due to 64kB read from each disk in raid-z while dataset was many TBs of data with small random reads from many threads (nfsd). It meant that during peak hours I wasn''t probably far from saturating fc links (there was over 200MB read throughout sometime) while nfsd was actually reading someting like 10x less. I belive that most of that "cached" data weren''t used. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Wed, 2006-06-28 at 22:13 +0100, Peter Tribble wrote:> On Wed, 2006-06-28 at 17:32, Erik Trimble wrote: > > Given a reasonable number of hot-spares, I simply can''t see the (very) > > marginal increase in safety give by using HW RAID5 as out balancing the > > considerable speed hit using RAID5 takes. > > That''s not quite right. There''s no significant difference in > performance, > and the question is whether you''re prepared to give up a small amount of > space for orders of magnitude increase in safety.As indicated by previous posts, even with HW assist, RAID5/6 on N disks will be noticably slower than a stripe of N disks. Theoretical read performance for RAID5 is at best (in a limited number of cases) equal to striping, and for the general read case, is (N-1)/N slower. Even assuming no performance hit at all for the parity calculation on writes, writes to a RAID5 are at best equal to a stripe (assuming a full stripe has to be written), and usually at least (N-1)/N slower, as the parity must be written in addition to the normal data (i.e. N/(N-1) data must be written, rather than N).> Each extra disk you can survive the failure of leads to a massive > increase in safety. By something like (just considering isolated > random disk failures) the ratio of the MTBF of a disk to the time > it takes to get a spare back in and repair the LUN. That''s something > like 100,000 - which isn''t a marginal increase in safety! > > Yes, I know it''s not that simple. The point to take from this is > simply that being able to survive 2 failures instead of 1 doesn''t > double your safety, it increases it by a very large number. And > by a very large number again for the next failure you can survive. > > In the stripe case, a single disk loss (pretty common) loses all your > redundancy straight off. A second disk failure (not particularly rare) > and all your data''s toast. Hot spares don''t really help in this case. > > At this point having HW raid-5 underneath means you''re still humming > along safely. >Agreed. However, part of the issue that comes up is that you take into account the possibility of a drive failing before the hotspare can be re-silvered when a drive. In general, this is why I don''t see stripes being more than 6-8 drives wide. It takes somewhere between 30 minutes and 2 hours to resilver a drive in a mirrored stripe of that size (depending on capacity and activity). So, the tradeoff has to be made. And, note that you''ll reduce your performance with the RAID5 resyncing for considerably longer than it takes to re-silver the stripe. Put in another way (assuming a hotspare is automatically put in place after a drive failure): With mirrored stripes, I''m vulnerable to complete data loss while the 2nd stripe resyncs. Say 2 hours or so, worst case. By percentages, a 2nd drive loss in a mirrored stripe has a 50% chance of causing data loss. With mirrored RAID5, I''m invulnerable to a second drive loss. However, my resync time is vastly greater than for a stripe, increasing my window for more drive failures by at least double, if not more. It''s all a tradeoff. However, in general, I haven''t see 2nd drives fail within a stripe resync time, UNLESS I saw many drives fail (that is, 30-40% of a group fail close together), in which case, this overwhelms any RAID ability to compensate. It''s certainly possible. (which is why striped mirrors are preferred to mirrored stripes). In the end, it comes down to the local needs. Speaking in generalizations, my opinion is that a mirror of RAID5 doesn''t significantly increase your safety enough to warrant a 15-20% reduction in space, and at least that in performance, vs a mirror of stripes. And, of course, backups help. :-) -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Wed, 2006-06-28 at 14:55 -0700, Jeff Bonwick wrote:> > Which is better - > > zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5? > > The latter. With a mirror of RAID-5 arrays, you get: > > (1) Self-healing data. > > (2) Tolerance of whole-array failure. > > (3) Tolerance of *at least* three disk failures. > > (4) More IOPs than raidz of hardware mirrors (see Roch''s blog entry). > > (5) More convenient FRUs (the whole array becomes a FRU). > > Jeff >Not that I disagree with the inital assessment, but a couple of corrections: (1) Both give you this. (2) ZFS RAIDZ on HW mirrors can also survive a complete HW mirror array failure. (3) Both configs can survive AT LEAST 3 drive failures. RAIDZ of HW mirrors is slightly better at being able to survive 4+ drive failures, statistically speaking. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Jun 28, 2006, at 17:25, Erik Trimble wrote:> On Wed, 2006-06-28 at 13:24 -0400, Jonathan Edwards wrote: >> >> On Jun 28, 2006, at 12:32, Erik Trimble wrote: >> >>> The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: >>> >>> >>> ZFS mirror/ RAID5: capacity = (N / 2) -1 >>> >>> speed << N / 2 -1 >>> >>> minimum # disks to lose before >>> loss of data: 4 >>> >>> maximum # disks to lose before >>> loss of data: (N / 2) + 2 >>> >> >> >> shouldn''t that be capacity = ((N -1) / 2) ? >> > Nope. For instance, 12 drives: 2 mirrors of 6 drive RAID5, which > actually has 5 drives capacity. N=12, so (12 / 2) -1 = 6 -1 = 5.right, sorry - was thinking of the case where i''ve got 2 luns built out of single R5 parity group .. but there''s not much point to using a mirror there since disk failure is typically much more common than LDEV failure. If you''re really concerned with reliability (the only reason you should be thinking about doing both R5 and R1) - you''d be better off mirroring each component of a RAID stripe before you construct the parity group. This will still give you an effective capacity of (N-2)/2 or ((N/2) -1) but now you would have to lose 2 complete mirrors before you would fail. To me this would say that the best case for reliability here should be to plat with HW mirrored drives and RAID-Z on top. Of course, you''re not going to be able to split mirrors very easily if you ever have that intention. <snip>> And, a N-drive striped array will always outperform a N-drive RAID5/6 > array. Always.true - but with some modern hardware, I think you''ll find that it''s pretty negligible.> I agree that there is some latitude for array design/cache > performance/workload variance in this, but I''ve compared what would be > the generally optimal RAID-5 workload (large size streaming writes/ > streaming reads) against a identical number of striped drives, and you > are looking at BEST CASE the RAID5 performing at (N-1)/N of the > stripe.right, and you''ll also have a read/<modify>/write penalty that will happen somewhere that can degrade performance particularly when you blow your cache in a large streaming write. Realistically you''ll typically give up the performance addition of a drive or two for parity to get basic redundancy and then realign your stripe width for your filesystem allocation unit or block commit based on the number of data drives in your RAID set (N-1 for R5 or N-2 for R6). You''ll probably be giving up another couple drives for hot spares .. you''re never doing R5 for performance - it''s the recoverability aspect with less capacity overhead than a full mirror. .je
On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote:> But Joe makes a good point about RAID-Z and iSCSI. > > It''d be nice if RAID HW could assist RAID-Z, and it wouldn''t take much > to do that: parity computation on write, checksum verification on read > and, if the checksum verification fails, combinarotial reconstruction on > read. The ZFS system (iSCSI client) would still have to verify the > checksum on read... > > ...but leaving parity computation/reconstruction to the iSCSI server > would greatly cut down the amount of I/O needed for RAID-Z to something > similar to that needed for HW RAID-5. > > Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor iSCSI > extensions for server-assisted RAID-Z. But at least iSCSI protocol > extensions could be pursued now.But - This still fails to address the design concept of ZFS''s end to end checksumming, and fails to address things gong bad over the system''s hardware bus, the IO card and the Fibre... So - It''s not nearly the same level of protection, IMO.
[hit send too soon...] Richard Elling wrote:> Erik Trimble wrote: >> The main reason I don''t see ZFS mirror / HW RAID5 as useful is this: >> >> ZFS mirror/ RAID5: capacity = (N / 2) -1 >> speed << N / 2 -1 >> minimum # disks to lose before >> loss of data: 4 >> maximum # disks to lose before >> loss of data: (N / 2) + 2 >> >> ZFS mirror / HW Stripe capacity = (N / 2) >> speed >= N / 2 >> minimum # disks to lose before >> loss of data: 2 >> maximum # disks to lose before >> loss of data: (N / 2) + 1 >> >> Given a reasonable number of hot-spares, I simply can''t see the (very) >> marginal increase in safety give by using HW RAID5 as out balancing >> the considerable speed hit using RAID5 takes. > > Eric, > Your analysis lacks some very important views of the problem. > > 0. Probability of failure is not constant across the components involved. > > 1. Disks don''t tend to fail completely as often as they fail partially. > For partial failures, the recovery method is very different for the > various hardware RAID types and ZFS. > > 2. Analysis for data availability is different than analysis for data loss > and performance. Typically, we do a performability analysis which > shows the relationship between availability and performance. Data loss > analysis is handled differently, as it is often measured in years > (perhaps tens of thousands of years) and is highly dependent upon > maintenance activity. > > 3. For most hardware RAID arrays, RAID-5 performance is similar to RAID-1+0. > In order to assign a value to the performance envelope, something must > be known about the workload. RAID-6 or raidz2 performs ??? > > 4. Scrubbing methods are also different between ZFS and RAID arrays. This > does impact latent fault detection which in turn impacts data loss.5. Excepting recovery from tape, the availability of a ZFS volume is a function of the amount of space used. This is different than LVMs or HW RAID arrays where the availability is a function of the size of the disk.> Depending on requirements, we might recommend something fast, but risky, or > something designed to never forget. Saying that some configuration has > little value only applies to a specific set of requirements. > -- richard
Roch wrote:> Philip Brown writes: > > > but there may not be filesystem space for double the data. > > Sounds like there is a need for a zfs-defragement-file utility perhaps? > > > > Or if you want to be politically cagey about naming choice, perhaps, > > > > zfs-seq-read-optimize-file ? :-) > > > > Possibly or may using fcntl ? > > Now the goal is to take a file with scattered blocks and order > them in contiguous chunks. So this is contigent on the > existence of regions of free contiguous disk space. This > will get more difficult as we get close to full on the > storage. >Quite so. It should be reasonable to require some minimum level of free space on the filesystem or the operation cannot continue. but even with some relatively ''small'' amount of free space, it should be still possible. it will just take significantly longer. CF: Any "defrag your hard drive" algorithm. same algorithm, just applied to a file instead of a drive.
Erik Trimble wrote:> > Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m > assuming that the proper way to get benefits from both ZFS and HW RAID > is the following: > > (1) ZFS mirror of HW stripes, i.e. "zpool create tank mirror > hwStripe1 hwStripe2" > (2) ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, > hwMirror2" > (3) ZFS RAIDZ of HW stripes, i.e. "zpool create tank raidz hwStripe1, > hwStripe2" > > mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty > sure they''re considerably less useful than the 3 above. > > Personally, I can''t think of a good reason to use ZFS with HW RAID5; > case (3) above seems to me to provide better performance with roughly > the same amount of redundancy (not quite true, but close). >I almost regret extending this thread more :-) but I havent seen anyone spell out one thing in simple language, so i''ll attempt to do that now. #2 is incredibly wasteful of space, so I''m not going to address it. it is highly redundant, that''s great. if you need it, do it. I''m more concerned with the concept of zfs of two hardware raid boxes that have internal disk redundancy vs zfs of two hardware raid boxes that are pure stripes (raid 0) (doesnt matter if using zfs mirror vs raidz to me, for this aspect of things) The point that I think people should remember, is that if you lose a drive in a pure raid0 configuration... your time to recover that hwraid unit and bring it back to full operation in the filesystem.. is HUGE. It will most likely be unacceptibly long. hours if not days, for a decent sized raid box. So, you can choose to throw away half your disk space in that hwraid box for redundancy, or use raid5. raid5 IS useful in zfs+hwraid boxes, for "Mean Time To Recover" purposes.
On Thu, Jun 29, 2006 at 09:25:21AM +1000, Nathan Kroenert wrote:> On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote: > > But Joe makes a good point about RAID-Z and iSCSI. > > > > It''d be nice if RAID HW could assist RAID-Z, and it wouldn''t take much > > to do that: parity computation on write, checksum verification on read > > and, if the checksum verification fails, combinarotial reconstruction on > > read. The ZFS system (iSCSI client) would still have to verify the > > checksum on read... > > > > ...but leaving parity computation/reconstruction to the iSCSI server > > would greatly cut down the amount of I/O needed for RAID-Z to something > > similar to that needed for HW RAID-5. > > But - This still fails to address the design concept of ZFS''s end to end > checksumming, and fails to address things gong bad over the system''s > hardware bus, the IO card and the Fibre...No it doesn''t. As I''d have it (and as I wrote) ZFS would compute the checksum both, on reads and writes, but on reads the iSCSI target would also compute the checksum, so it could do combinatorial reconstruction if a block is bad. Nico --
Philip Brown wrote:> > > raid5 IS useful in zfs+hwraid boxes, for "Mean Time To Recover" purposes.Or, and people haven''t really mentioned this yet, if you''re using R5 for the raid set and carving LUNs out of it to multiple hosts.
On 6/28/06, Nathan Kroenert <Nathan.Kroenert at sun.com> wrote:> On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote: > > But Joe makes a good point about RAID-Z and iSCSI. > > > > It''d be nice if RAID HW could assist RAID-Z, and it wouldn''t take much > > to do that: parity computation on write, checksum verification on read > > and, if the checksum verification fails, combinarotial reconstruction on > > read. The ZFS system (iSCSI client) would still have to verify the > > checksum on read... > > > > ...but leaving parity computation/reconstruction to the iSCSI server > > would greatly cut down the amount of I/O needed for RAID-Z to something > > similar to that needed for HW RAID-5. > > > > Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor iSCSI > > extensions for server-assisted RAID-Z. But at least iSCSI protocol > > extensions could be pursued now. > > But - This still fails to address the design concept of ZFS''s end to end > checksumming, and fails to address things gong bad over the system''s > hardware bus, the IO card and the Fibre... > > So - It''s not nearly the same level of protection, IMO. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Its not the same level of protection, but you get location independence, multi-pathing throughout your device tree (all components redundant, including the head node potentially), and if Solaris ever got around to it, ERL2 support would do wonders to ensure data integrity. Sure, you can still have specific components fail partially and no know if you corrupt the data, but again, the answer of mirroring your iscsi based storage allows for the error correction/checksumming route to work its wonders. For exceptionally large data pools, you''ll need many systems (perhaps beyond the scope of even FC). Just exposing each drive as a naked lun and doing layers of raidz/mirrors will show poor performance and an overly centralized management nightmare. Segmenting off the workload solves some performance ills (again, referring to Roch''s work on raidz''s failings for large number of luns) and does provide its own level of compartmentabilty, redundancy, and manageability (to gain some, and you lose some, I agree). Let ZFS integrate as best it can based on the environment at hand. My target use involves both tier1 (ala NetApp) and tier2 (very large multi-location storage pools).
On Jun 28, 2006, at 18:25, Erik Trimble wrote:> On Wed, 2006-06-28 at 14:55 -0700, Jeff Bonwick wrote: >>> Which is better - >>> zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5? >> >> The latter. With a mirror of RAID-5 arrays, you get: >> >> (1) Self-healing data. >> >> (2) Tolerance of whole-array failure. >> >> (3) Tolerance of *at least* three disk failures. >> >> (4) More IOPs than raidz of hardware mirrors (see Roch''s blog entry). >> >> (5) More convenient FRUs (the whole array becomes a FRU). >> >> Jeff >> > > > Not that I disagree with the inital assessment, but a couple of > corrections: > > (1) Both give you this. > > (2) ZFS RAIDZ on HW mirrors can also survive a complete HW mirror > array > failure. > > (3) Both configs can survive AT LEAST 3 drive failures. RAIDZ of HW > mirrors is slightly better at being able to survive 4+ drive failures, > statistically speaking.Here''s 10 options I can think of to summarize combinations of zfs with hw redundancy: # ZFS ARRAY HW CAPACITY COMMENTS -- --- -------- -------- -------- 1 R0 R1 N/2 hw mirror - no zfs healing (XXX) 2 R0 R5 N-1 hw R5 - no zfs healing (XXX) 3 R1 2 x R0 N/2 flexible, redundant, good perf 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 6 RZ R0 N-1 standard RAID-Z - no array RAID (XXX) 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) 9 RZ 2 x R5 N-3 triple parity calculations (XXX) 10 RZ 1 x R5 N-2 double parity calculations (XXX) If we eliminate the configs with no zfs healing, and the configs with double parity calculations (overworking the drives), I believe that configs 3 and 4 on a decent RAID array will probably perform similarly for most workloads. Config 4 (as Jeff pointed out) will probably get you the best performance and redundancy utilizing both the arrays'' strengths and zfs'' strengths. If we optimize for performance we''d probably shy away from the RAID-Z options since we can''t really dedicate channels and resources for the parity calculations and writes (roch''s explanation is much better.) But if we optimize for reliability config 8 would get you the highest overall redundancy. Other options not considered: - Double mirroring - capacity loss is too high for too little gain - RAID2/3/4/6/S - not commonly used - have their own flaw areas Jonathan Edwards (generic storage and filesystem engineer) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060629/1ab5eb95/attachment.html>
On Wed, Jun 28, 2006 at 09:30:25AM -0400, Jeff Victor wrote:> przemolicc at poczta.fm wrote: > >On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote: > > > >What I wanted to point out is the Al''s example: he wrote about damaged > >data. Data > >were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t > >help. ZFS can > >detect (and repair) errors on disk surface, bad cables, etc. But cannot > >detect and repair > >errors in its (ZFS) code. > > > > If you mean "ZFS doesn''t help with firmware problems" that is not true.No, I don''t mean that. :-)> For > example, if ZFS is mirroring a pool across two different storage arrays, a > firmware error in one of them will cause problems that ZFS will detect when > it tries to read the data. Further, ZFS would be able to correct the error > by reading from the other mirror, unless the second array also suffered > from a firmware error.In this case ZFS is going to help. I agree. But how often you meet such solution (mirror of two different storage arrays) ?> There are categories of problems that ZFS cannot handle, mostly regarding > data availability after catastophes (as Richard E described) but ZFS can > help with many firmware problems.Indeed. przemol
On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote:> ppf> What I wanted to point out is the Al''s example: he wrote about damaged data. Data > ppf> were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can > ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair > ppf> errors in its (ZFS) code. > > Not in its code but definitely in a firmware code in a controller.As Jeff pointed out: if you mirror two different storage arrays. przemol
Hello Philip, Thursday, June 29, 2006, 2:58:41 AM, you wrote: PB> Erik Trimble wrote:>> >> Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m >> assuming that the proper way to get benefits from both ZFS and HW RAID >> is the following: >> >> (1) ZFS mirror of HW stripes, i.e. "zpool create tank mirror >> hwStripe1 hwStripe2" >> (2) ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, >> hwMirror2" >> (3) ZFS RAIDZ of HW stripes, i.e. "zpool create tank raidz hwStripe1, >> hwStripe2" >> >> mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty >> sure they''re considerably less useful than the 3 above. >> >> Personally, I can''t think of a good reason to use ZFS with HW RAID5; >> case (3) above seems to me to provide better performance with roughly >> the same amount of redundancy (not quite true, but close). >>PB> I almost regret extending this thread more :-) but I havent seen anyone PB> spell out one thing in simple language, so i''ll attempt to do that now. PB> #2 is incredibly wasteful of space, so I''m not going to address it. it is PB> highly redundant, that''s great. if you need it, do it. I''m more concerned PB> with the concept of PB> zfs of two hardware raid boxes that have internal disk redundancy PB> vs PB> zfs of two hardware raid boxes that are pure stripes (raid 0) PB> (doesnt matter if using zfs mirror vs raidz to me, for this aspect of things) PB> The point that I think people should remember, is that if you lose a drive PB> in a pure raid0 configuration... your time to recover that hwraid unit and PB> bring it back to full operation in the filesystem.. is HUGE. PB> It will most likely be unacceptibly long. hours if not days, for a decent PB> sized raid box. Not really. You can create many smaller raid-0 luns in one array and then do raid-10 in zfs. That should also expand your available depth queue and minimize impact of resilvering. Storage capacity would still be the same. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello przemolicc, Thursday, June 29, 2006, 8:01:26 AM, you wrote: ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote:>> ppf> What I wanted to point out is the Al''s example: he wrote about damaged data. Data >> ppf> were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can >> ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair >> ppf> errors in its (ZFS) code. >> >> Not in its code but definitely in a firmware code in a controller.ppf> As Jeff pointed out: if you mirror two different storage arrays. Not only I belive. There are some classes of problems that even in one array ZFS could help for fw problems (with many controllers in active-active config like Symetrix). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote:> Hello przemolicc, > > Thursday, June 29, 2006, 8:01:26 AM, you wrote: > > ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: > >> ppf> What I wanted to point out is the Al''s example: he wrote about damaged data. Data > >> ppf> were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can > >> ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair > >> ppf> errors in its (ZFS) code. > >> > >> Not in its code but definitely in a firmware code in a controller. > > ppf> As Jeff pointed out: if you mirror two different storage arrays. > > Not only I belive. There are some classes of problems that even in one > array ZFS could help for fw problems (with many controllers in > active-active config like Symetrix).Any real example ? przemol
Hello przemolicc, Thursday, June 29, 2006, 10:08:23 AM, you wrote: ppf> On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote:>> Hello przemolicc, >> >> Thursday, June 29, 2006, 8:01:26 AM, you wrote: >> >> ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: >> >> ppf> What I wanted to point out is the Al''s example: he wrote about damaged data. Data >> >> ppf> were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t help. ZFS can >> >> ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair >> >> ppf> errors in its (ZFS) code. >> >> >> >> Not in its code but definitely in a firmware code in a controller. >> >> ppf> As Jeff pointed out: if you mirror two different storage arrays. >> >> Not only I belive. There are some classes of problems that even in one >> array ZFS could help for fw problems (with many controllers in >> active-active config like Symetrix).ppf> Any real example ? I wouldn''t say such problems are common. The issue is we don''t know. From time to time some files are bad, sometimes fsck is needed with no apparent reason. I think only the future will tell how and when ZFS will protect us. All I can say there''s big potential in ZFS. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
przemolicc at poczta.fm wrote:> On Wed, Jun 28, 2006 at 09:30:25AM -0400, Jeff Victor wrote: > >>For example, if ZFS is mirroring a pool across two different storage arrays, a >>firmware error in one of them will cause problems that ZFS will detect when >>it tries to read the data. Further, ZFS would be able to correct the error >>by reading from the other mirror, unless the second array also suffered >>from a firmware error. > > In this case ZFS is going to help. I agree. But how often you meet such solution > (mirror of two different storage arrays) ?I have never seen this for cabinet-sized storage systems, because they offer the ability to perform on-line maintenance. But I do see mirroring to two arrays for small arrays, which typically do not offer on-line maintenance. -- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
1) We installed ZFS onto our Solaris 10 T2000 3 months ago. I have been told our ZFS code is downrev. What is the recommended way to upgrade ZFS on a production system (we want minimum downtime)? Can it safely be done without affecting our 3.5 million files? 2) We did not turn on compression as most of our 3+ million files are already gzipped. What is the performance penalty of having compression on (both read and write numbers)? Is there advantage to compressing already gzipped files? Should compression be the default when installing ZFS? Nearly all our files are ASCII. here is some info on our machine itsm-mpk-2% showrev Hostname: itsm-mpk-2 Hostid: 83d8d784 Release: 5.10 Kernel architecture: sun4v Application architecture: sparc Hardware provider: Sun_Microsystems Domain: Kernel version: SunOS 5.10 Generic_118833-08 T2000 32x1000mhz, 16gigs RAM. # zpool status pool: canary state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM canary ONLINE 0 0 0 c1t0d0s3 ONLINE 0 0 0 errors: No known data errors # zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- canary 42.0G 12.0G 169 223 8.92M 1.39M canary 42.0G 12.0G 0 732 0 3.05M canary 42.0G 12.0G 0 573 0 2.47M canary 42.0G 12.0G 0 515 0 2.22M canary 42.0G 12.0G 0 680 0 3.11M canary 42.0G 12.0G 0 620 0 2.80M canary 42.0G 12.0G 0 687 0 2.85M canary 42.0G 12.0G 0 568 0 2.40M canary 42.0G 12.0G 0 688 0 2.91M canary 42.0G 12.0G 0 634 0 2.75M canary 42.0G 12.0G 0 625 0 2.61M canary 42.0G 12.0G 0 700 0 2.96M canary 42.0G 12.0G 0 733 0 3.19M canary 42.0G 12.0G 0 639 0 2.76M canary 42.0G 12.0G 1 573 127K 2.89M canary 42.0G 12.0G 0 652 0 2.48M canary 42.0G 12.0G 0 713 63.4K 3.55M canary 42.0G 12.0G 117 355 7.83M 782K canary 42.0G 12.0G 43 616 2.97M 1.11M canary 42.0G 12.0G 128 424 8.60M 1.57M canary 42.0G 12.0G 288 151 18.9M 795K canary 42.0G 12.0G 364 0 23.9M 0 canary 42.0G 12.0G 387 0 25.6M 0 thanks sean