Gurus; I am exceedingly impressed by the ZFS although it is my humble opinion that Sun is not doing enough evangelizing for it. But that''s beside the point. I am writing to seek help in understanding the RAID-Z concept. Jeff Bonwick''s weblog states the following; " RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write." I am unable to relate the above statement to the diagram shown in the PDF file ''zfs_last.pdf'' entitled "ZFS THE LAST WORD IN FILE SYSTEMS" (also by Jeff Bonwick), on page 11. I was wondering whether Jeff or some one knowledgeable would elaborate further on the above and also answer the following questions; The green and blue "blocks" shown in the diagram on page 11, do the represent actual physical blocks on individual disks or a single RAID-Z stripe write across multiple disks??? The parity for RAID-Z, where is it?? Surely not the checksum stored together in the upper level direct and indirect block pointer? And if not and it is written as a separate block on another disks, then .......I am afraid I do not understand.... Could someone please elaborate more on the statement "Every block is it''s own RAID-Z stripe"??? The block being referred to is a single block across multiple disks or a single disk? My sincere apologies if the above questions seem trivial. But I am really struggling to reconcile the statement and the diagram. Warmest Regards Steven Sim Fujitsu Asia Pte. Ltd. _____________________________________________________ This e-mail is confidential and may also be privileged. If you are not the intended recipient, please notify us immediately. You should not copy or use it for any purpose, nor disclose its contents to any other person. Opinions, conclusions and other information in this message that do not relate to the official business of my firm shall be understood as neither given nor endorsed by it. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> <i>RAID-Z is a data/parity scheme like RAID-5, <u><b>but it uses > dynamic stripe width. > Every block is its own RAID-Z stripe, regardless of blocksize. This > means > that every RAID-Z write is a full-stripe write.</b></u> This, when > combined with the > copy-on-write transactional semantics of ZFS, completely eliminates the > RAID write hole. RAID-Z is also faster than traditional RAID because it > never has to do > read-modify-write.</i>"<br> > <br> > I am unable to relate the above statement to the diagram shown in the > PDF file ''<b>zfs_last.pdf</b>'' entitled "<b>ZFS THE LAST WORD IN FILE > SYSTEMS</b>" (also by Jeff Bonwick), on page 11.<br>On the copy I''m looking at page 11 is "Copy-On-Write Transactions". Note that this page is showing only the "copy on write" stuff (which is used on all pools) and doesn''t show anything about raidz.> I was wondering whether Jeff or some one knowledgeable would elaborate > further on the above and also answer the following questions;<br> > <br> > <ul> > <li>The green and blue "blocks" shown in the diagram on page 11, do > the represent actual physical blocks on individual disks or a single > RAID-Z stripe write across multiple disks???</li>They represent filesystem "data" (the leaves) and filesystem "metadata" (the blocks above the leaves). They''re written to a pool that may have some form of redundancy (mirroring, raidz), but that level is not presented in this slide.> <li>The parity for RAID-Z, where is it??Mentioned on page 17, but no diagram on this link. Bill Moore presented this talk to BayLisa several months ago, and used a very similar presentation, but it had a diagram on the "RAID-Z" slide (the one on page 17) showing the data and parity blocks used by a raidz pool. Looking through google, I see many links to similar ZFS presentations, but none with the diagram on that page. Ah, found one... http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf Page 13 in that stack.> Surely not the checksum > stored together in the upper level direct and indirect block pointer? > And if not and it is written as a separate block on another disks, then > .......I am afraid I do not understand....<br>The parity is written as a separate block on a separate disk. It''s very similar to how RAID4/RAID5 would write parity on another disk. It''s just that for R4/R5, any given data block on disk can be immediately calculated to be part of a particular stripe on the storage (which has a particular parity block). In the case of raidz, the stripe has a maximum length set by the raidz columns, but it may be smaller than that.> <li>Could someone please elaborate more on the statement "Every block > is it''s own RAID-Z stripe"??? The block being referred to is a single > block across multiple disks or a single disk?<br>Every filesystem block (not disk block). So a single filesystem block that spans multiple disks. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
I''ll make an attempt to keep it simple, and tell what is true in ''most'' cases. For some values of ''most'' ;-) The words used are at times confusing. "Block" mostly refers to a logical filesystem block, which can be variable in size. There''s also "checksum" and "parity", which are completely independent.> * The green and blue "blocks" shown in the diagram on page 11, do > the represent actual physical blocks on individual disks or a > single RAID-Z stripe write across multiple disks???See Page 17: These are logical blocks, and can be variable in size.> * The parity for RAID-Z, where is it?? Surely not the checksum > stored together in the upper level direct and indirect block > pointer? And if not and it is written as a separate block on > another disks, then .......I am afraid I do not understand....z-raid Parity vs zfs checksum The parity is just a chunk of xor-ed data written for redundancy, and is part of the same I/O transaction. The checksum is a much smaller digest of the data used for detecting the various modes of data corruption. This is what goes into the metadata (logical) blocks above. A zfs file system always has checksums and can function without parity.> * Could someone please elaborate more on the statement "Every block > is it''s own RAID-Z stripe"??? The block being referred to is a > single block across multiple disks or a single disk?If the storage pool will use an n-way raid-z configuration, the (logical) block is first split into n-1 chunks, and an nth chunk is added before any actual I/O takes place. Each chunk goes to a separate disk. This goes hand in hand with the answer to question 2. Because it''s Copy-on-Write, we only worry about new data when computing parity.> *My sincere apologies if the above questions seem trivial* . But I am > really struggling to reconcile the statement and the diagram.Example Logical block: (1 6k block of fs data) Could be any size <= 128k |0|1|_|_|_|5|_|_|_|_|0|_| (12 x 512b blocks) --> ::checksum:: This is split into a single 4x 2k stripe: 3 chunks of 2k: |00|01|02|03| --> disk1 (4 sectors) |04|05|06|07| --> disk2 (4 sectors) |08|09|10|11| --> disk3 (4 sectors) 1 chunk of parity: |12|13|14|15| --> disk4 (4 sectors) ::checksum:: is then recorded in the metadata, which gets written in a separate stripe. This is recursed for the metadata checksum, until we reach the ueberblock, for which I won''t explain the redundancy and replication here. Cheers, Henk
Darren and Henk; Firstly, thank you very much for both of your replies. I am very grateful indeed for you all taking time off to answer my questions. I understand RAID-5 quite well and from both of your RAID-Z description, I see that the RAID-Z parity is also a separate block on a separate disk. Very well. This is just like RAID-5. My confusion is simple. Would this not then give rise also to the write-hole vulnerability of RAID-5? Jeff Bonwick states "that there''s no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage." If I understand correctly, then the parity block for RAID-Z are also written in two different atomic operations. As per RAID-5. (the only difference being each can be of a different stripe size). How then does it fit into Jeff''s statement that "Every block is its own RAID-Z stripe?" ( Perhaps I misunderstood but I now think this statement is rather for the fact that RAID-Z has a variable stripe size rather than meaning each block holding it''s own RAID-Z parity within itself. ) If the write-hole power outage situation as described by Jeff Bonwick occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability? Through each block''s independant checksum held one level above in the metadata block? Is this correct? Or am I completely off course? Henk Langeveld wonderful character based diagrams describes what is basically a standard RAID-5 layout on 4 disks. How is RAID-Z any different from RAID-5? (except for the ability to stripe different sizes which gives allows RAID-Z to never have to do a read-modify-write. This increases performance very significantly but I am unable to relate this to the write-hole vulnerability issue). Warmest Regards Steven Sim Darren Dunham wrote: RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write." I am unable to relate the above statement to the diagram shown in the PDF file ''zfs_last.pdf'' entitled "ZFS THE LAST WORD IN FILE SYSTEMS" (also by Jeff Bonwick), on page 11. On the copy I''m looking at page 11 is "Copy-On-Write Transactions". Note that this page is showing only the "copy on write" stuff (which is used on all pools) and doesn''t show anything about raidz. I was wondering whether Jeff or some one knowledgeable would elaborate further on the above and also answer the following questions; The green and blue "blocks" shown in the diagram on page 11, do the represent actual physical blocks on individual disks or a single RAID-Z stripe write across multiple disks??? They represent filesystem "data" (the leaves) and filesystem "metadata" (the blocks above the leaves). They''re written to a pool that may have some form of redundancy (mirroring, raidz), but that level is not presented in this slide. The parity for RAID-Z, where is it?? Mentioned on page 17, but no diagram on this link. Bill Moore presented this talk to BayLisa several months ago, and used a very similar presentation, but it had a diagram on the "RAID-Z" slide (the one on page 17) showing the data and parity blocks used by a raidz pool. Looking through google, I see many links to similar ZFS presentations, but none with the diagram on that page. Ah, found one... http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf Page 13 in that stack. Surely not the checksum stored together in the upper level direct and indirect block pointer? And if not and it is written as a separate block on another disks, then .......I am afraid I do not understand.... The parity is written as a separate block on a separate disk. It''s very similar to how RAID4/RAID5 would write parity on another disk. It''s just that for R4/R5, any given data block on disk can be immediately calculated to be part of a particular stripe on the storage (which has a particular parity block). In the case of raidz, the stripe has a maximum length set by the raidz columns, but it may be smaller than that. Could someone please elaborate more on the statement "Every block is it''s own RAID-Z stripe"??? The block being referred to is a single block across multiple disks or a single disk? Every filesystem block (not disk block). So a single filesystem block that spans multiple disks. Fujitsu Asia Pte. Ltd. _____________________________________________________ This e-mail is confidential and may also be privileged. If you are not the intended recipient, please notify us immediately. You should not copy or use it for any purpose, nor disclose its contents to any other person. Opinions, conclusions and other information in this message that do not relate to the official business of my firm shall be understood as neither given nor endorsed by it. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Steven Sim wrote:> Darren and Henk; > > Firstly, thank you very much for both of your replies. I am very > grateful indeed for you all taking time off to answer my questions. > > I understand RAID-5 quite well and from both of your RAID-Z description, > I see that the RAID-Z parity is also a separate block on a separate > disk. Very well. This is just like RAID-5. > > My confusion is simple. Would this not then give rise also to the > write-hole vulnerability of RAID-5? > > Jeff Bonwick states "/that there''s no way to update two or more disks > atomically, so RAID stripes can become damaged during a crash or power > outage./" > > If I understand correctly, then the parity block for RAID-Z are also > written in two different atomic operations. As per RAID-5. (the only > difference being each can be of a different stripe size). > > How then does it fit into Jeff''s statement that "/_Every block is its > own RAID-Z stripe?*"*_* ( */Perhaps I misunderstood but I now think > this statement is rather for the fact that RAID-Z has a variable stripe > size rather than meaning each block holding it''s own RAID-Z parity > within itself. ) > > If the write-hole power outage situation as described by Jeff Bonwick > occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability? >Recall that no written blocks are actually part of the file system until all the metadata blocks above them are also written. This includes the uberblock, whose write is atomic. So although the physical writing of the blocks on different physical disks is not atomic, if a crash occurs between such writes it is as if none of the writes ever occurred.> Through each block''s independant checksum held one level above in the > metadata block? Is this correct? Or am I completely off course?Yes, that is correct. HTH, Fred> > Henk Langeveld wonderful character based diagrams describes what is > basically a standard RAID-5 layout on 4 disks. How is RAID-Z any > different from RAID-5? (except for the ability to stripe different sizes > which gives allows RAID-Z to never have to do a read-modify-write. This > increases performance very significantly but I am unable to relate this > to the write-hole vulnerability issue). > > Warmest Regards > Steven Sim > > Darren Dunham wrote: >>> <i>RAID-Z is a data/parity scheme like RAID-5, <u><b>but it uses >>> dynamic stripe width. >>> Every block is its own RAID-Z stripe, regardless of blocksize. This >>> means >>> that every RAID-Z write is a full-stripe write.</b></u> This, when >>> combined with the >>> copy-on-write transactional semantics of ZFS, completely eliminates the >>> RAID write hole. RAID-Z is also faster than traditional RAID because it >>> never has to do >>> read-modify-write.</i>"<br> >>> <br> >>> I am unable to relate the above statement to the diagram shown in the >>> PDF file ''<b>zfs_last.pdf</b>'' entitled "<b>ZFS THE LAST WORD IN FILE >>> SYSTEMS</b>" (also by Jeff Bonwick), on page 11.<br> >>> >> >> On the copy I''m looking at page 11 is "Copy-On-Write Transactions". >> Note that this page is showing only the "copy on write" stuff (which is >> used on all pools) and doesn''t show anything about raidz. >> >> >>> I was wondering whether Jeff or some one knowledgeable would elaborate >>> further on the above and also answer the following questions;<br> >>> <br> >>> <ul> >>> <li>The green and blue "blocks" shown in the diagram on page 11, do >>> the represent actual physical blocks on individual disks or a single >>> RAID-Z stripe write across multiple disks???</li> >>> >> >> They represent filesystem "data" (the leaves) and filesystem "metadata" >> (the blocks above the leaves). They''re written to a pool that may have >> some form of redundancy (mirroring, raidz), but that level is not >> presented in this slide. >> >> >>> <li>The parity for RAID-Z, where is it?? >>> >> >> Mentioned on page 17, but no diagram on this link. >> >> Bill Moore presented this talk to BayLisa several months ago, and used a >> very similar presentation, but it had a diagram on the "RAID-Z" slide >> (the one on page 17) showing the data and parity blocks used by a raidz >> pool. Looking through google, I see many links to similar ZFS >> presentations, but none with the diagram on that page. >> >> Ah, found one... >> http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf >> Page 13 in that stack. >> >> >>> Surely not the checksum >>> stored together in the upper level direct and indirect block pointer? >>> And if not and it is written as a separate block on another disks, then >>> .......I am afraid I do not understand....<br> >>> >> >> The parity is written as a separate block on a separate disk. It''s very >> similar to how RAID4/RAID5 would write parity on another disk. >> >> It''s just that for R4/R5, any given data block on disk can be >> immediately calculated to be part of a particular stripe on the storage >> (which has a particular parity block). In the case of raidz, the stripe >> has a maximum length set by the raidz columns, but it may be smaller >> than that. >> >> >>> <li>Could someone please elaborate more on the statement "Every block >>> is it''s own RAID-Z stripe"??? The block being referred to is a single >>> block across multiple disks or a single disk?<br> >>> >> >> Every filesystem block (not disk block). So a single filesystem block >> that spans multiple disks. >> >> >> > > > > > Fujitsu Asia Pte. Ltd. > _____________________________________________________ > > This e-mail is confidential and may also be privileged. If you are not > the intended recipient, please notify us immediately. You should not > copy or use it for any purpose, nor disclose its contents to any other > person. > > Opinions, conclusions and other information in this message that do not > relate to the official business of my firm shall be understood as > neither given nor endorsed by it. > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Fred Zlotnick Director, Solaris Data Technology Sun Microsystems, Inc. fred.zlotnick at sun.com x85006/+1 650 786 5006
Hi Steven, Steven Sim wrote:> My confusion is simple. Would this not then give rise also to the > write-hole vulnerability of RAID-5? > > Jeff Bonwick states "/that there''s no way to update two or more disks > atomically, so RAID stripes can become damaged during a crash or power > outage./" > > If I understand correctly, then the parity block for RAID-Z are also > written in two different atomic operations. As per RAID-5. (the only > difference being each can be of a different stripe size).Yes, this is correct, writes to RAID-Z member disks are not atomic. But for the new block to become part of filesystem you need to update all the indirect blocks up to uber-block. Uber-block size is exactly one sector, so it can be written atomically.> How then does it fit into Jeff''s statement that "/_Every block is its > own RAID-Z stripe?*"*_* ( */Perhaps I misunderstood but I now think > this statement is rather for the fact that RAID-Z has a variable stripe > size rather than meaning each block holding it''s own RAID-Z parity > within itself. )Block here is logical filesystem block. When it is written to RAID-Z vdev of N disks it is split into no more than N-1 parts, and one more part with parity information is computed from these parts. Then parts are written to disks in the RAID-Z vdev one part to one disk.> If the write-hole power outage situation as described by Jeff Bonwick > occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability?ZFS never writes new data over old one. New data is always written into unoccupied space. Regular RAID-5 writes new data over old one, and since it is possible for outage to occur between two writes (you cannot write two sectors to two drives atomically), after that you stripe will be corrupted and there will be no way to tell what is correct - parity or data sector. Hope this helps, victor
Steven Sim wrote:> I understand RAID-5 quite well and from both of your RAID-Z description, > I see that the RAID-Z parity is also a separate block on a separate > disk. Very well. This is just like RAID-5.Yup. But there''s a little bit of magic, which I''ll try to explain below. With more ascii art! ;-)> My confusion is simple. Would this not then give rise also to the > write-hole vulnerability of RAID-5?Copy-on-Write> Jeff Bonwick states "/that there''s no way to update two or more disks > atomically, so RAID stripes can become damaged during a crash or power > outage./"Traditional RAID writes in-place. Each disk requires a separate write action. Once the first partial block (chunk) is written, the original data and parity are effectively lost.> If I understand correctly, then the parity block for RAID-Z are also > written in two different atomic operations. As per RAID-5. (the only > difference being each can be of a different stripe size).As with Raid-5 on a four disk stripe, there are four independant writes, and they don''t need to be atomic, as Copy-on-Write implies that the new blocks are written elsewhere on disk, while maintaining the original data. Only after all four writes return and are flushed to disk can you proceed and update the metadata.> How then does it fit into Jeff''s statement that "/ Every block is its > own RAID-Z stripe?*"* * ( */Perhaps I misunderstood but I now think > this statement is rather for the fact that RAID-Z has a variable stripe > size rather than meaning each block holding it''s own RAID-Z parity > within itself. )I think what is meant here is that the stripe is generated from filesystem data. Parity can be computed from file system data long before you go down to the device level. Of course, if you''re writing into the middle of a file, you will need that portion in the file system cache, but when an application is busy on a file, we may assume we have (the relevant portions of) that file in the cache, so no additional disk reads are needed.> If the write-hole power outage situation as described by Jeff Bonwick > occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability?Because it writes somewhere else, due to Copy-on-Write. The original data is still on disk, untouched.> Through each block''s independant checksum held one level above in the > metadata block? Is this correct? Or am I completely off course?Correct.> Henk Langeveld wonderful character based diagrams describes what is > basically a standard RAID-5 layout on 4 disks. How is RAID-Z any > different from RAID-5? (except for the ability to stripe different sizes > which gives allows RAID-Z to never have to do a read-modify-write. This > increases performance very significantly but I am unable to relate this > to the write-hole vulnerability issue).disk 1 disk 2 disk 3 disk 4 checksum _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |x|x|x|x| |x|x|x|x| |x|x|x|x| |X|X|X|X| cccccccc1 |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| Filesystem cache |x|x|x|x| |x|x|x|x| |x|x|x|x| cccccccc1 Application writes: y.y.y. .y.y.y.y. .y. Filesystem cache updated: |x|y|y|y| |y|y|y|y| |y|x|x|x| cccccccc2 Logical write results (computes parity) |x|y|y|y| |y|y|y|y| |y|x|x|x| |Y|Y|Y|Y| cccccccc2 with four independant physical writes: disk 1 disk 2 disk 3 disk 4 checksum _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |x|x|x|x| |x|x|x|x| |x|x|x|x| |X|X|X|X| cccccccc1 |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |x|y|y|y| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |Y|Y|Y|Y| |_|_|_|_| |y|y|y|y| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |y|x|x|x| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| |_|_|_|_| Note that after these four writes, the original data is still on disk, and allocated. If no snapshots are taken, those blocks can be reallocated after the metadata and the uberblock is written. If the system crashes before the uberblock goes down, the newly written data is effectively lost, as if it was never written, but neither do we lose the original data (and parity). If a disk dies during a crash, we still have the original parity. Cheers, Henk
Hello Henk, Friday, May 18, 2007, 12:09:40 AM, you wrote:>> If I understand correctly, then the parity block for RAID-Z are also >> written in two different atomic operations. As per RAID-5. (the only >> difference being each can be of a different stripe size).HL> As with Raid-5 on a four disk stripe, there are four independant HL> writes, and they don''t need to be atomic, as Copy-on-Write implies HL> that the new blocks are written elsewhere on disk, while maintaining HL> the original data. Only after all four writes return and are flushed HL> to disk can you proceed and update the metadata. And to clear things - meta data are updated also in a spirit of COW - so metadata are written to new locations and then uber block is atomically updated pointing to new meta data. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
>>> If I understand correctly, then the parity block for RAID-Z are also >>> written in two different atomic operations. As per RAID-5. (the only >>> difference being each can be of a different stripe size). >>> > > HL> As with Raid-5 on a four disk stripe, there are four independant > HL> writes, and they don''t need to be atomic, as Copy-on-Write implies > HL> that the new blocks are written elsewhere on disk, while maintaining > HL> the original data. Only after all four writes return and are flushed > HL> to disk can you proceed and update the metadata. > > And to clear things - meta data are updated also in a spirit of COW - > so metadata are written to new locations and then uber block is > atomically updated pointing to new meta dataWell, to add to this, uber-blocks are also updated in COW fashion - there is a circular array of 128 uber-blocks, and new uber-block is written to the next to current slot. victor
HL>> And to clear things - meta data are updated also in a spirit of COW - HL>> so metadata are written to new locations and then uber block is HL>> atomically updated pointing to new meta data Victor Latushkin wrote:> Well, to add to this, uber-blocks are also updated in COW fashion - > there is a circular array of 128 uber-blocks, and new uber-block is > written to the next to current slot.Correct, I left it out because there''s more detail involved with the uberblock. We can deal with it when we get there. Cheers, Henk
Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM +0800:> Gurus; > I am exceedingly impressed by the ZFS although it is my humble opinion > that Sun is not doing enough evangelizing for it.What else do you think we should be doing? David
David Bustos wrote:> Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM +0800: > >> Gurus; >> I am exceedingly impressed by the ZFS although it is my humble opinion >> that Sun is not doing enough evangelizing for it. >> > > What else do you think we should be doing? > >Send Thumpers to every respectable journal for a review! That''s probably a problem for marketing, how to target the the publications the people with the check books read to broaden the awareness of ZFS. Just about every x86 server manufacturer provides and promotes the features of hardware RAID solutions, maybe Sun should make more of the cost savings in storage ZFS offers to gain a cost advantage over the competition, or even "save $ on HP servers by running Solaris an removing the RAID". How about some JBOD only storage products? Or at least make hardware RAID a add on an option, to cater for a broader market. Trying to break (especially windows) administrators and CIOs out of the "hardware RAID is best" or even "hardware RAID is essential" mindset is a tough ask. As hardware RAID drops in price and moves into consumer grade products, ZFS will loose the cost advantage (just try and get a JBOD only SATA card, I only know of one). Ian
On 18-May-07, at 4:39 PM, Ian Collins wrote:> David Bustos wrote: > ... maybe Sun should make more of the > cost savings in storage ZFS offers to gain a cost advantage over the > competition,Cheaper AND more robust+featureful is hard to beat. --T
> Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM > +0800: > > Gurus; > > I am exceedingly impressed by the ZFS although > it is my humble opinion > > that Sun is not doing enough evangelizing for > it. > > What else do you think we should be doing? > > > DavidI''ll jump in here. I am a huge fan of ZFS. At the same time, I know about some of its warts. ZFS hints at adding agility to data management and is a wonderful system. At the same time, it operates on some assumptions which are antithetical to data agility, including: * inability to online restripe: add/remove data/parity disks * inability to make effective use of varying sized disks In one breath ZFS says, "Look how well you can dynamically alter filesystem storage." In another breath ZFS says, "Make sure that your pools have identical spindles and you have accurately predicted future bandwidth, access time, vdev size, and parity disks. Because you can''t change any of that later." I know, down the road you can tack new vdevs onto the pool, but that really misses the point. Even so, if I accidentally add a vdev to a pool and then realize my mistake, I am sunk. Once a vdev is added to a pool, it is attached to the pool forever. Ideally I could provision a vdev, later decide that I need a disk/LUN from that vdev and simply remove the disk/LUN, decreasing the vdev capacity. I should have the ability to decide that current redundancy needs are insufficient and allocate [b]any[/b] number of new parity disks. I should be able to have a pool from a rack of 15x250GB disks and then later add a rack of 11x750GB disks [b]to the vdev[/b], not by making another vdev. I should have the luxury of deciding to put hot Oracle indexes on their own vdev, deallocate spindles form an existing vdev and put those indexes on the new vdev. I should be able to change my mind later and put it all back. Most importantly is the access time issue. Since there are no partial-stripe reads in ZFS, then access time for a RAIDZ vdev is the same as single-disk access time, no matter how wide the stripe is. How to evangelize better? Get rid of the glaring "you can''t change it later" problems. Another thought is that flash storage has all of the indicators of being a disruptive technology described in [i]The Innovator''s Dilemma[/i]. What this means is that flash storage [b]will[/b] take over hard disks. It is inevitable. ZFS has a weakness with access times but handles single-block corruption very nicely. ZFS also has the ability to do very wide RAIDZ stripes, up to 256(?) devices, providing mind-numbing throughput. Flash has near-zero access times and relatively low throughput. Flash is also prone to single-block failures once the erase-limit has been reached for a block. ZFS + Flash = near-zero access time, very high throughput and high data integrity. To answer the question: get rid of the limitations and build a Thumper-like device using flash. Market it for Oracle redo logs, temp space, swap space (flash is now cheaper than RAM), anything that needs massive throughput and ridiculous iops numbers, but not necessarily huge storage. Each month, the cost of flash will fall 4% anyway, so get ahead of the curve now. My 2 cents, at least. Marty This message posted from opensolaris.org