Hi, I''ve been trying to understand how transactional writes work in RAID-Z. I think I understand the ZFS system for transactional writes in general (the only place I could find that info was wikipedia; someone should fix that!). For RAID-Z it seems to me that the only way to make it transactional is to have a single tree of blocks spanning all disks, so that, e.g., the blocks in a stripe across three disks will all share a single parent block. Is that how it works? Oh, also doesn''t the model (non-RAID, even) imply every write generates at least log(N) writes (the parent node changes because it stores child pointers, and it propagates up the tree). I understand that transaction groups would limit that effect, of course. Thanks, -- Dave Abrahams Boost Consulting www.boost-consulting.com
Hello David, Tuesday, July 11, 2006, 8:34:10 PM, you wrote: DA> Hi, DA> I''ve been trying to understand how transactional writes work in DA> RAID-Z. I think I understand the ZFS system for transactional writes DA> in general (the only place I could find that info was wikipedia; DA> someone should fix that!). For RAID-Z it seems to me that the only DA> way to make it transactional is to have a single tree of blocks DA> spanning all disks, so that, e.g., the blocks in a stripe across three DA> disks will all share a single parent block. Is that how it works? With RAID-Z one FS block is spread onto N-1 disks + one parity. So you have one FS block which is divided to smaller N-1 "device" blocks. It''s clever "trick" with it''s own pros and cons. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski <rmilkowski at task.gda.pl> writes:> Hello David, > > Tuesday, July 11, 2006, 8:34:10 PM, you wrote: > > DA> Hi, > > DA> I''ve been trying to understand how transactional writes work in > DA> RAID-Z. I think I understand the ZFS system for transactional writes > DA> in general (the only place I could find that info was wikipedia; > DA> someone should fix that!). For RAID-Z it seems to me that the only > DA> way to make it transactional is to have a single tree of blocks > DA> spanning all disks, so that, e.g., the blocks in a stripe across three > DA> disks will all share a single parent block. Is that how it works? > > With RAID-Z one FS block is spread onto N-1 disks + one parity. > So you have one FS block which is divided to smaller N-1 "device" > blocks. It''s clever "trick" with it''s own pros and cons.It''s not unfamiliar to me; that''s standard RAID5-ish striping AFAICT. But you''re not answering my question: How can RAID-Z preserve transactional semantics when a single FS block write requires writing to multiple physical devices? -- Dave Abrahams Boost Consulting www.boost-consulting.com
On Tue, Jul 11, 2006 at 11:03:17PM -0400, David Abrahams wrote:> How can RAID-Z preserve transactional semantics when a single > FS block write requires writing to multiple physical devices?ZFS uses a technique that''s been used in databases for years: phase trees. First you write all subtrees that you''re updating to disk (to currently free space - this is the COW part), wait for them to sync, then update the tree''s root (the uberblock) in a 2-phase commit. It doesn''t matter if you''re doing it to multiple independent disks, or to multiple disks in a RAID-Z stripe. The individual writes don''t need to be atomic. Just the update to the root of the tree. The other trick is that with RAID-Z, every logical filesystem block (512B - 128KB) is it''s own stripe with it''s own parity. So by writing a new block, you''re not messing up the parity of any old blocks. See Jeff Bonwick''s block on RAID-Z to learn more: http://blogs.sun.com/roller/page/bonwick?entry=raid_z --Bill
>But you''re not answering my question: > > How can RAID-Z preserve transactional semantics when a single > FS block write requires writing to multiple physical devices?Since transactions in ZFS are committed until the ueberblock is written, this boils down to: "How is the ueberblock committed atomically in a RAID-Z configuration?" Correct? Casper
> Since transactions in ZFS are committed until the ueberblock is written, > this boils down to: > > "How is the ueberblock committed atomically in a RAID-Z configuration?"RAID-Z isn''t even necessary to have this issue; all you need is a disk that doesn''t actually guarantee atomicity of single-sector writes. Which, of course, we have to cope with. The key is that there''s actually a ring of 128 uberblocks, indexed by transaction group number (mod 128). When we open a storage pool, we read every uberblock; among those that have a valid SHA-256 checksum, we take the one with the highest transaction group (txg). That will be, by definition, the uberblock for the last txg that successfully committed to disk. If we lose power in the middle of writing an uberblock, then that uberblock won''t checksum, so we''ll use the one from the previous txg, i.e. the last one that synced completely. Jeff
>RAID-Z isn''t even necessary to have this issue; all you need is a disk >that doesn''t actually guarantee atomicity of single-sector writes. >Which, of course, we have to cope with. > >The key is that there''s actually a ring of 128 uberblocks, indexed >by transaction group number (mod 128). When we open a storage pool, >we read every uberblock; among those that have a valid SHA-256 >checksum, we take the one with the highest transaction group (txg). >That will be, by definition, the uberblock for the last txg that >successfully committed to disk. > >If we lose power in the middle of writing an uberblock, then that >uberblock won''t checksum, so we''ll use the one from the previous txg, >i.e. the last one that synced completely.Thanks for providing this last bit of my mental ZFS picture. Does ZFS keep statistics on how many ueberblocks are bad when it imports a pool? Or is it the case that when fewer than 128 ueberblocks have ever been committed, the remainder will be bogus? Casper
> Thanks for providing this last bit of my mental ZFS picture. > > Does ZFS keep statistics on how many ueberblocks are bad when > it imports a pool?No. We could, of course, but I''m not sure how it would be useful.> Or is it the case that when fewer than 128 > ueberblocks have ever been committed, the remainder will be bogus?Right. We actually initialize them all during pool creation, so that they''re not just random garbage; but they all have birth times of zero, indicating that they''re not valid. Jeff
>> Thanks for providing this last bit of my mental ZFS picture. >> >> Does ZFS keep statistics on how many ueberblocks are bad when >> it imports a pool? > >No. We could, of course, but I''m not sure how it would be useful.I would probably be one of those "nothing is wrong" call generators induced by strange error messages and bogus statistics.>> Or is it the case that when fewer than 128 >> ueberblocks have ever been committed, the remainder will be bogus? > >Right. We actually initialize them all during pool creation, >so that they''re not just random garbage; but they all have >birth times of zero, indicating that they''re not valid.You''d probably have to initialize them for fear of them being valid leftovers from old pools. Casper
Casper.Dik at Sun.COM writes:>>But you''re not answering my question: >> >> How can RAID-Z preserve transactional semantics when a single >> FS block write requires writing to multiple physical devices? > > Since transactions in ZFS are committed until the ueberblock is written, > this boils down to: > > "How is the ueberblock committed atomically in a RAID-Z configuration?" > > Correct?Correct. And thanks to Jeff for the cogent answer. -- Dave Abrahams Boost Consulting www.boost-consulting.com
Jeff Bonwick <bonwick at zion.eng.sun.com> writes:> > Since transactions in ZFS are committed until the ueberblock is written, > > this boils down to: > > > > "How is the ueberblock committed atomically in a RAID-Z configuration?" > > RAID-Z isn''t even necessary to have this issue; all you need is a disk > that doesn''t actually guarantee atomicity of single-sector writes. > Which, of course, we have to cope with. > > The key is that there''s actually a ring of 128 uberblocks, indexed > by transaction group number (mod 128). When we open a storage pool, > we read every uberblock; among those that have a valid SHA-256 > checksum, we take the one with the highest transaction group (txg). > That will be, by definition, the uberblock for the last txg that > successfully committed to disk. > > If we lose power in the middle of writing an uberblock, then that > uberblock won''t checksum, so we''ll use the one from the previous txg, > i.e. the last one that synced completely.I *love* information! I''m not a DB or filesystem engineer, so much of this is not stuff I work with every day; and I''ve always wondered how people got around issues like power failures and other uncontrollable shutdowns really reliably and cleanly. I think this is a way I haven''t read about before, and it makes perfect sense and seems fairly cheap (you only have to look at all 128 on startup). -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://dd-b.lighthunters.net/> <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>