Will a write(2) to a ZFS file be made durable atomically? Under the hood in ZFS, writes are committed using either shadow paging or logging, as I understand it. So I believe that I mean to ask whether a write(2), pushed to ZPL, and pushed on down the stack, can be split into multiple transactions? Or, instead, is it guaranteed to be committed in a single transaction, and so committed atomically? thanks! -- Chris Frost http://www.frostnet.net/chris/
On Nov 30, 2009, at 8:30 PM, Chris Frost wrote:> Will a write(2) to a ZFS file be made durable atomically?Yes or no, as specified by the options set at open(2). Note: it is worthwhile to know if your hardware honors cache flush requests, otherwise all bets are off.> Under the hood in ZFS, writes are committed using either shadow > paging or > logging, as I understand it. So I believe that I mean to ask whether a > write(2), pushed to ZPL, and pushed on down the stack, can be split > into > multiple transactions? Or, instead, is it guaranteed to be committed > in a > single transaction, and so committed atomically?ZPL is the ZFS POSIX Layer. I believe you meant to say the ZFS intent log (ZIL) instead. There is a lot of material online about the ZIL and how it works. Neil''s blog is often cited: http://blogs.sun.com/perrin/entry/the_lumberjack -- richard
> >> Under the hood in ZFS, writes are committed using either shadow paging or >> logging, as I understand it. So I believe that I mean to ask whether a >> write(2), pushed to ZPL, and pushed on down the stack, can be split into >> multiple transactions? Or, instead, is it guaranteed to be committed in a >> single transaction, and so committed atomically?A write made through the ZPL (zfs_write()) will be broken into transactions that contain at most 128KB user data. So a large write could well be split across transaction groups, and thus committed separately. Neil.
On Mon, Nov 30, 2009 at 11:03:07PM -0700, Neil Perrin wrote:> A write made through the ZPL (zfs_write()) will be broken into transactions > that contain at most 128KB user data. So a large write could well be split > across transaction groups, and thus committed separately.That answers my exact question; thanks! And Richard, thanks, too. Sorry that my question wasn''t stated clearly enough to avoid causing confusion about whether I asked about the timing of durability vs. the atomicity of writes with respect to failures. -- Chris Frost http://www.frostnet.net/chris/
On Mon, Nov 30, 2009 at 10:23:06PM -0800, Chris Frost wrote:> On Mon, Nov 30, 2009 at 11:03:07PM -0700, Neil Perrin wrote: > > A write made through the ZPL (zfs_write()) will be broken into transactions > > that contain at most 128KB user data. So a large write could well be split > > across transaction groups, and thus committed separately. > > That answers my exact question; thanks!For my PhD thesis I am working on file systems that build on shadow paging and am interested in the design choices behind ZFS. Off the top of your head, how fundamental would you say it is for the system to split each zfs_write() into transactions <=128KB in size? That is, could the system support far larger transactions easily and efficiently? Could the system be made to support transactions that are bounded in size only by free pool space? thanks again, -- Chris Frost http://www.frostnet.net/chris/
Neil Perrin wrote:> >> >>> Under the hood in ZFS, writes are committed using either shadow >>> paging or >>> logging, as I understand it. So I believe that I mean to ask whether a >>> write(2), pushed to ZPL, and pushed on down the stack, can be split >>> into >>> multiple transactions? Or, instead, is it guaranteed to be committed >>> in a >>> single transaction, and so committed atomically? > > A write made through the ZPL (zfs_write()) will be broken into > transactions > that contain at most 128KB user data. So a large write could well be > split > across transaction groups, and thus committed separately.So what happens if application is doing a synchronous write of lets say 512KB? The write will be splitted in at least 4 separate transactions and the write will be confirmed to the application only after all 512KB has been written. But is there a possibility that if after a first transaction was commited system crashed and although write was not confirmed to the application 128KB of it has been commited to the disk? Or will it be rolled back? Basically for synchronous writes of more than 128KB - is it guaranteed that all data under a given write is committed or nothing at all? -- Robert Milkowski http://milek.blogspot.com