Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS''s copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that''s it. Am I missing something? (I do not actually have a use case for this on ZFS, since my experience with ZFS is thus far limited to my home storage server. But I have wished for an fbarrier() many many times over the past few years...) -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
On Mon, 12 Feb 2007, Peter Schuller wrote:> Hello, > > Often fsync() is used not because one cares that some piece of data is on > stable storage, but because one wants to ensure the subsequent I/O operations > are performed after previous I/O operations are on stable storage. In these > cases the latency introduced by an fsync() is completely unnecessary. An > fbarrier() or similar would be extremely useful to get the proper semantics > while still allowing for better performance than what you get with fsync(). > > My assumption has been that this has not been traditionally implemented for > reasons of implementation complexity. > > Given ZFS''s copy-on-write transactional model, would it not be almost trivial > to implement fbarrier()? Basically just choose to wrap up the transaction at > the point of fbarrier() and that''s it. > > Am I missing something?How do you guarantee that the disk driver and/or the disk firmware doesn''t reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. Or am _I_ now missing something :) FrankH.> > (I do not actually have a use case for this on ZFS, since my experience with > ZFS is thus far limited to my home storage server. But I have wished for an > fbarrier() many many times over the past few years...) > > -- > / Peter Schuller > > PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' > Key retrieval: Send an E-Mail to getpgpkey at scode.org > E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
2007/2/12, Frank Hofmann <Frank.Hofmann at sun.com>:> On Mon, 12 Feb 2007, Peter Schuller wrote: > > > Hello, > > > > Often fsync() is used not because one cares that some piece of data is on > > stable storage, but because one wants to ensure the subsequent I/O operations > > are performed after previous I/O operations are on stable storage. In these > > cases the latency introduced by an fsync() is completely unnecessary. An > > fbarrier() or similar would be extremely useful to get the proper semantics > > while still allowing for better performance than what you get with fsync(). > > > > My assumption has been that this has not been traditionally implemented for > > reasons of implementation complexity. > > > > Given ZFS''s copy-on-write transactional model, would it not be almost trivial > > to implement fbarrier()? Basically just choose to wrap up the transaction at > > the point of fbarrier() and that''s it. > > > > Am I missing something? > > How do you guarantee that the disk driver and/or the disk firmware doesn''t > reorder writes ? > > The only guarantee for in-order writes, on actual storage level, is to > complete the outstanding ones before issuing new ones.This is true for NCQ with SATA, but SCSI also supports ordered tags, so it should not be necessary. At least, that is my understanding. Chris
On Mon, 12 Feb 2007, Chris Csanady wrote: [ ... ]>> > Am I missing something? >> >> How do you guarantee that the disk driver and/or the disk firmware doesn''t >> reorder writes ? >> >> The only guarantee for in-order writes, on actual storage level, is to >> complete the outstanding ones before issuing new ones. > > This is true for NCQ with SATA, but SCSI also supports ordered tags, > so it should not be necessary. > > At least, that is my understanding.Except that ZFS doesn''t talk SCSI, it talks to a target driver. And that one may or may not treat async I/O requests dispatched via its strategy() entry point as strictly ordered / non-coalescible / non-cancellable. See e.g. disksort(9F). FrankH.> > Chris >
Peter Schuller wrote:> Hello, > > Often fsync() is used not because one cares that some piece of data is on > stable storage, but because one wants to ensure the subsequent I/O operations > are performed after previous I/O operations are on stable storage. In these > cases the latency introduced by an fsync() is completely unnecessary. An > fbarrier() or similar would be extremely useful to get the proper semantics > while still allowing for better performance than what you get with fsync(). > > My assumption has been that this has not been traditionally implemented for > reasons of implementation complexity. > > Given ZFS''s copy-on-write transactional model, would it not be almost trivial > to implement fbarrier()? Basically just choose to wrap up the transaction at > the point of fbarrier() and that''s it. > > Am I missing something? > > (I do not actually have a use case for this on ZFS, since my experience with > ZFS is thus far limited to my home storage server. But I have wished for an > fbarrier() many many times over the past few years...) >Hmmm... is store ordering what you''re looking for? Eg make sure that in the case of power failure, all previous writes will be visible after reboot if any subsequent write are visible. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
On 12-Feb-07, at 5:55 PM, Frank Hofmann wrote:> On Mon, 12 Feb 2007, Peter Schuller wrote: > >> Hello, >> >> Often fsync() is used not because one cares that some piece of >> data is on >> stable storage, but because one wants to ensure the subsequent I/O >> operations >> are performed after previous I/O operations are on stable storage. >> In these >> cases the latency introduced by an fsync() is completely >> unnecessary. An >> fbarrier() or similar would be extremely useful to get the proper >> semantics >> while still allowing for better performance than what you get with >> fsync(). >> >> My assumption has been that this has not been traditionally >> implemented for >> reasons of implementation complexity. >> >> Given ZFS''s copy-on-write transactional model, would it not be >> almost trivial >> to implement fbarrier()? Basically just choose to wrap up the >> transaction at >> the point of fbarrier() and that''s it. >> >> Am I missing something? > > How do you guarantee that the disk driver and/or the disk firmware > doesn''t reorder writes ? > > The only guarantee for in-order writes, on actual storage level, is > to complete the outstanding ones before issuing new ones. > > Or am _I_ now missing something :)I''m no guru, but would not ZFS already require strict ordering for its transactions ... which property Peter was exploiting to get "fbarrier()" for free? --Toby
On Mon, 12 Feb 2007, Toby Thain wrote: [ ... ]> I''m no guru, but would not ZFS already require strict ordering for its > transactions ... which property Peter was exploiting to get "fbarrier()" for > free?It achieves this by flushing the disk write cache when there''s need to barrier. Which completes outstanding writes. A "perfect fsync()" for ZFS shouldn''t need to do way more; that it does right now is something, as I understand, that is being worked on. FrankH.
2007/2/12, Frank Hofmann <Frank.Hofmann at sun.com>:> On Mon, 12 Feb 2007, Chris Csanady wrote: > > > This is true for NCQ with SATA, but SCSI also supports ordered tags, > > so it should not be necessary. > > > > At least, that is my understanding. > > Except that ZFS doesn''t talk SCSI, it talks to a target driver. And that > one may or may not treat async I/O requests dispatched via its strategy() > entry point as strictly ordered / non-coalescible / non-cancellable. > > See e.g. disksort(9F).Yes, however, this functionality could be exposed through the target driver. While the implementation does not (yet) take full advantage of ordered tags, linux does provide an interface to do this: http://www.mjmwired.net/kernel/Documentation/block/barrier.txt>From a correctness standpoint, the interface seems worthwhile, even ifthe mechanisms are never implemented. It just feels wrong to execute a synchronize cache command from ZFS, when often that is not the intention. The changes to ZFS itself would be very minor. That said, actually implementing the underlying mechanisms may not be worth the trouble. It is only a matter of time before disks have fast non-volatile memory like PRAM or MRAM, and then the need to do explicit cache management basically disappears. Chris
Toby Thain wrote:> I''m no guru, but would not ZFS already require strict ordering for its > transactions ... which property Peter was exploiting to get "fbarrier()" > for free?Exactly. Even if you disable the intent log, the transactional nature of ZFS ensures preservation of event ordering. Note that disk caches don''t come into it: ZFS builds up a wad of transactions in memory, then pushes them out as a transaction group. That entire group will either commit or not. ZFS writes all the new data to new locations, then flushes all disk write caches, then writes the new uberblock, then flushes the caches again. Thus you can lose power at any point in the middle of committing transaction group N, and you''re guaranteed that upon reboot, everything will either be at state N or state N-1. I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool thing is that on ZFS, fbarrier() is a no-op. It''s implicit after every system call. Jeff
Jeff Bonwick, Do you agree that their is a major tradeoff of "builds up a wad of transactions in memory"? We loose the changes if we have an unstable environment. Thus, I don''t quite understand why a 2-phase approach to commits isn''t done. First, take the transactions as they come and do a minimal amount of a delayed write. If the number of transactions build up, then convert to the delayed write scheme. This assumption is that not all ZFS envs are write heavy versus write once and read-many type accesses. My assumption is that attribute/meta reading outweighs all other accesses. Wouldn''t this approach allow minimal outstanding transactions and favor read access. Yes, the assumption is that once the "wad" is started, the amount of writing could be substantial and thus the amount of available bandwidth for reading is reduced. This would then allow for a more N states to be available. Right? Second, their are a multiple uses of "then: (then pushes, then flushes all disk..., then writes the new uberblock, then flushes the caches again), in which seems to have some level of possible parallelism which should reduce the latency from the start to the final write. Or did you just say that for simplicity sake? Mitchell Erblich ------------------- Jeff Bonwick wrote:> > Toby Thain wrote: > > I''m no guru, but would not ZFS already require strict ordering for its > > transactions ... which property Peter was exploiting to get "fbarrier()" > > for free? > > Exactly. Even if you disable the intent log, the transactional nature > of ZFS ensures preservation of event ordering. Note that disk caches > don''t come into it: ZFS builds up a wad of transactions in memory, > then pushes them out as a transaction group. That entire group will > either commit or not. ZFS writes all the new data to new locations, > then flushes all disk write caches, then writes the new uberblock, > then flushes the caches again. Thus you can lose power at any point > in the middle of committing transaction group N, and you''re guaranteed > that upon reboot, everything will either be at state N or state N-1. > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It''s implicit after > every system call. > > Jeff > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Do you agree that their is a major tradeoff of > "builds up a wad of transactions in memory"?I don''t think so. We trigger a transaction group commit when we have lots of dirty data, or 5 seconds elapse, whichever comes first. In other words, we don''t let updates get stale. Jeff
> That said, actually implementing the underlying mechanisms may not be > worth the trouble. It is only a matter of time before disks have fast > non-volatile memory like PRAM or MRAM, and then the need to do > explicit cache management basically disappears.I meant fbarrier() as a syscall exposed to userland, like fsync(), so that userland applications can achieve ordered semantics without synchronous writes. Whether or not ZFS in turn manages to eliminate synchronous writes by some feature of the underlying storage mechanism is a separate issue. But even if not, an fbarrier() exposes an asynchronous method of ensuring relative order of I/O operations to userland, which is often useful. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
> I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It''s implicit after > every system call.That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn''t do that to begin with.) But the fact that you effectively have an fbarrier() is extremely nice. Guess that is yet another reason to prefer ZFS for certrain (granted, very specific) cases. I still would love to see something like fbarrier() defined by some standrd (de facto or otherwise) to make the distinction between ordered writes and guaranteed persistence more easily exploited in the general case for applications, and encourage filesystems/storage systems to optimize for that case (i.e., not have fbarrier() simply fsync()). -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
> That is interesting. Could this account for disproportionate kernel > CPU usage for applications that perform I/O one byte at a time, as > compared to other filesystems? (Nevermind that the application > shouldn''t do that to begin with.)No, this is entirely a matter of CPU efficiency in the current code. There are two issues; we know what they are; and we''re fixing them. The first is that as we translate from znode to dnode, we throw away information along the way -- we go from znode to object number (fast), but then we have to do an object lookup to get from object number to dnode (slow, by comparison -- or more to the point, slow relative to the cost of writing a single byte). But this is just stupid, since we already have a dnode pointer sitting right there in the znode. We just need to fix our internal interfaces to expose it. The second problem is that we''re not very fast at partial-block updates. Again, this is entirely a matter of code efficiency, not anything fundamental.> I still would love to see something like fbarrier() defined by some > standrd (de facto or otherwise) to make the distinction between > ordered writes and guaranteed persistence more easily exploited in the > general case for applications, and encourage filesystems/storage > systems to optimize for that case (i.e., not have fbarrier() simply > fsync()).Totally agree. Jeff
Erblichs writes: > Jeff Bonwick, > > Do you agree that their is a major tradeoff of > "builds up a wad of transactions in memory"? > > We loose the changes if we have an unstable > environment. > > Thus, I don''t quite understand why a 2-phase > approach to commits isn''t done. First, take the > transactions as they come and do a minimal amount > of a delayed write. If the number of transactions > build up, then convert to the delayed write scheme. > I probably don''t understand the proposition. It seems that this is about making all writes synchronous and initially go through the Zil and then convert to the pool sync when load builds up ? The problem is that if we make all writes go through the synchronous Zil, this will limit the load greatly in a way that we''ll never build a backlog (unless we scale to 1000s of threads). So is this about an option to enable O_DSYNC for all files ? > This assumption is that not all ZFS envs are write > heavy versus write once and read-many type accesses. > My assumption is that attribute/meta reading > outweighs all other accesses. > > Wouldn''t this approach allow minimal outstanding > transactions and favor read access. Yes, the assumption > is that once the "wad" is started, the amount of writing > could be substantial and thus the amount of available > bandwidth for reading is reduced. This would then allow > for a more N states to be available. Right? So the reads _are_ prioritized over pool writes by the I/O scheduler. But it is correct that the pool sync does impact the read latency at least on JBOD. There already are suggestions on reducing the impact (reserved read slots, and throttling writers,...). Also for the next build the overhead of the pool sync is reduced which opens up the possibility of testing with smaller txg_time. I would be interested to know the problems you have observed to see if we''re covered. > > Second, their are a multiple uses of "then: (then pushes, > then flushes all disk..., then writes the new uberblock, > then flushes the caches again), in which seems to have > some level of possible parallelism which should reduce the > latency from the start to the final write. Or did you just > say that for simplicity sake? > The parallelism level of those operations seems very high to me and it was improved last week (for the tail end of the pool sync). But note that the pool sync does not commonly hold up a write or a zil commit. It does so only when the storage is saturated for 10s of seconds. Given that memory is finite we have to throttle applications at some point. -r > Mitchell Erblich > ------------------- > > > Jeff Bonwick wrote: > > > > Toby Thain wrote: > > > I''m no guru, but would not ZFS already require strict ordering for its > > > transactions ... which property Peter was exploiting to get "fbarrier()" > > > for free? > > > > Exactly. Even if you disable the intent log, the transactional nature > > of ZFS ensures preservation of event ordering. Note that disk caches > > don''t come into it: ZFS builds up a wad of transactions in memory, > > then pushes them out as a transaction group. That entire group will > > either commit or not. ZFS writes all the new data to new locations, > > then flushes all disk write caches, then writes the new uberblock, > > then flushes the caches again. Thus you can lose power at any point > > in the middle of committing transaction group N, and you''re guaranteed > > that upon reboot, everything will either be at state N or state N-1. > > > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > > thing is that on ZFS, fbarrier() is a no-op. It''s implicit after > > every system call. > > > > Jeff > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Schuller writes: > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > > thing is that on ZFS, fbarrier() is a no-op. It''s implicit after > > every system call. > > That is interesting. Could this account for disproportionate kernel > CPU usage for applications that perform I/O one byte at a time, as > compared to other filesystems? (Nevermind that the application > shouldn''t do that to begin with.) I just quickly measured this (overwritting files in CHUNKS); This is a software benchmark (I/O is non-factor) CHUNK ZFS vz UFS 1B 4X slower 1K 2X slower 8K 25% slower 32K equal 64K 30% faster Quick and dirty but I think it paints a picture. I can''t really answer your question though. -r
> > That is interesting. Could this account for disproportionate kernel > > CPU usage for applications that perform I/O one byte at a time, as > > compared to other filesystems? (Nevermind that the application > > shouldn''t do that to begin with.) > > I just quickly measured this (overwritting files in CHUNKS); > This is a software benchmark (I/O is non-factor) > > CHUNK ZFS vz UFS > > 1B 4X slower > 1K 2X slower > 8K 25% slower > 32K equal > 64K 30% faster > > Quick and dirty but I think it paints a picture. > I can''t really answer your question though.I should probably have said "other filesystems on other platforms", I did not really compare properly on the Solaris box. In this case it was actually BitTorrent (the official python client) that was completely CPU bound in kernel space, and tracing showed single-byte I/O. Regardless, the above stats are interesting and I suppose consistent with what one might expect, from previous discussion on this list. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
>> Given ZFS''s copy-on-write transactional model, would it not be almost trivial > to implement fbarrier()? Basically just choose to wrap up the transaction at > the point of fbarrier() and that''s it. > > Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn''t reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. Or am _I_ now missing something :) FrankH. As Jeff said, ZFS guarantees the write(2) are ordered by the fact that either they show up in the order supplied or they don''t at all. So as the transaction closes, we can issue all the I/Os we want in whatever order we choose (more or less), then flush the caches. Up to here none of the I/O would actually be visible upon a reboot. But then, we update the ueberblock, flush the cache and we''re done. All writes that associated with a transaction group show up at once in the main tree. -r