Some applications expect non-zero errno on close() for any errors that may happen during flushing dirty cached data/metadata even though linux manual page for close(2) suggests that fsync(2) should be used prior to close(2) in order to detect problems like those. Since syncing may degrade performance to a large extent, what do you think is the best/most convenient/least intrusive way to switch to that behaviour? Should it be a mount option for the client or anything else? Andrew.
Hello! On Apr 19, 2010, at 2:30 PM, Andrew Perepechko wrote:> Since syncing may degrade performance to a large extent, what do you think is > the best/most convenient/least intrusive way to switch to that behaviour? > Should it be a mount option for the client or anything else?Personally I think this is a non-issue for us. There is no mad dash in other filesystems to implement this, aside from NFS where this is what is guaranteed. As long as we don''t lose data after close (which we usually don''t anyway) we are in the clear. If we do (eviction, double failure) then there was some system exception that is kind of similar to disk failure for regular fs. Apps doing this "close after each write" should be fixed anyway because now they incur additional overhead of two extra unneeded sync RPCs (open + close) and adding more overhead to this is not going to help even if made optional. Bye, Oleg
One thing we can do to improve this situation a bit is to return any previous write error codes at close time. Cheers, Andreas On 2010-04-19, at 12:30, Andrew Perepechko <Andrew.Perepechko at Sun.COM> wrote:> Some applications expect non-zero errno on close() for any errors > that may > happen during flushing dirty cached data/metadata even though linux > manual > page for close(2) suggests that fsync(2) should be used prior to > close(2) in > order to detect problems like those. > > Since syncing may degrade performance to a large extent, what do you > think is > the best/most convenient/least intrusive way to switch to that > behaviour? > Should it be a mount option for the client or anything else? > > Andrew. > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On 2010-04-19, at 12:30, Andrew Perepechko wrote:> Some applications expect non-zero errno on close() for any errors > that may > happen during flushing dirty cached data/metadata even though linux > manual > page for close(2) suggests that fsync(2) should be used prior to > close(2) in > order to detect problems like those. > > Since syncing may degrade performance to a large extent, what do you > think is > the best/most convenient/least intrusive way to switch to that > behaviour? > Should it be a mount option for the client or anything else?If the application is expecting the "flush all outstanding data" semantics of fsync() on close, it should just call fsync() and use the return code of that instead of overloading close() to do it. There will be no worse performance by waiting for fsync() to complete, than waiting for a close that is flushing all of the data. In contrast, if an application is NOT expecting synchronous flushing on close, then making every close flush all data WILL be a major performance impact. Better to have applications do what they mean, and system calls do what they are supposed to, instead of adding in extra semantics that not every application wants. Also, as Oleg wrote, if an application is doing close/open when it is really just trying to flush the data to disk (which I saw some traces of recently) then this is even WORSE than just doing an fsync(). Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
Actually this is already being done. We do set AS_error (or something like that). There is only one exception where on eviction we forgot to implement this, I think. On Apr 19, 2010, at 10:34 PM, Andreas Dilger wrote:> One thing we can do to improve this situation a bit is to return any > previous write error codes at close time. > > Cheers, Andreas > > On 2010-04-19, at 12:30, Andrew Perepechko <Andrew.Perepechko at Sun.COM> > wrote: > >> Some applications expect non-zero errno on close() for any errors >> that may >> happen during flushing dirty cached data/metadata even though linux >> manual >> page for close(2) suggests that fsync(2) should be used prior to >> close(2) in >> order to detect problems like those. >> >> Since syncing may degrade performance to a large extent, what do you >> think is >> the best/most convenient/least intrusive way to switch to that >> behaviour? >> Should it be a mount option for the client or anything else? >> >> Andrew. >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Possible Suggestion, NFS allows async writes and then commits. I would almost suggest this.. As yes, close''s are not expected to fail. What would you do / should you do if a close failed? Mitchell Erblich On Apr 19, 2010, at 10:16 PM, Oleg Drokin wrote:> Actually this is already being done. We do set AS_error (or something like that). > There is only one exception where on eviction we forgot to implement this, I think. > > On Apr 19, 2010, at 10:34 PM, Andreas Dilger wrote: > >> One thing we can do to improve this situation a bit is to return any >> previous write error codes at close time. >> >> Cheers, Andreas >> >> On 2010-04-19, at 12:30, Andrew Perepechko <Andrew.Perepechko at Sun.COM> >> wrote: >> >>> Some applications expect non-zero errno on close() for any errors >>> that may >>> happen during flushing dirty cached data/metadata even though linux >>> manual >>> page for close(2) suggests that fsync(2) should be used prior to >>> close(2) in >>> order to detect problems like those. >>> >>> Since syncing may degrade performance to a large extent, what do you >>> think is >>> the best/most convenient/least intrusive way to switch to that >>> behaviour? >>> Should it be a mount option for the client or anything else? >>> >>> Andrew. >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Hello! On Apr 20, 2010, at 3:56 AM, Mitchell Erblich wrote:> NFS allows async writes and then commits.It''s all internal within NFS. There is no way for the client to say "these writes are async", "now do commit" aside from doing manual syncs for commits and using some sort of sync io open flags to force "sync" writes. Of course additional level of confusion is the "async writes" you reference have nothing to do with real write system calls. In fact write system calls put data in memory and data might stay there for who knows how long (or until close) and only when NFS client decides (or VFS forces it to) to write the data to the server is when these "async writes" are used, Lustre has this too.> I would almost suggest this..It''s already implemented sort of NFS-style except the close-to-open.> As yes, close''s are not expected to fail.Huh?!> What would you do / should you do if a close failed?Well, you assume the file content is no longer correct and write it again, I guess. Or if you are like 99% of people writing programs out there, you just bail out with an error ;) Bye, Oleg
> It''s all internal within NFS. There is no way for the client to say > "these writes are async", "now do commit" aside fromI meant to say "no way for the application" here.
POSIX behavior is that close(2) can fail, so if you care, you should check its error status. POSIX behavior is also that close(2) doesn''t imply an fsync(2) (much less a sync(2)). NFS clients implement fsync(2)-on-close, but that''s another story. However, when writes deferred at close(2) time fail on a local filesystem... chances are that subsequent I/O will just fail. Or at least that''s probably what many users will expect. But does POSIX require that? I don''t have it handy, but I''m pretty sure the answer is "no". With Lustre we could also have a close(2) whose deferred writes fail long after the process that could handle the failure is gone. Databases and so on should always fsync(2) when they need to ensure consistent on-disk state. Many apps that write(2) don''t need to fsync(2) (think of writes to log files). Nico --
On Tue, Apr 20, 2010 at 03:27:40PM -0500, Nicolas Williams wrote:> However, when writes deferred at close(2) time fail on a local > filesystem... chances are that subsequent I/O will just fail. Or at > least that''s probably what many users will expect. But does POSIX > require that? I don''t have it handy, but I''m pretty sure the answer is > "no". With Lustre we could also have a close(2) whose deferred writes > fail long after the process that could handle the failure is gone.To go out on a complete limb :) what we really need is a variant of close(2) where eventual failure can be caught, even when the process that called it has exit()ed. Something like: int close_or_spawn(int fd, <posix_spawn()-like arguments>); or int close_xid(int fd, uint64_t xid); /* * Where some daemon(s) reads a * log of sucessful/failed close * XIDs and takes action as * necessary. */ I prefer something along the lines of close_xid(). Adoption of new APIs in the context of Lustre is probably more realistic than in the general case, but it''d still be slow. So we''re still left with: you''d better fsync(2) explicitly before close()ing if you want to make sure that you don''t lose data. Nico --
On 2010-04-19, at 23:16, Oleg Drokin wrote:> Actually this is already being done. We do set AS_error (or something like that). There is only one exception where on eviction we forgot to implement this, I think.One thing I see in both the 1.x and 2.x client code is that even though we track the errors from async file IO on the file descriptor, we don''t actually return them to the application at close time: ll_file_release() { if (lsm) lov_test_and_clear_async_rc(lsm); lli->lli_async_rc = 0; It looks like this should be something like: if (lsm) rc2 = lov_test_and_clear_async_rc(lsm); if (rc2 == 0) rc2 = lli->lli_async_rc; lli->lli_async_rc = 0;> On Apr 19, 2010, at 10:34 PM, Andreas Dilger wrote: > >> One thing we can do to improve this situation a bit is to return any >> previous write error codes at close time. >> >> Cheers, Andreas >> >> On 2010-04-19, at 12:30, Andrew Perepechko <Andrew.Perepechko at Sun.COM> >> wrote: >> >>> Some applications expect non-zero errno on close() for any errors >>> that may >>> happen during flushing dirty cached data/metadata even though linux >>> manual >>> page for close(2) suggests that fsync(2) should be used prior to >>> close(2) in >>> order to detect problems like those. >>> >>> Since syncing may degrade performance to a large extent, what do you >>> think is >>> the best/most convenient/least intrusive way to switch to that >>> behaviour? >>> Should it be a mount option for the client or anything else? >>> >>> Andrew. >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
Oleg, say about code: lock_page(page); wait_on_page_writeback(page); if (rc < 0) { CERROR("writepage inode %lu(%p) of page %p " "failed: %d\n", mapping->host->i_ino, mapping->host, page, rc); if (rc == -ENOSPC) set_bit(AS_ENOSPC, &mapping->flags); else set_bit(AS_EIO, &mapping->flags); } in lock cancel handler and IO completion. This better way to tell linux kernel about hit a error, because lov async error isn''t help in case O_SYNC write (and some other - as i remember). int wait_on_page_writeback_range(struct address_space *mapping, pgoff_t start, pgoff_t end) ... /* Check for outstanding write errors */ if (test_and_clear_bit(AS_ENOSPC, &mapping->flags)) ret = -ENOSPC; if (test_and_clear_bit(AS_EIO, &mapping->flags)) ret = -EIO; return ret; On Apr 22, 2010, at 00:04, Andreas Dilger wrote:> On 2010-04-19, at 23:16, Oleg Drokin wrote: >> Actually this is already being done. We do set AS_error (or something like that). There is only one exception where on eviction we forgot to implement this, I think. > > One thing I see in both the 1.x and 2.x client code is that even though we track the errors from async file IO on the file descriptor, we don''t actually return them to the application at close time: > > ll_file_release() > { > if (lsm) > lov_test_and_clear_async_rc(lsm); > lli->lli_async_rc = 0; > > It looks like this should be something like: > > if (lsm) > rc2 = lov_test_and_clear_async_rc(lsm); > if (rc2 == 0) > rc2 = lli->lli_async_rc; > lli->lli_async_rc = 0; > >> On Apr 19, 2010, at 10:34 PM, Andreas Dilger wrote: >> >>> One thing we can do to improve this situation a bit is to return any >>> previous write error codes at close time. >>> >>> Cheers, Andreas >>> >>> On 2010-04-19, at 12:30, Andrew Perepechko <Andrew.Perepechko at Sun.COM> >>> wrote: >>> >>>> Some applications expect non-zero errno on close() for any errors >>>> that may >>>> happen during flushing dirty cached data/metadata even though linux >>>> manual >>>> page for close(2) suggests that fsync(2) should be used prior to >>>> close(2) in >>>> order to detect problems like those. >>>> >>>> Since syncing may degrade performance to a large extent, what do you >>>> think is >>>> the best/most convenient/least intrusive way to switch to that >>>> behaviour? >>>> Should it be a mount option for the client or anything else? >>>> >>>> Andrew. >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> _______________________________________________ >>> Lustre-devel mailing list >>> Lustre-devel at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer, Lustre Group > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel