Vitaly, 1. Clients must replay opens on the MDS if "done writing" is still pending to notify the new MDS that this file is volatile. Does it matter whether the client already sent "close" to the previous MDS instance? Does it have to send "close" again? 2. I assume "done writing" is only sent after stripe updates have been committed, not just executed so that cached SOM attributes are not dependent on the client still being around to participate in recovery if an OST fails. Is this correct? -- Cheers, Eric
Eric Barton wrote:> Vitaly, > > 1. Clients must replay opens on the MDS if "done writing" is still > pending to notify the new MDS that this file is volatile. Does it > matter whether the client already sent "close" to the previous MDS > instance? Does it have to send "close" again? > > 2. I assume "done writing" is only sent after stripe updates have been > committed, not just executed so that cached SOM attributes are not > dependent on the client still being around to participate in > recovery if an OST fails. Is this correct?And this has the interaction with async data commit feature I brought up during SC09 meetings i.e. client can''t assume anymore OST IO is synchronous. -Alex>
On Jan 5, 2010, at 9:01 PM, Eric Barton wrote:> Vitaly, > > 1. Clients must replay opens on the MDS if "done writing" is still > pending to notify the new MDS that this file is volatile. Does it > matter whether the client already sent "close" to the previous MDS > instance? Does it have to send "close" again?the idea was to get rid of these long chains of requests on replay (open-close-DW-setattr), DW and setattr are replayed independently not requiring committed open to be replayed. due to 3633, we do not even replay committed open if close is already sent. requiring open to be replayed due to pending DW will bring this problem back. MDS in its turn just ignores DW and setattr for not re-opened files and relies on synchronisation with OSTs -- once file is closed, data are under extent lock and under control here. thus we can invalidate SOM attributes on MDS by llog record and the following SOM recovery will ensure in some way data are flushed and committed on OST (alternatively we can just ask the clients to flush and OST to commit before the synchronisation). SOM recovery may try to happen late enough so that data would be already committed on OST with some checks they are really committed; or will have to take conflicting extent lock and wait for commit by itself.> 2. I assume "done writing" is only sent after stripe updates have been > committed, not just executed so that cached SOM attributes are not > dependent on the client still being around to participate in > recovery if an OST fails. Is this correct?it is correct, DW can be postponed until commit. however, as we cannot get the proper attribute update (in particular i_blocks) right in DW, there was an idea to separate SOM invalidation from SOM revalidation mechanism, i.e. to not try to rebuild the SOM cache on MDS immediately once the file has been modified. In this case DW can just indicate that this client is not going to modify the file anymore and probably we do not have to wait until commit, the revalidation will occur late enough so that the commit would have occured (again with some checks it really occured). In the case of OST failure, while OST is down or not re-synchronised with MDS, SOM is disabled; the SOM re-validation will occur late enough after MDS-OST synchronisation completes... -- Vitaly
I''d say we don''t need DW at all. it''s OST who knows whether attributes are stable (no pw locks and flush/commit is done, so i_blocks won''t change till next open/write). I think in general the procedure to refresh SOM attributes could look like the following: 1) MDS gets GETATTR and finds the file hasn''t been open for a period it set special flag in GETATTR reply - say, REFRESH_SOM 2) client does regular enqueue/glimpse to get attributes from OST 3) if OST finds inode is stable (VBR version >= last_committed) it set another special flag in reply - say, ATTR_STABLE 4) now, if client has REFRESH_SOME, ATTR_STABLE for all objects *and* locks granted, then it can send aggregated attributes to MDS to refresh SOM attributes 5) if the file hasn''t been open since that REFRESH SOM, attributes can be set it looks quite simple and with very minimal changes to existing protocol logic. I also think that following this we don''t need dedicated IO epoch notion and can use regular VBR version increasing on each open. thanks, Alex On 1/11/10 5:10 PM, Vitaly Fertman wrote:> On Jan 5, 2010, at 9:01 PM, Eric Barton wrote: > >> Vitaly, >> >> 1. Clients must replay opens on the MDS if "done writing" is still >> pending to notify the new MDS that this file is volatile. Does it >> matter whether the client already sent "close" to the previous MDS >> instance? Does it have to send "close" again? > > the idea was to get rid of these long chains of requests on replay > (open-close-DW-setattr), DW and setattr are replayed independently > not requiring committed open to be replayed. > > due to 3633, we do not even replay committed open if close is already > sent. > requiring open to be replayed due to pending DW will bring this > problem back. > > MDS in its turn just ignores DW and setattr for not re-opened files and > relies on synchronisation with OSTs -- once file is closed, data are > under extent lock and under control here. thus we can invalidate SOM > attributes on MDS by llog record and the following SOM recovery will > ensure in some way data are flushed and committed on OST (alternatively > we can just ask the clients to flush and OST to commit before the > synchronisation). > > SOM recovery may try to happen late enough so that data would be already > committed on OST with some checks they are really committed; or will > have > to take conflicting extent lock and wait for commit by itself. > >> 2. I assume "done writing" is only sent after stripe updates have been >> committed, not just executed so that cached SOM attributes are not >> dependent on the client still being around to participate in >> recovery if an OST fails. Is this correct? > > > it is correct, DW can be postponed until commit. > > however, as we cannot get the proper attribute update (in particular > i_blocks) right in DW, there was an idea to separate SOM invalidation > from SOM revalidation mechanism, i.e. to not try to rebuild the SOM > cache on MDS immediately once the file has been modified. > > In this case DW can just indicate that this client is not going to > modify the file anymore and probably we do not have to wait until > commit, > the revalidation will occur late enough so that the commit would have > occured (again with some checks it really occured). > > In the case of OST failure, while OST is down or not re-synchronised > with > MDS, SOM is disabled; the SOM re-validation will occur late enough after > MDS-OST synchronisation completes... > > -- > Vitaly > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On 1/11/10 6:47 PM, Alex Zhuravlev wrote:> it looks quite simple and with very minimal changes to existing protocol > logic. I also think that following this we don''t need dedicated IO epoch > notion and can use regular VBR version increasing on each open.sorry, I forgot to add that all this essentially means no SOM-specific in-core state on MDS at all and no need to maintain in-core state over reboots. thanks, Alex
On Jan 11, 2010, at 6:47 PM, Alex Zhuravlev wrote:> I''d say we don''t need DW at all. > > it''s OST who knows whether attributes are stable (no pw locks and > flush/commit is done, so i_blocks won''t change till next open/write).what do you mean stable? If you mean they are not going to change, this is exactly what OST doesn''t know because it doesn''t know if file is opened.> I think in general the procedure to refresh SOM attributes could > look like the following: > > 1) MDS gets GETATTR and finds the file hasn''t been open for a period > it set special flag in GETATTR reply - say, REFRESH_SOM > 2) client does regular enqueue/glimpse to get attributes from OST > 3) if OST finds inode is stable (VBR version >= last_committed) > it set another special flag in reply - say, ATTR_STABLE > 4) now, if client has REFRESH_SOME, ATTR_STABLE for all objects > *and* locks granted, then it can send aggregated attributes to > MDS to refresh SOM attributes > 5) if the file hasn''t been open since that REFRESH SOM, attributes > can be set > > it looks quite simple and with very minimal changes to existing > protocol > logic. I also think that following this we don''t need dedicated IO > epoch > notion and can use regular VBR version increasing on each open.could you clarify how you can block IO from "lost" clients without writing some id (VBR id?) to OST objects and not waiting for this change to commit before updating MDS with new attributes? -- Vitaly
On 1/12/10 12:59 AM, Vitaly Fertman wrote:> what do you mean stable? If you mean they are not going to change, > this is exactly what OST doesn''t know because it doesn''t know if file > is opened.let me rephrase it: stable means there is no pending (cached) IO on the clients and all locally cached (on OST) changes are committed to disk.> could you clarify how you can block IO from "lost" clients without > writing some id (VBR id?) to OST objects and not waiting for this > change to commit before updating MDS with new attributes?this is related but different issue discussed in "SOM safety" :) we consider at least 3 options for that. thanks, Alex