Vitaly Fertman wrote:> Oleg told me yesterday about one feature which seems destroying the > SOM completely. If a client is evicted and re-connects, we do not > re-open files so that client thinks files are opened, whereas MDS > thinks they are closed.Right. This issue has been around for a long time. There is bug 971 dealing with this issue, about changing open file recovery to work by generating new "open file" requests instead of saving the RPCs and handling it at the ptlrpc level. This is (AFAIK) being done for the simplified interoperability fixes already.> Thus MDS has no control over opened files, whereas clients may write > to them. To fix this we need at least to disable the file modification > on clients until files are re-opened.This is also going to be handled by the LOV EA lock that CEA is working on for HSM and migration. If the client is evicted from the MDS it will have the LOV EA lock cancelled, and all IO will block until a new LOV EA lock is gotten.> The re-opening itself could be done by application or by us. In the > later case, the recovery mechanism is involved...This is definitely not an application-level problem, it needs to be fixed within Lustre.> it was missed for the recovery, but it is a problem for interoperability > as well. I remember Eric said that we will evict clients on downgrade > and he said therefore all the files get closed. however, it seems it > is not for clients unless we do some extra actions.Even on upgrade, simplified interoperability will now have the server requesting that all clients flush their state before the server is shut down, so that the amount of interoperability needed is minimal. The only state that a client cannot completely remove is the open file handles, so the "replay" of file open will now be driven by the file handles themselves instead of the "saved RPC" mechanism we use today. That would also avoid bugs like 3632, 3633, etc. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello! On Jan 30, 2009, at 6:32 PM, Andreas Dilger wrote:> Vitaly Fertman wrote: >> Oleg told me yesterday about one feature which seems destroying the >> SOM completely. If a client is evicted and re-connects, we do not >> re-open files so that client thinks files are opened, whereas MDS >> thinks they are closed. > Right. This issue has been around for a long time. There is bug 971 > dealing with this issue, about changing open file recovery to work by > generating new "open file" requests instead of saving the RPCs and > handling it at the ptlrpc level. This is (AFAIK) being done for the > simplified interoperability fixes already.But the problem is client might be evicted before such command is issued and a knowledge about this system would disappear from MDS (but not from OST where it is still connected).>> Thus MDS has no control over opened files, whereas clients may write >> to them. To fix this we need at least to disable the file >> modification >> on clients until files are re-opened. > This is also going to be handled by the LOV EA lock that CEA is > working > on for HSM and migration. If the client is evicted from the MDS it > will > have the LOV EA lock cancelled, and all IO will block until a new > LOV EA > lock is gotten.LOV EA lock won''t help. It does not prevent (with current design, anyway) dirty data flush from client cache, only new writes would be not possible. Even then since there is no reopen when obtaining EA lock, MDS would still have no idea there is an open file handle somewhere.>> The re-opening itself could be done by application or by us. In the >> later case, the recovery mechanism is involved... > This is definitely not an application-level problem, it needs to be > fixed within Lustre.Right. But there is no straightforward fix. It is not going to be easy to reopen a file after eviction. Of course we can just invalidate local fd, so that the app will start to get something like ESTALE, but this approach is also not very desirable.>> it was missed for the recovery, but it is a problem for >> interoperability >> as well. I remember Eric said that we will evict clients on downgrade >> and he said therefore all the files get closed. however, it seems it >> is not for clients unless we do some extra actions. > Even on upgrade, simplified interoperability will now have the server > requesting that all clients flush their state before the server is > shut > down, so that the amount of interoperability needed is minimal. The > onlyExcept in this case the client is evicted from e.g. MDS, so it does not participate in recovery anyway. Bye, Oleg
On Jan 31, 2009, at 3:51 AM, Oleg Drokin wrote:> Hello! > > On Jan 30, 2009, at 6:32 PM, Andreas Dilger wrote: > >> Vitaly Fertman wrote: >>> Oleg told me yesterday about one feature which seems destroying the >>> SOM completely. If a client is evicted and re-connects, we do not >>> re-open files so that client thinks files are opened, whereas MDS >>> thinks they are closed. >> Right. This issue has been around for a long time. There is bug 971 >> dealing with this issue, about changing open file recovery to work by >> generating new "open file" requests instead of saving the RPCs and >> handling it at the ptlrpc level. This is (AFAIK) being done for the >> simplified interoperability fixes already. > > But the problem is client might be evicted before such command is > issued > and a knowledge about this system would disappear from MDS (but not > from > OST where it is still connected).right, besides that the problem exists even without the interoperability involved, i.e. if mds does not even reboot, when only eviction happens.>>> Thus MDS has no control over opened files, whereas clients may write >>> to them. To fix this we need at least to disable the file >>> modification >>> on clients until files are re-opened. >> This is also going to be handled by the LOV EA lock that CEA is >> working >> on for HSM and migration. If the client is evicted from the MDS it >> will >> have the LOV EA lock cancelled, and all IO will block until a new >> LOV EA >> lock is gotten. > > LOV EA lock won''t help. It does not prevent (with current design, > anyway) > dirty data flush from client cache, only new writes would be not > possible. > Even then since there is no reopen when obtaining EA lock, MDS would > still > have no idea there is an open file handle somewhere.the dirty cache existent on client is not such a big problem for SOM. first of all, the client eviction leads to closing the files on MDS, when MDS removes the SOM cache. besides that, if MDS failover happens, during the MDS-OST synchronization OST may ask the clients to flush their data and tell the MDS about the existent llog record -- thus MDS will be able to clean the SOM cache as well. Once MDS wants to get the SOM cache again and sees the cache did not exist, it asks a client to gather attributes under extent locks forcing other clients to flush their data on OST. thus the only problem here is a stale fh on a client which may let the client to write to the file after the SOM cache will be re-obtained on MDS, which consists of 2 parts: - an ability of a client to write to an opened file without a connection to MDS; - an absence of file re-opening on re-connection.>>> The re-opening itself could be done by application or by us. In the >>> later case, the recovery mechanism is involved... >> This is definitely not an application-level problem, it needs to be >> fixed within Lustre. > > Right. But there is no straightforward fix. It is not going to be easy > to reopen a file after eviction. Of course we can just invalidate > local fd, so that the app will start to get something like ESTALE, > but this approach is also not very desirable. > >>> it was missed for the recovery, but it is a problem for >>> interoperability >>> as well. I remember Eric said that we will evict clients on >>> downgrade >>> and he said therefore all the files get closed. however, it seems it >>> is not for clients unless we do some extra actions. >> Even on upgrade, simplified interoperability will now have the server >> requesting that all clients flush their state before the server is >> shut >> down, so that the amount of interoperability needed is minimal. >> The only >> state that a client cannot completely remove is the open file >> handles,the only state needed for SOM ;) IIRC, what was discussed in Beijing was the failover for upgrade and all the client evictions for downgrade. the failover is not a problem here as opens will be merely replayed. but eviction is.>> so the "replay" of file open will now be driven by the file handles >> themselves instead of the "saved RPC" mechanism we use today.hopefully not for replay only, but for the re-connection as well.> Except in this case the client is evicted from e.g. MDS, so it does > not > participate in recovery anyway.right. -- Vitaly
On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote:> thus the only problem here is a stale fh on a client which may let > the client > to write to the file after the SOM cache will be re-obtained on MDS, > which > consists of 2 parts: > > - an ability of a client to write to an opened file without a > connection to MDS; > - an absence of file re-opening on re-connection.I forgot to mention about truncate (locked & lockless) and lockless IO. MDS must be aware about opened IOEpoch for truncate as well, otherwise obd_punches must be blocked. The situation is pretty rare as we do not cache punches on clients and they go away right md_setattr completes, but I think what if at the time of the client eviction from MDS, the connection between this client and an OST is unstable so that punches will hang in the re-send list for a while, enough for another client to modify the file -- MDS gets a new SOM cache, and later punch will modify the file. The same for lockless IO. The locked truncate is involved as it could hang in the re-send list with the lock enqueue, so that enqueue+punch will happen after MDS re- validates SOM cache. Thus: - block truncate and lockless IO; - "re-open" truncate on re-connection as well as regularly opened files. This must happen even if SOM is disabled but the client already supports it (clients are upgraded first). Otherwise, the interoperability will be broken. -- Vitaly
On Feb 01, 2009 20:24 +0300, Vitaly Fertman wrote:> On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote: >> thus the only problem here is a stale fh on a client which may let the >> client to write to the file after the SOM cache will be re-obtained on >> MDS, which consists of 2 parts: >> >> - an ability of a client to write to an opened file without a >> connection to MDS;With the layout lock this would not be possible. The client would be required to have the layout lock (hence be connected to the MDS) in order to generate a new write.>> - an absence of file re-opening on re-connection. > > I forgot to mention about truncate (locked & lockless) and lockless IO. > > MDS must be aware about opened IOEpoch for truncate as well, otherwise > obd_punches must be blocked. The situation is pretty rare as we do not > cache punches on clients and they go away right md_setattr completes, > but I think what if at the time of the client eviction from MDS, the > connection between this client and an OST is unstable so that punches > will hang in the re-send list for a while, enough for another client > to modify the fileI a second client is trying to modify the file while the first one is having OST connection problems, then the first client would either succeed to flush its cache, or be evicted by the OST before the second client can get the extent locks needed to truncate the file. The same is true whether the truncate is from a remote client (with client lock) or a lockless truncate (OST holds lock).> MDS gets a new SOM cache, and later punch will modify the file. > > The same for lockless IO. > > The locked truncate is involved as it could hang in the re-send list > with the lock enqueue, so that enqueue+punch will happen after MDS re- > validates SOM cache.In this case the client will not even begin to send the truncate RPC until the lock enqueue has succeeded.> Thus: > - block truncate and lockless IO; > - "re-open" truncate on re-connection as well as regularly opened files. > > This must happen even if SOM is disabled but the client already supports > it (clients are upgraded first). Otherwise, the interoperability will be > broken.It isn''t clear to me why the done_writing RPC needs to be sent separately for each truncate? The client is already sending an RPC to the MDS for each truncate to update the size there, if file is not open (and currently has no objects), and to verify file write permission (avoid truncate of in-use executables). Now, if this only happens on recovery I don''t have a huge objection. If the "done_writing" RPC needs to be sent to the MDS for every single truncate, then that is a major performance concern. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Please also consider the security implication. Can all client actions be checked without extra message passing? Are any special capabilities required? To what extent must clients be trusted? What will go wrong if this trust is abused etc... Cheers, Eric> -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas > Dilger > Sent: 21 February 2009 12:21 AM > To: Vitaly Fertman > Cc: Oleg Drokin; Lustre Development Mailing List > Subject: Re: [Lustre-devel] SOM Recovery of open files > > On Feb 01, 2009 20:24 +0300, Vitaly Fertman wrote: > > On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote: > >> thus the only problem here is a stale fh on a client which may let the > >> client to write to the file after the SOM cache will be re-obtained on > >> MDS, which consists of 2 parts: > >> > >> - an ability of a client to write to an opened file without a > >> connection to MDS; > > With the layout lock this would not be possible. The client would be > required to have the layout lock (hence be connected to the MDS) in > order to generate a new write. > > >> - an absence of file re-opening on re-connection. > > > > I forgot to mention about truncate (locked & lockless) and lockless IO. > > > > MDS must be aware about opened IOEpoch for truncate as well, otherwise > > obd_punches must be blocked. The situation is pretty rare as we do not > > cache punches on clients and they go away right md_setattr completes, > > but I think what if at the time of the client eviction from MDS, the > > connection between this client and an OST is unstable so that punches > > will hang in the re-send list for a while, enough for another client > > to modify the file > > I a second client is trying to modify the file while the first one is > having OST connection problems, then the first client would either > succeed to flush its cache, or be evicted by the OST before the second > client can get the extent locks needed to truncate the file. > > The same is true whether the truncate is from a remote client (with > client lock) or a lockless truncate (OST holds lock). > > > MDS gets a new SOM cache, and later punch will modify the file. > > > > The same for lockless IO. > > > > The locked truncate is involved as it could hang in the re-send list > > with the lock enqueue, so that enqueue+punch will happen after MDS re- > > validates SOM cache. > > In this case the client will not even begin to send the truncate RPC > until the lock enqueue has succeeded. > > > Thus: > > - block truncate and lockless IO; > > - "re-open" truncate on re-connection as well as regularly opened files. > > > > This must happen even if SOM is disabled but the client already supports > > it (clients are upgraded first). Otherwise, the interoperability will be > > broken. > > It isn''t clear to me why the done_writing RPC needs to be sent separately > for each truncate? The client is already sending an RPC to the MDS for > each truncate to update the size there, if file is not open (and currently > has no objects), and to verify file write permission (avoid truncate of > in-use executables). > > Now, if this only happens on recovery I don''t have a huge objection. If > the "done_writing" RPC needs to be sent to the MDS for every single truncate, > then that is a major performance concern. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On Mar 06, 2009 10:01 -0800, Nathaniel Rutman wrote:> I think we need to explicitly list the extent / layout lock interactions > so we don''t miss anything in the implementation: > 1. Create > > * MDT generates new layout lock at open > * client gets Common Reader layout lock > * client can get new extents read/write locks as long it holds CR > layout lock > > 2. Layout change > > * MDT takes PW layout lock, revoking all client CR locks > * in parallel, MDT takes PW lock on all extents on all OSTs for this > file > * Clients drop layout lock and requeue > * Clients flush cache and drop their extent locks > * MDT changes layout > * MDT releases layout lock and extents locks > * Clients get CR layout lock and can now requeue their extent locks > > 3. Client / MDT network partition > > * client can continue reading/writing to currently held extents > * when client determines it has been disconnected from MDT it drops > layout lock > * client can''t get new extent locks, but can continue writing to > currently held extents > * if MDT changes layout, it first PW locks all extents, causing OSTs > to revoke client''s extents locks > * Client must requeue layout lock before requeueing extents locks > > What if client hasn''t noticed it''s been disconnected from the MDT by > the time it tries to requeue extent locks? It doesn''t know that the > layout lock its holding is invalid...That is a thorny problem. I''ll go through several partial solutions and see why they do not work, then hopefully a safe solution at the end. One possibility is that the AST sent to the clients during the extent lock revocation would contain a flag that indicates "the layout is changing" (similar to the truncate/discard data flag), so the clients get notified even if disconnected from the MDS. It still isn''t enough, however, as the clients will only get this AST if they currently have an extent lock, and it isn''t always true. A second option is in case a client holding a layout lock is evicted AND the layout is being changed then the MDS can''t release the extent locks until at least one ping interval (assuming any still-alive client would have detected this and try reconnecting). This is also not 100% safe because the client might have been evicted moments earlier due to some other lock and the "wait for one ping interval" heuristic would no longer apply. We cannot depend on the layout change to be drastic and the objects would no longer exist to be written to (CROW issues aside). If we are changing the layout to add a mirror that wouldn''t help and we would now have inconsistent data on each half of the mirror. Another option is something like "imperative eviction" so that clients being evicted are actively told they are being evicted, but that has the issue of the "you are evicted" RPC will normally be sent to a node which is already dead and slow down the MDS and/or block all of its LNET credits so isn''t really even a usable option. A safe option (AFAICS) is to have MDS eviction force OST eviction (via obd_set_info_async(EVICT_BY_NID). That would also resolve some other recovery problems, but might be overly drastic if e.g. the client is being evicted from the MDS due to router failure or simple network partition. Having a proper health network and also server-side RPC resending would help avoid such problems. This is one of the main reasons why having DLM servers on one node controlling resources on another node is a bad idea. We had similar issues in the past when we locked all objects via the OST only on stripe index 0, and we might have similar problems with subtree locks in the future with CMD or any SNS RAID that is only locking a subset of objects. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathaniel Rutman
2009-Mar-06 22:59 UTC
[Lustre-devel] layout lock / extent lock interaction
Andreas Dilger wrote:> On Mar 06, 2009 10:01 -0800, Nathaniel Rutman wrote: > >> I think we need to explicitly list the extent / layout lock interactions >> so we don''t miss anything in the implementation: >> 1. Create >> >> * MDT generates new layout lock at open >> * client gets Common Reader layout lock >> * client can get new extents read/write locks as long it holds CR >> layout lock >> >> 2. Layout change >> >> * MDT takes PW layout lock, revoking all client CR locks >> * in parallel, MDT takes PW lock on all extents on all OSTs for this >> file >> * Clients drop layout lock and requeue >> * Clients flush cache and drop their extent locks >> * MDT changes layout >> * MDT releases layout lock and extents locks >> * Clients get CR layout lock and can now requeue their extent locks >> >> 3. Client / MDT network partition >> >> * client can continue reading/writing to currently held extents >> * when client determines it has been disconnected from MDT it drops >> layout lock >> * client can''t get new extent locks, but can continue writing to >> currently held extents >> * if MDT changes layout, it first PW locks all extents, causing OSTs >> to revoke client''s extents locks >> * Client must requeue layout lock before requeueing extents locks >> >> What if client hasn''t noticed it''s been disconnected from the MDT by >> the time it tries to requeue extent locks? It doesn''t know that the >> layout lock its holding is invalid... >> > > That is a thorny problem. I''ll go through several partial solutions > and see why they do not work, then hopefully a safe solution at the end. > > One possibility is that the AST sent to the clients during the extent lock > revocation would contain a flag that indicates "the layout is changing" > (similar to the truncate/discard data flag), so the clients get notified > even if disconnected from the MDS. It still isn''t enough, however, > as the clients will only get this AST if they currently have an extent > lock, and it isn''t always true. >How about if we introduce the concept of a layout generation? The generation is stored in the layout and also with each OST object. When the MDT takes the extent locks it sends the new generation to the OSTs. Clients send the layout generation along with any extent lock enqueue. The OSTs only grant extents to clients that match the current generation. Maybe "match or exceed" in case OST dies before new gen can be recorded. And OST increases gen to latest seen whenever any (MDT or client) extent lock is enqueued.> A second option is in case a client holding a layout lock is evicted AND > the layout is being changed then the MDS can''t release the extent locks > until at least one ping interval (assuming any still-alive client would > have detected this and try reconnecting). This is also not 100% safe because > the client might have been evicted moments earlier due to some other lock > and the "wait for one ping interval" heuristic would no longer apply. > > We cannot depend on the layout change to be drastic and the objects would > no longer exist to be written to (CROW issues aside). If we are changing > the layout to add a mirror that wouldn''t help and we would now have > inconsistent data on each half of the mirror. > > Another option is something like "imperative eviction" so that clients > being evicted are actively told they are being evicted, but that has > the issue of the "you are evicted" RPC will normally be sent to a node > which is already dead and slow down the MDS and/or block all of its > LNET credits so isn''t really even a usable option. > > > A safe option (AFAICS) is to have MDS eviction force OST eviction (via > obd_set_info_async(EVICT_BY_NID). That would also resolve some other > recovery problems, but might be overly drastic if e.g. the client is > being evicted from the MDS due to router failure or simple network > partition. Having a proper health network and also server-side RPC > resending would help avoid such problems. >This is drastic, but on the other hand we only need to do this if the layout is being changed. Of course, since eviction would happen before layout change we would need to remember who was evicted and hasn''t reconnected...> This is one of the main reasons why having DLM servers on one node > controlling resources on another node is a bad idea. We had similar > issues in the past when we locked all objects via the OST only on > stripe index 0, and we might have similar problems with subtree locks > in the future with CMD or any SNS RAID that is only locking a subset > of objects. >
On Mar 06, 2009 14:59 -0800, Nathaniel Rutman wrote:> How about if we introduce the concept of a layout generation? The > generation is stored in the layout and also with each OST object. When > the MDT takes the extent locks it sends the new generation to the OSTs. > Clients send the layout generation along with any extent lock enqueue. > The OSTs only grant extents to clients that match the current > generation. Maybe "match or exceed" in case OST dies before new gen can > be recorded. And OST increases gen to latest seen whenever any (MDT or > client) extent lock is enqueued.I like this idea. We would need some place to store this information in the LOV EA on the MDT and pass it to the client, and to/on the OST. We already have: - inode versions (VBR; change on each file modification) - IO epochs (SOM; change slowly as files are written, not persistent) - recovery epochs (CMD/WBC; change frequently as global epochs are committed) We could concievably use the space in "l_ost_gen" in the first stripe, as we have never implemented OST generations. Those were intended for OST replacement, and/or OST snapshots, but have never been implemented. It also has the drawback that it is per-stripe, and we would likely be wasting the additional l_ost_gen values in later stripes in addition to breaking their intended use. Maybe we just bite the bullet and add another LOV EA type?>> A safe option (AFAICS) is to have MDS eviction force OST eviction (via >> obd_set_info_async(EVICT_BY_NID). That would also resolve some other >> recovery problems, but might be overly drastic if e.g. the client is >> being evicted from the MDS due to router failure or simple network >> partition. Having a proper health network and also server-side RPC >> resending would help avoid such problems. > > This is drastic, but on the other hand we only need to do this if the > layout is being changed. Of course, since eviction would happen before > layout change we would need to remember who was evicted and hasn''t > reconnected...No, I don''t think we need to remember recently-evicted clients, since the MDS would also evict clients from all OSTs immediately. The goal to avoid this drastic action would be to avoid evicting the client from the MDS in the first place (e.g. by request resend, health net), which is a double win.>> This is one of the main reasons why having DLM servers on one node >> controlling resources on another node is a bad idea. We had similar >> issues in the past when we locked all objects via the OST only on >> stripe index 0, and we might have similar problems with subtree locks >> in the future with CMD or any SNS RAID that is only locking a subset >> of objects. >>Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.