We''re working on adding replay RPC signatures, so that clients may only replay RPCs that have been seen by the server (thus signed). Currently clients recover open file state by replaying the open RPCs. Because files can stay open forever this means that replay RPC signatures must either remain valid forever (keys never deleted) or be renewed. But if we add a PTLRPC replay signature renewal feature then we''ll be causing MDSes to do redundant work (since FID capabilities used in opens will also have to be renewed). Since MDSes are typically CPU-bound as it is, adding a yet another cryptographic burden to them seems undesirable. Therefore a way to recover open state that does not depend on replaying RPCs with valid replay signatures is appealing. I''ve been researching this (and talking to Eric B. and Oleg about this). Several possible solutions are evident. I''ll describe the one that seems most elegant to me (and, I think, Oleg), namely separate open state recovery from transaction recovery. Server-side high-level description: - during recovery the MDS will first process anonymous open by FID RPCs from new clients (these open RPCs will not have transaction IDs assigned to them as they imply no actual transactions) - then the MDS will accept replays from all clients, new and old - followed by lock recovery as usual Client-side high-level description: - open processing will begin by sending an RPC as usual..., - ... but on commit the md_open_data will be added to a doubly-linked list of opens and the RPC will be removed from the PTLRPC replay queue - during recovery the client will begin by traversing the list of md_open_data (open state), reconstruct an anonymous open by FID RPC and send it to the MDS, and after that the client will replay outstanding transactions'' RPCs, followed by lock recovery Old clients would recover as usual. Security is provided by the capabilities used in the anonymous open by FID RPCs and transport security. The general principle then would be: RPC replaying is to be used only for recovering _transactions that should not be outstanding for very long. Where "very long" is relative to the replay signature crypto key lifecycle, which will be on the order of days. Since opens are not transactions[*] and can stay "outstanding" forever, opens would not be suitable for recovery by replay under that principle. Open state is much more similar to DLM locks than transactions. Open recovery must precede uncommitted transaction recovery so as to ensure that open state is re-established before unlinks can be replayed that would cause the file to be destroyed. There are, of course, other ways to achieve the desired effect, that is, to avoid having to renew replay signatures. Comments? Advice? Nico [*] Any filesystem object creation implied by an open, such as when O_CREAT is used, would be a transaction, but the open aspect of it wouldn''t be. Think of an open that creates as a filesystem transaction and an open that happen atomically.
On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> We''re working on adding replay RPC signatures, so that clients may only > replay RPCs that have been seen by the server (thus signed).Could you explain that more? All replays have been seen by server just by definition because client got reply from server, so what is purpose of such signing?> I''ve been researching this (and talking to Eric B. and Oleg about this). > Several possible solutions are evident. I''ll describe the one that > seems most elegant to me (and, I think, Oleg), namely separate open > state recovery from transaction recovery. > > Server-side high-level description: > > - during recovery the MDS will first process anonymous open by FID RPCs > from new clients (these open RPCs will not have transaction IDs > assigned to them as they imply no actual transactions) > > - then the MDS will accept replays from all clients, new and oldIt is not clear what do ''new'' and ''old'' mean here? If both ''new'' and ''old'' have requests to replay so they were active in previous server boot, so what is the difference between them?> > - followed by lock recovery as usual > > Client-side high-level description: > > - open processing will begin by sending an RPC as usual..., > - ... but on commit the md_open_data will be added to a doubly-linked > list of opens and the RPC will be removed from the PTLRPC replay > queue > > - during recovery the client will begin by traversing the list of > md_open_data (open state), reconstruct an anonymous open by FID RPC > and send it to the MDS, and after that the client will replay > outstanding transactions'' RPCs, followed by lock recoveryHmm, but currently it works exactly like this, the committed open replay are sent first followed by normal replays. So you propose to separate them just because they are not ''pure'' replays as you described below?> > Old clients would recover as usual. > > Security is provided by the capabilities used in the anonymous open by > FID RPCs and transport security. > > The general principle then would be: > > RPC replaying is to be used only for recovering _transactions that > should not be outstanding for very long. > > Where "very long" is relative to the replay signature crypto key > lifecycle, which will be on the order of days. > > Since opens are not transactions[*] and can stay "outstanding" forever, > opens would not be suitable for recovery by replay under that principle. > Open state is much more similar to DLM locks than transactions. > > Open recovery must precede uncommitted transaction recovery so as to > ensure that open state is re-established before unlinks can be replayed > that would cause the file to be destroyed.That requires the server shouldn''t start replays from all clients until ''open recovery'' is finished from all of them. In fact there is another solution for open-unlink problem that was implemented in 1.8. During recovery the unlink replay doesn''t delete file but makes it orphan even if open count is 0. After recovery orphans are cleaned up already, so open replay after unlink will find orphan and open it. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote:> On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: > >We''re working on adding replay RPC signatures, so that clients may only > >replay RPCs that have been seen by the server (thus signed). > > Could you explain that more? All replays have been seen by server just by > definition because client got reply from server, so what is purpose of > such signing?They''ve been seen, indeed, but when replayed not all the same permissions checks may be done, so the server needs to know that the replay is safe to process. There''s two ways to do that: never skip any permissions checks when processing replayed RPCs, or have the server sign replayable RPCs so the server can know validate any replays. I''ve not looked at a complete list of checks that are skipped on replays -- perhaps we should have such a list before we go down the replay signature path.> > [...] > > - then the MDS will accept replays from all clients, new and old > > It is not clear what do ''new'' and ''old'' mean here? If both ''new'' and ''old'' > have requests to replay so they were active in previous server boot, so > what is the difference between them?Old clients would be clients that don''t implement this new form of open state recovery (e.g., 1.6, 1.8 clients). New clients would be clients that do implement this new form of open state recovery (2.x).> > - followed by lock recovery as usual > > > >Client-side high-level description: > > > > [...] > Hmm, but currently it works exactly like this, the committed open replay > are sent first followed by normal replays. So you propose to separate them > just because they are not ''pure'' replays as you described below?It doesn''t work as I proposed: opens are currently recovered by _replaying_ RPCs (which potentially had side-effects besides creating open state). Or at least that''s my understanding. In my proposal open state recovery for opens associated with completed transactions would always be done by generating new anonymous open by FID RPCs (not replayed ones).> >The general principle then would be: > > > > RPC replaying is to be used only for recovering _transactions that > > should not be outstanding for very long. > > > >Where "very long" is relative to the replay signature crypto key > >lifecycle, which will be on the order of days. > > > >Since opens are not transactions[*] and can stay "outstanding" forever, > >opens would not be suitable for recovery by replay under that principle. > >Open state is much more similar to DLM locks than transactions. > > > >Open recovery must precede uncommitted transaction recovery so as to > >ensure that open state is re-established before unlinks can be replayed > >that would cause the file to be destroyed. > > That requires the server shouldn''t start replays from all clients until > ''open recovery'' is finished from all of them. In fact there is anotherCorrect.> solution for open-unlink problem that was implemented in 1.8. During > recovery the unlink replay doesn''t delete file but makes it orphan even if > open count is 0. After recovery orphans are cleaned up already, so open > replay after unlink will find orphan and open it.That idea did cross my mind. The MDS would have to keep a list of such unlinks so it can drop their open count if they truly aren''t open. That seems like a extra work that the MDS shouldn''t have to do. Nico --
On Fri, Jul 03, 2009 at 04:55:28PM -0500, Nicolas Williams wrote:> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote: > > On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams > > <Nicolas.Williams at sun.com> wrote: > > >We''re working on adding replay RPC signatures, so that clients may only > > >replay RPCs that have been seen by the server (thus signed). > > > > Could you explain that more? All replays have been seen by server just by > > definition because client got reply from server, so what is purpose of > > such signing? > > They''ve been seen, indeed, but when replayed not all the same > permissions checks may be done, so the server needs to know that the > replay is safe to process. There''s two ways to do that: never skip any > permissions checks when processing replayed RPCs, or have the server > sign replayable RPCs so the server can know validate any replays. I''ve > not looked at a complete list of checks that are skipped on replays -- > perhaps we should have such a list before we go down the replay > signature path.Oh, I forgot for a moment, but the other point of replay signatures is to prevent clients from causing other clients to be evicted.
On Sat, 04 Jul 2009 01:55:28 +0400, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote: > > They''ve been seen, indeed, but when replayed not all the same > permissions checks may be done, so the server needs to know that the > replay is safe to process. There''s two ways to do that: never skip any > permissions checks when processing replayed RPCs, or have the server > sign replayable RPCs so the server can know validate any replays. I''ve > not looked at a complete list of checks that are skipped on replays -- > perhaps we should have such a list before we go down the replay > signature path.OK, so it is not about fake/malformed client only, that is interesting, is there any preliminary arch/hld document describing that? I am interesting in more backgrounds if any> In my proposal open state recovery for opens associated with completed > transactions would always be done by generating new anonymous open by > FID RPCs (not replayed ones).Well, I see no difference yet. Currently all open ''replays'' are passed right to open_by_fid(), open file and create mfd structure for it, so it is the same on server side at least. Did I miss something?>> >Open recovery must precede uncommitted transaction recovery so as to >> >ensure that open state is re-established before unlinks can be replayed >> >that would cause the file to be destroyed. >> >> That requires the server shouldn''t start replays from all clients until >> ''open recovery'' is finished from all of them. In fact there is another > > Correct.That is more regression than benefit, having such kind of ''barrier'' during recovery leads to longer recovery with not balanced server load. There are couple improvements on the way already to make recovery of each client more independent from others if possible, e.g. the transaction-based recovery can be replaced with version-based only. So adding new barriers is not good case in this terms>> solution for open-unlink problem that was implemented in 1.8. During >> recovery the unlink replay doesn''t delete file but makes it orphan even >> if >> open count is 0. After recovery orphans are cleaned up already, so open >> replay after unlink will find orphan and open it. > > That idea did cross my mind. The MDS would have to keep a list of such > unlinks so it can drop their open count if they truly aren''t open. That > seems like a extra work that the MDS shouldn''t have to do.There is already such mechanism on MDS to handle open-unlink cases. MDS keeps orphaned files while they are opened and deletes all non-reopened after recovery. We can just use this mechanism during recovery moving unlinked files to orphans. It work so already in 1.8 and should be even simpler in 2.0 due to FIDs. There are extra checks only, no need to keep extra list or so. I think this is preferable way to go because we avoid ''barriers'' in recovery mentioned above -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
On Sat, 04 Jul 2009 04:48:51 +0400, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Fri, Jul 03, 2009 at 04:55:28PM -0500, Nicolas Williams wrote: >> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote: >> > On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams >> > <Nicolas.Williams at sun.com> wrote: >> > >We''re working on adding replay RPC signatures, so that clients may >> only >> > >replay RPCs that have been seen by the server (thus signed). >> > >> > Could you explain that more? All replays have been seen by server >> just by >> > definition because client got reply from server, so what is purpose of >> > such signing? >> >> They''ve been seen, indeed, but when replayed not all the same >> permissions checks may be done, so the server needs to know that the >> replay is safe to process. There''s two ways to do that: never skip any >> permissions checks when processing replayed RPCs, or have the server >> sign replayable RPCs so the server can know validate any replays. I''ve >> not looked at a complete list of checks that are skipped on replays -- >> perhaps we should have such a list before we go down the replay >> signature path. > > Oh, I forgot for a moment, but the other point of replay signatures is > to prevent clients from causing other clients to be evicted.Ah, I am interesting even more in some background about this idea. I thought this was needed as protection from malformed clients only but it looks more functional. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
Note too that recovering opens by reconstruction and before outstanding transactions commits us to always use capabilities. The reason is that an open might not be permitted prior to replaying, say, a chmod, but with capabilities the open would be permitted because of the capability issued earlier, before the MDS restarted. If we want an option to disable capabilities, then we should recover opens in order at RPC replay time, but the opens should still not be replays (unless O_CREAT is involved and that transaction hasn''t committed). Nico --
On Sat, Jul 04, 2009 at 11:10:41AM +0400, Mikhail Pershin wrote:> On Sat, 04 Jul 2009 01:55:28 +0400, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: > > OK, so it is not about fake/malformed client only, that is interesting, is > there any preliminary arch/hld document describing that? I am interesting > in more backgrounds if anySee bug #18657.> >In my proposal open state recovery for opens associated with completed > >transactions would always be done by generating new anonymous open by > >FID RPCs (not replayed ones). > > Well, I see no difference yet. Currently all open ''replays'' are passed > right to open_by_fid(), open file and create mfd structure for it, so it > is the same on server side at least. Did I miss something?The difference is on the wire. Currently open state recovery replays RPCs. This has a very specific meaning: the original RPC is sent again with a bit set in the ptlrpc header to indicate that it is a replay. When the transaction had already been committed this replay is processed on the server side as an anonymous open by FID, but on the wire the open may have been something other than an anon open by FID. In my proposal what would happen is that opens would only be recovered by _replay_ when the transaction had not yet been committed, otherwise the opens will be recovered by making a _new_ (non-replay) open RPC.> >>>Open recovery must precede uncommitted transaction recovery so as to > >>>ensure that open state is re-established before unlinks can be replayed > >>>that would cause the file to be destroyed. > >> > >>That requires the server shouldn''t start replays from all clients until > >>''open recovery'' is finished from all of them. In fact there is another > > > >Correct. > > That is more regression than benefit, having such kind of ''barrier'' during > recovery leads to longer recovery with not balanced server load. There are > couple improvements on the way already to make recovery of each client > more independent from others if possible, e.g. the transaction-based > recovery can be replaced with version-based only. So adding new barriers > is not good case in this termsI''m not sure why a new stage would necessarily slow recovery in a significant way. The new stage would not involve any writes to disk (though it would involve reads, reads which could then be cached and benefit the transaction recovery phase). There is an alternative: recover opens during transaction recovery in trasaction order, but for committed opens (or opens that had not filesystem transaction to commit, i.e., opens without O_CREAT) use new RPCs instead of replay RPCs. The amount of work should be the same as with the proposed solution, but with better cache locality of reference. Also, recovering opens before transactions would bind us to always having capabilities enabled (see my other post just now). Whereas the above alternative would not.> >>solution for open-unlink problem that was implemented in 1.8. During > >>recovery the unlink replay doesn''t delete file but makes it orphan even > >>if > >>open count is 0. After recovery orphans are cleaned up already, so open > >>replay after unlink will find orphan and open it. > > > >That idea did cross my mind. The MDS would have to keep a list of such > >unlinks so it can drop their open count if they truly aren''t open. That > >seems like a extra work that the MDS shouldn''t have to do. > > There is already such mechanism on MDS to handle open-unlink cases. MDS > keeps orphaned files while they are opened and deletes all non-reopened > after recovery. We can just use this mechanism during recovery moving > unlinked files to orphans. It work so already in 1.8 and should be even > simpler in 2.0 due to FIDs. There are extra checks only, no need to keep > extra list or so. I think this is preferable way to go because we avoid > ''barriers'' in recovery mentioned aboveSuppose we recovered opens after transactions: we''d still have additional costs for last unlinks since we''d have to put the object on an on-disk queue of orpahsn until all open state is recovered. See above. Nico --
On Mon, Jul 06, 2009 at 12:20:09PM -0500, Nicolas Williams wrote:> Note too that recovering opens by reconstruction and before outstanding > transactions commits us to always use capabilities. The reason is that > an open might not be permitted prior to replaying, say, a chmod, but > with capabilities the open would be permitted because of the capability > issued earlier, before the MDS restarted. > > If we want an option to disable capabilities, then we should recover > opens in order at RPC replay time, but the opens should still not be > replays (unless O_CREAT is involved and that transaction hasn''t > committed).The above is all wrong, fortunately :) Oleg explained it to me. Transactions are committed linearly in time, so if a chmod that an open depends on has not yet committed then that open''s state must be recovered by replay rather than by reconstruction. Nico --
On Mon, Jul 06, 2009 at 12:34:41PM -0500, Nicolas Williams wrote:> On Sat, Jul 04, 2009 at 11:10:41AM +0400, Mikhail Pershin wrote: > > That is more regression than benefit, having such kind of ''barrier'' during > > recovery leads to longer recovery with not balanced server load. There are > > couple improvements on the way already to make recovery of each client > > more independent from others if possible, e.g. the transaction-based > > recovery can be replaced with version-based only. So adding new barriers > > is not good case in this terms > > I''m not sure why a new stage would necessarily slow recovery in a > significant way. The new stage would not involve any writes to disk > (though it would involve reads, reads which could then be cached and > benefit the transaction recovery phase).Also, as Oleg explained to me, most open state is for files whose opens committed long ago, so most open state is recovered before other transactions. Which means we already have a separate open state recovery phase -- it just isn''t explicit. So the only thing that changes in my proposal is that all committed open state will be recovered by anonymous open by FID reconstruction instead of by replay, with all other transactions (including as-yet uncommitted opens) will be recovered by replay. There would be no new timeouts, and there should be no other negative impact on recovery time/performance. Recovery performance should actually be improved, when replay signatures are enabled, since there would be no need to verify replay signatures for more open state recovery. Nico --
Nicolas Williams wrote:> Also, as Oleg explained to me, most open state is for files whose opens > committed long ago, so most open state is recovered before other > transactions. Which means we already have a separate open state > recovery phase -- it just isn''t explicit. So the only thing that > changes in my proposal is that all committed open state will be > recovered by anonymous open by FID reconstruction instead of by replay, > with all other transactions (including as-yet uncommitted opens) will be > recovered by replay.I think it''d be slightly easier to introduce two notions of replay: 1) on-disk replay -- we try to recover some on-disk state from client''s cache regular requests like mkdir, unlink, rename, setattr, etc 2) in-core replay - we try to recover some in-core state from client''s cache ldlm locks, open files the thing is that open(2) is quite interesting in this regard because it does (1) *and* (2). I believe this is why we used (1) for (2). my old thougth was that instead of introducing special new open-by-fid RPC we should try to implement open in terms of LDLM locks because it''s in-core state (though with specific tracking of unlinked files). given this we''d automatically get single mechanism for all in-core states and we''d get rid of special paths for open replays. thanks, Alex
On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> > In my proposal what would happen is that opens would only be recovered > by _replay_ when the transaction had not yet been committed, otherwise > the opens will be recovered by making a _new_ (non-replay) open RPC. >Yes, I understood that and agree that this looks like more clean implementation but I see the following problems so far: - two kinds of client - new and old that should be handled somehow - client code should be changed a lot - server need to understand and handle this too What will we get for this? Sorry for my annoyance, but it looks for me that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY flag to such requests, so it will be also different in wire from transaction replays. Or we could re-use lock replay functionality somehow. The locks are not kept as saved RPC too but enqueued as new requests. The open is very close to this, I agree with idea that open handle has all needed info and no need to keep original RPC in this case. I mean that proposed solution looks overcomplicated just to solve signature problem though it makes sense in general. If we are going to re-organize open recovery and have time for this it would be better to move it from context of replay signature to separate task as it is quite complex.> > I''m not sure why a new stage would necessarily slow recovery in a > significant way. The new stage would not involve any writes to disk > (though it would involve reads, reads which could then be cached and > benefit the transaction recovery phase).Not necessarily, but it can. It is not about open stage only, it is about the whole approach to do recovery by stages when all clients must wait for any other at each stage before they can continue recovery. We have already this in HEAD and it extends recovery window. Lustre 1.8 had only single timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set after each one. If all clients are alive than the recovery time will be mostly the same, but if clients may gone during recovery then lustre 2.0 recovery time can be three times longer already. Just imagine that at each stage one client is gone, then at each stage all clients will wait until timer expiration. And the bigger cluster we have the more clients can be lost during recovery so recovery time may differ significantly. Also this means that server load is not well distributed over recovery time. It waits then start doing all requests at once then waits again on other stage, etc. Another point here is the possible using the version recovery instead of transaction-based recovery. This will makes recovery based on versions of object and it makes just no sense to wait all clients at each recovery stage, because all dependencies should be clear from versions and clients may finish recovery independently. Currently the requests can be recovered by versions and there is work on lock replays using versions too.> > Suppose we recovered opens after transactions: we''d still have > additional costs for last unlinks since we''d have to put the object on > an on-disk queue of orpahsn until all open state is recovered. See > above.There is no additional cost for pair of open-unlink because orphan is needed anyway after unlink. The only exception is replay of pure unlink. But we need to keep orphans after unlinks for other cases anyway, e.g. delayed recovery and such overhead is nothing compared with time that can be lost on waiting for everyone as described above. In fact this is already slightly out of scope original idea about open replay organization. This is more related to server recovery handling, version recovery, delayed recovery and can be discussed later when open replay changes on client will be settled, it will be more clear in that time. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
On Jul 07, 2009 13:56 +0400, Alex Zhuravlev wrote:> Nicolas Williams wrote: > > Also, as Oleg explained to me, most open state is for files whose opens > > committed long ago, so most open state is recovered before other > > transactions. Which means we already have a separate open state > > recovery phase -- it just isn''t explicit. So the only thing that > > changes in my proposal is that all committed open state will be > > recovered by anonymous open by FID reconstruction instead of by replay, > > with all other transactions (including as-yet uncommitted opens) will be > > recovered by replay. > > I think it''d be slightly easier to introduce two notions of replay: > > 1) on-disk replay -- we try to recover some on-disk state from client''s cache > regular requests like mkdir, unlink, rename, setattr, etc > > 2) in-core replay - we try to recover some in-core state from client''s cache > ldlm locks, open files > > the thing is that open(2) is quite interesting in this regard because it does > (1) *and* (2). I believe this is why we used (1) for (2). > > my old thougth was that instead of introducing special new open-by-fid RPC > we should try to implement open in terms of LDLM locks because it''s in-core > state (though with specific tracking of unlinked files). given this we''d > automatically get single mechanism for all in-core states and we''d get rid > of special paths for open replays.One problem with this is that the ordering needs to be preserved. Opens that have committed need to be replayed before any other replay operations, because those replayed operations may depend on the file being open. However, "normal" lock replay should happen after (or conceivably during) operation replay so that the objects being locked actually exist and the server can (hopefully soon) verify the lock version number during recovery. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jul 07, 2009 17:56 +0400, Mike Pershin wrote:> On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: > > In my proposal what would happen is that opens would only be recovered > > by _replay_ when the transaction had not yet been committed, otherwise > > the opens will be recovered by making a _new_ (non-replay) open RPC. > > Yes, I understood that and agree that this looks like more clean > implementation but I see the following problems so far: > - two kinds of client - new and old that should be handled somehow > - client code should be changed a lot > - server need to understand and handle this too > > What will we get for this? Sorry for my annoyance, but it looks for me > that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY > flag to such requests, so it will be also different in wire from > transaction replays. Or we could re-use lock replay functionality somehow. > The locks are not kept as saved RPC too but enqueued as new requests. The > open is very close to this, I agree with idea that open handle has all > needed info and no need to keep original RPC in this case.There are actually multiple benefits from this change: - we can remove the awkward handling of open RPCs that are saved even after they have been committed to disk. That code has had so many bugs in it (and probably still has some) I will be happy when it is gone. - we don''t have RPCs saved for replay that cannot be flushed during a server upgrade. For the Simplified Interoperability feature we need to be able to clear all of the saved RPCs from memory so that it is possible to change the RPC format over an upgrade. Regenerating the _new_ RPCs from the open file handles allows this to happen.> I mean that proposed solution looks overcomplicated just to solve > signature problem though it makes sense in general. If we are going to > re-organize open recovery and have time for this it would be better to > move it from context of replay signature to separate task as it is quite > complex.To my thinking, I don''t know that we need to introduce a new RPC _type_ for the open, AFAIK the old open replay RPC will already do open-by-FID. What is the core change here is that the open RPCs will be newly generated at recovery time instead of being kept in memory. This actually has a second benefit in that we don''t have to keep huge lists of open RPCs in the replay list that will be skipped each time we are trying to cancel committed RPCs. For HPCS we need to handle 100k opens on a single client, and cancelling RPCs from the replay list is an O(n^2) operation since it does a list walk to find just-committed RPCs.> > I''m not sure why a new stage would necessarily slow recovery in a > > significant way. The new stage would not involve any writes to disk > > (though it would involve reads, reads which could then be cached and > > benefit the transaction recovery phase). > > Not necessarily, but it can. It is not about open stage only, it is about > the whole approach to do recovery by stages when all clients must wait for > any other at each stage before they can continue recovery. We have already > this in HEAD and it extends recovery window. Lustre 1.8 had only single > timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set > after each one.Actually, the need to have separate recovery stages in HEAD is no longer needed. The addition of extra replay stages was a result of fixing a bug in recovery where open file handles were not being replayed before another client unlinked the file. However, this has to be fixed for VBR delayed recovery anyways, so we may as well fix this with a single mechanism instead of adding a separate recovery stage that requires waiting for all clients to join or be evicted before any recovery can start. [details for the above] INITIAL ORDER ============client 1 client 2 MDS -------- -------- --- open A (transno N) {use A} ***commit >= N*** unlink A (transno X) {continue to use A} ***crash*** REPLAY ORDER ===========client 1 client 2 MDS -------- -------- --- {slow reconnect} ***last committed < X*** unlink A (transno X) open A (transno N) = -ENOENT {A can no longer be used} The proper solution, as also needed by delayed recovery, is to move A to the PENDING list during replay and remove it at the end of replay. With 1.x we would have to also remove the inode from PENDING if some other node reuses that inode number, but since this extra recovery stage is only present in 2.0 and we will not implement delayed recovery for 1.x we can simply remove all unreferenced inodes from PENDING at the end of recovery (until delayed recovery is completed). It would be possible to flag the unlink RPCs with a special flag (maybe just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks that also destroy the objects, and unlinks that cause open-unlinked files. For replayed unlinks that cause objects to be destroyed we know that there are no other clients holding the file open after that point and we don''t have to put the inode into PENDING at all.> If all clients are alive than the recovery time will be > mostly the same, but if clients may gone during recovery then lustre 2.0 > recovery time can be three times longer already. Just imagine that at each > stage one client is gone, then at each stage all clients will wait until > timer expiration. And the bigger cluster we have the more clients can be > lost during recovery so recovery time may differ significantly. > Also this means that server load is not well distributed over recovery > time. It waits then start doing all requests at once then waits again on > other stage, etc. > > Another point here is the possible using the version recovery instead of > transaction-based recovery. This will makes recovery based on versions of > object and it makes just no sense to wait all clients at each recovery > stage, because all dependencies should be clear from versions and clients > may finish recovery independently. Currently the requests can be recovered > by versions and there is work on lock replays using versions too.I fully agree - it would be ideal if recovery started immediately without any waiting for other clients.> > Suppose we recovered opens after transactions: we''d still have > > additional costs for last unlinks since we''d have to put the object on > > an on-disk queue of orpahsn until all open state is recovered. See > > above. > > There is no additional cost for pair of open-unlink because orphan is > needed anyway after unlink. The only exception is replay of pure unlink. > But we need to keep orphans after unlinks for other cases anyway, e.g. > delayed recovery and such overhead is nothing compared with time that can > be lost on waiting for everyone as described above.Agreed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Tue, Jul 07, 2009 at 01:56:52PM +0400, Alex Zhuravlev wrote:> I think it''d be slightly easier to introduce two notions of replay:As I understand it ''replay'' has a very specific meaning: re-send an RPC with the ''replay'' bit set in the ptlrpc header.> [...] > my old thougth was that instead of introducing special new open-by-fid > RPC we should try to implement open in terms of LDLM locks because > it''s in-core state (though with specific tracking of unlinked files). > given this we''d automatically get single mechanism for all in-core > states and we''d get rid of special paths for open replays.Hmmm, but open by FID gives the MDS a chance to check capabilities. Yes, that''s probably not terribly important as long as the OSSes also check capabilities. Also, there''s the unlink issue to worry about. Mikhail''s proposal for that is to defer unlinks until after open state recovery (in this case: until after DLM recovery). That would work, I think. Also, you could have the kind of DLM locks used for open state tracking recovered first, then transactions, then all other types of locks. Here''s a question: what consumes more memory on the MDS: open state or a DLM lock? Nico --
On Tue, Jul 07, 2009 at 05:56:36PM +0400, Mikhail Pershin wrote:> What will we get for this? Sorry for my annoyance, but it looks for me > that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY > flag to such requests, so it will be also different in wire from > transaction replays. Or we could re-use lock replay functionality somehow.Making the open replays look different on the wire is exactly what this is about. They''ll look different from other replays in that they will not have a replay signature. But replay signatures are a PTLRPC layer feature, so how should PTLRPC know whether to allow such a replay to pass through? One way is to let it pass through replays with valid signatures and non-replays, and then let the MDT have non-replay handlers only for anon open by FID during recovery. Then the client might as well not bother caching open RPCs forever, just until they commit -- it can re-construct open RPCs from in-core state (vnode, ...) anytime it needs to. Using DLM locks to represent open state is interesting. It would require either recovering those first or deferring final unlinks at transaction recover time. Another problem with using locks for open state is that establishing the lock atomically with an open w/ create won''t be easy. The MDT would have to enqueue a lock for itself atomically with the create, then the client would have to enquee its lock, then the MDT would have to drop its lock. Would this not be much more complex that open RPC reconstruction?> The locks are not kept as saved RPC too but enqueued as new requests. The > open is very close to this, I agree with idea that open handle has all > needed info and no need to keep original RPC in this case.Yes. Nico --
On Tue, 07 Jul 2009 19:21:05 +0400, Andreas Dilger <adilger at sun.com> wrote:> This actually has a second benefit in that we don''t have to keep huge > lists of open RPCs in the replay list that will be skipped each time we > are trying to cancel committed RPCs. For HPCS we need to handle 100k > opens on a single client, and cancelling RPCs from the replay list is > an O(n^2) operation since it does a list walk to find just-committed > RPCs.Absolutely, all benefits are clear and I fully agree but all of them are not in reply signature context. I was just afraid that inside replay signature task such big changes will defer replay signature itself. But if we have time to make it in right way then it is good.> Actually, the need to have separate recovery stages in HEAD is no longer > needed. The addition of extra replay stages was a result of fixing a bug > in recovery where open file handles were not being replayed before > another client unlinked the file. However, this has to be fixed for VBR > delayed > recovery anyways, so we may as well fix this with a single mechanism > instead of adding a separate recovery stage that requires waiting for > all clients to join or be evicted before any recovery can start. > > The proper solution, as also needed by delayed recovery, is to move A > to the PENDING list during replay and remove it at the end of replay. > With 1.x we would have to also remove the inode from PENDING if some > other node reuses that inode number, but since this extra recovery > stage is only present in 2.0 and we will not implement delayed recovery > for 1.x we can simply remove all unreferenced inodes from PENDING at > the end of recovery (until delayed recovery is completed).Exactly, that is what I meant and that is why I don''t like another strict ''stage''> > It would be possible to flag the unlink RPCs with a special flag (maybe > just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks > that also destroy the objects, and unlinks that cause open-unlinked > files. > For replayed unlinks that cause objects to be destroyed we know that > there are no other clients holding the file open after that point and > we don''t have to put the inode into PENDING at all.I''ve just thought about the same, it is quite obvious solution here. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
On Tue, Jul 07, 2009 at 08:42:52PM +0400, Mikhail Pershin wrote:> On Tue, 07 Jul 2009 19:21:05 +0400, Andreas Dilger <adilger at sun.com> wrote: > >This actually has a second benefit in that we don''t have to keep huge > >lists of open RPCs in the replay list that will be skipped each time we > >are trying to cancel committed RPCs. For HPCS we need to handle 100k > >opens on a single client, and cancelling RPCs from the replay list is > >an O(n^2) operation since it does a list walk to find just-committed > >RPCs.This seems like a problem that could be solved anyways, but then, having to cache these RPCs forever is a waste of resources.> Absolutely, all benefits are clear and I fully agree but all of them are > not in reply signature context. I was just afraid that inside replay > signature task such big changes will defer replay signature itself. But if > we have time to make it in right way then it is good.Adding replay signature renewal just to avoid this restructuring seems equally bad. Not adding replay signature renewal and not bothering with rekeying is OK in the short-term, but eventually it''d become a problem. Given all the other benefits of doing committed open state recovery by reconstruction, it seems like a good idea to just do it.> >It would be possible to flag the unlink RPCs with a special flag (maybe > >just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks > >that also destroy the objects, and unlinks that cause open-unlinked > >files. > >For replayed unlinks that cause objects to be destroyed we know that > >there are no other clients holding the file open after that point and > >we don''t have to put the inode into PENDING at all. > > I''ve just thought about the same, it is quite obvious solution here.An excellent idea. Replay signatures would have to cover that bit. I''ll add that to the HLD. Nico --
Andreas Dilger wrote:>> my old thougth was that instead of introducing special new open-by-fid RPC >> we should try to implement open in terms of LDLM locks because it''s in-core >> state (though with specific tracking of unlinked files). given this we''d >> automatically get single mechanism for all in-core states and we''d get rid >> of special paths for open replays. > > One problem with this is that the ordering needs to be preserved. Opens > that have committed need to be replayed before any other replay operations, > because those replayed operations may depend on the file being open. > However, "normal" lock replay should happen after (or conceivably during) > operation replay so that the objects being locked actually exist and the > server can (hopefully soon) verify the lock version number during recovery.well, that ordering is already "dead" due to VBR? I think semantics of unlink is just to unlink name, everything else is up to MDS (when to destroy inode and objects). also notice inode destroy is a different transaction in general (due to possible multi-transaction truncate). if we decouple unlink and object destroy, then the following sequence should work: 1) replay on-disk states (unlink just put inode onto orphan list) 2) replay in-core states (including open locks) ... at some point 3) MDS goes over orphan list and destroys selected objects (depending on VRB policy, etc) thanks, Alex
Nicolas Williams wrote:> Another problem with using locks for open state is that establishing the > lock atomically with an open w/ create won''t be easy. The MDT would > have to enqueue a lock for itself atomically with the create, then the > client would have to enquee its lock, then the MDT would have to drop > its lock. Would this not be much more complex that open RPC > reconstruction?we already have open locks which are taken in atomical manner with create. thanks, Alex