thr3ads.net - Lustre devel - [Lustre-devel] Recovering opens by reconstruction [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Nicolas Williams

2009-Jul-02 22:39 UTC

[Lustre-devel] Recovering opens by reconstruction

We''re working on adding replay RPC signatures, so that clients may only
replay RPCs that have been seen by the server (thus signed).

Currently clients recover open file state by replaying the open RPCs.
Because files can stay open forever this means that replay RPC
signatures must either remain valid forever (keys never deleted) or be
renewed.  But if we add a PTLRPC replay signature renewal feature then
we''ll be causing MDSes to do redundant work (since FID capabilities
used
in opens will also have to be renewed).

Since MDSes are typically CPU-bound as it is, adding a yet another
cryptographic burden to them seems undesirable.  Therefore a way to
recover open state that does not depend on replaying RPCs with valid
replay signatures is appealing.

I''ve been researching this (and talking to Eric B. and Oleg about
this).
Several possible solutions are evident.  I''ll describe the one that
seems most elegant to me (and, I think, Oleg), namely separate open
state recovery from transaction recovery.

Server-side high-level description:

 - during recovery the MDS will first process anonymous open by FID RPCs
   from new clients (these open RPCs will not have transaction IDs
   assigned to them as they imply no actual transactions)

 - then the MDS will accept replays from all clients, new and old

 - followed by lock recovery as usual

Client-side high-level description:

 - open processing will begin by sending an RPC as usual...,
 
 - ... but on commit the md_open_data will be added to a doubly-linked
   list of opens and the RPC will be removed from the PTLRPC replay
   queue

 - during recovery the client will begin by traversing the list of
   md_open_data (open state), reconstruct an anonymous open by FID RPC
   and send it to the MDS, and after that the client will replay
   outstanding transactions'' RPCs, followed by lock recovery

Old clients would recover as usual.

Security is provided by the capabilities used in the anonymous open by
FID RPCs and transport security.

The general principle then would be:

   RPC replaying is to be used only for recovering _transactions that
   should not be outstanding for very long.

Where "very long" is relative to the replay signature crypto key
lifecycle, which will be on the order of days.

Since opens are not transactions[*] and can stay "outstanding"
forever,
opens would not be suitable for recovery by replay under that principle.
Open state is much more similar to DLM locks than transactions.

Open recovery must precede uncommitted transaction recovery so as to
ensure that open state is re-established before unlinks can be replayed
that would cause the file to be destroyed.

There are, of course, other ways to achieve the desired effect, that is,
to avoid having to renew replay signatures.

Comments?  Advice?

Nico

[*] Any filesystem object creation implied by an open, such as when
    O_CREAT is used, would be a transaction, but the open aspect of it
    wouldn''t be.  Think of an open that creates as a filesystem
    transaction and an open that happen atomically.

Mikhail Pershin

2009-Jul-03 19:02 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams  
<Nicolas.Williams at sun.com> wrote:
> We''re working on adding replay RPC signatures, so that clients may
only
> replay RPCs that have been seen by the server (thus signed).
Could you explain that more? All replays have been seen by server just by  
definition because client got reply from server, so what is purpose of  
such signing?
> I''ve been researching this (and talking to Eric B. and Oleg about
this).
> Several possible solutions are evident.  I''ll describe the one
that
> seems most elegant to me (and, I think, Oleg), namely separate open
> state recovery from transaction recovery.
>
> Server-side high-level description:
>
>  - during recovery the MDS will first process anonymous open by FID RPCs
>    from new clients (these open RPCs will not have transaction IDs
>    assigned to them as they imply no actual transactions)
>
>  - then the MDS will accept replays from all clients, new and old
It is not clear what do ''new'' and ''old'' mean
here? If both ''new'' and ''old''
have requests to replay so they were active in previous server boot, so  
what is the difference between them?
>
>  - followed by lock recovery as usual
>
> Client-side high-level description:
>
>  - open processing will begin by sending an RPC as usual...,
> - ... but on commit the md_open_data will be added to a doubly-linked
>    list of opens and the RPC will be removed from the PTLRPC replay
>    queue
>
>  - during recovery the client will begin by traversing the list of
>    md_open_data (open state), reconstruct an anonymous open by FID RPC
>    and send it to the MDS, and after that the client will replay
>    outstanding transactions'' RPCs, followed by lock recovery
Hmm, but currently it works exactly like this, the committed open replay  
are sent first followed by normal replays. So you propose to separate them  
just because they are not ''pure'' replays as you described
below?
>
> Old clients would recover as usual.
>
> Security is provided by the capabilities used in the anonymous open by
> FID RPCs and transport security.
>
> The general principle then would be:
>
>    RPC replaying is to be used only for recovering _transactions that
>    should not be outstanding for very long.
>
> Where "very long" is relative to the replay signature crypto key
> lifecycle, which will be on the order of days.
>
> Since opens are not transactions[*] and can stay "outstanding"
forever,
> opens would not be suitable for recovery by replay under that principle.
> Open state is much more similar to DLM locks than transactions.
>
> Open recovery must precede uncommitted transaction recovery so as to
> ensure that open state is re-established before unlinks can be replayed
> that would cause the file to be destroyed.
That requires the server shouldn''t start replays from all clients until
''open recovery'' is finished from all of them. In fact there is
another
solution for open-unlink problem that was implemented in 1.8. During  
recovery the unlink replay doesn''t delete file but makes it orphan even
if
open count is 0. After recovery orphans are cleaned up already, so open  
replay after unlink will find orphan and open it.


-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Nicolas Williams

2009-Jul-03 21:55 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin
wrote:> On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams  
> <Nicolas.Williams at sun.com> wrote:
> >We''re working on adding replay RPC signatures, so that clients
may only
> >replay RPCs that have been seen by the server (thus signed).
> 
> Could you explain that more? All replays have been seen by server just by  
> definition because client got reply from server, so what is purpose of  
> such signing?
They''ve been seen, indeed, but when replayed not all the same
permissions checks may be done, so the server needs to know that the
replay is safe to process.  There''s two ways to do that: never skip any
permissions checks when processing replayed RPCs, or have the server
sign replayable RPCs so the server can know validate any replays.  I''ve
not looked at a complete list of checks that are skipped on replays --
perhaps we should have such a list before we go down the replay
signature path.
> > [...]
> > - then the MDS will accept replays from all clients, new and old
> 
> It is not clear what do ''new'' and ''old''
mean here? If both ''new'' and ''old''
> have requests to replay so they were active in previous server boot, so  
> what is the difference between them?
Old clients would be clients that don''t implement this new form of open
state recovery (e.g., 1.6, 1.8 clients).  New clients would be clients
that do implement this new form of open state recovery (2.x).
> > - followed by lock recovery as usual
> >
> >Client-side high-level description:
> >
> > [...]
> Hmm, but currently it works exactly like this, the committed open replay  
> are sent first followed by normal replays. So you propose to separate them
> just because they are not ''pure'' replays as you described
below?
It doesn''t work as I proposed: opens are currently recovered by
_replaying_ RPCs (which potentially had side-effects besides creating
open state).  Or at least that''s my understanding.

In my proposal open state recovery for opens associated with completed
transactions would always be done by generating new anonymous open by
FID RPCs (not replayed ones).
> >The general principle then would be:
> >
> >   RPC replaying is to be used only for recovering _transactions that
> >   should not be outstanding for very long.
> >
> >Where "very long" is relative to the replay signature crypto
key
> >lifecycle, which will be on the order of days.
> >
> >Since opens are not transactions[*] and can stay
"outstanding" forever,
> >opens would not be suitable for recovery by replay under that
principle.
> >Open state is much more similar to DLM locks than transactions.
> >
> >Open recovery must precede uncommitted transaction recovery so as to
> >ensure that open state is re-established before unlinks can be replayed
> >that would cause the file to be destroyed.
> 
> That requires the server shouldn''t start replays from all clients
until
> ''open recovery'' is finished from all of them. In fact
there is another
Correct.
> solution for open-unlink problem that was implemented in 1.8. During  
> recovery the unlink replay doesn''t delete file but makes it orphan
even if
> open count is 0. After recovery orphans are cleaned up already, so open  
> replay after unlink will find orphan and open it.
That idea did cross my mind.  The MDS would have to keep a list of such
unlinks so it can drop their open count if they truly aren''t open. 
That
seems like a extra work that the MDS shouldn''t have to do.

Nico
--

Nicolas Williams

2009-Jul-04 00:48 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Fri, Jul 03, 2009 at 04:55:28PM -0500, Nicolas Williams
wrote:> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote:
> > On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams  
> > <Nicolas.Williams at sun.com> wrote:
> > >We''re working on adding replay RPC signatures, so that
clients may only
> > >replay RPCs that have been seen by the server (thus signed).
> > 
> > Could you explain that more? All replays have been seen by server just
by
> > definition because client got reply from server, so what is purpose of
> > such signing?
> 
> They''ve been seen, indeed, but when replayed not all the same
> permissions checks may be done, so the server needs to know that the
> replay is safe to process.  There''s two ways to do that: never
skip any
> permissions checks when processing replayed RPCs, or have the server
> sign replayable RPCs so the server can know validate any replays. 
I''ve
> not looked at a complete list of checks that are skipped on replays --
> perhaps we should have such a list before we go down the replay
> signature path.
Oh, I forgot for a moment, but the other point of replay signatures is
to prevent clients from causing other clients to be evicted.

Mikhail Pershin

2009-Jul-04 07:10 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Sat, 04 Jul 2009 01:55:28 +0400, Nicolas Williams  
<Nicolas.Williams at sun.com> wrote:
> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote:
>
> They''ve been seen, indeed, but when replayed not all the same
> permissions checks may be done, so the server needs to know that the
> replay is safe to process.  There''s two ways to do that: never
skip any
> permissions checks when processing replayed RPCs, or have the server
> sign replayable RPCs so the server can know validate any replays. 
I''ve
> not looked at a complete list of checks that are skipped on replays --
> perhaps we should have such a list before we go down the replay
> signature path.
OK, so it is not about fake/malformed client only, that is interesting, is  
there any preliminary arch/hld document describing that? I am interesting  
in more backgrounds if any
> In my proposal open state recovery for opens associated with completed
> transactions would always be done by generating new anonymous open by
> FID RPCs (not replayed ones).
Well, I see no difference yet. Currently all open ''replays''
are passed
right to open_by_fid(), open file and create mfd structure for it, so it  
is the same on server side at least. Did I miss something?
>> >Open recovery must precede uncommitted transaction recovery so as
to
>> >ensure that open state is re-established before unlinks can be
replayed
>> >that would cause the file to be destroyed.
>>
>> That requires the server shouldn''t start replays from all
clients until
>> ''open recovery'' is finished from all of them. In fact
there is another
>
> Correct.
That is more regression than benefit, having such kind of
''barrier'' during
recovery leads to longer recovery with not balanced server load. There are  
couple improvements on the way already to make recovery of each client  
more independent from others if possible, e.g. the transaction-based  
recovery can be replaced with version-based only. So adding new barriers  
is not good case in this terms
>> solution for open-unlink problem that was implemented in 1.8. During
>> recovery the unlink replay doesn''t delete file but makes it
orphan even
>> if
>> open count is 0. After recovery orphans are cleaned up already, so open
>> replay after unlink will find orphan and open it.
>
> That idea did cross my mind.  The MDS would have to keep a list of such
> unlinks so it can drop their open count if they truly aren''t open.
That
> seems like a extra work that the MDS shouldn''t have to do.
There is already such mechanism on MDS to handle open-unlink cases. MDS  
keeps orphaned files while they are opened and deletes all non-reopened  
after recovery. We can just use this mechanism during recovery moving  
unlinked files to orphans. It work so already in 1.8 and should be even  
simpler in 2.0 due to FIDs. There are extra checks only, no need to keep  
extra list or so. I think this is preferable way to go because we avoid  
''barriers'' in recovery mentioned above

-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Mikhail Pershin

2009-Jul-04 07:14 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Sat, 04 Jul 2009 04:48:51 +0400, Nicolas Williams  
<Nicolas.Williams at sun.com> wrote:
> On Fri, Jul 03, 2009 at 04:55:28PM -0500, Nicolas Williams wrote:
>> On Fri, Jul 03, 2009 at 11:02:16PM +0400, Mikhail Pershin wrote:
>> > On Fri, 03 Jul 2009 02:39:45 +0400, Nicolas Williams
>> > <Nicolas.Williams at sun.com> wrote:
>> > >We''re working on adding replay RPC signatures, so
that clients may
>> only
>> > >replay RPCs that have been seen by the server (thus signed).
>> >
>> > Could you explain that more? All replays have been seen by server
>> just by
>> > definition because client got reply from server, so what is
purpose of
>> > such signing?
>>
>> They''ve been seen, indeed, but when replayed not all the same
>> permissions checks may be done, so the server needs to know that the
>> replay is safe to process.  There''s two ways to do that: never
skip any
>> permissions checks when processing replayed RPCs, or have the server
>> sign replayable RPCs so the server can know validate any replays. 
I''ve
>> not looked at a complete list of checks that are skipped on replays --
>> perhaps we should have such a list before we go down the replay
>> signature path.
>
> Oh, I forgot for a moment, but the other point of replay signatures is
> to prevent clients from causing other clients to be evicted.
Ah, I am interesting even more in some background about this idea. I  
thought this was needed as protection from malformed clients only but it  
looks more functional.

-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Nicolas Williams

2009-Jul-06 17:20 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

Note too that recovering opens by reconstruction and before outstanding
transactions commits us to always use capabilities.  The reason is that
an open might not be permitted prior to replaying, say, a chmod, but
with capabilities the open would be permitted because of the capability
issued earlier, before the MDS restarted.

If we want an option to disable capabilities, then we should recover
opens in order at RPC replay time, but the opens should still not be
replays (unless O_CREAT is involved and that transaction hasn''t
committed).

Nico
--

Nicolas Williams

2009-Jul-06 17:34 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Sat, Jul 04, 2009 at 11:10:41AM +0400, Mikhail Pershin
wrote:> On Sat, 04 Jul 2009 01:55:28 +0400, Nicolas Williams  
> <Nicolas.Williams at sun.com> wrote:
> 
> OK, so it is not about fake/malformed client only, that is interesting, is
> there any preliminary arch/hld document describing that? I am interesting  
> in more backgrounds if any
See bug #18657.
> >In my proposal open state recovery for opens associated with completed
> >transactions would always be done by generating new anonymous open by
> >FID RPCs (not replayed ones).
> 
> Well, I see no difference yet. Currently all open
''replays'' are passed
> right to open_by_fid(), open file and create mfd structure for it, so it  
> is the same on server side at least. Did I miss something?
The difference is on the wire.

Currently open state recovery replays RPCs.  This has a very specific
meaning: the original RPC is sent again with a bit set in the ptlrpc
header to indicate that it is a replay.

When the transaction had already been committed this replay is processed
on the server side as an anonymous open by FID, but on the wire the open
may have been something other than an anon open by FID.

In my proposal what would happen is that opens would only be recovered
by _replay_ when the transaction had not yet been committed, otherwise
the opens will be recovered by making a _new_ (non-replay) open RPC.
> >>>Open recovery must precede uncommitted transaction recovery so
as to
> >>>ensure that open state is re-established before unlinks can be
replayed
> >>>that would cause the file to be destroyed.
> >>
> >>That requires the server shouldn''t start replays from all
clients until
> >>''open recovery'' is finished from all of them. In
fact there is another
> >
> >Correct.
> 
> That is more regression than benefit, having such kind of
''barrier'' during
> recovery leads to longer recovery with not balanced server load. There are
> couple improvements on the way already to make recovery of each client  
> more independent from others if possible, e.g. the transaction-based  
> recovery can be replaced with version-based only. So adding new barriers  
> is not good case in this terms
I''m not sure why a new stage would necessarily slow recovery in a
significant way.  The new stage would not involve any writes to disk
(though it would involve reads, reads which could then be cached and
benefit the transaction recovery phase).

There is an alternative: recover opens during transaction recovery in
trasaction order, but for committed opens (or opens that had not
filesystem transaction to commit, i.e., opens without O_CREAT) use new
RPCs instead of replay RPCs.  The amount of work should be the same as
with the proposed solution, but with better cache locality of reference.

Also, recovering opens before transactions would bind us to always
having capabilities enabled (see my other post just now).  Whereas the
above alternative would not.
> >>solution for open-unlink problem that was implemented in 1.8.
During
> >>recovery the unlink replay doesn''t delete file but makes
it orphan even
> >>if
> >>open count is 0. After recovery orphans are cleaned up already, so
open
> >>replay after unlink will find orphan and open it.
> >
> >That idea did cross my mind.  The MDS would have to keep a list of such
> >unlinks so it can drop their open count if they truly aren''t
open.  That
> >seems like a extra work that the MDS shouldn''t have to do.
> 
> There is already such mechanism on MDS to handle open-unlink cases. MDS  
> keeps orphaned files while they are opened and deletes all non-reopened  
> after recovery. We can just use this mechanism during recovery moving  
> unlinked files to orphans. It work so already in 1.8 and should be even  
> simpler in 2.0 due to FIDs. There are extra checks only, no need to keep  
> extra list or so. I think this is preferable way to go because we avoid  
> ''barriers'' in recovery mentioned above
Suppose we recovered opens after transactions: we''d still have
additional costs for last unlinks since we''d have to put the object on
an on-disk queue of orpahsn until all open state is recovered.  See
above.

Nico
--

Nicolas Williams

2009-Jul-06 22:37 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Mon, Jul 06, 2009 at 12:20:09PM -0500, Nicolas Williams
wrote:> Note too that recovering opens by reconstruction and before outstanding
> transactions commits us to always use capabilities.  The reason is that
> an open might not be permitted prior to replaying, say, a chmod, but
> with capabilities the open would be permitted because of the capability
> issued earlier, before the MDS restarted.
> 
> If we want an option to disable capabilities, then we should recover
> opens in order at RPC replay time, but the opens should still not be
> replays (unless O_CREAT is involved and that transaction hasn''t
> committed).
The above is all wrong, fortunately :)  Oleg explained it to me.

Transactions are committed linearly in time, so if a chmod that an open
depends on has not yet committed then that open''s state must be
recovered by replay rather than by reconstruction.

Nico
--

Nicolas Williams

2009-Jul-06 22:42 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Mon, Jul 06, 2009 at 12:34:41PM -0500, Nicolas Williams
wrote:> On Sat, Jul 04, 2009 at 11:10:41AM +0400, Mikhail Pershin wrote:
> > That is more regression than benefit, having such kind of
''barrier'' during
> > recovery leads to longer recovery with not balanced server load. There
are
> > couple improvements on the way already to make recovery of each client
> > more independent from others if possible, e.g. the transaction-based  
> > recovery can be replaced with version-based only. So adding new
barriers
> > is not good case in this terms
> 
> I''m not sure why a new stage would necessarily slow recovery in a
> significant way.  The new stage would not involve any writes to disk
> (though it would involve reads, reads which could then be cached and
> benefit the transaction recovery phase).
Also, as Oleg explained to me, most open state is for files whose opens
committed long ago, so most open state is recovered before other
transactions.  Which means we already have a separate open state
recovery phase -- it just isn''t explicit.  So the only thing that
changes in my proposal is that all committed open state will be
recovered by anonymous open by FID reconstruction instead of by replay,
with all other transactions (including as-yet uncommitted opens) will be
recovered by replay.

There would be no new timeouts, and there should be no other negative
impact on recovery time/performance.  Recovery performance should
actually be improved, when replay signatures are enabled, since there
would be no need to verify replay signatures for more open state
recovery.

Nico
--

Alex Zhuravlev

2009-Jul-07 09:56 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

Nicolas Williams wrote:> Also, as Oleg explained to me, most open state is for files whose opens
> committed long ago, so most open state is recovered before other
> transactions.  Which means we already have a separate open state
> recovery phase -- it just isn''t explicit.  So the only thing that
> changes in my proposal is that all committed open state will be
> recovered by anonymous open by FID reconstruction instead of by replay,
> with all other transactions (including as-yet uncommitted opens) will be
> recovered by replay.
I think it''d be slightly easier to introduce two notions of replay:

1) on-disk replay -- we try to recover some on-disk state from client''s
cache
    regular requests like mkdir, unlink, rename, setattr, etc

2) in-core replay - we try to recover some in-core state from client''s
cache
    ldlm locks, open files

the thing is that open(2) is quite interesting in this regard because it does
(1) *and* (2). I believe this is why we used (1) for (2).

my old thougth was that instead of introducing special new open-by-fid RPC we
should try to implement open in terms of LDLM locks because it''s
in-core state
(though with specific tracking of unlinked files). given this we''d
automatically
get single mechanism for all in-core states and we''d get rid of special
paths
for open replays.

thanks, Alex

Mikhail Pershin

2009-Jul-07 13:56 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams  
<Nicolas.Williams at sun.com> wrote:
>
> In my proposal what would happen is that opens would only be recovered
> by _replay_ when the transaction had not yet been committed, otherwise
> the opens will be recovered by making a _new_ (non-replay) open RPC.
>
Yes, I understood that and agree that this looks like more clean  
implementation but I see the following problems so far:
  - two kinds of client - new and old that should be handled somehow
  - client code should be changed a lot
  - server need to understand and handle this too

What will we get for this? Sorry for my annoyance, but it looks for me  
that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY  
flag to such requests, so it will be also different in wire from  
transaction replays. Or we could re-use lock replay functionality somehow.  
The locks are not kept as saved RPC too but enqueued as new requests. The  
open is very close to this, I agree with idea that open handle has all  
needed info and no need to keep original RPC in this case.

I mean that proposed solution looks overcomplicated just to solve  
signature problem though it makes sense in general. If we are going to  
re-organize open recovery and have time for this it would be better to  
move it from context of replay signature to separate task as it is quite  
complex.
>
> I''m not sure why a new stage would necessarily slow recovery in a
> significant way.  The new stage would not involve any writes to disk
> (though it would involve reads, reads which could then be cached and
> benefit the transaction recovery phase).
Not necessarily, but it can. It is not about open stage only, it is about  
the whole approach to do recovery by stages when all clients must wait for  
any other at each stage before they can continue recovery. We have already  
this in HEAD and it extends recovery window. Lustre 1.8 had only single  
timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set  
after each one. If all clients are alive than the recovery time will be  
mostly the same, but if clients may gone during recovery then lustre 2.0  
recovery time can be three times longer already. Just imagine that at each  
stage one client is gone, then at each stage all clients will wait until  
timer expiration. And the bigger cluster we have the more clients can be  
lost during recovery so recovery time may differ significantly.
Also this means that server load is not well distributed over recovery  
time. It waits then start doing all requests at once then waits again on  
other stage, etc.

Another point here is the possible using the version recovery instead of  
transaction-based recovery. This will makes recovery based on versions of  
object and it makes just no sense to wait all clients at each recovery  
stage, because all dependencies should be clear from versions and clients  
may finish recovery independently. Currently the requests can be recovered  
by versions and there is work on lock replays using versions too.
>
> Suppose we recovered opens after transactions: we''d still have
> additional costs for last unlinks since we''d have to put the
object on
> an on-disk queue of orpahsn until all open state is recovered.  See
> above.
There is no additional cost for pair of open-unlink because orphan is  
needed anyway after unlink. The only exception is replay of pure unlink.  
But we need to keep orphans after unlinks for other cases anyway, e.g.  
delayed recovery and such overhead is nothing compared with time that can  
be lost on waiting for everyone as described above.

In fact this is already slightly out of scope original idea about open  
replay organization. This is more related to server recovery handling,  
version recovery, delayed recovery and can be discussed later when open  
replay changes on client will be settled, it will be more clear in that  
time.

-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Andreas Dilger

2009-Jul-07 14:38 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Jul 07, 2009  13:56 +0400, Alex Zhuravlev wrote:> Nicolas Williams wrote:
> > Also, as Oleg explained to me, most open state is for files whose
opens
> > committed long ago, so most open state is recovered before other
> > transactions.  Which means we already have a separate open state
> > recovery phase -- it just isn''t explicit.  So the only thing
that
> > changes in my proposal is that all committed open state will be
> > recovered by anonymous open by FID reconstruction instead of by
replay,
> > with all other transactions (including as-yet uncommitted opens) will
be
> > recovered by replay.
> 
> I think it''d be slightly easier to introduce two notions of
replay:
> 
> 1) on-disk replay -- we try to recover some on-disk state from
client''s cache
>     regular requests like mkdir, unlink, rename, setattr, etc
> 
> 2) in-core replay - we try to recover some in-core state from
client''s cache
>     ldlm locks, open files
> 
> the thing is that open(2) is quite interesting in this regard because it
does
> (1) *and* (2). I believe this is why we used (1) for (2).
> 
> my old thougth was that instead of introducing special new open-by-fid RPC
> we should try to implement open in terms of LDLM locks because
it''s in-core
> state (though with specific tracking of unlinked files). given this
we''d
> automatically get single mechanism for all in-core states and we''d
get rid
> of special paths for open replays.
One problem with this is that the ordering needs to be preserved.  Opens
that have committed need to be replayed before any other replay operations,
because those replayed operations may depend on the file being open.
However, "normal" lock replay should happen after (or conceivably
during)
operation replay so that the objects being locked actually exist and the
server can (hopefully soon) verify the lock version number during recovery.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Andreas Dilger

2009-Jul-07 15:21 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Jul 07, 2009  17:56 +0400, Mike Pershin wrote:> On Mon, 06 Jul 2009 21:34:41 +0400, Nicolas Williams  
> <Nicolas.Williams at sun.com> wrote:
> > In my proposal what would happen is that opens would only be recovered
> > by _replay_ when the transaction had not yet been committed, otherwise
> > the opens will be recovered by making a _new_ (non-replay) open RPC.
> 
> Yes, I understood that and agree that this looks like more clean  
> implementation but I see the following problems so far:
>   - two kinds of client - new and old that should be handled somehow
>   - client code should be changed a lot
>   - server need to understand and handle this too
> 
> What will we get for this? Sorry for my annoyance, but it looks for me  
> that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY  
> flag to such requests, so it will be also different in wire from  
> transaction replays. Or we could re-use lock replay functionality somehow.
> The locks are not kept as saved RPC too but enqueued as new requests. The  
> open is very close to this, I agree with idea that open handle has all  
> needed info and no need to keep original RPC in this case.
There are actually multiple benefits from this change:
- we can remove the awkward handling of open RPCs that are saved even
  after they have been committed to disk.  That code has had so many
  bugs in it (and probably still has some) I will be happy when it is gone.
- we don''t have RPCs saved for replay that cannot be flushed during
  a server upgrade.  For the Simplified Interoperability feature we
  need to be able to clear all of the saved RPCs from memory so that
  it is possible to change the RPC format over an upgrade.  Regenerating
  the _new_ RPCs from the open file handles allows this to happen.
> I mean that proposed solution looks overcomplicated just to solve  
> signature problem though it makes sense in general. If we are going to  
> re-organize open recovery and have time for this it would be better to  
> move it from context of replay signature to separate task as it is quite  
> complex.
To my thinking, I don''t know that we need to introduce a new RPC _type_
for the open, AFAIK the old open replay RPC will already do open-by-FID.
What is the core change here is that the open RPCs will be newly generated
at recovery time instead of being kept in memory.

This actually has a second benefit in that we don''t have to keep huge
lists of open RPCs in the replay list that will be skipped each time we
are trying to cancel committed RPCs.  For HPCS we need to handle 100k
opens on a single client, and cancelling RPCs from the replay list is
an O(n^2) operation since it does a list walk to find just-committed RPCs.
> > I''m not sure why a new stage would necessarily slow recovery
in a
> > significant way.  The new stage would not involve any writes to disk
> > (though it would involve reads, reads which could then be cached and
> > benefit the transaction recovery phase).
> 
> Not necessarily, but it can. It is not about open stage only, it is about  
> the whole approach to do recovery by stages when all clients must wait for
> any other at each stage before they can continue recovery. We have already
> this in HEAD and it extends recovery window. Lustre 1.8 had only single  
> timer for recovery, Lustre 2.0 has 3 stages and timer should be re-set  
> after each one.
Actually, the need to have separate recovery stages in HEAD is no longer
needed.  The addition of extra replay stages was a result of fixing a bug
in recovery where open file handles were not being replayed before another
client unlinked the file.  However, this has to be fixed for VBR delayed
recovery anyways, so we may as well fix this with a single mechanism
instead of adding a separate recovery stage that requires waiting for
all clients to join or be evicted before any recovery can start.

[details for the above]

INITIAL ORDER
============client 1			client 2		MDS
--------			--------		---
open A  (transno N)
{use A}							***commit >= N***
				unlink A (transno X)
{continue to use A}
							***crash***

REPLAY ORDER
===========client 1			client 2		MDS
--------			--------		---
{slow reconnect}					***last committed < X***
				unlink A (transno X)
open A (transno N) = -ENOENT
{A can no longer be used}


The proper solution, as also needed by delayed recovery, is to move A
to the PENDING list during replay and remove it at the end of replay.
With 1.x we would have to also remove the inode from PENDING if some
other node reuses that inode number, but since this extra recovery
stage is only present in 2.0 and we will not implement delayed recovery
for 1.x we can simply remove all unreferenced inodes from PENDING at
the end of recovery (until delayed recovery is completed).

It would be possible to flag the unlink RPCs with a special flag (maybe
just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks
that also destroy the objects, and unlinks that cause open-unlinked files.
For replayed unlinks that cause objects to be destroyed we know that
there are no other clients holding the file open after that point and
we don''t have to put the inode into PENDING at all.
> If all clients are alive than the recovery time will be  
> mostly the same, but if clients may gone during recovery then lustre 2.0  
> recovery time can be three times longer already. Just imagine that at each
> stage one client is gone, then at each stage all clients will wait until  
> timer expiration. And the bigger cluster we have the more clients can be  
> lost during recovery so recovery time may differ significantly.
> Also this means that server load is not well distributed over recovery  
> time. It waits then start doing all requests at once then waits again on  
> other stage, etc.
> 
> Another point here is the possible using the version recovery instead of  
> transaction-based recovery. This will makes recovery based on versions of  
> object and it makes just no sense to wait all clients at each recovery  
> stage, because all dependencies should be clear from versions and clients  
> may finish recovery independently. Currently the requests can be recovered
> by versions and there is work on lock replays using versions too.
I fully agree - it would be ideal if recovery started immediately without
any waiting for other clients.
> > Suppose we recovered opens after transactions: we''d still
have
> > additional costs for last unlinks since we''d have to put the
object on
> > an on-disk queue of orpahsn until all open state is recovered.  See
> > above.
> 
> There is no additional cost for pair of open-unlink because orphan is  
> needed anyway after unlink. The only exception is replay of pure unlink.  
> But we need to keep orphans after unlinks for other cases anyway, e.g.  
> delayed recovery and such overhead is nothing compared with time that can  
> be lost on waiting for everyone as described above.
Agreed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nicolas Williams

2009-Jul-07 16:03 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Tue, Jul 07, 2009 at 01:56:52PM +0400, Alex Zhuravlev
wrote:> I think it''d be slightly easier to introduce two notions of
replay:
As I understand it ''replay'' has a very specific meaning:
re-send an RPC
with the ''replay'' bit set in the ptlrpc header.
> [...]
> my old thougth was that instead of introducing special new open-by-fid
> RPC we should try to implement open in terms of LDLM locks because
> it''s in-core state (though with specific tracking of unlinked
files).
> given this we''d automatically get single mechanism for all in-core
> states and we''d get rid of special paths for open replays.
Hmmm, but open by FID gives the MDS a chance to check capabilities.
Yes, that''s probably not terribly important as long as the OSSes also
check capabilities.

Also, there''s the unlink issue to worry about.  Mikhail''s
proposal for
that is to defer unlinks until after open state recovery (in this case:
until after DLM recovery).  That would work, I think.  Also, you could
have the kind of DLM locks used for open state tracking recovered first,
then transactions, then all other types of locks.

Here''s a question: what consumes more memory on the MDS: open state or
a
DLM lock?

Nico
--

Nicolas Williams

2009-Jul-07 16:14 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Tue, Jul 07, 2009 at 05:56:36PM +0400, Mikhail Pershin
wrote:> What will we get for this? Sorry for my annoyance, but it looks for me  
> that it can be solved in simpler ways. E.g. you can add MGS_OPEN_REPLAY  
> flag to such requests, so it will be also different in wire from  
> transaction replays. Or we could re-use lock replay functionality somehow.
Making the open replays look different on the wire is exactly what this
is about.  They''ll look different from other replays in that they will
not have a replay signature.  But replay signatures are a PTLRPC layer
feature, so how should PTLRPC know whether to allow such a replay to
pass through?  One way is to let it pass through replays with valid
signatures and non-replays, and then let the MDT have non-replay
handlers only for anon open by FID during recovery.  Then the client
might as well not bother caching open RPCs forever, just until they
commit -- it can re-construct open RPCs from in-core state (vnode, ...)
anytime it needs to.

Using DLM locks to represent open state is interesting.  It would
require either recovering those first or deferring final unlinks at
transaction recover time.

Another problem with using locks for open state is that establishing the
lock atomically with an open w/ create won''t be easy.  The MDT would
have to enqueue a lock for itself atomically with the create, then the
client would have to enquee its lock, then the MDT would have to drop
its lock.  Would this not be much more complex that open RPC
reconstruction?
> The locks are not kept as saved RPC too but enqueued as new requests. The  
> open is very close to this, I agree with idea that open handle has all  
> needed info and no need to keep original RPC in this case.
Yes.

Nico
--

Mikhail Pershin

2009-Jul-07 16:42 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Tue, 07 Jul 2009 19:21:05 +0400, Andreas Dilger <adilger at sun.com>
wrote:
> This actually has a second benefit in that we don''t have to keep
huge
> lists of open RPCs in the replay list that will be skipped each time we
> are trying to cancel committed RPCs.  For HPCS we need to handle 100k
> opens on a single client, and cancelling RPCs from the replay list is
> an O(n^2) operation since it does a list walk to find just-committed  
> RPCs.
Absolutely, all benefits are clear and I fully agree but all of them are  
not in reply signature context. I was just afraid that inside replay  
signature task such big changes will defer replay signature itself. But if  
we have time to make it in right way then it is good.
> Actually, the need to have separate recovery stages in HEAD is no longer
> needed.  The addition of extra replay stages was a result of fixing a bug
> in recovery where open file handles were not being replayed before  
> another client unlinked the file.  However, this has to be fixed for VBR  
> delayed
> recovery anyways, so we may as well fix this with a single mechanism
> instead of adding a separate recovery stage that requires waiting for
> all clients to join or be evicted before any recovery can start.
>
> The proper solution, as also needed by delayed recovery, is to move A
> to the PENDING list during replay and remove it at the end of replay.
> With 1.x we would have to also remove the inode from PENDING if some
> other node reuses that inode number, but since this extra recovery
> stage is only present in 2.0 and we will not implement delayed recovery
> for 1.x we can simply remove all unreferenced inodes from PENDING at
> the end of recovery (until delayed recovery is completed).
Exactly, that is what I meant and that is why I don''t like another
strict
''stage''
>
> It would be possible to flag the unlink RPCs with a special flag (maybe
> just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks
> that also destroy the objects, and unlinks that cause open-unlinked  
> files.
> For replayed unlinks that cause objects to be destroyed we know that
> there are no other clients holding the file open after that point and
> we don''t have to put the inode into PENDING at all.
I''ve just thought about the same, it is quite obvious solution here.


-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Nicolas Williams

2009-Jul-07 16:50 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

On Tue, Jul 07, 2009 at 08:42:52PM +0400, Mikhail Pershin
wrote:> On Tue, 07 Jul 2009 19:21:05 +0400, Andreas Dilger <adilger at
sun.com> wrote:
> >This actually has a second benefit in that we don''t have to
keep huge
> >lists of open RPCs in the replay list that will be skipped each time we
> >are trying to cancel committed RPCs.  For HPCS we need to handle 100k
> >opens on a single client, and cancelling RPCs from the replay list is
> >an O(n^2) operation since it does a list walk to find just-committed  
> >RPCs.
This seems like a problem that could be solved anyways, but then, having
to cache these RPCs forever is a waste of resources.
> Absolutely, all benefits are clear and I fully agree but all of them are  
> not in reply signature context. I was just afraid that inside replay  
> signature task such big changes will defer replay signature itself. But if
> we have time to make it in right way then it is good.
Adding replay signature renewal just to avoid this restructuring seems
equally bad.  Not adding replay signature renewal and not bothering with
rekeying is OK in the short-term, but eventually it''d become a problem.
Given all the other benefits of doing committed open state recovery by
reconstruction, it seems like a good idea to just do it.
> >It would be possible to flag the unlink RPCs with a special flag (maybe
> >just OBD_MD_FLEASIZE/OBD_MD_FLCOOKIE) to distinguish between unlinks
> >that also destroy the objects, and unlinks that cause open-unlinked  
> >files.
> >For replayed unlinks that cause objects to be destroyed we know that
> >there are no other clients holding the file open after that point and
> >we don''t have to put the inode into PENDING at all.
> 
> I''ve just thought about the same, it is quite obvious solution
here.
An excellent idea.  Replay signatures would have to cover that bit.
I''ll add that to the HLD.

Nico
--

Alex Zhuravlev

2009-Jul-08 06:46 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

Andreas Dilger wrote:>> my old thougth was that instead of introducing special new open-by-fid
RPC
>> we should try to implement open in terms of LDLM locks because
it''s in-core
>> state (though with specific tracking of unlinked files). given this
we''d
>> automatically get single mechanism for all in-core states and
we''d get rid
>> of special paths for open replays.
> 
> One problem with this is that the ordering needs to be preserved.  Opens
> that have committed need to be replayed before any other replay operations,
> because those replayed operations may depend on the file being open.
> However, "normal" lock replay should happen after (or conceivably
during)
> operation replay so that the objects being locked actually exist and the
> server can (hopefully soon) verify the lock version number during recovery.
well, that ordering is already "dead" due to VBR?

I think semantics of unlink is just to unlink name, everything else is up to
MDS (when to destroy inode and objects). also notice inode destroy is a
different
transaction in general (due to possible multi-transaction truncate).

if we decouple unlink and object destroy, then the following sequence should
work:
1) replay on-disk states
    (unlink just put inode onto orphan list)
2) replay in-core states
    (including open locks)
... at some point
3) MDS goes over orphan list and destroys selected objects
    (depending on VRB policy, etc)

thanks, Alex

Alex Zhuravlev

2009-Jul-08 17:15 UTC

head link

[Lustre-devel] Recovering opens by reconstruction

Nicolas Williams wrote:> Another problem with using locks for open state is that establishing the
> lock atomically with an open w/ create won''t be easy.  The MDT
would
> have to enqueue a lock for itself atomically with the create, then the
> client would have to enquee its lock, then the MDT would have to drop
> its lock.  Would this not be much more complex that open RPC
> reconstruction?
we already have open locks which are taken in atomical manner with create.

thanks, Alex

Lustre devel - Jul 2009 - Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction

[Lustre-devel] Recovering opens by reconstruction