I quickly reviewed the HLD and read Mike''s response. Here are a few questions: 1. Why do you wait for timeout+x after seeing a gap? Why not x,timeout, or y? 2. How to you avoid infinite accumulation of new exports? 3. If a VBR recovery operations happens, what transaction number is assigned to this? 4. Please discuss what happpens if multiple gaps are encountered? 5. Can we draw some pictures of the original transaction sequence and how its slots are refilled (in what order, with what new transaction number etc) if multiple clients are involved? I believe that you might have the right algorithms, but the explanations in the HLD are too short to be confident. - Peter
Thanks for review. I put short answers below and will update HLD with more details about questions you asked. On Tue, 10 Jun 2008 21:51:23 +0400, Peter Braam <Peter.Braam at Sun.COM> wrote:> > > I quickly reviewed the HLD and read Mike''s response. Here are a few > questions: > > 1. Why do you wait for timeout+x after seeing a gap? Why not x,timeout, > or > y?this is wrong sentence. The server waits for RECOVERY_TIMEOUT seconds since last reconnect.> > 2. How to you avoid infinite accumulation of new exports? >new clients are not allowed to connect during recovery and number of existent exports is finite> 3. If a VBR recovery operations happens, what transaction number is > assigned > to this?the same as during original operation, i.e. transno from replay request. Since we introduce the per-export last_committed value (section 2.2.3 of HLD), the transno may be the same as old one.> > 4. Please discuss what happpens if multiple gaps are encountered? >when first gap is encountered (the client misses recovery) the server starts using the version checking for replays and all not connected clients are marked as ''delayed''. The number of recoverable clients is decreased so check_for_next_transno will not stop on gap because number of queued requests is equal to number of client in recovery. You right, this is missed use case in HLD> 5. Can we draw some pictures of the original transaction sequence and how > its slots are refilled (in what order, with what new transaction number > etc) > if multiple clients are involved? >I will do that, sure> > I believe that you might have the right algorithms, but the explanations > in > the HLD are too short to be confident. > > - Peter > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
Mike - This deserves some pretty serious thinking and I should not be the only one to discuss this with you, because it is so complex. OK - so the thing to try to be very clear about is if after encountering a gap the recovery can ever switch back to non-VBR recovery. Also, it isn''t clear what happens if the servers saw a few gaps, but power down and back up. It is possible that even when clients reconnect, they don''t know anymore that there were gaps, yet it can affect sequence number recovery. If all clients and servers power off you can restart normal recovery. This raises the question if there is a point in keeping sequence recovery if we have version recovery, because one missing client appears to kick you into VBR mode forever. If you want to retain it, you''d have to record the gaps and track how they are getting filled with VBR operations and may close. Regards, Peter On 6/11/08 8:05 AM, "Mikhail Pershin" <Mikhail.Pershin at Sun.COM> wrote:> Thanks for review. I put short answers below and will update HLD with more > details about questions you asked. > > On Tue, 10 Jun 2008 21:51:23 +0400, Peter Braam <Peter.Braam at Sun.COM> > wrote: > >> >> >> I quickly reviewed the HLD and read Mike''s response. Here are a few >> questions: >> >> 1. Why do you wait for timeout+x after seeing a gap? Why not x,timeout, >> or >> y? > > this is wrong sentence. The server waits for RECOVERY_TIMEOUT seconds > since last reconnect. > >> >> 2. How to you avoid infinite accumulation of new exports? >> > > new clients are not allowed to connect during recovery and number of > existent exports is finite > >> 3. If a VBR recovery operations happens, what transaction number is >> assigned >> to this? > > the same as during original operation, i.e. transno from replay request. > Since we introduce the per-export last_committed value (section 2.2.3 of > HLD), the transno may be the same as old one. > >> >> 4. Please discuss what happpens if multiple gaps are encountered? >> > > when first gap is encountered (the client misses recovery) the server > starts using the version checking for replays and all not connected > clients are marked as ''delayed''. The number of recoverable clients is > decreased so check_for_next_transno will not stop on gap because number of > queued requests is equal to number of client in recovery. You right, this > is missed use case in HLD > >> 5. Can we draw some pictures of the original transaction sequence and how >> its slots are refilled (in what order, with what new transaction number >> etc) >> if multiple clients are involved? >> > > I will do that, sure > >> >> I believe that you might have the right algorithms, but the explanations >> in >> the HLD are too short to be confident. >> >> - Peter >> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > >
Hello, Peter On Wed, 11 Jun 2008 18:24:15 +0400, Peter Braam <Peter.Braam at Sun.COM> wrote:> Mike - > > This deserves some pretty serious thinking and I should not be the only > one > to discuss this with you, because it is so complex.is it worth to add CC to lustre-recovery@ mail list? It is right about recovery issues.> > OK - so the thing to try to be very clear about is if after encountering > a > gap the recovery can ever switch back to non-VBR recovery. > > Also, it isn''t clear what happens if the servers saw a few gaps, but > power > down and back up. It is possible that even when clients reconnect, they > don''t know anymore that there were gaps, yet it can affect sequence > number > recovery.Please see attached file with investigation for this use case. I am going to add this in HLD after discussion. There is the problem with switching back to ordinary recovery from VBR if recovery was interrupted and started again.> > If all clients and servers power off you can restart normal recovery. > > This raises the question if there is a point in keeping sequence > recovery if we have version recovery,The sequence recovery is simple and proven way to go, that is why it is better to use it with VBR for additional checks. The using only VBR leads to the following problems: 1) the replays may be done out of order so version checking can fail even when all clients are present. E.g. client1 does changes with version 1 and client2 - with version 2. If client2 will join recovery before the client2 then the version mismatch will occur. The obvious solution is to wait for RECOVERY_TIMOUT if version is less than needed, i.e. some another client will set it probably. This gives us new problem: 2) waiting for needed version to arrive leads to multithreaded recovery. E.g. client1 waits for version N of object K, in the same time another client2 needs version A of object H, therefore there can be multiple replays waiting for needed version of different objects and we should handle that somehow. During sequence recovery we wait for needed transaction in per-server sequence, but with VBR multiple requests can waits needed versions because version sequences are per-objects. This worth to be discussed as future approach for recovery, possibly, but using sequence recovery is simple way to start.> because one missing client appears to kick you > into VBR mode forever. If you want to retain it, you''d have to record > the gaps and track how they are getting filled with VBR operations and > may > close.The epoches (boot cycle counter) are used to track which clients should participate in recovery. Missed client will affect the recovery only once when it was missed. After that it will have epoch in last_rcvd client_data less then server last epoch and will not be included in main recovery. If all clients with epoch equal to server last epoch are connected then ordinary recovery can be used. See section 3.1 in VBR HLD for details about epoch management. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc. -------------- next part -------------- A non-text attachment was scrubbed... Name: vbr_transactions.pdf Type: application/pdf Size: 60921 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080617/46830568/attachment-0001.pdf