Hi all, after the problem with our MDS I reported earlier this evening, I did indeed a complete restart of the server, so afterwards the MDS was in recovery. There were 386 clients to be recovered. After 100 min, only one of these was left, but obviously it never came back. So I aborted recovery, and the file system seems to be back and running. My question: what happens to the one client that was not recovered? Of course that entails the question what happens during recovery anyhow, but in the first place I''m interested in the left over client. There can be no real damage to the client or the jobs that were running on it, all dead anyhow since Lustre was gone for such a long time. But maybe there are tell-tale signs on the client? According to Andreas and Robert, there is no way to identify the client from the MDS-side. What are the effects on the client side? Maybe I have to remount Lustre on that machine? Regards, Thomas -------------- next part -------------- A non-text attachment was scrubbed... Name: t_roth.vcf Type: text/x-vcard Size: 298 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090305/1f0ba72c/attachment.vcf
On Thu, 2009-03-05 at 22:19 +0100, Thomas Roth wrote:> > My question: what happens to the one client that was not recovered?It, and all of the clients that have transactions that need to be replayed after the AWOL client''s transactions are all evicted and their transactions discarded.> There can be no real damage to the client or the jobs that were > running on it, all dead anyhow since Lustre was gone for such a long time.Clients configured for failover will wait indefinitely for a Lustre server to return to service so there is no concept of "such a long time".> What are the effects on the client side? > Maybe I have to remount Lustre on that machine?An evicted client will reconnect without need for unmounting etc. However that it was evicted, any applications that were processing Lustre I/O will get an EIO. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090305/f001dc28/attachment-0001.bin
Thanks Brian. Brian J. Murrell wrote:> On Thu, 2009-03-05 at 22:19 +0100, Thomas Roth wrote: >> My question: what happens to the one client that was not recovered? > > It, and all of the clients that have transactions that need to be > replayed after the AWOL client''s transactions are all evicted and their > transactions discarded. > >> There can be no real damage to the client or the jobs that were >> running on it, all dead anyhow since Lustre was gone for such a long time. > > Clients configured for failover will wait indefinitely for a Lustre > server to return to service so there is no concept of "such a long > time".What I meant: the average batch job that wants to read from or write to Lustre will abort if a file cannot be accessed. The reason doesn''t matter to the jobs or the user. So the Lustre client may wait forever, but for the users that is irrelevant, they have to resubmit their jobs in any case. I was wondering whether a client whose transactions have not been replayed may get into some zombie state. Of course I see in the logs of MDS and clients what is supposed to happen, that remainig stuff on the client is discarded, inodes deleted etc. In some cases this will not work, I''m sure. But then reboot of the client will clean up. Regards, Thomas>> What are the effects on the client side? >> Maybe I have to remount Lustre on that machine? > > An evicted client will reconnect without need for unmounting etc. > However that it was evicted, any applications that were processing > Lustre I/O will get an EIO. > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote:> Thanks Brian.NP.> What I meant: the average batch job that wants to read from or write to > Lustre will abort if a file cannot be accessed. The reason doesn''t > matter to the jobs or the user.That may be so, but what I am saying is that when a lustre client wants to perform an i/o operation on behalf of an application running on that machine and the target it wants to do the i/o with is down, the lustre client will wait and block the applications i/o indefinitely. That means that unless the application has some kind of timer in it so that it can abort the read(2)/write(2), it will wait forever as the read(2) or write(2) system call that it issued will simply wait for the lustre client to complete -- forever, if the target that the lustre client wanted to do the i/o with never comes back.> So the Lustre client may wait forever, but for the users that is > irrelevant, they have to resubmit their jobs in any case.But what signals them to resubmit? A job waiting on I/O to a missing target will just "hang" (the proper term is block) until the target comes back. Is there some kind of timer that aborts a job if it takes too long? If so, then that is pretty orthogonal to the discussion of what happens to a lustre client during (a failed) recovery.> I was wondering whether a client whose transactions have not been > replayed may get into some zombie state.No. It should be evicted (that is why the transactions are not replayed) and will reconnect once recovery has been aborted and the target resumes it''s normal (FULL) state.> Of course I see in the logs of > MDS and clients what is supposed to happen, that remainig stuff on the > client is discarded, inodes deleted etc. In some cases this will not > work, I''m sure. But then reboot of the client will clean up.A reboot of the client should never be necessary to return it to the filesystem. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090306/b59a0fce/attachment.bin
Brian J. Murrell wrote:> On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote: >> Thanks Brian. > > NP. > >> What I meant: the average batch job that wants to read from or write to >> Lustre will abort if a file cannot be accessed. The reason doesn''t >> matter to the jobs or the user. > > That may be so, but what I am saying is that when a lustre client wants > to perform an i/o operation on behalf of an application running on that > machine and the target it wants to do the i/o with is down, the lustre > client will wait and block the applications i/o indefinitely. > > That means that unless the application has some kind of timer in it so > that it can abort the read(2)/write(2), it will wait forever as the > read(2) or write(2) system call that it issued will simply wait for the > lustre client to complete -- forever, if the target that the lustre > client wanted to do the i/o with never comes back.But this is not what our users observe. Even on an otherwise perfectly working system, they report I/O errors on access to some files. This is distributed randomly over time, servers, clients and jobs, and the files are always there and undamaged when checking them afterwards. I can usually see something happening in the logs of OST and client: The OST starts with "timeout on bulk PUT after 6+0s", which the OST is first "ignoring bulk IO comm error" in the hope that "client will retry". But then the OST says "Request x49132126 took longer than estimated (7+5s); client may timeout." Of course I might be falsely connecting error messages that are independent of each other. The client has a log that is easier to read, "Request ... has timed out (limit 7s)", "Connection to service was lost; in progress operations using this service will fail", finally "Connection restored to service". I guess I should append the full lines to this mail. Anyhow, in these minutes the user''s job reports "SysError: error reading from file..." - and aborts. The network connection between client and OSS can be trusted to be extremely buggy, but interruptions should only be on the order of seconds. So if the client were just blocking the applications I/O, it should smoothly get over these? This restoration of connection happens 1 sec after the first error message, so both sides can see each other again immediately. ------------------------------------------- ------------------------------------------- kern.log on the OSS: ------------------------------------------- Mar 6 11:23:49 OSS100 kernel: [14066606.171264] LustreError: 15145:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 6+0s req at ffff81021fa8d938 x93767754/t0 o3->a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID:0/0 lens 384/336 e 0 to 0 dl 1236335029 ref 1 fl Interpret:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.615287] Lustre: 14771:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027: 1011add7-13d1-0efc-65e3-14f94eab88c4 reconnecting Mar 6 11:23:50 OSS100 kernel: [14066607.616000] Lustre: 15112:0:(ost_handler.c:1270:ost_brw_write()) gsilust-OST0027: ignor ing bulk IO comm error with 6bb22b4d-3509-6c5d-ab95-85377cf88ade at NET_0x200008cb56a8a_UUID id 12345-CLIENT_A at tcp - client will retry Mar 6 11:23:50 OSS100 kernel: [14066607.616105] Lustre: 15112:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request x98164002 took longer than estimated (7+4s); client may timeout. req at ffff8101557cbad0 x98164002/t0 o4->6bb22b4d-3509-6c5d-a b95-85377cf88ade at NET_0x200008cb56a8a_UUID:0/0 lens 384/352 e 0 to 0 dl 1236335026 ref 1 fl Complete:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.616492] LustreError: 21161:0:(service.c:568:ptlrpc_check_req()) @@@ DROPPING req fr om old connection 1 < 2 req at ffff81015fc066e0 x520348/t0 o13->gsilust-mdtlov_UUID at NET_0x200008cb57036_UUID:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.616591] Lustre: 15145:0:(ost_handler.c:925:ost_brw_read()) gsilust-OST0027: ignorin g bulk IO comm error with a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID id 12345-CLIENT_B at tcp - client will retry Mar 6 11:23:50 OSS100 kernel: [14066607.623479] Lustre: 15094:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request x49132126 took longer than estimated (7+5s); client may timeout. req at ffff810152edd2b0 x49132126/t0 o3->fe639162-9537-9ab5-3 6f5-fba5c0a43171 at NET_0x200008cb572f6_UUID:0/0 lens 384/336 e 0 to 0 dl 1236335025 ref 1 fl Complete:/0/0 rc 0/0 Mar 6 11:24:24 OSS100 kernel: [14066641.871733] Lustre: 14753:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027: a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 reconnecting ------------------------------------------- ------------------------------------------- kern.log on CLIENT_A: ------------------------------------------- Mar 6 11:24:24 CLIENT_A kernel: Lustre: Request x93767754 sent from gsilust-OST0027-osc-ffff810808e68c00 to NID OSS100 at tcp 41s ago has timed out (limit 7s). Mar 6 11:24:24 CLIENT_A kernel: LustreError: 166-1: gsilust-OST0027-osc-ffff810808e68c00: Connection to service gsilust-OST0027 via nid OSS100 at tcp was lost; in progress operations using this service will fail. Mar 6 11:24:24 CLIENT_A kernel: LustreError: 167-0: This client was evicted by gsilust-OST0027; in progress operations using this service will fail. Mar 6 11:24:25 CLIENT_A kernel: Lustre: gsilust-OST0027-osc-ffff810808e68c00: Connection restored to service gsilust-OST0027 u sing nid OSS100 at tcp. -------------------------------------------------------------------------- Regards, Thomas
On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote:> > But this is not what our users observe. Even on an otherwise perfectly > working system, they report I/O errors on access to some files.EIO == eviction.> I can usually see something happening in the logs of OST and client: > The OST starts with "timeout on bulk PUT after 6+0s", which the OST is > first "ignoring bulk IO comm error" in the hope that "client will > retry".Wait a minute. This thread is about server recovery, not communications failures. You are mixing up errors and situations here. Communications failures will result in timeouts on the server and that will result in evictions which will result in EIOs for your applications. This has got nothing to do with server recovery though.> "Request ... has timed out > (limit 7s)", "Connection to service was lost; in progress operations > using this service will fail", finally "Connection restored to service".Yes. This is a timeout and nothing to do with the subject of server recovery. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090306/1ad66649/attachment-0001.bin
Brian J. Murrell wrote:> On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote: >> But this is not what our users observe. Even on an otherwise perfectly >> working system, they report I/O errors on access to some files. > > EIO == eviction. > >> I can usually see something happening in the logs of OST and client: >> The OST starts with "timeout on bulk PUT after 6+0s", which the OST is >> first "ignoring bulk IO comm error" in the hope that "client will >> retry". > > Wait a minute. This thread is about server recovery, not communications > failures. You are mixing up errors and situations here. > > Communications failures will result in timeouts on the server and that > will result in evictions which will result in EIOs for your > applications. This has got nothing to do with server recovery though.You are right, of course, this comes from a different situation. I just assumed that if a client cannot cope with a 1sec-interruption due to a communication failure, resulting in an EIO, how can it (resp. the application) survive an interruption of the entire system of several hours. Of course, if the client does react in a different manner during server recovery, then also the application will see things differently. I guess that''s what I misunderstood. In fact the client''s logs during yesterdays recovery don''t look so bad at all ;-) Just a number of "Request xyz sent from MDT0000-mdc to NID MGS ... timed out", as expected. Thanks for poiting this out., Thomas>> "Request ... has timed out >> (limit 7s)", "Connection to service was lost; in progress operations >> using this service will fail", finally "Connection restored to service". > > Yes. This is a timeout and nothing to do with the subject of server > recovery. > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Thomas Roth wrote:> Brian J. Murrell wrote: > >> On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote: >> >>> But this is not what our users observe. Even on an otherwise perfectly >>> working system, they report I/O errors on access to some files. >>> >> EIO == eviction.To be clear, while it is pretty clear that this is true in this one case, there are many other reasons why clients could get EIO, including the server''s path to the disk is down or returning errors.>>> I can usually see something happening in the logs of OST and client: >>> The OST starts with "timeout on bulk PUT after 6+0s", which the OST is >>> first "ignoring bulk IO comm error" in the hope that "client will >>> retry". >>> >> Wait a minute. This thread is about server recovery, not communications >> failures. You are mixing up errors and situations here. >> >> Communications failures will result in timeouts on the server and that >> will result in evictions which will result in EIOs for your >> applications. This has got nothing to do with server recovery though. >> > > You are right, of course, this comes from a different situation. I just > assumed that if a client cannot cope with a 1sec-interruption due to a > communication failure, resulting in an EIO, how can it (resp. the > application) survive an interruption of the entire system of several hours. > Of course, if the client does react in a different manner during server > recovery, then also the application will see things differently. > I guess that''s what I misunderstood. In fact the client''s logs during > yesterdays recovery don''t look so bad at all ;-) Just a number of > "Request xyz sent from MDT0000-mdc to NID MGS ... timed out", as expected. > Thanks for poiting this out., > Thomas >As Brian said there is a difference between the _server_ detecting the client is "down" and evicting it, and the _client_ continually attempting to reconnect and "finish" its IO (server failover). Failover does not "solve" network stability problems. Basically, clients try forever, servers do not. With a successful recovery, the clients can retry any operations that have not been acknowledged by the server, but if a client is "down" (in reality or due to network issues), it is evicted so the servers are able to reclaim locks, etc, and keep the filesystem from hanging due to a client. Hanging due to a server being down is the feature that allows failover. Kevin>>> "Request ... has timed out >>> (limit 7s)", "Connection to service was lost; in progress operations >>> using this service will fail", finally "Connection restored to service". >>> >> Yes. This is a timeout and nothing to do with the subject of server >> recovery. >> >> b. >> >>