thr3ads.net - Lustre discuss - [Lustre-discuss] Aborting recovery [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2009-Mar-05 21:19 UTC

[Lustre-discuss] Aborting recovery

Hi all,

after the problem with our MDS I reported earlier this evening, I did indeed a
complete restart of
the server, so afterwards the MDS was in recovery.
There were 386 clients to be recovered. After 100 min, only one of these was
left, but obviously it
never came back. So I aborted recovery, and the file system seems to be back and
running.

My question: what happens to the one client that was not recovered?

Of course that entails the question what happens during recovery anyhow, but in
the first place I''m
interested in the left over client. There can be no real damage to the client or
the jobs that were
running on it, all dead anyhow since Lustre was gone for such a long time. But
maybe there are
tell-tale signs on the client? According to Andreas and Robert, there is no way
to identify the
client from the MDS-side. What are the effects on the client side?
Maybe I have to remount Lustre on that machine?

Regards,
Thomas



-------------- next part --------------
A non-text attachment was scrubbed...
Name: t_roth.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090305/1f0ba72c/attachment.vcf

Brian J. Murrell

2009-Mar-05 21:46 UTC

head link

[Lustre-discuss] Aborting recovery

On Thu, 2009-03-05 at 22:19 +0100, Thomas Roth wrote:> 
> My question: what happens to the one client that was not recovered?
It, and all of the clients that have transactions that need to be
replayed after the AWOL client''s transactions are all evicted and their
transactions discarded.
> There can be no real damage to the client or the jobs that were
> running on it, all dead anyhow since Lustre was gone for such a long time.
Clients configured for failover will wait indefinitely for a Lustre
server to return to service so there is no concept of "such a long
time".
> What are the effects on the client side?
> Maybe I have to remount Lustre on that machine?
An evicted client will reconnect without need for unmounting etc.
However that it was evicted, any applications that were processing
Lustre I/O will get an EIO.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090305/f001dc28/attachment-0001.bin

Thomas Roth

2009-Mar-06 09:45 UTC

head link

[Lustre-discuss] Aborting recovery

Thanks Brian.

Brian J. Murrell wrote:> On Thu, 2009-03-05 at 22:19 +0100, Thomas Roth wrote:
>> My question: what happens to the one client that was not recovered?
> 
> It, and all of the clients that have transactions that need to be
> replayed after the AWOL client''s transactions are all evicted and
their
> transactions discarded.
> 
>> There can be no real damage to the client or the jobs that were
>> running on it, all dead anyhow since Lustre was gone for such a long
time.
> 
> Clients configured for failover will wait indefinitely for a Lustre
> server to return to service so there is no concept of "such a long
> time".
What I meant: the average batch job that wants to read from or write to
Lustre will abort if a file cannot be accessed. The reason doesn''t
matter to the jobs or the user.
So the Lustre client may wait forever, but for the users that is
irrelevant, they have to resubmit their jobs in any case.

I was wondering whether a client whose transactions have not been
replayed may get into some zombie state. Of course I see in the logs of
MDS and clients what is supposed to happen, that remainig stuff on the
client is discarded, inodes deleted etc. In some cases this will not
work, I''m sure. But then reboot of the client will clean up.

Regards,
Thomas

>> What are the effects on the client side?
>> Maybe I have to remount Lustre on that machine?
> 
> An evicted client will reconnect without need for unmounting etc.
> However that it was evicted, any applications that were processing
> Lustre I/O will get an EIO.
> 
> b.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Brian J. Murrell

2009-Mar-06 15:32 UTC

head link

[Lustre-discuss] Aborting recovery

On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote:> Thanks Brian.
NP.
> What I meant: the average batch job that wants to read from or write to
> Lustre will abort if a file cannot be accessed. The reason doesn''t
> matter to the jobs or the user.
That may be so, but what I am saying is that when a lustre client wants
to perform an i/o operation on behalf of an application running on that
machine and the target it wants to do the i/o with is down, the lustre
client will wait and block the applications i/o indefinitely.

That means that unless the application has some kind of timer in it so
that it can abort the read(2)/write(2), it will wait forever as the
read(2) or write(2) system call that it issued will simply wait for the
lustre client to complete -- forever, if the target that the lustre
client wanted to do the i/o with never comes back.
> So the Lustre client may wait forever, but for the users that is
> irrelevant, they have to resubmit their jobs in any case.
But what signals them to resubmit?  A job waiting on I/O to a missing
target will just "hang" (the proper term is block) until the target
comes back.  Is there some kind of timer that aborts a job if it takes
too long?  If so, then that is pretty orthogonal to the discussion of
what happens to a lustre client during (a failed) recovery.
> I was wondering whether a client whose transactions have not been
> replayed may get into some zombie state.
No.  It should be evicted (that is why the transactions are not
replayed) and will reconnect once recovery has been aborted and the
target resumes it''s normal (FULL) state.
> Of course I see in the logs of
> MDS and clients what is supposed to happen, that remainig stuff on the
> client is discarded, inodes deleted etc. In some cases this will not
> work, I''m sure. But then reboot of the client will clean up.
A reboot of the client should never be necessary to return it to the
filesystem.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090306/b59a0fce/attachment.bin

Thomas Roth

2009-Mar-06 19:09 UTC

head link

[Lustre-discuss] Aborting recovery

Brian J. Murrell wrote:> On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote:
>> Thanks Brian.
> 
> NP.
> 
>> What I meant: the average batch job that wants to read from or write to
>> Lustre will abort if a file cannot be accessed. The reason
doesn''t
>> matter to the jobs or the user.
> 
> That may be so, but what I am saying is that when a lustre client wants
> to perform an i/o operation on behalf of an application running on that
> machine and the target it wants to do the i/o with is down, the lustre
> client will wait and block the applications i/o indefinitely.
> 
> That means that unless the application has some kind of timer in it so
> that it can abort the read(2)/write(2), it will wait forever as the
> read(2) or write(2) system call that it issued will simply wait for the
> lustre client to complete -- forever, if the target that the lustre
> client wanted to do the i/o with never comes back.
But this is not what our users observe. Even on an otherwise perfectly
working system, they report I/O errors on access to some files. This is
distributed randomly over time, servers, clients and jobs, and the files
are always there and undamaged when checking them  afterwards.
I  can usually see something happening in the logs of OST and client:
The OST starts with "timeout on bulk PUT after 6+0s", which the OST is
first "ignoring bulk IO comm error" in the hope that "client will
retry". But then the OST says "Request x49132126 took longer than
estimated (7+5s); client may timeout." Of course I might be falsely
connecting error messages that are independent of each other.
The client has a log that is easier to read, "Request ... has timed out
(limit 7s)", "Connection to service was lost; in progress operations
using this service will fail", finally "Connection restored to
service".

I guess I should append the full lines to this mail.
Anyhow, in these minutes the user''s job reports "SysError: error
reading
from file..." - and aborts.

The network connection between client and OSS can be trusted to be
extremely buggy, but interruptions should only be on the order of
seconds. So if the client were just blocking the applications I/O, it
should smoothly get over these? This restoration of connection happens 1
sec after the first error message, so both sides can see each other
again immediately.

-------------------------------------------
-------------------------------------------
kern.log on the OSS:
-------------------------------------------
Mar  6 11:23:49 OSS100 kernel: [14066606.171264] LustreError:
15145:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk
 PUT after 6+0s  req at ffff81021fa8d938 x93767754/t0
o3->a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID:0/0 lens
 384/336 e 0 to 0 dl 1236335029 ref 1 fl Interpret:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.615287] Lustre:
14771:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027:
 1011add7-13d1-0efc-65e3-14f94eab88c4 reconnecting
Mar  6 11:23:50 OSS100 kernel: [14066607.616000] Lustre:
15112:0:(ost_handler.c:1270:ost_brw_write()) gsilust-OST0027: ignor
ing bulk IO comm error with
6bb22b4d-3509-6c5d-ab95-85377cf88ade at NET_0x200008cb56a8a_UUID id
12345-CLIENT_A at tcp - client will retry
Mar  6 11:23:50 OSS100 kernel: [14066607.616105] Lustre:
15112:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request
 x98164002 took longer than estimated (7+4s); client may timeout.
req at ffff8101557cbad0 x98164002/t0 o4->6bb22b4d-3509-6c5d-a
b95-85377cf88ade at NET_0x200008cb56a8a_UUID:0/0 lens 384/352 e 0 to 0 dl
1236335026 ref 1 fl Complete:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.616492] LustreError:
21161:0:(service.c:568:ptlrpc_check_req()) @@@ DROPPING req fr
om old connection 1 < 2  req at ffff81015fc066e0 x520348/t0
o13->gsilust-mdtlov_UUID at NET_0x200008cb57036_UUID:0/0 lens 128/0 e 0 to
0 dl 0 ref 1 fl New:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.616591] Lustre:
15145:0:(ost_handler.c:925:ost_brw_read()) gsilust-OST0027: ignorin
g bulk IO comm error with
a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID id
12345-CLIENT_B at tcp - client will retry
Mar  6 11:23:50 OSS100 kernel: [14066607.623479] Lustre:
15094:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request
 x49132126 took longer than estimated (7+5s); client may timeout.
req at ffff810152edd2b0 x49132126/t0 o3->fe639162-9537-9ab5-3
6f5-fba5c0a43171 at NET_0x200008cb572f6_UUID:0/0 lens 384/336 e 0 to 0 dl
1236335025 ref 1 fl Complete:/0/0 rc 0/0
Mar  6 11:24:24 OSS100 kernel: [14066641.871733] Lustre:
14753:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027:
 a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 reconnecting

-------------------------------------------
-------------------------------------------
kern.log on CLIENT_A:
-------------------------------------------

 Mar  6 11:24:24 CLIENT_A kernel: Lustre: Request x93767754 sent from
gsilust-OST0027-osc-ffff810808e68c00 to NID OSS100 at tcp 41s ago has timed
out (limit 7s).
Mar  6 11:24:24 CLIENT_A kernel: LustreError: 166-1:
gsilust-OST0027-osc-ffff810808e68c00: Connection to service
gsilust-OST0027 via nid OSS100 at tcp was lost; in progress operations
using this service will fail.
Mar  6 11:24:24 CLIENT_A kernel: LustreError: 167-0: This client was
evicted by gsilust-OST0027; in progress operations using this service
will fail.
Mar  6 11:24:25 CLIENT_A kernel: Lustre:
gsilust-OST0027-osc-ffff810808e68c00: Connection restored to service
gsilust-OST0027 u
sing nid OSS100 at tcp.

--------------------------------------------------------------------------

Regards,
Thomas

Brian J. Murrell

2009-Mar-06 19:55 UTC

head link

[Lustre-discuss] Aborting recovery

On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote:> 
> But this is not what our users observe. Even on an otherwise perfectly
> working system, they report I/O errors on access to some files.
EIO == eviction.
> I  can usually see something happening in the logs of OST and client:
> The OST starts with "timeout on bulk PUT after 6+0s", which the
OST is
> first "ignoring bulk IO comm error" in the hope that "client
will
> retry".
Wait a minute.  This thread is about server recovery, not communications
failures.  You are mixing up errors and situations here.

Communications failures will result in timeouts on the server and that
will result in evictions which will result in EIOs for your
applications.  This has got nothing to do with server recovery though.
> "Request ... has timed out
> (limit 7s)", "Connection to service was lost; in progress
operations
> using this service will fail", finally "Connection restored to
service".
Yes.  This is a timeout and nothing to do with the subject of server
recovery.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090306/1ad66649/attachment-0001.bin

Thomas Roth

2009-Mar-06 20:22 UTC

head link

[Lustre-discuss] Aborting recovery

Brian J. Murrell wrote:> On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote:
>> But this is not what our users observe. Even on an otherwise perfectly
>> working system, they report I/O errors on access to some files.
> 
> EIO == eviction.
> 
>> I  can usually see something happening in the logs of OST and client:
>> The OST starts with "timeout on bulk PUT after 6+0s", which
the OST is
>> first "ignoring bulk IO comm error" in the hope that
"client will
>> retry".
> 
> Wait a minute.  This thread is about server recovery, not communications
> failures.  You are mixing up errors and situations here.
> 
> Communications failures will result in timeouts on the server and that
> will result in evictions which will result in EIOs for your
> applications.  This has got nothing to do with server recovery though.
You are right, of course, this comes from a different situation. I just
assumed that if a client cannot cope with a 1sec-interruption due to a
communication failure, resulting in an EIO, how can it (resp. the
application) survive an interruption of the entire system of several hours.
Of course, if the client does react in a different manner during server
recovery, then also the application will see things differently.
I guess that''s what I misunderstood. In fact the client''s logs
during
yesterdays recovery don''t look so bad at all ;-) Just a number of
"Request xyz sent from MDT0000-mdc to NID MGS ... timed out", as
expected.
Thanks for poiting this out.,
Thomas
>> "Request ... has timed out
>> (limit 7s)", "Connection to service was lost; in progress
operations
>> using this service will fail", finally "Connection restored
to service".
> 
> Yes.  This is a timeout and nothing to do with the subject of server
> recovery.
> 
> b.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kevin Van Maren

2009-Mar-06 20:43 UTC

head link

[Lustre-discuss] Aborting recovery

Thomas Roth wrote:> Brian J. Murrell wrote:
>   
>> On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote:
>>     
>>> But this is not what our users observe. Even on an otherwise
perfectly
>>> working system, they report I/O errors on access to some files.
>>>       
>> EIO == eviction.
To be clear, while it is pretty clear that this is true in this one 
case, there are many other reasons
why clients could get EIO, including the server''s path to the disk is 
down or returning errors.
>>> I  can usually see something happening in the logs of OST and
client:
>>> The OST starts with "timeout on bulk PUT after 6+0s",
which the OST is
>>> first "ignoring bulk IO comm error" in the hope that
"client will
>>> retry".
>>>       
>> Wait a minute.  This thread is about server recovery, not
communications
>> failures.  You are mixing up errors and situations here.
>>
>> Communications failures will result in timeouts on the server and that
>> will result in evictions which will result in EIOs for your
>> applications.  This has got nothing to do with server recovery though.
>>     
>
> You are right, of course, this comes from a different situation. I just
> assumed that if a client cannot cope with a 1sec-interruption due to a
> communication failure, resulting in an EIO, how can it (resp. the
> application) survive an interruption of the entire system of several hours.
> Of course, if the client does react in a different manner during server
> recovery, then also the application will see things differently.
> I guess that''s what I misunderstood. In fact the client''s
logs during
> yesterdays recovery don''t look so bad at all ;-) Just a number of
> "Request xyz sent from MDT0000-mdc to NID MGS ... timed out", as
expected.
> Thanks for poiting this out.,
> Thomas
>   
As Brian said there is a difference between the _server_ detecting the 
client is "down"
and evicting it, and the _client_ continually attempting to reconnect 
and "finish" its IO
(server failover).

Failover does not "solve" network stability problems.  Basically, 
clients try forever, servers do not.

With a successful recovery, the clients can retry any operations that 
have not been acknowledged
by the server, but if a client is "down" (in reality or due to network
issues), it is evicted so the
servers are able to reclaim locks, etc, and keep the filesystem from 
hanging due to a client.
Hanging due to a server being down is the feature that allows failover.

Kevin

>>> "Request ... has timed out
>>> (limit 7s)", "Connection to service was lost; in progress
operations
>>> using this service will fail", finally "Connection
restored to service".
>>>       
>> Yes.  This is a timeout and nothing to do with the subject of server
>> recovery.
>>
>> b.
>>
>>

Lustre discuss - Mar 2009 - Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery

[Lustre-discuss] Aborting recovery