Thomas Roth
2009-Jan-14 10:34 UTC
[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
Hi all, on our production cluster we have for a surprisingly long time (> 1 day) only the following two error messages (and no visible problems), although the system is under heavy load right now: Jan 14 10:44:33 server1 kernel: LustreError: 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) req at ffff8107fd6c4c50 x2077599/t0 o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 1231927273 ref 1 fl Interpret:/0/0 rc -107/0 and: Jan 14 10:46:42 server1 kernel: LustreError: 6766:0:(mgs_handler.c:557:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS error (-107) is /* Transport endpoint is not connected */ - I have seen this before on clients which had lost the connection to the cluster. But this is on the MGS/MDS - one server with one partition for the MGS and one for the MDT. The second error suggests of course that the MGS is actually not connected - but how can a Lustre system run when its MGS isn''t there? Makes no sense, does it? O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre 1.6.5.1. The "operation 101" thing is supposed to have been solved in the 1.6.4 -> 1.6.5 upgrade, according to the change logs. Either it hasn''t, or I have a real problem were this error message really applies. It is also remarkable that it seems nobody seems to know about the meaning of "operation X on unconnected MGS" - via Google one will find many questions but no answers - at least that''s my impression (and I didn''t search Bugzilla). Many thanks, Thomas
Cliff White
2009-Jan-15 04:05 UTC
[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
Thomas Roth wrote:> Hi all, > > on our production cluster we have for a surprisingly long time (> 1 day) > only the following two error messages (and no visible problems), > although the system is under heavy load right now: > > Jan 14 10:44:33 server1 kernel: LustreError: > 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-107) req at ffff8107fd6c4c50 x2077599/t0 o101-><?>@<?>:0/0 lens 232/0 e > 0 to 0 dl 1231927273 ref 1 fl Interpret:/0/0 rc -107/0 > > and: > > Jan 14 10:46:42 server1 kernel: LustreError: > 6766:0:(mgs_handler.c:557:mgs_handle()) lustre_mgs: operation 101 on > unconnected MGS > > > error (-107) is /* Transport endpoint is not connected */ - I have > seen this before on clients which had lost the connection to the > cluster. But this is on the MGS/MDS - one server with one partition for > the MGS and one for the MDT.Remember, this is a distributed client/server system. When any node needs to connect to a service, there will be a client process. So, an OSS (which needs to talk to the MDS) will have a metadata client (mdc) running on it.> The second error suggests of course that the MGS is actually not > connected - but how can a Lustre system run when its MGS isn''t there? > Makes no sense, does it?Ah, that''s the beauty of Lustre. The MGS is needed for two things: - New clients get the mount from the MGS - Configuration changes are propagated from the MGS. So, if you are not actively mounting clients, and not changing the configuration, in fact Lustre can run just fine without the MGS. Filesystem users will not even notice it''s gone, unless they are attempting a mount. Likewise, the MDS is used for metadata transactions. If a client is not actively touching metadata, (for example a client already has an open file and is doing IO only) you can fail the MDS without the clients noticing. Those two errors are quite harmless in this case - ''operation x on unconnected MGS'' means a client was evicted, the client is attempting to replay an RPC, however the server has destroyed the import (due to the eviction) and it has not been re-established. cliffw> > O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre > 1.6.5.1. The "operation 101" thing is supposed to have been solved in > the 1.6.4 -> 1.6.5 upgrade, according to the change logs. Either it > hasn''t, or I have a real problem were this error message really applies. > > It is also remarkable that it seems nobody seems to know about the > meaning of "operation X on unconnected MGS" - via Google one will find > many questions but no answers - at least that''s my impression (and I > didn''t search Bugzilla). > > Many thanks, > Thomas > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Thomas Roth
2009-Jan-15 11:54 UTC
[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
Thank you for this clarification on the operation X message! "Running" Lustre without MGS or even MDT is something I have tested already - involuntarily ;-) But I was confused because in this case, there were new mounts coming all the time, so the MGS was there and answering, and at the same time Lustre talks about an unconnected MGS. Thomas Cliff White wrote:> Thomas Roth wrote: >> Hi all, >> >> on our production cluster we have for a surprisingly long time (> 1 day) >> only the following two error messages (and no visible problems), >> although the system is under heavy load right now: >> >> Jan 14 10:44:33 server1 kernel: LustreError: >> 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error >> (-107) req at ffff8107fd6c4c50 x2077599/t0 o101-><?>@<?>:0/0 lens 232/0 e >> 0 to 0 dl 1231927273 ref 1 fl Interpret:/0/0 rc -107/0 >> >> and: >> >> Jan 14 10:46:42 server1 kernel: LustreError: >> 6766:0:(mgs_handler.c:557:mgs_handle()) lustre_mgs: operation 101 on >> unconnected MGS >> >> >> error (-107) is /* Transport endpoint is not connected */ - I have >> seen this before on clients which had lost the connection to the >> cluster. But this is on the MGS/MDS - one server with one partition for >> the MGS and one for the MDT. > > Remember, this is a distributed client/server system. When any node > needs to connect to a service, there will be a client process. > So, an OSS (which needs to talk to the MDS) will have a metadata client > (mdc) running on it. > >> The second error suggests of course that the MGS is actually not >> connected - but how can a Lustre system run when its MGS isn''t there? >> Makes no sense, does it? > > Ah, that''s the beauty of Lustre. The MGS is needed for two things: > - New clients get the mount from the MGS > - Configuration changes are propagated from the MGS. > So, if you are not actively mounting clients, and not changing the > configuration, in fact Lustre can run just fine without the MGS. > Filesystem users will not even notice it''s gone, unless they are > attempting a mount. > > Likewise, the MDS is used for metadata transactions. If a client is not > actively touching metadata, (for example a client already has an open > file and is doing IO only) you can fail the MDS without the clients > noticing. > > Those two errors are quite harmless in this case - ''operation x on > unconnected MGS'' means a client was evicted, the client is attempting to > replay an RPC, however the server has destroyed the import (due to the > eviction) and it has not been re-established. > > cliffw > > >> >> O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre >> 1.6.5.1. The "operation 101" thing is supposed to have been solved in >> the 1.6.4 -> 1.6.5 upgrade, according to the change logs. Either it >> hasn''t, or I have a real problem were this error message really applies. >> >> It is also remarkable that it seems nobody seems to know about the >> meaning of "operation X on unconnected MGS" - via Google one will find >> many questions but no answers - at least that''s my impression (and I >> didn''t search Bugzilla). >> >> Many thanks, >> Thomas >> >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Andreas Dilger
2009-Jan-15 15:56 UTC
[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
On Jan 14, 2009 11:34 +0100, Thomas Roth wrote:> Jan 14 10:44:33 server1 kernel: LustreError: > 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-107) req at ffff8107fd6c4c50 x2077599/t0 o101-><?>@<?>:0/0 lens 232/0 e > 0 to 0 dl 1231927273 ref 1 fl Interpret:/0/0 rc -107/0 > > and: > > Jan 14 10:46:42 server1 kernel: LustreError: > 6766:0:(mgs_handler.c:557:mgs_handle()) lustre_mgs: operation 101 on > unconnected MGS > > > error (-107) is /* Transport endpoint is not connected */ - I have > seen this before on clients which had lost the connection to the > cluster. But this is on the MGS/MDS - one server with one partition for > the MGS and one for the MDT. > The second error suggests of course that the MGS is actually not > connected - but how can a Lustre system run when its MGS isn''t there? > Makes no sense, does it?It means some client is trying to perform operations on the MGS before it is connected.> O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre > 1.6.5.1. The "operation 101" thing is supposed to have been solved in > the 1.6.4 -> 1.6.5 upgrade, according to the change logs.There are a million things that might cause "operation 101" problems. 101 = LDLM_ENQUEUE, so this is just a lock enqueue. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thomas Roth
2009-Jan-15 18:50 UTC
[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
Thanks, Andreas. Andreas Dilger wrote:>> >> >> error (-107) is /* Transport endpoint is not connected */ - I have >> seen this before on clients which had lost the connection to the >> cluster. But this is on the MGS/MDS - one server with one partition for >> the MGS and one for the MDT. >> The second error suggests of course that the MGS is actually not >> connected - but how can a Lustre system run when its MGS isn''t there? >> Makes no sense, does it? > > It means some client is trying to perform operations on the MGS before > it is connected.Who? Before the client is connected, or before the MGS is connected? Of course the client can''t do something before it is connected? But the MGS is connected in the sense that it is mounted and responsive - I can do a fresh client mount of this system any time. Maybe that''s more semantics than Lustre. In any case, I am reassured by your comments, in particular since the cluster is doing fine in this situation. Regards, Thomas>> O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre >> 1.6.5.1. The "operation 101" thing is supposed to have been solved in >> the 1.6.4 -> 1.6.5 upgrade, according to the change logs. > > There are a million things that might cause "operation 101" problems. > 101 = LDLM_ENQUEUE, so this is just a lock enqueue. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt