thr3ads.net - Lustre discuss - [Lustre-discuss] o2ib possible network problems [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Ms. Megan Larko

2008-Sep-22 15:34 UTC

[Lustre-discuss] o2ib possible network problems

In my continuing quest (and self-education) on lustre networking (lctl
ping, and obd_ping, in particular....):

My MGS/MDS box is losing the connection to one and only one particular
OSS and then restoring in all within the same wall-clock second:
MGDS/MDS  /var/log/messages:
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection to
service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0003; in progress operations using this service
will fail.
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre:
931:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection
restored to service crew8-OST0003 using nid 172.18.0.15 at o2ib.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:59 mds1 kernel: Lustre: MDS crew8-MDT0000:
crew8-OST0003_UUID now active, resetting orphans

My corresponding problem OSS has a <i>processing error</i> ??? and
then resets its own connection:
OSS4 /var/log/messages:
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens
128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous
similar messages
Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS
connection from 172.18.0.10 at o2ib
Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven''t heard from
client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think
it''s dead, and I am evicting it.

My client box here has the same connection error but minutes
later(!!).   Odd.   The boxes all use ntpd and sync from a common time
server here.  But the notable thing is that the obd_ping, lost
connection, eviction and then restoration all occurr with a wall-clock
minute of one another.
crew01 /var/log/messages:
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.

I have swapped IB network cables.   The linux (CentOS 5 on all
systems) ping has no dropped packets between any of the systems on the
o2ib network.  All lctl pings return normally.   All systems are
running the same OS code---
[root at oss4 ~]# uname -a
Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun
Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

What is this "LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)" error  on my OSS?  As the end-users are not noticing anything
and all of the activity on this one OSS is "no
communication--evicted--restored" inside of a minute, should I do
anything other than clean my becoming-voluminous logfiles more
frequently?

megan

Andreas Dilger

2008-Sep-23 08:51 UTC

head link

[Lustre-discuss] o2ib possible network problems

On Sep 22, 2008  11:34 -0400, Ms. Megan Larko wrote:> My MGS/MDS box is losing the connection to one and only one particular
> OSS and then restoring in all within the same wall-clock second:
> MGDS/MDS  /var/log/messages:
> 
> My corresponding problem OSS has a <i>processing error</i> ???
and
> then resets its own connection:
> Sep 22 11:00:16 oss4 kernel: LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107)  req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1
lens 128/0
> ref 0 fl Interpret:/0/0 rc -107/0
> Sep 22 11:00:16 oss4 kernel: LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous
> similar messages
> Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS
> connection from 172.18.0.10 at o2ib
> Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages
> Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven''t heard
from
> client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think
> it''s dead, and I am evicting it.
This indicates that the MDS is trying to ping the OSS server ("o400"
is
OBD_PING) to indicate that it is alive, but the OSS is having trouble to
reply to these requests, apparently because it thinks the MDS/client are
not connected (-107 = -ENOTCONN).
> My client box here has the same connection error but minutes
> later(!!).   Odd.   The boxes all use ntpd and sync from a common time
> server here.  But the notable thing is that the obd_ping, lost
> connection, eviction and then restoration all occurr with a wall-clock
> minute of one another.
There isn''t a requirement that the clocks are synchronized for Lustre,
and the difference in the printing of the messages between client and
MDS is likely just due to different timing for skipping of messages.
> I have swapped IB network cables.   The linux (CentOS 5 on all
> systems) ping has no dropped packets between any of the systems on the
> o2ib network.  All lctl pings return normally.   All systems are
> running the same OS code---
> [root at oss4 ~]# uname -a
> Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun
> Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
> 
> What is this "LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107)" error  on my OSS?  As the end-users are not noticing anything
> and all of the activity on this one OSS is "no
> communication--evicted--restored" inside of a minute, should I do
> anything other than clean my becoming-voluminous logfiles more
> frequently?
It definitely looks like there is some sort of strange problem.  You
could enable RPC tracing and dump a debug log (lctl dk /tmp/lustre.log),
checking if the client is sending a ping (o400) and the server is getting
it and what is happening during the processing.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Sep 2008 - o2ib possible network problems

[Lustre-discuss] o2ib possible network problems

[Lustre-discuss] o2ib possible network problems