In my continuing quest (and self-education) on lustre networking (lctl ping, and obd_ping, in particular....): My MGS/MDS box is losing the connection to one and only one particular OSS and then restoring in all within the same wall-clock second: MGDS/MDS /var/log/messages: Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection to service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: LustreError: 167-0: This client was evicted by crew8-OST0003; in progress operations using this service will fail. Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: Lustre: 931:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection restored to service crew8-OST0003 using nid 172.18.0.15 at o2ib. Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:04:59 mds1 kernel: Lustre: MDS crew8-MDT0000: crew8-OST0003_UUID now active, resetting orphans My corresponding problem OSS has a <i>processing error</i> ??? and then resets its own connection: OSS4 /var/log/messages: Sep 22 11:00:16 oss4 kernel: LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 22 11:00:16 oss4 kernel: LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous similar messages Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS connection from 172.18.0.10 at o2ib Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven''t heard from client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think it''s dead, and I am evicting it. My client box here has the same connection error but minutes later(!!). Odd. The boxes all use ntpd and sync from a common time server here. But the notable thing is that the obd_ping, lost connection, eviction and then restoration all occurr with a wall-clock minute of one another. crew01 /var/log/messages: Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.15 at o2ib. The obd_ping operation failed with -107 Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.15 at o2ib. The obd_ping operation failed with -107 Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection to service crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection to service crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was evicted by crew8-OST0000; in progress operations using this service will fail. Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was evicted by crew8-OST0000; in progress operations using this service will fail. Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service crew8-OST0000 using nid 172.18.0.15 at o2ib. Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service crew8-OST0000 using nid 172.18.0.15 at o2ib. I have swapped IB network cables. The linux (CentOS 5 on all systems) ping has no dropped packets between any of the systems on the o2ib network. All lctl pings return normally. All systems are running the same OS code--- [root at oss4 ~]# uname -a Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux What is this "LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)" error on my OSS? As the end-users are not noticing anything and all of the activity on this one OSS is "no communication--evicted--restored" inside of a minute, should I do anything other than clean my becoming-voluminous logfiles more frequently? megan
On Sep 22, 2008 11:34 -0400, Ms. Megan Larko wrote:> My MGS/MDS box is losing the connection to one and only one particular > OSS and then restoring in all within the same wall-clock second: > MGDS/MDS /var/log/messages: > > My corresponding problem OSS has a <i>processing error</i> ??? and > then resets its own connection: > Sep 22 11:00:16 oss4 kernel: LustreError: > 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error > (-107) req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens 128/0 > ref 0 fl Interpret:/0/0 rc -107/0 > Sep 22 11:00:16 oss4 kernel: LustreError: > 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous > similar messages > Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS > connection from 172.18.0.10 at o2ib > Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages > Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven''t heard from > client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think > it''s dead, and I am evicting it.This indicates that the MDS is trying to ping the OSS server ("o400" is OBD_PING) to indicate that it is alive, but the OSS is having trouble to reply to these requests, apparently because it thinks the MDS/client are not connected (-107 = -ENOTCONN).> My client box here has the same connection error but minutes > later(!!). Odd. The boxes all use ntpd and sync from a common time > server here. But the notable thing is that the obd_ping, lost > connection, eviction and then restoration all occurr with a wall-clock > minute of one another.There isn''t a requirement that the clocks are synchronized for Lustre, and the difference in the printing of the messages between client and MDS is likely just due to different timing for skipping of messages.> I have swapped IB network cables. The linux (CentOS 5 on all > systems) ping has no dropped packets between any of the systems on the > o2ib network. All lctl pings return normally. All systems are > running the same OS code--- > [root at oss4 ~]# uname -a > Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun > Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > > What is this "LustreError: > 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error > (-107)" error on my OSS? As the end-users are not noticing anything > and all of the activity on this one OSS is "no > communication--evicted--restored" inside of a minute, should I do > anything other than clean my becoming-voluminous logfiles more > frequently?It definitely looks like there is some sort of strange problem. You could enable RPC tracing and dump a debug log (lctl dk /tmp/lustre.log), checking if the client is sending a ping (o400) and the server is getting it and what is happening during the processing. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.