Hello List, I have finally got my new lustre disk on-line having rescued as much as I could from the old, hardware-failed volume. The new disk is mounted on new hardware, OSS4 (for Object Storage Server4---no I am not imaginative). The disk OSTs are called "crew8". They mount happily as shown on OSS4. /dev/sdb1 6.3T 897G 5.1T 15% /srv/lustre/OST/crew8-OST0000 /dev/sdb2 6.3T 867G 5.1T 15% /srv/lustre/OST/crew8-OST0001 /dev/sdc1 6.3T 892G 5.1T 15% /srv/lustre/OST/crew8-OST0002 /dev/sdc2 6.3T 1003G 5.0T 17% /srv/lustre/OST/crew8-OST0003 /dev/sdd1 6.3T 907G 5.1T 15% /srv/lustre/OST/crew8-OST0004 /dev/sdd2 6.3T 877G 5.1T 15% /srv/lustre/OST/crew8-OST0005 /dev/sdi1 6.3T 916G 5.1T 16% /srv/lustre/OST/crew8-OST0006 /dev/sdi2 6.3T 920G 5.1T 16% /srv/lustre/OST/crew8-OST0007 /dev/sdj1 6.3T 901G 5.1T 15% /srv/lustre/OST/crew8-OST0008 /dev/sdj2 6.3T 895G 5.1T 15% /srv/lustre/OST/crew8-OST0009 /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 The MGS/MDS (which servers two other lustre volumes for us currently) shows the following info: [root at mds1 ~]# lctl dl 0 UP mgs MGS MGS 13 1 UP mgc MGC172.18.0.10 at o2ib 81039216-0261-c74d-3f2f-a504788ad8f8 5 2 UP mdt MDS MDS_uuid 3 3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4 4 UP mds crew2-MDT0000 crew2mds_UUID 7 5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5 6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5 7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5 8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4 9 UP mds crew3-MDT0000 crew3mds_UUID 7 10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5 11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5 12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5 13 UP lov crew8-mdtlov crew8-mdtlov_UUID 4 14 UP mds crew8-MDT0000 crew8-MDT0000_UUID 9 15 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5 16 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5 17 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5 18 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5 19 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5 20 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5 21 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5 22 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5 23 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5 24 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and not crew8-OST0010 and crew8-OST0011 respectively. I don''t know if that has anything at all to do with my issue.) The clients are forever losing this one crew8 volume (mounted on the clients as /crewdat).>From /var/log/messages:[larkoc at crew01 ~]$ tail /var/log/messages Sep 18 13:53:10 crew01 kernel: Lustre: crew8-OST0002-osc-ffff8101edbff400: Connection restored to service crew8-OST0002 using nid 172.18.0.15 at o2ib. Sep 18 13:53:10 crew01 kernel: Lustre: Skipped 4 previous similar messages Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.15 at o2ib. The obd_ping operation failed with -107 Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.15 at o2ib. The obd_ping operation failed with -107 Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: Lustre: crew8-OST0003-osc-ffff81083ea5c400: Connection to service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:54:05 cn2 kernel: Lustre: crew8-OST0003-osc-ffff81083ea5c400: Connection to service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages The MGS/MDS /var/log/messages reads: root at mds1 ~]# tail /var/log/messages Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: LustreError: 167-0: This client was evicted by crew8-OST0005; in progress operations using this service will fail. Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: 568:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection restored to service crew8-OST0005 using nid 172.18.0.15 at o2ib. Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: MDS crew8-MDT0000: crew8-OST000b_UUID now active, resetting orphans Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages The OSS4 box /var/log/messages: [root at oss4 ~]# tail /var/log/messages Sep 18 13:40:40 oss4 kernel: Lustre: crew8-OST0000: haven''t heard from client 794ff121-dfec-3934-338e-6b7f861f69b6 (at 172.18.1.2 at o2ib) in 195 seconds. I think it''s dead, and I am evicting it. Sep 18 13:40:40 oss4 kernel: Lustre: Skipped 25 previous similar messages Sep 18 13:44:50 oss4 kernel: LustreError: 3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81042b8c9a00 x8274144/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 18 13:44:50 oss4 kernel: LustreError: 3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 23 previous similar messages Sep 18 13:50:58 oss4 kernel: Lustre: crew8-OST000b: received MDS connection from 172.18.0.10 at o2ib Sep 18 13:50:58 oss4 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:51:00 oss4 kernel: Lustre: crew8-OST0006: haven''t heard from client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 251 seconds. I think it''s dead, and I am evicting it. Sep 18 13:51:00 oss4 kernel: Lustre: Skipped 30 previous similar messages Sep 18 13:55:08 oss4 kernel: LustreError: 3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81042d8dba00 x9085095/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 18 13:55:08 oss4 kernel: LustreError: 3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 35 previous similar messages So---I am seeing that OSS4 is repeatedly losing its network contact with MGS/MDS machine mds1. Each time this occurs the mds1 box tells the client the disk mounted as /crewdat that the disk is evicted and will wait for recovery service to complete. A check on mds1 of the crew8-MDT0000 target shows that no recovery is occurring (and has not AFAICT since I mounted the disk on the clients, post-recovery). On mds1: cat /proc/fs/lustre/mds/crew8-MDT0000/recovery_status status: COMPLETE recovery_start: 1221745009 recovery_end: 1221745185 recovered_clients: 1 unrecovered_clients: 0 last_transno: 33954534 replayed_requests: 0 I am guessing that I need to increase a lustre client timeout value for our o2ib connections for the new disk to not generate these messages (the /crewdat disk itself seems to be fine for user access). The other two lustre volumes on the system seem content. Is my guess correct? If yes, what timeout value do I need to increase? Thank you, megan
On Sep 18, 2008 14:04 -0400, Ms. Megan Larko wrote:> /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 > /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 > > 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 > 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 > > (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and > not crew8-OST0010 and crew8-OST0011 respectively. I don''t know if > that has anything at all to do with my issue.)Hmm, that is a bit strange, I don''t know that I''ve seen this before.> crew8-OST0003-osc-ffff81083ea5c400: Connection to service > crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress > crew8-OST0003-osc-ffff81083ea5c400: Connection to service > crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress > > The MGS/MDS /var/log/messages reads: > root at mds1 ~]# tail /var/log/messages > Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to > service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress > > So---I am seeing that OSS4 is repeatedly losing its network contact > with MGS/MDS machine mds1.It is also losing connection to the crew01 client, I''d suspect some kind of network problem (e.g. cable).> > I am guessing that I need to increase a lustre client timeout value > for our o2ib connections for the new disk to not generate these > messages (the /crewdat disk itself seems to be fine for user access).This seems unlikely, unless you have a large cluster (e.g. 500+ clients). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello! I don''t think it is a timeout issue any longer. The timeout value is the same for all of the Lustre systems mounted via our MGS/MDS system. The value is rather high. It is currently 1000. The value I got from "cat /proc/sys/lustre/timeout on the MGS/MDS box. I changed the IB cable on the problem box using the same IB card, PCI slot and slot on the IB SilverStorm switch. The errors I now see on the clients are the same but the server OSS for crew8-OST0000 thru crew8-OST0011 are: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 LustreError: 4346:0:(filter.c:2674:filter_destroy_precreated()) LustreError:4486:0:(ldmlm_lib.c:1442:target_send_reply_msg(()@@processing error -107 Perhaps it could be the IB card? It is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA]. This is the same card in many, but not all, of our other systems. I can try a new IB card on Monday. On the OSS, the following lines repeat every two minutes (from /var/log/messages): ep 20 22:20:32 oss4 kernel: LustreError: 3775:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1221963532, 100s ago) req at ffff810383d5de00 x46975/t0 o250->MGS at MGC172.18.0.10@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/-22 Sep 20 22:20:32 oss4 kernel: LustreError: 3775:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Thank you, megan On Sat, Sep 20, 2008 at 6:23 PM, Andreas Dilger <adilger at sun.com> wrote:> On Sep 18, 2008 14:04 -0400, Ms. Megan Larko wrote: >> /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 >> /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 >> >> 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 >> 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 >> >> (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and >> not crew8-OST0010 and crew8-OST0011 respectively. I don''t know if >> that has anything at all to do with my issue.) > > Hmm, that is a bit strange, I don''t know that I''ve seen this before. > >> crew8-OST0003-osc-ffff81083ea5c400: Connection to service >> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress >> crew8-OST0003-osc-ffff81083ea5c400: Connection to service >> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress >> >> The MGS/MDS /var/log/messages reads: >> root at mds1 ~]# tail /var/log/messages >> Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to >> service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress >> >> So---I am seeing that OSS4 is repeatedly losing its network contact >> with MGS/MDS machine mds1. > > It is also losing connection to the crew01 client, I''d suspect some > kind of network problem (e.g. cable). >> >> I am guessing that I need to increase a lustre client timeout value >> for our o2ib connections for the new disk to not generate these >> messages (the /crewdat disk itself seems to be fine for user access). > > This seems unlikely, unless you have a large cluster (e.g. 500+ clients). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Ms. Megan Larko wrote:> I changed the IB cable on the problem box using the same IB card, PCI > slot and slot on the IB SilverStorm switch. > The errors I now see on the clients are the same but the server OSS > for crew8-OST0000 thru crew8-OST0011 are: > ib0: multicast join failed for > ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 > LustreError: 4346:0:(filter.c:2674:filter_destroy_precreated()) > LustreError:4486:0:(ldmlm_lib.c:1442:target_send_reply_msg(()@@processing > error -107 > > Perhaps it could be the IB card? It is a Mellanox Technologies > MT25204 [InfiniHost III Lx HCA]. This is the same card in many, but > not all, of our other systems. I can try a new IB card on Monday. >Which subnet manager are you using? You should look a the log files to see why you are getting these "multicast join failed" messages - which are indications that there is something pretty wrong with the infiniband fabric. For some reason (like the nodes do not support the speed used for the multicast group), they could not join the group. This is especially critical as this particular multicast group is used for all IPv4 broadcast traffic (eg, IPv4 ARP requests). Since infiniband multicast is not well understood, let me summarize: The SM assigns a multicast LID for each multicast group. Most switches only support 1024 multicast LIDs, and some SMs cannot map more than one group to the same LID, so multicast sometimes breaks when you get too many groups (ie, more than ~900 nodes with just link-local IPv6 addresses - see below). When a node first joins a multicast group, it selects the group speed (typically SDR 4x or DDR 4x). Nodes that do not support (at least) that speed are not allowed to join later, as all multicast messages for that LID are sent at that speed (ie, an SDR node cannot joing a DDR mcast group, as it could not keep up). With IPv6, ARPs are done using multicast (which is perfect for broadcast LANs, where only the target node takes an interrupt to process the ARP request), which can lead to a multicast group being created per IPv6 address. Note that IPv4 also uses a few multicast groups. With infiniband, it is a little messy, where the node has to query the MC list from the SM to know the LID to use to send the multicast ARP. Try checking the link speeds, and looking at "saquery -g" Kevin Van Maren