Charles Taylor
2008-May-27 13:46 UTC
[Lustre-discuss] Failure to communicate with MDS via o2ib
We have a few nodes that locked up due to memory oversubscription. After rebooting, they can no longer communicate with our any of our three MDSs over IB and, consequently, we cannot mount our Lustre 1.6.4.2 file systems on these nodes any longer. All other communication via the IB port (ipoib for pings, ssh, etc) seems fine. If we re-cable the node to use the second IB port, communication is re-established and we can mount the file system. In other words, by switching to the second IB port, we can once again communicate with the MDSs and everything works as expected. Note that this is only for a few nodes (out of 400) that seem to have gotten in a bad state with regard to lustre. Relevant info: Lustre 1.6.4.2 CentOS 4.5 w/ updated kernel. Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20 10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux OFED 1.2 HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is valid, unknown source Secondary image is valid, unknown source Vital Product Data Product Name: Lion cub P/N: MHEA28-1TC E/C: A2 S/N: MT0637X00650 Freq/Power: PCIe x8 Checksum: Ok Date Code: N/A We don''t know because we have not tried rebooting the MDS''s yet (kind of painful) but I''m guessing that if we rebooted them, the issue would go away. I suppose it could be a problem at the IB layer (LID re- assignment or some such) but since Lustre is the only app that seems to be manifesting the issue that seems unlikely. I''m just wondering if anyone else has encountered this and might know of a way to clear it out (some obscure lnet command) without rebooting the MDS. Charlie Taylor UF HPC Center
Charles Taylor
2008-May-27 13:50 UTC
[Lustre-discuss] Failure to communicate with MDS via o2ib
Whoops, I meant to include the mount-time error message.... /etc/init.d/lustre-client start IB HCA detected - will try to sleep until link state becomes ACTIVE State becomes ACTIVE Loading Lustre lnet module with option networks=o2ib: [ OK ] Loading Lustre kernel module: [ OK ] mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch: mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch failed: Cannot send after transport endpoint shutdown [FAILED] Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch: mount.lustre: mount 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after transport endpoint shutdown [FAILED] Error: Failed to mount 10.13.24.90 at o2ib:/crn mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata: mount.lustre: mount 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send after transport endpoint shutdown [FAILED] Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata Charlie Taylor UF HPC Center On May 27, 2008, at 9:46 AM, Charles Taylor wrote:> > > We have a few nodes that locked up due to memory oversubscription. > After rebooting, they can no longer communicate with our any of our > three MDSs over IB and, consequently, we cannot mount our Lustre > 1.6.4.2 file systems on these nodes any longer. All other > communication via the IB port (ipoib for pings, ssh, etc) seems > fine. If we re-cable the node to use the second IB port, > communication is re-established and we can mount the file system. In > other words, by switching to the second IB port, we can once again > communicate with the MDSs and everything works as expected. Note > that this is only for a few nodes (out of 400) that seem to have > gotten in a bad state with regard to lustre. > > Relevant info: > > Lustre 1.6.4.2 > > CentOS 4.5 w/ updated kernel. > > Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20 > 10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > > OFED 1.2 > > HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 > Primary image is valid, unknown source > Secondary image is valid, unknown source > > Vital Product Data > Product Name: Lion cub > P/N: MHEA28-1TC > E/C: A2 > S/N: MT0637X00650 > Freq/Power: PCIe x8 > Checksum: Ok > Date Code: N/A > > We don''t know because we have not tried rebooting the MDS''s yet (kind > of painful) but I''m guessing that if we rebooted them, the issue would > go away. I suppose it could be a problem at the IB layer (LID re- > assignment or some such) but since Lustre is the only app that seems > to be manifesting the issue that seems unlikely. I''m just wondering > if anyone else has encountered this and might know of a way to clear > it out (some obscure lnet command) without rebooting the MDS. > > > Charlie Taylor > UF HPC Center > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080527/0efeface/attachment.html
Isaac Huang
2008-May-27 14:13 UTC
[Lustre-discuss] Failure to communicate with MDS via o2ib
On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote:> Whoops, I meant to include the mount-time error message.... > > /etc/init.d/lustre-client start > IB HCA detected - will try to sleep until link state becomes ACTIVE > State becomes ACTIVE > Loading Lustre lnet module with option networks=o2ib: [ OK ] > Loading Lustre kernel module: [ OK ] > mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch: > > > mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch failed: Cannot > send after transport endpoint shutdown > [FAILED] > Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc > mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch: mount.lustre: mount > 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after transport > endpoint shutdown > [FAILED] > Error: Failed to mount 10.13.24.90 at o2ib:/crn > mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata: mount.lustre: mount > 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send after transport > endpoint shutdown > [FAILED] > Error: Failed to mount 10.13.24.85 at o2ib:/hpcdataWas there any error message in ''dmesg''? Can you try ''lctl ping 10.13.24.90 at o2ib''? (and ''lctl list_nids'' and ''lctl --net o2ib peer_list'' and ''lctl --net o2ib conn_list''). Isaac
Charles Taylor
2008-May-27 14:36 UTC
[Lustre-discuss] Failure to communicate with MDS via o2ib
Here it is for the one of the other MDSs (10.13.16.24 at o2ib). As you can see, the ipoib ping succeeds but the "lctl ping" fails as does the mount. The last few lines of dmesg are also below. [root at r5b-s41 ~]# ping 10.13.16.24 PING 10.13.16.24 (10.13.16.24) 56(84) bytes of data. 64 bytes from 10.13.16.24: icmp_seq=0 ttl=64 time=0.168 ms --- 10.13.16.24 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.168/0.168/0.168/0.000 ms, pipe 2 [root at r5b-s41 ~]# lctl ping 10.13.16.24 at o2ib failed to ping 10.13.16.24 at o2ib: Input/output error [root at r5b-s41 ~]# mount -t lustre 10.13.16.24 at o2ib:/ufhpc /ufhpc/scratch mount.lustre: mount 10.13.16.24 at o2ib:/ufhpc at /ufhpc/scratch failed: Cannot send after transport endpoint shutdown dmesg.... LustreError: 12980:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8102320ee400 x15/t0 o501- >MGS at MGC10.13.16.24@o2ib_0:26 lens 136/120 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 15c-8: MGC10.13.16.24 at o2ib: The configuration from log ''ufhpc-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 12980:0:(llite_lib.c:1021:ll_fill_super()) Unable to process log: -108 Lustre: client ffff810232fc3800 umount complete LustreError: 12980:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount (-108) Thanks, Charlie Taylor On May 27, 2008, at 10:13 AM, Isaac Huang wrote:> On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote: >> Whoops, I meant to include the mount-time error message.... >> >> /etc/init.d/lustre-client start >> IB HCA detected - will try to sleep until link state becomes ACTIVE >> State becomes ACTIVE >> Loading Lustre lnet module with option networks=o2ib: [ OK ] >> Loading Lustre kernel module: [ OK ] >> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch: >> >> >> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch >> failed: Cannot >> send after transport endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc >> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch: mount.lustre: >> mount >> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after >> transport >> endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.90 at o2ib:/crn >> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata: >> mount.lustre: mount >> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send >> after transport >> endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata > > Was there any error message in ''dmesg''? Can you try ''lctl ping > 10.13.24.90 at o2ib''? (and ''lctl list_nids'' and ''lctl --net o2ib > peer_list'' and ''lctl --net o2ib conn_list''). > > Isaac
Charles Taylor
2008-May-27 14:52 UTC
[Lustre-discuss] Failure to communicate with MDS via o2ib
Thanks but I''m going to withdraw this for now. I was too quick on the trigger. We are seeing some issues with LID assignment (upon reboot) for the nodes in question on our SM. Sorry for the wasted BW. Charlie Taylor UF HPC Center On May 27, 2008, at 10:13 AM, Isaac Huang wrote:> On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote: >> Whoops, I meant to include the mount-time error message.... >> >> /etc/init.d/lustre-client start >> IB HCA detected - will try to sleep until link state becomes ACTIVE >> State becomes ACTIVE >> Loading Lustre lnet module with option networks=o2ib: [ OK ] >> Loading Lustre kernel module: [ OK ] >> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch: >> >> >> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch >> failed: Cannot >> send after transport endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc >> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch: mount.lustre: >> mount >> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after >> transport >> endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.90 at o2ib:/crn >> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata: >> mount.lustre: mount >> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send >> after transport >> endpoint shutdown >> [FAILED] >> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata > > Was there any error message in ''dmesg''? Can you try ''lctl ping > 10.13.24.90 at o2ib''? (and ''lctl list_nids'' and ''lctl --net o2ib > peer_list'' and ''lctl --net o2ib conn_list''). > > Isaac