thr3ads.net - Lustre discuss - [Lustre-discuss] Failure to communicate with MDS via o2ib [May 2008]

If this information is useful, please help other people find it:
Share via:

Charles Taylor

2008-May-27 13:46 UTC

[Lustre-discuss] Failure to communicate with MDS via o2ib

We have a few nodes that locked up due to memory oversubscription.     
After rebooting, they can no longer communicate with our any of our  
three MDSs over IB and, consequently, we cannot mount our Lustre  
1.6.4.2 file systems on these nodes any longer.    All other  
communication via the IB port (ipoib for pings, ssh, etc) seems  
fine.   If we re-cable the node to use the second IB port,  
communication is re-established and we can mount the file system.   In  
other words, by switching to the second IB port, we can once again  
communicate with the MDSs and everything works as expected.    Note  
that this is only for a few nodes (out of 400) that seem to have  
gotten in a bad state with regard to lustre.

Relevant info:

Lustre 1.6.4.2

CentOS 4.5 w/ updated kernel.

Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20  
10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

OFED 1.2

HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
   Primary image is valid, unknown source
   Secondary image is valid, unknown source

   Vital Product Data
     Product Name: Lion cub
     P/N: MHEA28-1TC
     E/C: A2
     S/N: MT0637X00650
     Freq/Power: PCIe x8
     Checksum: Ok
     Date Code: N/A

We don''t know because we have not tried rebooting the MDS''s
yet (kind
of painful) but I''m guessing that if we rebooted them, the issue would
go away.    I suppose it could be a problem at the IB layer (LID re- 
assignment or some such) but since Lustre is the only app that seems  
to be manifesting the issue that seems unlikely.  I''m just wondering  
if anyone else has encountered this and might know of a way to clear  
it out (some obscure lnet command) without rebooting the MDS.


Charlie Taylor
UF HPC Center

Charles Taylor

2008-May-27 13:50 UTC

head link

[Lustre-discuss] Failure to communicate with MDS via o2ib

Whoops, I meant to include the mount-time error message....


/etc/init.d/lustre-client start
IB HCA detected - will try to sleep until link state becomes ACTIVE
   State becomes ACTIVE
Loading Lustre lnet module with option networks=o2ib:      [  OK  ]
Loading Lustre kernel module:                              [  OK  ]
mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:


mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch failed:  
Cannot
send after transport endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch:  mount.lustre: mount
10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after  
transport
endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.90 at o2ib:/crn
mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:   
mount.lustre: mount
10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send after  
transport
endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
Charlie Taylor
UF HPC Center


On May 27, 2008, at 9:46 AM, Charles Taylor wrote:
>
>
> We have a few nodes that locked up due to memory oversubscription.
> After rebooting, they can no longer communicate with our any of our
> three MDSs over IB and, consequently, we cannot mount our Lustre
> 1.6.4.2 file systems on these nodes any longer.    All other
> communication via the IB port (ipoib for pings, ssh, etc) seems
> fine.   If we re-cable the node to use the second IB port,
> communication is re-established and we can mount the file system.   In
> other words, by switching to the second IB port, we can once again
> communicate with the MDSs and everything works as expected.    Note
> that this is only for a few nodes (out of 400) that seem to have
> gotten in a bad state with regard to lustre.
>
> Relevant info:
>
> Lustre 1.6.4.2
>
> CentOS 4.5 w/ updated kernel.
>
> Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20
> 10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> OFED 1.2
>
> HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
>   Primary image is valid, unknown source
>   Secondary image is valid, unknown source
>
>   Vital Product Data
>     Product Name: Lion cub
>     P/N: MHEA28-1TC
>     E/C: A2
>     S/N: MT0637X00650
>     Freq/Power: PCIe x8
>     Checksum: Ok
>     Date Code: N/A
>
> We don''t know because we have not tried rebooting the
MDS''s yet (kind
> of painful) but I''m guessing that if we rebooted them, the issue
would
> go away.    I suppose it could be a problem at the IB layer (LID re-
> assignment or some such) but since Lustre is the only app that seems
> to be manifesting the issue that seems unlikely.  I''m just
wondering
> if anyone else has encountered this and might know of a way to clear
> it out (some obscure lnet command) without rebooting the MDS.
>
>
> Charlie Taylor
> UF HPC Center
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080527/0efeface/attachment.html

Isaac Huang

2008-May-27 14:13 UTC

head link

[Lustre-discuss] Failure to communicate with MDS via o2ib

On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor
wrote:>    Whoops, I meant to include the mount-time error message....
> 
> /etc/init.d/lustre-client start
> IB HCA detected - will try to sleep until link state becomes ACTIVE
>   State becomes ACTIVE
> Loading Lustre lnet module with option networks=o2ib:      [  OK  ]
> Loading Lustre kernel module:                              [  OK  ]
> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:
> 
> 
> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch failed:
Cannot
> send after transport endpoint shutdown
>                                                            [FAILED]
> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch:  mount.lustre: mount
> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after
transport
> endpoint shutdown
>                                                            [FAILED]
> Error: Failed to mount 10.13.24.90 at o2ib:/crn
> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:  mount.lustre:
mount
> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send after
transport
> endpoint shutdown
>                                                            [FAILED]
> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
Was there any error message in ''dmesg''? Can you try
''lctl ping
10.13.24.90 at o2ib''? (and ''lctl list_nids'' and
''lctl --net o2ib
peer_list'' and ''lctl --net o2ib conn_list'').

Isaac

Charles Taylor

2008-May-27 14:36 UTC

head link

[Lustre-discuss] Failure to communicate with MDS via o2ib

Here it is for the one of the other MDSs (10.13.16.24 at o2ib).   As you  
can see, the ipoib ping succeeds but the "lctl ping" fails as does the
mount.    The last few lines of dmesg are also below.

[root at r5b-s41 ~]# ping 10.13.16.24
PING 10.13.16.24 (10.13.16.24) 56(84) bytes of data.
64 bytes from 10.13.16.24: icmp_seq=0 ttl=64 time=0.168 ms

--- 10.13.16.24 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.168/0.168/0.168/0.000 ms, pipe 2

[root at r5b-s41 ~]# lctl ping 10.13.16.24 at o2ib
failed to ping 10.13.16.24 at o2ib: Input/output error

[root at r5b-s41 ~]# mount -t lustre 10.13.16.24 at o2ib:/ufhpc /ufhpc/scratch
mount.lustre: mount 10.13.16.24 at o2ib:/ufhpc at /ufhpc/scratch failed:  
Cannot send after transport endpoint shutdown


dmesg....

LustreError: 12980:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  req at ffff8102320ee400 x15/t0 o501- 
 >MGS at MGC10.13.16.24@o2ib_0:26 lens 136/120 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 15c-8: MGC10.13.16.24 at o2ib: The configuration from log  
''ufhpc-client'' failed (-108). This may be the result of
communication
errors between this node and the MGS, a bad configuration, or other  
errors. See the syslog for more information.
LustreError: 12980:0:(llite_lib.c:1021:ll_fill_super()) Unable to  
process log: -108
Lustre: client ffff810232fc3800 umount complete
LustreError: 12980:0:(obd_mount.c:1924:lustre_fill_super()) Unable to  
mount  (-108)



Thanks,

Charlie Taylor

On May 27, 2008, at 10:13 AM, Isaac Huang wrote:
> On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote:
>>   Whoops, I meant to include the mount-time error message....
>>
>> /etc/init.d/lustre-client start
>> IB HCA detected - will try to sleep until link state becomes ACTIVE
>>  State becomes ACTIVE
>> Loading Lustre lnet module with option networks=o2ib:      [  OK  ]
>> Loading Lustre kernel module:                              [  OK  ]
>> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:
>>
>>
>> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch  
>> failed: Cannot
>> send after transport endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
>> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch:  mount.lustre:  
>> mount
>> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after  
>> transport
>> endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.90 at o2ib:/crn
>> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:   
>> mount.lustre: mount
>> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send  
>> after transport
>> endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
>
> Was there any error message in ''dmesg''? Can you try
''lctl ping
> 10.13.24.90 at o2ib''? (and ''lctl list_nids'' and
''lctl --net o2ib
> peer_list'' and ''lctl --net o2ib conn_list'').
>
> Isaac

Charles Taylor

2008-May-27 14:52 UTC

head link

[Lustre-discuss] Failure to communicate with MDS via o2ib

Thanks but I''m going to withdraw this for now.   I was too quick on  
the trigger.    We are seeing some issues with LID assignment (upon  
reboot) for the nodes in question on our SM.

Sorry for the wasted BW.

Charlie Taylor
UF HPC Center


On May 27, 2008, at 10:13 AM, Isaac Huang wrote:
> On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote:
>>   Whoops, I meant to include the mount-time error message....
>>
>> /etc/init.d/lustre-client start
>> IB HCA detected - will try to sleep until link state becomes ACTIVE
>>  State becomes ACTIVE
>> Loading Lustre lnet module with option networks=o2ib:      [  OK  ]
>> Loading Lustre kernel module:                              [  OK  ]
>> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:
>>
>>
>> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch  
>> failed: Cannot
>> send after transport endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
>> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch:  mount.lustre:  
>> mount
>> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after  
>> transport
>> endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.90 at o2ib:/crn
>> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:   
>> mount.lustre: mount
>> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send  
>> after transport
>> endpoint shutdown
>>                                                           [FAILED]
>> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
>
> Was there any error message in ''dmesg''? Can you try
''lctl ping
> 10.13.24.90 at o2ib''? (and ''lctl list_nids'' and
''lctl --net o2ib
> peer_list'' and ''lctl --net o2ib conn_list'').
>
> Isaac

Lustre discuss - May 2008 - Failure to communicate with MDS via o2ib

[Lustre-discuss] Failure to communicate with MDS via o2ib

[Lustre-discuss] Failure to communicate with MDS via o2ib

[Lustre-discuss] Failure to communicate with MDS via o2ib

[Lustre-discuss] Failure to communicate with MDS via o2ib

[Lustre-discuss] Failure to communicate with MDS via o2ib