Hi All, I have an existing Lustre filesystem that I want to make available to another cluster. Both clusters are IB based and the servers have dual IB ports which are connected to the two IB fabrics of the clusters. Cluster A, the original cluster, works fine with the filesystem. The clients for cluster B cannot mount the filesystem yet lctl ping does work. Here are the details: MGS/MDS: /etc/modprobe.conf.local options lnet ip2nets="o2ib0(ib0) 10.149.0.*;o2ib1(ib1) 10.150.0.*" ib0 IP: 10.149.0.69/16 ib1 IP: 10.150.0.69/16 lctl list_nids: 10.149.0.69 at o2ib 10.150.0.69 at o2ib1 OSS1: /etc/modprobe.conf.local: options lnet ip2nets="o2ib0(ib0) 10.149.0.*;o2ib1(ib1) 10.150.0.*" ib0 IP: 10.149.0.70/16 ib1 IP: 10.150.0.70/16 lctl list_nids: 10.149.0.70 at o2ib 10.150.0.70 at o2ib1 OSS2-3 similar to OSS1 OSS2 ib0 IP: 10.149.0.71/16 OSS2 ib1 IP: 10.150.0.71/16 OSS3 ib0 IP: 10.149.0.72/16 OSS3 ib1 IP: 10.150.0.72/16 Cluster A Client: /etc/modprobe.conf.local options lnet networks=o2ib(ib1) ib1 IP: 10.149.0.2/16 # lctl list_nids 10.149.0.2 at o2ib Cluster B Client: /etc/modprobe.conf.local options lnet ip2nets="o2ib1(ib0) 10.150.0.*" ib0 IP 10.150.0.120/16 # lctl list_nids 10.150.0.120 at o2ib1 --- I can use lctl ping to ping the MDS/MGS: Clinet B# lctl ping 10.150.0.69 at o2ib1 12345-0 at lo 12345-10.149.0.69 at o2ib 12345-10.150.0.69 at o2ib1 And the other direction - client to MGS/MDS: mds # # lctl ping 10.150.0.120 at o2ib1 12345-0 at lo 12345-10.150.0.120 at o2ib1 Yet, the mount continues to fail. I see the following messages in /var/log/messages on the client: Apr 21 14:37:02 gpute-47 kernel: Lustre: MGC10.150.0.69 at o2ib1: Reactivating import Apr 21 14:37:02 gpute-47 kernel: LustreError: 21187:0:(events.c:460:ptlrpc_uuid_to_peer()) No NID found for 10.149.0.69 at o2ib Apr 21 14:37:02 gpute-47 kernel: LustreError: 21187:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer 10.149.0.69 at o2ib! Apr 21 14:37:02 gpute-47 kernel: LustreError: 21187:0:(ldlm_lib.c:334:client_obd_setup()) can''t add initial connection Apr 21 14:37:02 gpute-47 kernel: LustreError: 21187:0:(obd_config.c:363:class_setup()) setup lustre-MDT0000-mdc-ffff810322afe400 failed (-2) Apr 21 14:37:02 gpute-47 kernel: LustreError: 21187:0:(obd_config.c:1102:class_config_llog_handler()) Err -2 on cfg command: Apr 21 14:37:02 gpute-47 kernel: Lustre: cmd=cf003 0:lustre-MDT0000-mdc 1:lustre-MDT0000_UUID 2:10.149.0.69 at o2ib Apr 21 14:37:02 gpute-47 kernel: LustreError: 15c-8: MGC10.150.0.69 at o2ib1: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Apr 21 14:37:02 gpute-47 kernel: LustreError: 20920:0:(llite_lib.c:1064:ll_fill_super()) Unable to process log: -2 Apr 21 14:37:02 gpute-47 kernel: LustreError: 20920:0:(obd_config.c:430:class_cleanup()) Device 2 not setup Apr 21 14:37:03 gpute-47 kernel: LustreError: 20920:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Apr 21 14:37:03 gpute-47 kernel: LustreError: 20920:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Apr 21 14:37:03 gpute-47 kernel: Lustre: client ffff810322afe400 umount complete Apr 21 14:37:03 gpute-47 kernel: LustreError: 20920:0:(obd_mount.c:1991:lustre_fill_super()) Unable to mount (-2) It appears that the client is still looking for the 10.149.xxx NID but does not find it. The other part that bothers me is that there are no messages at all on the MGS/MDS server when I attempt to mount the client. I would have expected some sort of failure message unless it is not reaching it at all. Any ideas?
On 2010-04-21, at 14:07, Dennis Nelson wrote:> I have an existing Lustre filesystem that I want to make available to > another cluster. Both clusters are IB based and the servers have dual IB > ports which are connected to the two IB fabrics of the clusters. > > Cluster A, the original cluster, works fine with the filesystem. The > clients for cluster B cannot mount the filesystem yet lctl ping does work. > Here are the details:> 21187:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer > 10.149.0.69 at o2ib! > 21187:0:(obd_config.c:1102:class_config_llog_handler()) Err -2 on cfg > Apr 21 14:37:02 gpute-47 kernel: LustreError: 15c-8: MGC10.150.0.69 at o2ib1: > The configuration from log ''lustre-client'' failed (-2). This may be the > result of communication errors between this node and the MGS, a bad > configuration, or other errors. See the syslog for more information. > > It appears that the client is still looking for the 10.149.xxx NID but does > not find it. The other part that bothers me is that there are no messages > at all on the MGS/MDS server when I attempt to mount the client. I would > have expected some sort of failure message unless it is not reaching it at > all.I suspect you need to rewrite the filesystem configuration to include these new interfaces. I believe there is a section in the manual on how to correctly change network interfaces. Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.