Hi,
I''m having difficulty getting one of my clients to work with a
multirail IB configuration. Here''s what I''ve got:
Host OS Version Lustre Version Function
Storage Interface ib0 Interface ib1
1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1
mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24
2. bmr1-s8 CentOS 5.7 2.1.1 OSS2
ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24
3. bmr1-s5 CentOS 5.7 2.1.1 OSS3
ost25->30 192.168.1.20/24 192.168.1.30/24
4. bmr1-s6 CentOS 5.7 2.1.1 OSS4
ost31->36 192.168.1.21/24 192.168.1.31/24
5. bmr2-s9 CentOS 5.7 2.1.1 Client
n/a 192.168.1.209/24
The "/lustre" filesystem consists of mdt and ost1->12 (using
bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using
bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on
ib1.
>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and
access "/lustre". I can also successfully mount "/lustre2".
>From bmr2-s9, I can neither mount "/lustre" nor
"/lustre2". Originally, the issue with bmr2-s9 was that it was
running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail)
wasn''t supported on that version, I upgraded to 2.1.1. Originally, I
tried installing and testing the 2.1.1 client without success. Then, since it
had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought
I''d try that next. Unfortunately, it still didn''t work.
1a) Here''s what I see on the client when I try to mount
"/lustre":
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#
1b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:
5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at
o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at
o2ib->MGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:522:class_setup()) setup
lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc
1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log
''lustre-client'' failed (-2). This may be the result of
communication errors between this node and the MGS, a bad configuration, or
other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC:
canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2)
1c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from
2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null
2a) Here''s what I see on the client when I try to mount
"/lustre" (using the other interface):
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at
o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at
o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is ''lustre'' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#
2b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at
o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at
o2ib->MGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617]
[deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar
messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642]
[deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667]
[deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile
''lustre-client'' could not be read from the MGS. Does that
filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError:
5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22)
2c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar
messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
3) Here''s what one of the MDTs looks like (the other is similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
exiting before disk write.
[root at bmr1-s7 ~]#
4) Here''s what one of the OSTs looks like (the others are similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
exiting before disk write.
[root at bmr1-s7 ~]#
I''d appreciate any help or direction on a potential resolution. Let me
know what additional information is needed, if any. Hopefully, I''m
just missing something simple.
Thanks in advance,
...Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130212/cd6836b6/attachment-0001.html
Hi,
The two LNet networks need separate IPoIB networks.
o2ib0 - 10.0.0.0/24
o2ib1 - 10.1.0.0/24
Nodes with 2 physical interfaces (like you servers) have ib0 on o2ib0 and ib1 on
o2ib1
Nodes with 1 physical interface (like your clients) have 2 logical interfaces:
ib0 on o2ib0 and ib0:0 on o2ib1.
Gr?goire.
De : lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] De la part de mages, brian
Envoy? : mardi 12 f?vrier 2013 21:34
? : lustre-discuss at lists.lustre.org
Objet : [Lustre-discuss] Multirail IB Configuration Issue
Hi,
I''m having difficulty getting one of my clients to work with a
multirail IB configuration. Here''s what I''ve got:
Host OS Version Lustre Version Function
Storage Interface ib0 Interface ib1
1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1
mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24
2. bmr1-s8 CentOS 5.7 2.1.1 OSS2
ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24
3. bmr1-s5 CentOS 5.7 2.1.1 OSS3
ost25->30 192.168.1.20/24 192.168.1.30/24
4. bmr1-s6 CentOS 5.7 2.1.1 OSS4
ost31->36 192.168.1.21/24 192.168.1.31/24
5. bmr2-s9 CentOS 5.7 2.1.1 Client
n/a 192.168.1.209/24
The "/lustre" filesystem consists of mdt and ost1->12 (using
bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using
bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on
ib1.
>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and
access "/lustre". I can also successfully mount "/lustre2".
>From bmr2-s9, I can neither mount "/lustre" nor
"/lustre2". Originally, the issue with bmr2-s9 was that it was
running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail)
wasn''t supported on that version, I upgraded to 2.1.1. Originally, I
tried installing and testing the 2.1.1 client without success. Then, since it
had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought
I''d try that next. Unfortunately, it still didn''t work.
1a) Here''s what I see on the client when I try to mount
"/lustre":
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#
1b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:
5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at
o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at
o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:522:class_setup()) setup
lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc
1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log
''lustre-client'' failed (-2). This may be the result of
communication errors between this node and the MGS, a bad configuration, or
other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC:
canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2)
1c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from
2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null
2a) Here''s what I see on the client when I try to mount
"/lustre" (using the other interface):
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at
o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at
o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is ''lustre'' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#
2b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at
o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at
o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617]
[deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar
messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642]
[deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667]
[deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile
''lustre-client'' could not be read from the MGS. Does that
filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError:
5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22)
2c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar
messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
3) Here''s what one of the MDTs looks like (the other is similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
exiting before disk write.
[root at bmr1-s7 ~]#
4) Here''s what one of the OSTs looks like (the others are similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
exiting before disk write.
[root at bmr1-s7 ~]#
I''d appreciate any help or direction on a potential resolution. Let me
know what additional information is needed, if any. Hopefully, I''m
just missing something simple.
Thanks in advance,
...Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130213/fadd2309/attachment-0001.html
Hi Gr?goire,
Thanks for the reply.
I thought that might be the issue as well. However, it doesn''t explain
why the other two clients work successfully when both interfaces are on the same
network.
...Brian
From: Gregoire Pichon [mailto:gregoire.pichon at bull.net]
Sent: Wednesday, February 13, 2013 4:19 AM
To: mages, brian; lustre-discuss at lists.lustre.org
Subject: RE: Multirail IB Configuration Issue
Hi,
The two LNet networks need separate IPoIB networks.
o2ib0 - 10.0.0.0/24
o2ib1 - 10.1.0.0/24
Nodes with 2 physical interfaces (like you servers) have ib0 on o2ib0 and ib1 on
o2ib1
Nodes with 1 physical interface (like your clients) have 2 logical interfaces:
ib0 on o2ib0 and ib0:0 on o2ib1.
Gr?goire.
De : lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces
at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org] De
la part de mages, brian
Envoy? : mardi 12 f?vrier 2013 21:34
? : lustre-discuss at lists.lustre.org<mailto:lustre-discuss at
lists.lustre.org>
Objet : [Lustre-discuss] Multirail IB Configuration Issue
Hi,
I''m having difficulty getting one of my clients to work with a
multirail IB configuration. Here''s what I''ve got:
Host OS Version Lustre Version Function
Storage Interface ib0 Interface ib1
1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1
mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24
2. bmr1-s8 CentOS 5.7 2.1.1 OSS2
ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24
3. bmr1-s5 CentOS 5.7 2.1.1 OSS3
ost25->30 192.168.1.20/24 192.168.1.30/24
4. bmr1-s6 CentOS 5.7 2.1.1 OSS4
ost31->36 192.168.1.21/24 192.168.1.31/24
5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a
192.168.1.209/24
The "/lustre" filesystem consists of mdt and ost1->12 (using
bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using
bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on
ib1.
>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and
access "/lustre". I can also successfully mount "/lustre2".
>From bmr2-s9, I can neither mount "/lustre" nor
"/lustre2". Originally, the issue with bmr2-s9 was that it was
running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail)
wasn''t supported on that version, I upgraded to 2.1.1. Originally, I
tried installing and testing the 2.1.1 client without success. Then, since it
had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought
I''d try that next. Unfortunately, it still didn''t work.
1a) Here''s what I see on the client when I try to mount
"/lustre":
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#
1b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:
5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at
o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at
o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:522:class_setup()) setup
lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc
1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log
''lustre-client'' failed (-2). This may be the result of
communication errors between this node and the MGS, a bad configuration, or
other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC:
canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2)
1c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from
2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null
2a) Here''s what I see on the client when I try to mount
"/lustre" (using the other interface):
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at
o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at
o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is ''lustre'' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#
2b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at
o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at
o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617]
[deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar
messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642]
[deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667]
[deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile
''lustre-client'' could not be read from the MGS. Does that
filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError:
5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22)
2c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar
messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
3) Here''s what one of the MDTs looks like (the other is similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
exiting before disk write.
[root at bmr1-s7 ~]#
4) Here''s what one of the OSTs looks like (the others are similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
exiting before disk write.
[root at bmr1-s7 ~]#
I''d appreciate any help or direction on a potential resolution. Let me
know what additional information is needed, if any. Hopefully, I''m
just missing something simple.
Thanks in advance,
...Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130213/efc8be48/attachment-0001.html
Hi,
It appears that I''ve resolved the issue and therefore wanted to provide
an update to this list. As I noted in the description of my configuration, the
client only has a single IB interface. After changing the options for lnet in
"/etc/modprobe.conf" (on the client) from "options lnet
networks=o2ib0(ib0)" to "options lnet
networks=o2ib0(ib0),o2ib1(ib0)", things started working.
Now, I said "appears" above because I am seeing an issue that
I''ve not seen in the past. Occasionally, while testing workloads with
8 concurrent clients, I see a client being evicted. The stack trace is not
always the same. Here''s an excerpt from "/var/log/messages":
Feb 26 11:26:05 bmr2-s14 kernel: Lustre:
7648:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361895936/real 0] req at ffff81013fe3d800
x1428048654366757/t0(0) o4->lustre2-OST0015-osc-ffff810229235c00 at
192.168.1.31@o2ib1:6/4 lens 456/416 e 0 to 1 dl 1361895943 ref 3 fl
Rpc:X/0/ffffffff rc 0/-1
Feb 26 11:26:05 bmr2-s14 kernel: Lustre:
7648:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 10 previous similar
messages
Feb 26 11:26:05 bmr2-s14 kernel: Lustre: lustre2-OST0010-osc-ffff810229235c00:
Connection to lustre2-OST0010 (at 192.168.1.20 at o2ib) was lost;
in progress operations using this service will wait for recovery to complete
Feb 26 11:26:05 bmr2-s14 kernel: Lustre: Skipped 2 previous similar messages
Feb 26 11:26:21 bmr2-s14 kernel: Lustre:
7647:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361895964/real 0] req at ffff8102438a2800
x1428048654378315/t0(0) o400->lustre2-OST0010-osc-ffff810229235c00 at
192.168.1.20@o2ib:28/4 lens 192/192 e 0 to 1 dl 1361895981 ref 2 fl
Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:26:21 bmr2-s14 kernel: Lustre:
7647:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 14 previous similar
messages
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: lustre2-OST0013-osc-ffff810229235c00:
Connection restored to lustre2-OST0013 (at 192.168.1.31 at o2ib1)
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages
Feb 26 11:26:32 bmr2-s14 kernel: LNetError:
7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3
seconds
Feb 26 11:26:32 bmr2-s14 kernel: LNetError:
7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20
at o2ib (55): c: 8, oc: 0, rc: 16
Feb 26 11:27:21 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361896015/real 0] req at ffff810082199800
x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at
192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl
Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:27:21 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar
messages
Feb 26 11:29:11 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361896115/real 0] req at ffff81009d25ec00
x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at
192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl
Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages
Feb 26 11:26:32 bmr2-s14 kernel: LNetError:
7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3
seconds
Feb 26 11:26:32 bmr2-s14 kernel: LNetError:
7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20
at o2ib (55): c: 8, oc: 0, rc: 16
Feb 26 11:27:21 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361896015/real 0] req at ffff810082199800
x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at
192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl
Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:27:21 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar
messages
Feb 26 11:29:11 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1361896115/real 0] req at ffff81009d25ec00
x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at
192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl
Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:29:11 bmr2-s14 kernel: Lustre:
7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 5 previous similar
messages
Feb 26 11:29:15 bmr2-s14 kernel: INFO: task iozone:9201 blocked for more than
120 seconds.
Feb 26 11:29:15 bmr2-s14 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 11:29:15 bmr2-s14 kernel: iozone D ffffffff801546d1 0 9201
1 9202 9269 7846 (NOTLB)
Feb 26 11:29:15 bmr2-s14 kernel: ffff8101278f5aa8 0000000000000082
ffff8101278f5ab8 ffffffff80062ff2
Feb 26 11:29:15 bmr2-s14 kernel: ffff81021dbaddf0 0000000000000007
ffff81014f521820 ffff810108617100
Feb 26 11:29:15 bmr2-s14 kernel: 00003976bb915dac 0000000000001fbe
ffff81014f521a08 000000018006ec8f
Feb 26 11:29:15 bmr2-s14 kernel: Call Trace:
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8006ec8f>]
do_gettimeofday+0x40/0x90
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80028d0e>] sync_page+0x0/0x43
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800637ce>]
io_schedule+0x3f/0x67
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80028d4c>] sync_page+0x3e/0x43
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800350d9>]
wait_on_page_bit+0x6c/0x72
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800a2e8b>]
wake_bit_function+0x0/0x23
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80047cae>]
pagevec_lookup_tag+0x1a/0x21
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8001d19f>]
mpage_writepages+0x18d/0x37d
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e7f850>]
:lustre:ll_writepage+0x0/0x430
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8005a8a6>]
do_writepages+0x20/0x2f
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8004f767>]
__filemap_fdatawrite_range+0x50/0x5b
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800c8cf4>]
sync_page_range+0x3d/0xa0
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800c8ff2>]
generic_file_writev+0x8a/0xa3
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88ea430d>]
:lustre:vvp_io_write_start+0xfd/0x1b0
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88aaea50>]
:obdclass:cl_io_start+0x90/0xf0
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88ab1718>]
:obdclass:cl_io_loop+0x88/0x130
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e5d16e>]
:lustre:ll_file_io_generic+0x43e/0x480
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e5d335>]
:lustre:ll_file_writev+0x185/0x1f0
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e66a71>]
:lustre:ll_file_write+0x121/0x190
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80016b92>]
vfs_write+0xce/0x174
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8001745b>] sys_write+0x45/0x6e
Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Feb 26 11:29:15 bmr2-s14 kernel:
Here''s some additional info showing loss of connection to 3 of the 6
OSTs located on this OSS (on the .20 at o2ib interface):
[root at bmr2-s14 ~]# cat
/proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_conn_uuid
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
[root at bmr2-s14 ~]# cat
/proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_server_uuid
lustre2-OST0000_UUID FULL
lustre2-OST0001_UUID FULL
lustre2-OST0002_UUID FULL
lustre2-OST0003_UUID FULL
lustre2-OST0004_UUID FULL
lustre2-OST0005_UUID FULL
lustre2-OST0006_UUID FULL
lustre2-OST0007_UUID FULL
lustre2-OST0008_UUID FULL
lustre2-OST0009_UUID FULL
lustre2-OST000a_UUID FULL
lustre2-OST000b_UUID FULL
lustre2-OST000c_UUID CONNECTING
lustre2-OST000d_UUID FULL
lustre2-OST000e_UUID CONNECTING
lustre2-OST000f_UUID FULL
lustre2-OST0010_UUID CONNECTING
lustre2-OST0011_UUID FULL
lustre2-OST0012_UUID FULL
lustre2-OST0013_UUID FULL
lustre2-OST0014_UUID FULL
lustre2-OST0015_UUID FULL
lustre2-OST0016_UUID FULL
lustre2-OST0017_UUID FULL
[root at bmr2-s14 ~]#
Based on some research, I''ve experimented with setting "options
ko2iblnd peer_credits=16 concurrent_sends=16" in /etc/modprobe.conf and
this has made the issue occur less frequently. However, it is still occurring.
I''m not sure if this has something to do with both server interfaces
being located on the same network or something else.
Any input would be appreciated.
Thanks,
...Brian
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of mages, brian
Sent: Tuesday, February 12, 2013 3:34 PM
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] Multirail IB Configuration Issue
Hi,
I''m having difficulty getting one of my clients to work with a
multirail IB configuration. Here''s what I''ve got:
Host OS Version Lustre Version Function
Storage Interface ib0 Interface ib1
1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1
mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24
2. bmr1-s8 CentOS 5.7 2.1.1 OSS2
ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24
3. bmr1-s5 CentOS 5.7 2.1.1 OSS3
ost25->30 192.168.1.20/24 192.168.1.30/24
4. bmr1-s6 CentOS 5.7 2.1.1 OSS4
ost31->36 192.168.1.21/24 192.168.1.31/24
5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a
192.168.1.209/24
The "/lustre" filesystem consists of mdt and ost1->12 (using
bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using
bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on
ib1.
>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and
access "/lustre". I can also successfully mount "/lustre2".
>From bmr2-s9, I can neither mount "/lustre" nor
"/lustre2". Originally, the issue with bmr2-s9 was that it was
running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail)
wasn''t supported on that version, I upgraded to 2.1.1. Originally, I
tried installing and testing the 2.1.1 client without success. Then, since it
had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought
I''d try that next. Unfortunately, it still didn''t work.
1a) Here''s what I see on the client when I try to mount
"/lustre":
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#
1b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:
5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at
o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at
o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:522:class_setup()) setup
lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc
1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at
o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log
''lustre-client'' failed (-2). This may be the result of
communication errors between this node and the MGS, a bad configuration, or
other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC:
canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError:
5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2)
1c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from
2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at
192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre:
25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null
2a) Here''s what I see on the client when I try to mount
"/lustre" (using the other interface):
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at
o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at
o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at
o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at
o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is ''lustre'' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#
2b) Here''s an excerpt from "/var/log/messages" on the client
(after executing the above command):
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at
o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at
o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617]
[deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError:
5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar
messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642]
[deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at
ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at
MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25>
lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError:
5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at
o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing
latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre:
4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872
sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to
network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667]
[deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0)
o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at
MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl
Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError:
3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at
o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile
''lustre-client'' could not be read from the MGS. Does that
filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError:
5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22)
2c) Here''s an excerpt from "/var/log/messages" on the server
(after executing the above command):
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar
messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError:
9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept
192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at
o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid
192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
3) Here''s what one of the MDTs looks like (the other is similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>
exiting before disk write.
[root at bmr1-s7 ~]#
4) Here''s what one of the OSTs looks like (the others are similarly
configured):
[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at
tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at
o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0
exiting before disk write.
[root at bmr1-s7 ~]#
I''d appreciate any help or direction on a potential resolution. Let me
know what additional information is needed, if any. Hopefully, I''m
just missing something simple.
Thanks in advance,
...Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130226/779ce657/attachment-0001.html
On Tue, Feb 26, 2013 at 01:04:06PM -0500, mages, brian wrote:> Hi, > > It appears that I''ve resolved the issue and therefore wanted to provide an update to this list. As I noted in the description of my configuration, the client only has a single IB interface. After changing the options for lnet in "/etc/modprobe.conf" (on the client) from "options lnet networks=o2ib0(ib0)" to "options lnet networks=o2ib0(ib0),o2ib1(ib0)", things started working.Why do you want two o2ib networks over a same interface?> ...... > Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds > Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16This often indicates problem with the underlying network, i.e. the HCA couldn''t complete an outgoing message in time - either something wrong on the network or with 192.168.1.20 at o2ib. Did you see any error on 192.168.1.20 at o2ib too? - Isaac
Hi, We had similar problems when we tried to setup two lnet networks on the same interface. I think Gregoire got the solution: you create an alias interface on the client (ib0:0) and you''ll have two logical interfaces. Lnet configuration on clients should be then something like this: o2ib0(ib0), o2ib1(ib0:0). Hope this helps, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130228/5feeec67/attachment.html