Hi, I''m having difficulty getting one of my clients to work with a multirail IB configuration. Here''s what I''ve got: Host OS Version Lustre Version Function Storage Interface ib0 Interface ib1 1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1 mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24 2. bmr1-s8 CentOS 5.7 2.1.1 OSS2 ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24 3. bmr1-s5 CentOS 5.7 2.1.1 OSS3 ost25->30 192.168.1.20/24 192.168.1.30/24 4. bmr1-s6 CentOS 5.7 2.1.1 OSS4 ost31->36 192.168.1.21/24 192.168.1.31/24 5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a 192.168.1.209/24 The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8). The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6). On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre". I can also successfully mount "/lustre2".>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2". Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail) wasn''t supported on that version, I upgraded to 2.1.1. Originally, I tried installing and testing the 2.1.1 client without success. Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I''d try that next. Unfortunately, it still didn''t work.1a) Here''s what I see on the client when I try to mount "/lustre": [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# 1b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0> netid 50000: select flavor null Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2) Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command: Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc 1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2) 1c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0 Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null 2a) Here''s what I see on the client when I try to mount "/lustre" (using the other interface): [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument This may have multiple causes. Is ''lustre'' the correct filesystem name? Are the mount options correct? Check the syslog for more info. [root at bmr2-s9 ~]# 2b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0> netid 50000: select flavor null Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile ''lustre-client'' could not be read from the MGS. Does that filesystem exist? Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22) 2c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> 3) Here''s what one of the MDTs looks like (the other is similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> Permanent disk data: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x105 (MDT MGS writeconf ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> exiting before disk write. [root at bmr1-s7 ~]# 4) Here''s what one of the OSTs looks like (the others are similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 exiting before disk write. [root at bmr1-s7 ~]# I''d appreciate any help or direction on a potential resolution. Let me know what additional information is needed, if any. Hopefully, I''m just missing something simple. Thanks in advance, ...Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130212/cd6836b6/attachment-0001.html
Hi, The two LNet networks need separate IPoIB networks. o2ib0 - 10.0.0.0/24 o2ib1 - 10.1.0.0/24 Nodes with 2 physical interfaces (like you servers) have ib0 on o2ib0 and ib1 on o2ib1 Nodes with 1 physical interface (like your clients) have 2 logical interfaces: ib0 on o2ib0 and ib0:0 on o2ib1. Gr?goire. De : lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] De la part de mages, brian Envoy? : mardi 12 f?vrier 2013 21:34 ? : lustre-discuss at lists.lustre.org Objet : [Lustre-discuss] Multirail IB Configuration Issue Hi, I''m having difficulty getting one of my clients to work with a multirail IB configuration. Here''s what I''ve got: Host OS Version Lustre Version Function Storage Interface ib0 Interface ib1 1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1 mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24 2. bmr1-s8 CentOS 5.7 2.1.1 OSS2 ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24 3. bmr1-s5 CentOS 5.7 2.1.1 OSS3 ost25->30 192.168.1.20/24 192.168.1.30/24 4. bmr1-s6 CentOS 5.7 2.1.1 OSS4 ost31->36 192.168.1.21/24 192.168.1.31/24 5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a 192.168.1.209/24 The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8). The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6). On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre". I can also successfully mount "/lustre2".>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2". Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail) wasn''t supported on that version, I upgraded to 2.1.1. Originally, I tried installing and testing the 2.1.1 client without success. Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I''d try that next. Unfortunately, it still didn''t work.1a) Here''s what I see on the client when I try to mount "/lustre": [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# 1b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2) Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command: Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc 1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2) 1c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0 Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null 2a) Here''s what I see on the client when I try to mount "/lustre" (using the other interface): [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument This may have multiple causes. Is ''lustre'' the correct filesystem name? Are the mount options correct? Check the syslog for more info. [root at bmr2-s9 ~]# 2b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile ''lustre-client'' could not be read from the MGS. Does that filesystem exist? Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22) 2c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> 3) Here''s what one of the MDTs looks like (the other is similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> Permanent disk data: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x105 (MDT MGS writeconf ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> exiting before disk write. [root at bmr1-s7 ~]# 4) Here''s what one of the OSTs looks like (the others are similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 exiting before disk write. [root at bmr1-s7 ~]# I''d appreciate any help or direction on a potential resolution. Let me know what additional information is needed, if any. Hopefully, I''m just missing something simple. Thanks in advance, ...Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130213/fadd2309/attachment-0001.html
Hi Gr?goire, Thanks for the reply. I thought that might be the issue as well. However, it doesn''t explain why the other two clients work successfully when both interfaces are on the same network. ...Brian From: Gregoire Pichon [mailto:gregoire.pichon at bull.net] Sent: Wednesday, February 13, 2013 4:19 AM To: mages, brian; lustre-discuss at lists.lustre.org Subject: RE: Multirail IB Configuration Issue Hi, The two LNet networks need separate IPoIB networks. o2ib0 - 10.0.0.0/24 o2ib1 - 10.1.0.0/24 Nodes with 2 physical interfaces (like you servers) have ib0 on o2ib0 and ib1 on o2ib1 Nodes with 1 physical interface (like your clients) have 2 logical interfaces: ib0 on o2ib0 and ib0:0 on o2ib1. Gr?goire. De : lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org] De la part de mages, brian Envoy? : mardi 12 f?vrier 2013 21:34 ? : lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> Objet : [Lustre-discuss] Multirail IB Configuration Issue Hi, I''m having difficulty getting one of my clients to work with a multirail IB configuration. Here''s what I''ve got: Host OS Version Lustre Version Function Storage Interface ib0 Interface ib1 1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1 mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24 2. bmr1-s8 CentOS 5.7 2.1.1 OSS2 ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24 3. bmr1-s5 CentOS 5.7 2.1.1 OSS3 ost25->30 192.168.1.20/24 192.168.1.30/24 4. bmr1-s6 CentOS 5.7 2.1.1 OSS4 ost31->36 192.168.1.21/24 192.168.1.31/24 5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a 192.168.1.209/24 The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8). The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6). On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre". I can also successfully mount "/lustre2".>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2". Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail) wasn''t supported on that version, I upgraded to 2.1.1. Originally, I tried installing and testing the 2.1.1 client without success. Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I''d try that next. Unfortunately, it still didn''t work.1a) Here''s what I see on the client when I try to mount "/lustre": [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# 1b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2) Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command: Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc 1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2) 1c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0 Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null 2a) Here''s what I see on the client when I try to mount "/lustre" (using the other interface): [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument This may have multiple causes. Is ''lustre'' the correct filesystem name? Are the mount options correct? Check the syslog for more info. [root at bmr2-s9 ~]# 2b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile ''lustre-client'' could not be read from the MGS. Does that filesystem exist? Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22) 2c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> 3) Here''s what one of the MDTs looks like (the other is similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> Permanent disk data: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x105 (MDT MGS writeconf ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> exiting before disk write. [root at bmr1-s7 ~]# 4) Here''s what one of the OSTs looks like (the others are similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 exiting before disk write. [root at bmr1-s7 ~]# I''d appreciate any help or direction on a potential resolution. Let me know what additional information is needed, if any. Hopefully, I''m just missing something simple. Thanks in advance, ...Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130213/efc8be48/attachment-0001.html
Hi, It appears that I''ve resolved the issue and therefore wanted to provide an update to this list. As I noted in the description of my configuration, the client only has a single IB interface. After changing the options for lnet in "/etc/modprobe.conf" (on the client) from "options lnet networks=o2ib0(ib0)" to "options lnet networks=o2ib0(ib0),o2ib1(ib0)", things started working. Now, I said "appears" above because I am seeing an issue that I''ve not seen in the past. Occasionally, while testing workloads with 8 concurrent clients, I see a client being evicted. The stack trace is not always the same. Here''s an excerpt from "/var/log/messages": Feb 26 11:26:05 bmr2-s14 kernel: Lustre: 7648:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361895936/real 0] req at ffff81013fe3d800 x1428048654366757/t0(0) o4->lustre2-OST0015-osc-ffff810229235c00 at 192.168.1.31@o2ib1:6/4 lens 456/416 e 0 to 1 dl 1361895943 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 Feb 26 11:26:05 bmr2-s14 kernel: Lustre: 7648:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Feb 26 11:26:05 bmr2-s14 kernel: Lustre: lustre2-OST0010-osc-ffff810229235c00: Connection to lustre2-OST0010 (at 192.168.1.20 at o2ib) was lost; in progress operations using this service will wait for recovery to complete Feb 26 11:26:05 bmr2-s14 kernel: Lustre: Skipped 2 previous similar messages Feb 26 11:26:21 bmr2-s14 kernel: Lustre: 7647:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361895964/real 0] req at ffff8102438a2800 x1428048654378315/t0(0) o400->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 192/192 e 0 to 1 dl 1361895981 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 26 11:26:21 bmr2-s14 kernel: Lustre: 7647:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 14 previous similar messages Feb 26 11:26:30 bmr2-s14 kernel: Lustre: lustre2-OST0013-osc-ffff810229235c00: Connection restored to lustre2-OST0013 (at 192.168.1.31 at o2ib1) Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16 Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361896015/real 0] req at ffff810082199800 x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361896115/real 0] req at ffff81009d25ec00 x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16 Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361896015/real 0] req at ffff810082199800 x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1361896115/real 0] req at ffff81009d25ec00 x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Feb 26 11:29:15 bmr2-s14 kernel: INFO: task iozone:9201 blocked for more than 120 seconds. Feb 26 11:29:15 bmr2-s14 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 26 11:29:15 bmr2-s14 kernel: iozone D ffffffff801546d1 0 9201 1 9202 9269 7846 (NOTLB) Feb 26 11:29:15 bmr2-s14 kernel: ffff8101278f5aa8 0000000000000082 ffff8101278f5ab8 ffffffff80062ff2 Feb 26 11:29:15 bmr2-s14 kernel: ffff81021dbaddf0 0000000000000007 ffff81014f521820 ffff810108617100 Feb 26 11:29:15 bmr2-s14 kernel: 00003976bb915dac 0000000000001fbe ffff81014f521a08 000000018006ec8f Feb 26 11:29:15 bmr2-s14 kernel: Call Trace: Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80028d0e>] sync_page+0x0/0x43 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800637ce>] io_schedule+0x3f/0x67 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80028d4c>] sync_page+0x3e/0x43 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800350d9>] wait_on_page_bit+0x6c/0x72 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80047cae>] pagevec_lookup_tag+0x1a/0x21 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8001d19f>] mpage_writepages+0x18d/0x37d Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e7f850>] :lustre:ll_writepage+0x0/0x430 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8005a8a6>] do_writepages+0x20/0x2f Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8004f767>] __filemap_fdatawrite_range+0x50/0x5b Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800c8cf4>] sync_page_range+0x3d/0xa0 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff800c8ff2>] generic_file_writev+0x8a/0xa3 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88ea430d>] :lustre:vvp_io_write_start+0xfd/0x1b0 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88aaea50>] :obdclass:cl_io_start+0x90/0xf0 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88ab1718>] :obdclass:cl_io_loop+0x88/0x130 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e5d16e>] :lustre:ll_file_io_generic+0x43e/0x480 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e5d335>] :lustre:ll_file_writev+0x185/0x1f0 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff88e66a71>] :lustre:ll_file_write+0x121/0x190 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff80016b92>] vfs_write+0xce/0x174 Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8001745b>] sys_write+0x45/0x6e Feb 26 11:29:15 bmr2-s14 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Feb 26 11:29:15 bmr2-s14 kernel: Here''s some additional info showing loss of connection to 3 of the 6 OSTs located on this OSS (on the .20 at o2ib interface): [root at bmr2-s14 ~]# cat /proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_conn_uuid 192.168.1.25 at o2ib 192.168.1.35 at o2ib1 192.168.1.25 at o2ib 192.168.1.35 at o2ib1 192.168.1.25 at o2ib 192.168.1.35 at o2ib1 192.168.1.26 at o2ib 192.168.1.36 at o2ib1 192.168.1.26 at o2ib 192.168.1.36 at o2ib1 192.168.1.26 at o2ib 192.168.1.36 at o2ib1 192.168.1.20 at o2ib 192.168.1.30 at o2ib1 192.168.1.20 at o2ib 192.168.1.30 at o2ib1 192.168.1.20 at o2ib 192.168.1.30 at o2ib1 192.168.1.21 at o2ib 192.168.1.31 at o2ib1 192.168.1.21 at o2ib 192.168.1.31 at o2ib1 192.168.1.21 at o2ib 192.168.1.31 at o2ib1 [root at bmr2-s14 ~]# cat /proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_server_uuid lustre2-OST0000_UUID FULL lustre2-OST0001_UUID FULL lustre2-OST0002_UUID FULL lustre2-OST0003_UUID FULL lustre2-OST0004_UUID FULL lustre2-OST0005_UUID FULL lustre2-OST0006_UUID FULL lustre2-OST0007_UUID FULL lustre2-OST0008_UUID FULL lustre2-OST0009_UUID FULL lustre2-OST000a_UUID FULL lustre2-OST000b_UUID FULL lustre2-OST000c_UUID CONNECTING lustre2-OST000d_UUID FULL lustre2-OST000e_UUID CONNECTING lustre2-OST000f_UUID FULL lustre2-OST0010_UUID CONNECTING lustre2-OST0011_UUID FULL lustre2-OST0012_UUID FULL lustre2-OST0013_UUID FULL lustre2-OST0014_UUID FULL lustre2-OST0015_UUID FULL lustre2-OST0016_UUID FULL lustre2-OST0017_UUID FULL [root at bmr2-s14 ~]# Based on some research, I''ve experimented with setting "options ko2iblnd peer_credits=16 concurrent_sends=16" in /etc/modprobe.conf and this has made the issue occur less frequently. However, it is still occurring. I''m not sure if this has something to do with both server interfaces being located on the same network or something else. Any input would be appreciated. Thanks, ...Brian From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mages, brian Sent: Tuesday, February 12, 2013 3:34 PM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] Multirail IB Configuration Issue Hi, I''m having difficulty getting one of my clients to work with a multirail IB configuration. Here''s what I''ve got: Host OS Version Lustre Version Function Storage Interface ib0 Interface ib1 1. bmr1-s7 CentOS 5.7 2.1.1 MGS,MDS,OSS1 mdt,mdt2,ost1->6,ost13->18 192.168.1.25/24 192.168.1.35/24 2. bmr1-s8 CentOS 5.7 2.1.1 OSS2 ost7->12,ost19->24 192.168.1.26/24 192.168.1.36/24 3. bmr1-s5 CentOS 5.7 2.1.1 OSS3 ost25->30 192.168.1.20/24 192.168.1.30/24 4. bmr1-s6 CentOS 5.7 2.1.1 OSS4 ost31->36 192.168.1.21/24 192.168.1.31/24 5. bmr2-s9 CentOS 5.7 2.1.1 Client n/a 192.168.1.209/24 The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8). The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6). On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre". I can also successfully mount "/lustre2".>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2". Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6). Since this config (i.e., multirail) wasn''t supported on that version, I upgraded to 2.1.1. Originally, I tried installing and testing the 2.1.1 client without success. Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I''d try that next. Unfortunately, it still didn''t work.1a) Here''s what I see on the client when I try to mount "/lustre": [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# 1b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can''t add initial connection Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2) Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command: Feb 12 15:00:54 bmr2-s9 kernel: Lustre: cmd=cf003 0:lustre-OST0001-osc 1:lustre-OST0001_UUID 2:192.168.1.35 at o2ib1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-2) 1c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0 Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null 2a) Here''s what I see on the client when I try to mount "/lustre" (using the other interface): [root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) [root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument This may have multiple causes. Is ''lustre'' the correct filesystem name? Are the mount options correct? Check the syslog for more info. [root at bmr2-s9 ~]# 2b) Here''s an excerpt from "/var/log/messages" on the client (after executing the above command): Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s] req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s] req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s] req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile ''lustre-client'' could not be read from the MGS. Does that filesystem exist? Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount (-22) 2c) Here''s an excerpt from "/var/log/messages" on the server (after executing the above command): Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can''t accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> 3) Here''s what one of the MDTs looks like (the other is similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> Permanent disk data: Target: lustre-MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x105 (MDT MGS writeconf ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> exiting before disk write. [root at bmr1-s7 ~]# 4) Here''s what one of the OSTs looks like (the others are similarly configured): [root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0 exiting before disk write. [root at bmr1-s7 ~]# I''d appreciate any help or direction on a potential resolution. Let me know what additional information is needed, if any. Hopefully, I''m just missing something simple. Thanks in advance, ...Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130226/779ce657/attachment-0001.html
On Tue, Feb 26, 2013 at 01:04:06PM -0500, mages, brian wrote:> Hi, > > It appears that I''ve resolved the issue and therefore wanted to provide an update to this list. As I noted in the description of my configuration, the client only has a single IB interface. After changing the options for lnet in "/etc/modprobe.conf" (on the client) from "options lnet networks=o2ib0(ib0)" to "options lnet networks=o2ib0(ib0),o2ib1(ib0)", things started working.Why do you want two o2ib networks over a same interface?> ...... > Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds > Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16This often indicates problem with the underlying network, i.e. the HCA couldn''t complete an outgoing message in time - either something wrong on the network or with 192.168.1.20 at o2ib. Did you see any error on 192.168.1.20 at o2ib too? - Isaac
Hi, We had similar problems when we tried to setup two lnet networks on the same interface. I think Gregoire got the solution: you create an alias interface on the client (ib0:0) and you''ll have two logical interfaces. Lnet configuration on clients should be then something like this: o2ib0(ib0), o2ib1(ib0:0). Hope this helps, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130228/5feeec67/attachment.html