Howdy, Lustre 1.8.5 using the EL5 provided RPMs on both clients and servers lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 The servers and clients are all running CentOS 5.5 x86_64 with kernel 2.6.18-194.17.1.el5 (servers running the Lustre patched kernel) We have two Infiniband networks, o2ib0 and o2ib1 as well as ethernet. Here''s the lnet modprobe used on the mds and oss''s: options lnet networks="o2ib0(ib0),o2ib1(ib1),tcp0(eth0)" The compute nodes that mount via tcp0 don''t have any problems. The compute nodes that mount via o2ib1 do not have any problems. The compute nodes attached to o2ib0 fail to mount the Lustre file system at boot (output of dmesg is at the end). The compute nodes are Dell M610 blades. There are 3 Dell M1000e chassis switches (Mellanox InfiniScale IV M3601Q 32 port 40Gb/s switches), each attached to a QLogic 12300 36 port QDR switch via 8 cables. The compute nodes are directly attached to the M3601Q switches internally (blades). The Lustre servers are attached directly to the QLogic 12300 switch. All of our Infiniband tests have checked out and the switches do not report an errors. Here''s the scenario that enables me to mount the lustre file system after the compute node has booted. 1. ssh to the node 2. check ibstat to ensure that the port on the card reports the port as active, success 3. Run ibswitches as a test to ensure it can see the switches, success 4. Ping using another IPoverIB address using regular ping # ping 192.168.2.20 PING 192.168.2.20 (192.168.2.20) 56(84) bytes of data. 64 bytes from 192.168.2.20: icmp_seq=1 ttl=64 time=1.93 ms 5. Try to ping the MDS using lctl ping # lctl ping 192.168.2.20 at o2ib failed to ping 192.168.2.20 at o2ib: Input/output error 6. Try it again (this step isn''t actually necessary, after the single failed ping, I can then mount) # lctl ping 192.168.2.20 at o2ib 12345-0 at lo 12345-192.168.2.20 at o2ib 12345-192.168.3.20 at o2ib1 12345-172.20.0.20 at tcp 7. Now mount lustre # mount /lustre # mount | grep lustre 192.168.2.20 at o2ib:/lustre on /lustre type lustre (rw,_netdev) Instead of doing the "lctl ping" I can also do a mount, which will fail, followed by another which will succeed. Here''s the messages logged during boot, anyone have any suggestions? Thanks, Mike Lustre: Listener bound to ib0:192.168.2.229:987:mlx4_0 Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1 Lustre: Added LNI 192.168.2.229 at o2ib [8/64/0/180] Lustre: Lustre Client File System; http://www.lustre.org/ Lustre: 4989:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1363390903615489 sent from MGC192.168.2.20 at o2ib to NID 192.168.2.20 at o2ib 5s ago has timed out (5s prior to deadline). req at ffff81062b455c00 x1363390903615489/t0 o250->MGS at MGC192.168.2.20@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1300230903 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 5076:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81062b455000 x1363390903615491/t0 o501->MGS at MGC192.168.2.20@o2ib_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 15c-8: MGC192.168.2.20 at o2ib: The configuration from log ''lustre-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 5076:0:(llite_lib.c:1079:ll_fill_super()) Unable to process log: -108 Lustre: client ffff81062dddf400 umount complete LustreError: 5076:0:(obd_mount.c:2050:lustre_fill_super()) Unable to mount (-108) ib_srp: ASYNC event= 11 on device= mlx4_0 ib_srp: ASYNC event= 17 on device= mlx4_0 ib_srp: ASYNC event= 9 on device= mlx4_0 ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do # load_modules $i #done and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): load_modules ib_mthca /bin/sleep 10 load_modules mlx4_core and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. Nirmal
Thanks, I forgot to include the card info: The servers each have a single IB card: dual port MT26528 QDR o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades The blades connected to o2ib0 each have an MT26428 QDR IB card The blades connected to o2ib1 each have an MT25418 DDR IB card -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu Sent: Wednesday, March 16, 2011 2:10 PM To: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre over o2ib issue If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do # load_modules $i #done and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): load_modules ib_mthca /bin/sleep 10 load_modules mlx4_core and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. Nirmal _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 2011-03-16, at 3:04 PM, Mike Hanby wrote:> Thanks, I forgot to include the card info: > > The servers each have a single IB card: dual port MT26528 QDR > o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) > o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades > > The blades connected to o2ib0 each have an MT26428 QDR IB card > The blades connected to o2ib1 each have an MT25418 DDR IB cardYou may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes.> > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu > Sent: Wednesday, March 16, 2011 2:10 PM > To: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] Lustre over o2ib issue > > If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. > > To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: > > #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do > # load_modules $i > #done > > and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): > load_modules ib_mthca > /bin/sleep 10 > load_modules mlx4_core > > and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. > > Nirmal > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Hi, We are having this problem right now with our Lustre 2.0. We tried the proposed solutions but we didn''t get it. We have 2 QDR IB cards on 4 servers and we have to do "lctl ping" from each server to every client if we want clients to connect to servers. We don''t have ib_mthca modules loaded because we don''t have DDR cards and we configured ip2nets with no result. Our ip2nets configuration ([7-10] interfaces are in servers, the others are in clients): o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.*.* ; o2ib1(ib0) 10.50.*.* So the only way of having clients connected to servers is doing something like this on every server: for i in $CLIENT_IB_LIST ; do lctl ping $i at o2ib0 lctl ping $i at o2ib1 done Before "lctl ping" we get messages like this one: Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-10.50.1.7 at o2ib1: peer not alive After "lctl ping'' everything works right. Maybe I''m missing something or this is a known bug in lustre 2.0... On 16/03/2011 22:13, Andreas Dilger wrote:> On 2011-03-16, at 3:04 PM, Mike Hanby wrote: >> Thanks, I forgot to include the card info: >> >> The servers each have a single IB card: dual port MT26528 QDR >> o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) >> o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades >> >> The blades connected to o2ib0 each have an MT26428 QDR IB card >> The blades connected to o2ib1 each have an MT25418 DDR IB card > > You may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes. > >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu >> Sent: Wednesday, March 16, 2011 2:10 PM >> To: lustre-discuss at lists.lustre.org >> Subject: Re: [Lustre-discuss] Lustre over o2ib issue >> >> If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. >> >> To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: >> >> #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do >> # load_modules $i >> #done >> >> and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): >> load_modules ib_mthca >> /bin/sleep 10 >> load_modules mlx4_core >> >> and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. >> >> Nirmal >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Shouldn''t your ip2nets look like this: o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.0.* ; o2ib1(ib0) 10.50.1.* -----Original Message----- From: Diego Moreno [mailto:Diego.Moreno-Lazaro at bull.net] Sent: Tuesday, March 22, 2011 8:26 AM To: Andreas Dilger Cc: Mike Hanby; lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre over o2ib issue Hi, We are having this problem right now with our Lustre 2.0. We tried the proposed solutions but we didn''t get it. We have 2 QDR IB cards on 4 servers and we have to do "lctl ping" from each server to every client if we want clients to connect to servers. We don''t have ib_mthca modules loaded because we don''t have DDR cards and we configured ip2nets with no result. Our ip2nets configuration ([7-10] interfaces are in servers, the others are in clients): o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.*.* ; o2ib1(ib0) 10.50.*.* So the only way of having clients connected to servers is doing something like this on every server: for i in $CLIENT_IB_LIST ; do lctl ping $i at o2ib0 lctl ping $i at o2ib1 done Before "lctl ping" we get messages like this one: Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-10.50.1.7 at o2ib1: peer not alive After "lctl ping'' everything works right. Maybe I''m missing something or this is a known bug in lustre 2.0... On 16/03/2011 22:13, Andreas Dilger wrote:> On 2011-03-16, at 3:04 PM, Mike Hanby wrote: >> Thanks, I forgot to include the card info: >> >> The servers each have a single IB card: dual port MT26528 QDR >> o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) >> o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades >> >> The blades connected to o2ib0 each have an MT26428 QDR IB card >> The blades connected to o2ib1 each have an MT25418 DDR IB card > > You may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes. > >> >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu >> Sent: Wednesday, March 16, 2011 2:10 PM >> To: lustre-discuss at lists.lustre.org >> Subject: Re: [Lustre-discuss] Lustre over o2ib issue >> >> If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. >> >> To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: >> >> #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do >> # load_modules $i >> #done >> >> and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): >> load_modules ib_mthca >> /bin/sleep 10 >> load_modules mlx4_core >> >> and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. >> >> Nirmal >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On 22/03/2011 14:41, Mike Hanby wrote:> Shouldn''t your ip2nets look like this: > > o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) > 10.50.0.* ; o2ib1(ib0) 10.50.1.*Well, not exactly. My clients have only one IB interface and I want this interface to have access to both lnets (o2ib0, o2ib1). One interface in two lnets.
Hi Diego, Do you have any other module parameter for lnet and lnd? Regards Liang On Mar 22, 2011, at 9:26 PM, Diego Moreno wrote:> Hi, > > We are having this problem right now with our Lustre 2.0. We tried the > proposed solutions but we didn''t get it. > > We have 2 QDR IB cards on 4 servers and we have to do "lctl ping" from > each server to every client if we want clients to connect to servers. We > don''t have ib_mthca modules loaded because we don''t have DDR cards and > we configured ip2nets with no result. > > Our ip2nets configuration ([7-10] interfaces are in servers, the others > are in clients): > o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) > 10.50.*.* ; o2ib1(ib0) 10.50.*.* > > So the only way of having clients connected to servers is doing > something like this on every server: > > for i in $CLIENT_IB_LIST ; do > lctl ping $i at o2ib0 > lctl ping $i at o2ib1 > done > > Before "lctl ping" we get messages like this one: > > Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping > message for 12345-10.50.1.7 at o2ib1: peer not alive > > After "lctl ping'' everything works right. > > Maybe I''m missing something or this is a known bug in lustre 2.0... > > > On 16/03/2011 22:13, Andreas Dilger wrote: >> On 2011-03-16, at 3:04 PM, Mike Hanby wrote: >>> Thanks, I forgot to include the card info: >>> >>> The servers each have a single IB card: dual port MT26528 QDR >>> o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) >>> o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades >>> >>> The blades connected to o2ib0 each have an MT26428 QDR IB card >>> The blades connected to o2ib1 each have an MT25418 DDR IB card >> >> You may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes. >> >>> >>> -----Original Message----- >>> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu >>> Sent: Wednesday, March 16, 2011 2:10 PM >>> To: lustre-discuss at lists.lustre.org >>> Subject: Re: [Lustre-discuss] Lustre over o2ib issue >>> >>> If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. >>> >>> To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: >>> >>> #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do >>> # load_modules $i >>> #done >>> >>> and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): >>> load_modules ib_mthca >>> /bin/sleep 10 >>> load_modules mlx4_core >>> >>> and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. >>> >>> Nirmal >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Principal Engineer >> Whamcloud, Inc. >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
I think we don''t have any other parameter. I checked it and there is no such other parameter. I''m wondering why it''s not possible to "lctl ping" from a client to the second interface on server and why it''s possible to "lctl ping" in the opposite direction when the first connection is established. There''s no Lustre routers in the middle. Maybe an OFED bug? Strange as there''s no problem with standard ping. root at berlin4 ~]# cat /etc/modprobe.d/lustre.conf install mdc /sbin/modprobe lquota >/dev/null 2>&1; /sbin/modprobe --ignore-install mdc #install fld /sbin/modprobe ptlrpc #install fid /sbin/modprobe ptlrpc #install mdc /sbin/modprobe ptlrpc #install lustre /sbin/modprobe ptlrpc install fld /sbin/modprobe ptlrpc ; /sbin/modprobe --ignore-install fld install lnet \ PORTKEEPER=/sbin/keep_port_988 ;\ [ -x $PORTKEEPER ] && killall -w `basename $PORTKEEPER` ;\ LNET_NETWORKS_LIST=""; \ if /sbin/lsmod|grep -qE "^elan"; then \ LNET_NETWORKS_LIST="elan0"; \ fi; \ if /sbin/lsmod|grep -qE "^ib_mthca|^mlx4_ib"; then \ LNET_NETWORKS_LIST="o2ib(ib0),o2ib1(ib1)"; \ fi; \ if [ -z "${LNET_NETWORKS_LIST}" ]; then \ LNET_OPTIONS="networks=tcp0(eth0)"; \ else \ LNET_OPTIONS="networks=${LNET_NETWORKS_LIST},tcp0(eth0)"; \ fi; \ if [ -e /etc/lustre/routers.conf ]; then . /etc/lustre/routers.conf ; fi;\ if [ -e /etc/lustre/multirail.conf ]; then . /etc/lustre/multirail.conf ; fi;\ /sbin/modprobe --ignore-install lnet $LNET_OPTIONS $LNET_ROUTER_OPTIONS $LNET_MULTIRAIL_OPTIONS remove libcfs \ PORTKEEPER=/sbin/keep_port_988 ;\ attempt=0;\ while [ 1 ];\ do\ rmmod `lsmod | grep libcfs | awk ''{ print $4}'' | tr '','' '' ''` >/dev/null 2>&1;\ [ $? == 0 ] && break;\ attempt=`expr $attempt + 1`;\ [ $attempt -gt 4 ] && break;\ done;\ modprobe -r --ignore-remove libcfs;\ modprobe -r ldiskfs >/dev/null 2>&1;\ if [ $? == 0 ] && [ $attempt -le 4 ] && [ -x $PORTKEEPER ]; then \ $PORTKEEPER > /dev/null ;\ fi options lpfc lpfc_sg_seg_cnt=256 [root at berlin4 ~]# cat /etc/lustre/multirail.conf #!/bin/sh unset LNET_OPTIONS unset LNET_ROUTER_OPTIONS unset LNET_MULTIRAIL_OPTIONS #export LNET_MULTIRAIL_OPTIONS="networks=o2ib0(ib0),o2ib1(ib1)" export LNET_ROUTER_OPTIONS="ip2nets=\"o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.*.* ; o2ib1(ib0) 10.50.*.* \"" On 23/03/2011 10:55, Liang Zhen wrote:> Hi Diego, > > Do you have any other module parameter for lnet and lnd? > > Regards > Liang > > > On Mar 22, 2011, at 9:26 PM, Diego Moreno wrote: > >> Hi, >> >> We are having this problem right now with our Lustre 2.0. We tried the >> proposed solutions but we didn''t get it. >> >> We have 2 QDR IB cards on 4 servers and we have to do "lctl ping" from >> each server to every client if we want clients to connect to servers. We >> don''t have ib_mthca modules loaded because we don''t have DDR cards and >> we configured ip2nets with no result. >> >> Our ip2nets configuration ([7-10] interfaces are in servers, the others >> are in clients): >> o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) >> 10.50.*.* ; o2ib1(ib0) 10.50.*.* >> >> So the only way of having clients connected to servers is doing >> something like this on every server: >> >> for i in $CLIENT_IB_LIST ; do >> lctl ping $i at o2ib0 >> lctl ping $i at o2ib1 >> done >> >> Before "lctl ping" we get messages like this one: >> >> Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping >> message for 12345-10.50.1.7 at o2ib1: peer not alive >> >> After "lctl ping'' everything works right. >> >> Maybe I''m missing something or this is a known bug in lustre 2.0... >> >> >> On 16/03/2011 22:13, Andreas Dilger wrote: >>> On 2011-03-16, at 3:04 PM, Mike Hanby wrote: >>>> Thanks, I forgot to include the card info: >>>> >>>> The servers each have a single IB card: dual port MT26528 QDR >>>> o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) >>>> o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades >>>> >>>> The blades connected to o2ib0 each have an MT26428 QDR IB card >>>> The blades connected to o2ib1 each have an MT25418 DDR IB card >>> >>> You may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes. >>> >>>> >>>> -----Original Message----- >>>> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu >>>> Sent: Wednesday, March 16, 2011 2:10 PM >>>> To: lustre-discuss at lists.lustre.org >>>> Subject: Re: [Lustre-discuss] Lustre over o2ib issue >>>> >>>> If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. >>>> >>>> To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: >>>> >>>> #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e ''s/driver: //'' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do >>>> # load_modules $i >>>> #done >>>> >>>> and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): >>>> load_modules ib_mthca >>>> /bin/sleep 10 >>>> load_modules mlx4_core >>>> >>>> and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. >>>> >>>> Nirmal >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Principal Engineer >>> Whamcloud, Inc. >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >-- Diego Moreno Bull S.A.S 1, rue de Provence B.P. 208 38432 ECHIROLLES CEDEX FRANCE Phone: +33 (0) 4 76 29 71 86 (229-7186) http://www.bull-world.com/