Problem is similer to http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html But by looking at the thread could not really get the solution for the problem. I have two RHEL5 Linux servers installed with following packages - kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp e2fsprogs-1.40.7.sun3-0redhat machine 1: with ib0 IP address : 172.24.198.111 machine 2: with ib0 IP address : 172.24.198.112 /etc/modprobe.conf contains options lnet networks=o2ib TCP networking worked fine and now I am trying with Infiniband network finding it difficult in communicating with IB nodes mounting effort throghs me the following error [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error Is the MGS running? /var/log/messages : Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has timed out (limit 5s). Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration failed for lustre-OSTffff: -5 Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running? Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5 Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not registered Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-5) All pinging efforts also failed to the IB NIDS local/remote can ping the ip address : [root at p186 ~]# ping 172.24.198.112 PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. 64 bytes from 172.24.198.112: icmp_seq=1 ttl=64 time=0.052 ms 64 bytes from 172.24.198.112: icmp_seq=2 ttl=64 time=0.024 ms --- 172.24.198.112 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms [root at p186 ~]# ping 172.24.198.111 PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. 64 bytes from 172.24.198.111: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 172.24.198.111: icmp_seq=2 ttl=64 time=0.296 ms --- 172.24.198.111 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms but cant ping the NIDS : [root at p186 ~]# lctl ping 172.24.198.112 at o2ib failed to ping 172.24.198.112 at o2ib: Input/output error [root at p186 ~]# lctl ping 172.24.198.111 at o2ib failed to ping 172.24.198.111 at o2ib: Input/output error Any idea why lnet cant ping NIDS ? some more configurations: [root at p186 ~]# ibstat CA ''mthca0'' CA type: MT23108 Number of ports: 2 Firmware version: 3.5.0 Hardware version: a1 Node GUID: 0x0002c9020021550c Machines are connected via IB switch. Looking forward for help. ~subbu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090115/0af49ff6/attachment-0001.html
Subbu, I''d suggest: 1) make sure ko2iblnd has been brought up (please check if there is any error message when startup ko2iblnd) 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl ping, if it still can''t work please post error messages Regards Liang subbu kl:> Problem is similer to > http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html > But by looking at the thread could not really get the solution for the > problem. > > I have two RHEL5 Linux servers installed with following packages - > > kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 > kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > e2fsprogs-1.40.7.sun3-0redhat > > > machine 1: with ib0 IP address : 172.24.198.111 > machine 2: with ib0 IP address : 172.24.198.112 > > /etc/modprobe.conf contains > options lnet networks=o2ib > > TCP networking worked fine and now I am trying with Infiniband network > finding it difficult in communicating with IB nodes mounting effort > throghs me the following error > > [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 > mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error > Is the MGS running? > > /var/log/messages : > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 > seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 > seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled > Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from > MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has timed out > (limit 5s). > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration > failed for lustre-OSTffff: -5 > Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error > with the MGS. Is the MGS running? > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5 > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not > registered > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 > success) > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 > goal hits, 0 2^N hits, 0 breaks, 0 lost > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it > took 0 > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 > discarded > Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-5) > > All pinging efforts also failed to the IB NIDS local/remote > can ping the ip address : > [root at p186 ~]# ping 172.24.198.112 > PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. > 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=1 > ttl=64 time=0.052 ms > 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=2 > ttl=64 time=0.024 ms > > --- 172.24.198.112 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms > [root at p186 ~]# ping 172.24.198.111 > PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. > 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=1 > ttl=64 time=2.16 ms > 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=2 > ttl=64 time=0.296 ms > > --- 172.24.198.111 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms > > but cant ping the NIDS : > [root at p186 ~]# lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > [root at p186 ~]# lctl ping 172.24.198.111 at o2ib > failed to ping 172.24.198.111 at o2ib: Input/output error > > Any idea why lnet cant ping NIDS ? > > some more configurations: > [root at p186 ~]# ibstat > CA ''mthca0'' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.5.0 > Hardware version: a1 > Node GUID: 0x0002c9020021550c > > Machines are connected via IB switch. > > Looking forward for help. > > ~subbu > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Liang, after executing following echo : echo +neterror > /proc/sys/lnet/printk now lctlt ping shows the following error # lctl ping 172.24.198.112 at o2ib failed to ping 172.24.198.112 at o2ib: Input/output error Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198.112 at o2ib: connection failed Looks like some problem with "IB connection manager" ! 1. do we have any help docs to setup IPoIB and Lustre, lustre operation manual has very minimal info about this . I think I am missing some IPoIB setup part here. 2. or is it mannual assignment of IP addresses to "ib0" is creating some problem *Some more supporting info : *subnet manager of following version is also running : OpenSM 3.1.8 Initially I got this error for MDS mount Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get IP address for interface ib0 Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB interface ib0: -99 Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up LNI o2ib Jan 16 09:45:21 p128 kernel: LustreError: 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): Input/output error Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ptlrpc_lprocfs_register_obd . . . then I mannually set the IP address for ib0 as folows : ifconfig ib0 172.24.198.111 [root at p186 ~]# ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) then it mounted sucessfully Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib [8/64] Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, initializing Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled Jan 16 09:47:09 p128 kernel: Lustre: 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device /dev/loop0 has started . . . ~subbu On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com> wrote:> Subbu, > > I''d suggest: > 1) make sure ko2iblnd has been brought up (please check if there is any > error message when startup ko2iblnd) > 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl ping, if it > still can''t work please post error messages > > Regards > Liang > > subbu kl: > >> Problem is similer to >> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >> But by looking at the thread could not really get the solution for the >> problem. >> >> I have two RHEL5 Linux servers installed with following packages - >> >> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> e2fsprogs-1.40.7.sun3-0redhat >> >> >> machine 1: with ib0 IP address : 172.24.198.111 >> machine 2: with ib0 IP address : 172.24.198.112 >> >> /etc/modprobe.conf contains >> options lnet networks=o2ib >> >> TCP networking worked fine and now I am trying with Infiniband network >> finding it difficult in communicating with IB nodes mounting effort throghs >> me the following error >> >> [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 >> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error >> Is the MGS running? >> >> /var/log/messages : >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 >> seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered >> data mode. >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 >> seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered >> data mode. >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled >> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from >> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has timed out >> (limit 5s). >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration >> failed for lustre-OSTffff: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error with >> the MGS. Is the MGS running? >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not >> registered >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 >> success) >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 >> goal hits, 0 2^N hits, 0 breaks, 0 lost >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it took >> 0 >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 >> discarded >> Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-5) >> >> All pinging efforts also failed to the IB NIDS local/remote >> can ping the ip address : >> [root at p186 ~]# ping 172.24.198.112 >> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=1 ttl=64 >> time=0.052 ms >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=2 ttl=64 >> time=0.024 ms >> >> --- 172.24.198.112 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >> [root at p186 ~]# ping 172.24.198.111 >> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=1 ttl=64 >> time=2.16 ms >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=2 ttl=64 >> time=0.296 ms >> >> --- 172.24.198.111 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >> >> but cant ping the NIDS : >> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib >> failed to ping 172.24.198.111 at o2ib: Input/output error >> >> Any idea why lnet cant ping NIDS ? >> >> some more configurations: >> [root at p186 ~]# ibstat >> CA ''mthca0'' >> CA type: MT23108 >> Number of ports: 2 >> Firmware version: 3.5.0 >> Hardware version: a1 >> Node GUID: 0x0002c9020021550c >> >> Machines are connected via IB switch. >> >> Looking forward for help. >> >> ~subbu >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > >-- . . . s u b b u "You''ve got to be original, because if you''re like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090116/46bb0933/attachment-0001.html
Subbu, We don''t have any tip for setup IPoIB, looks like linux can''t find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it''s because you didn''t assign any address to ib0 (or failed to assign address to ib0) before loading o2iblnd in the first try. I can reproduce exactly same error by: 1. modprobe ib_ipoib 2. ifconfig ib0 up // without assign any address 3. modprobe ko2iblnd 4. lctl network up Regards Liang subbu kl:> Liang, > after executing following echo : > echo +neterror > /proc/sys/lnet/printk > > now lctlt ping shows the following error > > # lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: > ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198.112 at o2ib: connection failed > > Looks like some problem with "IB connection manager" ! > > 1. do we have any help docs to setup IPoIB and Lustre, lustre > operation manual has very minimal info about this . I think I am > missing some IPoIB setup part here. > 2. or is it mannual assignment of IP addresses to "ib0" is creating > some problem > > > *Some more supporting info : > *subnet manager of following version is also running : OpenSM 3.1.8 > > Initially I got this error for MDS mount > > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get IP address > for interface ib0 > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB interface > ib0: -99 > Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting > up LNI o2ib > Jan 16 09:45:21 p128 kernel: LustreError: > 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): > Input/output error > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ptlrpc_lprocfs_register_obd > . > . > . > > then I mannually set the IP address for ib0 as folows : > ifconfig ib0 172.24.198.111 > > [root at p186 ~]# ifconfig ib0 > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > then it mounted sucessfully > > Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib [8/64] > Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started > Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter > lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 > Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, > initializing > Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving > dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with > recovery enabled > Jan 16 09:47:09 p128 kernel: Lustre: > 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: > group upcall set to /usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter > group_upcall=/usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device > /dev/loop0 has started > . > . > . > > > ~subbu > > > On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com > <mailto:Zhen.Liang at sun.com>> wrote: > > Subbu, > > I''d suggest: > 1) make sure ko2iblnd has been brought up (please check if there > is any error message when startup ko2iblnd) > 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl > ping, if it still can''t work please post error messages > > Regards > Liang > > subbu kl: > > Problem is similer to > http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html > But by looking at the thread could not really get the solution > for the problem. > > I have two RHEL5 Linux servers installed with following packages - > > kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 > kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > e2fsprogs-1.40.7.sun3-0redhat > > > machine 1: with ib0 IP address : 172.24.198.111 > machine 2: with ib0 IP address : 172.24.198.112 > > /etc/modprobe.conf contains > options lnet networks=o2ib > > TCP networking worked fine and now I am trying with Infiniband > network finding it difficult in communicating with IB nodes > mounting effort throghs me the following error > > [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 > mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: > Input/output error > Is the MGS running? > > /var/log/messages : > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: kjournald starting. Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled > Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from > MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has > timed out (limit 5s). > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1062:server_start_targets()) Required > registration failed for lustre-OSTffff: -5 > Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication > error with the MGS. Is the MGS running? > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start > targets: -5 > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:119:server_deregister_mount()) > lustre-OSTffff not registered > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 > reqs (0 success) > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents > scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated > and it took 0 > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > preallocated, 0 discarded > Jan 15 16:55:30 p186 kernel: Lustre: server umount > lustre-OSTffff complete > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount > (-5) > > All pinging efforts also failed to the IB NIDS local/remote > can ping the ip address : > [root at p186 ~]# ping 172.24.198.112 > PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=1 ttl=64 time=0.052 ms > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=2 ttl=64 time=0.024 ms > > > --- 172.24.198.112 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms > [root at p186 ~]# ping 172.24.198.111 > PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=1 ttl=64 time=2.16 ms > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=2 ttl=64 time=0.296 ms > > > --- 172.24.198.111 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, time 1000ms > rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms > > but cant ping the NIDS : > [root at p186 ~]# lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > [root at p186 ~]# lctl ping 172.24.198.111 at o2ib > failed to ping 172.24.198.111 at o2ib: Input/output error > > Any idea why lnet cant ping NIDS ? > > some more configurations: > [root at p186 ~]# ibstat > CA ''mthca0'' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.5.0 > Hardware version: a1 > Node GUID: 0x0002c9020021550c > > Machines are connected via IB switch. > > Looking forward for help. > > ~subbu > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone else, what > do they need you for?" > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Liang, Right; you reproduced the exact problem. But as you can see in my previous mail I think I have solved that problem by mannually assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 and *"Added LNI" lines *) we are back to sqare one now I guess ! LNET is up with mannually assigned IPs. normal ping succeds between machines but not lctl ping. so my current problem is this : # lctl ping 172.24.198.112 at o2ib failed to ping 172.24.198.112 at o2ib: Input/output error /var/log/messages: Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198.112 at o2ib: connection failed how can I get rid of this connection problem? ~subbu On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com> wrote:> Subbu, > > We don''t have any tip for setup IPoIB, looks like linux can''t find the > ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it''s because you > didn''t assign any address to ib0 (or failed to assign address to ib0) before > loading o2iblnd in the first try. > I can reproduce exactly same error by: > 1. modprobe ib_ipoib > 2. ifconfig ib0 up // without assign any address > 3. modprobe ko2iblnd > 4. lctl network up > > Regards > Liang > > subbu kl: > >> Liang, >> after executing following echo : >> echo +neterror > /proc/sys/lnet/printk >> >> now lctlt ping shows the following error >> >> # lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: >> ROUTE ERROR -22 >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages >> for 172.24.198.112 at o2ib: connection failed >> >> Looks like some problem with "IB connection manager" ! >> >> 1. do we have any help docs to setup IPoIB and Lustre, lustre operation >> manual has very minimal info about this . I think I am missing some IPoIB >> setup part here. >> 2. or is it mannual assignment of IP addresses to "ib0" is creating some >> problem >> >> >> *Some more supporting info : >> *subnet manager of following version is also running : OpenSM 3.1.8 >> >> Initially I got this error for MDS mount >> >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get IP address for >> interface ib0 >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB interface ib0: >> -99 >> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up >> LNI o2ib >> Jan 16 09:45:21 p128 kernel: LustreError: >> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): >> Input/output error >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): >> Unknown symbol in module, or unknown parameter (see dmesg) >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >> ptlrpc_lprocfs_register_obd >> . >> . >> . >> >> then I mannually set the IP address for ib0 as folows : >> # ifconfig ib0 172.24.198.111 >> >> [root at p186 ~]# ifconfig ib0 >> ib0 Link encap:InfiniBand HWaddr >> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >> inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 >> UP BROADCAST MULTICAST MTU:65520 Metric:1 >> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >> >> then it mounted sucessfully >> >> * Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib[8/64] >> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* >> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter >> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 >> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, >> initializing >> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev >> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled >> Jan 16 09:47:09 p128 kernel: Lustre: >> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group >> upcall set to /usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter >> group_upcall=/usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device >> /dev/loop0 has started >> . >> . >> . >> >> >> ~subbu >> >> >> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com <mailto: >> Zhen.Liang at sun.com>> wrote: >> >> Subbu, >> >> I''d suggest: >> 1) make sure ko2iblnd has been brought up (please check if there >> is any error message when startup ko2iblnd) >> 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl >> ping, if it still can''t work please post error messages >> >> Regards >> Liang >> >> subbu kl: >> >> Problem is similer to >> >> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >> But by looking at the thread could not really get the solution >> for the problem. >> >> I have two RHEL5 Linux servers installed with following packages - >> >> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> e2fsprogs-1.40.7.sun3-0redhat >> >> >> machine 1: with ib0 IP address : 172.24.198.111 >> machine 2: with ib0 IP address : 172.24.198.112 >> >> /etc/modprobe.conf contains >> options lnet networks=o2ib >> >> TCP networking worked fine and now I am trying with Infiniband >> network finding it difficult in communicating with IB nodes >> mounting effort throghs me the following error >> >> [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 >> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: >> Input/output error >> Is the MGS running? >> >> /var/log/messages : >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled >> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from >> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has >> timed out (limit 5s). >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1062:server_start_targets()) Required >> registration failed for lustre-OSTffff: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication >> error with the MGS. Is the MGS running? >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start >> targets: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:119:server_deregister_mount()) >> lustre-OSTffff not registered >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 >> reqs (0 success) >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents >> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated >> and it took 0 >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> preallocated, 0 discarded >> Jan 15 16:55:30 p186 kernel: Lustre: server umount >> lustre-OSTffff complete >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount >> (-5) >> >> All pinging efforts also failed to the IB NIDS local/remote >> can ping the ip address : >> [root at p186 ~]# ping 172.24.198.112 >> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=1 ttl=64 time=0.052 ms >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=2 ttl=64 time=0.024 ms >> >> >> --- 172.24.198.112 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >> [root at p186 ~]# ping 172.24.198.111 >> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=1 ttl=64 time=2.16 ms >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=2 ttl=64 time=0.296 ms >> >> >> --- 172.24.198.111 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >> >> but cant ping the NIDS : >> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib >> failed to ping 172.24.198.111 at o2ib: Input/output error >> >> Any idea why lnet cant ping NIDS ? >> >> some more configurations: >> [root at p186 ~]# ibstat >> CA ''mthca0'' >> CA type: MT23108 >> Number of ports: 2 >> Firmware version: 3.5.0 >> Hardware version: a1 >> Node GUID: 0x0002c9020021550c >> >> Machines are connected via IB switch. >> >> Looking forward for help. >> >> ~subbu >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >> >> -- >> . . . s u b b u >> "You''ve got to be original, because if you''re like someone else, what do >> they need you for?" >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > >-- . . . s u b b u "You''ve got to be original, because if you''re like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090116/4b101f30/attachment-0001.html
problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont see any activity on ib0 ! another exhaustive Lustre debug log I took with lctl ping do you see any problem with it ? Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(module.c:160:libcfs_psdev_open()) Process entered Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced ''ldu'': 8 at f5bc6620 (tot 7258558). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(module.c:228:libcfs_ioctl()) Process entered Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1223:LNetNIInit()) refs 1 Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1614:lnet_ping()) kmalloced ''info'': 144 at f0b95880 (tot 7258702). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced ''eq'': 48 at efda1a00 (tot 7258750). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:72:LNetEQAlloc()) kmalloced ''eq->eq_events'': 240 at f0b95c80 (tot 7258990). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced ''md'': 84 at ed16acc0 (tot 7259074). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced ''msg'': 268 at f205a400 (tot 7259342). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-move.c:2395:LNetGet()) LNetGet -> 12345-172.24.198.140 at o2ib Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags to 12345-172.24.198.140 at o2ib Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced ''peer'': 56 at efda18c0 (tot 7259398). Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> 172.24.198.140 at o2ib (1)++ Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> 172.24.198.140 at o2ib (2)++ Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> 172.24.198.140 at o2ib (3)-- Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:146:lib_get_event()) Process entered Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, eq->size: 2 Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:39 p186 kernel: Lustre: 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198.140 at o2ib Addr resolved: 0 Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:146:lib_get_event()) Process entered Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, eq->size: 2 Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:239:LNetEQPoll()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) poll 0(-1 -1) Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0 Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:146:lib_get_event()) Process entered Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, eq->size: 2 Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 : -4352 : ffffef00) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 : -512 : fffffe00) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : 2817 : b01) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : 2047 : 7ff) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 : -226464 : fffc8b60) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 : -8750811 : ff7a7925) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : 5821091 : 58d2a3) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : 3356952 : 333918) Jan 23 17:23:56 p186 kernel: Lustre: 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 : -4352 : ffffef00) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 : -512 : fffffe00) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : 2817 : b01) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : 2047 : 7ff) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 : -226464 : fffc8b60) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 : -8750811 : ff7a7925) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : 5821091 : 58d2a3) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : 3356952 : 333918) Jan 23 17:24:21 p186 kernel: Lustre: 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847) Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: ROUTE ERROR -110 Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> 172.24.198.140 at o2ib (2)-- Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(router.c:151:lnet_notify()) 172.24.198.141 at o2ib notifying 172.24.198.140 at o2ib: down Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(router.c:82:lnet_notify_locked()) Old news Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198.140 at o2ib: connection failed Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ed16acc0 Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed ''md'': 84 at ed16acc0 (tot 7259314). Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed ''msg'': 268 at f205a400 (tot 7259046). Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> 172.24.198.140 at o2ib (1)-- Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed ''peer'': 56 at efda18c0 (tot 7258990). Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:146:lib_get_event()) Process entered Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, eq->size: 2 Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1) Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:232:LNetEQPoll()) Process leaving (rc=1 : 1 : 1) Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) poll 1(4 -113) unlinked Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed ''eq'': 48 at efda1a00 (tot 7258942). Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:135:LNetEQFree()) kfreed ''events'': 240 at f0b95c80 (tot 7258702). Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1772:lnet_ping()) kfreed ''info'': 144 at f0b95880 (tot 7258558). Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(module.c:336:libcfs_ioctl()) Process leaving (rc=4294967291 : -5 : fffffffb) Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(module.c:178:libcfs_psdev_release()) Process entered Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(module.c:183:libcfs_psdev_release()) kfreed ''ldu'': 8 at f5bc6620 (tot 7258550). Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : 0 : 0) ~subbu On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <subbukl at gmail.com> wrote:> Liang, > > Right; you reproduced the exact problem. But as you can see in my previous > mail I think I have solved that problem by mannually assiging IP to ib0 > (check this line # ifconfig ib0 172.24.198.111 and *"Added LNI" lines *) > > we are back to sqare one now I guess ! LNET is up with mannually assigned > IPs. normal ping succeds between machines but not lctl ping. > > so my current problem is this : > # lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > > /var/log/messages: > > Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: > kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages > for 172.24.198.112 at o2ib: connection failed > > how can I get rid of this connection problem? > > ~subbu > > > > On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com> wrote: > >> Subbu, >> >> We don''t have any tip for setup IPoIB, looks like linux can''t find the >> ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it''s because you >> didn''t assign any address to ib0 (or failed to assign address to ib0) before >> loading o2iblnd in the first try. >> I can reproduce exactly same error by: >> 1. modprobe ib_ipoib >> 2. ifconfig ib0 up // without assign any address >> 3. modprobe ko2iblnd >> 4. lctl network up >> >> Regards >> Liang >> >> subbu kl: >> >>> Liang, >>> after executing following echo : >>> echo +neterror > /proc/sys/lnet/printk >>> >>> now lctlt ping shows the following error >>> >>> # lctl ping 172.24.198.112 at o2ib >>> failed to ping 172.24.198.112 at o2ib: Input/output error >>> >>> Jan 16 10:24:14 p128 kernel: Lustre: >>> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: >>> ROUTE ERROR -22 >>> Jan 16 10:24:14 p128 kernel: Lustre: >>> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages >>> for 172.24.198.112 at o2ib: connection failed >>> >>> Looks like some problem with "IB connection manager" ! >>> >>> 1. do we have any help docs to setup IPoIB and Lustre, lustre operation >>> manual has very minimal info about this . I think I am missing some IPoIB >>> setup part here. >>> 2. or is it mannual assignment of IP addresses to "ib0" is creating some >>> problem >>> >>> >>> *Some more supporting info : >>> *subnet manager of following version is also running : OpenSM 3.1.8 >>> >>> Initially I got this error for MDS mount >>> >>> Jan 16 09:45:20 p128 kernel: LustreError: >>> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get IP address for >>> interface ib0 >>> Jan 16 09:45:20 p128 kernel: LustreError: >>> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB interface ib0: >>> -99 >>> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up >>> LNI o2ib >>> Jan 16 09:45:21 p128 kernel: LustreError: >>> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed >>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc >>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): >>> Input/output error >>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc >>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): >>> Unknown symbol in module, or unknown parameter (see dmesg) >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >>> ptlrpc_lprocfs_register_obd >>> . >>> . >>> . >>> >>> then I mannually set the IP address for ib0 as folows : >>> # ifconfig ib0 172.24.198.111 >>> >>> [root at p186 ~]# ifconfig ib0 >>> ib0 Link encap:InfiniBand HWaddr >>> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >>> inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 >>> UP BROADCAST MULTICAST MTU:65520 Metric:1 >>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:256 >>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>> >>> then it mounted sucessfully >>> >>> * Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib[8/64] >>> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* >>> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter >>> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 >>> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr >>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, >>> initializing >>> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev >>> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled >>> Jan 16 09:47:09 p128 kernel: Lustre: >>> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group >>> upcall set to /usr/sbin/l_getgroups >>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter >>> group_upcall=/usr/sbin/l_getgroups >>> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device >>> /dev/loop0 has started >>> . >>> . >>> . >>> >>> >>> ~subbu >>> >>> >>> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com <mailto: >>> Zhen.Liang at sun.com>> wrote: >>> >>> Subbu, >>> >>> I''d suggest: >>> 1) make sure ko2iblnd has been brought up (please check if there >>> is any error message when startup ko2iblnd) >>> 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl >>> ping, if it still can''t work please post error messages >>> >>> Regards >>> Liang >>> >>> subbu kl: >>> >>> Problem is similer to >>> >>> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >>> But by looking at the thread could not really get the solution >>> for the problem. >>> >>> I have two RHEL5 Linux servers installed with following packages - >>> >>> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >>> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> e2fsprogs-1.40.7.sun3-0redhat >>> >>> >>> machine 1: with ib0 IP address : 172.24.198.111 >>> machine 2: with ib0 IP address : 172.24.198.112 >>> >>> /etc/modprobe.conf contains >>> options lnet networks=o2ib >>> >>> TCP networking worked fine and now I am trying with Infiniband >>> network finding it difficult in communicating with IB nodes >>> mounting effort throghs me the following error >>> >>> [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 >>> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: >>> Input/output error >>> Is the MGS running? >>> >>> /var/log/messages : >>> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >>> interval 5 seconds >>> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >>> with ordered data mode. >>> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >>> interval 5 seconds >>> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >>> with ordered data mode. >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled >>> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from >>> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has >>> timed out (limit 5s). >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1062:server_start_targets()) Required >>> registration failed for lustre-OSTffff: -5 >>> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication >>> error with the MGS. Is the MGS running? >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start >>> targets: -5 >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:119:server_deregister_mount()) >>> lustre-OSTffff not registered >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 >>> reqs (0 success) >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents >>> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated >>> and it took 0 >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >>> preallocated, 0 discarded >>> Jan 15 16:55:30 p186 kernel: Lustre: server umount >>> lustre-OSTffff complete >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount >>> (-5) >>> >>> All pinging efforts also failed to the IB NIDS local/remote >>> can ping the ip address : >>> [root at p186 ~]# ping 172.24.198.112 >>> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. >>> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >>> icmp_seq=1 ttl=64 time=0.052 ms >>> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >>> icmp_seq=2 ttl=64 time=0.024 ms >>> >>> >>> --- 172.24.198.112 ping statistics --- >>> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >>> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >>> [root at p186 ~]# ping 172.24.198.111 >>> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. >>> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >>> icmp_seq=1 ttl=64 time=2.16 ms >>> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >>> icmp_seq=2 ttl=64 time=0.296 ms >>> >>> >>> --- 172.24.198.111 ping statistics --- >>> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >>> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >>> >>> but cant ping the NIDS : >>> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib >>> failed to ping 172.24.198.112 at o2ib: Input/output error >>> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib >>> failed to ping 172.24.198.111 at o2ib: Input/output error >>> >>> Any idea why lnet cant ping NIDS ? >>> >>> some more configurations: >>> [root at p186 ~]# ibstat >>> CA ''mthca0'' >>> CA type: MT23108 >>> Number of ports: 2 >>> Firmware version: 3.5.0 >>> Hardware version: a1 >>> Node GUID: 0x0002c9020021550c >>> >>> Machines are connected via IB switch. >>> >>> Looking forward for help. >>> >>> ~subbu >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> <mailto:Lustre-discuss at lists.lustre.org> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >>> >>> >>> -- >>> . . . s u b b u >>> "You''ve got to be original, because if you''re like someone else, what do >>> they need you for?" >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> >> > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone else, what do > they need you for?" >-- . . . s u b b u "You''ve got to be original, because if you''re like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090123/15249da9/attachment-0001.html
Subbu, I think we can''t see anything from tcpdump even run ping sucessfully, because we only need ipoib for connecting (not for transaction). I think we need these information for diagnosing: 1. modprobe.conf of two nodes with IB 2. ifconfig on these two nodes 3. routing table on these two nodes 4. try lctl ping itself on both nodes and see if any error (with +neterror) Regards Liang subbu kl:> problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont > see any activity on ib0 ! > > another exhaustive Lustre debug log I took with lctl ping do you see > any problem with it ? > > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:160:libcfs_psdev_open()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced ''ldu'': 8 at > f5bc6620 (tot 7258558). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:228:libcfs_ioctl()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving > (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(api-ni.c:1223:LNetNIInit()) refs 1 > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(api-ni.c:1614:lnet_ping()) kmalloced ''info'': 144 at f0b95880 > (tot 7258702). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced ''eq'': 48 at > efda1a00 (tot 7258750). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:72:LNetEQAlloc()) kmalloced ''eq->eq_events'': 240 at > f0b95c80 (tot 7258990). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced ''md'': 84 at > ed16acc0 (tot 7259074). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced ''msg'': 268 at > f205a400 (tot 7259342). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-move.c:2395:LNetGet()) LNetGet -> 12345-172.24.198.140 at o2ib > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags > to 12345-172.24.198.140 at o2ib > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced ''peer'': 56 at > efda18c0 (tot 7259398). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> > 172.24.198.140 at o2ib (1)++ > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> > 172.24.198.140 at o2ib (2)++ > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> > 172.24.198.140 at o2ib (3)-- > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198.140 at o2ib > Addr resolved: 0 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:239:LNetEQPoll()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(api-ni.c:1665:lnet_ping()) poll 0(-1 -1) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294962944 : -4352 : ffffef00) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294966784 : -512 : fffffe00) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 > : 2817 : b01) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 > : 2047 : 7ff) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294740832 : -226464 : fffc8b60) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4286216485 : -8750811 : ff7a7925) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=5821091 : 5821091 : 58d2a3) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=3356952 : 3356952 : 333918) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294962944 : -4352 : ffffef00) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294966784 : -512 : fffffe00) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 > : 2817 : b01) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 > : 2047 : 7ff) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294740832 : -226464 : fffc8b60) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4286216485 : -8750811 : ff7a7925) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=5821091 : 5821091 : 58d2a3) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=3356952 : 3356952 : 333918) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847) > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: > ROUTE ERROR -110 > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> > 172.24.198.140 at o2ib (2)-- > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(router.c:151:lnet_notify()) 172.24.198.141 at o2ib notifying > 172.24.198.140 at o2ib: down > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(router.c:82:lnet_notify_locked()) Old news > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198.140 at o2ib: connection failed > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ed16acc0 > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed ''md'': 84 at ed16acc0 > (tot 7259314). > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed ''msg'': 268 at f205a400 > (tot 7259046). > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> > 172.24.198.140 at o2ib (1)-- > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed ''peer'': 56 at > efda18c0 (tot 7258990). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:232:LNetEQPoll()) Process leaving (rc=1 : 1 : 1) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(api-ni.c:1665:lnet_ping()) poll 1(4 -113) unlinked > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed ''eq'': 48 at efda1a00 > (tot 7258942). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:135:LNetEQFree()) kfreed ''events'': 240 at f0b95c80 > (tot 7258702). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(api-ni.c:1772:lnet_ping()) kfreed ''info'': 144 at f0b95880 > (tot 7258558). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:336:libcfs_ioctl()) Process leaving (rc=4294967291 : > -5 : fffffffb) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:178:libcfs_psdev_release()) Process entered > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:183:libcfs_psdev_release()) kfreed ''ldu'': 8 at > f5bc6620 (tot 7258550). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : > 0 : 0) > > ~subbu > > On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <subbukl at gmail.com > <mailto:subbukl at gmail.com>> wrote: > > Liang, > > Right; you reproduced the exact problem. But as you can see in my > previous mail I think I have solved that problem by mannually > assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 > and *"Added LNI" lines *) > > we are back to sqare one now I guess ! LNET is up with mannually > assigned IPs. normal ping succeds between machines but not lctl ping. > > so my current problem is this : > > # lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > > /var/log/messages: > > > Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: > kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198.112 at o2ib: connection failed > > how can I get rid of this connection problem? > > ~subbu > > > > On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com > <mailto:Zhen.Liang at sun.com>> wrote: > > Subbu, > > We don''t have any tip for setup IPoIB, looks like linux can''t > find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I > think it''s because you didn''t assign any address to ib0 (or > failed to assign address to ib0) before loading o2iblnd in > the first try. > I can reproduce exactly same error by: > 1. modprobe ib_ipoib > 2. ifconfig ib0 up // without assign any address > 3. modprobe ko2iblnd > 4. lctl network up > > Regards > Liang > > subbu kl: > > Liang, > after executing following echo : > echo +neterror > /proc/sys/lnet/printk > > now lctlt ping shows the following error > > # lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) > 172.24.198.112 at o2ib: ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) > Deleting messages for 172.24.198.112 at o2ib: connection failed > > Looks like some problem with "IB connection manager" ! > > 1. do we have any help docs to setup IPoIB and Lustre, > lustre operation manual has very minimal info about this . > I think I am missing some IPoIB setup part here. > 2. or is it mannual assignment of IP addresses to "ib0" > is creating some problem > > > *Some more supporting info : > *subnet manager of following version is also running : > OpenSM 3.1.8 > > Initially I got this error for MDS mount > > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get > IP address for interface ib0 > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB > interface ib0: -99 > Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error > -100 starting up LNI o2ib > Jan 16 09:45:21 p128 kernel: LustreError: > 4991:0:(events.c:707:ptlrpc_init_portals()) network > initialisation failed > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting > ptlrpc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): > Input/output error > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting > osc > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ldlm_prep_enqueue_req > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ldlm_resource_get > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ptlrpc_lprocfs_register_obd > . > . > . > > then I mannually set the IP address for ib0 as folows : > # ifconfig ib0 172.24.198.111 > > [root at p186 ~]# ifconfig ib0 > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.112 Bcast:172.24.255.255 > Mask:255.255.0.0 > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > then it mounted sucessfully > > *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI > 172.24.198.111 at o2ib [8/64] > Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* > Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter > lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 > Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new > disk, initializing > Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 > now serving dev > (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with > recovery enabled > Jan 16 09:47:09 p128 kernel: Lustre: > 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) > lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: > set parameter group_upcall=/usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 > on device /dev/loop0 has started > . > . > . > > > ~subbu > > > On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen > <Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com> > <mailto:Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com>>> > wrote: > > Subbu, > > I''d suggest: > 1) make sure ko2iblnd has been brought up (please check > if there > is any error message when startup ko2iblnd) > 2) echo +neterror > /proc/sys/lnet/printk, then try > with lctl > ping, if it still can''t work please post error messages > > Regards > Liang > > subbu kl: > > Problem is similer to > > http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html > But by looking at the thread could not really get > the solution > for the problem. > > I have two RHEL5 Linux servers installed with > following packages - > > kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 > kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > > lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > > lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > e2fsprogs-1.40.7.sun3-0redhat > > > machine 1: with ib0 IP address : 172.24.198.111 > machine 2: with ib0 IP address : 172.24.198.112 > > /etc/modprobe.conf contains > options lnet networks=o2ib > > TCP networking worked fine and now I am trying with > Infiniband > network finding it difficult in communicating with > IB nodes > mounting effort throghs me the following error > > [root at p186 ~]# mount -t lustre -o loop > /tmp/lustre-ost1 /mnt/ost1 > mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: > Input/output error > Is the MGS running? > > /var/log/messages : > Jan 15 16:55:25 p186 kernel: kjournald starting. > Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, > internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted > filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: kjournald starting. > Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, > internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted > filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file > extents enabled > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc > enabled > Jan 15 16:55:30 p186 kernel: Lustre: Request x7 > sent from > MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib > 5s ago has > timed out (limit 5s). > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1062:server_start_targets()) > Required > registration failed for lustre-OSTffff: -5 > Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: > Communication > error with the MGS. Is the MGS running? > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1597:server_fill_super()) > Unable to start > targets: -5 > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1382:server_put_super()) no obd > lustre-OSTffff > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:119:server_deregister_mount()) > lustre-OSTffff not registered > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > blocks 0 > reqs (0 success) > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > extents > scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > generated > and it took 0 > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > preallocated, 0 discarded > Jan 15 16:55:30 p186 kernel: Lustre: server umount > lustre-OSTffff complete > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1951:lustre_fill_super()) > Unable to mount > (-5) > > All pinging efforts also failed to the IB NIDS > local/remote > can ping the ip address : > [root at p186 ~]# ping 172.24.198.112 > PING 172.24.198.112 (172.24.198.112) 56(84) bytes > of data. > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=1 ttl=64 time=0.052 ms > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=2 ttl=64 time=0.024 ms > > > --- 172.24.198.112 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, > time 1000ms > rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms > [root at p186 ~]# ping 172.24.198.111 > PING 172.24.198.111 (172.24.198.111) 56(84) bytes > of data. > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=1 ttl=64 time=2.16 ms > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=2 ttl=64 time=0.296 ms > > > --- 172.24.198.111 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, > time 1000ms > rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms > > but cant ping the NIDS : > [root at p186 ~]# lctl ping 172.24.198.112 at o2ib > failed to ping 172.24.198.112 at o2ib: Input/output error > [root at p186 ~]# lctl ping 172.24.198.111 at o2ib > failed to ping 172.24.198.111 at o2ib: Input/output error > > Any idea why lnet cant ping NIDS ? > > some more configurations: > [root at p186 ~]# ibstat > CA ''mthca0'' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.5.0 > Hardware version: a1 > Node GUID: 0x0002c9020021550c > > Machines are connected via IB switch. > > Looking forward for help. > > ~subbu > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > <mailto:Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org>> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone > else, what do they need you for?" > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone else, > what do they need you for?" > > > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone else, what > do they need you for?" > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Liang, please find the info you have asked below. There are two nodes MDS and OSS1 connected throgh a silverstorme Infiniband switch and MDS running IB subnet manager running. [root at MDS ~]# cat /etc/modprobe.conf alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter megaraid_sas alias scsi_hostadapter1 ata_piix alias scsi_hostadapter2 usb-storage alias ib0 ib_ipoib alias ib1 ib_ipoib alias net-pf-27 ib_sdp options loop max_loop=64 options lnet networks=o2ib(ib0) options ib_madeye data=1 [root at MDS ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:18:8B:40:63:C3 inet addr:172.24.198.128 Bcast:172.24.255.255 Mask:255.255.0.0 inet6 addr: fe80::218:8bff:fe40:63c3/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4203 errors:0 dropped:0 overruns:0 frame:0 TX packets:1069 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:415345 (405.6 KiB) TX bytes:109548 (106.9 KiB) Interrupt:169 Memory:f8000000-f8012100 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.24.198.140 Bcast:172.24.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:c902:22:cd49/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:11 errors:0 dropped:6 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:4163 (4.0 KiB) TX bytes:4205 (4.1 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1614 errors:0 dropped:0 overruns:0 frame:0 TX packets:1614 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:5322452 (5.0 MiB) TX bytes:5322452 (5.0 MiB) [root at MDS ~]# lctl list_nids 172.24.198.140 at o2ib [root at MDS ~]# route -e Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 172.24.0.0 * 255.255.0.0 U 0 0 0 eth0 172.24.0.0 * 255.255.0.0 U 0 0 0 ib0 169.254.0.0 * 255.255.0.0 U 0 0 0 ib0 default 172.24.198.250 0.0.0.0 UG 0 0 0 eth0 [root at MDS ~]# echo +neterror > /proc/sys/lnet/printk [root at MDS ~]# echo +neterror > /proc/sys/lnet/printk [root at MDS ~]# lctl list_nids 172.24.198.140 at o2ib [root at MDS ~]# lctl ping 172.24.198.140 at o2ib 12345-0 at lo 12345-172.24.198.140 at o2ib [root at MDS ~]# lctl ping 172.24.198.141 at o2ib failed to ping 172.24.198.141 at o2ib: Input/output error /var/log/messages : Jan 27 15:41:41 MDS kernel: Lustre: 5649:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.141 at o2ib: ROUTE ERROR -22 Jan 27 15:41:41 MDS kernel: Lustre: 5649:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198.141 at o2ib: connection failed [root at OSS1 ~]# cat /etc/modprobe.conf alias eth0 e1000 alias eth1 e1000 alias scsi_hostadapter megaraid_mbox alias scsi_hostadapter1 qla2xxx options loop max_loop=64 alias ib0 ib_ipoib options lnet networks=o2ib(ib0) options ib_ipoib debug_level=1 options ib_ipoib mcast_debug_level=1 ### BEGIN MPP Driver Comments ### remove mppUpper if [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e "Please Unload Physical HBA Driver prior to unloading mppUpper."; else /sbin/modprobe -r --ignore-remove mppUpper; fi # Additional config info can be found in /opt/mpp/modprobe.conf.mppappend. # The Above config info is needed if you want to make mkinitrd manually. # Please read the Readme file that came with MPP driver for building RamDisk manually. # Edit the ''/etc/modprobe.conf'' file and run ''mppUpdate'' to create Ramdisk dynamically. ### END MPP Driver Comments ### #alias ib1 ib_ipoib alias net-pf-27 ib_sdp options ib_madeye data=1 [root at OSS1 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:13:72:5D:3B:65 inet addr:172.24.198.186 Bcast:172.24.255.255 Mask:255.255.0.0 inet6 addr: fe80::213:72ff:fe5d:3b65/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:7831 errors:0 dropped:0 overruns:0 frame:0 TX packets:1007 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:809440 (790.4 KiB) TX bytes:99439 (97.1 KiB) Base address:0xdcc0 Memory:df7e0000-df800000 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.24.198.141 Bcast:172.24.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:c902:21:550d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:10 errors:0 dropped:0 overruns:0 frame:0 TX packets:11 errors:0 dropped:7 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:4097 (4.0 KiB) TX bytes:5202 (5.0 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:94 errors:0 dropped:0 overruns:0 frame:0 TX packets:94 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:8962 (8.7 KiB) TX bytes:8962 (8.7 KiB) virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:49 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:9166 (8.9 KiB) [root at OSS1 ~]# lctl list_nids 172.24.198.141 at o2ib [root at OSS1 ~]# route -e Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.122.0 * 255.255.255.0 U 0 0 0 virbr0 172.24.0.0 * 255.255.0.0 U 0 0 0 eth0 172.24.0.0 * 255.255.0.0 U 0 0 0 ib0 169.254.0.0 * 255.255.0.0 U 0 0 0 ib0 default 172.24.198.250 0.0.0.0 UG 0 0 0 eth0 [root at OSS1 ~]# echo +neterror > /proc/sys/lnet/printk [root at OSS1 ~]# echo +neterror > /proc/sys/lnet/printk [root at OSS1 ~]# lctl list_nids 172.24.198.141 at o2ib [root at OSS1 ~]# lctl ping 172.24.198.141 at o2ib 12345-0 at lo 12345-172.24.198.141 at o2ib [root at OSS1 ~]# lctl ping 172.24.198.140 at o2ib failed to ping 172.24.198.140 at o2ib: Input/output error /var/log/messages : Jan 27 15:34:17 OSS1 kernel: Lustre: 2776:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: ROUTE ERROR -22 Jan 27 15:34:17 OSS1 kernel: Lustre: 2776:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198.140 at o2ib: connection failed ~subbu On Sat, Jan 24, 2009 at 8:06 AM, Liang Zhen <Zhen.Liang at sun.com> wrote:> Subbu, > I think we can''t see anything from tcpdump even run ping sucessfully, > because we only need ipoib for connecting (not for transaction). > I think we need these information for diagnosing: > 1. modprobe.conf of two nodes with IB > 2. ifconfig on these two nodes > 3. routing table on these two nodes > 4. try lctl ping itself on both nodes and see if any error (with +neterror) > > Regards > Liang > > subbu kl: > >> problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont >> see any activity on ib0 ! >> >> another exhaustive Lustre debug log I took with lctl ping do you see any >> problem with it ? >> >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(module.c:160:libcfs_psdev_open()) Process entered >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced ''ldu'': 8 at f5bc6620 >> (tot 7258558). >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0) >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(module.c:228:libcfs_ioctl()) >> Process entered >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving (rc=0 : 0 >> : 0) >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1223:LNetNIInit()) >> refs 1 >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1614:lnet_ping()) >> kmalloced ''info'': 144 at f0b95880 (tot 7258702). >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced ''eq'': 48 at efda1a00 (tot >> 7258750). >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:72:LNetEQAlloc()) >> kmalloced ''eq->eq_events'': 240 at f0b95c80 (tot 7258990). >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced ''md'': 84 at ed16acc0 (tot >> 7259074). >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced ''msg'': 268 at f205a400 >> (tot 7259342). >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-move.c:2395:LNetGet()) >> LNetGet -> 12345-172.24.198.140 at o2ib >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags to >> 12345-172.24.198.140 at o2ib >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced ''peer'': 56 at >> efda18c0 (tot 7259398). >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> >> 172.24.198.140 at o2ib (1)++ >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> >> 172.24.198.140 at o2ib (2)++ >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> >> 172.24.198.140 at o2ib (3)-- >> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) >> Process entered >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >> eq->size: 2 >> Jan 23 17:23:39 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >> Jan 23 17:23:39 p186 kernel: Lustre: >> 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198.140 at o2ib Addr >> resolved: 0 >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >> eq->size: 2 >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:239:LNetEQPoll()) >> Process leaving (rc=0 : 0 : 0) >> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) >> poll 0(-1 -1) >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0 >> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) >> Process entered >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >> eq->size: 2 >> Jan 23 17:23:40 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 >> : -4352 : ffffef00) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 >> : -512 : fffffe00) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : >> 2817 : b01) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : >> 2047 : 7ff) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 >> : -226464 : fffc8b60) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 >> : -8750811 : ff7a7925) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : >> 5821091 : 58d2a3) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : >> 3356952 : 333918) >> Jan 23 17:23:56 p186 kernel: Lustre: >> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 >> : -4352 : ffffef00) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 >> : -512 : fffffe00) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : >> 2817 : b01) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : >> 2047 : 7ff) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 >> : -226464 : fffc8b60) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 >> : -8750811 : ff7a7925) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : >> 5821091 : 58d2a3) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : >> 3356952 : 333918) >> Jan 23 17:24:21 p186 kernel: Lustre: >> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847) >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: >> ROUTE ERROR -110 >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> >> 172.24.198.140 at o2ib (2)-- >> Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(router.c:151:lnet_notify()) >> 172.24.198.141 at o2ib notifying 172.24.198.140 at o2ib: down >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(router.c:82:lnet_notify_locked()) Old news >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages >> for 172.24.198.140 at o2ib: connection failed >> Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(lib-md.c:73:lnet_md_unlink()) >> Unlinking md ed16acc0 >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed ''md'': 84 at ed16acc0 (tot >> 7259314). >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed ''msg'': 268 at f205a400 (tot >> 7259046). >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> >> 172.24.198.140 at o2ib (1)-- >> Jan 23 17:24:29 p186 kernel: Lustre: >> 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed ''peer'': 56 at efda18c0 >> (tot 7258990). >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >> eq->size: 2 >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1) >> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:232:LNetEQPoll()) >> Process leaving (rc=1 : 1 : 1) >> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) >> poll 1(4 -113) unlinked >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed ''eq'': 48 at efda1a00 (tot >> 7258942). >> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:135:LNetEQFree()) >> kfreed ''events'': 240 at f0b95c80 (tot 7258702). >> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1772:lnet_ping()) >> kfreed ''info'': 144 at f0b95880 (tot 7258558). >> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(module.c:336:libcfs_ioctl()) >> Process leaving (rc=4294967291 : -5 : fffffffb) >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(module.c:178:libcfs_psdev_release()) Process entered >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(module.c:183:libcfs_psdev_release()) kfreed ''ldu'': 8 at f5bc6620 >> (tot 7258550). >> Jan 23 17:24:29 p186 kernel: Lustre: >> 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : 0 : 0) >> >> ~subbu >> >> On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <subbukl at gmail.com <mailto: >> subbukl at gmail.com>> wrote: >> >> Liang, >> >> Right; you reproduced the exact problem. But as you can see in my >> previous mail I think I have solved that problem by mannually >> assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 >> and *"Added LNI" lines *) >> >> we are back to sqare one now I guess ! LNET is up with mannually >> assigned IPs. normal ping succeds between machines but not lctl ping. >> >> so my current problem is this : >> >> # lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> >> /var/log/messages: >> >> >> Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: >> kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting >> messages for 172.24.198.112 at o2ib: connection failed >> >> how can I get rid of this connection problem? >> >> ~subbu >> >> >> >> On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com >> <mailto:Zhen.Liang at sun.com>> wrote: >> >> Subbu, >> >> We don''t have any tip for setup IPoIB, looks like linux can''t >> find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I >> think it''s because you didn''t assign any address to ib0 (or >> failed to assign address to ib0) before loading o2iblnd in >> the first try. >> I can reproduce exactly same error by: >> 1. modprobe ib_ipoib >> 2. ifconfig ib0 up // without assign any address >> 3. modprobe ko2iblnd >> 4. lctl network up >> >> Regards >> Liang >> >> subbu kl: >> >> Liang, >> after executing following echo : >> echo +neterror > /proc/sys/lnet/printk >> >> now lctlt ping shows the following error >> >> # lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) >> 172.24.198.112 at o2ib: ROUTE ERROR -22 >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) >> Deleting messages for 172.24.198.112 at o2ib: connection failed >> >> Looks like some problem with "IB connection manager" ! >> >> 1. do we have any help docs to setup IPoIB and Lustre, >> lustre operation manual has very minimal info about this . >> I think I am missing some IPoIB setup part here. >> 2. or is it mannual assignment of IP addresses to "ib0" >> is creating some problem >> >> >> *Some more supporting info : >> *subnet manager of following version is also running : >> OpenSM 3.1.8 >> >> Initially I got this error for MDS mount >> >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get >> IP address for interface ib0 >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB >> interface ib0: -99 >> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error >> -100 starting up LNI o2ib >> Jan 16 09:45:21 p128 kernel: LustreError: >> 4991:0:(events.c:707:ptlrpc_init_portals()) network >> initialisation failed >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting >> ptlrpc >> >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): >> Input/output error >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting >> osc >> >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): >> Unknown symbol in module, or unknown parameter (see dmesg) >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >> ldlm_prep_enqueue_req >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >> ldlm_resource_get >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >> ptlrpc_lprocfs_register_obd >> . >> . >> . >> >> then I mannually set the IP address for ib0 as folows : >> # ifconfig ib0 172.24.198.111 >> >> [root at p186 ~]# ifconfig ib0 >> ib0 Link encap:InfiniBand HWaddr >> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >> inet addr:172.24.198.112 Bcast:172.24.255.255 >> Mask:255.255.0.0 >> UP BROADCAST MULTICAST MTU:65520 Metric:1 >> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >> >> then it mounted sucessfully >> >> *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI >> 172.24.198.111 at o2ib [8/64] >> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* >> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter >> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 >> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new >> disk, initializing >> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 >> now serving dev >> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with >> recovery enabled >> Jan 16 09:47:09 p128 kernel: Lustre: >> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) >> lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: >> set parameter group_upcall=/usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 >> on device /dev/loop0 has started >> . >> . >> . >> >> >> ~subbu >> >> >> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen >> <Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com> >> <mailto:Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com>>> >> wrote: >> >> Subbu, >> >> I''d suggest: >> 1) make sure ko2iblnd has been brought up (please check >> if there >> is any error message when startup ko2iblnd) >> 2) echo +neterror > /proc/sys/lnet/printk, then try >> with lctl >> ping, if it still can''t work please post error messages >> >> Regards >> Liang >> >> subbu kl: >> >> Problem is similer to >> >> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >> But by looking at the thread could not really get >> the solution >> for the problem. >> >> I have two RHEL5 Linux servers installed with >> following packages - >> >> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> >> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> >> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> e2fsprogs-1.40.7.sun3-0redhat >> >> >> machine 1: with ib0 IP address : 172.24.198.111 >> machine 2: with ib0 IP address : 172.24.198.112 >> >> /etc/modprobe.conf contains >> options lnet networks=o2ib >> >> TCP networking worked fine and now I am trying with >> Infiniband >> network finding it difficult in communicating with >> IB nodes >> mounting effort throghs me the following error >> >> [root at p186 ~]# mount -t lustre -o loop >> /tmp/lustre-ost1 /mnt/ost1 >> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: >> Input/output error >> Is the MGS running? >> >> /var/log/messages : >> Jan 15 16:55:25 p186 kernel: kjournald starting. >> Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, >> internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted >> filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: kjournald starting. >> Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, >> internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted >> filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file >> extents enabled >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc >> enabled >> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 >> sent from >> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib >> 5s ago has >> timed out (limit 5s). >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1062:server_start_targets()) >> Required >> registration failed for lustre-OSTffff: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: >> Communication >> error with the MGS. Is the MGS running? >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1597:server_fill_super()) >> Unable to start >> targets: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1382:server_put_super()) no obd >> lustre-OSTffff >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:119:server_deregister_mount()) >> lustre-OSTffff not registered >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> blocks 0 >> reqs (0 success) >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> extents >> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> generated >> and it took 0 >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> preallocated, 0 discarded >> Jan 15 16:55:30 p186 kernel: Lustre: server umount >> lustre-OSTffff complete >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1951:lustre_fill_super()) >> Unable to mount >> (-5) >> >> All pinging efforts also failed to the IB NIDS >> local/remote >> can ping the ip address : >> [root at p186 ~]# ping 172.24.198.112 >> PING 172.24.198.112 (172.24.198.112) 56(84) bytes >> of data. >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=1 ttl=64 time=0.052 ms >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=2 ttl=64 time=0.024 ms >> >> >> --- 172.24.198.112 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, >> time 1000ms >> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >> [root at p186 ~]# ping 172.24.198.111 >> PING 172.24.198.111 (172.24.198.111) 56(84) bytes >> of data. >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=1 ttl=64 time=2.16 ms >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=2 ttl=64 time=0.296 ms >> >> >> --- 172.24.198.111 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, >> time 1000ms >> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >> >> but cant ping the NIDS : >> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib >> failed to ping 172.24.198.112 at o2ib: Input/output error >> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib >> failed to ping 172.24.198.111 at o2ib: Input/output error >> >> Any idea why lnet cant ping NIDS ? >> >> some more configurations: >> [root at p186 ~]# ibstat >> CA ''mthca0'' >> CA type: MT23108 >> Number of ports: 2 >> Firmware version: 3.5.0 >> Hardware version: a1 >> Node GUID: 0x0002c9020021550c >> >> Machines are connected via IB switch. >> >> Looking forward for help. >> >> ~subbu >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org> >> <mailto:Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org>> >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >> -- . . . s u b b u >> "You''ve got to be original, because if you''re like someone >> else, what do they need you for?" >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> <mailto:Lustre-discuss at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >> >> -- . . . s u b b u >> "You''ve got to be original, because if you''re like someone else, >> what do they need you for?" >> >> >> >> >> -- >> . . . s u b b u >> "You''ve got to be original, because if you''re like someone else, what do >> they need you for?" >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > >-- . . . s u b b u "You''ve got to be original, because if you''re like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090127/e85bdf80/attachment-0001.html
Problem solved. Earlier we had kept both my ethernet and infiniband network interfaces in the same subnet. After changing the ipoib network interfaces subnet to different subnet it worked. I think probably it makes sense to add a note about this subnet config in Lustre manual as well. Thanks again. ~subbu On Tue, Jan 27, 2009 at 3:49 PM, subbu kl <subbukl at gmail.com> wrote:> Liang, > > please find the info you have asked below. > > There are two nodes MDS and OSS1 connected throgh a silverstorme Infiniband > switch and MDS running IB subnet manager running. > > [root at MDS ~]# cat /etc/modprobe.conf > alias eth0 bnx2 > alias eth1 bnx2 > alias scsi_hostadapter megaraid_sas > alias scsi_hostadapter1 ata_piix > alias scsi_hostadapter2 usb-storage > alias ib0 ib_ipoib > alias ib1 ib_ipoib > alias net-pf-27 ib_sdp > options loop max_loop=64 > options lnet networks=o2ib(ib0) > options ib_madeye data=1 > > [root at MDS ~]# ifconfig > eth0 Link encap:Ethernet HWaddr 00:18:8B:40:63:C3 > inet addr:172.24.198.128 Bcast:172.24.255.255 Mask:255.255.0.0 > inet6 addr: fe80::218:8bff:fe40:63c3/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4203 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1069 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:415345 (405.6 KiB) TX bytes:109548 (106.9 KiB) > Interrupt:169 Memory:f8000000-f8012100 > > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.140 Bcast:172.24.255.255 Mask:255.255.0.0 > inet6 addr: fe80::202:c902:22:cd49/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:8 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11 errors:0 dropped:6 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:4163 (4.0 KiB) TX bytes:4205 (4.1 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:1614 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1614 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:5322452 (5.0 MiB) TX bytes:5322452 (5.0 MiB) > > [root at MDS ~]# lctl list_nids > 172.24.198.140 at o2ib > [root at MDS ~]# route -e > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > 172.24.0.0 * 255.255.0.0 U 0 0 0 eth0 > 172.24.0.0 * 255.255.0.0 U 0 0 0 ib0 > 169.254.0.0 * 255.255.0.0 U 0 0 0 ib0 > default 172.24.198.250 0.0.0.0 UG 0 0 0 eth0 > > [root at MDS ~]# echo +neterror > /proc/sys/lnet/printk > > [root at MDS ~]# echo +neterror > /proc/sys/lnet/printk > [root at MDS ~]# lctl list_nids > 172.24.198.140 at o2ib > [root at MDS ~]# lctl ping 172.24.198.140 at o2ib > 12345-0 at lo > 12345-172.24.198.140 at o2ib > [root at MDS ~]# lctl ping 172.24.198.141 at o2ib > failed to ping 172.24.198.141 at o2ib: Input/output error > > > /var/log/messages : > > Jan 27 15:41:41 MDS kernel: Lustre: > 5649:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.141 at o2ib: ROUTE > ERROR -22 > Jan 27 15:41:41 MDS kernel: Lustre: > 5649:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages > for 172.24.198.141 at o2ib: connection failed > > > > > > > > > > > > > [root at OSS1 ~]# cat /etc/modprobe.conf > alias eth0 e1000 > alias eth1 e1000 > alias scsi_hostadapter megaraid_mbox > alias scsi_hostadapter1 qla2xxx > options loop max_loop=64 > alias ib0 ib_ipoib > options lnet networks=o2ib(ib0) > options ib_ipoib debug_level=1 > options ib_ipoib mcast_debug_level=1 > ### BEGIN MPP Driver Comments ### > remove mppUpper if [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e > "Please Unload Physical HBA Driver prior to unloading mppUpper."; else > /sbin/modprobe -r --ignore-remove mppUpper; fi > # Additional config info can be found in /opt/mpp/modprobe.conf.mppappend. > # The Above config info is needed if you want to make mkinitrd manually. > # Please read the Readme file that came with MPP driver for building > RamDisk manually. > # Edit the ''/etc/modprobe.conf'' file and run ''mppUpdate'' to create Ramdisk > dynamically. > ### END MPP Driver Comments ### > #alias ib1 ib_ipoib > alias net-pf-27 ib_sdp > options ib_madeye data=1 > > [root at OSS1 ~]# ifconfig > eth0 Link encap:Ethernet HWaddr 00:13:72:5D:3B:65 > inet addr:172.24.198.186 Bcast:172.24.255.255 Mask:255.255.0.0 > inet6 addr: fe80::213:72ff:fe5d:3b65/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:7831 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1007 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:809440 (790.4 KiB) TX bytes:99439 (97.1 KiB) > Base address:0xdcc0 Memory:df7e0000-df800000 > > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.141 Bcast:172.24.255.255 Mask:255.255.0.0 > inet6 addr: fe80::202:c902:21:550d/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:10 errors:0 dropped:0 overruns:0 frame:0 > TX packets:11 errors:0 dropped:7 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:4097 (4.0 KiB) TX bytes:5202 (5.0 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:94 errors:0 dropped:0 overruns:0 frame:0 > TX packets:94 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:8962 (8.7 KiB) TX bytes:8962 (8.7 KiB) > > virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 > inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 > inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:49 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:9166 (8.9 KiB) > > [root at OSS1 ~]# lctl list_nids > 172.24.198.141 at o2ib > [root at OSS1 ~]# route -e > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > 192.168.122.0 * 255.255.255.0 U 0 0 0 virbr0 > 172.24.0.0 * 255.255.0.0 U 0 0 0 eth0 > 172.24.0.0 * 255.255.0.0 U 0 0 0 ib0 > 169.254.0.0 * 255.255.0.0 U 0 0 0 ib0 > default 172.24.198.250 0.0.0.0 UG 0 0 0 eth0 > > [root at OSS1 ~]# echo +neterror > /proc/sys/lnet/printk > [root at OSS1 ~]# echo +neterror > /proc/sys/lnet/printk > [root at OSS1 ~]# lctl list_nids > 172.24.198.141 at o2ib > [root at OSS1 ~]# lctl ping 172.24.198.141 at o2ib > 12345-0 at lo > 12345-172.24.198.141 at o2ib > [root at OSS1 ~]# lctl ping 172.24.198.140 at o2ib > failed to ping 172.24.198.140 at o2ib: Input/output error > > > /var/log/messages : > > Jan 27 15:34:17 OSS1 kernel: Lustre: > 2776:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: ROUTE > ERROR -22 > Jan 27 15:34:17 OSS1 kernel: Lustre: > 2776:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages > for 172.24.198.140 at o2ib: connection failed > > > > ~subbu > > On Sat, Jan 24, 2009 at 8:06 AM, Liang Zhen <Zhen.Liang at sun.com> wrote: > >> Subbu, >> I think we can''t see anything from tcpdump even run ping sucessfully, >> because we only need ipoib for connecting (not for transaction). >> I think we need these information for diagnosing: >> 1. modprobe.conf of two nodes with IB >> 2. ifconfig on these two nodes >> 3. routing table on these two nodes >> 4. try lctl ping itself on both nodes and see if any error (with >> +neterror) >> >> Regards >> Liang >> >> subbu kl: >> >>> problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont >>> see any activity on ib0 ! >>> >>> another exhaustive Lustre debug log I took with lctl ping do you see any >>> problem with it ? >>> >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(module.c:160:libcfs_psdev_open()) Process entered >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced ''ldu'': 8 at f5bc6620 >>> (tot 7258558). >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0) >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(module.c:228:libcfs_ioctl()) Process entered >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving (rc=0 : 0 >>> : 0) >>> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1223:LNetNIInit()) >>> refs 1 >>> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(api-ni.c:1614:lnet_ping()) >>> kmalloced ''info'': 144 at f0b95880 (tot 7258702). >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced ''eq'': 48 at efda1a00 (tot >>> 7258750). >>> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:72:LNetEQAlloc()) >>> kmalloced ''eq->eq_events'': 240 at f0b95c80 (tot 7258990). >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced ''md'': 84 at ed16acc0 (tot >>> 7259074). >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced ''msg'': 268 at f205a400 >>> (tot 7259342). >>> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-move.c:2395:LNetGet()) >>> LNetGet -> 12345-172.24.198.140 at o2ib >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags to >>> 12345-172.24.198.140 at o2ib >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced ''peer'': 56 at >>> efda18c0 (tot 7259398). >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> >>> 172.24.198.140 at o2ib (1)++ >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> >>> 172.24.198.140 at o2ib (2)++ >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> >>> 172.24.198.140 at o2ib (3)-- >>> Jan 23 17:23:39 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) >>> Process entered >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >>> eq->size: 2 >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >>> Jan 23 17:23:39 p186 kernel: Lustre: >>> 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198.140 at o2ib Addr >>> resolved: 0 >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >>> eq->size: 2 >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >>> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:239:LNetEQPoll()) >>> Process leaving (rc=0 : 0 : 0) >>> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) >>> poll 0(-1 -1) >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0 >>> Jan 23 17:23:40 p186 kernel: Lustre: 14294:0:(lib-eq.c:209:LNetEQPoll()) >>> Process entered >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >>> eq->size: 2 >>> Jan 23 17:23:40 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 >>> : -4352 : ffffef00) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 >>> : -512 : fffffe00) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : >>> 2817 : b01) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : >>> 2047 : 7ff) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 >>> : -226464 : fffc8b60) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 >>> : -8750811 : ff7a7925) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : >>> 5821091 : 58d2a3) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : >>> 3356952 : 333918) >>> Jan 23 17:23:56 p186 kernel: Lustre: >>> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294962944 >>> : -4352 : ffffef00) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294966784 >>> : -512 : fffffe00) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 : >>> 2817 : b01) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 : >>> 2047 : 7ff) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4294740832 >>> : -226464 : fffc8b60) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=4286216485 >>> : -8750811 : ff7a7925) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=5821091 : >>> 5821091 : 58d2a3) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=3356952 : >>> 3356952 : 333918) >>> Jan 23 17:24:21 p186 kernel: Lustre: >>> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847) >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: >>> ROUTE ERROR -110 >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> >>> 172.24.198.140 at o2ib (2)-- >>> Jan 23 17:24:29 p186 kernel: Lustre: 2794:0:(router.c:151:lnet_notify()) >>> 172.24.198.141 at o2ib notifying 172.24.198.140 at o2ib: down >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(router.c:82:lnet_notify_locked()) Old news >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting messages >>> for 172.24.198.140 at o2ib: connection failed >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ed16acc0 >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed ''md'': 84 at ed16acc0 (tot >>> 7259314). >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed ''msg'': 268 at f205a400 (tot >>> 7259046). >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> >>> 172.24.198.140 at o2ib (1)-- >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed ''peer'': 56 at efda18c0 >>> (tot 7258990). >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, >>> eq->size: 2 >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1) >>> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:232:LNetEQPoll()) >>> Process leaving (rc=1 : 1 : 1) >>> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1665:lnet_ping()) >>> poll 1(4 -113) unlinked >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed ''eq'': 48 at efda1a00 (tot >>> 7258942). >>> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(lib-eq.c:135:LNetEQFree()) >>> kfreed ''events'': 240 at f0b95c80 (tot 7258702). >>> Jan 23 17:24:29 p186 kernel: Lustre: 14294:0:(api-ni.c:1772:lnet_ping()) >>> kfreed ''info'': 144 at f0b95880 (tot 7258558). >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(module.c:336:libcfs_ioctl()) Process leaving (rc=4294967291 : -5 : >>> fffffffb) >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(module.c:178:libcfs_psdev_release()) Process entered >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(module.c:183:libcfs_psdev_release()) kfreed ''ldu'': 8 at f5bc6620 >>> (tot 7258550). >>> Jan 23 17:24:29 p186 kernel: Lustre: >>> 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : 0 : 0) >>> >>> ~subbu >>> >>> On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <subbukl at gmail.com <mailto: >>> subbukl at gmail.com>> wrote: >>> >>> Liang, >>> >>> Right; you reproduced the exact problem. But as you can see in my >>> previous mail I think I have solved that problem by mannually >>> assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 >>> and *"Added LNI" lines *) >>> >>> we are back to sqare one now I guess ! LNET is up with mannually >>> assigned IPs. normal ping succeds between machines but not lctl ping. >>> >>> so my current problem is this : >>> >>> # lctl ping 172.24.198.112 at o2ib >>> failed to ping 172.24.198.112 at o2ib: Input/output error >>> >>> /var/log/messages: >>> >>> >>> Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: >>> kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22 >>> Jan 16 10:24:14 p128 kernel: Lustre: >>> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting >>> messages for 172.24.198.112 at o2ib: connection failed >>> >>> how can I get rid of this connection problem? >>> >>> ~subbu >>> >>> >>> >>> On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com >>> <mailto:Zhen.Liang at sun.com>> wrote: >>> >>> Subbu, >>> >>> We don''t have any tip for setup IPoIB, looks like linux can''t >>> find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I >>> think it''s because you didn''t assign any address to ib0 (or >>> failed to assign address to ib0) before loading o2iblnd in >>> the first try. >>> I can reproduce exactly same error by: >>> 1. modprobe ib_ipoib >>> 2. ifconfig ib0 up // without assign any address >>> 3. modprobe ko2iblnd >>> 4. lctl network up >>> >>> Regards >>> Liang >>> >>> subbu kl: >>> >>> Liang, >>> after executing following echo : >>> echo +neterror > /proc/sys/lnet/printk >>> >>> now lctlt ping shows the following error >>> >>> # lctl ping 172.24.198.112 at o2ib >>> failed to ping 172.24.198.112 at o2ib: Input/output error >>> >>> Jan 16 10:24:14 p128 kernel: Lustre: >>> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) >>> 172.24.198.112 at o2ib: ROUTE ERROR -22 >>> Jan 16 10:24:14 p128 kernel: Lustre: >>> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) >>> Deleting messages for 172.24.198.112 at o2ib: connection failed >>> >>> Looks like some problem with "IB connection manager" ! >>> >>> 1. do we have any help docs to setup IPoIB and Lustre, >>> lustre operation manual has very minimal info about this . >>> I think I am missing some IPoIB setup part here. >>> 2. or is it mannual assignment of IP addresses to "ib0" >>> is creating some problem >>> >>> >>> *Some more supporting info : >>> *subnet manager of following version is also running : >>> OpenSM 3.1.8 >>> >>> Initially I got this error for MDS mount >>> >>> Jan 16 09:45:20 p128 kernel: LustreError: >>> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can''t get >>> IP address for interface ib0 >>> Jan 16 09:45:20 p128 kernel: LustreError: >>> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can''t query IPoIB >>> interface ib0: -99 >>> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error >>> -100 starting up LNI o2ib >>> Jan 16 09:45:21 p128 kernel: LustreError: >>> 4991:0:(events.c:707:ptlrpc_init_portals()) network >>> initialisation failed >>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting >>> ptlrpc >>> >>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): >>> Input/output error >>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting >>> osc >>> >>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): >>> Unknown symbol in module, or unknown parameter (see dmesg) >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >>> ldlm_prep_enqueue_req >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >>> ldlm_resource_get >>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >>> ptlrpc_lprocfs_register_obd >>> . >>> . >>> . >>> >>> then I mannually set the IP address for ib0 as folows : >>> # ifconfig ib0 172.24.198.111 >>> >>> [root at p186 ~]# ifconfig ib0 >>> ib0 Link encap:InfiniBand HWaddr >>> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >>> inet addr:172.24.198.112 Bcast:172.24.255.255 >>> Mask:255.255.0.0 >>> UP BROADCAST MULTICAST MTU:65520 Metric:1 >>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:256 >>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>> >>> then it mounted sucessfully >>> >>> *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI >>> 172.24.198.111 at o2ib [8/64] >>> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* >>> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter >>> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 >>> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr >>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new >>> disk, initializing >>> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 >>> now serving dev >>> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with >>> recovery enabled >>> Jan 16 09:47:09 p128 kernel: Lustre: >>> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) >>> lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups >>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: >>> set parameter group_upcall=/usr/sbin/l_getgroups >>> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 >>> on device /dev/loop0 has started >>> . >>> . >>> . >>> >>> >>> ~subbu >>> >>> >>> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen >>> <Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com> >>> <mailto:Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com>>> >>> wrote: >>> >>> Subbu, >>> >>> I''d suggest: >>> 1) make sure ko2iblnd has been brought up (please check >>> if there >>> is any error message when startup ko2iblnd) >>> 2) echo +neterror > /proc/sys/lnet/printk, then try >>> with lctl >>> ping, if it still can''t work please post error messages >>> >>> Regards >>> Liang >>> >>> subbu kl: >>> >>> Problem is similer to >>> >>> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >>> But by looking at the thread could not really get >>> the solution >>> for the problem. >>> >>> I have two RHEL5 Linux servers installed with >>> following packages - >>> >>> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >>> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> >>> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> >>> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >>> e2fsprogs-1.40.7.sun3-0redhat >>> >>> >>> machine 1: with ib0 IP address : 172.24.198.111 >>> machine 2: with ib0 IP address : 172.24.198.112 >>> >>> /etc/modprobe.conf contains >>> options lnet networks=o2ib >>> >>> TCP networking worked fine and now I am trying with >>> Infiniband >>> network finding it difficult in communicating with >>> IB nodes >>> mounting effort throghs me the following error >>> >>> [root at p186 ~]# mount -t lustre -o loop >>> /tmp/lustre-ost1 /mnt/ost1 >>> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: >>> Input/output error >>> Is the MGS running? >>> >>> /var/log/messages : >>> Jan 15 16:55:25 p186 kernel: kjournald starting. >>> Commit >>> interval 5 seconds >>> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, >>> internal journal >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted >>> filesystem >>> with ordered data mode. >>> Jan 15 16:55:25 p186 kernel: kjournald starting. >>> Commit >>> interval 5 seconds >>> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, >>> internal journal >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted >>> filesystem >>> with ordered data mode. >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file >>> extents enabled >>> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc >>> enabled >>> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 >>> sent from >>> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib >>> 5s ago has >>> timed out (limit 5s). >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1062:server_start_targets()) >>> Required >>> registration failed for lustre-OSTffff: -5 >>> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: >>> Communication >>> error with the MGS. Is the MGS running? >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1597:server_fill_super()) >>> Unable to start >>> targets: -5 >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1382:server_put_super()) no obd >>> lustre-OSTffff >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:119:server_deregister_mount()) >>> lustre-OSTffff not registered >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >>> blocks 0 >>> reqs (0 success) >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >>> extents >>> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >>> generated >>> and it took 0 >>> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >>> preallocated, 0 discarded >>> Jan 15 16:55:30 p186 kernel: Lustre: server umount >>> lustre-OSTffff complete >>> Jan 15 16:55:30 p186 kernel: LustreError: >>> 7193:0:(obd_mount.c:1951:lustre_fill_super()) >>> Unable to mount >>> (-5) >>> >>> All pinging efforts also failed to the IB NIDS >>> local/remote >>> can ping the ip address : >>> [root at p186 ~]# ping 172.24.198.112 >>> PING 172.24.198.112 (172.24.198.112) 56(84) bytes >>> of data. >>> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >>> icmp_seq=1 ttl=64 time=0.052 ms >>> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >>> icmp_seq=2 ttl=64 time=0.024 ms >>> >>> >>> --- 172.24.198.112 ping statistics --- >>> 2 packets transmitted, 2 received, 0% packet loss, >>> time 1000ms >>> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >>> [root at p186 ~]# ping 172.24.198.111 >>> PING 172.24.198.111 (172.24.198.111) 56(84) bytes >>> of data. >>> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >>> icmp_seq=1 ttl=64 time=2.16 ms >>> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >>> icmp_seq=2 ttl=64 time=0.296 ms >>> >>> >>> --- 172.24.198.111 ping statistics --- >>> 2 packets transmitted, 2 received, 0% packet loss, >>> time 1000ms >>> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >>> >>> but cant ping the NIDS : >>> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib >>> failed to ping 172.24.198.112 at o2ib: Input/output error >>> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib >>> failed to ping 172.24.198.111 at o2ib: Input/output error >>> >>> Any idea why lnet cant ping NIDS ? >>> >>> some more configurations: >>> [root at p186 ~]# ibstat >>> CA ''mthca0'' >>> CA type: MT23108 >>> Number of ports: 2 >>> Firmware version: 3.5.0 >>> Hardware version: a1 >>> Node GUID: 0x0002c9020021550c >>> >>> Machines are connected via IB switch. >>> >>> Looking forward for help. >>> >>> ~subbu >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> <mailto:Lustre-discuss at lists.lustre.org> >>> <mailto:Lustre-discuss at lists.lustre.org >>> <mailto:Lustre-discuss at lists.lustre.org>> >>> >>> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >>> >>> -- . . . s u b b u >>> "You''ve got to be original, because if you''re like someone >>> else, what do they need you for?" >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> <mailto:Lustre-discuss at lists.lustre.org> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >>> >>> >>> -- . . . s u b b u >>> "You''ve got to be original, because if you''re like someone else, >>> what do they need you for?" >>> >>> >>> >>> >>> -- >>> . . . s u b b u >>> "You''ve got to be original, because if you''re like someone else, what do >>> they need you for?" >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> >> > > > -- > . . . s u b b u > "You''ve got to be original, because if you''re like someone else, what do > they need you for?" >-- . . . s u b b u "You''ve got to be original, because if you''re like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090131/9c4a6014/attachment-0001.html