Hi all, A few lustre nodes hanged without any reason. After we rebooted the server, we found that the luster file system could not be properly mounted. We are using voltaire ibhost. # /usr/sbin/lconf --node node06-ib0 /etc/lustre/config.xml loading module: ptlrpc srcdir None devdir ptlrpc Bad module options? Check dmesg. ! modprobe (error 1):> FATAL: Error inserting ptlrpc (/lib/modules/2.6.9-55.0.9.EL_lustre.1.4.11.1custom/kernel/fs/lustre/ptlrpc.ko): Input/output error # dmesg LustreError: 4273:0:(viblnd.c:1890:kibnal_startup()) Can''t find an active port on InfiniHost_III_Ex0 LustreError: Error -100 starting up LNI vib LustreError: 4273:0:(events.c:647:ptlrpc_init_portals()) network initialisation failed Could any one tell me how to fix this problem? Thanks in advance. -- Regards, Changer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080107/c65d4242/attachment-0002.html
On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote:> ...... > # dmesg > > LustreError: 4273:0:(viblnd.c:1890:kibnal_startup()) > > Can''t find an active port on InfiniHost_III_Ex0It meant that viblnd couldn''t find a port whose link state was active on the hca InfiniHost_III_Ex0, i.e. no link on the device was usable. Was there any other error messages from viblnd before this one? Did you see this problem on just one node? Isaac
On Jan 8, 2008 1:35 AM, Isaac Huang <He.Huang at sun.com> wrote:> On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote: > > ...... > > # dmesg > > > > LustreError: 4273:0:(viblnd.c:1890:kibnal_startup()) > > > > Can''t find an active port on InfiniHost_III_Ex0 > > It meant that viblnd couldn''t find a port whose link state was active > on the hca InfiniHost_III_Ex0, i.e. no link on the device was usable. > > Was there any other error messages from viblnd before this one?There was no error messages but a related message like ''ADDRCONF(NETDEV_UP):ipoib0: link is not ready''.> Did you see this problem on just one node?There are four nodes which can not mount the lustre system. The other nodes can mount the lustre but got the following error messages: # dmesg divert: not allocating divert_blk for non-ethernet device ipoib0 ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(127.0.0.1) failed new: ipoib_allow_arp_joins: 1 ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(11.0.0.4) failed ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(11.0.0.4) failed ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(11.0.0.4) failed How can I check the link on the device? Thanks in advance. -- Regards, Changer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080108/db9a4b14/attachment-0002.html
If you?re using IPoIB, you can use standard TCP/IP diagnostic tools the same way you would on an Ethernet link (ifconfig, ping, traceroute, telnet, etc.) If you?re using a copper-to-optical converter in your data path as well, the Emcore MIAs have link lights on them which will tell you if a physical link is present (check the documentation). I know with STP InfiniBand connectors, there is some ambiguity about terminology with some vendors and manufacturers, and the fibre arrangement doesn?t provide a lot of wiggle room. Klaus On 1/7/08 7:56 PM, "Changer Van" <changerv at gmail.com>did etch on stone tablets:> > > On Jan 8, 2008 1:35 AM, Isaac Huang <He.Huang at sun.com> wrote: >> On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote: >>> > ...... >>> > # dmesg >>> > >>> > LustreError: 4273:0:(viblnd.c:1890:kibnal_startup()) >>> > >>> > Can''t find an active port on InfiniHost_III_Ex0 >> >> It meant that viblnd couldn''t find a port whose link state was active >> on the hca InfiniHost_III_Ex0, i.e. no link on the device was usable. >> >> Was there any other error messages from viblnd before this one? > There was no error messages but a related message > like ''ADDRCONF(NETDEV_UP):ipoib0: link is not ready''. >> Did you see this problem on just one node? > There are four nodes which can not mount the lustre system. > The other nodes can mount the lustre but got the following error messages: > > # dmesg > divert: not allocating divert_blk for non-ethernet device ipoib0 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(127.0.0.1 <http://127.0.0.1> ) failed > new: ipoib_allow_arp_joins: 1 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > > How can I check the link on the device? Thanks in advance.-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080108/8a79d448/attachment-0002.html
Network connection is down. I can not ping the other nodes. I ran the vstat command and found one of the port_state is ''port_initialize''. What does ''port_initialize'' mean? Dose it mean my ib card is broken? 1 HCA found: hca_id=InfiniHost_III_Ex0 pci_location={BUS=0x20,DEV/FUNC=0x00} vendor_id=0x02C9 vendor_part_id=0x6282 hw_ver=0xA0 fw_ver=5.1.400 PSID=MT_0140000001 num_phys_ports=2 port=1 port_state=PORT_INITIALIZE sm_lid=0x0000 port_lid=0x0000 port_lmc=0x00 max_mtu=2048 port=2 port_state=PORT_DOWN sm_lid=0x0000 port_lid=0x0000 port_lmc=0x00 max_mtu=2048 -- Regards, Changer On Jan 9, 2008 3:27 AM, Klaus Steden <klaus.steden at thomson.net> wrote:> > If you''re using IPoIB, you can use standard TCP/IP diagnostic tools the > same way you would on an Ethernet link (ifconfig, ping, traceroute, telnet, > etc.) > > If you''re using a copper-to-optical converter in your data path as well, > the Emcore MIAs have link lights on them which will tell you if a physical > link is present (check the documentation). I know with STP InfiniBand > connectors, there is some ambiguity about terminology with some vendors and > manufacturers, and the fibre arrangement doesn''t provide a lot of wiggle > room. > > Klaus > > On 1/7/08 7:56 PM, "Changer Van" <changerv at gmail.com>did etch on stone > tablets: > > > > On Jan 8, 2008 1:35 AM, Isaac Huang <He.Huang at sun.com> wrote: > > On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote: > > ...... > > # dmesg > > > > LustreError: 4273:0:(viblnd.c:1890:kibnal_startup()) > > > > Can''t find an active port on InfiniHost_III_Ex0 > > It meant that viblnd couldn''t find a port whose link state was active > on the hca InfiniHost_III_Ex0, i.e. no link on the device was usable. > > Was there any other error messages from viblnd before this one? > > There was no error messages but a related message > like ''ADDRCONF(NETDEV_UP):ipoib0: link is not ready''. > > Did you see this problem on just one node? > > There are four nodes which can not mount the lustre system. > The other nodes can mount the lustre but got the following error messages: > > # dmesg > divert: not allocating divert_blk for non-ethernet device ipoib0 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(127.0.0.1 <http://127.0.0.1> <http://127.0.0.1/>) failed > new: ipoib_allow_arp_joins: 1 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > failed > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > failed > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > failed > > How can I check the link on the device? Thanks in advance. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080109/cbe4ba45/attachment-0002.html
I don''t know if the voltaire IB stack is the same as OFED but I''m guessing it has a subnet manager. Check that. I''ve had similar issues when my subnet manager has crashed. On Jan 9, 2008, at 3:08 AM, Changer Van wrote:> Network connection is down. I can not ping the other nodes. > I ran the vstat command and found one of the port_state is > ''port_initialize''. > What does ''port_initialize'' mean? Dose it mean my ib card is broken? > > 1 HCA found: > hca_id=InfiniHost_III_Ex0 > pci_location={BUS=0x20,DEV/FUNC=0x00} > vendor_id=0x02C9 > vendor_part_id=0x6282 > hw_ver=0xA0 > fw_ver=5.1.400 > PSID=MT_0140000001 > num_phys_ports=2 > port=1 > port_state=PORT_INITIALIZE > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048 > port=2 > port_state=PORT_DOWN > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048 > -- > Regards, > Changer > > On Jan 9, 2008 3:27 AM, Klaus Steden <klaus.steden at thomson.net> wrote: > > If you''re using IPoIB, you can use standard TCP/IP diagnostic tools > the same way you would on an Ethernet link (ifconfig, ping, > traceroute, telnet, etc.) > > If you''re using a copper-to-optical converter in your data path as > well, the Emcore MIAs have link lights on them which will tell you > if a physical link is present (check the documentation). I know with > STP InfiniBand connectors, there is some ambiguity about terminology > with some vendors and manufacturers, and the fibre arrangement > doesn''t provide a lot of wiggle room. > > Klaus > > On 1/7/08 7:56 PM, "Changer Van" <changerv at gmail.com>did etch on > stone tablets: > > > > On Jan 8, 2008 1:35 AM, Isaac Huang <He.Huang at sun.com> wrote: > On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote: > > ...... > > # dmesg > > > > LustreError: 4273:0:(viblnd.c :1890:kibnal_startup()) > > > > Can''t find an active port on InfiniHost_III_Ex0 > > It meant that viblnd couldn''t find a port whose link state was active > on the hca InfiniHost_III_Ex0, i.e . no link on the device was usable. > > Was there any other error messages from viblnd before this one? > There was no error messages but a related message > like ''ADDRCONF(NETDEV_UP):ipoib0: link is not ready''. > Did you see this problem on just one node? > There are four nodes which can not mount the lustre system. > The other nodes can mount the lustre but got the following error > messages: > > # dmesg > divert: not allocating divert_blk for non-ethernet device ipoib0 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(127.0.0.1 <http://127.0.0.1> ) failed > > new: ipoib_allow_arp_joins: 1 > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > ip_route_output_key(11.0.0.4 <http://11.0.0.4> ) failed > > > How can I check the link on the device? Thanks in advance. > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080109/4ef3577f/attachment-0002.html
Hi there, I suppose I should ask all the obvious questions, like: * does your IB card have an IP address assigned to it? * does it have the right netmask configured for the address? * do the other machines on the IB network also have addresses on them with the correct netmasks? * have you tested TCP/IP on any other hosts in your cluster and confirmed that it is working as expected? I don?t have a lot of experience with broken IB cards, but they?re just as subject to IP misconfiguration as Ethernet can be ... I don?t recall offhand if there are hardware layer diagnostics for the (I?m assuming Voltaire) card itself, but there probably are, and they could help you verify physical connectivity. However, assuming your IP configuration is both valid and usable, if it?s still inoperable, then you probably have a bad card. Have you tried the other IB port (if the card has more than one), or swapping another IB card from another host that you know to be working? Klaus On 1/9/08 12:08 AM, "Changer Van" <changerv at gmail.com>did etch on stone tablets:> Network connection is down. I can not ping the other nodes. > I ran the vstat command and found one of the port_state is ''port_initialize''. > What does ''port_initialize'' mean? Dose it mean my ib card is broken? > > 1 HCA found: > hca_id=InfiniHost_III_Ex0 > pci_location={BUS=0x20,DEV/FUNC=0x00} > vendor_id=0x02C9 > vendor_part_id=0x6282 > hw_ver=0xA0 > fw_ver=5.1.400 > PSID=MT_0140000001 > num_phys_ports=2 > port=1 > port_state=PORT_INITIALIZE > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048 > port=2 > port_state=PORT_DOWN > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080109/835c7134/attachment-0002.html
Yes, the subnet manager has crashed. I rebooted the infiniband switch, everything is fine now. Regards, Changer On 1/9/08, Aaron Knister <aaron at iges.org> wrote:> > I don''t know if the voltaire IB stack is the same as OFED but I''m > guessing it has a subnet manager. Check that. I''ve had similar issues when > my subnet manager has crashed. > > On Jan 9, 2008, at 3:08 AM, Changer Van wrote: > > Network connection is down. I can not ping the other nodes. > I ran the vstat command and found one of the port_state is > ''port_initialize''. > What does ''port_initialize'' mean? Dose it mean my ib card is broken? > > 1 HCA found: > hca_id=InfiniHost_III_Ex0 > pci_location={BUS=0x20,DEV/FUNC=0x00} > vendor_id=0x02C9 > vendor_part_id=0x6282 > hw_ver=0xA0 > fw_ver=5.1.400 > PSID=MT_0140000001 > num_phys_ports=2 > port=1 > port_state=PORT_INITIALIZE > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048 > port=2 > port_state=PORT_DOWN > sm_lid=0x0000 > port_lid=0x0000 > port_lmc=0x00 > max_mtu=2048 > -- > Regards, > Changer > > > On Jan 9, 2008 3:27 AM, Klaus Steden <klaus.steden at thomson.net> wrote: > > > > > If you''re using IPoIB, you can use standard TCP/IP diagnostic tools the > > same way you would on an Ethernet link (ifconfig, ping, traceroute, telnet, > > etc.) > > > > If you''re using a copper-to-optical converter in your data path as well, > > the Emcore MIAs have link lights on them which will tell you if a physical > > link is present (check the documentation). I know with STP InfiniBand > > connectors, there is some ambiguity about terminology with some vendors and > > manufacturers, and the fibre arrangement doesn''t provide a lot of wiggle > > room. > > > > Klaus > > > > On 1/7/08 7:56 PM, "Changer Van" <changerv at gmail.com>did etch on stone > > tablets: > > > > > > > > On Jan 8, 2008 1:35 AM, Isaac Huang <He.Huang at sun.com> wrote: > > > > On Mon, Jan 07, 2008 at 06:20:52PM +0800, Changer Van wrote: > > > ...... > > > # dmesg > > > > > > LustreError: 4273:0:(viblnd.c :1890:kibnal_startup()) > > > > > > Can''t find an active port on InfiniHost_III_Ex0 > > > > It meant that viblnd couldn''t find a port whose link state was active > > on the hca InfiniHost_III_Ex0, i.e . no link on the device was usable. > > > > Was there any other error messages from viblnd before this one? > > > > There was no error messages but a related message > > like ''ADDRCONF(NETDEV_UP):ipoib0: link is not ready''. > > > > Did you see this problem on just one node? > > > > There are four nodes which can not mount the lustre system. > > The other nodes can mount the lustre but got the following error > > messages: > > > > # dmesg > > divert: not allocating divert_blk for non-ethernet device ipoib0 > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > > > > ip_route_output_key(127.0.0.1 <http://127.0.0.1> > > <http://127.0.0.1/>) failed > > new: ipoib_allow_arp_joins: 1 > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > > > > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > > failed > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > > > > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > > failed > > ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): > > > > ip_route_output_key(11.0.0.4 <http://11.0.0.4> <http://11.0.0.4/> ) > > failed > > > > How can I check the link on the device? Thanks in advance. > > > > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > > (301) 595-7000 > aaron at iges.org > > > > > > > >-- Regards, Changer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080114/0d65d4f8/attachment-0002.html