Danny Sternkopf
2008-Jul-31 14:08 UTC
[Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn''t work
Hi, installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and if I start OpenIB the server crashes. It also can''t be rebooted anymore until the kernel-ib RPM is deinstalled. The list of installed packages: lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-source-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp kernel-lustre-source-2.6.18-53.1.14.el5_lustre.1.6.5.1 lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp The previous OFED installion is completely removed. What''s wrong? During boot the server hangs at udev startup. Then it stops booting. I also tried to BUILD OFED 1.3.1 and 1.3, but it fails due to missing modules. As also mentioned here: http://lists.lustre.org/pipermail/lustre-discuss/2008-June/007767.html Did anybody get it running? best regards, Danny -- Danny Sternkopf http://www.nec.de/hpc dsternkopf at hpce.nec.com HPCE Division Germany phone: +49-711-68770-35 fax: +49-711-6877145 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NEC Deutschland GmbH, Hansaallee 101, 40549 D?sseldorf Gesch?ftsf?hrer Yuya Momose Handelsregister D?sseldorf HRB 57941; VAT ID DE129424743
Brian J. Murrell
2008-Jul-31 14:32 UTC
[Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn''t work
On Thu, 2008-07-31 at 16:08 +0200, Danny Sternkopf wrote:> Hi, > > installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and > if I start OpenIB the server crashes. It also can''t be rebooted anymore > until the kernel-ib RPM is deinstalled.That sounds very suspect.> Did anybody get it running?Most certainly our QA department had it all running before we released it. I suspect that you have some other problem masquerading itself as a problem with the OFED stack. I''m afraid there is not much we can do to help you without seeing some logs or error messages or the like. You might have to instrument your boot with some debugging to see where it''s really getting stuck. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080731/f360f112/attachment.bin
Danny Sternkopf
2008-Aug-01 08:57 UTC
[Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn''t work
Hi Brian, we got the following messages when starting IB: Jul 31 15:22:55 doss1 kernel: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) Jul 31 15:22:55 doss1 kernel: ib_mthca: Initializing 0000:20:00.0 Jul 31 15:22:55 doss1 kernel: GSI 24 sharing vector 0x92 and IRQ 24 Jul 31 15:22:55 doss1 kernel: ACPI: PCI Interrupt 0000:20:00.0[A] -> GSI 24 (level, low) -> IRQ 146 Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: HCA FW version 3.1.000 is old (3.5.000 is current). Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: If you have problems, try updating your HCA FW. Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: NOP command failed to generate interrupt (IRQ 170). Jul 31 15:22:56 doss1 kernel: ib_mthca 0000:20:00.0: Trying again with MSI/MSI-X disabled. Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ failed (-11) Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_EQ returned status 0xff Jul 31 15:23:56 doss1 kernel: ib_mthca 0000:20:00.0: HW2SW_MPT failed (-11) So we updated the HCA FW and it resolved the problem. Now IB is working. How about the 2nd issue? http://lists.lustre.org/pipermail/lustre-discuss/2008-June/007767.html Are there any news? Thank you and Best regards, Danny Brian J. Murrell wrote:> On Thu, 2008-07-31 at 16:08 +0200, Danny Sternkopf wrote: >> Hi, >> >> installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and >> if I start OpenIB the server crashes. It also can''t be rebooted anymore >> until the kernel-ib RPM is deinstalled. > > That sounds very suspect. > >> Did anybody get it running? > > Most certainly our QA department had it all running before we released > it. > > I suspect that you have some other problem masquerading itself as a > problem with the OFED stack. > > I''m afraid there is not much we can do to help you without seeing some > logs or error messages or the like. You might have to instrument your > boot with some debugging to see where it''s really getting stuck. > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2008-Aug-01 12:45 UTC
[Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn''t work
On Fri, 2008-08-01 at 10:57 +0200, Danny Sternkopf wrote:> > So we updated the HCA FW and it resolved the problem. Now IB is working.Great.> How about the 2nd issue? > http://lists.lustre.org/pipermail/lustre-discuss/2008-June/007767.htmlI''m not really sure off the top of my head -- it looks like some mismatched build options or something. However, I would encourage you to use our 1.6.5.1 release which includes the kernel-ib RPM already built. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080801/d063657d/attachment.bin