Adeyemi Adesanya
2010-Mar-23 03:22 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Hi. I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the kernel-ib RPM is not available from the download site. I did get hold of the OFED 1.5 source and built a version of kernel-ib but is this step essential for o2ib LNET on RHEL5? I would like to try and retain compatibility with the RedHat OFED distribution and use the versions of openmpi, etc supplied by RedHat. Introducing a different OFED distribution could create a dependency mess. The Lustre 1.8.2 patched RHEL5 kernel appears to already include Infiniband drivers in "/lib/ modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband". What am I missing? ------- Yemi
Lawrence Sorrillo
2010-Mar-23 12:47 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
I dont believe the trio combination of lustre-1.8.2, OFED-1.5 and RHEL5.4 is supported. I think there is a problem integrating with OFED-1.5 and lustre-1.8.2(specifically with the module ko2iblnd.ko) I tried this for several days and failed at this module everytime. The following page suggests that some of these modules are not friendly with OFED-1.5 http://wiki.lustre.org/index.php/Lustre_Release_Information#Lustre_Support_Matrix If you get this to work, please give me a shout about it. Cheers, ~Lawrence Adeyemi Adesanya wrote:> Hi. > > I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the > kernel-ib RPM is not available from the download site. I did get hold > of the OFED 1.5 source and built a version of kernel-ib but is this > step essential for o2ib LNET on RHEL5? I would like to try and retain > compatibility with the RedHat OFED distribution and use the versions > of openmpi, etc supplied by RedHat. Introducing a different OFED > distribution could create a dependency mess. The Lustre 1.8.2 patched > RHEL5 kernel appears to already include Infiniband drivers in "/lib/ > modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband". > What am I missing? > > ------- > Yemi > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Marco Aurelio L Gomes
2010-Mar-23 13:01 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
I''ve tried also to get Lustre 1.8.2 working with RHEL5.4 and OFED 1.5 but I didn''t get this trio working. Even with OFED 1.4.2 I had problems when modprobing lustre module. The only trio that work here is Lustre 1.8.1, OFED 1.4.2 and kernel 2.6.18-128 (RHEL5.3) If someone had success on Lustre 1.8.2 with OFED 1.5 please let me know. Regards, Marco Gomes Systems/HPC-Cluster Numeric Offshore Tank +55 11 3777-4142 #250 +55 11 3091-5350 #250 On Tue, 2010-03-23 at 08:47 -0400, Lawrence Sorrillo wrote:> I dont believe the trio combination of lustre-1.8.2, OFED-1.5 and > RHEL5.4 is supported. > > I think there is a problem integrating with OFED-1.5 and > lustre-1.8.2(specifically with the module ko2iblnd.ko) > > I tried this for several days and failed at this module everytime. > > The following page suggests that some of these modules are not friendly > with OFED-1.5 > > http://wiki.lustre.org/index.php/Lustre_Release_Information#Lustre_Support_Matrix > > If you get this to work, please give me a shout about it. > > Cheers, > ~Lawrence > > > > Adeyemi Adesanya wrote: > > Hi. > > > > I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the > > kernel-ib RPM is not available from the download site. I did get hold > > of the OFED 1.5 source and built a version of kernel-ib but is this > > step essential for o2ib LNET on RHEL5? I would like to try and retain > > compatibility with the RedHat OFED distribution and use the versions > > of openmpi, etc supplied by RedHat. Introducing a different OFED > > distribution could create a dependency mess. The Lustre 1.8.2 patched > > RHEL5 kernel appears to already include Infiniband drivers in "/lib/ > > modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband". > > What am I missing? > > > > ------- > > Yemi > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Ken Hornstein
2010-Mar-23 13:15 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
>I''ve tried also to get Lustre 1.8.2 working with RHEL5.4 and OFED 1.5 >but I didn''t get this trio working. Even with OFED 1.4.2 I had problems >when modprobing lustre module.I think you had problems with the module symbol versions, right? Those are relatively easy to track down, once you know a few tricks; the core problem is that you (or someone else) compiled Lustre by pointing it at the "wrong" version of OFED. If that''s your problem, then let me know; I can give you some guidance on how to figure out what is wrong. --Ken
Ken Hornstein
2010-Mar-23 13:51 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
>You''re right, I had problems with the module symbol versions using >Lustre 1.8.2 packages available at Sun website, kernel >2.6.18-164.11.1.el5 (RHEL 5.4) and OFED 1.5. The same problems happens >when using OFED 1.4.2.So since this comes up now and then, I''ve cc''d the list. So you can Google around to find more about kernel symbol versioning. The short answer is that there is a CRC associated with each exported symbol in the loaded kernel, and that version is recorded in the module when it is compiled. That''s all well and good, but figuring out what happens when it doesn''t work is a pain, because all of the information isn''t in one place (and nobody has explained it well, at least that I''ve seen). When a module (like Lustre) is compiled, it''s pointed at a file called "Module.symvers"; that contains the versions of the symbols that modules are expected to link against, and those versions are recorded in the module object file. When you get this mismatch at module load time, one of two things is happening: the "wrong" OFed is being loaded, or you linked against the "wrong" Module.symvers file. How do you figure out which one is the problem? Well, let''s take a common OFed symbol, like rdma_connect. You can find out the version of this symbol by grep''ing /proc/kallsyms. On our system: # grep rdma_connect /proc/kallsyms ffffffffa0375510 u rdma_connect [ko2iblnd] ffffffffa0375510 u rdma_connect [rdma_ucm] ffffffffa0375510 u rdma_connect [ib_sdp] ffffffffa0377000 r __ksymtab_rdma_connect [rdma_cm] ffffffffa0377225 r __kstrtab_rdma_connect [rdma_cm] ffffffffa03770f0 r __kcrctab_rdma_connect [rdma_cm] 000000000ef3a1e8 a __crc_rdma_connect [rdma_cm] ffffffffa0375510 T rdma_connect [rdma_cm] The symbol you care about is the absolute symbol, the one prefixed by __crc. So in this case, we are interested in __crc_rdma_connect, and that symbol''s version is 0x0ef3a1ea. This is the symbol used by the currently running kernel. Which version is Lustre linked against? Well, for that you need to find the ko2iblnd.ko file, and dump the __versions section. # objdump -s -j __versions ko2iblnd.ko | less [...] 0670 00000000 00000000 00000000 00000000 ................ 0680 e8a1f30e 00000000 72646d61 5f636f6e ........rdma_con 0690 6e656374 00000000 00000000 00000000 nect............ 06a0 00000000 00000000 00000000 00000000 ................ This display isn''t as pretty, but you want to look in the hex dump just before the symbol name. In this case, right before rmda_connect, you will see "e8a1f30e" ... which is the little-endian version of our symbol version! So they match up, and everything works. If you want to find out which symbol version is in a particular OFed module (in this case, we want to look at rdma_cm.ko), you can do this: # nm ./kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect 00000000cd7aa3e6 A __crc_rdma_connect Wrong version! But we''re ACTUALLY using the module located here: nm ./updates/kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect 000000000ef3a1e8 A __crc_rdma_connect Which is the "correct" version. But if you LINK against the first version, you''ll get these errors when you try to load Lustre. Note that my Module.symvers file for this kernel contains: 0xcd7aa3e6 rdma_connect drivers/infiniband/core/rdma_cm EXPORT_SYMBOL Which is wrong! In this case, you need to explicitly point Lustre at the OFed directory which contains the Module.symvers file. (Can you tell I''ve beaten my head against the wall over this issue a WHOLE LOT? :-/) --Ken
Lawrence Sorrillo
2010-Mar-23 14:53 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Ken: This is a wonderful post. Very very helpful indeed. Thank you. I also noticed something in trying to compile lustre. The "Module.symvers" file is very much needed but is only created after you do "make/make modules" in the /usr/src/linux directory. It does NOT exist before then and the lustre installation guide makes it seem like this file should exist when you unpack the sources. But it does not. Hence you cannot go on to run configure on lustre and expect the modules to work. ~Lawrence Ken Hornstein wrote:>> You''re right, I had problems with the module symbol versions using >> Lustre 1.8.2 packages available at Sun website, kernel >> 2.6.18-164.11.1.el5 (RHEL 5.4) and OFED 1.5. The same problems happens >> when using OFED 1.4.2. >> > > So since this comes up now and then, I''ve cc''d the list. > > So you can Google around to find more about kernel symbol versioning. > The short answer is that there is a CRC associated with each exported > symbol in the loaded kernel, and that version is recorded in the module > when it is compiled. That''s all well and good, but figuring out what > happens when it doesn''t work is a pain, because all of the information > isn''t in one place (and nobody has explained it well, at least that > I''ve seen). > > When a module (like Lustre) is compiled, it''s pointed at a file called > "Module.symvers"; that contains the versions of the symbols that > modules are expected to link against, and those versions are recorded > in the module object file. When you get this mismatch at module load > time, one of two things is happening: the "wrong" OFed is being loaded, > or you linked against the "wrong" Module.symvers file. > > How do you figure out which one is the problem? Well, let''s take a > common OFed symbol, like rdma_connect. You can find out the version of > this symbol by grep''ing /proc/kallsyms. On our system: > > # grep rdma_connect /proc/kallsyms > ffffffffa0375510 u rdma_connect [ko2iblnd] > ffffffffa0375510 u rdma_connect [rdma_ucm] > ffffffffa0375510 u rdma_connect [ib_sdp] > ffffffffa0377000 r __ksymtab_rdma_connect [rdma_cm] > ffffffffa0377225 r __kstrtab_rdma_connect [rdma_cm] > ffffffffa03770f0 r __kcrctab_rdma_connect [rdma_cm] > 000000000ef3a1e8 a __crc_rdma_connect [rdma_cm] > ffffffffa0375510 T rdma_connect [rdma_cm] > > The symbol you care about is the absolute symbol, the one prefixed by > __crc. So in this case, we are interested in __crc_rdma_connect, and > that symbol''s version is 0x0ef3a1ea. This is the symbol used by the > currently running kernel. > > Which version is Lustre linked against? Well, for that you need to > find the ko2iblnd.ko file, and dump the __versions section. > > # objdump -s -j __versions ko2iblnd.ko | less > [...] > 0670 00000000 00000000 00000000 00000000 ................ > 0680 e8a1f30e 00000000 72646d61 5f636f6e ........rdma_con > 0690 6e656374 00000000 00000000 00000000 nect............ > 06a0 00000000 00000000 00000000 00000000 ................ > > This display isn''t as pretty, but you want to look in the hex dump > just before the symbol name. In this case, right before rmda_connect, > you will see "e8a1f30e" ... which is the little-endian version of our > symbol version! So they match up, and everything works. > > If you want to find out which symbol version is in a particular OFed module > (in this case, we want to look at rdma_cm.ko), you can do this: > > # nm ./kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect > 00000000cd7aa3e6 A __crc_rdma_connect > > Wrong version! But we''re ACTUALLY using the module located here: > > nm ./updates/kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect > 000000000ef3a1e8 A __crc_rdma_connect > > Which is the "correct" version. But if you LINK against the first > version, you''ll get these errors when you try to load Lustre. Note > that my Module.symvers file for this kernel contains: > > 0xcd7aa3e6 rdma_connect drivers/infiniband/core/rdma_cm EXPORT_SYMBOL > > Which is wrong! In this case, you need to explicitly point Lustre at > the OFed directory which contains the Module.symvers file. > > (Can you tell I''ve beaten my head against the wall over this issue > a WHOLE LOT? :-/) > > --Ken > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Ken Hornstein
2010-Mar-23 15:09 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
>I also noticed something in trying to compile lustre. The >"Module.symvers" file is very much needed but is only created after you do >"make/make modules" in the /usr/src/linux directory. It does NOT exist >before then and the lustre installation guide makes it seem >like this file should exist when you unpack the sources. But it does >not. Hence you cannot go on to run configure on lustre and expect the >modules to work.In my (limited) experience, it depends on your distribution. If you get a "development" kernel RPM that corresponds to an already-built binary kernel, I''ve found it to already be there. If you''re compiling your own kernel from scratch, yeah, you have to have it already built before you compile modules against it ... but that''s true for any third-party kernel module, I would think. --Ken
Marco Aurelio L Gomes
2010-Mar-23 15:13 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Ken, Thank you very much for your post, it worked! Regards, Marco On Tue, 2010-03-23 at 09:51 -0400, Ken Hornstein wrote:> >You''re right, I had problems with the module symbol versions using > >Lustre 1.8.2 packages available at Sun website, kernel > >2.6.18-164.11.1.el5 (RHEL 5.4) and OFED 1.5. The same problems happens > >when using OFED 1.4.2. > > So since this comes up now and then, I''ve cc''d the list. > > So you can Google around to find more about kernel symbol versioning. > The short answer is that there is a CRC associated with each exported > symbol in the loaded kernel, and that version is recorded in the module > when it is compiled. That''s all well and good, but figuring out what > happens when it doesn''t work is a pain, because all of the information > isn''t in one place (and nobody has explained it well, at least that > I''ve seen). > > When a module (like Lustre) is compiled, it''s pointed at a file called > "Module.symvers"; that contains the versions of the symbols that > modules are expected to link against, and those versions are recorded > in the module object file. When you get this mismatch at module load > time, one of two things is happening: the "wrong" OFed is being loaded, > or you linked against the "wrong" Module.symvers file. > > How do you figure out which one is the problem? Well, let''s take a > common OFed symbol, like rdma_connect. You can find out the version of > this symbol by grep''ing /proc/kallsyms. On our system: > > # grep rdma_connect /proc/kallsyms > ffffffffa0375510 u rdma_connect [ko2iblnd] > ffffffffa0375510 u rdma_connect [rdma_ucm] > ffffffffa0375510 u rdma_connect [ib_sdp] > ffffffffa0377000 r __ksymtab_rdma_connect [rdma_cm] > ffffffffa0377225 r __kstrtab_rdma_connect [rdma_cm] > ffffffffa03770f0 r __kcrctab_rdma_connect [rdma_cm] > 000000000ef3a1e8 a __crc_rdma_connect [rdma_cm] > ffffffffa0375510 T rdma_connect [rdma_cm] > > The symbol you care about is the absolute symbol, the one prefixed by > __crc. So in this case, we are interested in __crc_rdma_connect, and > that symbol''s version is 0x0ef3a1ea. This is the symbol used by the > currently running kernel. > > Which version is Lustre linked against? Well, for that you need to > find the ko2iblnd.ko file, and dump the __versions section. > > # objdump -s -j __versions ko2iblnd.ko | less > [...] > 0670 00000000 00000000 00000000 00000000 ................ > 0680 e8a1f30e 00000000 72646d61 5f636f6e ........rdma_con > 0690 6e656374 00000000 00000000 00000000 nect............ > 06a0 00000000 00000000 00000000 00000000 ................ > > This display isn''t as pretty, but you want to look in the hex dump > just before the symbol name. In this case, right before rmda_connect, > you will see "e8a1f30e" ... which is the little-endian version of our > symbol version! So they match up, and everything works. > > If you want to find out which symbol version is in a particular OFed module > (in this case, we want to look at rdma_cm.ko), you can do this: > > # nm ./kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect > 00000000cd7aa3e6 A __crc_rdma_connect > > Wrong version! But we''re ACTUALLY using the module located here: > > nm ./updates/kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect > 000000000ef3a1e8 A __crc_rdma_connect > > Which is the "correct" version. But if you LINK against the first > version, you''ll get these errors when you try to load Lustre. Note > that my Module.symvers file for this kernel contains: > > 0xcd7aa3e6 rdma_connect drivers/infiniband/core/rdma_cm EXPORT_SYMBOL > > Which is wrong! In this case, you need to explicitly point Lustre at > the OFed directory which contains the Module.symvers file. > > (Can you tell I''ve beaten my head against the wall over this issue > a WHOLE LOT? :-/) > > --Ken
Ken Hornstein
2010-Mar-23 15:20 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
>Thank you very much for your post, it worked!So ... what was your problem? Wrong version of OFed loaded? Or Lustre was compiled using the wrong symbol versions? --Ken
Brian J. Murrell
2010-Mar-23 15:25 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
On Mon, 2010-03-22 at 20:22 -0700, Adeyemi Adesanya wrote:> Hi.Hi,> I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the > kernel-ib RPM is not available from the download site.That''s right. That''s because for RHEL5, our 1.8.2 release has the o2ib LND built against the OFED that''s in the RHEL5 kernel (1.4.1rc3 IIRC, which is what was made 1.4.2, again, IIRC). So you don''t need a kernel-ib RPM and o2iblnd will work with RHEL5''s built-in OFED.> I did get hold > of the OFED 1.5 source and built a version of kernel-ib but is this > step essential for o2ib LNET on RHEL5?Not at all, per the above. You just install the RHEL5.4 kernel and then when you (or your kernel does a) modprobe o2iblndm all should work just fine.> I would like to try and retain > compatibility with the RedHat OFED distribution and use the versions > of openmpi, etc supplied by RedHat.Then I would recommend using the OFED that shipped with the RHEL5.4 kernel, which is what Lustre 1.8.2 is doing.> The Lustre 1.8.2 patched > RHEL5 kernel appears to already include Infiniband drivers in "/lib/ > modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband".These were not put there by us but by RH. If you look in a stock RHEL5.4 kernel you will find the exact same modules. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100323/ffd012ef/attachment.bin
Lawrence Sorrillo
2010-Mar-23 15:59 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Brian: The caveat here is that the RHEL5 kernel-version used should be one supported by SUN? If this is so then to get a lustre client service one only has to install the corresponding lustre-client-modules-xxxx lustre-client Then lustre with IB in RHEL5 should work? The module, o2iblndm, does not seem to come from a stock RHEL5 build. It comes from lustre-client-modules-xxxx? ~Lawrence ~Lawrence Brian J. Murrell wrote:> On Mon, 2010-03-22 at 20:22 -0700, Adeyemi Adesanya wrote: > >> Hi. >> > > Hi, > > >> I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the >> kernel-ib RPM is not available from the download site. >> > > That''s right. That''s because for RHEL5, our 1.8.2 release has the o2ib > LND built against the OFED that''s in the RHEL5 kernel (1.4.1rc3 IIRC, > which is what was made 1.4.2, again, IIRC). So you don''t need a > kernel-ib RPM and o2iblnd will work with RHEL5''s built-in OFED. > > >> I did get hold >> of the OFED 1.5 source and built a version of kernel-ib but is this >> step essential for o2ib LNET on RHEL5? >> > > Not at all, per the above. You just install the RHEL5.4 kernel and then > when you (or your kernel does a) modprobe o2iblndm all should work just > fine. > > >> I would like to try and retain >> compatibility with the RedHat OFED distribution and use the versions >> of openmpi, etc supplied by RedHat. >> > > Then I would recommend using the OFED that shipped with the RHEL5.4 > kernel, which is what Lustre 1.8.2 is doing. > > >> The Lustre 1.8.2 patched >> RHEL5 kernel appears to already include Infiniband drivers in "/lib/ >> modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband". >> > > These were not put there by us but by RH. If you look in a stock > RHEL5.4 kernel you will find the exact same modules. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Brian J. Murrell
2010-Mar-23 16:12 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
On Tue, 2010-03-23 at 11:59 -0400, Lawrence Sorrillo wrote:> > The caveat here is that the RHEL5 kernel-version used should be one > supported by SUN?There are two kernels. One which we distribute, patched, for the servers. It has OFED and the lustre-modules RPM has an o2iblnd.ko built against it. The other kernel is the patchless kernel, available from RH. It should be the one we declared as officially supported by the release and will have the same version as the patched kernel we distribute per the above. The o2iblnd in the lustre-client-modules will be built against that. It is also possible through something called "weak-modules" to use the modules in one given lustre-client-modules package with another kernel that has a matching kABI, but utilizing that feature is left as an exercise for the reader (for the time being).> If this is so then to get a lustre client service one only has to > install the corresponding > > lustre-client-modules-xxxx > lustre-client > > Then lustre with IB in RHEL5 should work?That''s correct. For RHEL5 on 1.8.2, and likely more future releases than not. It is our goal to utilize the vendor''s integrated stack when it makes sense to do so. We hope it makes sense most, if not all of the time now that OFED seems to be stabilizing and Linux vendors are including up-to-date releases of it more frequently.> The module, o2iblndm, does not seem to come from a stock RHEL5 build.No. That''s the Lustre module that utilizes OFED.> It > comes from lustre-client-modules-xxxx?Yes. And lustre-modules, for the servers. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100323/666a54a7/attachment.bin
Marco Aurelio L Gomes
2010-Mar-23 16:26 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Sorry, my problem was wrong OFED version. I removed my OFED instalation and it worked. Marco On Tue, 2010-03-23 at 11:20 -0400, Ken Hornstein wrote:> >Thank you very much for your post, it worked! > > So ... what was your problem? Wrong version of OFed loaded? Or Lustre > was compiled using the wrong symbol versions? > > --Ken
Adesanya, Adeyemi
2010-Mar-23 17:21 UTC
[Lustre-discuss] Is OFED ''kernel-ib'' required for o2ib on RHEL5?
Hi Brian. Thanks for the clarification. I have no desire to use any other OFED stack apart from the version that Redhat ship. ------- Yemi On Mar 23, 2010, at 8:25 AM, Brian J. Murrell wrote:> On Mon, 2010-03-22 at 20:22 -0700, Adeyemi Adesanya wrote: >> Hi. > > Hi, > >> I''m working on installing Lustre 1.8.2 on RHEL5.4. I noticed that the >> kernel-ib RPM is not available from the download site. > > That''s right. That''s because for RHEL5, our 1.8.2 release has the o2ib > LND built against the OFED that''s in the RHEL5 kernel (1.4.1rc3 IIRC, > which is what was made 1.4.2, again, IIRC). So you don''t need a > kernel-ib RPM and o2iblnd will work with RHEL5''s built-in OFED. > >> I did get hold >> of the OFED 1.5 source and built a version of kernel-ib but is this >> step essential for o2ib LNET on RHEL5? > > Not at all, per the above. You just install the RHEL5.4 kernel and then > when you (or your kernel does a) modprobe o2iblndm all should work just > fine. > >> I would like to try and retain >> compatibility with the RedHat OFED distribution and use the versions >> of openmpi, etc supplied by RedHat. > > Then I would recommend using the OFED that shipped with the RHEL5.4 > kernel, which is what Lustre 1.8.2 is doing. > >> The Lustre 1.8.2 patched >> RHEL5 kernel appears to already include Infiniband drivers in "/lib/ >> modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/drivers/infiniband". > > These were not put there by us but by RH. If you look in a stock > RHEL5.4 kernel you will find the exact same modules. > > b. >-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 204 bytes Desc: signature.asc Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100323/c4187355/attachment.bin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ATT00001.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100323/c4187355/attachment.txt
Has anyone seen this before? I have a lustre client that will work well soon after reboot (giving 300MB/sec writes over SDR infiniband to a lustre mount ) but then after a couple of hours the the mount will stop working-I get hangs on files coming from particular OSTs. Simultaneously, other clients, built a bit differently, do not hang on the same OST. All clients with this particular build share this same malady. This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2. (uname -a) Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 x86_64 x86_64 x86_64 GNU/Linux Here is what it displays (/var/log/messages ) soon after reboot and for initial read/writes to the lustre mount areas. Apr 6 13:37:04 host0 kernel: Lustre: OBD class driver, http://www.lustre.org/ Apr 6 13:37:04 host0 kernel: Lustre: Lustre Version: 1.8.2 Apr 6 13:37:04 host0 kernel: Lustre: Build Version: 1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5 Apr 6 13:37:05 host0 kernel: Lustre: Listener bound to ib0:172.17.3.61:987:mthca0 Apr 6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1 Apr 6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3.61 at o2ib [8/64/0/180] Apr 6 13:37:05 host0 kernel: Lustre: Added LNI X.X.X.X at tcp [8/256/0/180] Apr 6 13:37:05 host0 kernel: Lustre: Accept secure, port 988 Apr 6 13:37:06 host0 kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Apr 6 13:37:06 host0 kernel: Lustre: MGC172.17.1.83 at o2ib: Reactivating import Apr 6 13:37:06 host0 kernel: Lustre: Client lustre-client has started .... .... . Everthings is fine here....just OS messages that do not pertain to lustre .... .... Apr 6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X Apr 6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in 36986 seconds. Apr 7 08:38:36 host0 : error getting update info: (104, ''Connection reset by peer'') Apr 7 09:09:30 host0 kernel: LustreError: 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 9 seconds Apr 7 09:09:30 host0 kernel: LustreError: 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 172.17.1.108 at o2ib (84) Apr 7 09:09:45 host0 kernel: LustreError: 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.17.1.108 at o2ib: -113 Apr 7 09:09:45 host0 kernel: LustreError: 5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff810509419000 x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 09:09:45 host0 kernel: Lustre: 5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1332294902650884 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 172.17.1.108 at o2ib 0s ago has failed due to network error (17s prior to deadline). Apr 7 09:09:45 host0 kernel: req at ffff810509419000 x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 7 09:09:45 host0 kernel: Lustre: lustre-OST0018-osc-ffff810335e15c00: Connection to service lustre-OST0018 via nid 172.17.1.108 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 7 09:09:45 host0 kernel: LustreError: 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.17.1.108 at o2ib: -113 Apr 7 09:09:45 host0 kernel: LustreError: 5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff8104345b2c00 x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 1270645791 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 09:09:45 host0 kernel: Lustre: 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1332294902650898 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 172.17.1.108 at o2ib 0s ago has failed due to network error (6s prior to deadline). Apr 7 09:09:45 host0 kernel: req at ffff8104345b2c00 x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 7 09:09:45 host0 kernel: LustreError: 5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message Apr 7 09:09:45 host0 kernel: Lustre: lustre-OST0019-osc-ffff810335e15c00: Connection to service lustre-OST0019 via nid 172.17.1.108 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 7 09:09:52 host0 kernel: Lustre: 5314:0:(import.c:524:import_select_connection()) lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing latency to 2s Apr 7 09:09:59 host0 kernel: Lustre: 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1332294902654188 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 172.17.1.108 at o2ib 7s ago has timed out (7s prior to deadline). Apr 7 09:09:59 host0 kernel: req at ffff8104ff9c6c00 x1332294902654188/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 1270645799 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 09:09:59 host0 kernel: Lustre: 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 7 09:10:00 host0 kernel: Lustre: 5314:0:(import.c:524:import_select_connection()) lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing latency to 3s Apr 7 09:10:00 host0 kernel: Lustre: 5314:0:(import.c:524:import_select_connection()) Skipped 2 previous similar messages Apr 7 09:10:08 host0 kernel: Lustre: 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1332294902658081 sent from lustre-OST0018-osc-ffff810335e15c00 to NID 172.17.1.108 at o2ib 8s ago has timed out (8s prior to deadline). Apr 7 09:10:08 host0 kernel: req at ffff810378e91400 x1332294902658081/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 1270645808 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 7 09:10:08 host0 kernel: Lustre: 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 2 previous similar messages ~Lawrence ~
Also, the logs from the OST that is providing the files for which we have hangs are showing the following errors: Apr 7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message Apr 7 02:51:45 loss09 kernel: Lustre: lustre-OST001a: haven''t heard from client dd7aee74-0bb9-7b4a-4c7f-d0e78fff45ef (at 172.17.0.160 at o2ib) in 227 seconds. I think it''s dead, and I am evicting it. Apr 7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message Apr 7 02:53:18 loss09 kernel: LustreError: 13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error (-107) req at ffff81018021c000 x1326517357508998/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1270623204 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 7 02:53:18 loss09 kernel: LustreError: 13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 5 previous similar messages Apr 7 09:12:42 loss09 kernel: Lustre: lustre-OST001a: haven''t heard from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3.61 at o2ib) in 227 seconds. I think it''s dead, and I am evicting it. Apr 7 09:13:07 host09 kernel: Lustre: lustre-OST0018: haven''t heard from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3.61 at o2ib) in 227 seconds. I think it''s dead, and I am evicting it. 172.17.3.61 at o2ib is the IB interface for the client experiencing the hang condition. ~Lawrence Lawrence Sorrillo wrote:> Has anyone seen this before? > > > I have a lustre client that will work well soon after reboot (giving > 300MB/sec writes over SDR infiniband to a lustre mount ) but then after > a couple of hours the > the mount will stop working-I get hangs on files coming from particular > OSTs. Simultaneously, other clients, built a bit differently, do not > hang on the same OST. > > All clients with this particular build share this same malady. > > This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2. > > (uname -a) > Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 > x86_64 x86_64 x86_64 GNU/Linux > > > Here is what it displays (/var/log/messages ) soon after reboot and for > initial read/writes to the lustre mount areas. > > Apr 6 13:37:04 host0 kernel: Lustre: OBD class driver, > http://www.lustre.org/ > Apr 6 13:37:04 host0 kernel: Lustre: Lustre Version: 1.8.2 > Apr 6 13:37:04 host0 kernel: Lustre: Build Version: > 1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5 > Apr 6 13:37:05 host0 kernel: Lustre: Listener bound to > ib0:172.17.3.61:987:mthca0 > Apr 6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: > 0xffffffffffffffff, array size: 1 > Apr 6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3.61 at o2ib > [8/64/0/180] > Apr 6 13:37:05 host0 kernel: Lustre: Added LNI X.X.X.X at tcp [8/256/0/180] > Apr 6 13:37:05 host0 kernel: Lustre: Accept secure, port 988 > Apr 6 13:37:06 host0 kernel: Lustre: Lustre Client File System; > http://www.lustre.org/ > Apr 6 13:37:06 host0 kernel: Lustre: MGC172.17.1.83 at o2ib: Reactivating > import > Apr 6 13:37:06 host0 kernel: Lustre: Client lustre-client has started > > > .... > .... > . Everthings is fine here....just OS messages that do not pertain to lustre > .... > .... > Apr 6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X > Apr 6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in > 36986 seconds. > Apr 7 08:38:36 host0 : error getting update info: (104, ''Connection > reset by peer'') > Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, > 9 seconds > Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with > 172.17.1.108 at o2ib (84) > Apr 7 09:09:45 host0 kernel: LustreError: > 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to > 12345-172.17.1.108 at o2ib: -113 > Apr 7 09:09:45 host0 kernel: LustreError: > 5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 > req at ffff810509419000 x1332294902650884/t0 > o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 > dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:09:45 host0 kernel: Lustre: > 5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1332294902650884 sent from lustre-OST0018-osc-ffff810335e15c00 to NID > 172.17.1.108 at o2ib 0s ago has failed due to network error (17s prior to > deadline). > Apr 7 09:09:45 host0 kernel: req at ffff810509419000 > x1332294902650884/t0 o400->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 > lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:09:45 host0 kernel: Lustre: > lustre-OST0018-osc-ffff810335e15c00: Connection to service > lustre-OST0018 via nid 172.17.1.108 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > Apr 7 09:09:45 host0 kernel: LustreError: > 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to > 12345-172.17.1.108 at o2ib: -113 > Apr 7 09:09:45 host0 kernel: LustreError: > 5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 > req at ffff8104345b2c00 x1332294902650898/t0 > o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl > 1270645791 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:09:45 host0 kernel: Lustre: > 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1332294902650898 sent from lustre-OST0018-osc-ffff810335e15c00 to NID > 172.17.1.108 at o2ib 0s ago has failed due to network error (6s prior to > deadline). > Apr 7 09:09:45 host0 kernel: req at ffff8104345b2c00 > x1332294902650898/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens > 368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:09:45 host0 kernel: LustreError: > 5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message > Apr 7 09:09:45 host0 kernel: Lustre: > lustre-OST0019-osc-ffff810335e15c00: Connection to service > lustre-OST0019 via nid 172.17.1.108 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > Apr 7 09:09:52 host0 kernel: Lustre: > 5314:0:(import.c:524:import_select_connection()) > lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing > latency to 2s > Apr 7 09:09:59 host0 kernel: Lustre: > 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1332294902654188 sent from lustre-OST0018-osc-ffff810335e15c00 to NID > 172.17.1.108 at o2ib 7s ago has timed out (7s prior to deadline). > Apr 7 09:09:59 host0 kernel: req at ffff8104ff9c6c00 > x1332294902654188/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens > 368/584 e 0 to 1 dl 1270645799 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:09:59 host0 kernel: Lustre: > 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 4 previous > similar messages > Apr 7 09:10:00 host0 kernel: Lustre: > 5314:0:(import.c:524:import_select_connection()) > lustre-OST0018-osc-ffff810335e15c00: tried all connections, increasing > latency to 3s > Apr 7 09:10:00 host0 kernel: Lustre: > 5314:0:(import.c:524:import_select_connection()) Skipped 2 previous > similar messages > Apr 7 09:10:08 host0 kernel: Lustre: > 5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1332294902658081 sent from lustre-OST0018-osc-ffff810335e15c00 to NID > 172.17.1.108 at o2ib 8s ago has timed out (8s prior to deadline). > Apr 7 09:10:08 host0 kernel: req at ffff810378e91400 > x1332294902658081/t0 o8->lustre-OST0018_UUID at 172.17.1.108@o2ib:28/4 lens > 368/584 e 0 to 1 dl 1270645808 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr 7 09:10:08 host0 kernel: Lustre: > 5313:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 2 previous > similar messages > > ~Lawrence > ~ > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Wed, 2010-04-07 at 10:23 -0400, Lawrence Sorrillo wrote:> Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, > 9 seconds > Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with > 172.17.1.108 at o2ib (84)Your network is failing. You need to test and fix your network. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100407/1ff02bec/attachment.bin
Lawrence Sorrillo
2010-Apr-08 14:56 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
I am about to try to build lustre again as I am getting hangs with the lustre mounts in my previous build. "Apr 7 09:09:30 host0 kernel: LustreError: 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 9 seconds Apr 7 09:09:30 host0 kernel: LustreError: 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 172.17.1.108 at o2ib (84)" Here is the plan. Lustre 1.8.2 on rhel5 x86_64 using the ofed in the rhel5 kernel. I have gathered the following packages from the lustre site: e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm kernel-2.6.18-164.6.1.0.1.el5.src.rpm lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm I want to get the kernel-2.6.18-164.6.1.0.1.el5.x86_64.rpm binary from kernel-2.6.18-164.6.1.0.1.el5.src.rpm. Then I am hoping to the install the two lustre-client packages along with the new created kernel binary and be done with it. Since this setup is meant to work with IB I am hoping that all will be consistent with the IB/OFED that comes with kernel-2.6.18-164.6.1.0.1.el5.x86_64.rpm and the lustre-client modules. I know that RHEL5 comes with OFED1.4.1-rc3. Is this a reasonable approach? Thanks.
Brian J. Murrell
2010-Apr-08 15:19 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
On Thu, 2010-04-08 at 10:56 -0400, Lawrence Sorrillo wrote:> I am about to try to build lustre again as I am getting hangs with the > lustre mounts in my previous build. > > "Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, > 9 seconds > Apr 7 09:09:30 host0 kernel: LustreError: > 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with > 172.17.1.108 at o2ib (84)"What makes you think that this is a software problem and that rebuilding the software stack will resolve it? FWIW, every time I have seen this type of problem reported, the fabric was flaky.> Here is the plan. Lustre 1.8.2 on rhel5 x86_64 using the ofed in the rhel5 kernel.In case it''s not what you mean, why don''t you just use the pre-built packages that we have built and extensively tested in our QA department for you?> I have gathered the following packages from the lustre site: > e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm > kernel-2.6.18-164.6.1.0.1.el5.src.rpmWhy do you need a kernel src.rpm?> lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm > lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm > > I want to get the kernel-2.6.18-164.6.1.0.1.el5.x86_64.rpm binary from > kernel-2.6.18-164.6.1.0.1.el5.src.rpm.Why not just use the binary kernel we provide instead of rebuilding your own? It''s the *exact* same kernel that we used in our QA testing and therefore a known quantity. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100408/382ab9e3/attachment.bin
>Why not just use the binary kernel we provide instead of rebuilding your >own? It''s the *exact* same kernel that we used in our QA testing and >therefore a known quantity.I have to agree with Brian here ... the best success that we''ve had is to either use _everything_ from Sun/Oracle (I''m just not used to thinking of you guys as "Oracle" yet!), or compile _everything_ yourself. We do the latter on some systems (for various reasons), but I prefer it when we can do the former. Mixing and matching just leads you into trouble (like the symbol version problems you were encountering). --Ken
Lawrence Sorrillo
2010-Apr-08 15:43 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
Brian: I greatly appreciate your input. These IB connections for this set of builds are SDR when the rest of the fabric is either DDR or QDR. We have one large fabric. It appears that only these nodes with this build(and SDR connections ) are affected this way. I guess I can place a DDR card with a different cable and IB port and see if this makes a difference. All the machines built this way are experiencing the hangs so I assumed it was not hardware. Although it could be just hardware-they-all-share. I can''t find the pre-built, kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm. I only found the source (kernel-2.6.18-164.6.1.0.1.el5.src.rpm). Hence the reason I need to build the binary version. Do you have it somewhere? I can''t use the lustre patched version as I have other software to install that expects a stock kernel version. I am hoping to use the pre-built lustre-client rpms with my built binary(hoping for no modules versioning complaints). Overly hopeful? ~Lawrence Brian J. Murrell wrote:> On Thu, 2010-04-08 at 10:56 -0400, Lawrence Sorrillo wrote: > >> I am about to try to build lustre again as I am getting hangs with the >> lustre mounts in my previous build. >> >> "Apr 7 09:09:30 host0 kernel: LustreError: >> 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, >> 9 seconds >> Apr 7 09:09:30 host0 kernel: LustreError: >> 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with >> 172.17.1.108 at o2ib (84)" >> > > What makes you think that this is a software problem and that rebuilding > the software stack will resolve it? FWIW, every time I have seen this > type of problem reported, the fabric was flaky. > > >> Here is the plan. Lustre 1.8.2 on rhel5 x86_64 using the ofed in the rhel5 kernel. >> > > In case it''s not what you mean, why don''t you just use the pre-built > packages that we have built and extensively tested in our QA department > for you? > > >> I have gathered the following packages from the lustre site: >> e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm >> kernel-2.6.18-164.6.1.0.1.el5.src.rpm >> > > Why do you need a kernel src.rpm? > > >> lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> >> I want to get the kernel-2.6.18-164.6.1.0.1.el5.x86_64.rpm binary from >> kernel-2.6.18-164.6.1.0.1.el5.src.rpm. >> > > Why not just use the binary kernel we provide instead of rebuilding your > own? It''s the *exact* same kernel that we used in our QA testing and > therefore a known quantity. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Brian J. Murrell
2010-Apr-08 15:51 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
On Thu, 2010-04-08 at 11:43 -0400, Lawrence Sorrillo wrote:> Brian:Hi Lawrence,> I greatly appreciate your input. These IB connections for this set of > builds are SDR when the rest of the fabric is either DDR or QDR. We have > one large fabric. > It appears that only these nodes with this build(and SDR connections ) > are affected this way.Ahhh. Perhaps a firmware incompatibility? ISTR that we ran into some firmware compatibility issues on our IB hardware in the lab when we started 1.5 testing. A firmware upgrade resolved it. I don''t recall what the nature of the problem was, but if you are experiencing hardware issues with just one brand of hardware, firmware is most likely the culprit.> I can''t find the pre-built, kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm.Yes, that''s because that is just RedHat''s stock kernel. You can get that from wherever you got your distro and updates from (i.e. RH, Centos, etc.).> I > only found the source (kernel-2.6.18-164.6.1.0.1.el5.src.rpm).It''s strange why we have that on our download site. I will ask our release manager about that one.> Do you have it > somewhere?No. Per the above, you get it wherever you get the rest of your distro.> I can''t use the lustre patched version as I have other > software to > install that expects a stock kernel version.Indeed, and on a patchless client, you don''t want to use the patched kernel.> I am hoping to use the > pre-built lustre-client rpms with my built binary(hoping for no modules > versioning complaints).If you get the kernel specified above from RH (or whoever your distro vendor is) you shouldn''t get any.> Overly hopeful?Not at all. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100408/80a4f2e2/attachment.bin
Lawrence Sorrillo
2010-Apr-08 15:58 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
I will definitely look into the hardware issues. Also, I can''t get (I have looked everywhere) kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm Our RHN services only proivdes kernel-2.6.18-164.6.1.el5.x86_64.rpm Cheers, ~Lawrence Brian J. Murrell wrote:> On Thu, 2010-04-08 at 11:43 -0400, Lawrence Sorrillo wrote: > >> Brian: >> > > Hi Lawrence, > > >> I greatly appreciate your input. These IB connections for this set of >> builds are SDR when the rest of the fabric is either DDR or QDR. We have >> one large fabric. >> It appears that only these nodes with this build(and SDR connections ) >> are affected this way. >> > > Ahhh. Perhaps a firmware incompatibility? ISTR that we ran into some > firmware compatibility issues on our IB hardware in the lab when we > started 1.5 testing. A firmware upgrade resolved it. I don''t recall > what the nature of the problem was, but if you are experiencing hardware > issues with just one brand of hardware, firmware is most likely the > culprit. > > >> I can''t find the pre-built, kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm. >> > > Yes, that''s because that is just RedHat''s stock kernel. You can get > that from wherever you got your distro and updates from (i.e. RH, > Centos, etc.). > > >> I >> only found the source (kernel-2.6.18-164.6.1.0.1.el5.src.rpm). >> > > It''s strange why we have that on our download site. I will ask our > release manager about that one. > > >> Do you have it >> somewhere? >> > > No. Per the above, you get it wherever you get the rest of your distro. > > >> I can''t use the lustre patched version as I have other >> software to >> install that expects a stock kernel version. >> > > Indeed, and on a patchless client, you don''t want to use the patched > kernel. > > >> I am hoping to use the >> pre-built lustre-client rpms with my built binary(hoping for no modules >> versioning complaints). >> > > If you get the kernel specified above from RH (or whoever your distro > vendor is) you shouldn''t get any. > > >> Overly hopeful? >> > > Not at all. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Brian J. Murrell
2010-Apr-08 16:27 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
On Thu, 2010-04-08 at 11:58 -0400, Lawrence Sorrillo wrote:> I will definitely look into the hardware issues. > > Also, I can''t get (I have looked everywhere) > > kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpmHrm. I wonder why you''d need that kernel. Then I looked at your original post in this thread where you said:> I have gathered the following packages from the lustre site: > e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm > kernel-2.6.18-164.6.1.0.1.el5.src.rpm > lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm > lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpmBut I am looking at our download site right now and I see (for 1.8.2/RHEL5): Lustre client modules (Client for unpatched vendor kernel) lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64.rpm Which means you want kernel-2.6.18_164.11.1.el5. Now that I am looking at our download site, I am also not seeing the src.rpm for the stock/vendor kernel that you were referring to. Oh wait. I see, You are looking at OEL5 packages. Are you in fact running OEL5 or RHEL5 (or something else even)? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100408/9ccfa5ba/attachment.bin
Lawrence Sorrillo
2010-Apr-08 17:27 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
You are absolutely correct. I have been looking at OEL5 instead of RHEL5. Thanks-this may explain some of my difficulties in compiling this stuff on previous occasions. ~Lawrence Brian J. Murrell wrote:> On Thu, 2010-04-08 at 11:58 -0400, Lawrence Sorrillo wrote: > >> I will definitely look into the hardware issues. >> >> Also, I can''t get (I have looked everywhere) >> >> kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm >> > > Hrm. I wonder why you''d need that kernel. Then I looked at your > original post in this thread where you said: > > > >> I have gathered the following packages from the lustre site: >> e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm >> kernel-2.6.18-164.6.1.0.1.el5.src.rpm >> lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> > > But I am looking at our download site right now and I see (for > 1.8.2/RHEL5): > > Lustre client modules (Client for unpatched vendor kernel) > lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64.rpm > > Which means you want kernel-2.6.18_164.11.1.el5. > > Now that I am looking at our download site, I am also not seeing the > src.rpm for the stock/vendor kernel that you were referring to. > > Oh wait. I see, You are looking at OEL5 packages. Are you in fact > running OEL5 or RHEL5 (or something else even)? > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Lawrence Sorrillo
2010-Apr-08 21:13 UTC
[Lustre-discuss] RHEL5''s OFED with lustre1.8.2 on IB
Brian: This worked without a hitch. Although the OFED with rhel5 is bare-without the real niftty tools that other OFED installations possess. I am now testing to see if the hang-condition re-appears. Thanks, ~Lawrence Brian J. Murrell wrote:> On Thu, 2010-04-08 at 11:58 -0400, Lawrence Sorrillo wrote: > >> I will definitely look into the hardware issues. >> >> Also, I can''t get (I have looked everywhere) >> >> kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm >> > > Hrm. I wonder why you''d need that kernel. Then I looked at your > original post in this thread where you said: > > > >> I have gathered the following packages from the lustre site: >> e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm >> kernel-2.6.18-164.6.1.0.1.el5.src.rpm >> lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >> > > But I am looking at our download site right now and I see (for > 1.8.2/RHEL5): > > Lustre client modules (Client for unpatched vendor kernel) > lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64.rpm > > Which means you want kernel-2.6.18_164.11.1.el5. > > Now that I am looking at our download site, I am also not seeing the > src.rpm for the stock/vendor kernel that you were referring to. > > Oh wait. I see, You are looking at OEL5 packages. Are you in fact > running OEL5 or RHEL5 (or something else even)? > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
What''s missing? When you say ''nifty tools'', the goodies in infinibad-diags and ibutils come to mind. Maybe you just need to install the RPMs. -frank On Apr 8, 2010, at 2:13 PM, Lawrence Sorrillo wrote:> Brian: > > This worked without a hitch. > > Although the OFED with rhel5 is bare-without the real niftty tools that > other OFED installations possess. > > I am now testing to see if the hang-condition re-appears. > > Thanks, > ~Lawrence > > Brian J. Murrell wrote: >> On Thu, 2010-04-08 at 11:58 -0400, Lawrence Sorrillo wrote: >> >>> I will definitely look into the hardware issues. >>> >>> Also, I can''t get (I have looked everywhere) >>> >>> kernel-2.6.18-164.6.1.0.1.el5.x86_84.rpm >>> >> >> Hrm. I wonder why you''d need that kernel. Then I looked at your >> original post in this thread where you said: >> >> >> >>> I have gathered the following packages from the lustre site: >>> e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm >>> kernel-2.6.18-164.6.1.0.1.el5.src.rpm >>> lustre-client-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >>> lustre-client-modules-1.8.2-2.6.18_164.6.1.0.1.el5_lustre.1.8.2.x86_64.rpm >>> >> >> But I am looking at our download site right now and I see (for >> 1.8.2/RHEL5): >> >> Lustre client modules (Client for unpatched vendor kernel) >> lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2.x86_64.rpm >> >> Which means you want kernel-2.6.18_164.11.1.el5. >> >> Now that I am looking at our download site, I am also not seeing the >> src.rpm for the stock/vendor kernel that you were referring to. >> >> Oh wait. I see, You are looking at OEL5 packages. Are you in fact >> running OEL5 or RHEL5 (or something else even)? >> >> b. >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss