Rhys McMurdo
2009-Sep-21 21:06 UTC
[Linux_hpc_swstack] [hpcdev-discuss] infiniband and sun hpc software for linux
Hi Yann, Firstly, this probably isn''t the best list to ask these questions. There is a mailing list for the Linux HPC software stack available at linux_hpc_swstack at lists.lustre.org Secondly, if I had to guess at your problem it looks like either you may not have an OpenSM daemon running, or you library paths are not right. Check the opensmd status via /etc/init.d/opensmd status. Also, what does ldd /usr/sbin/ibstat show? Regards, Rhys 2009/9/21 Yann JOBIC <jobic at polytech.univ-mrs.fr>> Hello, > > I''ve got 2 X4600, centos 5.3, the last firmware for 375-3549 cards from the > mellanix website (for sun cards), and sun hpc software for linux. > > When i''m running ibstat, in order to check the health of my infiniband > cards, i''ve got nothing. > When i''m running the strace tool to see what happened, i''ve got : > > [root at Lidia ~]# strace ibstat > execve("/usr/sbin/ibstat", ["ibstat"], [/* 34 vars */]) = 0 > brk(0) = 0x1bb06000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > 0x2b47a0ae0000 > uname({sys="Linux", node="Lidia", ...}) = 0 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or > directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64", 0x7fff09fc8b80) > -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/libopensm.so.2", O_RDONLY) > -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls", 0x7fff09fc8b80) = -1 > ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64", 0x7fff09fc8b80) = -1 > ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libopensm.so.2", O_RDONLY) = -1 > ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > [....] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmvendor.so.2", O_RDONLY) > -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmcomp.so.2", O_RDONLY) = -1 > ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibmad.so.1", O_RDONLY) = -1 > ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibumad.so.1", O_RDONLY) = -1 > ENOENT (No such file or directory) > [...] > > It''s missing some other files. > > When i flashed the firmware, i had this warning : > > root at Lilou ~]# mstflint -d 03:00.0 -i fw-25408-2_6_000-375-3549-01.bin b > > Current FW version on flash: 2.5.100 > New FW version: 2.6.0 > > You are about to replace current PSID on flash - "SUN0070000001" with a > different PSID - "SUN0070130001". > Note: It is highly recommended not to change the PSID. > > Do you want to continue ? (y/n) [n] : y > > Burning second FW image without signatures - OK Restoring second > signature - OK > > I followed the deployment documentation. Did i miss something ? > Does anybody had those kind of problems ? > > Thanks, > > Yann > > > > -- > ___________________________ > > Yann JOBIC > HPC engineer > Polytech Marseille DME > IUSTI-CNRS UMR 6595 > Technop?le de Ch?teau Gombert > 5 rue Enrico Fermi > 13453 Marseille cedex 13 > Tel : (33) 4 91 10 69 39 > ou (33) 4 91 10 69 43 > Fax : (33) 4 91 10 69 69 > _______________________________________________ > hpcdev-discuss mailing list > hpcdev-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/hpcdev-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/linux_hpc_swstack/attachments/20090922/d6dc291a/attachment.html
Zhiqi Tao
2009-Sep-22 00:46 UTC
[Linux_hpc_swstack] [hpcdev-discuss] infiniband and sun hpc software for linux
Hi Yann, Yeah, I like to echo what Rhys suggested as well. Please use sminfo to verify if one Subnet manager is active in your IB network. If an SM is not running, sminfo prints: sminfo: iberror: query failed If an SM is running, sminfo prints the LID and other SM node information. Example: sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 BTW, What''s your IB card model? for example, # lspci|grep Infini 05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0) Regards, Zhiqi On 22/09/2009, at 7:06 AM, Rhys McMurdo wrote:> Hi Yann, > > Firstly, this probably isn''t the best list to ask these questions. > There is a mailing list for the Linux HPC software stack available > at linux_hpc_swstack at lists.lustre.org > > Secondly, if I had to guess at your problem it looks like either you > may not have an OpenSM daemon running, or you library paths are not > right. > > Check the opensmd status via /etc/init.d/opensmd status. Also, what > does ldd /usr/sbin/ibstat show? > > Regards, > > Rhys > > 2009/9/21 Yann JOBIC <jobic at polytech.univ-mrs.fr> > Hello, > > I''ve got 2 X4600, centos 5.3, the last firmware for 375-3549 cards > from the mellanix website (for sun cards), and sun hpc software for > linux. > > When i''m running ibstat, in order to check the health of my > infiniband cards, i''ve got nothing. > When i''m running the strace tool to see what happened, i''ve got : > > [root at Lidia ~]# strace ibstat > execve("/usr/sbin/ibstat", ["ibstat"], [/* 34 vars */]) = 0 > brk(0) = 0x1bb06000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, > -1, 0) = 0x2b47a0ae0000 > uname({sys="Linux", node="Lidia", ...}) = 0 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or > directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64/libopensm.so. > 2", O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64", > 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls", 0x7fff09fc8b80) = > -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64", 0x7fff09fc8b80) > = -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > [....] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmvendor.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmcomp.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibmad.so.1", O_RDONLY) > = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibumad.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > > It''s missing some other files. > > When i flashed the firmware, i had this warning : > > root at Lilou ~]# mstflint -d 03:00.0 -i > fw-25408-2_6_000-375-3549-01.bin b > > Current FW version on flash: 2.5.100 > New FW version: 2.6.0 > > You are about to replace current PSID on flash - "SUN0070000001" > with a different PSID - "SUN0070130001". > Note: It is highly recommended not to change the PSID. > > Do you want to continue ? (y/n) [n] : y > > Burning second FW image without signatures - OK Restoring second > signature - OK > > I followed the deployment documentation. Did i miss something ? > Does anybody had those kind of problems ? > > Thanks, > > Yann > > > > -- > ___________________________ > > Yann JOBIC > HPC engineer > Polytech Marseille DME > IUSTI-CNRS UMR 6595 > Technop?le de Ch?teau Gombert > 5 rue Enrico Fermi > 13453 Marseille cedex 13 > Tel : (33) 4 91 10 69 39 > ou (33) 4 91 10 69 43 > Fax : (33) 4 91 10 69 69 > _______________________________________________ > hpcdev-discuss mailing list > hpcdev-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/hpcdev-discuss > > _______________________________________________ > Linux_hpc_swstack mailing list > Linux_hpc_swstack at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/linux_hpc_swstack/attachments/20090922/7c01eb53/attachment.html
Kevin Van Maren
2009-Sep-22 01:04 UTC
[Linux_hpc_swstack] [hpcdev-discuss] infiniband and sun hpc software for linux
It is not clear if you have a library path issue or not, as you trimmed too much from the strace. I would say not, as if you did you would get an exec error about not being able to find a shared library, not "nothing". Rather it sounds like the driver is not loading properly. ibstat should work even w/o a subnet manager running. This very much could have been caused by your loading the wrong firmware for that card. Given that the PSID was different, are you sure you flashed the right firmware for that card? Kevin Rhys McMurdo wrote:> Hi Yann, > > Firstly, this probably isn''t the best list to ask these questions. > There is a mailing list for the Linux HPC software stack available at > linux_hpc_swstack at lists.lustre.org > <mailto:linux_hpc_swstack at lists.lustre.org> > > Secondly, if I had to guess at your problem it looks like either you > may not have an OpenSM daemon running, or you library paths are not > right. > > Check the opensmd status via /etc/init.d/opensmd status. Also, what > does ldd /usr/sbin/ibstat show? > > Regards, > > Rhys > > 2009/9/21 Yann JOBIC <jobic at polytech.univ-mrs.fr > <mailto:jobic at polytech.univ-mrs.fr>> > > Hello, > > I''ve got 2 X4600, centos 5.3, the last firmware for 375-3549 cards > from the mellanix website (for sun cards), and sun hpc software > for linux. > > When i''m running ibstat, in order to check the health of my > infiniband cards, i''ve got nothing. > When i''m running the strace tool to see what happened, i''ve got : > > [root at Lidia ~]# strace ibstat > execve("/usr/sbin/ibstat", ["ibstat"], [/* 34 vars */]) = 0 > brk(0) = 0x1bb06000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, > -1, 0) = 0x2b47a0ae0000 > uname({sys="Linux", node="Lidia", ...}) = 0 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file > or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64", > 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls", 0x7fff09fc8b80) > -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64", > 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libopensm.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [....] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmvendor.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmcomp.so.2", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibmad.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibumad.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > [...] > > It''s missing some other files. > > When i flashed the firmware, i had this warning : > > root at Lilou ~]# mstflint -d 03:00.0 -i > fw-25408-2_6_000-375-3549-01.bin b > > Current FW version on flash: 2.5.100 > New FW version: 2.6.0 > > You are about to replace current PSID on flash - "SUN0070000001" > with a different PSID - "SUN0070130001". > Note: It is highly recommended not to change the PSID. > > Do you want to continue ? (y/n) [n] : y > > Burning second FW image without signatures - OK Restoring second > signature - OK > > I followed the deployment documentation. Did i miss something ? > Does anybody had those kind of problems ? > > Thanks, > > Yann > > > > -- > ___________________________ > > Yann JOBIC > HPC engineer > Polytech Marseille DME > IUSTI-CNRS UMR 6595 > Technop?le de Ch?teau Gombert > 5 rue Enrico Fermi > 13453 Marseille cedex 13 > Tel : (33) 4 91 10 69 39 > ou (33) 4 91 10 69 43 > Fax : (33) 4 91 10 69 69 > _______________________________________________ > hpcdev-discuss mailing list > hpcdev-discuss at opensolaris.org <mailto:hpcdev-discuss at opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/hpcdev-discuss > > > ------------------------------------------------------------------------ > > _______________________________________________ > Linux_hpc_swstack mailing list > Linux_hpc_swstack at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack >
Yann JOBIC
2009-Sep-22 07:04 UTC
[Linux_hpc_swstack] [hpcdev-discuss] infiniband and sun hpc software for linux
Kevin Van Maren wrote:> It is not clear if you have a library path issue or not, as you > trimmed too much from the strace. I would say not, as if you did you > would get an exec error about not being able to find a shared library, > not "nothing". > > Rather it sounds like the driver is not loading properly. ibstat > should work even w/o a subnet manager running. > > This very much could have been caused by your loading the wrong > firmware for that card. Given that the PSID was different, are you > sure you flashed the right firmware for that card? > > KevinYou''re right : mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) mlx4_core: Initializing 0000:03:00.0 mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1 mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5) mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting. mlx4_core: probe of 0000:03:00.0 failed with error -5 Then, i can''t see the card with mrtflint : [root at Lidia ~]# lspci | grep Mell 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0) [root at Lidia ~]# mstflint -d 03:00.0 v Warning: memory access to device 03:00.0 failed: Input/output error Warning: Fallback on IO: much slower, and unsafe if device in use. *** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR However, when i''m using ubuntu (liveCD), i can see the card with mstflint: http://img29.imageshack.us/img29/9251/mstflint.png I installed the firmware from here : http://www.mellanox.com/content/pages.php?pg=firmware_table_Sun I''ve got a SUN0070000001 (375-3549, X4217A-Z) I also think that this firmware is odd. Do you know where i can have the right one ? Thanks, Yann> > > Rhys McMurdo wrote: >> Hi Yann, >> >> Firstly, this probably isn''t the best list to ask these questions. >> There is a mailing list for the Linux HPC software stack available at >> linux_hpc_swstack at lists.lustre.org >> <mailto:linux_hpc_swstack at lists.lustre.org> >> >> Secondly, if I had to guess at your problem it looks like either you >> may not have an OpenSM daemon running, or you library paths are not >> right. >> >> Check the opensmd status via /etc/init.d/opensmd status. Also, what >> does ldd /usr/sbin/ibstat show? >> >> Regards, >> >> Rhys >> >> 2009/9/21 Yann JOBIC <jobic at polytech.univ-mrs.fr >> <mailto:jobic at polytech.univ-mrs.fr>> >> >> Hello, >> >> I''ve got 2 X4600, centos 5.3, the last firmware for 375-3549 cards >> from the mellanix website (for sun cards), and sun hpc software >> for linux. >> >> When i''m running ibstat, in order to check the health of my >> infiniband cards, i''ve got nothing. >> When i''m running the strace tool to see what happened, i''ve got : >> >> [root at Lidia ~]# strace ibstat >> execve("/usr/sbin/ibstat", ["ibstat"], [/* 34 vars */]) = 0 >> brk(0) = 0x1bb06000 >> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, >> -1, 0) = 0x2b47a0ae0000 >> uname({sys="Linux", node="Lidia", ...}) = 0 >> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file >> or directory) >> >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/x86_64", >> 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/tls", 0x7fff09fc8b80) >> -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64/x86_64", >> 0x7fff09fc8b80) = -1 ENOENT (No such file or directory) >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libopensm.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> stat("/usr/mpi/gnu/ClusterTools-8.2/lib/64", >> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> [....] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmvendor.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libosmcomp.so.2", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibmad.so.1", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> open("/usr/mpi/gnu/ClusterTools-8.2/lib/64/libibumad.so.1", >> O_RDONLY) = -1 ENOENT (No such file or directory) >> [...] >> >> It''s missing some other files. >> >> When i flashed the firmware, i had this warning : >> >> root at Lilou ~]# mstflint -d 03:00.0 -i >> fw-25408-2_6_000-375-3549-01.bin b >> >> Current FW version on flash: 2.5.100 >> New FW version: 2.6.0 >> >> You are about to replace current PSID on flash - "SUN0070000001" >> with a different PSID - "SUN0070130001". >> Note: It is highly recommended not to change the PSID. >> >> Do you want to continue ? (y/n) [n] : y >> >> Burning second FW image without signatures - OK Restoring second >> signature - OK >> I followed the deployment documentation. Did i miss something ? >> Does anybody had those kind of problems ? >> >> Thanks, >> >> Yann >> >> >> >> -- ___________________________ >> >> Yann JOBIC >> HPC engineer >> Polytech Marseille DME >> IUSTI-CNRS UMR 6595 >> Technop?le de Ch?teau Gombert >> 5 rue Enrico Fermi >> 13453 Marseille cedex 13 >> Tel : (33) 4 91 10 69 39 >> ou (33) 4 91 10 69 43 >> Fax : (33) 4 91 10 69 69 >> _______________________________________________ >> hpcdev-discuss mailing list >> hpcdev-discuss at opensolaris.org >> <mailto:hpcdev-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/hpcdev-discuss >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Linux_hpc_swstack mailing list >> Linux_hpc_swstack at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack >> >-- ___________________________ Yann JOBIC HPC engineer Polytech Marseille DME IUSTI-CNRS UMR 6595 Technop?le de Ch?teau Gombert 5 rue Enrico Fermi 13453 Marseille cedex 13 Tel : (33) 4 91 10 69 39 ou (33) 4 91 10 69 43 Fax : (33) 4 91 10 69 69