Malcolm Cowe
2008-Oct-06 09:58 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Hi Folks, We are trying to create a small lustre environment on behalf of a customer. There are 2 X4200m2 MDS servers, both dual-attached to an STK 6140 array over FC. This is an active-passive arrangement with a single shared volume. Heartbeat is used to co-ordinate file system failover. There is a single X4500 OSS server, the storage for which is split into 6 OSTs. Finally, we have 2 X4600m2 clients, just for kicks. All systems are connected together over ethernet and infiniband, with the IB network being used for Lustre and every system is running RHEL 4.5 AS. The X4500 OST volumes are created using software RAID, while the X4200m2 MDT is accessed using DM Multipath. We downloaded the Lustre binary packages from SUN''s web site and installed them onto each of the servers. Unfortunately, the resulting system is very unstable and is prone to lock-ups on the servers (uptimes are measured in hours). These lock-ups happen without warning, and with very little, if any, debug information in the system logs. We have also observed the servers locking up on shutdown (kernel panics). Based on the documentation in the Lustre operations manual, we installed the RPMs as follows: rpm -Uvh --force e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm rpm -ivh kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64.rpm rpm -ivh kernel-lustre-source-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64.rpm rpm -ivh lustre-modules-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # (many "unknown symbol" warnings) rpm -ivh lustre-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm rpm -ivh lustre-source-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # (many "unknown symbol" warnings) mv /etc/init.d/openibd /etc/init.d/openibd.rhel4default rpm -ivh --force kernel-ib-1.3-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm cp /etc/init.d/openibd /etc/init.d/openibd.lustre.1.6.5.1 We then reboot the system and load RHEL using the Lustre kernel. Now we install the Voltaire OFED software: 1. Copy the kernel config used to build the Lustre patched kernel into the Lustre kernel source tree: cp /boot/config-2.6.9-67.0.7.EL_lustre.1.6.5.1smp \ /usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1/.config 2. Change into the Lustre kernel source and edit the Makefile. Change "custom" suffix to "smp" in the variable "EXTRAVERSION". 3. Change into the lustre kernel source and run these setup commands: make oldconfig || make menuconfig make include/asm make include/linux/version.h make SUBDIRS=scripts 4. Change into the "-obj" directory and run these setup commands: cd /usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1-obj/x86_64/smp ln -s /usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1/include . 5. Unpack the Voltaire OFED tar-ball: tar zxf VoltaireOFED-5.1.3.1_5.tgz 6. Change to the unpacked software directory and run the installation script. To build the OFED packages with the Voltaire certified configuration, run the following commands: cd VoltaireOFED-5.1.3.1_5 ./install.pl -c ofed.conf.Volt 7. Once complete, reboot. 8. Configure any IPoIB interfaces as required. 9. Add the following into /etc/modprobe.conf: options lnet networks="o2ib0(ib0)" 10. Load the Lustre LNET kernel module. modprobe lnet 11. Start the Lustre core networking service. lctl network up 12. Check the system log (/var/log/messages) for confirmation. Create the MGS/MDT Lustre Volume: 1. Format the MGS/MDT device. mkfs.lustre [ --reformat ] --fsname lfs01 --mdt --mgs --failnode=mds-2 at o2ib0 /dev/dm-0 2. Create the MGS/MDT file system mount point. mkdir -p /lustre/mdt/lfs01 3. Mount the file system. This will initiate MGS and MDT services for Lustre. mount -t lustre /dev/dm-0 /lustre/mdt/lfs01 With the exception of the OST volume creation, we use an equivalent process to bring the OSS online. The cabling has been checked and verified. So we re-built the system from scratch and applied only SUN''s RDAC modules and Voltaire OFED to the stock RHEL 4.5 kernel (2.6.9-55.ELsmp). We removed the second MDS from the h/w configuration and did not install Heartbeat. The shared storage was re-formatted as a regular EXT3 file system using the DM multipathing device, /dev/dm-0, and mounted onto the host. Running I/O tests onto the mounted file system over an extended period did not elicit a single error or warning message in the log related to the multipathing or the SCSI device. Once we were confident that the system was running in a consistent and stable manner, we re-installed the Lustre packages, omitting the kernel-ib packages. We had to re-build and re-install the RDAC support as well. This means that the system has support for the Lustre file system but no infiniband support at all. /etc/modprobe.conf is updated such that the lnet networks option is set to "tcp". The MDS/MGS volume is recreated on the DM device. We have tried the following configurations on the X4200m2: * RHEL vanilla kernel, multipathd, RDAC. EXT-3 file system. PASSED. * RHEL vanilla kernel, multipathd, RDAC, Voltaire OFED. EXT-3 file system. PASSED. * Lustre supplied kernel, Lustre software. No IB. MDS/MGS file system. FAILED. * Lustre supplied kernel, Lustre software, RDAC. No IB. MDS/MGS file system (Full Lustre FS over Ethernet). FAILED. * Lustre supplied kernel, Lustre software, RDAC, Voltaire OFED. EXT-3 file system. FAILED. * Lustre supplied kernel, Lustre software. RDAC, Voltaire OFED. MDS/MGS file system (Full Lustre FS over IB). FAILED. Our findings indicate that there is a problem within the binary distribution of Lustre. This may be due to the fact that we are applying the 2.6.9-67 RHEL kernel to a platform based upon 2.6.9.-55, or it may be a more subtle issue based on the interaction with the underlying hardware. We could use some advice on how best to proceed, since our deadline fast approaches. For example, is our build process, as documented above, clean? Currently, we''re looking at building from source, to see if this results in a more stable environment. Regards, Malcolm. -- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/76d65993/attachment-0003.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/76d65993/attachment-0003.gif
Brian J. Murrell
2008-Oct-06 13:18 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
On Mon, 2008-10-06 at 10:58 +0100, Malcolm Cowe wrote:> > rpm -Uvh --force e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpmYou should not (have to) use --force. If you do, there is either an operational error or a bug in our packages. In the latter case, please file a bug in our bugzilla.> rpm -ivh > lustre-modules-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # > (many "unknown symbol" warnings)Can you paste them here?> rpm -ivh > lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # > (many "unknown symbol" warnings)Ditto.> rpm -ivh --force > kernel-ib-1.3-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpmAgain, you should not need to use --force.> We then reboot the system and load RHEL using the Lustre kernel. Now > we install the Voltaire OFED software:Why? The kernel-ib package you installed above should provide a working OFED stack.> 1. Unpack the Voltaire OFED tar-ball: > > tar zxf VoltaireOFED-5.1.3.1_5.tgzDo you really need 1.3.1? If so, then you should not install the 1.3 kernel-ib package we provide above. I really wonder why you need 1.3.1 though.> * Lustre supplied kernel, Lustre software. No IB. MDS/MGS file > system. FAILED.Failed in what way?> * Lustre supplied kernel, Lustre software, RDAC. No IB. MDS/MGS > file system (Full Lustre FS over Ethernet). FAILED.Again, in what way?> * Lustre supplied kernel, Lustre software, RDAC, Voltaire OFED. > EXT-3 file system. FAILED.Ditto.> * Lustre supplied kernel, Lustre software. RDAC, Voltaire OFED. > MDS/MGS file system (Full Lustre FS over IB). FAILED.And Ditto again. You have to provide more details than just "FAILED" if we are to try to help diagnose a problem.> Our findings indicate that there is a problem within the binary > distribution of Lustre.I think that many of our users use it as is, so it cannot be all that bad.> This may be due to the fact that we are applying the 2.6.9-67 RHEL > kernel to a platform based upon 2.6.9.-55,That shouldn''t be a problem in and of itself. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/c01add3e/attachment.bin
Ms. Megan Larko
2008-Oct-06 14:24 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Hello, Reading through the message from Malcolm Cowe about a new lustre environment, he mentioned that there were "unknown symbol" warnings during his installation procedure. Well, I also saw warnings when I was doing a lustre 1.6.5.1 install. I know from general linux that the rpm is not properly installed while there are so many warning and of the type (some were ldiskfs issues for example) that I was seeing. What I discovered is that the order in which the lustre 1.6.5.1 rpms are installed does matter and that it is not the same order as indicated in the Lustre Manual version 1.12 for 1.6.4. The order I used which generated no "unknown symbol" errors for installation of lustre 1.6.5.1 was this: 1) kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1.x86_64.rpm If using infiniband (IB) this is next: 2) kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm 3) lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm 4) lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm 5) lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm The above were done using rpm -i (install) which works well for kernels so that you have multiple versions (hopefully including a good one to which one may return if necessary). The last cannot be done using -i, but -U: rpm -Uvh e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm This has not been a problem for us as the newer version seems to get along fine with a 1.6.4.3 version of Lustre (I appreciate backwards compatibility). If a module installation does have many "unknown symbol" references, then find the rpm which will satisfy those references and install that module. To actually have them satisfied one must return to the package that had complained about the "unknown symbol" and having already installed the package to satisfy those symbols then "rpm --force -ivh" to force a retry of the package with the issues. This procedure can be iterative as sometimes more than one package may be needed to satisfy all the references of the package desired. I do know from personal experience that if a package has "unknown symbols" especially if those symbols are used/accessed, then it can panic the box. My experience with this in on 64-bit hw using Cent OS 5 as the base Operating System. Best of luck. megan
Malcolm Cowe
2008-Oct-06 14:47 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Hey Brian, I''ll have to re-install the system from scratch in order to be able to answer some of your questions, which I''ll get started on this evening. What I was hoping for in the first instance was a sanity check of our installation methods. With respect to the OFED stack used, we are using the latest official software stack supplied by Voltaire. The reason for this is that there is more to OFED than just the kernel modules, including many libraries and tools, plus the latest firmware for the cards. It''s what the customer has asked for, and it is what the card vendor expects us to do. We may be able to get away with OFED 1.3, but I would still like some guidance on how to install the rest of the OFED stack -- do we use the OFED source to rebuild everything, or can we pick the Lustre supplied kernel modules and just layer on the other stuff separately? Like I said, sanity-checking the install procedure is important. Finally, when I said that one file system fails versus another passes, I mean that the server locks solid, crashes, usually with no debug to speak of (nothing in the system logs). Even while the system is up and running the lustre kernel, if we attempt a clean shutdown, the kernel panics. Since I need to rebuild the systems anyway, I will also try to install the packages in the order mentioned by Megan Larko, to see how that affects the installation. We have been following the instructions in the Lustre Operations Manual (v. 1.14). Regards, Malcolm. Brian J. Murrell wrote:> On Mon, 2008-10-06 at 10:58 +0100, Malcolm Cowe wrote: > >> rpm -Uvh --force e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm >> > > You should not (have to) use --force. If you do, there is either an > operational error or a bug in our packages. In the latter case, please > file a bug in our bugzilla. > > >> rpm -ivh >> lustre-modules-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # >> (many "unknown symbol" warnings) >> > > Can you paste them here? > > >> rpm -ivh >> lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm # >> (many "unknown symbol" warnings) >> > > Ditto. > > >> rpm -ivh --force >> kernel-ib-1.3-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm >> > > Again, you should not need to use --force. > > >> We then reboot the system and load RHEL using the Lustre kernel. Now >> we install the Voltaire OFED software: >> > > Why? The kernel-ib package you installed above should provide a working > OFED stack. > > >> 1. Unpack the Voltaire OFED tar-ball: >> >> tar zxf VoltaireOFED-5.1.3.1_5.tgz >> > > Do you really need 1.3.1? If so, then you should not install the 1.3 > kernel-ib package we provide above. I really wonder why you need 1.3.1 > though. > > >> * Lustre supplied kernel, Lustre software. No IB. MDS/MGS file >> system. FAILED. >> > > Failed in what way? > > >> * Lustre supplied kernel, Lustre software, RDAC. No IB. MDS/MGS >> file system (Full Lustre FS over Ethernet). FAILED. >> > > Again, in what way? > > >> * Lustre supplied kernel, Lustre software, RDAC, Voltaire OFED. >> EXT-3 file system. FAILED. >> > > Ditto. > > >> * Lustre supplied kernel, Lustre software. RDAC, Voltaire OFED. >> MDS/MGS file system (Full Lustre FS over IB). FAILED. >> > > And Ditto again. > > You have to provide more details than just "FAILED" if we are to try to > help diagnose a problem. > > >> Our findings indicate that there is a problem within the binary >> distribution of Lustre. >> > > I think that many of our users use it as is, so it cannot be all that > bad. > > >> This may be due to the fact that we are applying the 2.6.9-67 RHEL >> kernel to a platform based upon 2.6.9.-55, >> > > That shouldn''t be a problem in and of itself. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/1ebc04de/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/1ebc04de/attachment-0001.gif
Brian J. Murrell
2008-Oct-06 14:59 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote:> Hey Brian,Hey Malcolm,> I''ll have to re-install the system from scratch in order to be able to > answer some of your questions, which I''ll get started on this evening.OK.> What I was hoping for in the first instance was a sanity check of our > installation methods.I think I commented on those. If you are going to build your OFED stack you don''t need to install the one we provide.> With respect to the OFED stack used, we are using the latest official > software stack supplied by Voltaire. The reason for this is that there > is more to OFED than just the kernel modules, including many libraries > and tools,None of these should be necessary for Lustre to use I/B.> plus the latest firmware for the cards.Hrm. Can you not upgrade firmware independent of upgrading the whole OFED stack? That seems very limiting.> It''s what the customer has asked for, and it is what the card vendor > expects us to do.Fair enough. I was just pointing out that you don''t need our OFED stack if you are going to install your own.> We may be able to get away with OFED 1.3, but I would still like some > guidance on how to install the rest of the OFED stackWe don''t supply the userspace tools because they are not really necessary for Lustre.> do we use the OFED source to rebuild everything, or can we pick the > Lustre supplied kernel modules and just layer on the other stuff > separately?Yes, you should be able to do that. I say that quite generally as I''m not entirely clear on your operating environment.> Finally, when I said that one file system fails versus another passes, > I mean that the server locks solid, crashes, usually with no debug to > speak of (nothing in the system logs).Nothing on the console either?> Even while the system is up and running the lustre kernel, if we > attempt a clean shutdown, the kernel panics.Hrm. A panic is quite different than locking solid with no messages at all. A solid lock with no messages is indicative of hardware problems.> Since I need to rebuild the systems anyway, I will also try to install > the packages in the order mentioned by Megan Larko, to see how that > affects the installation.I''m not entirely convinced of her process. You should not need to use --force and reinstall packages already installed. I''d be more interested in knowing exactly your installation steps and the errors you get from it. Please try to avoid the use of --force so we can see why it''s necessary. You will have to use "rpm -U" with e2fsprogs though as she mentions. Do all of your work with the "script(1)" tool so you can easily log it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081006/32bbf4ff/attachment.bin
Andreas Dilger
2008-Oct-07 08:17 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
On Oct 06, 2008 10:59 -0400, Brian J. Murrell wrote:> On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote: > > With respect to the OFED stack used, we are using the latest official > > software stack supplied by Voltaire. The reason for this is that there > > is more to OFED than just the kernel modules, including many libraries > > and tools, > > None of these should be necessary for Lustre to use I/B.Also very important to note is that if you are changing the IB stack, then Lustre also needs to be recompiled to work with the new IB stack. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger
2008-Oct-07 08:48 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
On Oct 06, 2008 10:24 -0400, Ms. Megan Larko wrote:> The order I used which generated no "unknown symbol" errors for > installation of lustre 1.6.5.1 was this: > 1) kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1.x86_64.rpm > > If using infiniband (IB) this is next: > 2) kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm > 3) lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm > 4) lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm > 5) lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpmThat is good to know for the documentation. However, I suspect if all of these packages are installed at the same time there would also not be any symbol warnings.> If a module installation does have many "unknown symbol" references, > then find the rpm which will satisfy those references and install that > module. To actually have them satisfied one must return to the > package that had complained about the "unknown symbol" and having > already installed the package to satisfy those symbols then "rpm > --force -ivh" to force a retry of the package with the issues.That isn''t quite correct. The missing module symbols are the output of "depmod -ae" that is run in the RPM post-install after kernel modules are installed. Even if there are such warnings, if the modules are later installed and "depmod -ae" is run again it should report no warnings, regardless of what order the modules were installed. That means - no need to reinstall the RPMs or to install them in a particular order, though of course avoid the warnings is always nicer. You can always run "depmod -ae" by hand to re-verify the modules of the currently installed kernels. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Malcolm Cowe
2008-Oct-07 09:43 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Andreas Dilger wrote:> On Oct 06, 2008 10:59 -0400, Brian J. Murrell wrote: > >> On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote: >> >>> With respect to the OFED stack used, we are using the latest official >>> software stack supplied by Voltaire. The reason for this is that there >>> is more to OFED than just the kernel modules, including many libraries >>> and tools, >>> >> None of these should be necessary for Lustre to use I/B. >> > > Also very important to note is that if you are changing the IB stack, > then Lustre also needs to be recompiled to work with the new IB stack. > >Yes. As a matter of fact, you have anticipated a question I have: how does one re-build Lustre in a safe and consistent manner? I''m working through the docs, but I have come across a problem when I try to run "make rpms" in the Lustre source: make[4]: *** No rule to make target `/usr/src/redhat/BUILD/lustre-1.6.5.1/ldiskfs/Module.symvers'', needed by `Module.symvers''. Stop. How do I ensure that the build environment that Lustre requires is properly prepared? I could just hoick a soft link to the Module.symvers file in the kernel tree, but that''s a little messy. I''ve attached a draft copy of the build process to this message. Again, just looking to sanity check the method, since I''m obviously missing something. I''m going to rebuild the servers today so that I can provide the debug information that Brian requested.> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >Regards, Malcolm. -- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/57940f21/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/57940f21/attachment.gif
Malcolm Cowe
2008-Oct-07 09:46 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Now with attachment. Sorry. Malcolm Cowe wrote:> > Andreas Dilger wrote: >> On Oct 06, 2008 10:59 -0400, Brian J. Murrell wrote: >> >>> On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote: >>> >>>> With respect to the OFED stack used, we are using the latest official >>>> software stack supplied by Voltaire. The reason for this is that there >>>> is more to OFED than just the kernel modules, including many libraries >>>> and tools, >>>> >>> None of these should be necessary for Lustre to use I/B. >>> >> >> Also very important to note is that if you are changing the IB stack, >> then Lustre also needs to be recompiled to work with the new IB stack. >> >> > Yes. As a matter of fact, you have anticipated a question I have: how > does one re-build Lustre in a safe and consistent manner? I''m working > through the docs, but I have come across a problem when I try to run > "make rpms" in the Lustre source: > > make[4]: *** No rule to make target > `/usr/src/redhat/BUILD/lustre-1.6.5.1/ldiskfs/Module.symvers'', needed > by `Module.symvers''. Stop. > > How do I ensure that the build environment that Lustre requires is > properly prepared? I could just hoick a soft link to the > Module.symvers file in the kernel tree, but that''s a little messy. > > I''ve attached a draft copy of the build process to this message. > Again, just looking to sanity check the method, since I''m obviously > missing something. > > I''m going to rebuild the servers today so that I can provide the debug > information that Brian requested. >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > Regards, > > Malcolm. > > -- > <http://www.sun.com> > *Malcolm Cowe* > /Solutions Integration Engineer/ > > *Sun Microsystems, Inc.* > Blackness Road > Linlithgow, West Lothian EH49 7LR UK > Phone: x73602 / +44 1506 673 602 > Email: Malcolm.Cowe at Sun.COM >-- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/f84bca54/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/f84bca54/attachment.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/f84bca54/attachment-0001.gif -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Lustre-from-Source-RHEL4.5.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/f84bca54/attachment.txt
Malcolm Cowe
2008-Oct-07 16:04 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Brian (and company), Thanks for taking an interest in this. I appreciate that you guys have a lot on your plate, so I''m grateful for any feedback you can provide. As requested, I have attached a transcript of the RPM installation process as used against a completely clean RHEL 4.5 AS installation on an X4200m2 server with PCIe Infiniband IB HCA. Taking your advice on board regarding the OFED kernel modules, I am going to try creating the file system using only that material supplied as part of the Lustre download plus the RDAC kernel modules for the STK6140. Regards, Malcolm. Brian J. Murrell wrote:> On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote: > >> Hey Brian, >> > > Hey Malcolm, > > >> I''ll have to re-install the system from scratch in order to be able to >> answer some of your questions, which I''ll get started on this evening. >> > > OK. > > >> What I was hoping for in the first instance was a sanity check of our >> installation methods. >> > > I think I commented on those. If you are going to build your OFED stack > you don''t need to install the one we provide. > > >> With respect to the OFED stack used, we are using the latest official >> software stack supplied by Voltaire. The reason for this is that there >> is more to OFED than just the kernel modules, including many libraries >> and tools, >> > > None of these should be necessary for Lustre to use I/B. > > >> plus the latest firmware for the cards. >> > > Hrm. Can you not upgrade firmware independent of upgrading the whole > OFED stack? That seems very limiting. > > >> It''s what the customer has asked for, and it is what the card vendor >> expects us to do. >> > > Fair enough. I was just pointing out that you don''t need our OFED stack > if you are going to install your own. > > >> We may be able to get away with OFED 1.3, but I would still like some >> guidance on how to install the rest of the OFED stack >> > > We don''t supply the userspace tools because they are not really > necessary for Lustre. > > >> do we use the OFED source to rebuild everything, or can we pick the >> Lustre supplied kernel modules and just layer on the other stuff >> separately? >> > > Yes, you should be able to do that. I say that quite generally as I''m > not entirely clear on your operating environment. > > >> Finally, when I said that one file system fails versus another passes, >> I mean that the server locks solid, crashes, usually with no debug to >> speak of (nothing in the system logs). >> > > Nothing on the console either? > > >> Even while the system is up and running the lustre kernel, if we >> attempt a clean shutdown, the kernel panics. >> > > Hrm. A panic is quite different than locking solid with no messages at > all. A solid lock with no messages is indicative of hardware problems. > > >> Since I need to rebuild the systems anyway, I will also try to install >> the packages in the order mentioned by Megan Larko, to see how that >> affects the installation. >> > > I''m not entirely convinced of her process. You should not need to use > --force and reinstall packages already installed. I''d be more > interested in knowing exactly your installation steps and the errors you > get from it. Please try to avoid the use of --force so we can see why > it''s necessary. You will have to use "rpm -U" with e2fsprogs though as > she mentions. Do all of your work with the "script(1)" tool so you can > easily log it. > > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/236ff3a6/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/236ff3a6/attachment-0001.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-binary-install.out Type: application/octet-stream Size: 82539 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081007/236ff3a6/attachment-0001.obj
Malcolm Cowe
2008-Oct-13 08:41 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
The X4200m2 MDS systems and the X4500 OSS were rebuilt using the stock Lustre packages (Kernel + modules + userspace). With the exception of the RDAC kernel module, no additional software was applied to the systems. We recreated our volumes and ran the servers over the weekend. However, the OSS crashed about 8 hours in. The syslog output is attached to this message. Looks like it could be similar to bug #16404, which means patching and rebuilding the kernel. Given my lack of success at trying to build from source, I am again asking for some guidance on how to do this. I sent out the steps I used to try and build from source on the 7th because I was encountering problems and was unable to get a working set of packages. Included in that messages was output from quilt that implies that the kernel patching process was not working properly. Regards, Malcolm. -- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081013/f6eefcc0/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081013/f6eefcc0/attachment-0001.gif -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: oss-messages-error.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081013/f6eefcc0/attachment-0001.txt
Brock Palen
2008-Oct-13 13:53 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
I know you say the only addition was the RDAC for the MDS''s I assume (we use it also just fine). When I ran faultmond from suns dcmu rpm (RHEL 4 here) the x4500''s would crash like clock work ~48 hours. For a very simple bit of code I was surpised that once when I forgot to turn it on when working on the load this would happen. Just FYI it was unrelated to lustre (using provided rpm''s no kernel build) this solved my problem on the x4500 Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Oct 13, 2008, at 4:41 AM, Malcolm Cowe wrote:> The X4200m2 MDS systems and the X4500 OSS were rebuilt using the > stock Lustre packages (Kernel + modules + userspace). With the > exception of the RDAC kernel module, no additional software was > applied to the systems. We recreated our volumes and ran the > servers over the weekend. However, the OSS crashed about 8 hours > in. The syslog output is attached to this message. > > Looks like it could be similar to bug #16404, which means patching > and rebuilding the kernel. Given my lack of success at trying to > build from source, I am again asking for some guidance on how to do > this. I sent out the steps I used to try and build from source on > the 7th because I was encountering problems and was unable to get a > working set of packages. Included in that messages was output from > quilt that implies that the kernel patching process was not working > properly. > > > Regards, > > Malcolm. > > -- > <6g_top.gif> > Malcolm Cowe > Solutions Integration Engineer > > Sun Microsystems, Inc. > Blackness Road > Linlithgow, West Lothian EH49 7LR UK > Phone: x73602 / +44 1506 673 602 > Email: Malcolm.Cowe at Sun.COM > <6g_top.gif> > Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15, internal journal > Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Oct 10 06:53:42 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal > Oct 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Oct 10 06:57:49 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 06:57:49 oss-1 kernel: LDISKFS FS on md17, internal journal > Oct 10 06:57:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for drive > fault > Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is complete > Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver, > info at clusterfs.com > Oct 10 07:56:23 oss-LDISKFS-fs: file extents enabled1 kernel: > Lustre VersionLDISKFS-fs: mballoc enabled > : 1.6.5.1 > Oct 10 07:56:23 oss-1 kernel: Build Version: > 1.6.5.1-19691231190000-PRISTINE-.cache.OLDRPMS.20080618230526.linux- > smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. > 1.6.5.1smp > Oct 10 07:56:24 oss-1 kernel: Lustre: Added LNI 192.168.30.111 at o2ib > [8/64] > Oct 10 07:56:24 oss-1 kernel: Lustre: Lustre Client File System; > info at clusterfs.com > Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal > on md21 > Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal > on md21 > Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled > Lustre: Request x1 sent from MGC192.168.30.101 at o2ib to NID > 192.168.30.101 at o2ib 5s ago has timed out (limit 5s). > Oct 10 07:56:30 oss-1 kernel: Lustre: Request x1 sent from > MGC192.168.30.101 at o2ib to NID 192.168.30.101 at o2ib 5s ago has timed > out (limit 5s). > LustreError: 4685:0:(events.c:55:request_out_callback()) @@@ type > 4, status -113 req at 00000101f8ef3200 x3/t0 o250- > >MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl > 1223621815 ref 2 fl Rpc:/0/0 rc 0/0 > Lustre: Request x3 sent from MGC192.168.30.101 at o2ib to NID > 192.168.30.102 at o2ib 0s ago has timed out (limit 5s). > LustreError: 18125:0:(obd_mount.c:1062:server_start_targets()) > Required registration failed for lfs01-OSTffff: -5 > LustreError: 15f-b: Communication error with the MGS. Is the MGS > running? > LustreError: 18125:0:(obd_mount.c:1597:server_fill_super()) Unable > to start targets: -5 > LustreError: 18125:0:(obd_mount.c:1382:server_put_super()) no obd > lfs01-OSTffff > LustreError: 18125:0:(obd_mount.c:119:server_deregister_mount()) > lfs01-OSTffff not registered > LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) > LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 > breaks, 0 lost > LDISKFS-fs: mballoc: 0 generated and it took 0 > LDISKFS-fs: mballoc: 0 preallocated, 0 discarded > Oct 10 07:56:50 oss-1 kernel: Lustre: Changing connection for > MGC192.168.30.101 at o2ib to MGC192.1Lustre: server umount lfs01- > OSTffff complete > 68.30.101 at o2ib_1LustreError: 18125:0:(obd_mount.c: > 1951:lustre_fill_super()) Unable to mount (-5) > /192.168.30.102 at o2ib > Oct 10 07:56:50 oss-1 kernel: LustreError: 4685:0:(events.c: > 55:request_out_callback()) @@@ type 4, status -113 > req at 00000101f8ef3200 x3/t0 o250->MGS at MGC192.168.30.101@o2ib_1:26/25 > lens 240/400 e 0 to 5 dl 1223621815 ref 2 fl Rpc:/0/0 rc 0/0Oct 10 > 07:56:50 oss-1 kernel: Lustre: Request x3 sent from > MGC192.168.30.101 at o2ib to NID 192.168.30.102 at o2ib 0s ago has timed > out (limit 5s). > Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: > 1062:server_start_targets()) Required registration failed for lfs01- > OSTffff: -5 > Oct 10 07:56:50 oss-1 kernel: LustreError: 15f-b: Communication > error with the MGS. Is the MGS running? > Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: > 1597:server_fill_super()) Unable to start targets: -5 > Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: > 1382:server_put_super()) no obd lfs01-OSTffff > Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: > 119:server_deregister_mount()) lfs01-OSTffff not registered > Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs > (0 success) > Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 extents > scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost > Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 generated and > it took 0 > Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 preallocated, > 0 discarded > Oct 10 07:56:51 oss-1 kernel: Lustre: server umount lfs01-OSTffff > complete > Oct 10 07:56:51 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: > 1951:lustre_fill_super()) Unable to mount (-5) > LustreError: 6644:0:(events.c:55:request_out_callback()) @@@ type > 4, status -113 req at 00000103f7a50600 x1/t0 o250- > >MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl > 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 > Oct 10 07:57:15 oss-1 kernel: LustreError: 6644:0:(events.c: > 55:request_out_callback()) @@@ type 4, status -113 > req at 00000103f7a50600 x1/t0 o250->MGS at MGC192.168.30.101@o2ib_1:26/25 > lens 240/400 e 0 to 5 dl 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 > Oct 10 08:04:09 oss-1 sshd(pam_unix)[18530]: session opened for > user root by root(uid=0) > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Lustre: lfs01-OST0000: new disk, initializing > Lustre: Server lfs01-OST0000 on device /dev/md11 has started > Oct 10 08:06:49 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal > on md21 > Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:06:49 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal > on md21 > Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:06:49 oss-1 kernel: Lustre: Filtering OBD driver; > info at clusterfs.com > Oct 10 08:06:49 oss-1 kernel: Lustre: lfs01-OST0000: new disk, > initializing > Oct 10 08:06:49 oss-1 kernel: Lustre: OST lfs01-OST0000 now serving > dev (lfs01-OST0000/ccc68ac6-5b58-acd6-455b-2df9d2980009) with > recovery enabled > Oct 10 08:06:49 oss-1 kernel: Lustre: Server lfs01-OST0000 on > device /dev/md11 has started > Lustre: lfs01-OST0000: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:06:54 oss-1 kernel: Lustre: lfs01-OST0000: received MDS > connection from 192.168.30.101 at o2ib > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Lustre: lfs01-OST0001: new disk, initializing > Lustre: Server lfs01-OST0001 on device /dev/md12 has started > Oct 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal > on md22 > Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal > on md22 > Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:06:56 oss-1 kernel: Lustre: lfs01-OST0001: new disk, > initializing > Oct 10 08:06:56 oss-1 kernel: Lustre: OST lfs01-OST0001 now serving > dev (lfs01-OST0001/b2122e87-be36-bd1a-4e40-fdd41e626d0b) with > recovery enabled > Oct 10 08:06:56 oss-1 kernel: Lustre: Server lfs01-OST0001 on > device /dev/md12 has started > Lustre: lfs01-OST0001: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:07:01 oss-1 kernel: Lustre: lfs01-OST0001: received MDS > connection from 192.168.30.101 at o2ib > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Lustre: lfs01-OST0002: new disk, initializing > Lustre: Server lfs01-OST0002 on device /dev/md13 has started > Oct 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal > on md23 > Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal > on md23 > Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:07:02 oss-1 kernel: Lustre: lfs01-OST0002: new disk, > initializing > Oct 10 08:07:02 oss-1 kernel: Lustre: OST lfs01-OST0002 now serving > dev (lfs01-OST0002/13c66dfa-47c5-b350-43e3-3c3b67c358b6) with > recovery enabled > Oct 10 08:07:02 oss-1 kernel: Lustre: Server lfs01-OST0002 on > device /dev/md13 has started > Lustre: lfs01-OST0002: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:07:06 oss-1 kernel: Lustre: lfs01-OST0002: received MDS > connection from 192.168.30.101 at o2ib > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Oct 10 08:07:08 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > OcLustre: lfs01-OST0003: new disk, initializing > t 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, external > journalLustre: Server lfs01-OST0003 on device /dev/md15 has started > on md25 > Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:08 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, external journal > on md25 > Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:07:08 oss-1 kernel: Lustre: lfs01-OST0003: new disk, > initializing > Oct 10 08:07:08 oss-1 kernel: Lustre: OST lfs01-OST0003 now serving > dev (lfs01-OST0003/d6fd7a9d-3bb8-ae05-41ed-bbfb1b6b0303) with > recovery enabled > Oct 10 08:07:08 oss-1 kernel: Lustre: Server lfs01-OST0003 on > device /dev/md15 has started > Lustre: lfs01-OST0003: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:07:12 oss-1 kernel: Lustre: lfs01-OST0003: received MDS > connection from 192.168.30.101 at o2ib > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Lustre: lfs01-OST0004: new disk, initializing > Oct 10 08:07:14 oss-1 kernel: kjournald starting. Commit > intervLustre: Server lfs01-OST0004 on device /dev/md16 has started > al 5 seconds > Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, external journal > on md26 > Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:14 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, external journal > on md26 > Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:07:14 oss-1 kernel: Lustre: lfs01-OST0004: new disk, > initializing > Oct 10 08:07:14 oss-1 kernel: Lustre: OST lfs01-OST0004 now serving > dev (lfs01-OST0004/661dcb52-7ef9-8274-45d7-4441e36410d1) with > recovery enabled > Oct 10 08:07:14 oss-1 kernel: Lustre: Server lfs01-OST0004 on > device /dev/md16 has started > Lustre: lfs01-OST0004: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:07:18 oss-1 kernel: Lustre: lfs01-OST0004: received MDS > connection from 192.168.30.101 at o2ib > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > Lustre: lfs01-OST0005: new disk, initializing > Lustre: Server lfs01-OST0005 on device /dev/md17 has started > Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on md17, external journal > on md27 > Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit interval > 5 seconds > Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on md17, external journal > on md27 > Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: mounted filesystem with > journal data mode. > Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: file extents enabled > Oct 10 08:07:20 oss-1 kernel: LDISKFS-fs: mballoc enabled > Oct 10 08:07:20 oss-1 kernel: Lustre: lfs01-OST0005: new disk, > initializing > Oct 10 08:07:20 oss-1 kernel: Lustre: OST lfs01-OST0005 now serving > dev (lfs01-OST0005/978ba68c-0ba7-9ac7-439f-964ca7bf86a3) with > recovery enabled > Oct 10 08:07:20 oss-1 kernel: Lustre: Server lfs01-OST0005 on > device /dev/md17 has started > Lustre: lfs01-OST0005: received MDS connection from > 192.168.30.101 at o2ib > Oct 10 08:07:25 oss-1 kernel: Lustre: lfs01-OST0005: received MDS > connection from 192.168.30.101 at o2ib > Oct 10 08:45:00 oss-1 faultmond: 17:Polling all 48 slots for drive > fault > Oct 10 08:45:06 oss-1 faultmond: Polling cycle 17 is complete > Oct 10 09:45:06 oss-1 faultmond: 18:Polling all 48 slots for drive > fault > Oct 10 09:45:12 oss-1 faultmond: Polling cycle 18 is complete > Oct 10 10:45:12 oss-1 faultmond: 19:Polling all 48 slots for drive > fault > Oct 10 10:45:17 oss-1 faultmond: Polling cycle 19 is complete > > LustreError: 18732:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0001: slow setattr 85s > Oct 10 10:48:14 oss-1 kernel: LustreError: 18732:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 85s > Oct 10 11:45:17 oss-1 faultmond: 20:Polling all 48 slots for drive > fault > Oct 10 11:45:25 oss-1 faultmond: Polling cycle 20 is complete > Oct 10 12:45:25 oss-1 faultmond: 21:Polling all 48 slots for drive > fault > Oct 10 12:45:33 oss-1 faultmond: Polling cycle 21 is complete > Lustre: 18805:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0005: slow setattr 33s > Oct 10 13:14:46 oss-1 kernel: Lustre: 18805:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 33s > Lustre: 18794:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0000: slow setattr 43s > Oct 10 13:15:03 oss-1 kernel: Lustre: 18794:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 43s > Lustre: 18815:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0004: slow setattr 40s > Oct 10 13:15:13 oss-1 kernel: Lustre: 18815:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s > Lustre: 18809:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0003: slow i_mutex 31s > Lustre: 18753:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0003: slow i_mutex 31s > Oct 10 13:15:25 oss-1 kernel: Lustre: 18809:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s > Oct 10 13:15:25 oss-1 kernel: Lustre: 18753:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s > Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0002: slow i_mutex 34s > Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) > Skipped 2 previous similar messages > Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 34s > Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: > 700:filter_commitrw_write()) Skipped 2 previous similar messages > Lustre: 18833:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0001: slow i_mutex 37s > Oct 10 13:15:31 oss-1 kernel: Lustre: 18833:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0001: slow i_mutex 37s > Lustre: 18812:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0002: slow i_mutex 40s > Lustre: 18844:0:(filter_io_26.c:765:filter_commitrw_write()) lfs01- > OST0003: slow direct_io 40s > Oct 10 13:15:34 oss-1 kernel: Lustre: 18812:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 40s > Oct 10 13:15:34 oss-1 kernel: Lustre: 18844:0:(filter_io_26.c: > 765:filter_commitrw_write()) lfs01-OST0003: slow direct_io 40s > Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0001: slow setattr 41s > Lustre: 18849:0:(filter_io_26.c:765:filter_commitrw_write()) lfs01- > OST0001: slow direct_io 31s > Oct 10 13:15:35 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 41s > Oct 10 13:15:35 oss-1 kernel: Lustre: 18849:0:(filter_io_26.c: > 765:filter_commitrw_write()) lfs01-OST0001: slow direct_io 31s > LustreError: 18765:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0002: slow setattr 51s > Oct 10 13:15:38 oss-1 kernel: LustreError: 18765:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 51s > Lustre: 18756:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- > OST0002: slow i_mutex 45s > Oct 10 13:15:39 oss-1 kernel: Lustre: 18756:0:(filter_io_26.c: > 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 45s > Oct 10 13:45:33 oss-1 faultmond: 22:Polling all 48 slots for drive > fault > Oct 10 13:45:41 oss-1 faultmond: Polling cycle 22 is complete > Oct 10 14:45:41 oss-1 faultmond: 23:Polling all 48 slots for drive > fault > Oct 10 14:45:49 oss-1 faultmond: Polling cycle 23 is complete > Lustre: 18740:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0000: slow setattr 38s > Oct 10 15:40:41 oss-1 kernel: Lustre: 18740:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 38s > LustreError: 18830:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0004: slow setattr 60s > Lustre: 18767:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0005: slow setattr 38s > Oct 10 15:41:13 oss-1 kernel: LustreError: 18830:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 60s > Oct 10 15:41:13 oss-1 kernel: Lustre: 18767:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 38s > Lustre: 18796:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0001: slow setattr 44s > Oct 10 15:41:20 oss-1 kernel: Lustre: 18796:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 44s > LustreError: 18831:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0002: slow setattr 62s > Oct 10 15:41:21 oss-1 kernel: LustreError: 18831:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 62s > Oct 10 15:45:49 oss-1 faultmond: 24:Polling all 48 slots for drive > fault > Oct 10 15:45:58 oss-1 faultmond: Polling cycle 24 is complete > Oct 10 16:45:58 oss-1 faultmond: 25:Polling all 48 slots for drive > fault > Oct 10 16:46:06 oss-1 faultmond: Polling cycle 25 is complete > Oct 10 17:46:06 oss-1 faultmond: 26:Polling all 48 slots for drive > fault > Oct 10 17:46:15 oss-1 faultmond: Polling cycle 26 is complete > Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0000: slow setattr 41s > Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ > Slow req_in handling 7s req at 00000101e8f1de00 x15789/t0 o13-><?>@<? > >:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 > Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ > Slow req_in handling 7s req at 00000101e8f1da00 x15790/t0 o13-><?>@<? > >:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 > Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) > Skipped 3 previous similar messages > Lustre: 18764:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0004: slow setattr 40s > Oct 10 18:06:33 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 41s > Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0:(service.c: > 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s > req at 00000101e8f1de00 x15789/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to 0 > dl 0 ref 1 fl New:/0/0 rc 0/0 > Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0:(service.c: > 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s > req at 00000101e8f1da00 x15790/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to 0 > dl 0 ref 1 fl New:/0/0 rc 0/0 > Lustre: 18845:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0002: slow setattr 44s > Lustre: 18579:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ > Slow req_in handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><? > >@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 > Oct 10 18:06:54 oss-1 kernel: Lustre: 18726:0:(service.c: > 918:ptlrpc_server_handle_req_in()) Skipped 3 previous similar messages > Oct 10 18:06:54 oss-1 kernel: Lustre: 18764:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s > Oct 10 18:06:54 oss-1 kernel: Lustre: 18845:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 44s > Oct 10 18:06:54 oss-1 kernel: Lustre: 18579:0:(service.c: > 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 14s > req at 00000103f8dabe00 x7271650/t0 o103-><?>@<?>:0/0 lens 232/0 e 0 > to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 > Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0005: slow setattr 32s > Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr()) Skipped 1 > previous similar message > Oct 10 18:06:59 oss-1 kernel: Lustre: 18766:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 32s > Oct 10 18:06:59 oss-1 kernel: Lustre: 18766:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) Skipped 1 previous similar message > Lustre: 18826:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- > OST0003: slow setattr 45s > Oct 10 18:07:04 oss-1 kernel: Lustre: 18826:0:(lustre_fsfilt.h: > 312:fsfilt_setattr()) lfs01-OST0003: slow setattr 45s > Oct 10 18:46:15 oss-1 faultmond: 27:Polling all 48 slots for drive > fault > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at spinlock:76 > invalid operand: 0000 [1] SMP > CPU 2 > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) > lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) > obdclass(U) lvfs(U) ldiskfs(U) lnet(U) libcfs(U) raid5(U) xor(U) > parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) > ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) rdma_ucm(U) > qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) > md5(U) ipv6(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core > (U) ds(U) yenta_socket(U) pcmcia_core(U) dm_mirror(U) dm_multipath > (U) dm_mod(U) button(U) battery(U) ac(U) joydev(U) ohci_hcd(U) > ehci_hcd(U) hw_random(U) edac_mc(U) ib_mthca(U) ib_umad(U) ib_ucm > (U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) e1000(U) > ext3(U) jbd(U) raid1(U) mv_sata(U) sd_mod(U) scsi_mod(U) > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Malcolm Cowe
2008-Oct-13 15:03 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Brock Palen wrote:> I know you say the only addition was the RDAC for the MDS''s I assume > (we use it also just fine). > >Yes, the MDS''s share a STK 6140.> When I ran faultmond from suns dcmu rpm (RHEL 4 here) the x4500''s > would crash like clock work ~48 hours. For a very simple bit of code > I was surpised that once when I forgot to turn it on when working on > the load this would happen. Just FYI it was unrelated to lustre > (using provided rpm''s no kernel build) this solved my problem on the > x4500 > >The DCMU RPM is installed. I didn''t explicitly install this, so it must have been bundled in with the SIA CD... I''ll try removing the rpm to see what happens. Thanks for the heads up. Regards, Malcolm.> Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Oct 13, 2008, at 4:41 AM, Malcolm Cowe wrote: > > >> The X4200m2 MDS systems and the X4500 OSS were rebuilt using the >> stock Lustre packages (Kernel + modules + userspace). With the >> exception of the RDAC kernel module, no additional software was >> applied to the systems. We recreated our volumes and ran the >> servers over the weekend. However, the OSS crashed about 8 hours >> in. The syslog output is attached to this message. >> >> Looks like it could be similar to bug #16404, which means patching >> and rebuilding the kernel. Given my lack of success at trying to >> build from source, I am again asking for some guidance on how to do >> this. I sent out the steps I used to try and build from source on >> the 7th because I was encountering problems and was unable to get a >> working set of packages. Included in that messages was output from >> quilt that implies that the kernel patching process was not working >> properly. >> >> >> Regards, >> >> Malcolm. >> >> -- >> <6g_top.gif> >> Malcolm Cowe >> Solutions Integration Engineer >> >> Sun Microsystems, Inc. >> Blackness Road >> Linlithgow, West Lothian EH49 7LR UK >> Phone: x73602 / +44 1506 673 602 >> Email: Malcolm.Cowe at Sun.COM >> <6g_top.gif> >> Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15, internal journal >> Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> ordered data mode. >> Oct 10 06:53:42 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal >> Oct 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> ordered data mode. >> Oct 10 06:57:49 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 06:57:49 oss-1 kernel: LDISKFS FS on md17, internal journal >> Oct 10 06:57:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> ordered data mode. >> Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for drive >> fault >> Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is complete >> Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver, >> info at clusterfs.com >> Oct 10 07:56:23 oss-LDISKFS-fs: file extents enabled1 kernel: >> Lustre VersionLDISKFS-fs: mballoc enabled >> : 1.6.5.1 >> Oct 10 07:56:23 oss-1 kernel: Build Version: >> 1.6.5.1-19691231190000-PRISTINE-.cache.OLDRPMS.20080618230526.linux- >> smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. >> 1.6.5.1smp >> Oct 10 07:56:24 oss-1 kernel: Lustre: Added LNI 192.168.30.111 at o2ib >> [8/64] >> Oct 10 07:56:24 oss-1 kernel: Lustre: Lustre Client File System; >> info at clusterfs.com >> Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal >> on md21 >> Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal >> on md21 >> Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Lustre: Request x1 sent from MGC192.168.30.101 at o2ib to NID >> 192.168.30.101 at o2ib 5s ago has timed out (limit 5s). >> Oct 10 07:56:30 oss-1 kernel: Lustre: Request x1 sent from >> MGC192.168.30.101 at o2ib to NID 192.168.30.101 at o2ib 5s ago has timed >> out (limit 5s). >> LustreError: 4685:0:(events.c:55:request_out_callback()) @@@ type >> 4, status -113 req at 00000101f8ef3200 x3/t0 o250- >> >>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> >> 1223621815 ref 2 fl Rpc:/0/0 rc 0/0 >> Lustre: Request x3 sent from MGC192.168.30.101 at o2ib to NID >> 192.168.30.102 at o2ib 0s ago has timed out (limit 5s). >> LustreError: 18125:0:(obd_mount.c:1062:server_start_targets()) >> Required registration failed for lfs01-OSTffff: -5 >> LustreError: 15f-b: Communication error with the MGS. Is the MGS >> running? >> LustreError: 18125:0:(obd_mount.c:1597:server_fill_super()) Unable >> to start targets: -5 >> LustreError: 18125:0:(obd_mount.c:1382:server_put_super()) no obd >> lfs01-OSTffff >> LustreError: 18125:0:(obd_mount.c:119:server_deregister_mount()) >> lfs01-OSTffff not registered >> LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) >> LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 >> breaks, 0 lost >> LDISKFS-fs: mballoc: 0 generated and it took 0 >> LDISKFS-fs: mballoc: 0 preallocated, 0 discarded >> Oct 10 07:56:50 oss-1 kernel: Lustre: Changing connection for >> MGC192.168.30.101 at o2ib to MGC192.1Lustre: server umount lfs01- >> OSTffff complete >> 68.30.101 at o2ib_1LustreError: 18125:0:(obd_mount.c: >> 1951:lustre_fill_super()) Unable to mount (-5) >> /192.168.30.102 at o2ib >> Oct 10 07:56:50 oss-1 kernel: LustreError: 4685:0:(events.c: >> 55:request_out_callback()) @@@ type 4, status -113 >> req at 00000101f8ef3200 x3/t0 o250->MGS at MGC192.168.30.101@o2ib_1:26/25 >> lens 240/400 e 0 to 5 dl 1223621815 ref 2 fl Rpc:/0/0 rc 0/0Oct 10 >> 07:56:50 oss-1 kernel: Lustre: Request x3 sent from >> MGC192.168.30.101 at o2ib to NID 192.168.30.102 at o2ib 0s ago has timed >> out (limit 5s). >> Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >> 1062:server_start_targets()) Required registration failed for lfs01- >> OSTffff: -5 >> Oct 10 07:56:50 oss-1 kernel: LustreError: 15f-b: Communication >> error with the MGS. Is the MGS running? >> Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >> 1597:server_fill_super()) Unable to start targets: -5 >> Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >> 1382:server_put_super()) no obd lfs01-OSTffff >> Oct 10 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >> 119:server_deregister_mount()) lfs01-OSTffff not registered >> Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs >> (0 success) >> Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 extents >> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >> Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 generated and >> it took 0 >> Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 preallocated, >> 0 discarded >> Oct 10 07:56:51 oss-1 kernel: Lustre: server umount lfs01-OSTffff >> complete >> Oct 10 07:56:51 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >> 1951:lustre_fill_super()) Unable to mount (-5) >> LustreError: 6644:0:(events.c:55:request_out_callback()) @@@ type >> 4, status -113 req at 00000103f7a50600 x1/t0 o250- >> >>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> >> 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 >> Oct 10 07:57:15 oss-1 kernel: LustreError: 6644:0:(events.c: >> 55:request_out_callback()) @@@ type 4, status -113 >> req at 00000103f7a50600 x1/t0 o250->MGS at MGC192.168.30.101@o2ib_1:26/25 >> lens 240/400 e 0 to 5 dl 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 >> Oct 10 08:04:09 oss-1 sshd(pam_unix)[18530]: session opened for >> user root by root(uid=0) >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Lustre: lfs01-OST0000: new disk, initializing >> Lustre: Server lfs01-OST0000 on device /dev/md11 has started >> Oct 10 08:06:49 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal >> on md21 >> Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:06:49 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal >> on md21 >> Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:06:49 oss-1 kernel: Lustre: Filtering OBD driver; >> info at clusterfs.com >> Oct 10 08:06:49 oss-1 kernel: Lustre: lfs01-OST0000: new disk, >> initializing >> Oct 10 08:06:49 oss-1 kernel: Lustre: OST lfs01-OST0000 now serving >> dev (lfs01-OST0000/ccc68ac6-5b58-acd6-455b-2df9d2980009) with >> recovery enabled >> Oct 10 08:06:49 oss-1 kernel: Lustre: Server lfs01-OST0000 on >> device /dev/md11 has started >> Lustre: lfs01-OST0000: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:06:54 oss-1 kernel: Lustre: lfs01-OST0000: received MDS >> connection from 192.168.30.101 at o2ib >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Lustre: lfs01-OST0001: new disk, initializing >> Lustre: Server lfs01-OST0001 on device /dev/md12 has started >> Oct 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal >> on md22 >> Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal >> on md22 >> Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:06:56 oss-1 kernel: Lustre: lfs01-OST0001: new disk, >> initializing >> Oct 10 08:06:56 oss-1 kernel: Lustre: OST lfs01-OST0001 now serving >> dev (lfs01-OST0001/b2122e87-be36-bd1a-4e40-fdd41e626d0b) with >> recovery enabled >> Oct 10 08:06:56 oss-1 kernel: Lustre: Server lfs01-OST0001 on >> device /dev/md12 has started >> Lustre: lfs01-OST0001: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:07:01 oss-1 kernel: Lustre: lfs01-OST0001: received MDS >> connection from 192.168.30.101 at o2ib >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Lustre: lfs01-OST0002: new disk, initializing >> Lustre: Server lfs01-OST0002 on device /dev/md13 has started >> Oct 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal >> on md23 >> Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal >> on md23 >> Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:07:02 oss-1 kernel: Lustre: lfs01-OST0002: new disk, >> initializing >> Oct 10 08:07:02 oss-1 kernel: Lustre: OST lfs01-OST0002 now serving >> dev (lfs01-OST0002/13c66dfa-47c5-b350-43e3-3c3b67c358b6) with >> recovery enabled >> Oct 10 08:07:02 oss-1 kernel: Lustre: Server lfs01-OST0002 on >> device /dev/md13 has started >> Lustre: lfs01-OST0002: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:07:06 oss-1 kernel: Lustre: lfs01-OST0002: received MDS >> connection from 192.168.30.101 at o2ib >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Oct 10 08:07:08 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> OcLustre: lfs01-OST0003: new disk, initializing >> t 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, external >> journalLustre: Server lfs01-OST0003 on device /dev/md15 has started >> on md25 >> Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:08 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, external journal >> on md25 >> Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:07:08 oss-1 kernel: Lustre: lfs01-OST0003: new disk, >> initializing >> Oct 10 08:07:08 oss-1 kernel: Lustre: OST lfs01-OST0003 now serving >> dev (lfs01-OST0003/d6fd7a9d-3bb8-ae05-41ed-bbfb1b6b0303) with >> recovery enabled >> Oct 10 08:07:08 oss-1 kernel: Lustre: Server lfs01-OST0003 on >> device /dev/md15 has started >> Lustre: lfs01-OST0003: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:07:12 oss-1 kernel: Lustre: lfs01-OST0003: received MDS >> connection from 192.168.30.101 at o2ib >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Lustre: lfs01-OST0004: new disk, initializing >> Oct 10 08:07:14 oss-1 kernel: kjournald starting. Commit >> intervLustre: Server lfs01-OST0004 on device /dev/md16 has started >> al 5 seconds >> Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, external journal >> on md26 >> Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:14 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, external journal >> on md26 >> Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:07:14 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:07:14 oss-1 kernel: Lustre: lfs01-OST0004: new disk, >> initializing >> Oct 10 08:07:14 oss-1 kernel: Lustre: OST lfs01-OST0004 now serving >> dev (lfs01-OST0004/661dcb52-7ef9-8274-45d7-4441e36410d1) with >> recovery enabled >> Oct 10 08:07:14 oss-1 kernel: Lustre: Server lfs01-OST0004 on >> device /dev/md16 has started >> Lustre: lfs01-OST0004: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:07:18 oss-1 kernel: Lustre: lfs01-OST0004: received MDS >> connection from 192.168.30.101 at o2ib >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> Lustre: lfs01-OST0005: new disk, initializing >> Lustre: Server lfs01-OST0005 on device /dev/md17 has started >> Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on md17, external journal >> on md27 >> Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit interval >> 5 seconds >> Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on md17, external journal >> on md27 >> Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: mounted filesystem with >> journal data mode. >> Oct 10 08:07:19 oss-1 kernel: LDISKFS-fs: file extents enabled >> Oct 10 08:07:20 oss-1 kernel: LDISKFS-fs: mballoc enabled >> Oct 10 08:07:20 oss-1 kernel: Lustre: lfs01-OST0005: new disk, >> initializing >> Oct 10 08:07:20 oss-1 kernel: Lustre: OST lfs01-OST0005 now serving >> dev (lfs01-OST0005/978ba68c-0ba7-9ac7-439f-964ca7bf86a3) with >> recovery enabled >> Oct 10 08:07:20 oss-1 kernel: Lustre: Server lfs01-OST0005 on >> device /dev/md17 has started >> Lustre: lfs01-OST0005: received MDS connection from >> 192.168.30.101 at o2ib >> Oct 10 08:07:25 oss-1 kernel: Lustre: lfs01-OST0005: received MDS >> connection from 192.168.30.101 at o2ib >> Oct 10 08:45:00 oss-1 faultmond: 17:Polling all 48 slots for drive >> fault >> Oct 10 08:45:06 oss-1 faultmond: Polling cycle 17 is complete >> Oct 10 09:45:06 oss-1 faultmond: 18:Polling all 48 slots for drive >> fault >> Oct 10 09:45:12 oss-1 faultmond: Polling cycle 18 is complete >> Oct 10 10:45:12 oss-1 faultmond: 19:Polling all 48 slots for drive >> fault >> Oct 10 10:45:17 oss-1 faultmond: Polling cycle 19 is complete >> >> LustreError: 18732:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0001: slow setattr 85s >> Oct 10 10:48:14 oss-1 kernel: LustreError: 18732:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 85s >> Oct 10 11:45:17 oss-1 faultmond: 20:Polling all 48 slots for drive >> fault >> Oct 10 11:45:25 oss-1 faultmond: Polling cycle 20 is complete >> Oct 10 12:45:25 oss-1 faultmond: 21:Polling all 48 slots for drive >> fault >> Oct 10 12:45:33 oss-1 faultmond: Polling cycle 21 is complete >> Lustre: 18805:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0005: slow setattr 33s >> Oct 10 13:14:46 oss-1 kernel: Lustre: 18805:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 33s >> Lustre: 18794:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0000: slow setattr 43s >> Oct 10 13:15:03 oss-1 kernel: Lustre: 18794:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 43s >> Lustre: 18815:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0004: slow setattr 40s >> Oct 10 13:15:13 oss-1 kernel: Lustre: 18815:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s >> Lustre: 18809:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0003: slow i_mutex 31s >> Lustre: 18753:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0003: slow i_mutex 31s >> Oct 10 13:15:25 oss-1 kernel: Lustre: 18809:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s >> Oct 10 13:15:25 oss-1 kernel: Lustre: 18753:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s >> Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0002: slow i_mutex 34s >> Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) >> Skipped 2 previous similar messages >> Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 34s >> Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >> 700:filter_commitrw_write()) Skipped 2 previous similar messages >> Lustre: 18833:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0001: slow i_mutex 37s >> Oct 10 13:15:31 oss-1 kernel: Lustre: 18833:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0001: slow i_mutex 37s >> Lustre: 18812:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0002: slow i_mutex 40s >> Lustre: 18844:0:(filter_io_26.c:765:filter_commitrw_write()) lfs01- >> OST0003: slow direct_io 40s >> Oct 10 13:15:34 oss-1 kernel: Lustre: 18812:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 40s >> Oct 10 13:15:34 oss-1 kernel: Lustre: 18844:0:(filter_io_26.c: >> 765:filter_commitrw_write()) lfs01-OST0003: slow direct_io 40s >> Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0001: slow setattr 41s >> Lustre: 18849:0:(filter_io_26.c:765:filter_commitrw_write()) lfs01- >> OST0001: slow direct_io 31s >> Oct 10 13:15:35 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 41s >> Oct 10 13:15:35 oss-1 kernel: Lustre: 18849:0:(filter_io_26.c: >> 765:filter_commitrw_write()) lfs01-OST0001: slow direct_io 31s >> LustreError: 18765:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0002: slow setattr 51s >> Oct 10 13:15:38 oss-1 kernel: LustreError: 18765:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 51s >> Lustre: 18756:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >> OST0002: slow i_mutex 45s >> Oct 10 13:15:39 oss-1 kernel: Lustre: 18756:0:(filter_io_26.c: >> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 45s >> Oct 10 13:45:33 oss-1 faultmond: 22:Polling all 48 slots for drive >> fault >> Oct 10 13:45:41 oss-1 faultmond: Polling cycle 22 is complete >> Oct 10 14:45:41 oss-1 faultmond: 23:Polling all 48 slots for drive >> fault >> Oct 10 14:45:49 oss-1 faultmond: Polling cycle 23 is complete >> Lustre: 18740:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0000: slow setattr 38s >> Oct 10 15:40:41 oss-1 kernel: Lustre: 18740:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 38s >> LustreError: 18830:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0004: slow setattr 60s >> Lustre: 18767:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0005: slow setattr 38s >> Oct 10 15:41:13 oss-1 kernel: LustreError: 18830:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 60s >> Oct 10 15:41:13 oss-1 kernel: Lustre: 18767:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 38s >> Lustre: 18796:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0001: slow setattr 44s >> Oct 10 15:41:20 oss-1 kernel: Lustre: 18796:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 44s >> LustreError: 18831:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0002: slow setattr 62s >> Oct 10 15:41:21 oss-1 kernel: LustreError: 18831:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 62s >> Oct 10 15:45:49 oss-1 faultmond: 24:Polling all 48 slots for drive >> fault >> Oct 10 15:45:58 oss-1 faultmond: Polling cycle 24 is complete >> Oct 10 16:45:58 oss-1 faultmond: 25:Polling all 48 slots for drive >> fault >> Oct 10 16:46:06 oss-1 faultmond: Polling cycle 25 is complete >> Oct 10 17:46:06 oss-1 faultmond: 26:Polling all 48 slots for drive >> fault >> Oct 10 17:46:15 oss-1 faultmond: Polling cycle 26 is complete >> Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0000: slow setattr 41s >> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ >> Slow req_in handling 7s req at 00000101e8f1de00 x15789/t0 o13-><?>@<? >> >>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> >> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ >> Slow req_in handling 7s req at 00000101e8f1da00 x15790/t0 o13-><?>@<? >> >>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> >> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) >> Skipped 3 previous similar messages >> Lustre: 18764:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0004: slow setattr 40s >> Oct 10 18:06:33 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0000: slow setattr 41s >> Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0:(service.c: >> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >> req at 00000101e8f1de00 x15789/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to 0 >> dl 0 ref 1 fl New:/0/0 rc 0/0 >> Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0:(service.c: >> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >> req at 00000101e8f1da00 x15790/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to 0 >> dl 0 ref 1 fl New:/0/0 rc 0/0 >> Lustre: 18845:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0002: slow setattr 44s >> Lustre: 18579:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ >> Slow req_in handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><? >> >>> @<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> >> Oct 10 18:06:54 oss-1 kernel: Lustre: 18726:0:(service.c: >> 918:ptlrpc_server_handle_req_in()) Skipped 3 previous similar messages >> Oct 10 18:06:54 oss-1 kernel: Lustre: 18764:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s >> Oct 10 18:06:54 oss-1 kernel: Lustre: 18845:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0002: slow setattr 44s >> Oct 10 18:06:54 oss-1 kernel: Lustre: 18579:0:(service.c: >> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 14s >> req at 00000103f8dabe00 x7271650/t0 o103-><?>@<?>:0/0 lens 232/0 e 0 >> to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >> Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0005: slow setattr 32s >> Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr()) Skipped 1 >> previous similar message >> Oct 10 18:06:59 oss-1 kernel: Lustre: 18766:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0005: slow setattr 32s >> Oct 10 18:06:59 oss-1 kernel: Lustre: 18766:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) Skipped 1 previous similar message >> Lustre: 18826:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >> OST0003: slow setattr 45s >> Oct 10 18:07:04 oss-1 kernel: Lustre: 18826:0:(lustre_fsfilt.h: >> 312:fsfilt_setattr()) lfs01-OST0003: slow setattr 45s >> Oct 10 18:46:15 oss-1 faultmond: 27:Polling all 48 slots for drive >> fault >> ----------- [cut here ] --------- [please bite here ] --------- >> Kernel BUG at spinlock:76 >> invalid operand: 0000 [1] SMP >> CPU 2 >> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) >> obdclass(U) lvfs(U) ldiskfs(U) lnet(U) libcfs(U) raid5(U) xor(U) >> parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) >> ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) rdma_ucm(U) >> qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) >> md5(U) ipv6(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) mlx4_core >> (U) ds(U) yenta_socket(U) pcmcia_core(U) dm_mirror(U) dm_multipath >> (U) dm_mod(U) button(U) battery(U) ac(U) joydev(U) ohci_hcd(U) >> ehci_hcd(U) hw_random(U) edac_mc(U) ib_mthca(U) ib_umad(U) ib_ucm >> (U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) e1000(U) >> ext3(U) jbd(U) raid1(U) mv_sata(U) sd_mod(U) scsi_mod(U) >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081013/3439fae3/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081013/3439fae3/attachment-0001.gif
Brock Palen
2008-Oct-13 15:31 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
I never uninstalled it (i still use some of the tools in it) Faultmond is a service, just chkconfig it off. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Oct 13, 2008, at 11:03 AM, Malcolm Cowe wrote:> Brock Palen wrote: >> >> I know you say the only addition was the RDAC for the MDS''s I >> assume (we use it also just fine). > Yes, the MDS''s share a STK 6140. >> When I ran faultmond from suns dcmu rpm (RHEL 4 here) the x4500''s >> would crash like clock work ~48 hours. For a very simple bit of >> code I was surpised that once when I forgot to turn it on when >> working on the load this would happen. Just FYI it was unrelated >> to lustre (using provided rpm''s no kernel build) this solved my >> problem on the x4500 > The DCMU RPM is installed. I didn''t explicitly install this, so it > must have been bundled in with the SIA CD... I''ll try removing the > rpm to see what happens. Thanks for the heads up. > > Regards, > > Malcolm. > >> Brock Palen www.umich.edu/~brockp Center for Advanced Computing >> brockp at umich.edu (734)936-1985 On Oct 13, 2008, at 4:41 AM, >> Malcolm Cowe wrote: >>> >>> The X4200m2 MDS systems and the X4500 OSS were rebuilt using the >>> stock Lustre packages (Kernel + modules + userspace). With the >>> exception of the RDAC kernel module, no additional software was >>> applied to the systems. We recreated our volumes and ran the >>> servers over the weekend. However, the OSS crashed about 8 hours >>> in. The syslog output is attached to this message. Looks like it >>> could be similar to bug #16404, which means patching and >>> rebuilding the kernel. Given my lack of success at trying to >>> build from source, I am again asking for some guidance on how to >>> do this. I sent out the steps I used to try and build from source >>> on the 7th because I was encountering problems and was unable to >>> get a working set of packages. Included in that messages was >>> output from quilt that implies that the kernel patching process >>> was not working properly. Regards, Malcolm. -- <6g_top.gif> >>> Malcolm Cowe Solutions Integration Engineer Sun Microsystems, >>> Inc. Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: >>> x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM >>> <6g_top.gif> Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15, >>> internal journal Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs: >>> mounted filesystem with ordered data mode. Oct 10 06:53:42 oss-1 >>> kernel: kjournald starting. Commit interval 5 seconds Oct 10 >>> 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal Oct >>> 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with >>> ordered data mode. Oct 10 06:57:49 oss-1 kernel: kjournald >>> starting. Commit interval 5 seconds Oct 10 06:57:49 oss-1 kernel: >>> LDISKFS FS on md17, internal journal Oct 10 06:57:49 oss-1 >>> kernel: LDISKFS-fs: mounted filesystem with ordered data mode. >>> Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for >>> drive fault Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is >>> complete Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver, >>> info at clusterfs.com Oct 10 07:56:23 oss-LDISKFS-fs: file extents >>> enabled1 kernel: Lustre VersionLDISKFS-fs: mballoc enabled : >>> 1.6.5.1 Oct 10 07:56:23 oss-1 kernel: Build Version: >>> 1.6.5.1-19691231190000-PRISTINE-.cache.OLDRPMS. >>> 20080618230526.linux- smp-2.6.9-67.0.7.EL_lustre. >>> 1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. 1.6.5.1smp Oct 10 07:56:24 >>> oss-1 kernel: Lustre: Added LNI 192.168.30.111 at o2ib [8/64] Oct 10 >>> 07:56:24 oss-1 kernel: Lustre: Lustre Client File System; >>> info at clusterfs.com Oct 10 07:56:24 oss-1 kernel: kjournald >>> starting. Commit interval 5 seconds Oct 10 07:56:24 oss-1 kernel: >>> LDISKFS FS on md11, external journal on md21 Oct 10 07:56:24 >>> oss-1 kernel: LDISKFS-fs: mounted filesystem with journal data >>> mode. Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit >>> interval 5 seconds Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on >>> md11, external journal on md21 Oct 10 07:56:24 oss-1 kernel: >>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>> 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 >>> 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled Lustre: >>> Request x1 sent from MGC192.168.30.101 at o2ib to NID >>> 192.168.30.101 at o2ib 5s ago has timed out (limit 5s). Oct 10 >>> 07:56:30 oss-1 kernel: Lustre: Request x1 sent from >>> MGC192.168.30.101 at o2ib to NID 192.168.30.101 at o2ib 5s ago has >>> timed out (limit 5s). LustreError: 4685:0:(events.c: >>> 55:request_out_callback()) @@@ type 4, status -113 >>> req at 00000101f8ef3200 x3/t0 o250- >>>> >>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> 1223621815 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: Request x3 sent from >>> MGC192.168.30.101 at o2ib to NID 192.168.30.102 at o2ib 0s ago has >>> timed out (limit 5s). LustreError: 18125:0:(obd_mount.c: >>> 1062:server_start_targets()) Required registration failed for >>> lfs01-OSTffff: -5 LustreError: 15f-b: Communication error with >>> the MGS. Is the MGS running? LustreError: 18125:0:(obd_mount.c: >>> 1597:server_fill_super()) Unable to start targets: -5 >>> LustreError: 18125:0:(obd_mount.c:1382:server_put_super()) no obd >>> lfs01-OSTffff LustreError: 18125:0:(obd_mount.c: >>> 119:server_deregister_mount()) lfs01-OSTffff not registered >>> LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) LDISKFS-fs: >>> mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, >>> 0 lost LDISKFS-fs: mballoc: 0 generated and it took 0 LDISKFS-fs: >>> mballoc: 0 preallocated, 0 discarded Oct 10 07:56:50 oss-1 >>> kernel: Lustre: Changing connection for MGC192.168.30.101 at o2ib to >>> MGC192.1Lustre: server umount lfs01- OSTffff complete >>> 68.30.101 at o2ib_1LustreError: 18125:0:(obd_mount.c: >>> 1951:lustre_fill_super()) Unable to mount (-5) / >>> 192.168.30.102 at o2ib Oct 10 07:56:50 oss-1 kernel: LustreError: >>> 4685:0:(events.c: 55:request_out_callback()) @@@ type 4, status >>> -113 req at 00000101f8ef3200 x3/t0 o250- >>> >MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> 1223621815 ref 2 fl Rpc:/0/0 rc 0/0Oct 10 07:56:50 oss-1 kernel: >>> Lustre: Request x3 sent from MGC192.168.30.101 at o2ib to NID >>> 192.168.30.102 at o2ib 0s ago has timed out (limit 5s). Oct 10 >>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>> 1062:server_start_targets()) Required registration failed for >>> lfs01- OSTffff: -5 Oct 10 07:56:50 oss-1 kernel: LustreError: 15f- >>> b: Communication error with the MGS. Is the MGS running? Oct 10 >>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>> 1597:server_fill_super()) Unable to start targets: -5 Oct 10 >>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>> 1382:server_put_super()) no obd lfs01-OSTffff Oct 10 07:56:50 >>> oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>> 119:server_deregister_mount()) lfs01-OSTffff not registered Oct >>> 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 >>> success) Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 >>> extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Oct 10 >>> 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 generated and it >>> took 0 Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 >>> preallocated, 0 discarded Oct 10 07:56:51 oss-1 kernel: Lustre: >>> server umount lfs01-OSTffff complete Oct 10 07:56:51 oss-1 >>> kernel: LustreError: 18125:0:(obd_mount.c: 1951:lustre_fill_super >>> ()) Unable to mount (-5) LustreError: 6644:0:(events.c: >>> 55:request_out_callback()) @@@ type 4, status -113 >>> req at 00000103f7a50600 x1/t0 o250- >>>> >>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 Oct 10 07:57:15 >>> oss-1 kernel: LustreError: 6644:0:(events.c: >>> 55:request_out_callback()) @@@ type 4, status -113 >>> req at 00000103f7a50600 x1/t0 o250- >>> >MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>> 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 Oct 10 08:04:09 >>> oss-1 sshd(pam_unix)[18530]: session opened for user root by root >>> (uid=0) LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc >>> enabled Lustre: lfs01-OST0000: new disk, initializing Lustre: >>> Server lfs01-OST0000 on device /dev/md11 has started Oct 10 >>> 08:06:49 oss-1 kernel: kjournald starting. Commit interval 5 >>> seconds Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, >>> external journal on md21 Oct 10 08:06:49 oss-1 kernel: LDISKFS- >>> fs: mounted filesystem with journal data mode. Oct 10 08:06:49 >>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>> 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal on >>> md21 Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem >>> with journal data mode. Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: >>> file extents enabled Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: >>> mballoc enabled Oct 10 08:06:49 oss-1 kernel: Lustre: Filtering >>> OBD driver; info at clusterfs.com Oct 10 08:06:49 oss-1 kernel: >>> Lustre: lfs01-OST0000: new disk, initializing Oct 10 08:06:49 >>> oss-1 kernel: Lustre: OST lfs01-OST0000 now serving dev (lfs01- >>> OST0000/ccc68ac6-5b58-acd6-455b-2df9d2980009) with recovery >>> enabled Oct 10 08:06:49 oss-1 kernel: Lustre: Server lfs01- >>> OST0000 on device /dev/md11 has started Lustre: lfs01-OST0000: >>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:06:54 >>> oss-1 kernel: Lustre: lfs01-OST0000: received MDS connection from >>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>> mballoc enabled Lustre: lfs01-OST0001: new disk, initializing >>> Lustre: Server lfs01-OST0001 on device /dev/md12 has started Oct >>> 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval 5 >>> seconds Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, >>> external journal on md22 Oct 10 08:06:56 oss-1 kernel: LDISKFS- >>> fs: mounted filesystem with journal data mode. Oct 10 08:06:56 >>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>> 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal on >>> md22 Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem >>> with journal data mode. Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: >>> file extents enabled Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: >>> mballoc enabled Oct 10 08:06:56 oss-1 kernel: Lustre: lfs01- >>> OST0001: new disk, initializing Oct 10 08:06:56 oss-1 kernel: >>> Lustre: OST lfs01-OST0001 now serving dev (lfs01-OST0001/b2122e87- >>> be36-bd1a-4e40-fdd41e626d0b) with recovery enabled Oct 10 >>> 08:06:56 oss-1 kernel: Lustre: Server lfs01-OST0001 on device / >>> dev/md12 has started Lustre: lfs01-OST0001: received MDS >>> connection from 192.168.30.101 at o2ib Oct 10 08:07:01 oss-1 kernel: >>> Lustre: lfs01-OST0001: received MDS connection from >>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>> mballoc enabled Lustre: lfs01-OST0002: new disk, initializing >>> Lustre: Server lfs01-OST0002 on device /dev/md13 has started Oct >>> 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval 5 >>> seconds Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, >>> external journal on md23 Oct 10 08:07:02 oss-1 kernel: LDISKFS- >>> fs: mounted filesystem with journal data mode. Oct 10 08:07:02 >>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>> 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal on >>> md23 Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem >>> with journal data mode. Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: >>> file extents enabled Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: >>> mballoc enabled Oct 10 08:07:02 oss-1 kernel: Lustre: lfs01- >>> OST0002: new disk, initializing Oct 10 08:07:02 oss-1 kernel: >>> Lustre: OST lfs01-OST0002 now serving dev (lfs01- >>> OST0002/13c66dfa-47c5-b350-43e3-3c3b67c358b6) with recovery >>> enabled Oct 10 08:07:02 oss-1 kernel: Lustre: Server lfs01- >>> OST0002 on device /dev/md13 has started Lustre: lfs01-OST0002: >>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:07:06 >>> oss-1 kernel: Lustre: lfs01-OST0002: received MDS connection from >>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>> mballoc enabled Oct 10 08:07:08 oss-1 kernel: kjournald starting. >>> Commit interval 5 seconds OcLustre: lfs01-OST0003: new disk, >>> initializing t 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, >>> external journalLustre: Server lfs01-OST0003 on device /dev/md15 >>> has started on md25 Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>> mounted filesystem with journal data mode. Oct 10 08:07:08 oss-1 >>> kernel: kjournald starting. Commit interval 5 seconds Oct 10 >>> 08:07:08 oss-1 kernel: LDISKFS FS on md15, external journal on >>> md25 Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem >>> with journal data mode. Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>> file extents enabled Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>> mballoc enabled Oct 10 08:07:08 oss-1 kernel: Lustre: lfs01- >>> OST0003: new disk, initializing Oct 10 08:07:08 oss-1 kernel: >>> Lustre: OST lfs01-OST0003 now serving dev (lfs01-OST0003/ >>> d6fd7a9d-3bb8-ae05-41ed-bbfb1b6b0303) with recovery enabled Oct >>> 10 08:07:08 oss-1 kernel: Lustre: Server lfs01-OST0003 on device / >>> dev/md15 has started Lustre: lfs01-OST0003: received MDS >>> connection from 192.168.30.101 at o2ib Oct 10 08:07:12 oss-1 kernel: >>> Lustre: lfs01-OST0003: received MDS connection from >>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>> mballoc enabled Lustre: lfs01-OST0004: new disk, initializing Oct >>> 10 08:07:14 oss-1 kernel: kjournald starting. Commit >>> intervLustre: Server lfs01-OST0004 on device /dev/md16 has >>> started al 5 seconds Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on >>> md16, external journal on md26 Oct 10 08:07:14 oss-1 kernel: >>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>> 08:07:14 oss-1 kernel: kjournald starting. Commit interval 5 >>> seconds Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, >>> external journal on md26 Oct 10 08:07:14 oss-1 kernel: LDISKFS- >>> fs: mounted filesystem with journal data mode. Oct 10 08:07:14 >>> oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 08:07:14 >>> oss-1 kernel: LDISKFS-fs: mballoc enabled Oct 10 08:07:14 oss-1 >>> kernel: Lustre: lfs01-OST0004: new disk, initializing Oct 10 >>> 08:07:14 oss-1 kernel: Lustre: OST lfs01-OST0004 now serving >>> dev (lfs01-OST0004/661dcb52-7ef9-8274-45d7-4441e36410d1) with >>> recovery enabled Oct 10 08:07:14 oss-1 kernel: Lustre: Server >>> lfs01-OST0004 on device /dev/md16 has started Lustre: lfs01- >>> OST0004: received MDS connection from 192.168.30.101 at o2ib Oct 10 >>> 08:07:18 oss-1 kernel: Lustre: lfs01-OST0004: received MDS >>> connection from 192.168.30.101 at o2ib LDISKFS-fs: file extents >>> enabled LDISKFS-fs: mballoc enabled Lustre: lfs01-OST0005: new >>> disk, initializing Lustre: Server lfs01-OST0005 on device /dev/ >>> md17 has started Oct 10 08:07:19 oss-1 kernel: kjournald >>> starting. Commit interval 5 seconds Oct 10 08:07:19 oss-1 kernel: >>> LDISKFS FS on md17, external journal on md27 Oct 10 08:07:19 >>> oss-1 kernel: LDISKFS-fs: mounted filesystem with journal data >>> mode. Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit >>> interval 5 seconds Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on >>> md17, external journal on md27 Oct 10 08:07:19 oss-1 kernel: >>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>> 08:07:19 oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 >>> 08:07:20 oss-1 kernel: LDISKFS-fs: mballoc enabled Oct 10 >>> 08:07:20 oss-1 kernel: Lustre: lfs01-OST0005: new disk, >>> initializing Oct 10 08:07:20 oss-1 kernel: Lustre: OST lfs01- >>> OST0005 now serving dev (lfs01- >>> OST0005/978ba68c-0ba7-9ac7-439f-964ca7bf86a3) with recovery >>> enabled Oct 10 08:07:20 oss-1 kernel: Lustre: Server lfs01- >>> OST0005 on device /dev/md17 has started Lustre: lfs01-OST0005: >>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:07:25 >>> oss-1 kernel: Lustre: lfs01-OST0005: received MDS connection from >>> 192.168.30.101 at o2ib Oct 10 08:45:00 oss-1 faultmond: 17:Polling >>> all 48 slots for drive fault Oct 10 08:45:06 oss-1 faultmond: >>> Polling cycle 17 is complete Oct 10 09:45:06 oss-1 faultmond: >>> 18:Polling all 48 slots for drive fault Oct 10 09:45:12 oss-1 >>> faultmond: Polling cycle 18 is complete Oct 10 10:45:12 oss-1 >>> faultmond: 19:Polling all 48 slots for drive fault Oct 10 >>> 10:45:17 oss-1 faultmond: Polling cycle 19 is complete >>> LustreError: 18732:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>> lfs01- OST0001: slow setattr 85s Oct 10 10:48:14 oss-1 kernel: >>> LustreError: 18732:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0001: slow setattr 85s Oct 10 11:45:17 oss-1 faultmond: >>> 20:Polling all 48 slots for drive fault Oct 10 11:45:25 oss-1 >>> faultmond: Polling cycle 20 is complete Oct 10 12:45:25 oss-1 >>> faultmond: 21:Polling all 48 slots for drive fault Oct 10 >>> 12:45:33 oss-1 faultmond: Polling cycle 21 is complete Lustre: >>> 18805:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0005: >>> slow setattr 33s Oct 10 13:14:46 oss-1 kernel: Lustre: 18805:0: >>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0005: slow >>> setattr 33s Lustre: 18794:0:(lustre_fsfilt.h:312:fsfilt_setattr >>> ()) lfs01- OST0000: slow setattr 43s Oct 10 13:15:03 oss-1 >>> kernel: Lustre: 18794:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0000: slow setattr 43s Lustre: 18815:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01- OST0004: slow setattr 40s Oct 10 >>> 13:15:13 oss-1 kernel: Lustre: 18815:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s Lustre: >>> 18809:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >>> OST0003: slow i_mutex 31s Lustre: 18753:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01- OST0003: slow i_mutex 31s Oct >>> 10 13:15:25 oss-1 kernel: Lustre: 18809:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s Oct >>> 10 13:15:25 oss-1 kernel: Lustre: 18753:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s >>> Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) >>> lfs01- OST0002: slow i_mutex 34s Lustre: 18768:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) Skipped 2 previous similar messages >>> Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 34s Oct >>> 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) Skipped 2 previous similar messages >>> Lustre: 18833:0:(filter_io_26.c:700:filter_commitrw_write()) >>> lfs01- OST0001: slow i_mutex 37s Oct 10 13:15:31 oss-1 kernel: >>> Lustre: 18833:0:(filter_io_26.c: 700:filter_commitrw_write()) >>> lfs01-OST0001: slow i_mutex 37s Lustre: 18812:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01- OST0002: slow i_mutex 40s >>> Lustre: 18844:0:(filter_io_26.c:765:filter_commitrw_write()) >>> lfs01- OST0003: slow direct_io 40s Oct 10 13:15:34 oss-1 kernel: >>> Lustre: 18812:0:(filter_io_26.c: 700:filter_commitrw_write()) >>> lfs01-OST0002: slow i_mutex 40s Oct 10 13:15:34 oss-1 kernel: >>> Lustre: 18844:0:(filter_io_26.c: 765:filter_commitrw_write()) >>> lfs01-OST0003: slow direct_io 40s Lustre: 18741:0: >>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0001: slow >>> setattr 41s Lustre: 18849:0:(filter_io_26.c: >>> 765:filter_commitrw_write()) lfs01- OST0001: slow direct_io 31s >>> Oct 10 13:15:35 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 41s Oct 10 >>> 13:15:35 oss-1 kernel: Lustre: 18849:0:(filter_io_26.c: >>> 765:filter_commitrw_write()) lfs01-OST0001: slow direct_io 31s >>> LustreError: 18765:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>> lfs01- OST0002: slow setattr 51s Oct 10 13:15:38 oss-1 kernel: >>> LustreError: 18765:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0002: slow setattr 51s Lustre: 18756:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01- OST0002: slow i_mutex 45s Oct >>> 10 13:15:39 oss-1 kernel: Lustre: 18756:0:(filter_io_26.c: >>> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 45s Oct >>> 10 13:45:33 oss-1 faultmond: 22:Polling all 48 slots for drive >>> fault Oct 10 13:45:41 oss-1 faultmond: Polling cycle 22 is >>> complete Oct 10 14:45:41 oss-1 faultmond: 23:Polling all 48 slots >>> for drive fault Oct 10 14:45:49 oss-1 faultmond: Polling cycle 23 >>> is complete Lustre: 18740:0:(lustre_fsfilt.h:312:fsfilt_setattr >>> ()) lfs01- OST0000: slow setattr 38s Oct 10 15:40:41 oss-1 >>> kernel: Lustre: 18740:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0000: slow setattr 38s LustreError: 18830:0: >>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0004: slow >>> setattr 60s Lustre: 18767:0:(lustre_fsfilt.h:312:fsfilt_setattr >>> ()) lfs01- OST0005: slow setattr 38s Oct 10 15:41:13 oss-1 >>> kernel: LustreError: 18830:0:(lustre_fsfilt.h: 312:fsfilt_setattr >>> ()) lfs01-OST0004: slow setattr 60s Oct 10 15:41:13 oss-1 kernel: >>> Lustre: 18767:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01- >>> OST0005: slow setattr 38s Lustre: 18796:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01- OST0001: slow setattr 44s Oct 10 >>> 15:41:20 oss-1 kernel: Lustre: 18796:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 44s >>> LustreError: 18831:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>> lfs01- OST0002: slow setattr 62s Oct 10 15:41:21 oss-1 kernel: >>> LustreError: 18831:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0002: slow setattr 62s Oct 10 15:45:49 oss-1 faultmond: >>> 24:Polling all 48 slots for drive fault Oct 10 15:45:58 oss-1 >>> faultmond: Polling cycle 24 is complete Oct 10 16:45:58 oss-1 >>> faultmond: 25:Polling all 48 slots for drive fault Oct 10 >>> 16:46:06 oss-1 faultmond: Polling cycle 25 is complete Oct 10 >>> 17:46:06 oss-1 faultmond: 26:Polling all 48 slots for drive fault >>> Oct 10 17:46:15 oss-1 faultmond: Polling cycle 26 is complete >>> Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >>> OST0000: slow setattr 41s Lustre: 18726:0:(service.c: >>> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >>> req at 00000101e8f1de00 x15789/t0 o13-><?>@<? >>>> >>>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ >>> Slow req_in handling 7s req at 00000101e8f1da00 x15790/t0 o13-><?>@<? >>>> >>>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) >>> Skipped 3 previous similar messages Lustre: 18764:0: >>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0004: slow >>> setattr 40s Oct 10 18:06:33 oss-1 kernel: Lustre: 18741:0: >>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0000: slow >>> setattr 41s Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0: >>> (service.c: 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in >>> handling 7s req at 00000101e8f1de00 x15789/t0 o13-><?>@<?>:0/0 lens >>> 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Oct 10 18:06:33 >>> oss-1 kernel: Lustre: 18726:0:(service.c: >>> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >>> req at 00000101e8f1da00 x15790/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to >>> 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Lustre: 18845:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01- OST0002: slow setattr 44s Lustre: >>> 18579:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ Slow >>> req_in handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><? >>>> >>>> @<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>> Oct 10 18:06:54 oss-1 kernel: Lustre: 18726:0:(service.c: >>> 918:ptlrpc_server_handle_req_in()) Skipped 3 previous similar >>> messages Oct 10 18:06:54 oss-1 kernel: Lustre: 18764:0: >>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0004: slow >>> setattr 40s Oct 10 18:06:54 oss-1 kernel: Lustre: 18845:0: >>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0002: slow >>> setattr 44s Oct 10 18:06:54 oss-1 kernel: Lustre: 18579:0: >>> (service.c: 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in >>> handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><?>@<?>:0/0 >>> lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Lustre: 18766:0: >>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0005: slow >>> setattr 32s Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr >>> ()) Skipped 1 previous similar message Oct 10 18:06:59 oss-1 >>> kernel: Lustre: 18766:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>> lfs01-OST0005: slow setattr 32s Oct 10 18:06:59 oss-1 kernel: >>> Lustre: 18766:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) Skipped 1 >>> previous similar message Lustre: 18826:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01- OST0003: slow setattr 45s Oct 10 >>> 18:07:04 oss-1 kernel: Lustre: 18826:0:(lustre_fsfilt.h: >>> 312:fsfilt_setattr()) lfs01-OST0003: slow setattr 45s Oct 10 >>> 18:46:15 oss-1 faultmond: 27:Polling all 48 slots for drive fault >>> ----------- [cut here ] --------- [please bite here ] --------- >>> Kernel BUG at spinlock:76 invalid operand: 0000 [1] SMP CPU 2 >>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) >>> obdclass(U) lvfs(U) ldiskfs(U) lnet(U) libcfs(U) raid5(U) xor(U) >>> parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) >>> ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) rdma_ucm >>> (U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib >>> (U) md5(U) ipv6(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) >>> mlx4_core (U) ds(U) yenta_socket(U) pcmcia_core(U) dm_mirror(U) >>> dm_multipath (U) dm_mod(U) button(U) battery(U) ac(U) joydev(U) >>> ohci_hcd(U) ehci_hcd(U) hw_random(U) edac_mc(U) ib_mthca(U) >>> ib_umad(U) ib_ucm (U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) >>> ib_core(U) e1000(U) ext3(U) jbd(U) raid1(U) mv_sata(U) sd_mod(U) >>> scsi_mod(U) _______________________________________________ >>> Lustre-discuss mailing list Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ Lustre-discuss >> mailing list Lustre-discuss at lists.lustre.org http:// >> lists.lustre.org/mailman/listinfo/lustre-discuss > > -- > <6g_top.gif> > Malcolm Cowe > Solutions Integration Engineer > > Sun Microsystems, Inc. > Blackness Road > Linlithgow, West Lothian EH49 7LR UK > Phone: x73602 / +44 1506 673 602 > Email: Malcolm.Cowe at Sun.COM
Malcolm Cowe
2008-Oct-14 09:07 UTC
[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues
Well, guess what: got a kernel panic overnight. Ugh. I''ve attached the output to this message... Faultmond was still running last night, so I''ll make sure to disable it before trying again. Regards, Malcolm. Brock Palen wrote:> I never uninstalled it (i still use some of the tools in it) > Faultmond is a service, just chkconfig it off. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Oct 13, 2008, at 11:03 AM, Malcolm Cowe wrote: > > >> Brock Palen wrote: >> >>> I know you say the only addition was the RDAC for the MDS''s I >>> assume (we use it also just fine). >>> >> Yes, the MDS''s share a STK 6140. >> >>> When I ran faultmond from suns dcmu rpm (RHEL 4 here) the x4500''s >>> would crash like clock work ~48 hours. For a very simple bit of >>> code I was surpised that once when I forgot to turn it on when >>> working on the load this would happen. Just FYI it was unrelated >>> to lustre (using provided rpm''s no kernel build) this solved my >>> problem on the x4500 >>> >> The DCMU RPM is installed. I didn''t explicitly install this, so it >> must have been bundled in with the SIA CD... I''ll try removing the >> rpm to see what happens. Thanks for the heads up. >> >> Regards, >> >> Malcolm. >> >> >>> Brock Palen www.umich.edu/~brockp Center for Advanced Computing >>> brockp at umich.edu (734)936-1985 On Oct 13, 2008, at 4:41 AM, >>> Malcolm Cowe wrote: >>> >>>> The X4200m2 MDS systems and the X4500 OSS were rebuilt using the >>>> stock Lustre packages (Kernel + modules + userspace). With the >>>> exception of the RDAC kernel module, no additional software was >>>> applied to the systems. We recreated our volumes and ran the >>>> servers over the weekend. However, the OSS crashed about 8 hours >>>> in. The syslog output is attached to this message. Looks like it >>>> could be similar to bug #16404, which means patching and >>>> rebuilding the kernel. Given my lack of success at trying to >>>> build from source, I am again asking for some guidance on how to >>>> do this. I sent out the steps I used to try and build from source >>>> on the 7th because I was encountering problems and was unable to >>>> get a working set of packages. Included in that messages was >>>> output from quilt that implies that the kernel patching process >>>> was not working properly. Regards, Malcolm. -- <6g_top.gif> >>>> Malcolm Cowe Solutions Integration Engineer Sun Microsystems, >>>> Inc. Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: >>>> x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM >>>> <6g_top.gif> Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15, >>>> internal journal Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs: >>>> mounted filesystem with ordered data mode. Oct 10 06:53:42 oss-1 >>>> kernel: kjournald starting. Commit interval 5 seconds Oct 10 >>>> 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal Oct >>>> 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with >>>> ordered data mode. Oct 10 06:57:49 oss-1 kernel: kjournald >>>> starting. Commit interval 5 seconds Oct 10 06:57:49 oss-1 kernel: >>>> LDISKFS FS on md17, internal journal Oct 10 06:57:49 oss-1 >>>> kernel: LDISKFS-fs: mounted filesystem with ordered data mode. >>>> Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for >>>> drive fault Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is >>>> complete Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver, >>>> info at clusterfs.com Oct 10 07:56:23 oss-LDISKFS-fs: file extents >>>> enabled1 kernel: Lustre VersionLDISKFS-fs: mballoc enabled : >>>> 1.6.5.1 Oct 10 07:56:23 oss-1 kernel: Build Version: >>>> 1.6.5.1-19691231190000-PRISTINE-.cache.OLDRPMS. >>>> 20080618230526.linux- smp-2.6.9-67.0.7.EL_lustre. >>>> 1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. 1.6.5.1smp Oct 10 07:56:24 >>>> oss-1 kernel: Lustre: Added LNI 192.168.30.111 at o2ib [8/64] Oct 10 >>>> 07:56:24 oss-1 kernel: Lustre: Lustre Client File System; >>>> info at clusterfs.com Oct 10 07:56:24 oss-1 kernel: kjournald >>>> starting. Commit interval 5 seconds Oct 10 07:56:24 oss-1 kernel: >>>> LDISKFS FS on md11, external journal on md21 Oct 10 07:56:24 >>>> oss-1 kernel: LDISKFS-fs: mounted filesystem with journal data >>>> mode. Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit >>>> interval 5 seconds Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on >>>> md11, external journal on md21 Oct 10 07:56:24 oss-1 kernel: >>>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>>> 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 >>>> 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled Lustre: >>>> Request x1 sent from MGC192.168.30.101 at o2ib to NID >>>> 192.168.30.101 at o2ib 5s ago has timed out (limit 5s). Oct 10 >>>> 07:56:30 oss-1 kernel: Lustre: Request x1 sent from >>>> MGC192.168.30.101 at o2ib to NID 192.168.30.101 at o2ib 5s ago has >>>> timed out (limit 5s). LustreError: 4685:0:(events.c: >>>> 55:request_out_callback()) @@@ type 4, status -113 >>>> req at 00000101f8ef3200 x3/t0 o250- >>>> >>>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>>>> >>>> 1223621815 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: Request x3 sent from >>>> MGC192.168.30.101 at o2ib to NID 192.168.30.102 at o2ib 0s ago has >>>> timed out (limit 5s). LustreError: 18125:0:(obd_mount.c: >>>> 1062:server_start_targets()) Required registration failed for >>>> lfs01-OSTffff: -5 LustreError: 15f-b: Communication error with >>>> the MGS. Is the MGS running? LustreError: 18125:0:(obd_mount.c: >>>> 1597:server_fill_super()) Unable to start targets: -5 >>>> LustreError: 18125:0:(obd_mount.c:1382:server_put_super()) no obd >>>> lfs01-OSTffff LustreError: 18125:0:(obd_mount.c: >>>> 119:server_deregister_mount()) lfs01-OSTffff not registered >>>> LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) LDISKFS-fs: >>>> mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, >>>> 0 lost LDISKFS-fs: mballoc: 0 generated and it took 0 LDISKFS-fs: >>>> mballoc: 0 preallocated, 0 discarded Oct 10 07:56:50 oss-1 >>>> kernel: Lustre: Changing connection for MGC192.168.30.101 at o2ib to >>>> MGC192.1Lustre: server umount lfs01- OSTffff complete >>>> 68.30.101 at o2ib_1LustreError: 18125:0:(obd_mount.c: >>>> 1951:lustre_fill_super()) Unable to mount (-5) / >>>> 192.168.30.102 at o2ib Oct 10 07:56:50 oss-1 kernel: LustreError: >>>> 4685:0:(events.c: 55:request_out_callback()) @@@ type 4, status >>>> -113 req at 00000101f8ef3200 x3/t0 o250- >>>> >>>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>>>> >>>> 1223621815 ref 2 fl Rpc:/0/0 rc 0/0Oct 10 07:56:50 oss-1 kernel: >>>> Lustre: Request x3 sent from MGC192.168.30.101 at o2ib to NID >>>> 192.168.30.102 at o2ib 0s ago has timed out (limit 5s). Oct 10 >>>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>>> 1062:server_start_targets()) Required registration failed for >>>> lfs01- OSTffff: -5 Oct 10 07:56:50 oss-1 kernel: LustreError: 15f- >>>> b: Communication error with the MGS. Is the MGS running? Oct 10 >>>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>>> 1597:server_fill_super()) Unable to start targets: -5 Oct 10 >>>> 07:56:50 oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>>> 1382:server_put_super()) no obd lfs01-OSTffff Oct 10 07:56:50 >>>> oss-1 kernel: LustreError: 18125:0:(obd_mount.c: >>>> 119:server_deregister_mount()) lfs01-OSTffff not registered Oct >>>> 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 >>>> success) Oct 10 07:56:50 oss-1 kernel: LDISKFS-fs: mballoc: 0 >>>> extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Oct 10 >>>> 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 generated and it >>>> took 0 Oct 10 07:56:51 oss-1 kernel: LDISKFS-fs: mballoc: 0 >>>> preallocated, 0 discarded Oct 10 07:56:51 oss-1 kernel: Lustre: >>>> server umount lfs01-OSTffff complete Oct 10 07:56:51 oss-1 >>>> kernel: LustreError: 18125:0:(obd_mount.c: 1951:lustre_fill_super >>>> ()) Unable to mount (-5) LustreError: 6644:0:(events.c: >>>> 55:request_out_callback()) @@@ type 4, status -113 >>>> req at 00000103f7a50600 x1/t0 o250- >>>> >>>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>>>> >>>> 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 Oct 10 07:57:15 >>>> oss-1 kernel: LustreError: 6644:0:(events.c: >>>> 55:request_out_callback()) @@@ type 4, status -113 >>>> req at 00000103f7a50600 x1/t0 o250- >>>> >>>>> MGS at MGC192.168.30.101@o2ib_1:26/25 lens 240/400 e 0 to 5 dl >>>>> >>>> 1223621790 ref 1 fl Complete:EX/0/0 rc -110/0 Oct 10 08:04:09 >>>> oss-1 sshd(pam_unix)[18530]: session opened for user root by root >>>> (uid=0) LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc >>>> enabled Lustre: lfs01-OST0000: new disk, initializing Lustre: >>>> Server lfs01-OST0000 on device /dev/md11 has started Oct 10 >>>> 08:06:49 oss-1 kernel: kjournald starting. Commit interval 5 >>>> seconds Oct 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, >>>> external journal on md21 Oct 10 08:06:49 oss-1 kernel: LDISKFS- >>>> fs: mounted filesystem with journal data mode. Oct 10 08:06:49 >>>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>>> 10 08:06:49 oss-1 kernel: LDISKFS FS on md11, external journal on >>>> md21 Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: mounted filesystem >>>> with journal data mode. Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: >>>> file extents enabled Oct 10 08:06:49 oss-1 kernel: LDISKFS-fs: >>>> mballoc enabled Oct 10 08:06:49 oss-1 kernel: Lustre: Filtering >>>> OBD driver; info at clusterfs.com Oct 10 08:06:49 oss-1 kernel: >>>> Lustre: lfs01-OST0000: new disk, initializing Oct 10 08:06:49 >>>> oss-1 kernel: Lustre: OST lfs01-OST0000 now serving dev (lfs01- >>>> OST0000/ccc68ac6-5b58-acd6-455b-2df9d2980009) with recovery >>>> enabled Oct 10 08:06:49 oss-1 kernel: Lustre: Server lfs01- >>>> OST0000 on device /dev/md11 has started Lustre: lfs01-OST0000: >>>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:06:54 >>>> oss-1 kernel: Lustre: lfs01-OST0000: received MDS connection from >>>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>>> mballoc enabled Lustre: lfs01-OST0001: new disk, initializing >>>> Lustre: Server lfs01-OST0001 on device /dev/md12 has started Oct >>>> 10 08:06:56 oss-1 kernel: kjournald starting. Commit interval 5 >>>> seconds Oct 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, >>>> external journal on md22 Oct 10 08:06:56 oss-1 kernel: LDISKFS- >>>> fs: mounted filesystem with journal data mode. Oct 10 08:06:56 >>>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>>> 10 08:06:56 oss-1 kernel: LDISKFS FS on md12, external journal on >>>> md22 Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: mounted filesystem >>>> with journal data mode. Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: >>>> file extents enabled Oct 10 08:06:56 oss-1 kernel: LDISKFS-fs: >>>> mballoc enabled Oct 10 08:06:56 oss-1 kernel: Lustre: lfs01- >>>> OST0001: new disk, initializing Oct 10 08:06:56 oss-1 kernel: >>>> Lustre: OST lfs01-OST0001 now serving dev (lfs01-OST0001/b2122e87- >>>> be36-bd1a-4e40-fdd41e626d0b) with recovery enabled Oct 10 >>>> 08:06:56 oss-1 kernel: Lustre: Server lfs01-OST0001 on device / >>>> dev/md12 has started Lustre: lfs01-OST0001: received MDS >>>> connection from 192.168.30.101 at o2ib Oct 10 08:07:01 oss-1 kernel: >>>> Lustre: lfs01-OST0001: received MDS connection from >>>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>>> mballoc enabled Lustre: lfs01-OST0002: new disk, initializing >>>> Lustre: Server lfs01-OST0002 on device /dev/md13 has started Oct >>>> 10 08:07:02 oss-1 kernel: kjournald starting. Commit interval 5 >>>> seconds Oct 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, >>>> external journal on md23 Oct 10 08:07:02 oss-1 kernel: LDISKFS- >>>> fs: mounted filesystem with journal data mode. Oct 10 08:07:02 >>>> oss-1 kernel: kjournald starting. Commit interval 5 seconds Oct >>>> 10 08:07:02 oss-1 kernel: LDISKFS FS on md13, external journal on >>>> md23 Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: mounted filesystem >>>> with journal data mode. Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: >>>> file extents enabled Oct 10 08:07:02 oss-1 kernel: LDISKFS-fs: >>>> mballoc enabled Oct 10 08:07:02 oss-1 kernel: Lustre: lfs01- >>>> OST0002: new disk, initializing Oct 10 08:07:02 oss-1 kernel: >>>> Lustre: OST lfs01-OST0002 now serving dev (lfs01- >>>> OST0002/13c66dfa-47c5-b350-43e3-3c3b67c358b6) with recovery >>>> enabled Oct 10 08:07:02 oss-1 kernel: Lustre: Server lfs01- >>>> OST0002 on device /dev/md13 has started Lustre: lfs01-OST0002: >>>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:07:06 >>>> oss-1 kernel: Lustre: lfs01-OST0002: received MDS connection from >>>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>>> mballoc enabled Oct 10 08:07:08 oss-1 kernel: kjournald starting. >>>> Commit interval 5 seconds OcLustre: lfs01-OST0003: new disk, >>>> initializing t 10 08:07:08 oss-1 kernel: LDISKFS FS on md15, >>>> external journalLustre: Server lfs01-OST0003 on device /dev/md15 >>>> has started on md25 Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>>> mounted filesystem with journal data mode. Oct 10 08:07:08 oss-1 >>>> kernel: kjournald starting. Commit interval 5 seconds Oct 10 >>>> 08:07:08 oss-1 kernel: LDISKFS FS on md15, external journal on >>>> md25 Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: mounted filesystem >>>> with journal data mode. Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>>> file extents enabled Oct 10 08:07:08 oss-1 kernel: LDISKFS-fs: >>>> mballoc enabled Oct 10 08:07:08 oss-1 kernel: Lustre: lfs01- >>>> OST0003: new disk, initializing Oct 10 08:07:08 oss-1 kernel: >>>> Lustre: OST lfs01-OST0003 now serving dev (lfs01-OST0003/ >>>> d6fd7a9d-3bb8-ae05-41ed-bbfb1b6b0303) with recovery enabled Oct >>>> 10 08:07:08 oss-1 kernel: Lustre: Server lfs01-OST0003 on device / >>>> dev/md15 has started Lustre: lfs01-OST0003: received MDS >>>> connection from 192.168.30.101 at o2ib Oct 10 08:07:12 oss-1 kernel: >>>> Lustre: lfs01-OST0003: received MDS connection from >>>> 192.168.30.101 at o2ib LDISKFS-fs: file extents enabled LDISKFS-fs: >>>> mballoc enabled Lustre: lfs01-OST0004: new disk, initializing Oct >>>> 10 08:07:14 oss-1 kernel: kjournald starting. Commit >>>> intervLustre: Server lfs01-OST0004 on device /dev/md16 has >>>> started al 5 seconds Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on >>>> md16, external journal on md26 Oct 10 08:07:14 oss-1 kernel: >>>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>>> 08:07:14 oss-1 kernel: kjournald starting. Commit interval 5 >>>> seconds Oct 10 08:07:14 oss-1 kernel: LDISKFS FS on md16, >>>> external journal on md26 Oct 10 08:07:14 oss-1 kernel: LDISKFS- >>>> fs: mounted filesystem with journal data mode. Oct 10 08:07:14 >>>> oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 08:07:14 >>>> oss-1 kernel: LDISKFS-fs: mballoc enabled Oct 10 08:07:14 oss-1 >>>> kernel: Lustre: lfs01-OST0004: new disk, initializing Oct 10 >>>> 08:07:14 oss-1 kernel: Lustre: OST lfs01-OST0004 now serving >>>> dev (lfs01-OST0004/661dcb52-7ef9-8274-45d7-4441e36410d1) with >>>> recovery enabled Oct 10 08:07:14 oss-1 kernel: Lustre: Server >>>> lfs01-OST0004 on device /dev/md16 has started Lustre: lfs01- >>>> OST0004: received MDS connection from 192.168.30.101 at o2ib Oct 10 >>>> 08:07:18 oss-1 kernel: Lustre: lfs01-OST0004: received MDS >>>> connection from 192.168.30.101 at o2ib LDISKFS-fs: file extents >>>> enabled LDISKFS-fs: mballoc enabled Lustre: lfs01-OST0005: new >>>> disk, initializing Lustre: Server lfs01-OST0005 on device /dev/ >>>> md17 has started Oct 10 08:07:19 oss-1 kernel: kjournald >>>> starting. Commit interval 5 seconds Oct 10 08:07:19 oss-1 kernel: >>>> LDISKFS FS on md17, external journal on md27 Oct 10 08:07:19 >>>> oss-1 kernel: LDISKFS-fs: mounted filesystem with journal data >>>> mode. Oct 10 08:07:19 oss-1 kernel: kjournald starting. Commit >>>> interval 5 seconds Oct 10 08:07:19 oss-1 kernel: LDISKFS FS on >>>> md17, external journal on md27 Oct 10 08:07:19 oss-1 kernel: >>>> LDISKFS-fs: mounted filesystem with journal data mode. Oct 10 >>>> 08:07:19 oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10 >>>> 08:07:20 oss-1 kernel: LDISKFS-fs: mballoc enabled Oct 10 >>>> 08:07:20 oss-1 kernel: Lustre: lfs01-OST0005: new disk, >>>> initializing Oct 10 08:07:20 oss-1 kernel: Lustre: OST lfs01- >>>> OST0005 now serving dev (lfs01- >>>> OST0005/978ba68c-0ba7-9ac7-439f-964ca7bf86a3) with recovery >>>> enabled Oct 10 08:07:20 oss-1 kernel: Lustre: Server lfs01- >>>> OST0005 on device /dev/md17 has started Lustre: lfs01-OST0005: >>>> received MDS connection from 192.168.30.101 at o2ib Oct 10 08:07:25 >>>> oss-1 kernel: Lustre: lfs01-OST0005: received MDS connection from >>>> 192.168.30.101 at o2ib Oct 10 08:45:00 oss-1 faultmond: 17:Polling >>>> all 48 slots for drive fault Oct 10 08:45:06 oss-1 faultmond: >>>> Polling cycle 17 is complete Oct 10 09:45:06 oss-1 faultmond: >>>> 18:Polling all 48 slots for drive fault Oct 10 09:45:12 oss-1 >>>> faultmond: Polling cycle 18 is complete Oct 10 10:45:12 oss-1 >>>> faultmond: 19:Polling all 48 slots for drive fault Oct 10 >>>> 10:45:17 oss-1 faultmond: Polling cycle 19 is complete >>>> LustreError: 18732:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>>> lfs01- OST0001: slow setattr 85s Oct 10 10:48:14 oss-1 kernel: >>>> LustreError: 18732:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0001: slow setattr 85s Oct 10 11:45:17 oss-1 faultmond: >>>> 20:Polling all 48 slots for drive fault Oct 10 11:45:25 oss-1 >>>> faultmond: Polling cycle 20 is complete Oct 10 12:45:25 oss-1 >>>> faultmond: 21:Polling all 48 slots for drive fault Oct 10 >>>> 12:45:33 oss-1 faultmond: Polling cycle 21 is complete Lustre: >>>> 18805:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0005: >>>> slow setattr 33s Oct 10 13:14:46 oss-1 kernel: Lustre: 18805:0: >>>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0005: slow >>>> setattr 33s Lustre: 18794:0:(lustre_fsfilt.h:312:fsfilt_setattr >>>> ()) lfs01- OST0000: slow setattr 43s Oct 10 13:15:03 oss-1 >>>> kernel: Lustre: 18794:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0000: slow setattr 43s Lustre: 18815:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01- OST0004: slow setattr 40s Oct 10 >>>> 13:15:13 oss-1 kernel: Lustre: 18815:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01-OST0004: slow setattr 40s Lustre: >>>> 18809:0:(filter_io_26.c:700:filter_commitrw_write()) lfs01- >>>> OST0003: slow i_mutex 31s Lustre: 18753:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01- OST0003: slow i_mutex 31s Oct >>>> 10 13:15:25 oss-1 kernel: Lustre: 18809:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s Oct >>>> 10 13:15:25 oss-1 kernel: Lustre: 18753:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01-OST0003: slow i_mutex 31s >>>> Lustre: 18768:0:(filter_io_26.c:700:filter_commitrw_write()) >>>> lfs01- OST0002: slow i_mutex 34s Lustre: 18768:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) Skipped 2 previous similar messages >>>> Oct 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 34s Oct >>>> 10 13:15:28 oss-1 kernel: Lustre: 18768:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) Skipped 2 previous similar messages >>>> Lustre: 18833:0:(filter_io_26.c:700:filter_commitrw_write()) >>>> lfs01- OST0001: slow i_mutex 37s Oct 10 13:15:31 oss-1 kernel: >>>> Lustre: 18833:0:(filter_io_26.c: 700:filter_commitrw_write()) >>>> lfs01-OST0001: slow i_mutex 37s Lustre: 18812:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01- OST0002: slow i_mutex 40s >>>> Lustre: 18844:0:(filter_io_26.c:765:filter_commitrw_write()) >>>> lfs01- OST0003: slow direct_io 40s Oct 10 13:15:34 oss-1 kernel: >>>> Lustre: 18812:0:(filter_io_26.c: 700:filter_commitrw_write()) >>>> lfs01-OST0002: slow i_mutex 40s Oct 10 13:15:34 oss-1 kernel: >>>> Lustre: 18844:0:(filter_io_26.c: 765:filter_commitrw_write()) >>>> lfs01-OST0003: slow direct_io 40s Lustre: 18741:0: >>>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0001: slow >>>> setattr 41s Lustre: 18849:0:(filter_io_26.c: >>>> 765:filter_commitrw_write()) lfs01- OST0001: slow direct_io 31s >>>> Oct 10 13:15:35 oss-1 kernel: Lustre: 18741:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 41s Oct 10 >>>> 13:15:35 oss-1 kernel: Lustre: 18849:0:(filter_io_26.c: >>>> 765:filter_commitrw_write()) lfs01-OST0001: slow direct_io 31s >>>> LustreError: 18765:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>>> lfs01- OST0002: slow setattr 51s Oct 10 13:15:38 oss-1 kernel: >>>> LustreError: 18765:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0002: slow setattr 51s Lustre: 18756:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01- OST0002: slow i_mutex 45s Oct >>>> 10 13:15:39 oss-1 kernel: Lustre: 18756:0:(filter_io_26.c: >>>> 700:filter_commitrw_write()) lfs01-OST0002: slow i_mutex 45s Oct >>>> 10 13:45:33 oss-1 faultmond: 22:Polling all 48 slots for drive >>>> fault Oct 10 13:45:41 oss-1 faultmond: Polling cycle 22 is >>>> complete Oct 10 14:45:41 oss-1 faultmond: 23:Polling all 48 slots >>>> for drive fault Oct 10 14:45:49 oss-1 faultmond: Polling cycle 23 >>>> is complete Lustre: 18740:0:(lustre_fsfilt.h:312:fsfilt_setattr >>>> ()) lfs01- OST0000: slow setattr 38s Oct 10 15:40:41 oss-1 >>>> kernel: Lustre: 18740:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0000: slow setattr 38s LustreError: 18830:0: >>>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0004: slow >>>> setattr 60s Lustre: 18767:0:(lustre_fsfilt.h:312:fsfilt_setattr >>>> ()) lfs01- OST0005: slow setattr 38s Oct 10 15:41:13 oss-1 >>>> kernel: LustreError: 18830:0:(lustre_fsfilt.h: 312:fsfilt_setattr >>>> ()) lfs01-OST0004: slow setattr 60s Oct 10 15:41:13 oss-1 kernel: >>>> Lustre: 18767:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01- >>>> OST0005: slow setattr 38s Lustre: 18796:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01- OST0001: slow setattr 44s Oct 10 >>>> 15:41:20 oss-1 kernel: Lustre: 18796:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01-OST0001: slow setattr 44s >>>> LustreError: 18831:0:(lustre_fsfilt.h:312:fsfilt_setattr()) >>>> lfs01- OST0002: slow setattr 62s Oct 10 15:41:21 oss-1 kernel: >>>> LustreError: 18831:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0002: slow setattr 62s Oct 10 15:45:49 oss-1 faultmond: >>>> 24:Polling all 48 slots for drive fault Oct 10 15:45:58 oss-1 >>>> faultmond: Polling cycle 24 is complete Oct 10 16:45:58 oss-1 >>>> faultmond: 25:Polling all 48 slots for drive fault Oct 10 >>>> 16:46:06 oss-1 faultmond: Polling cycle 25 is complete Oct 10 >>>> 17:46:06 oss-1 faultmond: 26:Polling all 48 slots for drive fault >>>> Oct 10 17:46:15 oss-1 faultmond: Polling cycle 26 is complete >>>> Lustre: 18741:0:(lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- >>>> OST0000: slow setattr 41s Lustre: 18726:0:(service.c: >>>> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >>>> req at 00000101e8f1de00 x15789/t0 o13-><?>@<? >>>> >>>>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>>>> >>>> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ >>>> Slow req_in handling 7s req at 00000101e8f1da00 x15790/t0 o13-><?>@<? >>>> >>>>> :0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>>>> >>>> Lustre: 18726:0:(service.c:918:ptlrpc_server_handle_req_in()) >>>> Skipped 3 previous similar messages Lustre: 18764:0: >>>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0004: slow >>>> setattr 40s Oct 10 18:06:33 oss-1 kernel: Lustre: 18741:0: >>>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0000: slow >>>> setattr 41s Oct 10 18:06:33 oss-1 kernel: Lustre: 18726:0: >>>> (service.c: 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in >>>> handling 7s req at 00000101e8f1de00 x15789/t0 o13-><?>@<?>:0/0 lens >>>> 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Oct 10 18:06:33 >>>> oss-1 kernel: Lustre: 18726:0:(service.c: >>>> 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 7s >>>> req at 00000101e8f1da00 x15790/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to >>>> 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Lustre: 18845:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01- OST0002: slow setattr 44s Lustre: >>>> 18579:0:(service.c:918:ptlrpc_server_handle_req_in()) @@@ Slow >>>> req_in handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><? >>>> >>>>> @<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 >>>>> >>>> Oct 10 18:06:54 oss-1 kernel: Lustre: 18726:0:(service.c: >>>> 918:ptlrpc_server_handle_req_in()) Skipped 3 previous similar >>>> messages Oct 10 18:06:54 oss-1 kernel: Lustre: 18764:0: >>>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0004: slow >>>> setattr 40s Oct 10 18:06:54 oss-1 kernel: Lustre: 18845:0: >>>> (lustre_fsfilt.h: 312:fsfilt_setattr()) lfs01-OST0002: slow >>>> setattr 44s Oct 10 18:06:54 oss-1 kernel: Lustre: 18579:0: >>>> (service.c: 918:ptlrpc_server_handle_req_in()) @@@ Slow req_in >>>> handling 14s req at 00000103f8dabe00 x7271650/t0 o103-><?>@<?>:0/0 >>>> lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Lustre: 18766:0: >>>> (lustre_fsfilt.h:312:fsfilt_setattr()) lfs01- OST0005: slow >>>> setattr 32s Lustre: 18766:0:(lustre_fsfilt.h:312:fsfilt_setattr >>>> ()) Skipped 1 previous similar message Oct 10 18:06:59 oss-1 >>>> kernel: Lustre: 18766:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) >>>> lfs01-OST0005: slow setattr 32s Oct 10 18:06:59 oss-1 kernel: >>>> Lustre: 18766:0:(lustre_fsfilt.h: 312:fsfilt_setattr()) Skipped 1 >>>> previous similar message Lustre: 18826:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01- OST0003: slow setattr 45s Oct 10 >>>> 18:07:04 oss-1 kernel: Lustre: 18826:0:(lustre_fsfilt.h: >>>> 312:fsfilt_setattr()) lfs01-OST0003: slow setattr 45s Oct 10 >>>> 18:46:15 oss-1 faultmond: 27:Polling all 48 slots for drive fault >>>> ----------- [cut here ] --------- [please bite here ] --------- >>>> Kernel BUG at spinlock:76 invalid operand: 0000 [1] SMP CPU 2 >>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>>> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) >>>> obdclass(U) lvfs(U) ldiskfs(U) lnet(U) libcfs(U) raid5(U) xor(U) >>>> parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) >>>> ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) rdma_ucm >>>> (U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib >>>> (U) md5(U) ipv6(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U) >>>> mlx4_core (U) ds(U) yenta_socket(U) pcmcia_core(U) dm_mirror(U) >>>> dm_multipath (U) dm_mod(U) button(U) battery(U) ac(U) joydev(U) >>>> ohci_hcd(U) ehci_hcd(U) hw_random(U) edac_mc(U) ib_mthca(U) >>>> ib_umad(U) ib_ucm (U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) >>>> ib_core(U) e1000(U) ext3(U) jbd(U) raid1(U) mv_sata(U) sd_mod(U) >>>> scsi_mod(U) _______________________________________________ >>>> Lustre-discuss mailing list Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> _______________________________________________ Lustre-discuss >>> mailing list Lustre-discuss at lists.lustre.org http:// >>> lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> -- >> <6g_top.gif> >> Malcolm Cowe >> Solutions Integration Engineer >> >> Sun Microsystems, Inc. >> Blackness Road >> Linlithgow, West Lothian EH49 7LR UK >> Phone: x73602 / +44 1506 673 602 >> Email: Malcolm.Cowe at Sun.COM >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- <http://www.sun.com> *Malcolm Cowe* /Solutions Integration Engineer/ *Sun Microsystems, Inc.* Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone: x73602 / +44 1506 673 602 Email: Malcolm.Cowe at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081014/495299c3/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081014/495299c3/attachment-0001.gif -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: x4500-oss-kernel-panic.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081014/495299c3/attachment-0001.txt