I put into production today an x4540 along with our two x4500''s, the machine was tested, but now is panicing every few hours. The traces are different every time, first time was about i8042 next time was in socklnd, most recent time is in the bonding code (we bond all 4 interfaces into one). Other than i8042, looks like maybe its in networking some place, I do not know how to capture the stack trace, I only have one page ever at a time, is there documentation on how to capture this? Also anyone seeing panics with x4540''s Server RHEL4 2.6.9-67.0.22.EL_lustre.1.6.6smp Clients RHEL4 patchless 1.6.6 Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
On Wed, 2009-06-10 at 19:06 -0400, Brock Palen wrote:> I do > not know how to capture the stack trace, I only have one page ever at > a time, is there documentation on how to capture this?You want a serial console, or lacking that, a netconsole. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090610/1004d411/attachment.bin
I thought I had netconsole setup: modprobe netconsole netconsole=6666 at 10.164.3.156/ bond0,514 at 141.212.30.35/00:10:DC:FE:70:59 But nothing got logged after Jun 11 09:32:34 oss3 kernel: [...network console startup...] I attached a screen shot of the dump from console, its incomplete though. Should I just update the server to 1.6.7.2 ? Just strange that the two x4500 had no issues, but the x4540 does -------------- next part -------------- A non-text attachment was scrubbed... Name: oss3crash.jpg Type: image/jpeg Size: 195688 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090611/1f46401d/attachment-0001.jpg -------------- next part -------------- Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Jun 10, 2009, at 8:13 PM, Brian J. Murrell wrote:> On Wed, 2009-06-10 at 19:06 -0400, Brock Palen wrote: >> I do >> not know how to capture the stack trace, I only have one page ever at >> a time, is there documentation on how to capture this? > > You want a serial console, or lacking that, a netconsole. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi! On Thu, Jun 11, 2009 at 10:08:30AM -0400, Brock Palen wrote:> I attached a screen shot of the dump from console, its incomplete > though. > Should I just update the server to 1.6.7.2 ? Just strange that the > two x4500 had no issues, but the x4540 doesThe X4500 comes with Intel NICs while the X4540 is equipped with NVidia chips. The Linux driver for the latter reportedly isn''t well tuned. Bumping parameter max_interrupt_work of the forcedeth module to 100 cured stability problems for us. That''s on a RHEL5-based system, though. Not sure how the driver in RHEL5 fares, but it might still be worth a try. Regards, Daniel.
We have been running the Lustre servers on a machine with Nvidia chipset(nVidia Corporation MCP55 Ethernet (rev a3)) for well over a year now, the following two options seems to work the best on these servers: options forcedeth max_interrupt_work=50 optimization_mode=1 optimization_mode enables Interrupt coalescing. Nirmal
On Jun 15, 2009, at 11:44 AM, Nirmal Seenu wrote:> We have been running the Lustre servers on a machine with Nvidia > chipset(nVidia Corporation MCP55 Ethernet (rev a3)) for well over a > year > now, the following two options seems to work the best on these > servers: > > options forcedeth max_interrupt_work=50 optimization_mode=1Thanks we put those in place, and disabled bonding for now (running on one over taxed gig-e port). We also tried noapic because of some notes online for the crashes we were seeing, but that does not let the MPT disk controllers in the machine startup. (sets al drives offline). Thanks for the note, Brock> > optimization_mode enables Interrupt coalescing. > > Nirmal > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >