Matthew Baker
2007-Dec-07 14:00 UTC
[Xen-users] Fatal Trap 18 (convincing hardware engineer)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, I have 2 servers with identical hardware (lspci at the bottom of this email). An Extra Intel PRO/1000 MT Dual Port Server Adapter[1] has been connected into the second slot on a pci-x capable riser (the first slot taken by the SAS Raid controller). When this nic *is* connected *and* the boxes boot a Xen kernel (debian 4.0 2.6.18-5-xen and using Xen HyperVisor(PAE) 3.0.3-0-4) after about 2 days I get this error on the console: (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]---- (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]---- (XEN) CPU: 1 (XEN) EIP: e008:[<ff1193be>]CPU: 3 (XEN) EIP: e008:[<ff1193be>] idle_loop+0x4e/0x60 idle_loop+0x4e/0x60 (XEN) EFLAGS: 00000246 CONTEXT: hypervisor (XEN) eax: 00000000 ebx: ffbeffb4 ecx: 00000001 edx: 00000000 (XEN) esi: ffbeffb4 edi: ffbf6080 ebp: 000090dc esp: ffbeffa8 (XEN) cr0: 8005003b cr4: 000006f0 cr3: a3363000 cr2: b7f2c260 (XEN) (XEN) EFLAGS: 00000246 CONTEXT: hypervisor (XEN) ds: e010 es: e010 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) eax: 00000000 ebx: ffbe3fb4 ecx: 096a03ba edx: ff18c080 (XEN) Xen stack trace from esp=ffbeffa8: (XEN) esi: ffbf0080 edi: 07a0403a ebp: 000090dc esp: ffbe3fa8 (XEN) 00000001cr0: 8005003b cr4: 000006f0 cr3: a1b80000 cr2: b7edd260 (XEN) 00000001 00001000 00000001 00000000 00000000 00000001 00000001ds: e010 8(XEN) (XEN) 00000000Xen stack trace from esp=ffbe3fa8: (XEN) 00000000 00000001 00f90000 00000003 c01013a7 ffbf0080 00000061 00000001(XEN) 0000007b 0000007b 00000000 00000000 00000001 ffbf6080 00000003 (XEN) Xen call trace: (XEN) [<ff1193be>] (XEN) idle_loop+0x4e/0x60 (XEN) 00000000 (XEN) ************************************ (XEN) 00000000CPU1 FATAL TRAP 18 (machine check), ERROR_CODE 0000. (XEN) System shutting down -- need manual reset. (XEN) ************************************ The machine obviously hangs. If I remove the PCI NIC the machine stays up. If I boot into a vanilla kernel with the NIC in the box it stays up. I have NICs like these bought in batch running in other machines that are also running Xen. The machines aren''t really used a great deal (at the moment although need to be soon) and as far as i can tell there''s no other issue with respect to the system that is failing, i.e the obvious stuff like disk space running out or exhaustive cronjobs). There are no logs other than the one to the console suggesting a failure elsewhere. Our hardware engineer is convinced it''s either a Xen or driver issue. I''ve seen the thread at http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html and have directed the engineer at this. My questions to the list are: 1. Can this be caused by anything else (other than hardware)? 2. Is there anything I can do to debug this further to confirm what part of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)? Any help on this would be greatly appreciated. Many thanks, Matt - -- Matthew Baker, UNIX Systems Administrator ---------------------------------------------------- Institute for Learning and Research Technology (ILRT) A: University of Bristol, 8-10 Berkeley Square, Bristol. BS8 1HH W: http://www.ilrt.bristol.ac.uk E: matt.baker@bris.ac.uk T: +44 (0)117 928 7121 - -- lspci 00:00.0 Host bridge: Intel Corporation E7320 Memory Controller Hub (rev 0c) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev)00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 (re)00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller)00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller)00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Cont)00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller)00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 0)00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev )01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev )02:02.0 PCI bridge: Intel Corporation 80331 [Lindsay] I/O processor (PCI-X Brid)03:0e.0 RAID bus controller: Adaptec AAC-RAID (rev 0a) 06:01.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Cont)06:02.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Cont)07:02.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHWVH6Lvm7pB/aicMRAioDAJ0Vw2dVALMkYylyR6Pjlw71y8ZZpQCfV+KU Ia7+fPLZQsMXtjmFk5KSNyA=6fWn -----END PGP SIGNATURE----- _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Robbie Dinn
2007-Dec-11 10:19 UTC
Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)
No one else seems to have taken the bite, so I will even though I may not be best qualified to do so. Matthew Baker wrote:> Hi all, > > I have 2 servers with identical hardware (lspci at the bottom of this > email).Two identical servers is good. But I wasn''t clear from your description whether they behaved the same. Assuming they behave differently then that might mean you have one substandard component in one of the machines. Record all the serial numbers of the components, or label them yourself, then begin swapping them between the machines. If you can get the fault to move from one machine to the other, you can maybe pin it on one component. Your hardware guy may have already tried the above. If you have two machines and they both show the fault, that''s more tricky.> > An Extra Intel PRO/1000 MT Dual Port Server Adapter[1] has been > connected into the second slot on a pci-x capable riser (the first slot > taken by the SAS Raid controller). > > When this nic *is* connected *and* the boxes boot a Xen kernel (debian > 4.0 2.6.18-5-xen and using Xen HyperVisor(PAE) 3.0.3-0-4) after about 2 > days I get this error on the console: > > (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]---- > (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]---- > (XEN) CPU: 1 > (XEN) EIP: e008:[<ff1193be>]CPU: 3 > (XEN) EIP: e008:[<ff1193be>] idle_loop+0x4e/0x60 idle_loop+0x4e/0x60 > (XEN) EFLAGS: 00000246 CONTEXT: hypervisor > (XEN) eax: 00000000 ebx: ffbeffb4 ecx: 00000001 edx: 00000000 > (XEN) esi: ffbeffb4 edi: ffbf6080 ebp: 000090dc esp: ffbeffa8 > (XEN) cr0: 8005003b cr4: 000006f0 cr3: a3363000 cr2: b7f2c260 > (XEN) > (XEN) EFLAGS: 00000246 CONTEXT: hypervisor > (XEN) ds: e010 es: e010 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) eax: 00000000 ebx: ffbe3fb4 ecx: 096a03ba edx: ff18c080 > (XEN) Xen stack trace from esp=ffbeffa8: > (XEN) esi: ffbf0080 edi: 07a0403a ebp: 000090dc esp: ffbe3fa8 > (XEN) 00000001cr0: 8005003b cr4: 000006f0 cr3: a1b80000 cr2: b7edd260 > (XEN) 00000001 00001000 00000001 00000000 00000000 00000001 > 00000001ds: e010 8(XEN) > (XEN) 00000000Xen stack trace from esp=ffbe3fa8: > (XEN) 00000000 00000001 00f90000 00000003 c01013a7 ffbf0080 > 00000061 00000001(XEN) 0000007b 0000007b 00000000 00000000 00000001 > ffbf6080 00000003 > (XEN) Xen call trace: > (XEN) [<ff1193be>] > (XEN) idle_loop+0x4e/0x60 > (XEN) 00000000 > (XEN) ************************************ > (XEN) 00000000CPU1 FATAL TRAP 18 (machine check), ERROR_CODE 0000.I had a brief look to see if I could find the place in the source code where this is being printed out, but I drew a blank. I must not be looking at the right version of the source tree. File .xen/arch/x86/traps.c looks like a good candidate.> (XEN) System shutting down -- need manual reset. > (XEN) ************************************ > > The machine obviously hangs. > > If I remove the PCI NIC the machine stays up. If I boot into a vanilla > kernel with the NIC in the box it stays up. > > I have NICs like these bought in batch running in other machines that > are also running Xen. The machines aren''t really used a great deal (at > the moment although need to be soon) and as far as i can tell there''s no > other issue with respect to the system that is failing, i.e the obvious > stuff like disk space running out or exhaustive cronjobs). There are no > logs other than the one to the console suggesting a failure elsewhere. > > Our hardware engineer is convinced it''s either a Xen or driver issue.I can see why he might think so or want to say so.> I''ve seen the thread at > http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html > and have directed the engineer at this. > > My questions to the list are: > > 1. Can this be caused by anything else (other than hardware)? > 2. Is there anything I can do to debug this further to confirm what part > of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?grasping at straws, could you try running a memory test program, eg memtest86. Is this a server class machine with with EEC memory? If so, is it possible to get the linux kernel to report any soft memory errors that get corrected via the EEC hardware? Is there anything in linux/Documentation/drivers/edac/edac.txt that might help? (I have not used this myself). There may be non fatal errors that are happening that before the fatal one. That might give you or your hardware engineer a clue as to where else to look. How about building a linux kernel with some form of debugging turned on? This might help you to see is something is scribbling on memory when it shouldn''t be. I don''t really know the answer, but good link anyway.> > Any help on this would be greatly appreciated. > > Many thanks, > > Matt >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Matthew Baker
2007-Dec-12 11:21 UTC
Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Robbie Dinn wrote:> No one else seems to have taken the bite, so I will even > though I may not be best qualified to do so.and I thank you for that. ;-)> Matthew Baker wrote: >> Hi all, >> >> I have 2 servers with identical hardware (lspci at the bottom of this >> email). > Two identical servers is good. But I wasn''t clear from your description > whether they behaved the same.Well both servers have exhibited "problems" I''ve only been able to capture the panic on one machine. So my assumption that it is the same cause may be wrong.> Assuming they behave differently then that might mean you have one > substandard component in one of the machines. Record all the serial > numbers of the components, or label them yourself, then begin > swapping them between the machines. If you can get the fault to > move from one machine to the other, you can maybe pin it on one component.I''m going to be able to get both these boxes out of a rack into a place which I can do some better diagnosis from this angle. I''m beginning to believe it may be related to one box more than the other.> Your hardware guy may have already tried the above. If you have two > machines and they both show the fault, that''s more tricky.fingers crossed.>> Our hardware engineer is convinced it''s either a Xen or driver issue. > > I can see why he might think so or want to say so.Yes as can I.>> I''ve seen the thread at >> http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html >> and have directed the engineer at this. >> >> My questions to the list are: >> >> 1. Can this be caused by anything else (other than hardware)? >> 2. Is there anything I can do to debug this further to confirm what part >> of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)? > > grasping at straws, could you try running a memory test program, eg memtest86.Yes we''ve ran some diagnostics on one of the boxes and all seems well. However, we still need to compare them.> Is this a server class machine with with EEC memory? If so, is it possible > to get the linux kernel to report any soft memory errors that get corrected > via the EEC hardware? > > Is there anything in linux/Documentation/drivers/edac/edac.txt > that might help? (I have not used this myself). There may be > non fatal errors that are happening that before the fatal one. > That might give you or your hardware engineer a clue as to > where else to look.Ah, this looks good. The edac modules were loaded already (by udev I presume). I''ve enabled the logging features via /sys. Thanks for the tip.> How about building a linux kernel with some form of debugging > turned on? This might help you to see is something is > scribbling on memory when it shouldn''t be.Yes, we''ve thought about enabling the gdb-stub as described in http://wiki.xensource.com/xenwiki/XenPPC/Debug/XenGDBStub I''m presuming this will work for other architectures than ppc. I see this as a last resort as kernel debugging can be quite time consuming! Thanks for your help it has given me some ideas on how to approach this. Matt - -- Matthew Baker, UNIX Systems Administrator ---------------------------------------------------- Institute for Learning and Research Technology (ILRT) A: University of Bristol, 8-10 Berkeley Square, Bristol. BS8 1HH W: http://www.ilrt.bristol.ac.uk E: matt.baker@bris.ac.uk T: +44 (0)117 928 7121 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHX8QsLvm7pB/aicMRAuGeAJ4mb4NSPj6YeRSC48iKz2N0U3jm3gCfZM1d Pr3mJfQZsO0bvCvtUoqjwT8=XSUr -----END PGP SIGNATURE----- _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users