thr3ads.net - Xen users - [Xen-users] Fatal Trap 18 (convincing hardware engineer) [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Matthew Baker

2007-Dec-07 14:00 UTC

[Xen-users] Fatal Trap 18 (convincing hardware engineer)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I have 2 servers with identical hardware (lspci at the bottom of this
email).

An Extra Intel PRO/1000 MT Dual Port Server Adapter[1] has been
connected into the second slot on a pci-x capable riser (the first slot
taken by the SAS Raid controller).

When this nic *is* connected *and* the boxes boot a Xen kernel (debian
4.0 2.6.18-5-xen and using Xen HyperVisor(PAE) 3.0.3-0-4) after about 2
days I get this error on the console:

(XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
(XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
(XEN) CPU: 1
(XEN) EIP: e008:[<ff1193be>]CPU: 3
(XEN) EIP: e008:[<ff1193be>] idle_loop+0x4e/0x60 idle_loop+0x4e/0x60
(XEN) EFLAGS: 00000246 CONTEXT: hypervisor
(XEN) eax: 00000000 ebx: ffbeffb4 ecx: 00000001 edx: 00000000
(XEN) esi: ffbeffb4 edi: ffbf6080 ebp: 000090dc esp: ffbeffa8
(XEN) cr0: 8005003b cr4: 000006f0 cr3: a3363000 cr2: b7f2c260
(XEN)
(XEN) EFLAGS: 00000246 CONTEXT: hypervisor
(XEN) ds: e010 es: e010 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) eax: 00000000 ebx: ffbe3fb4 ecx: 096a03ba edx: ff18c080
(XEN) Xen stack trace from esp=ffbeffa8:
(XEN) esi: ffbf0080 edi: 07a0403a ebp: 000090dc esp: ffbe3fa8
(XEN) 00000001cr0: 8005003b cr4: 000006f0 cr3: a1b80000 cr2: b7edd260
(XEN) 00000001 00001000 00000001 00000000 00000000 00000001
00000001ds: e010 8(XEN)
(XEN) 00000000Xen stack trace from esp=ffbe3fa8:
(XEN) 00000000 00000001 00f90000 00000003 c01013a7 ffbf0080
00000061 00000001(XEN) 0000007b 0000007b 00000000 00000000 00000001
ffbf6080 00000003
(XEN) Xen call trace:
(XEN) [<ff1193be>]
(XEN) idle_loop+0x4e/0x60
(XEN) 00000000
(XEN) ************************************
(XEN) 00000000CPU1 FATAL TRAP 18 (machine check), ERROR_CODE 0000.
(XEN) System shutting down -- need manual reset.
(XEN) ************************************

The machine obviously hangs.

If I remove the PCI NIC the machine stays up. If I boot into a vanilla
kernel with the NIC in the box it stays up.

I have NICs like these bought in batch running in other machines that
are also running Xen. The machines aren''t really used a great deal (at
the moment although need to be soon) and as far as i can tell there''s
no
other issue with respect to the system that is failing, i.e the obvious
stuff like disk space running out or exhaustive cronjobs). There are no
logs other than the one to the console suggesting a failure elsewhere.

Our hardware engineer is convinced it''s either a Xen or driver issue.
I''ve seen the thread at
http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html
and have directed the engineer at this.

My questions to the list are:

1. Can this be caused by anything else (other than hardware)?
2. Is there anything I can do to debug this further to confirm what part
of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?

Any help on this would be greatly appreciated.

Many thanks,

Matt

- --
 Matthew Baker, UNIX Systems Administrator
 ----------------------------------------------------
 Institute for Learning and Research Technology (ILRT)
 A: University of Bristol,
    8-10 Berkeley Square,
    Bristol.
    BS8 1HH
 W: http://www.ilrt.bristol.ac.uk
 E: matt.baker@bris.ac.uk
 T: +44 (0)117 928 7121

- -- lspci

00:00.0 Host bridge: Intel Corporation E7320 Memory Controller Hub (rev 0c)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port
A (rev)00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI
Express Port A1 (re)00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit
PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host
Controller)00:1d.1 USB Controller: Intel Corporation 6300ESB USB
Universal Host Controller)00:1d.4 System peripheral: Intel Corporation
6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable
Interrupt Cont)00:1d.7 USB Controller: Intel Corporation 6300ESB USB2
Enhanced Host Controller)00:1e.0 PCI bridge: Intel Corporation 82801 PCI
Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller
(rev 02)
00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller
(rev 0)00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge
A (rev )01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI
Bridge B (rev )02:02.0 PCI bridge: Intel Corporation 80331 [Lindsay] I/O
processor (PCI-X Brid)03:0e.0 RAID bus controller: Adaptec AAC-RAID (rev 0a)
06:01.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Cont)06:02.0 Ethernet controller: Intel Corporation 82541GI/PI
Gigabit Ethernet Cont)07:02.0 VGA compatible controller: ATI
Technologies Inc Rage XL (rev 27)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHWVH6Lvm7pB/aicMRAioDAJ0Vw2dVALMkYylyR6Pjlw71y8ZZpQCfV+KU
Ia7+fPLZQsMXtjmFk5KSNyA=6fWn
-----END PGP SIGNATURE-----

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Robbie Dinn

2007-Dec-11 10:19 UTC

head link

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)

No one else seems to have taken the bite, so I will even
though I may not be best qualified to do so.

Matthew Baker wrote:> Hi all,
> 
> I have 2 servers with identical hardware (lspci at the bottom of this
> email).Two identical servers is good. But I wasn''t clear from your description
whether they behaved the same.

Assuming they behave differently then that might mean you have one
substandard component in one of the machines. Record all the serial
numbers of the components, or label them yourself, then begin
swapping them between the machines. If you can get the fault to
move from one machine to the other, you can maybe pin it on one component.

Your hardware guy may have already tried the above. If you have two
machines and they both show the fault, that''s more tricky.
> 
> An Extra Intel PRO/1000 MT Dual Port Server Adapter[1] has been
> connected into the second slot on a pci-x capable riser (the first slot
> taken by the SAS Raid controller).
> 
> When this nic *is* connected *and* the boxes boot a Xen kernel (debian
> 4.0 2.6.18-5-xen and using Xen HyperVisor(PAE) 3.0.3-0-4) after about 2
> days I get this error on the console:
> 
> (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
> (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
> (XEN) CPU: 1
> (XEN) EIP: e008:[<ff1193be>]CPU: 3
> (XEN) EIP: e008:[<ff1193be>] idle_loop+0x4e/0x60 idle_loop+0x4e/0x60
> (XEN) EFLAGS: 00000246 CONTEXT: hypervisor
> (XEN) eax: 00000000 ebx: ffbeffb4 ecx: 00000001 edx: 00000000
> (XEN) esi: ffbeffb4 edi: ffbf6080 ebp: 000090dc esp: ffbeffa8
> (XEN) cr0: 8005003b cr4: 000006f0 cr3: a3363000 cr2: b7f2c260
> (XEN)
> (XEN) EFLAGS: 00000246 CONTEXT: hypervisor
> (XEN) ds: e010 es: e010 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) eax: 00000000 ebx: ffbe3fb4 ecx: 096a03ba edx: ff18c080
> (XEN) Xen stack trace from esp=ffbeffa8:
> (XEN) esi: ffbf0080 edi: 07a0403a ebp: 000090dc esp: ffbe3fa8
> (XEN) 00000001cr0: 8005003b cr4: 000006f0 cr3: a1b80000 cr2: b7edd260
> (XEN) 00000001 00001000 00000001 00000000 00000000 00000001
> 00000001ds: e010 8(XEN)
> (XEN) 00000000Xen stack trace from esp=ffbe3fa8:
> (XEN) 00000000 00000001 00f90000 00000003 c01013a7 ffbf0080
> 00000061 00000001(XEN) 0000007b 0000007b 00000000 00000000 00000001
> ffbf6080 00000003
> (XEN) Xen call trace:
> (XEN) [<ff1193be>]
> (XEN) idle_loop+0x4e/0x60
> (XEN) 00000000
> (XEN) ************************************
> (XEN) 00000000CPU1 FATAL TRAP 18 (machine check), ERROR_CODE 0000.
I had a brief look to see if I could find the place in the source
code where this is being printed out, but I drew a blank. I must
not be looking at the right version of the source tree.
File .xen/arch/x86/traps.c looks like a good candidate.


> (XEN) System shutting down -- need manual reset.
> (XEN) ************************************
> 
> The machine obviously hangs.
> 
> If I remove the PCI NIC the machine stays up. If I boot into a vanilla
> kernel with the NIC in the box it stays up.
> 
> I have NICs like these bought in batch running in other machines that
> are also running Xen. The machines aren''t really used a great deal
(at
> the moment although need to be soon) and as far as i can tell
there''s no
> other issue with respect to the system that is failing, i.e the obvious
> stuff like disk space running out or exhaustive cronjobs). There are no
> logs other than the one to the console suggesting a failure elsewhere.
> 
> Our hardware engineer is convinced it''s either a Xen or driver
issue.
I can see why he might think so or want to say so.
> I''ve seen the thread at
> http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html
> and have directed the engineer at this.
> 
> My questions to the list are:
> 
> 1. Can this be caused by anything else (other than hardware)?
> 2. Is there anything I can do to debug this further to confirm what part
> of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?
grasping at straws, could you try running a memory test program, eg memtest86.

Is this a server class machine with with EEC memory? If so, is it possible
to get the linux kernel to report any soft memory errors that get corrected
via the EEC hardware? 

Is there anything in linux/Documentation/drivers/edac/edac.txt
that might help? (I have not used this myself). There may be
non fatal errors that are happening that before the fatal one.
That might give you or your hardware engineer a clue as to
where else to look.

How about building a linux kernel with some form of debugging
turned on? This might help you to see is something is
scribbling on memory when it shouldn''t be.


I don''t really know the answer, but good link anyway.
> 
> Any help on this would be greatly appreciated.
> 
> Many thanks,
> 
> Matt
> 


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Matthew Baker

2007-Dec-12 11:21 UTC

head link

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robbie Dinn wrote:> No one else seems to have taken the bite, so I will even
> though I may not be best qualified to do so.
and I thank you for that. ;-)
> Matthew Baker wrote:
>> Hi all,
>>
>> I have 2 servers with identical hardware (lspci at the bottom of this
>> email).
> Two identical servers is good. But I wasn''t clear from your
description
> whether they behaved the same.
Well both servers have exhibited "problems" I''ve only been
able to
capture the panic on one machine. So my assumption that it is the same
cause may be wrong.
> Assuming they behave differently then that might mean you have one
> substandard component in one of the machines. Record all the serial
> numbers of the components, or label them yourself, then begin
> swapping them between the machines. If you can get the fault to
> move from one machine to the other, you can maybe pin it on one component.
I''m going to be able to get both these boxes out of a rack into a place
which I can do some better diagnosis from this angle. I''m beginning to
believe it may be related to one box more than the other.
> Your hardware guy may have already tried the above. If you have two
> machines and they both show the fault, that''s more tricky.
fingers crossed.
>> Our hardware engineer is convinced it''s either a Xen or driver
issue.
> 
> I can see why he might think so or want to say so.
Yes as can I.
>> I''ve seen the thread at
>>
http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html
>> and have directed the engineer at this.
>>
>> My questions to the list are:
>>
>> 1. Can this be caused by anything else (other than hardware)?
>> 2. Is there anything I can do to debug this further to confirm what
part
>> of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?
> 
> grasping at straws, could you try running a memory test program, eg
memtest86.
Yes we''ve ran some diagnostics on one of the boxes and all seems well.
However, we still need to compare them.
> Is this a server class machine with with EEC memory? If so, is it possible
> to get the linux kernel to report any soft memory errors that get corrected
> via the EEC hardware? 
> 
> Is there anything in linux/Documentation/drivers/edac/edac.txt
> that might help? (I have not used this myself). There may be
> non fatal errors that are happening that before the fatal one.
> That might give you or your hardware engineer a clue as to
> where else to look.
Ah, this looks good. The edac modules were loaded already (by udev I
presume). I''ve enabled the logging features via /sys. Thanks for the
tip.
> How about building a linux kernel with some form of debugging
> turned on? This might help you to see is something is
> scribbling on memory when it shouldn''t be.
Yes, we''ve thought about enabling the gdb-stub as described in
http://wiki.xensource.com/xenwiki/XenPPC/Debug/XenGDBStub I''m presuming
this will work for other architectures than ppc. I see this as a last
resort as kernel debugging can be quite time consuming!

Thanks for your help it has given me some ideas on how to approach this.

Matt

- --
 Matthew Baker, UNIX Systems Administrator
 ----------------------------------------------------
 Institute for Learning and Research Technology (ILRT)
 A: University of Bristol,
    8-10 Berkeley Square,
    Bristol.
    BS8 1HH
 W: http://www.ilrt.bristol.ac.uk
 E: matt.baker@bris.ac.uk
 T: +44 (0)117 928 7121
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHX8QsLvm7pB/aicMRAuGeAJ4mb4NSPj6YeRSC48iKz2N0U3jm3gCfZM1d
Pr3mJfQZsO0bvCvtUoqjwT8=XSUr
-----END PGP SIGNATURE-----

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Xen users - Dec 2007 - Fatal Trap 18 (convincing hardware engineer)

[Xen-users] Fatal Trap 18 (convincing hardware engineer)

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)