thr3ads.net - Xen users - [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box. [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Jamie J. Begin

2008-Apr-29 18:12 UTC

[Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

I mentioned this yesterday, but it was buried in another thread:

 

When I try to reboot dom0, it switches to runlevel 6 and the xen init.d
script attempts to stop a domU containing an Asterisk installation. 
It''s at
that point I get an "hda: interrupt lost" on the physical console. 
SSH
become inaccessible and eventually the system pukes up a bunch of ext3 and
RAID controller related errors and freezes.  I have to physically power
cycle the box to get it back up.

 

I suspect that a PCIe telephony card that I''m passing to the domU using
pciback is the source of the problem.  The card is a Digium AEX800 (which is
actually a PCIe version of Digium''s PCI-based TDM800P). Based on some
preliminary testing, the card seems to function just fine in the domU.
lspci output is:

 

0b:08.0 Ethernet controller: Digium, Inc. Unknown device 8002 (rev 11)

        Subsystem: Digium, Inc. Unknown device 8002

        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-

        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-

        Interrupt: pin A routed to IRQ 16

        Region 0: I/O ports at dc00 [disabled] [size=256]

        Region 1: Memory at fc7dfc00 (32-bit, non-prefetchable) [disabled]
[size=1K]

        Expansion ROM at fc7e0000 [disabled] [size=128K]

        Capabilities: [c0] Power Management version 2

                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold-)

                Status: D0 PME-Enable- DSel=0 DScale=0 PME-

 

Any suggestions?  This is a new Dell PowerEdge 1950 with a PERC SATA RAID 1
array, running CentOS 5.1 (2.6.18-53.1.14.el5xen) in both the dom0 and domU.


 



_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Mark Williamson

2008-Apr-29 18:21 UTC

head link

Re: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

> When I try to reboot dom0, it switches to runlevel 6 and the xen init.d
> script attempts to stop a domU containing an Asterisk installation. 
It''s
> at that point I get an "hda: interrupt lost" on the physical
console.  SSH
> become inaccessible and eventually the system pukes up a bunch of ext3 and
> RAID controller related errors and freezes.  I have to physically power
> cycle the box to get it back up.
Ugh, that''s nasty :-(
> I suspect that a PCIe telephony card that I''m passing to the domU
using
> pciback is the source of the problem.  The card is a Digium AEX800 (which
> is actually a PCIe version of Digium''s PCI-based TDM800P). Based
on some
> preliminary testing, the card seems to function just fine in the domU.
> lspci output is:
I was actually just thinking "I wonder if he''s using PCI
passthrough" ;-)

A few thoughts spring to mind:

1) Any idea if this is happening during a normal shutdown of the domU or if 
that shutdown is timing out, resulting in the domain being rudely destroyed?

2) Is there any chance that the domains are being suspended rather than 
shutdown?  That might do funny things...

3) Does this happen if you manually shutdown the domain?  if you manually 
destroy the domain?

It''s also possible that this is some kind of bug in the PCI
passthrough.  I
didn''t actually know that it worked for PCIe, but it''s nice to
know that it
(sort of) does :-)

I apologise for suggesting testing on your own system; I imagine that doing 
this repeatedly is not doing your filesystem consistency any good :-( 
It''s
possible that you''ll get more suggestions than I''ve been able
to provide from
the xen-devel list.  Still, if you feel like doing some testing it may help.

It''s also possible that there have been other reports like this,
although I
don''t remember hearing of them.  Have you done a quick search of the
mailing
list archives and the bugzilla?  (or even just a google, in case someone 
grumbled about it on their blog).

The Asterisk-in-a-domU configuration seems to be rather popular, which is 
cool :-)

Cheers,
Mark

>
>
> 0b:08.0 Ethernet controller: Digium, Inc. Unknown device 8002 (rev 11)
>
>         Subsystem: Digium, Inc. Unknown device 8002
>
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
>
>         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
>
>         Interrupt: pin A routed to IRQ 16
>
>         Region 0: I/O ports at dc00 [disabled] [size=256]
>
>         Region 1: Memory at fc7dfc00 (32-bit, non-prefetchable) [disabled]
> [size=1K]
>
>         Expansion ROM at fc7e0000 [disabled] [size=128K]
>
>         Capabilities: [c0] Power Management version 2
>
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0+,D1+,D2+,D3hot+,D3cold-)
>
>                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
>
>
>
> Any suggestions?  This is a new Dell PowerEdge 1950 with a PERC SATA RAID 1
> array, running CentOS 5.1 (2.6.18-53.1.14.el5xen) in both the dom0 and
> domU.


-- 
Push Me Pull You - Distributed SCM tool (http://www.cl.cam.ac.uk/~maw48/pmpu/)

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Jamie J. Begin

2008-Apr-29 21:28 UTC

head link

RE: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

I''m not too worried about screwing up the filesystem, but thanks. :-) 
We''re
moving offices and I have about three weeks to get this working so I
nuke-n-reload as much as needed.  Our new telecom provider is able to hand
off our voice trunks using SIP, so I''m hoping to use this one, physical
server as an all-in-one edge device: Asterisk, SER (a SIP proxy), firewall,
and possibly caching Squid proxy.  Using Xen will also make it easier to
replicate this config at our other offices.

I started doing some more testing along the lines you suggested:

1. Issuing a "poweroff" from within a SSH session on the domU works
fine.

3. Executing "xm destroy <domU>" causes ''irq 16: nobody
cared (try booting
with the "irqpoll" option'' to be written to the console,
followed by a dom0
kernel panic. Yay!

2. Executing "xm save <domU> filename" causes the same HDD
errors I
previously mentioned.

I started to think "resource conflict." I edited grub.conf to pass
"acpi=off" to the dom0 kernel, hoping maybe the problem was with how
ACPI
handled the IRQs, but no luck.  I then took a look at the IRQ assignments in
the BIOS and discovered that both the RAID controller and the Digium card
were assigned to the same IRQ.  Trying to change the IRQ on either the card
or the RAID controller just caused the other to change as well; I
couldn''t
individually assign them.  So I poked around and under "Integrated
Devices"
in the BIOS and changed "System Interrupts Assignment" from
"Standard" to
"Distributed."

Bam! It now worked.  I wasn''t getting IRQ conflicts between the storage
subsystem and the telephony card. However, I was now I was getting a kernel
panic at boot in the domU when it was trying to load the zaptel drivers
(needed for the telephony card). So I mounted the LVM volume in xen image
using kpartx and manually blew away the zaptel drivers in /lib/modules.
Unmounted the image, and booted the domU. This time it booted fine, so I
recompiled and reinstalled the zaptel stuff.  Rebooted the domU again, and
still got a kernel panic. 

Mainly out of frustration, I then just decided to reboot the physical
server.  For some reason, that fixed that problem. It now works.  No
conflicts with the storage controller, no problems shutting down domU, and
the card still works fine in Asterisk.  Problem solved.  

The very first computer problem I ever solved was changing the IRQ selection
jumpers on an internal modem because it was conflicting with the IO card in
my 286.  I can''t believe the same problem is still stumping me almost
20
years later. 

-----Original Message-----
From: M.A. Williamson [mailto:maw48@hermes.cam.ac.uk] On Behalf Of Mark
Williamson
Sent: Tuesday, April 29, 2008 2:21 PM
To: xen-users@lists.xensource.com
Cc: Jamie J. Begin
Subject: Re: [Xen-users] Shutting down domU causes "hda: interrupt
lost" on
dom0 and freezes the box.
> When I try to reboot dom0, it switches to runlevel 6 and the xen init.d
> script attempts to stop a domU containing an Asterisk installation. 
It''s
> at that point I get an "hda: interrupt lost" on the physical
console.  SSH
> become inaccessible and eventually the system pukes up a bunch of ext3 and
> RAID controller related errors and freezes.  I have to physically power
> cycle the box to get it back up.
Ugh, that''s nasty :-(
> I suspect that a PCIe telephony card that I''m passing to the domU
using
> pciback is the source of the problem.  The card is a Digium AEX800 (which
> is actually a PCIe version of Digium''s PCI-based TDM800P). Based
on some
> preliminary testing, the card seems to function just fine in the domU.
> lspci output is:
I was actually just thinking "I wonder if he''s using PCI
passthrough" ;-)

A few thoughts spring to mind:

1) Any idea if this is happening during a normal shutdown of the domU or if 
that shutdown is timing out, resulting in the domain being rudely destroyed?

2) Is there any chance that the domains are being suspended rather than 
shutdown?  That might do funny things...

3) Does this happen if you manually shutdown the domain?  if you manually 
destroy the domain?

It''s also possible that this is some kind of bug in the PCI
passthrough.  I
didn''t actually know that it worked for PCIe, but it''s nice to
know that it
(sort of) does :-)

I apologise for suggesting testing on your own system; I imagine that doing 
this repeatedly is not doing your filesystem consistency any good :-( 
It''s
possible that you''ll get more suggestions than I''ve been able
to provide
from 
the xen-devel list.  Still, if you feel like doing some testing it may help.

It''s also possible that there have been other reports like this,
although I
don''t remember hearing of them.  Have you done a quick search of the
mailing

list archives and the bugzilla?  (or even just a google, in case someone 
grumbled about it on their blog).

The Asterisk-in-a-domU configuration seems to be rather popular, which is 
cool :-)

Cheers,
Mark

>
>
> 0b:08.0 Ethernet controller: Digium, Inc. Unknown device 8002 (rev 11)
>
>         Subsystem: Digium, Inc. Unknown device 8002
>
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr-> Stepping- SERR- FastB2B-
>
>         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
>
>         Interrupt: pin A routed to IRQ 16
>
>         Region 0: I/O ports at dc00 [disabled] [size=256]
>
>         Region 1: Memory at fc7dfc00 (32-bit, non-prefetchable) [disabled]
> [size=1K]
>
>         Expansion ROM at fc7e0000 [disabled] [size=128K]
>
>         Capabilities: [c0] Power Management version 2
>
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0+,D1+,D2+,D3hot+,D3cold-)
>
>                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
>
>
>
> Any suggestions?  This is a new Dell PowerEdge 1950 with a PERC SATA RAID
1> array, running CentOS 5.1 (2.6.18-53.1.14.el5xen) in both the dom0 and
> domU.


-- 
Push Me Pull You - Distributed SCM tool
(http://www.cl.cam.ac.uk/~maw48/pmpu/)


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Mark Williamson

2008-Apr-30 13:27 UTC

head link

Re: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

Thanks Jamie,

I think your debugging has identified the cause of the problem with a decent 
degree of certainty.  Sounds like it''s more or less what I had
originally
suspected:

The guest OS has to be able to mask interrupt lines in order to operate the 
PCI device it''s been given.  If that interrupt line is shared with dom0
devices then the domU can mask a line that dom0 needs to get interrupts on.  
That''s fine if the guest is well behaved but in your situation the
guest was
leaving the line masked when it was suspended - at least that''s roughly
what
it sounds like, although I couldn''t describe the exact chain of events.

Your pulling BIOS switches around sounds like it''s distributed the IRQs
across
multiple lines, which means that the guest screwing up doesn''t break
dom0.
This is also beneficial for performance in any case (imagine having to upcall 
into two entire operating systems just to find out which driver is interested 
in an interrupt!) so it''s a good switch to make in any case, if it
works.
> 1. Issuing a "poweroff" from within a SSH session on the domU
works fine.
This would be since the guest was able to shutdown cleanly and not leave the 
hardware in a funny state.
> 3. Executing "xm destroy <domU>" causes ''irq 16:
nobody cared (try booting
> with the "irqpoll" option'' to be written to the console,
followed by a dom0
> kernel panic. Yay!
A fully panic?  Eww!  Anyhow, sounds like the domU hadn''t been able to
shut
down the hardware right, so there were unexpected IRQs coming through. 
I''m
not sure why you''d get a panic though, that sounds nastier.
> 2. Executing "xm save <domU> filename" causes the same HDD
errors I
> previously mentioned.
Right.  Your init scripts were probably doing an xm save on reboot, I guess.

Did the xm save actually appear to proceed?  Did it even succeed?  It''s
a bit
scary if it does since I''m not at all sure it''s a supported
configuration (it
wasn''t last time I heard; I had thought we''d disabled this). 
Can you remind
me what version of Xen you''re running and where you got it from?

IMPORTANT NOTE: xm save-ing and xm restore-ing guests with passed-through PCI 
devices doesn''t sound like it should make much sense.  I
wouldn''t be at all
surprised if it was actually unsafe and I certainly wouldn''t be
surprised if
it didn''t actually work properly.  You should look into disabling this
before
production use; make sure they get shut down conventionally and then booted 
from scratch when dom0 next comes up.  It will save you headaches!
> Mainly out of frustration, I then just decided to reboot the physical
> server.  For some reason, that fixed that problem. It now works.  No
> conflicts with the storage controller, no problems shutting down domU, and
> the card still works fine in Asterisk.  Problem solved.
Ho hum, not sure why that helped but if it works reliably we''ll just
smile,
nod and move on!
> The very first computer problem I ever solved was changing the IRQ
> selection jumpers on an internal modem because it was conflicting with the
> IO card in my 286.  I can''t believe the same problem is still
stumping me
> almost 20 years later.
At least we''re not having to recompile Xen and put it on a special
floppy
because we don''t have enough base memory free.  Some things have moved
on ...
slightly.

Anyhow, I bet when you had to change the jumpers around the problem was only 
able to take down one OS.  With modern technology, you can blow away several 
at once :-D

Cheers,
Mark
> -----Original Message-----
> From: M.A. Williamson [mailto:maw48@hermes.cam.ac.uk] On Behalf Of Mark
> Williamson
> Sent: Tuesday, April 29, 2008 2:21 PM
> To: xen-users@lists.xensource.com
> Cc: Jamie J. Begin
> Subject: Re: [Xen-users] Shutting down domU causes "hda: interrupt
lost" on
> dom0 and freezes the box.
>
> > When I try to reboot dom0, it switches to runlevel 6 and the xen
init.d
> > script attempts to stop a domU containing an Asterisk installation. 
It''s
> > at that point I get an "hda: interrupt lost" on the physical
console.
> > SSH become inaccessible and eventually the system pukes up a bunch of
> > ext3 and RAID controller related errors and freezes.  I have to
> > physically power cycle the box to get it back up.
>
> Ugh, that''s nasty :-(
>
> > I suspect that a PCIe telephony card that I''m passing to the
domU using
> > pciback is the source of the problem.  The card is a Digium AEX800
(which
> > is actually a PCIe version of Digium''s PCI-based TDM800P).
Based on some
> > preliminary testing, the card seems to function just fine in the domU.
> > lspci output is:
>
> I was actually just thinking "I wonder if he''s using PCI
passthrough" ;-)
>
> A few thoughts spring to mind:
>
> 1) Any idea if this is happening during a normal shutdown of the domU or if
> that shutdown is timing out, resulting in the domain being rudely
> destroyed?
>
> 2) Is there any chance that the domains are being suspended rather than
> shutdown?  That might do funny things...
>
> 3) Does this happen if you manually shutdown the domain?  if you manually
> destroy the domain?
>
> It''s also possible that this is some kind of bug in the PCI
passthrough.  I
> didn''t actually know that it worked for PCIe, but it''s
nice to know that it
> (sort of) does :-)
>
> I apologise for suggesting testing on your own system; I imagine that doing
> this repeatedly is not doing your filesystem consistency any good :-( 
It''s
> possible that you''ll get more suggestions than I''ve been
able to provide
> from
> the xen-devel list.  Still, if you feel like doing some testing it may
> help.
>
> It''s also possible that there have been other reports like this,
although I
> don''t remember hearing of them.  Have you done a quick search of
the
> mailing
>
> list archives and the bugzilla?  (or even just a google, in case someone
> grumbled about it on their blog).
>
> The Asterisk-in-a-domU configuration seems to be rather popular, which is
> cool :-)
>
> Cheers,
> Mark
>
> > 0b:08.0 Ethernet controller: Digium, Inc. Unknown device 8002 (rev 11)
> >
> >         Subsystem: Digium, Inc. Unknown device 8002
> >
> >         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
>
> ParErr-
>
> > Stepping- SERR- FastB2B-
> >
> >         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort-
> > <TAbort- <MAbort- >SERR- <PERR-
> >
> >         Interrupt: pin A routed to IRQ 16
> >
> >         Region 0: I/O ports at dc00 [disabled] [size=256]
> >
> >         Region 1: Memory at fc7dfc00 (32-bit, non-prefetchable)
> > [disabled] [size=1K]
> >
> >         Expansion ROM at fc7e0000 [disabled] [size=128K]
> >
> >         Capabilities: [c0] Power Management version 2
> >
> >                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> > PME(D0+,D1+,D2+,D3hot+,D3cold-)
> >
> >                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> >
> >
> >
> > Any suggestions?  This is a new Dell PowerEdge 1950 with a PERC SATA
RAID
>
> 1
>
> > array, running CentOS 5.1 (2.6.18-53.1.14.el5xen) in both the dom0 and
> > domU.


-- 
Push Me Pull You - Distributed SCM tool (http://www.cl.cam.ac.uk/~maw48/pmpu/)

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Xen users - Apr 2008 - Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

[Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

Re: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

RE: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.

Re: [Xen-users] Shutting down domU causes "hda: interrupt lost" on dom0 and freezes the box.