James C. McPherson
2005-Oct-10 14:29 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Hi all, I''ve been trying to figure out why my audio hardware (ac97 controller using audio810/audioi810 onboard a gigabyte nForce3-250) is using an incredible number of interrupts. (Of the order of 80000/sec, even when not playing any audio). I was running intrstat, and tried running /usr/demo/dtrace/intr.d at the same time. This results in a hard hang every time I attempt it. I''m running in 64bit mode with SunOS doppio 5.11 onnv-gate:2005-10-04 i86pc i386 i86pc + build24 bits Has anybody seen this already? Logged it? What should I be looking for in the way of useful data to provide? I''ve got snooping=1 set but I''m loath to attempt to kill this box again in a hurry since it''s my primary workstation and I do have that day job thing to focus on :) Incidentally, are there any reports of onboard ac97 hardware monopolizing interrupts? thanks in advance, James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems
Bryan Cantrill
2005-Oct-10 15:54 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
> Hi all, > I''ve been trying to figure out why my audio hardware (ac97 controller > using audio810/audioi810 onboard a gigabyte nForce3-250) is using an > incredible number of interrupts. (Of the order of 80000/sec, even when > not playing any audio). > > I was running intrstat, and tried running /usr/demo/dtrace/intr.d > at the same time. > > This results in a hard hang every time I attempt it.One of three things is happening: (1) There is a bug in the driver that is somehow tickled by the temporal distortion induced by the DTrace enabling. (2) The system isn''t actually hung, it''s just in a ton of pain. Be sure that you leave the machine long enough for the entire DTrace deadman to kick in (40 seconds); if DTrace detects that the machine is no longer making forward progress after this time, it will kill the enabling. If DTrace isn''t killing off the enabling, it means that the machine is still making forward progress (clock is firing, processes are running, etc.) (3) There is a bug in the deadman mechanism, or elsewhere in DTrace that is inducing (or at least not recovering from) the hang. Of these, (3) is the least likely -- but certainly anything is possible. Perhaps you can experiment with dialing the probe effect of intr.d down (by not aggregating on a string) to see if you can at least root-cause the interrupt storm? That might give us insight into the likelihood of (1)... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
James C. McPherson
2005-Oct-10 19:47 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Seth Goldberg wrote:> How many interrupts does mpstat report? It''s possible that if mpstat is > reporting a normal-looking number of interrupts, that if you''re sharing > the same IRQ with the audio hardware and the audio hardware repeatedly > claims interrupts (returns nonzero from its ISR), the looping construct we > use during interrupt handling will never exit, and the machine will appear > to hang -- can you set up a serial console on this system, and if so, can > you break into kmdb during the hang?mpstat? I hadn''t thought of running it -- this is a single-cpu machine. It''s a PC, and last night after another perceived hard-hang I retried the operation from the console without X - same thing. James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems
Seth Goldberg
2005-Oct-10 19:55 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
}Seth Goldberg wrote: }> How many interrupts does mpstat report? It''s possible that if mpstat is }> reporting a normal-looking number of interrupts, that if you''re sharing the }> same IRQ with the audio hardware and the audio hardware repeatedly claims }> interrupts (returns nonzero from its ISR), the looping construct we use }> during interrupt handling will never exit, and the machine will appear to }> hang -- can you set up a serial console on this system, and if so, can you }> break into kmdb during the hang? } }mpstat? I hadn''t thought of running it -- this is a single-cpu machine. Yes -- the mpstat interrupt statistics are different from the dtrace interrupt statistics. } }It''s a PC, and last night after another perceived hard-hang I retried }the operation from the console without X - same thing. You can try something else if you don''t want to set up a serial console. You can drop the interrupt priority of the audio device by editing its conf file -- change its interrupt-priorities property to 1. That way, the keyboard interrupt will be able to get through if there is truly an audio driver interrupt storm. --S
James C. McPherson
2005-Oct-10 19:58 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Bryan Cantrill wrote: ...> One of three things is happening: > (1) There is a bug in the driver that is somehow tickled by the temporal > distortion induced by the DTrace enabling. > (2) The system isn''t actually hung, it''s just in a ton of pain. Be sure > that you leave the machine long enough for the entire DTrace deadman > to kick in (40 seconds); if DTrace detects that the machine is no > longer making forward progress after this time, it will kill the > enabling. If DTrace isn''t killing off the enabling, it means that > the machine is still making forward progress (clock is firing, > processes are running, etc.)This is what I figure is happening. I''ve left the box in this state for at least 2 minutes.> (3) There is a bug in the deadman mechanism, or elsewhere in DTrace > that is inducing (or at least not recovering from) the hang. > Of these, (3) is the least likely -- but certainly anything is possible.Well sure -- I''ve seen your code .... :)> Perhaps you can experiment with dialing the probe effect of intr.d down > (by not aggregating on a string) to see if you can at least root-cause > the interrupt storm? That might give us insight into the likelihood of > (1)...I''ll give it a go and find out. I tried using the attached D script as well, but that is aggregating on a string as well. thanks, James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: intr.d URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20051011/a1917a52/attachment.ksh>
James C. McPherson
2005-Oct-10 20:25 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Seth Goldberg wrote: >James McPherson wrote:> }Seth Goldberg wrote: > }> How many interrupts does mpstat report? It''s possible that if mpstat is > }> reporting a normal-looking number of interrupts, that if you''re sharing the > }> same IRQ with the audio hardware and the audio hardware repeatedly claims > }> interrupts (returns nonzero from its ISR), the looping construct we use > }> during interrupt handling will never exit, and the machine will appear to > }> hang -- can you set up a serial console on this system, and if so, can you > }> break into kmdb during the hang? > }mpstat? I hadn''t thought of running it -- this is a single-cpu machine. > Yes -- the mpstat interrupt statistics are different from the dtrace > interrupt statistics.Now this is interesting. I just ran mpstat: doppio:tmp $ mpstat 1 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 13 1 0 635384 635284 360 16 0 0 0 960 0 100 0 0 0 10 0 0 616197 616095 355 51 0 0 0 1013 0 100 0 0 0 0 0 0 630354 630255 312 9 0 0 0 784 0 100 0 0 0 0 0 0 629379 629279 321 14 0 0 0 863 0 100 0 0 0 0 0 0 629996 629896 325 18 0 0 0 874 0 100 0 0 0 0 0 0 629518 629418 341 23 0 0 0 874 0 100 0 0 0 0 0 0 629095 628995 304 19 0 0 0 798 0 100 0 0 0 0 0 0 627890 627790 352 24 0 0 0 1050 0 100 0 0 0 0 0 0 602069 601969 379 79 0 0 0 1099 0 100 0 0 0 0 0 0 640340 640240 379 6 0 0 0 1075 0 100 0 0 0 0 0 0 638873 638773 436 1 0 0 0 1255 0 100 0 0 0 0 0 0 638833 638733 436 8 0 0 0 1303 0 100 0 0 0 316 0 0 640835 640735 388 13 0 0 0 1691 0 100 0 0 0 0 0 0 635549 635449 416 2 0 0 0 1072 0 100 0 0 0 454 16 0 623652 623552 589 90 0 2 0 1877 0 100 0 0 0 11 0 0 614002 613902 126 101 0 0 0 124 0 100 0 0 0 1914 52 0 582372 582263 399 99 0 0 0 2044 0 100 0 0 0 5 2 0 85936 85753 300 105 0 1 0 813 0 100 0 0 0 0 0 0 86528 86347 295 55 0 0 0 680 0 100 0 0 0 0 0 0 86581 86399 320 69 0 1 0 788 0 100 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 1 0 0 86325 86143 267 67 0 0 0 599 0 100 0 0 0 0 0 0 86002 85819 264 81 0 0 0 740 0 100 0 0 0 0 0 0 85931 85749 181 106 0 1 0 487 0 100 0 0 0 0 0 0 85764 85585 151 103 0 0 0 357 0 100 0 0 0 0 0 0 85295 85113 274 114 0 2 0 764 0 100 0 0 0 0 0 0 85457 85274 261 105 0 0 0 706 0 100 0 0 0 0 0 0 85507 85326 277 106 0 0 0 772 0 100 0 0 0 0 0 0 85655 85472 258 102 0 0 0 721 0 100 0 0 0 0 0 0 85588 85407 231 102 0 0 0 659 0 100 0 0 0 0 0 0 85573 85391 259 121 0 1 0 669 0 100 0 0 see where intr and ithr drops from 582372/582263 down to 85936/85753 ? That''s when I started running intrstat as well.> }It''s a PC, and last night after another perceived hard-hang I retried > }the operation from the console without X - same thing. > You can try something else if you don''t want to set up a serial > console. You can drop the interrupt priority of the audio device by > editing its conf file -- change its interrupt-priorities property to 1. > That way, the keyboard interrupt will be able to get through if there is > truly an audio driver interrupt storm.And of course that would help with my keyboard+mouse being usb, too. I''ll check and get back to you. James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems
Seth Goldberg
2005-Oct-10 20:27 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
}> You can try something else if you don''t want to set up a serial console. }> You can drop the interrupt priority of the audio device by editing its conf }> file -- change its interrupt-priorities property to 1. That way, the }> keyboard interrupt will be able to get through if there is truly an audio }> driver interrupt storm. } }And of course that would help with my keyboard+mouse being usb, too. In the case of USB, you''ll probably want to increase the USB controller''s interrupt priority also, by adding a the property to the host controller driver''s conf file (USB controllers'' priorities default to `1'') with an appropriate priority. --S
Seth Goldberg
2005-Oct-10 20:28 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
} }see where intr and ithr drops from 582372/582263 down to 85936/85753 ? That''s }when I started running intrstat as well. That makes sense -- the processor is doing more things, so it cannot continue to service interrupts at the higher rate. --S
James C. McPherson
2005-Oct-10 20:34 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Seth Goldberg wrote:> } > }see where intr and ithr drops from 582372/582263 down to 85936/85753 ? That''s > }when I started running intrstat as well. > > That makes sense -- the processor is doing more things, so it cannot > continue to service interrupts at the higher rate.What I was not expecting was the closeness of the numbers to what intrstat reports -- there''s roughly 500 intr/sec more reported by intrstat, but it''s consistent. I love learning experiences :) cheers, James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems
James C. McPherson
2005-Oct-11 01:27 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
Hi all, Seth Goldberg has just spent a lot of cycles helping me with this, and this email summarises the findings. Bryan Cantrill wrote: ...> One of three things is happening: > (1) There is a bug in the driver that is somehow tickled by the temporal > distortion induced by the DTrace enabling.Nope> (2) The system isn''t actually hung, it''s just in a ton of pain. Be sure > that you leave the machine long enough for the entire DTrace deadman > to kick in (40 seconds); if DTrace detects that the machine is no > longer making forward progress after this time, it will kill the > enabling. If DTrace isn''t killing off the enabling, it means that > the machine is still making forward progress (clock is firing, > processes are running, etc.)Closest to the mark> (3) There is a bug in the deadman mechanism, or elsewhere in DTrace > that is inducing (or at least not recovering from) the hang.As far as Seth could determine, definitely not this option The winner is a bug in my motherboard bios. I guess that''s not really a surprise :| Seth root-caused the problem to be a new failure mode for allocation of interrupts in this system based around interrupt line polarities which prevents correct allocation of a non-shared IRQ for the audio device after an IRQ has been assigned for my usb devices. Seth concluded his analysis thus: ------------------------------------------------------------------------ maybe this board can switch polarities of interrupt controller inputs, or maybe the laws of physics cease to exist on your board. ------------------------------------------------------------------------ Since my laptop apparently doesn''t need a cpu, I''m betting on the latter :) (http://blogs.sun.com/roller/page/jmcp/20050830) cheers! James C. McPherson -- SAN Engineering Product Development Data Management Group Sun Microsystems
Seth Goldberg
2005-Oct-11 01:39 UTC
[dtrace-discuss] dtrace+ interrupt monitoring = hard hang -- existing bug?
}> (2) The system isn''t actually hung, it''s just in a ton of pain. Be sure }> that you leave the machine long enough for the entire DTrace deadman }> to kick in (40 seconds); if DTrace detects that the machine is no }> longer making forward progress after this time, it will kill the }> enabling. If DTrace isn''t killing off the enabling, it means that }> the machine is still making forward progress (clock is firing, }> processes are running, etc.) } }Closest to the mark } The wrong interrupt controller input was programmed with the wrong polarity, causing continuous interrupts to be sent to the CPU. Hey, at least now James knows the maximum theoretical # of interrupts / seconds his system can process ;). }The winner is a bug in my motherboard bios. I guess that''s not really }a surprise :| Specifically, the ACPI tables were lying about the interrupt polarity for a particular set of interrupts. I modified his DSDT ACPI table, recompiled it, and stuffed it into /boot/acpi/tables, so the system could override the bad one offered by the BIOS. --S
Snafoo
2006-Nov-19 00:00 UTC
[dtrace-discuss] Re: dtrace+ interrupt monitoring = hard hang -- existing bug?
> > }> (2) The system isn''t actually hung, it''s just in a > ton of pain. Be sure > }> that you leave the machine long enough for the > entire DTrace deadman > }> to kick in (40 seconds); if DTrace detects > that the machine is no > }> longer making forward progress after this > time, it will kill the > }> enabling. If DTrace isn''t killing off the > enabling, it means that > }> the machine is still making forward progress > (clock is firing, > }> processes are running, etc.) > } > }Closest to the mark > } > > The wrong interrupt controller input was programmed > with the wrong > olarity, causing continuous interrupts to be sent to > the CPU. Hey, at > least now James knows the maximum theoretical # of > interrupts / seconds > his system can process ;). > > }The winner is a bug in my motherboard bios. I guess > that''s not really > }a surprise :| > > Specifically, the ACPI tables were lying about the > interrupt polarity > or a particular set of interrupts. I modified his > DSDT ACPI > table, recompiled it, and stuffed it into > /boot/acpi/tables, so the system > could override the bad one offered by the BIOS. > > --SHi, does anybody know the changes made to the DSDT ACPI tables? I think I''ve got the same problem with my MSI K8N Neo Platinum Motherboard (this seems really not to be the best choice for Solaris x86). My HW Specs: MSI K8N Neo Platinum Motherboard (nForce 3 250Gb Chipset) AMD Athlon 64 3000+ 3 x SATA HD & 1 x IDE DVD-RW MSI ATI RX9250 (RADEON 9250) Current intrstat output: device | cpu0 %tim -------------+--------------- audio810#0 | 51802 2.9 ehci#0 | 0 0.0 nfo#0 | 51802 3.3 pci-ide#1 | 0 0.0 pci-ide#2 | 103605 5.3 Compared to my experiences with this system with Win XP Pro, FreeBSD 64 and Knoppix it is extremely slow with Solaris x86 but since this problems are noticeable often with the MSI NEO motherboards/AWARD BIOS I''m hoping that there''s already a solution. Regards, Snafoo This message posted from opensolaris.org