James Lamanna
2010-Mar-30 00:30 UTC
[asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
Hi,
I'm trying to figure out the cause of a soft lockup I experienced:
Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
[asterisk:32029]
Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk
Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[<c046e7fe>] CPU: 0
Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
Mar 29 09:38:24 pstn1 kernel: EFLAGS: 00000286 Tainted: GF
(2.6.18-128.1.10.el5 #1)
Mar 29 09:38:24 pstn1 kernel: EAX: 00000029 EBX: f7ff9380 ECX:
f7fff880 EDX: c11ff9a0
Mar 29 09:38:24 pstn1 kernel: ESI: 00000286 EDI: cffcda00 EBP:
e5e10c80 DS: 007b ES: 007b
Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
0f911000 CR4: 000006d0
Mar 29 09:38:24 pstn1 kernel: [<c05b067c>] kfree_skbmem+0x8/0x61
Mar 29 09:38:24 pstn1 kernel: [<c05e9aaf>] __udp_queue_rcv_skb+0x4a/0x51
Mar 29 09:38:24 pstn1 kernel: [<c05ad993>] release_sock+0x44/0x91
Mar 29 09:38:24 pstn1 kernel: [<c05ea939>] udp_sendmsg+0x44e/0x514
Mar 29 09:38:24 pstn1 kernel: [<c05efdec>] inet_sendmsg+0x35/0x3f
Mar 29 09:38:24 pstn1 kernel: [<c05ab30c>] sock_sendmsg+0xce/0xe8
Mar 29 09:38:24 pstn1 kernel: [<c043464f>]
autoremove_wake_function+0x0/0x2d
Mar 29 09:38:24 pstn1 kernel: [<c04ea17b>] copy_from_user+0x17/0x5d
Mar 29 09:38:24 pstn1 kernel: [<c04ea3a1>] copy_to_user+0x31/0x48
Mar 29 09:38:24 pstn1 kernel: [<f89ab141>] zt_chan_read+0x1e0/0x20b
[zaptel]
Mar 29 09:38:24 pstn1 kernel: [<c04ea195>] copy_from_user+0x31/0x5d
Mar 29 09:38:24 pstn1 kernel: [<c05ac4c4>] sys_sendto+0x116/0x140
Mar 29 09:38:24 pstn1 kernel: [<c0415d4f>] flush_tlb_page+0x74/0x77
Mar 29 09:38:24 pstn1 kernel: [<c0461331>] do_wp_page+0x3bf/0x40a
Mar 29 09:38:24 pstn1 kernel: [<c04284f1>] current_fs_time+0x4a/0x55
Mar 29 09:38:24 pstn1 kernel: [<c0488f9b>] touch_atime+0x60/0x91
Mar 29 09:38:24 pstn1 kernel: [<c047d9d0>] pipe_readv+0x315/0x321
Mar 29 09:38:24 pstn1 kernel: [<c05acde4>] sys_socketcall+0x106/0x19e
Mar 29 09:38:24 pstn1 kernel: [<c0404f17>] syscall_call+0x7/0xb
Mar 29 09:38:24 pstn1 kernel: ======================
This occurred during a "high load" period (52 calls across 3 PRI
spans).
A couple days ago I moved the interrupts for my PRI card to CPU0 from
CPU3, because CPU3 was handling everything else:
CPU0 CPU1 CPU2 CPU3
0: 306 0 0 3684057379 IO-APIC-edge timer
1: 0 0 0 13468 IO-APIC-edge i8042
8: 0 0 0 3 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
12: 0 0 0 4 IO-APIC-edge i8042
169: 0 0 0 0 IO-APIC-level uhci_hcd:usb2
177: 0 0 0 18392593 IO-APIC-level ata_piix
185: 0 0 0 1 IO-APIC-level ehci_hcd:usb1
193: 0 0 0 0 IO-APIC-level uhci_hcd:usb3
201: 0 0 0 2090021759 IO-APIC-level eth0
209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp
(The CPU3 number for wct4xxp is not increasing any more).
What is the interrupt distribution of other people's systems?
Before I made this change I was having a problem with D-channels
dropping occasionally, so I thought it might be an interrupt/load
issue.
Thank you.
-- James
Matt Watson
2010-Mar-30 03:38 UTC
[asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)
Dell server by any chance? I have a similar problem with a TE220B in a Dell 1950 III server - i've seen several other people having issues with digium cards in dell servers as well. I've actually done something similar to what you have done - isolated the TE220B onto its own IRQ and set processor affinity for all the IRQs to particular cores... so far I haven't had kernel pancs since doing this, but its still a little too early to say if it has fixed the issue 100% or not. -- Matt On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna <jlamanna at gmail.com> wrote:> Hi, > I'm trying to figure out the cause of a soft lockup I experienced: > > Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s! > [asterisk:32029] > Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm: asterisk > Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[<c046e7fe>] CPU: 0 > Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c > Mar 29 09:38:24 pstn1 kernel: EFLAGS: 00000286 Tainted: GF > (2.6.18-128.1.10.el5 #1) > Mar 29 09:38:24 pstn1 kernel: EAX: 00000029 EBX: f7ff9380 ECX: > f7fff880 EDX: c11ff9a0 > Mar 29 09:38:24 pstn1 kernel: ESI: 00000286 EDI: cffcda00 EBP: > e5e10c80 DS: 007b ES: 007b > Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3: > 0f911000 CR4: 000006d0 > Mar 29 09:38:24 pstn1 kernel: [<c05b067c>] kfree_skbmem+0x8/0x61 > Mar 29 09:38:24 pstn1 kernel: [<c05e9aaf>] __udp_queue_rcv_skb+0x4a/0x51 > Mar 29 09:38:24 pstn1 kernel: [<c05ad993>] release_sock+0x44/0x91 > Mar 29 09:38:24 pstn1 kernel: [<c05ea939>] udp_sendmsg+0x44e/0x514 > Mar 29 09:38:24 pstn1 kernel: [<c05efdec>] inet_sendmsg+0x35/0x3f > Mar 29 09:38:24 pstn1 kernel: [<c05ab30c>] sock_sendmsg+0xce/0xe8 > Mar 29 09:38:24 pstn1 kernel: [<c043464f>] > autoremove_wake_function+0x0/0x2d > Mar 29 09:38:24 pstn1 kernel: [<c04ea17b>] copy_from_user+0x17/0x5d > Mar 29 09:38:24 pstn1 kernel: [<c04ea3a1>] copy_to_user+0x31/0x48 > Mar 29 09:38:24 pstn1 kernel: [<f89ab141>] zt_chan_read+0x1e0/0x20b > [zaptel] > Mar 29 09:38:24 pstn1 kernel: [<c04ea195>] copy_from_user+0x31/0x5d > Mar 29 09:38:24 pstn1 kernel: [<c05ac4c4>] sys_sendto+0x116/0x140 > Mar 29 09:38:24 pstn1 kernel: [<c0415d4f>] flush_tlb_page+0x74/0x77 > Mar 29 09:38:24 pstn1 kernel: [<c0461331>] do_wp_page+0x3bf/0x40a > Mar 29 09:38:24 pstn1 kernel: [<c04284f1>] current_fs_time+0x4a/0x55 > Mar 29 09:38:24 pstn1 kernel: [<c0488f9b>] touch_atime+0x60/0x91 > Mar 29 09:38:24 pstn1 kernel: [<c047d9d0>] pipe_readv+0x315/0x321 > Mar 29 09:38:24 pstn1 kernel: [<c05acde4>] sys_socketcall+0x106/0x19e > Mar 29 09:38:24 pstn1 kernel: [<c0404f17>] syscall_call+0x7/0xb > Mar 29 09:38:24 pstn1 kernel: ======================> > > This occurred during a "high load" period (52 calls across 3 PRI spans). > > A couple days ago I moved the interrupts for my PRI card to CPU0 from > CPU3, because CPU3 was handling everything else: > CPU0 CPU1 CPU2 CPU3 > 0: 306 0 0 3684057379 IO-APIC-edge timer > 1: 0 0 0 13468 IO-APIC-edge i8042 > 8: 0 0 0 3 IO-APIC-edge rtc > 9: 0 0 0 0 IO-APIC-level acpi > 12: 0 0 0 4 IO-APIC-edge i8042 > 169: 0 0 0 0 IO-APIC-level > uhci_hcd:usb2 > 177: 0 0 0 18392593 IO-APIC-level ata_piix > 185: 0 0 0 1 IO-APIC-level > ehci_hcd:usb1 > 193: 0 0 0 0 IO-APIC-level > uhci_hcd:usb3 > 201: 0 0 0 2090021759 IO-APIC-level eth0 > 209: 149621223 0 0 3534419461 IO-APIC-level wct4xxp > > > (The CPU3 number for wct4xxp is not increasing any more). > > What is the interrupt distribution of other people's systems? > Before I made this change I was having a problem with D-channels > dropping occasionally, so I thought it might be an interrupt/load > issue. > > Thank you. > > -- James > > -- > _____________________________________________________________________ > -- Bandwidth and Colocation Provided by http://www.api-digital.com -- > New to Asterisk? Join us for a live introductory webinar every Thurs: > http://www.asterisk.org/hello > > asterisk-users mailing list > To UNSUBSCRIBE or update options visit: > http://lists.digium.com/mailman/listinfo/asterisk-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.digium.com/pipermail/asterisk-users/attachments/20100329/ffbfcedf/attachment.htm