thr3ads.net - Xen devel - BUG: unable to handle kernel NULL pointer dereference

If this information is useful, please help other people find it:
Share via:

Christopher S. Aker

2012-Feb-13 22:41 UTC

BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Xen: xen-4.1-testing @ cs 23224 64 bit
Dom0: 3.2.5 PAE *and* 3.0.4 PAE
Hardware: both IGB and e1000e NICed machines

Network stress testing (iperf, UDP, bidirectional) the above stack 
reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs 
with the following:

BUG: unable to handle kernel NULL pointer dereference at 00000474
IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
*pdpt = 00000000009f7027 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip 
ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
Pid: 0, comm: swapper/1 Not tainted 3.2.5-2 #3 Supermicro X8DT6/X8DT6
EIP: 0061:[<c100e967>] EFLAGS: 00010006 CPU: 1
EIP is at xen_spin_lock_flags+0x27/0x70
EAX: 00000400 EBX: 00000474 ECX: 00000001 EDX: 00000001
ESI: 00000200 EDI: 00000400 EBP: e84aff08 ESP: e84afef8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process swapper/1 (pid: 0, ti=e84ae000 task=e84589c0 task.ti=e8470000)
Stack:
00000001 00000274 e6058400 e605848c e84aff1c c16688ba 00000000 e6058400
e605848c e84aff40 c149f1a0 c100828b c1668904 c184f880 00000474 00010f66
f38b4000 e6058454 e84aff50 c149f23d 000005e8 ffffffff e84aff60 c149f2c5
Call Trace:
[<c16688ba>] _raw_spin_lock_irqsave+0x2a/0x40
[<c149f1a0>] xen_netbk_schedule_xenvif+0x80/0xf0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c1668904>] ? _raw_spin_unlock_irqrestore+0x14/0x20
[<c149f23d>] xen_netbk_check_rx_xenvif+0x2d/0x70
[<c149f2c5>] tx_credit_callback+0x45/0x50
[<c1059f92>] run_timer_softirq+0x112/0x2d0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c10a072d>] ? check_for_new_grace_period+0x9d/0xc0
[<c10a062e>] ? rcu_process_gp_end+0x3e/0xa0
[<c149f280>] ? xen_netbk_check_rx_xenvif+0x70/0x70
[<c1052586>] __do_softirq+0x96/0x1b0
[<c10524f0>] ? irq_enter+0x70/0x70
<IRQ>
[<c10523e5>] ? irq_exit+0x95/0xb0
[<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
[<c166f647>] ? xen_do_upcall+0x7/0xc
[<c106007b>] ? do_prlimit+0x5b/0x1b0
[<c10013a7>] ? hypercall_page+0x3a7/0x1000
[<c1007a42>] ? xen_safe_halt+0x12/0x20
[<c1016fac>] ? default_idle+0x6c/0x170
[<c100ef4e>] ? cpu_idle+0x5e/0x90
[<c165db1f>] ? cpu_bringup_and_idle+0xd/0xf
Code: 00 00 00 00 55 b9 01 00 00 00 89 e5 83 ec 10 89 75 f8 89 d6 89 5d 
f4 81 e6 00 02 00 00 89 c3 89 7d fc bf 00 04 00 00 89 f8 89 ca <86> 13 
84 d2 74 0a f3 90 80 3b 00 74 f3 48 75 f6 84 d2 75 0d 8b
EIP: [<c100e967>] xen_spin_lock_flags+0x27/0x70 SS:ESP 0069:e84afef8
CR2: 0000000000000474
---[ end trace 76e317879747e85d ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper/1 Tainted: G D 3.2.5-2 #3
Call Trace:
[<c166624a>] panic+0x57/0x15f
[<c16699b5>] oops_end+0xc5/0xd0
[<c1032dbe>] no_context+0xbe/0x190
[<c1032f28>] __bad_area_nosemaphore+0x98/0x140
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<f2dcd0b7>] ? ebt_nat_out+0x27/0x34 [ebtable_nat]
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<c1032fe2>] bad_area_nosemaphore+0x12/0x20
[<c166bad2>] do_page_fault+0x2d2/0x3f0
[<c15896ca>] ? nf_hook_slow+0x5a/0x110
[<c1621d21>] ? br_dev_queue_push_xmit+0x51/0x70
[<c1621f77>] ? br_forward_finish+0x57/0x60
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<c1621df3>] ? __br_forward+0xb3/0xd0
[<c166b800>] ? spurious_fault+0x130/0x130
[<c16691ee>] error_code+0x5a/0x60
[<c166b800>] ? spurious_fault+0x130/0x130
[<c100e967>] ? xen_spin_lock_flags+0x27/0x70
[<c16688ba>] _raw_spin_lock_irqsave+0x2a/0x40
[<c149f1a0>] xen_netbk_schedule_xenvif+0x80/0xf0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c1668904>] ? _raw_spin_unlock_irqrestore+0x14/0x20
[<c149f23d>] xen_netbk_check_rx_xenvif+0x2d/0x70
[<c149f2c5>] tx_credit_callback+0x45/0x50
[<c1059f92>] run_timer_softirq+0x112/0x2d0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c10a072d>] ? check_for_new_grace_period+0x9d/0xc0
[<c10a062e>] ? rcu_process_gp_end+0x3e/0xa0
[<c149f280>] ? xen_netbk_check_rx_xenvif+0x70/0x70
[<c1052586>] __do_softirq+0x96/0x1b0
[<c10524f0>] ? irq_enter+0x70/0x70
<IRQ> [<c10523e5>] ? irq_exit+0x95/0xb0
[<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
[<c166f647>] ? xen_do_upcall+0x7/0xc
[<c106007b>] ? do_prlimit+0x5b/0x1b0
[<c10013a7>] ? hypercall_page+0x3a7/0x1000
[<c1007a42>] ? xen_safe_halt+0x12/0x20
[<c1016fac>] ? default_idle+0x6c/0x170
[<c100ef4e>] ? cpu_idle+0x5e/0x90
[<c165db1f>] ? cpu_bringup_and_idle+0xd/0xf
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Should we be pointing the finger at the Hypervisor, or a dom0 issue?

Thanks,
-Chris

Christopher S. Aker

2012-Feb-14 00:54 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:> Network stress testing (iperf, UDP, bidirectional) the above stack reliably
BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the following:
> 
> BUG: unable to handle kernel NULL pointer dereference at 00000474
> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or disabled.

-Chris

Ian Campbell

2012-Feb-14 10:09 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

1On Tue, 2012-02-14 at 00:54 +0000, Christopher S. Aker
wrote:> On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:
> > Network stress testing (iperf, UDP, bidirectional) the above stack
reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the
following:
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 00000474
> > IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
> 
> This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or disabled.
I think that rules out the recent pv spinlock bug (fixed by
7a7546b377bdaa25ac77f33d9433c59f259b9688, in various stable trees
AFAIK).

What line of code does that IP correspond to within xen_spin_lock_flags?
Likewise the one in xen_netbk_schedule_xenvif from the stack.

I suspect this must be &netbk->net_schedule_list_lock but I
don''t see
how that can ever be NULL nor does the offset appear to be 0x474, at
least in my tree -- although that may depend on debug options.

Are you rebooting guests or plug/unplugging vifs while this is going on?
What about hotplugging CPUs (dom0 in particular)?

Does this happen as soon as the test starts or does it work for a bit
before failing?

Ian.

Andrew Cooper

2012-Feb-14 13:58 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

On 14/02/12 10:09, Ian Campbell wrote:> 1On Tue, 2012-02-14 at 00:54 +0000, Christopher S. Aker wrote:
>> On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:
>>> Network stress testing (iperf, UDP, bidirectional) the above stack
reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the
following:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 00000474
>>> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
>> This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or
disabled.
> I think that rules out the recent pv spinlock bug (fixed by
> 7a7546b377bdaa25ac77f33d9433c59f259b9688, in various stable trees
> AFAIK).
>
> What line of code does that IP correspond to within xen_spin_lock_flags?
> Likewise the one in xen_netbk_schedule_xenvif from the stack.
>
> I suspect this must be &netbk->net_schedule_list_lock but I
don''t see
> how that can ever be NULL nor does the offset appear to be 0x474, at
> least in my tree -- although that may depend on debug options.
>
> Are you rebooting guests or plug/unplugging vifs while this is going on?
> What about hotplugging CPUs (dom0 in particular)?
>
> Does this happen as soon as the test starts or does it work for a bit
> before failing?
>
> Ian.
I dont know if this is related, but it looks very similar to a bug a
friend of mine encountered.  I tried to investigate but got nowhere.

Panic can be found: pastebin.com/ExCwhzpy

The panic looks as if it is on the same logical instruction.  (There is
a 32/64bit difference which would likely explain the out-by-one byte
reference for the dereference.)

The difference here is this bug is from the ext4 path, indicating that
it might be a spinlock problem rather than a network problem (of course,
assuming that this is infact the same bug)

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, citrix.com

Christopher S. Aker

2012-Feb-14 15:28 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

On 2/14/12 5:09 AM, Ian Campbell wrote:> I suspect this must be&netbk->net_schedule_list_lock but I
don''t see
> how that can ever be NULL nor does the offset appear to be 0x474, at
> least in my tree -- although that may depend on debug options.
>
> Are you rebooting guests or plug/unplugging vifs while this is going on?
Sorry - I''m relaying some of this on behalf of another team member.

The dom0 insta-panics when running iperf and then bringing a vif down 
(for a network rules rebuild).  We''ve only been able to trigger it when
the dom0 is dealing with a boatload of packets.  So, it smells like a 
race between the vif being marked down, and something thinking it can 
still deliver packets to it...

root@dom0# ip link set vif2.0 down (return = BOOM)
br0: port 4(vif2.0) entering forwarding state
BUG: unable to handle kernel NULL pointer dereference at 00000474
IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
...

-Chris

Ian Campbell

2012-Feb-14 15:37 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

On Tue, 2012-02-14 at 15:28 +0000, Christopher S. Aker
wrote:> On 2/14/12 5:09 AM, Ian Campbell wrote:
> > I suspect this must be&netbk->net_schedule_list_lock but I
don''t see
> > how that can ever be NULL nor does the offset appear to be 0x474, at
> > least in my tree -- although that may depend on debug options.
> >
> > Are you rebooting guests or plug/unplugging vifs while this is going
on?
> 
> Sorry - I''m relaying some of this on behalf of another team
member.
> 
> The dom0 insta-panics when running iperf and then bringing a vif down 
> (for a network rules rebuild).  We''ve only been able to trigger it
when
> the dom0 is dealing with a boatload of packets.  So, it smells like a 
> race between the vif being marked down, and something thinking it can 
> still deliver packets to it...
> 
> root@dom0# ip link set vif2.0 down (return = BOOM)
> br0: port 4(vif2.0) entering forwarding state
> BUG: unable to handle kernel NULL pointer dereference at 00000474
> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
> ...
I wonder if this is related to the discussion in
lists.xen.org/archives/html/xen-devel/2011-09/msg00981.html
lists.xen.org/archives/html/xen-devel/2011-09/msg00985.html

Jan, do you recall what you did about that issue?

Is it as simple as moving the call to netif_stop_queue(dev) in
xenvif_close before the carrier check?

Ian.

Jan Beulich

2012-Feb-14 15:52 UTC

head link

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

>>> On 14.02.12 at 16:37, Ian Campbell <Ian.Campbell@citrix.com>
wrote:
> On Tue, 2012-02-14 at 15:28 +0000, Christopher S. Aker wrote:
>> On 2/14/12 5:09 AM, Ian Campbell wrote:
>> > I suspect this must be&netbk->net_schedule_list_lock but I
don''t see
>> > how that can ever be NULL nor does the offset appear to be 0x474,
at
>> > least in my tree -- although that may depend on debug options.
>> >
>> > Are you rebooting guests or plug/unplugging vifs while this is
going on?
>> 
>> Sorry - I''m relaying some of this on behalf of another team
member.
>> 
>> The dom0 insta-panics when running iperf and then bringing a vif down 
>> (for a network rules rebuild).  We''ve only been able to
trigger it when
>> the dom0 is dealing with a boatload of packets.  So, it smells like a 
>> race between the vif being marked down, and something thinking it can 
>> still deliver packets to it...
>> 
>> root@dom0# ip link set vif2.0 down (return = BOOM)
>> br0: port 4(vif2.0) entering forwarding state
>> BUG: unable to handle kernel NULL pointer dereference at 00000474
>> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
>> ...
> 
> I wonder if this is related to the discussion in
> lists.xen.org/archives/html/xen-devel/2011-09/msg00981.html 
> lists.xen.org/archives/html/xen-devel/2011-09/msg00985.html 
> 
> Jan, do you recall what you did about that issue?
I did what I indicated we were asked at that time - add a safety
check to start_xmit() (which xen-netback appears to have too,
albeit in a slightly different form). Also, start_xmit() doesn''t
appear on the call stack Christopher sent.
> Is it as simple as moving the call to netif_stop_queue(dev) in
> xenvif_close before the carrier check?
As said back then, it would appear to be more consistent, but
you''re much more of an expert here than I am. We continue
to have netif_stop_queue() after the carrier check in our trees,
without anyone having reported a problem like this since we
did the above adjustment.

Jan

Xen devel - Feb 2012 - BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags

Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags