thr3ads.net - Xen devel - [Xen-devel] System crash on pskb_expand

If this information is useful, please help other people find it:
Share via:

Christopher Thunes

2008-Jun-27 15:59 UTC

[Xen-devel] System crash on pskb_expand_head call

Hi all,
   I wanted to drop this here and see if anyone had any ideas on where 
this issue may be stemming from. We run a unique set up which involves 
running arptables on bridge devices for our HVM guests. Here is all the 
debug information we currently have,

Kernel BUG at ...ib/xen-3.2.1/linux-2.6.18-xen.hg/net/core/skbuff.c:695
invalid opcode: 0000 [1] SMP
CPU 3
Modules linked in: xt_mac tun arptable_filter arp_tables xt_physdev
iptable_filter ip_tables x_tables bridge ipv6 button ac battery nfs
lockd nfs_acl sunrpc sg sr_mod parport_pc parport floppy serio_raw
pcspkr i2c_i801 i2c_core joydev ext3 jbd dm_mirror dm_snapshot dm_mod
sd_mod ide_cd cdrom usbhid usb_storage aacraid ehci_hcd e1000 piix
scsi_mod uhci_hcd usbcore thermal processor fan
Pid: 9699, comm: qemu-dm Tainted: GF     2.6.18.8-xen #1
RIP: e030:[<ffffffff803948e5>]  [<ffffffff803948e5>]
pskb_expand_head+0x2a/0x138RSP: e02b:ffff880007d0fc08  EFLAGS:
00010202
RAX: 0000000000000001 RBX: ffff880019b6c0c0 RCX: ffff880078365000
RDX: 0000000000000134 RSI: 0000000000000020 RDI: ffff880078365100
RBP: ffff88007f10e000 R08: ffff880078365012 R09: 0000000000000194
R10: ffff88007f10e000 R11: ffffffff8039b349 R12: 0000000000000000
R13: ffff8800789f7e04 R14: 0000000000000002 R15: 0000000000000178
FS:  00002ba5be9835f0(0000) GS:ffffffff804d7180(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process qemu-dm (pid: 9699, threadinfo ffff880048626000, task 
ffff880001239080)
Stack:  ffff8800789f7dc0 ffff880019b6c0c0 ffff88007f10e000 ffff880019b6c0c0
  ffff8800789f7e04 ffffffff80394a4b 0000000000000178 ffff880019b6c0c0
  ffff88007f10e000 ffff880019b6c0c0 ffff8800789f7e04 0000000000000002
Call Trace:
  <IRQ> [<ffffffff80394a4b>] __pskb_pull_tail+0x58/0x26e
  [<ffffffff8039b4a3>] dev_queue_xmit+0x15a/0x313
  [<ffffffff8039e9e2>] neigh_update+0x304/0x3d9
  [<ffffffff803a7ccd>] eth_header_cache_update+0x0/0x12
  [<ffffffff803d6cc8>] arp_process+0x579/0x5c2
  [<ffffffff803d674f>] arp_process+0x0/0x5c2
  [<ffffffff803b0b22>] nf_hook_slow+0x58/0xc4
  [<ffffffff803d674f>] arp_process+0x0/0x5c2
  [<ffffffff803d6e17>] arp_rcv+0x106/0x129
  [<ffffffff80398d95>] netif_receive_skb+0x0/0x2eb
  [<ffffffff80398ffb>] netif_receive_skb+0x266/0x2eb
  [<ffffffff88247950>] :bridge:br_pass_frame_up+0x67/0x69
  [<ffffffff88247a18>] :bridge:br_handle_frame_finish+0xc6/0xf8
  [<ffffffff88247bd2>] :bridge:br_handle_frame+0x188/0x1a6
  [<ffffffff80398f5f>] netif_receive_skb+0x1ca/0x2eb
  [<ffffffff8039af67>] process_backlog+0xd0/0x182
  [<ffffffff8039b1e2>] net_rx_action+0xe3/0x24a
  [<ffffffff802356ec>] __do_softirq+0x83/0x117
  [<ffffffff8020b1ac>] call_softirq+0x1c/0x28
  <EOI> [<ffffffff8020d01f>] do_softirq+0x6a/0xeb
  [<ffffffff8039936e>] netif_rx_ni+0x19/0x1d
  [<ffffffff882702b8>] :tun:tun_chr_writev+0x1d0/0x204
  [<ffffffff88270306>] :tun:tun_chr_write+0x1a/0x1f
  [<ffffffff802803f7>] vfs_write+0xce/0x174
  [<ffffffff802809b5>] sys_write+0x45/0x6e
  [<ffffffff8020a4fc>] system_call+0x68/0x6d
  [<ffffffff8020a494>] system_call+0x0/0x6d

We have been experience really nasty whole system crashes from this on 
multiple machines during the last few weeks and I''m hoping that someone
may be able to shed some light on this issue. If there is any more 
information you think I may be able to provide just let me know. From 
what I can tell so far it seems to be from the combination of running 
arptables on a bridge device as this traceback is identical for all the 
system we are experiencing this issue on.

Much Thanks,
Christopher Thunes


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christopher Thunes

2008-Jun-30 18:23 UTC

head link

Re: [Xen-devel] System crash on pskb_expand_head call

Hello again,

I''ve looked into this again and found the following. In skbuff.c the 
line it is crashing on is a self induced bug found here

	if (skb_shared(skb))
		BUG();

to attempt to debug this we have replaced the above with

	if (skb_shared(skb)) {
                 printk(KERN_INFO "skb_shared BUG problem
detected\n");
                 printk(KERN_INFO "  skb->users: %d\n", 
atomic_read(&skb->users));
                 printk(KERN_INFO "  skb->dev: %s\n",
skb->dev->name);
                 printk(KERN_INFO "  skb->fclone: %d\n",
skb->fclone);
         }

We have also removed all iptables and arptables rules from our setup and 
  no insured that no (ip/arp)tables modules are loaded at all. We 
triggered this bug again and got the following output in dmesg

	Hello skb_shared BUG problem detected
	  skb->users: 2
	  skb->dev: br0
	  skb->fclone: 0

My original suspicion was that this problem was related to arptables 
given the traceback I originally posted but now with it not even loaded 
I can''t think it could be the problem. The system was at the time of
the
bug being triggered wasn''t under any abnormal conditions with load and 
network IO being normal. Our network setup within Xen is fairly simple. 
We run mostly HVM guests and all tap devices are in a common network 
placed on a bridge device. IP forwarding is enabled to allow traffic 
between the bridge and outside network. I would really appreciate some 
insight into this as I''m not quite sure where to go with this from
here.

Thanks again,
Christopher Thunes


Christopher Thunes wrote:> Hi all,
>   I wanted to drop this here and see if anyone had any ideas on where 
> this issue may be stemming from. We run a unique set up which involves 
> running arptables on bridge devices for our HVM guests. Here is all the 
> debug information we currently have,
> 
> Kernel BUG at ...ib/xen-3.2.1/linux-2.6.18-xen.hg/net/core/skbuff.c:695
> invalid opcode: 0000 [1] SMP
> CPU 3
> Modules linked in: xt_mac tun arptable_filter arp_tables xt_physdev
> iptable_filter ip_tables x_tables bridge ipv6 button ac battery nfs
> lockd nfs_acl sunrpc sg sr_mod parport_pc parport floppy serio_raw
> pcspkr i2c_i801 i2c_core joydev ext3 jbd dm_mirror dm_snapshot dm_mod
> sd_mod ide_cd cdrom usbhid usb_storage aacraid ehci_hcd e1000 piix
> scsi_mod uhci_hcd usbcore thermal processor fan
> Pid: 9699, comm: qemu-dm Tainted: GF     2.6.18.8-xen #1
> RIP: e030:[<ffffffff803948e5>]  [<ffffffff803948e5>]
> pskb_expand_head+0x2a/0x138RSP: e02b:ffff880007d0fc08  EFLAGS:
> 00010202
> RAX: 0000000000000001 RBX: ffff880019b6c0c0 RCX: ffff880078365000
> RDX: 0000000000000134 RSI: 0000000000000020 RDI: ffff880078365100
> RBP: ffff88007f10e000 R08: ffff880078365012 R09: 0000000000000194
> R10: ffff88007f10e000 R11: ffffffff8039b349 R12: 0000000000000000
> R13: ffff8800789f7e04 R14: 0000000000000002 R15: 0000000000000178
> FS:  00002ba5be9835f0(0000) GS:ffffffff804d7180(0000) 
> knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000
> Process qemu-dm (pid: 9699, threadinfo ffff880048626000, task 
> ffff880001239080)
> Stack:  ffff8800789f7dc0 ffff880019b6c0c0 ffff88007f10e000 ffff880019b6c0c0
>  ffff8800789f7e04 ffffffff80394a4b 0000000000000178 ffff880019b6c0c0
>  ffff88007f10e000 ffff880019b6c0c0 ffff8800789f7e04 0000000000000002
> Call Trace:
>  <IRQ> [<ffffffff80394a4b>] __pskb_pull_tail+0x58/0x26e
>  [<ffffffff8039b4a3>] dev_queue_xmit+0x15a/0x313
>  [<ffffffff8039e9e2>] neigh_update+0x304/0x3d9
>  [<ffffffff803a7ccd>] eth_header_cache_update+0x0/0x12
>  [<ffffffff803d6cc8>] arp_process+0x579/0x5c2
>  [<ffffffff803d674f>] arp_process+0x0/0x5c2
>  [<ffffffff803b0b22>] nf_hook_slow+0x58/0xc4
>  [<ffffffff803d674f>] arp_process+0x0/0x5c2
>  [<ffffffff803d6e17>] arp_rcv+0x106/0x129
>  [<ffffffff80398d95>] netif_receive_skb+0x0/0x2eb
>  [<ffffffff80398ffb>] netif_receive_skb+0x266/0x2eb
>  [<ffffffff88247950>] :bridge:br_pass_frame_up+0x67/0x69
>  [<ffffffff88247a18>] :bridge:br_handle_frame_finish+0xc6/0xf8
>  [<ffffffff88247bd2>] :bridge:br_handle_frame+0x188/0x1a6
>  [<ffffffff80398f5f>] netif_receive_skb+0x1ca/0x2eb
>  [<ffffffff8039af67>] process_backlog+0xd0/0x182
>  [<ffffffff8039b1e2>] net_rx_action+0xe3/0x24a
>  [<ffffffff802356ec>] __do_softirq+0x83/0x117
>  [<ffffffff8020b1ac>] call_softirq+0x1c/0x28
>  <EOI> [<ffffffff8020d01f>] do_softirq+0x6a/0xeb
>  [<ffffffff8039936e>] netif_rx_ni+0x19/0x1d
>  [<ffffffff882702b8>] :tun:tun_chr_writev+0x1d0/0x204
>  [<ffffffff88270306>] :tun:tun_chr_write+0x1a/0x1f
>  [<ffffffff802803f7>] vfs_write+0xce/0x174
>  [<ffffffff802809b5>] sys_write+0x45/0x6e
>  [<ffffffff8020a4fc>] system_call+0x68/0x6d
>  [<ffffffff8020a494>] system_call+0x0/0x6d
> 
> We have been experience really nasty whole system crashes from this on 
> multiple machines during the last few weeks and I''m hoping that
someone
> may be able to shed some light on this issue. If there is any more 
> information you think I may be able to provide just let me know. From 
> what I can tell so far it seems to be from the combination of running 
> arptables on a bridge device as this traceback is identical for all the 
> system we are experiencing this issue on.
> 
> Much Thanks,
> Christopher Thunes
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-01 10:36 UTC

head link

Re: [Xen-devel] System crash on pskb_expand_head call

I''m afraid this isn''t very helpful to you, but this looks like
a
non-Xen-specific bug. Probably the crash is happening via a call to
__skb_linearize() in dev_queue_xmit(). Either:
 1) dev_queue_xmit() should not be called with an skbuff with skb->users
greater than one; or
 2) dev_queue_xmit() should not be calling __skb_linearize().

I don''t know enough about the details of Linux''s internal
networking
interfaces to say which of the above ought to be the case. It doesn''t
look
like the backtrace you supplied touches any Xen-specific code in the kernel.
That is, I would expect you could repro this issue with a vanilla (non-Xen
kernel) set up in a similar way and with a user-space test harness stuffing
appropriately crafted packets through tuntap. If you could repro on a more
modern Linux kernel you may well get some help from the Linux networking
maintainers. It might even be worth posting them this crash despite it being
against an ''old'' kernel version.

 --Keir

On 30/6/08 19:23, "Christopher Thunes" <c2thunes@brewtab.com>
wrote:
> Hello again,
> 
> I''ve looked into this again and found the following. In skbuff.c
the
> line it is crashing on is a self induced bug found here
> 
> if (skb_shared(skb))
> BUG();
> 
> to attempt to debug this we have replaced the above with
> 
> if (skb_shared(skb)) {
>                  printk(KERN_INFO "skb_shared BUG problem
detected\n");
>                  printk(KERN_INFO "  skb->users: %d\n",
> atomic_read(&skb->users));
>                  printk(KERN_INFO "  skb->dev: %s\n",
skb->dev->name);
>                  printk(KERN_INFO "  skb->fclone: %d\n",
skb->fclone);
>          }
> 
> We have also removed all iptables and arptables rules from our setup and
>   no insured that no (ip/arp)tables modules are loaded at all. We
> triggered this bug again and got the following output in dmesg
> 
> Hello skb_shared BUG problem detected
>  skb->users: 2
>  skb->dev: br0
>  skb->fclone: 0
> 
> My original suspicion was that this problem was related to arptables
> given the traceback I originally posted but now with it not even loaded
> I can''t think it could be the problem. The system was at the time
of the
> bug being triggered wasn''t under any abnormal conditions with load
and
> network IO being normal. Our network setup within Xen is fairly simple.
> We run mostly HVM guests and all tap devices are in a common network
> placed on a bridge device. IP forwarding is enabled to allow traffic
> between the bridge and outside network. I would really appreciate some
> insight into this as I''m not quite sure where to go with this from
here.
> 
> Thanks again,
> Christopher Thunes
> 
> 
> Christopher Thunes wrote:
>> Hi all,
>>   I wanted to drop this here and see if anyone had any ideas on where
>> this issue may be stemming from. We run a unique set up which involves
>> running arptables on bridge devices for our HVM guests. Here is all the
>> debug information we currently have,
>> 
>> Kernel BUG at ...ib/xen-3.2.1/linux-2.6.18-xen.hg/net/core/skbuff.c:695
>> invalid opcode: 0000 [1] SMP
>> CPU 3
>> Modules linked in: xt_mac tun arptable_filter arp_tables xt_physdev
>> iptable_filter ip_tables x_tables bridge ipv6 button ac battery nfs
>> lockd nfs_acl sunrpc sg sr_mod parport_pc parport floppy serio_raw
>> pcspkr i2c_i801 i2c_core joydev ext3 jbd dm_mirror dm_snapshot dm_mod
>> sd_mod ide_cd cdrom usbhid usb_storage aacraid ehci_hcd e1000 piix
>> scsi_mod uhci_hcd usbcore thermal processor fan
>> Pid: 9699, comm: qemu-dm Tainted: GF     2.6.18.8-xen #1
>> RIP: e030:[<ffffffff803948e5>]  [<ffffffff803948e5>]
>> pskb_expand_head+0x2a/0x138RSP: e02b:ffff880007d0fc08  EFLAGS:
>> 00010202
>> RAX: 0000000000000001 RBX: ffff880019b6c0c0 RCX: ffff880078365000
>> RDX: 0000000000000134 RSI: 0000000000000020 RDI: ffff880078365100
>> RBP: ffff88007f10e000 R08: ffff880078365012 R09: 0000000000000194
>> R10: ffff88007f10e000 R11: ffffffff8039b349 R12: 0000000000000000
>> R13: ffff8800789f7e04 R14: 0000000000000002 R15: 0000000000000178
>> FS:  00002ba5be9835f0(0000) GS:ffffffff804d7180(0000)
>> knlGS:0000000000000000
>> CS:  e033 DS: 0000 ES: 0000
>> Process qemu-dm (pid: 9699, threadinfo ffff880048626000, task
>> ffff880001239080)
>> Stack:  ffff8800789f7dc0 ffff880019b6c0c0 ffff88007f10e000
ffff880019b6c0c0
>>  ffff8800789f7e04 ffffffff80394a4b 0000000000000178 ffff880019b6c0c0
>>  ffff88007f10e000 ffff880019b6c0c0 ffff8800789f7e04 0000000000000002
>> Call Trace:
>>  <IRQ> [<ffffffff80394a4b>] __pskb_pull_tail+0x58/0x26e
>>  [<ffffffff8039b4a3>] dev_queue_xmit+0x15a/0x313
>>  [<ffffffff8039e9e2>] neigh_update+0x304/0x3d9
>>  [<ffffffff803a7ccd>] eth_header_cache_update+0x0/0x12
>>  [<ffffffff803d6cc8>] arp_process+0x579/0x5c2
>>  [<ffffffff803d674f>] arp_process+0x0/0x5c2
>>  [<ffffffff803b0b22>] nf_hook_slow+0x58/0xc4
>>  [<ffffffff803d674f>] arp_process+0x0/0x5c2
>>  [<ffffffff803d6e17>] arp_rcv+0x106/0x129
>>  [<ffffffff80398d95>] netif_receive_skb+0x0/0x2eb
>>  [<ffffffff80398ffb>] netif_receive_skb+0x266/0x2eb
>>  [<ffffffff88247950>] :bridge:br_pass_frame_up+0x67/0x69
>>  [<ffffffff88247a18>] :bridge:br_handle_frame_finish+0xc6/0xf8
>>  [<ffffffff88247bd2>] :bridge:br_handle_frame+0x188/0x1a6
>>  [<ffffffff80398f5f>] netif_receive_skb+0x1ca/0x2eb
>>  [<ffffffff8039af67>] process_backlog+0xd0/0x182
>>  [<ffffffff8039b1e2>] net_rx_action+0xe3/0x24a
>>  [<ffffffff802356ec>] __do_softirq+0x83/0x117
>>  [<ffffffff8020b1ac>] call_softirq+0x1c/0x28
>>  <EOI> [<ffffffff8020d01f>] do_softirq+0x6a/0xeb
>>  [<ffffffff8039936e>] netif_rx_ni+0x19/0x1d
>>  [<ffffffff882702b8>] :tun:tun_chr_writev+0x1d0/0x204
>>  [<ffffffff88270306>] :tun:tun_chr_write+0x1a/0x1f
>>  [<ffffffff802803f7>] vfs_write+0xce/0x174
>>  [<ffffffff802809b5>] sys_write+0x45/0x6e
>>  [<ffffffff8020a4fc>] system_call+0x68/0x6d
>>  [<ffffffff8020a494>] system_call+0x0/0x6d
>> 
>> We have been experience really nasty whole system crashes from this on
>> multiple machines during the last few weeks and I''m hoping
that someone
>> may be able to shed some light on this issue. If there is any more
>> information you think I may be able to provide just let me know. From
>> what I can tell so far it seems to be from the combination of running
>> arptables on a bridge device as this traceback is identical for all the
>> system we are experiencing this issue on.
>> 
>> Much Thanks,
>> Christopher Thunes
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jun 2008 - System crash on pskb_expand_head call

[Xen-devel] System crash on pskb_expand_head call

Re: [Xen-devel] System crash on pskb_expand_head call

Re: [Xen-devel] System crash on pskb_expand_head call