thr3ads.net - Xen users - [Xen-users] kernel oops/IRQ exception when networking between many domUs [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Birger Tödtmann

2005-Jun-04 17:05 UTC

[Xen-users] kernel oops/IRQ exception when networking between many domUs

Hi,

I try to build experimental networks with Xen and stumbled over the same
problem that has been described quite well by Mark Doll in his posting
"xen_net: Failed to connect all virtual interfaces: err=-100"
here:

http://lists.xensource.com/archives/html/xen-users/2005-04/msg00447.html

As it was still present in 2.0.6, I tried 3.0-devel and found NR_PIRQS
and NR_DYNIRQS had been adjusted there - so I hoped for the best.  I was
then able to fire up my virtual test network and get it running with 20
nodes and approx. 120 interfaces, without problems at first. The vifs
are wired to ~60 bridge interfaces, 2 vifs each, and I can access all
domU-nodes with the console etc.  Kernel version is 2.6.11 coming with
xen-unstable as of May 31.

The problem: after allowing free packet delivery within the network by
issueing a

  sysctl -w net.bridge.bridge-nf-call-iptables=0

(which was until then set to 1 and my iptables rules blocked all
traffic), the whole machine froze after a very short time (immediately
to 2-3 seconds), apparently when the first packet is traveling through
the network.  No output, kernel oops, nothing to see, and magic sysrq
gone as well(!).  This behaviour was deterministic.  I had quite some
difficulties getting more information - what I finally did was to set
the sysctl *before* starting the domUs.  Funnily, nothing happend after
starting the first 10-12 nodes, but after "xm create"ing one or two
more
nodes, the system oopsed with at least some info, but sysrq gone as
well.  So I wrote it down on a peace of paper ;-) , hopefully someone
can make sense of it:


Stack: 
 00000000 d06cea20 2f001020 c8b04780 c0403f1c c028cbfa 0002f001 0000000d
 ffffffff 08b78020 00000052 00000001 00000028 0000005e 00008b85 d21fe000
 00000006 c0457824 0000011d c0453240 00283d58 e01c3a6e c0403cec da6bccd0

Call Trace:
 [<c0109c51>] show_stack+0x80/0x96
 [<c0100de1>] show_registers+0x15a/0x1d1
 [<c010a001>] die+0x106/0x1c4
 [<c010a4aa>] do_invalid_op+0xb5/0xbf
 [<c010985b>] error_code+0x2b/0x30
 [<c028cbfa>] net_rx_action+0x484/0x4df
 [<c01239a9>] tasklet_action+0x7b/0xe0
 [<c0123533>] __do_softirq+0x6f/0xef
 [<c0123632>] do_softirq+0x7f/0x97
 [<c0123706>] irq_exit+0x3a/0x3c
 [<c010d819>] do_IRQ+0x25/0x2c
 [<c0105efe>] evtchn_do_upcall+0x62/0x82
 [<c010988c>] hypervisor_callback+0x2c/0x34
 [<c0107673>] cpu_idle+0x33/0x41
 [<c04047a9>] start_kernel+0x196/0x1e8
 [<c010006c>] 0xc010006c

Code:  08 a8 75 30 83 c4 5b 5e 5f 5d c3 bb 01 00 00 00 31 f6 b8 0c 00 00
00 bf  f0 7f 00 00 8d 4d 08 89 da cd 82 83 e8 01 2e 74 8e <0f> 0b 66 00
2c 7a 35 c0 eb 84 e8 f8 b1 09 00 eb c9 e8 f6 98 e7

<0>Kernel panic - not syncing: Fatal exception in interrupt



Any suggestions?


Regards,

Birger

PS.: I attach the scripts starting the virtual network for the
interested user.  Beware, they have no decent design but are mere hacks.
The root filesystem used is available here:

  http://www.iem.uni-due.de/~birger/downloads/root_fs





_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2005-Jun-05 16:52 UTC

head link

Re: [Xen-devel] [Xen-users] kernel oops/IRQ exception when networking between many domUs

On 4 Jun 2005, at 18:05, Birger Tödtmann wrote:
> Funnily, nothing happend after
> starting the first 10-12 nodes, but after "xm create"ing one or
two
> more
> nodes, the system oopsed with at least some info, but sysrq gone as
> well.  So I wrote it down on a peace of paper ;-) , hopefully someone
> can make sense of it:
Do you have the vmlinux file? It would be useful to know where in 
net_rx_action the crash is happening.

  -- Keir


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Birger Toedtmann

2005-Jun-06 06:42 UTC

head link

[Xen-users] Re: kernel oops/IRQ exception when networking between many domUs

Re-post without attachments for list readers.


Keir Fraser schrieb am Sun, Jun 05, 2005 at 05:52:13PM
+0100:> 
> On 4 Jun 2005, at 18:05, Birger Tödtmann wrote:
> 
> >Funnily, nothing happend after
> >starting the first 10-12 nodes, but after "xm create"ing one
or two
> >more
> >nodes, the system oopsed with at least some info, but sysrq gone as
> >well.  So I wrote it down on a peace of paper ;-) , hopefully someone
> >can make sense of it:
> 
> Do you have the vmlinux file? It would be useful to know where in 
> net_rx_action the crash is happening.
Apparently it is happening somewhere here:

[...]
0xc028cbe5 <net_rx_action+1135>:        test   %eax,%eax
0xc028cbe7 <net_rx_action+1137>:        je     0xc028ca82
<net_rx_action+780>
0xc028cbed <net_rx_action+1143>:        mov    %esi,%eax
0xc028cbef <net_rx_action+1145>:        shr    $0xc,%eax
0xc028cbf2 <net_rx_action+1148>:        mov    %eax,(%esp)
0xc028cbf5 <net_rx_action+1151>:        call   0xc028c4c4 <free_mfn>
0xc028cbfa <net_rx_action+1156>:        mov    $0xffffffff,%ecx
^^^^^^^^^^
0xc028cbff <net_rx_action+1161>:        jmp    0xc028ca82
<net_rx_action+780>
0xc028cc04 <net_rx_action+1166>:        call   0xc02c59fe
<net_ratelimit>
0xc028cc09 <net_rx_action+1171>:        test   %eax,%eax
0xc028cc0b <net_rx_action+1173>:        jne    0xc028cc47
<net_rx_action+1233>
0xc028cc0d <net_rx_action+1175>:        mov    0xc0378b60,%eax
[...]


which is, I presume, reflected by this section within net_rx_action():


[...]
        /* Check the reassignment error code. */
        status = NETIF_RSP_OKAY;
        if ( unlikely(mcl[1].args[5] != 0) )
        {
            DPRINTK("Failed MMU update transferring to DOM%u\n",
netif->domid);
            free_mfn(mdata >> PAGE_SHIFT);
            status = NETIF_RSP_ERROR;
        }
[...]


Kernel image and System.map attached.


Regards,
-- 
Birger Tödtmann
Technik der Rechnernetze, Institut für Experimentelle Mathematik und Institut 
für Informatik und Wirtschaftsinformatik, Universität Duisburg-Essen
email:btoedtmann@iem.uni-due.de skype:birger.toedtmann pgp:0x6FB166C9
icq:294947817


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2005-Jun-06 08:23 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

On 5 Jun 2005, at 17:57, Birger Toedtmann wrote:
> Apparently it is happening somewhere here:
>
> [...]
> 0xc028cbe5 <net_rx_action+1135>:        test   %eax,%eax
> 0xc028cbe7 <net_rx_action+1137>:        je     0xc028ca82 
> <net_rx_action+780>
> 0xc028cbed <net_rx_action+1143>:        mov    %esi,%eax
> 0xc028cbef <net_rx_action+1145>:        shr    $0xc,%eax
> 0xc028cbf2 <net_rx_action+1148>:        mov    %eax,(%esp)
> 0xc028cbf5 <net_rx_action+1151>:        call   0xc028c4c4
<free_mfn>
> 0xc028cbfa <net_rx_action+1156>:        mov    $0xffffffff,%ecx
> ^^^^^^^^^^
Most likely the driver has tried to send a bogus page to a domU. 
Because it''s bogus the transfer fails. The driver then tries to free 
the page back to Xen, but that also fails because the page is bogus. 
This confuses the driver, which then BUG()s out.

It''s not at all clear where the bogus address comes from: the driver 
basically just reads the address out of an skbuff, and converts it from 
virtual to physical address. But something is obviously going wrong, 
perhaps under memory pressure. :-(

  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Birger Tödtmann

2005-Jun-06 08:52 UTC

head link

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Montag, den 06.06.2005, 09:23 +0100 schrieb Keir
Fraser:> On 5 Jun 2005, at 17:57, Birger Toedtmann wrote:
> 
> > Apparently it is happening somewhere here:
> >
> > [...]
> > 0xc028cbe5 <net_rx_action+1135>:        test   %eax,%eax
> > 0xc028cbe7 <net_rx_action+1137>:        je     0xc028ca82 
> > <net_rx_action+780>
> > 0xc028cbed <net_rx_action+1143>:        mov    %esi,%eax
> > 0xc028cbef <net_rx_action+1145>:        shr    $0xc,%eax
> > 0xc028cbf2 <net_rx_action+1148>:        mov    %eax,(%esp)
> > 0xc028cbf5 <net_rx_action+1151>:        call   0xc028c4c4
<free_mfn>
> > 0xc028cbfa <net_rx_action+1156>:        mov    $0xffffffff,%ecx
> > ^^^^^^^^^^
> 
> Most likely the driver has tried to send a bogus page to a domU. 
> Because it''s bogus the transfer fails. The driver then tries to
free
> the page back to Xen, but that also fails because the page is bogus. 
> This confuses the driver, which then BUG()s out.
I commented out the free_mfn() and status= lines: the kernel now reports
the following after it configured the 10th domU and ~80th vif, with
approx. 20-25 bridges up.  Just an idea: the number of vifs + bridges is
somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the
crash happens - could this hint to something?


[...]
Jun  6 10:12:14 lomin kernel: 10.2.23.8: port 2(vif10.3) entering
forwarding state
Jun  6 10:12:14 lomin kernel: 10.2.35.16: topology change detected,
propagating
Jun  6 10:12:14 lomin kernel: 10.2.35.16: port 2(vif10.4) entering
forwarding state
Jun  6 10:12:14 lomin kernel: 10.2.35.20: topology change detected,
propagating
Jun  6 10:12:14 lomin kernel: 10.2.35.20: port 2(vif10.5) entering
forwarding state
Jun  6 10:12:20 lomin kernel: c014cea4
Jun  6 10:12:20 lomin kernel:  [do_page_fault+643/1665] do_page_fault
+0x469/0x738
Jun  6 10:12:20 lomin kernel:  [<c0115720>] do_page_fault+0x469/0x738
Jun  6 10:12:20 lomin kernel:  [fixup_4gb_segment+2/12] page_fault
+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [<c0109a7e>] page_fault+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [do_page_fault+49/1665] do_page_fault
+0x217/0x738
Jun  6 10:12:20 lomin kernel:  [<c01154ce>] do_page_fault+0x217/0x738
Jun  6 10:12:20 lomin kernel:  [fixup_4gb_segment+2/12] page_fault
+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [<c0109a7e>] page_fault+0x2e/0x34
Jun  6 10:12:20 lomin kernel: PREEMPT
Jun  6 10:12:20 lomin kernel: Modules linked in: dm_snapshot pcmcia
bridge ipt_REJECT ipt_state iptable_filter ipt_MASQUERADE iptable_nat
ip_conntrack ip_tables autofs4 snd_seq snd_seq_device evdev usbhid
rfcomm l2cap bluetooth dm_mod cryptoloop snd_pcm_oss snd_mixer_oss
snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore
snd_page_alloc tun uhci_hcd usb_storage usbcore irtty_sir sir_dev
ircomm_tty ircomm irda yenta_socket rsrc_nonstatic pcmcia_core 3c59x
Jun  6 10:12:20 lomin kernel: CPU:    0
Jun  6 10:12:20 lomin kernel: EIP:    0061:[do_wp_page+622/1175]    Not
tainted VLI
Jun  6 10:12:20 lomin kernel: EIP:    0061:[<c014cea4>]    Not tainted
VLI
Jun  6 10:12:20 lomin kernel: EFLAGS: 00010206   (2.6.11.11-xen0)
Jun  6 10:12:20 lomin kernel: EIP is at handle_mm_fault+0x5d/0x222
Jun  6 10:12:20 lomin kernel: eax: 15555b18   ebx: d8788000   ecx:
00000b18   edx: 15555b18
Jun  6 10:12:20 lomin kernel: esi: dcfc3b4c   edi: dcaf5580   ebp:
d8789ee4   esp: d8789ebc
Jun  6 10:12:20 lomin kernel: ds: 0069   es: 0069   ss: 0069
Jun  6 10:12:20 lomin kernel: Process python (pid: 4670,
threadinfo=d8788000 task=de1a1520)
Jun  6 10:12:20 lomin kernel: Stack: 00000040 00000001 d40e687c d40e6874
00000006 d40e685c d8789f14 dcaf5580
Jun  6 10:12:20 lomin kernel:        dcaf55ac d40e6b1c d8789fbc c01154ce
dcaf5580 d40e6b1c b4ec6ff0 00000001
Jun  6 10:12:20 lomin kernel:        00000001 de1a1520 b4ec6ff0 00000006
d8789fc4 d8789fc4 c03405b0 00000006
Jun  6 10:12:20 lomin kernel: Call Trace:
Jun  6 10:12:20 lomin kernel:  [dump_stack+16/32] show_stack+0x80/0x96
Jun  6 10:12:20 lomin kernel:  [<c0109c51>] show_stack+0x80/0x96
Jun  6 10:12:20 lomin kernel:  [show_registers+384/457] show_registers
+0x15a/0x1d1
Jun  6 10:12:20 lomin kernel:  [<c0109de1>] show_registers+0x15a/0x1d1
Jun  6 10:12:20 lomin kernel:  [die+301/458] die+0x106/0x1c4
Jun  6 10:12:20 lomin kernel:  [<c010a001>] die+0x106/0x1c4
Jun  6 10:12:20 lomin kernel:  [do_page_fault+675/1665] do_page_fault
+0x489/0x738
Jun  6 10:12:20 lomin kernel:  [<c0115740>] do_page_fault+0x489/0x738
Jun  6 10:12:20 lomin kernel:  [fixup_4gb_segment+2/12] page_fault
+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [<c0109a7e>] page_fault+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [do_page_fault+49/1665] do_page_fault
+0x217/0x738
Jun  6 10:12:20 lomin kernel:  [<c01154ce>] do_page_fault+0x217/0x738
Jun  6 10:12:20 lomin kernel:  [fixup_4gb_segment+2/12] page_fault
+0x2e/0x34
Jun  6 10:12:20 lomin kernel:  [<c0109a7e>] page_fault+0x2e/0x34
Jun  6 10:12:20 lomin kernel: Code: 8b 47 1c c1 ea 16 83 43 14 01 8d 34
90 85 f6 0f 84 52 01 00 00 89 f2 8b 4d 10 89 f8 e8 4a d1 ff ff 85 c0 89
c2 0f 84 3c 01 00 00 <8b> 00 a8 81 75 3d 85 c0 0f 84 01 01 00 00 a8 40
0f 84 a4 00 00

> 
> It''s not at all clear where the bogus address comes from: the
driver
> basically just reads the address out of an skbuff, and converts it from 
> virtual to physical address. But something is obviously going wrong, 
> perhaps under memory pressure. :-(
Where, within the domUs or dom0?  The latter has lots of memory at hand,
the domU are quite strapped of memory.  I''ll try to find out...


Regards,
-- 
Birger Tödtmann
Technik der Rechnernetze, Institut für Experimentelle Mathematik
Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de
skype:birger.toedtmann pgp:0x6FB166C9

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Birger Tödtmann

2005-Jun-06 08:56 UTC

head link

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Montag, den 06.06.2005, 10:52 +0200 schrieb Birger Tödtmann:
[...]> 
> I commented out the free_mfn() and status= lines: the kernel now reports
> the following after it configured the 10th domU and ~80th vif, with
> approx. 20-25 bridges up.  Just an idea: the number of vifs + bridges is
Correction: I meant 40-45 bridge devices are then up and running.
> somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the
> crash happens - could this hint to something?
> 

-- 
Birger Tödtmann
Technik der Rechnernetze, Institut für Experimentelle Mathematik
Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de
skype:birger.toedtmann pgp:0x6FB166C9

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2005-Jun-06 09:26 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

On 6 Jun 2005, at 09:52, Birger Tödtmann wrote:
> I commented out the free_mfn() and status= lines: the kernel now 
> reports
> the following after it configured the 10th domU and ~80th vif, with
> approx. 20-25 bridges up.  Just an idea: the number of vifs + bridges 
> is
> somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the
> crash happens - could this hint to something?
The crashes you see with free_mfn removed will be impossible to debug 
-- things are very screwed by that point. Even the crash within 
free_mfn might be far removed from the cause of the crash, if it''s due 
to memory corruption.

It''s perhaps worth investigating what critical limit you might be 
hitting, and what resource it is that''s limited. e.g., can you can 
create a few vifs, but connected together by some very large number of 
bridges (daisy chained together)? Or can you create a large number of 
vifs if they are connected together by just one bridge?

This kind of thing will give us an idea of where the bug might be 
lurking.

  -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Birger Tödtmann

2005-Jun-06 12:30 UTC

head link

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser:
[...]> > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the
> > crash happens - could this hint to something?
> 
> The crashes you see with free_mfn removed will be impossible to debug 
> -- things are very screwed by that point. Even the crash within 
> free_mfn might be far removed from the cause of the crash, if it''s
due
> to memory corruption.
> 
> It''s perhaps worth investigating what critical limit you might be 
> hitting, and what resource it is that''s limited. e.g., can you can
> create a few vifs, but connected together by some very large number of 
> bridges (daisy chained together)? Or can you create a large number of 
> vifs if they are connected together by just one bridge?
This is getting really weird - as I found out I''ll enounter problems
with far fewer vifs/bridges that suspected.  I just fired up a network
with 7 nodes, all with four interfaces each connected to the same four
bridge interfaces.  The nodes can ping through the network, however
after a short time, the system (dom0) crashes as well.  This time, it
dies in net_rx_action() at a slightly different place:

[...]
 [<c02b6e15>] kfree_skbmem+0x12/0x29
 [<c02b6ed1>] __kfree_skb+0xa5/0x13f
 [<c028c9b3>] net_rx_action+0x23d/0x4df
[...]

Funnily, I cannot reproduce this with 5 nodes (domUs) running.  I''m a
bit unsure where to go from here...  Maybe I should try a different
machine for further testing.


Regards
-- 
Birger Tödtmann
Technik der Rechnernetze, Institut für Experimentelle Mathematik
Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de
skype:birger.toedtmann pgp:0x6FB166C9

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Nils Toedtmann

2005-Jun-07 16:46 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann:
> Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser:
> [...]
> > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when
the
> > > crash happens - could this hint to something?
> > 
> > The crashes you see with free_mfn removed will be impossible to debug 
> > -- things are very screwed by that point. Even the crash within 
> > free_mfn might be far removed from the cause of the crash, if
it''s due
> > to memory corruption.
> > 
> > It''s perhaps worth investigating what critical limit you
might be
> > hitting, and what resource it is that''s limited. e.g., can
you can
> > create a few vifs, but connected together by some very large number of
> > bridges (daisy chained together)? Or can you create a large number of 
> > vifs if they are connected together by just one bridge?
> 
> This is getting really weird - as I found out I''ll enounter
problems
> with far fewer vifs/bridges that suspected.  I just fired up a network
> with 7 nodes, all with four interfaces each connected to the same four
> bridge interfaces.  The nodes can ping through the network, however
> after a short time, the system (dom0) crashes as well.  This time, it
> dies in net_rx_action() at a slightly different place:
> 
> [...]
>  [<c02b6e15>] kfree_skbmem+0x12/0x29
>  [<c02b6ed1>] __kfree_skb+0xa5/0x13f
>  [<c028c9b3>] net_rx_action+0x23d/0x4df
> [...]
> 
> Funnily, I cannot reproduce this with 5 nodes (domUs) running. 
I''m a
> bit unsure where to go from here...  Maybe I should try a different
> machine for further testing.
I can confirm this bug on AMD Athlon using xen-unstable from june 5th
(latest ChangeSet 1.1677). All testing domains run OSPF daemons which
will start talking via multicast to each other as soon as the network
connections are established.

  * ''xm create'' 20 domains with 122 vifs (+ vif0.0), but that
xen-
    version does not UP the vifs. Everything is fine.

  * Create 51 transfer bridges, connect the some vifs to them (not
    more than two vifs to each) UP all vifs. Now i have lo + eth0
    + veth0 + 123 vif* + 51 br* = 177 devices, all UP. 
    All transfer networks work, OSPF tables grow, everything is fine.

  * Create a 52th bridge. Connect 20 vifs to it but DOWN THEM BEFORE.
    Everything ist fine. 

  * Now UP all the vifs connected to the 52th bridge one after the
    other. More and more multicast traffic shows up. After UPing the
    9th vif, dom0 BOOOOOMs (net_rx_action, too).

Further experiments show that its seems to be the amount of traffic (and
the number of connected vifs?) which triggers the oops: with all OSPF
daemons stopped, i could UP all bridges & vifs. But when i did a flood-
broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one
with more that two active ports), dom0 OOPSed again.

I could only reproduce that "too-much-traffic-oops" on bridges
connecting more that 10 vifs.

Would be interesting if that happens with unicast traffic, too. Have no
time left, test more tomorrow.

/nils.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nils Toedtmann

2005-Jun-07 16:47 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann:
> Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser:
> [...]
> > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when
the
> > > crash happens - could this hint to something?
> > 
> > The crashes you see with free_mfn removed will be impossible to debug 
> > -- things are very screwed by that point. Even the crash within 
> > free_mfn might be far removed from the cause of the crash, if
it''s due
> > to memory corruption.
> > 
> > It''s perhaps worth investigating what critical limit you
might be
> > hitting, and what resource it is that''s limited. e.g., can
you can
> > create a few vifs, but connected together by some very large number of
> > bridges (daisy chained together)? Or can you create a large number of 
> > vifs if they are connected together by just one bridge?
> 
> This is getting really weird - as I found out I''ll enounter
problems
> with far fewer vifs/bridges that suspected.  I just fired up a network
> with 7 nodes, all with four interfaces each connected to the same four
> bridge interfaces.  The nodes can ping through the network, however
> after a short time, the system (dom0) crashes as well.  This time, it
> dies in net_rx_action() at a slightly different place:
> 
> [...]
>  [<c02b6e15>] kfree_skbmem+0x12/0x29
>  [<c02b6ed1>] __kfree_skb+0xa5/0x13f
>  [<c028c9b3>] net_rx_action+0x23d/0x4df
> [...]
> 
> Funnily, I cannot reproduce this with 5 nodes (domUs) running. 
I''m a
> bit unsure where to go from here...  Maybe I should try a different
> machine for further testing.
I can confirm this bug on AMD Athlon using xen-unstable from june 5th
(latest ChangeSet 1.1677). All testing domains run OSPF daemons which
will start talking via multicast to each other as soon as the network
connections are established.

  * ''xm create'' 20 domains with 122 vifs (+ vif0.0), but that
xen-
    version does not UP the vifs. Everything is fine.

  * Create 51 transfer bridges, connect the some vifs to them (not
    more than two vifs to each) UP all vifs. Now i have lo + eth0
    + veth0 + 123 vif* + 51 br* = 177 devices, all UP. 
    All transfer networks work, OSPF tables grow, everything is fine.

  * Create a 52th bridge. Connect 20 vifs to it but DOWN THEM BEFORE.
    Everything ist fine. 

  * Now UP all the vifs connected to the 52th bridge one after the
    other. More and more multicast traffic shows up. After UPing the
    9th vif, dom0 BOOOOOMs (net_rx_action, too).

Further experiments show that its seems to be the amount of traffic (and
the number of connected vifs?) which triggers the oops: with all OSPF
daemons stopped, i could UP all bridges & vifs. But when i did a flood-
broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one
with more that two active ports), dom0 OOPSed again.

I could only reproduce that "too-much-traffic-oops" on bridges
connecting more that 10 vifs.

Would be interesting if that happens with unicast traffic, too. Have no
time left, test more tomorrow.

/nils.


ps: Shall we continue crossporting to devel+users?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nils Toedtmann

2005-Jun-08 12:34 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Dienstag, den 07.06.2005, 18:47 +0200 schrieb Nils
Toedtmann:> Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann: 
> > Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser:
> > [...]
> > > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!)
when the
> > > > crash happens - could this hint to something?
> > > 
> > > The crashes you see with free_mfn removed will be impossible to
debug
> > > -- things are very screwed by that point. Even the crash within 
> > > free_mfn might be far removed from the cause of the crash, if
it''s due
> > > to memory corruption.
> > > 
> > > It''s perhaps worth investigating what critical limit you
might be
> > > hitting, and what resource it is that''s limited. e.g.,
can you can
> > > create a few vifs, but connected together by some very large
number of
> > > bridges (daisy chained together)? Or can you create a large
number of
> > > vifs if they are connected together by just one bridge?
> > 
> > This is getting really weird - as I found out I''ll enounter
problems
> > with far fewer vifs/bridges that suspected.  I just fired up a network
> > with 7 nodes, all with four interfaces each connected to the same four
> > bridge interfaces.  The nodes can ping through the network, however
> > after a short time, the system (dom0) crashes as well.  This time, it
> > dies in net_rx_action() at a slightly different place:
> > 
> > [...]
> >  [<c02b6e15>] kfree_skbmem+0x12/0x29
> >  [<c02b6ed1>] __kfree_skb+0xa5/0x13f
> >  [<c028c9b3>] net_rx_action+0x23d/0x4df
> > [...]
> > 
> > Funnily, I cannot reproduce this with 5 nodes (domUs) running. 
I''m a
> > bit unsure where to go from here...  Maybe I should try a different
> > machine for further testing.
> 
> I can confirm this bug on AMD Athlon using xen-unstable from june 5th
> (latest ChangeSet 1.1677). [...]

errr ... sorry for the dupe.
> Further experiments show that its seems to be the amount of traffic (and
> the number of connected vifs?) which triggers the oops: with all OSPF
> daemons stopped, i could UP all bridges & vifs. But when i did a flood-
> broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one
> with more that two active ports), dom0 OOPSed again.
> 
> I could only reproduce that "too-much-traffic-oops" on bridges
> connecting more that 10 vifs.
> 
> Would be interesting if that happens with unicast traffic, too. Have no
> time left, test more tomorrow.
Ok, reproduced the dom0 kernel panic in a simpler situation:

* create some domUs, each having 1 interface in the same subnet
* bridge all the interfaces together (dom0 not having an ip on that
  bridge)
* trigger unicast traffic as much as you want (like unicast flood
  pings): No problem.
* Now trigger some broadcast traffic between the domUs:

    ping -i 0,1 -b 192.168.0.255

  BOOOM.


Instead, you may down all vifs first, start the flood broadcast ping in
the first domU and bring up one vif after the other (wait each
time>15sec until the bridge put the added port in forwarding state). Afterbringing up 10-15 vifs, dom0 panics. 

I could _not_ reproduce this with massive unicast traffic. The problem
disappears if i set "net.ipv4.icmp_echo_ignore_broadcasts=1" in all
domains. Maybe the probem rises if to many domUs answer to broadcasts at
the same time (collisions?).

/nils.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nils Toedtmann

2005-Jun-08 14:40 UTC

head link

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Am Mittwoch, den 08.06.2005, 14:34 +0200 schrieb Nils Toedtmann:
[...] > Ok, reproduced the dom0 kernel panic in a simpler situation:
> 
> * create some domUs, each having 1 interface in the same subnet
> * bridge all the interfaces together (dom0 not having an ip on that
>   bridge)
> * trigger unicast traffic as much as you want (like unicast flood
>   pings): No problem.
> * Now trigger some broadcast traffic between the domUs:
> 
>     ping -i 0,1 -b 192.168.0.255
> 
>   BOOOM.
> 
> 
> Instead, you may down all vifs first, start the flood broadcast ping in
> the first domU and bring up one vif after the other (wait each time
> >15sec until the bridge put the added port in forwarding state). After
> bringing up 10-15 vifs, dom0 panics. 
> 
> I could _not_ reproduce this with massive unicast traffic. The problem
> disappears if i set "net.ipv4.icmp_echo_ignore_broadcasts=1" in
all
> domains. Maybe the probem rises if to many domUs answer to broadcasts at
> the same time (collisions?).

More testing: again doing a 

  [root@domUtest01 ~]# ping -f -b 192.168.0.255

into the bridged vif-subnet. With all domains having
"net.ipv4.icmp_echo_ignore_broadcasts=1" (so noone answers the pings)
everything is fine. When i switch in the pinging domUtest01 itself (and
_only_ in that domain) to "net.ipv4.icmp_echo_ignore_broadcasts=0",
dom0
immediately panics (if there are 15-20 domUs in that bridged subnet).

Another test: putting dom0''s vif0.0 on the bridge too, pinging from
dom0. Then in needed (yet) all domains to have
"net.ipv4.icmp_echo_ignore_broadcasts=0" to get my oops.

The oopses happen in different places, not all contain
"net_rx_action" (all are "Fatal exception in interupt".
These "dumps"
may contain typos because i copied them from monitor by hand):

  [...] 
  error_code
  kfree_skbmem
  __kfree_skb
  net_rx_action
  tasklet_action
  __do_softirq
  soft_irq
  irq_exit
  do_IRQ
  evtchn_do_upcall
  hypervisor_callback
  __wake_up
  sock_def_readable
  unix_stream_sendmsg
  sys_sendto
  sys_send
  sys_socketcall
  syscall_call

or

  [...] 
  error_code
  tasklet_action
  __do_softirq
  soft_irq
  irq_exit
  do_IRQ
  evtchn_do_upcall
  hypervisor_callback

or

  [...] 
  error_code
  tasklet_action
  __do_softirq
  soft_irq
  evtchn_do_upcall
  hypervisor_callback
  cpu_idle
  start_kernel

or

  [...] 
  error_code
  kfree_skbmem
  __kfree_skb
  net_rx_action
  tasklet_action
  __do_softirq
  soft_irq
  irq_exit
  do_IRQ
  evtchn_do_upcall
  hypervisor_callback
  __mmx_memcpy
  memcpy
  dup_task_struct
  copy_process
  do_fork
  sys_clone
  syscall_call

or

  [...] 
  error_code
  kfree_skbmem
  __kfree_skb
  net_rx_action
  tasklet_action
  __do_softirq
  soft_irq
  irq_exit
  do_IRQ
  evtchn_do_upcall
  hypervisor_callback
  __wake_up
  sock_def_readable
  unix_stream_sendmsg
  sys_sendto
  sys_send
  sys_socketcall
  syscall_call

or

  [...] 
  error_code
  kfree_skbmem
  __kfree_skb
  net_rx_action
  tasklet_action
  __do_softirq
  do_softirq
  local_bh_enable
  dev_queue_xmit
  nf_hook_slow
  ip_finish_output
  dst_output
  ip_push_pending_frames
  raw_sendmsg
  sock_sendmsg
  sys_sendmsg
  sys_socketcall
  syscall_call

and more ...


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Seemingly Similar Threads

Search for more reasonably related threads

Xen users - Jun 2005 - kernel oops/IRQ exception when networking between many domUs

[Xen-users] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] [Xen-users] kernel oops/IRQ exception when networking between many domUs

[Xen-users] Re: kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

[Xen-users] Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs

Seemingly Similar Threads