thr3ads.net - Xen devel - netback BUG_ON when using copy

If this information is useful, please help other people find it:
Share via:

jerry

2013-Oct-16 04:13 UTC

netback BUG_ON when using copy_skb=1

Hi Wei Liu,

I am doing some network performance on Xen4.1.2 and kernel 3.0, and get a crash
with BUG_ON(netbk->mmap_pages[idx] != page) in netbk_gop_frag() accidentally.

By analyzing the module drivers/xen/netback, I think the reason is as follows
when sending packets from VM1 to VM2:
1) The two netback thread(the first for VM1 sending, second for VM2 receiving)
run concurrently.
2) In first netback thread, it will do delayed copy from a foreign granted page
to local memory when some outstanding packets have been pending too long( above
half of one HZ).
   Then netbk->mmap_pages[idx] will be replaced with new allocated page.
3) If the packets are forwarded to VM2 by virtual switch, netbk_gop_frag() will
be called in second netback thread.
   And that function will judge whether the pages in skb frags[] is foreign in
order to make sure how to do grant copy.
4) If the page replacing was done after the page foreign judge in
netbk_gop_frag(), the BUG will be invoked because the page from skb frags[] are
different with mmap_pages[idx].

I tried to using spin_lock to protect the page accessing, but no appropriate
solutions was found.
How to fix this problem?  Would you like to share some opinions?

In addition, I have tried to turn off copy_skb. Then the vif netdevice may not
be released after shutting down VM,
that''s because outstanding packets hold the reference count of the
device too long for some unknown reason.
The reason may be that the NIC does not release packets after DMA.
Does anyone have met such problems? Thanks.

Best regards,
Jerry

Jan Beulich

2013-Oct-16 11:10 UTC

head link

Re: netback BUG_ON when using copy_skb=1

>>> On 16.10.13 at 06:13, jerry <jerry.lilijun@huawei.com> wrote:
> Hi Wei Liu,
> 
> I am doing some network performance on Xen4.1.2 and kernel 3.0, and get a 
> crash with BUG_ON(netbk->mmap_pages[idx] != page) in netbk_gop_frag() 
> accidentally.
> 
> By analyzing the module drivers/xen/netback,
You aren''t looking at the upstream driver, are you? If so, Wei is
very likely the wrong addressee.

Assuming that you instead talk of the SLE11 kernel, I can only
point out that a problem in that code was found and fixed a
couple of months ago (resulting in the BUG_ON() you quoted not
being there anymore), so you''re simply not looking at up-to-date
code.

Jan
> I think the reason is as 
> follows when sending packets from VM1 to VM2:
> 1) The two netback thread(the first for VM1 sending, second for VM2 
> receiving) run concurrently.
> 2) In first netback thread, it will do delayed copy from a foreign granted 
> page to local memory when some outstanding packets have been pending too 
> long( above half of one HZ).
>    Then netbk->mmap_pages[idx] will be replaced with new allocated page.
> 3) If the packets are forwarded to VM2 by virtual switch, netbk_gop_frag() 
> will be called in second netback thread.
>    And that function will judge whether the pages in skb frags[] is foreign
> in order to make sure how to do grant copy.
> 4) If the page replacing was done after the page foreign judge in 
> netbk_gop_frag(), the BUG will be invoked because the page from skb frags[]
> are different with mmap_pages[idx].
> 
> I tried to using spin_lock to protect the page accessing, but no
appropriate
> solutions was found.
> How to fix this problem?  Would you like to share some opinions?
> 
> In addition, I have tried to turn off copy_skb. Then the vif netdevice may 
> not be released after shutting down VM,
> that''s because outstanding packets hold the reference count of the
device
> too long for some unknown reason.
> The reason may be that the NIC does not release packets after DMA.
> Does anyone have met such problems? Thanks.
> 
> Best regards,
> Jerry
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel

jerry

2013-Oct-17 07:41 UTC

head link

Re: netback BUG_ON when using copy_skb=1

Hi Jan,

Thanks for your reply.
Yes, I am using the SLE11 kernel 3.0.58 which is not up-to-date as you assumed.
I find one related patch named xen-netback-generalize which was committed on Aug
7 and has been applied to SLE11 kernel 3.0.98.
That BUG_ON(netbk->mmap_pages[idx] != page) has been removed in this patch.

But there may be still concurrency problems in my test.
If the page replacing in copy_pending_req() was done after netif_get_page_ext()
in netbk_gop_frag(), copy_gop->flags is wrongly marked with
GNTCOPY_source_gref.
Here the memory of that page in skb has been replaced with Dom0 local memory, so
the later HYPERVISOR_multicall() with GNTTABOP_copy in netbk_rx_actions() will
get errors.
The messages is shown as:

(XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected dom 0)

Would you like to share some opinions?

Regards,
Jerry
On 2013/10/16 19:10, Jan Beulich wrote:>>>> On 16.10.13 at 06:13, jerry <jerry.lilijun@huawei.com>
wrote:
>> Hi Wei Liu,
>>
>> I am doing some network performance on Xen4.1.2 and kernel 3.0, and get
a
>> crash with BUG_ON(netbk->mmap_pages[idx] != page) in
netbk_gop_frag()
>> accidentally.
>>
>> By analyzing the module drivers/xen/netback,
> 
> You aren''t looking at the upstream driver, are you? If so, Wei is
> very likely the wrong addressee.
> 
> Assuming that you instead talk of the SLE11 kernel, I can only
> point out that a problem in that code was found and fixed a
> couple of months ago (resulting in the BUG_ON() you quoted not
> being there anymore), so you''re simply not looking at up-to-date
> code.
> 
> Jan
> 
>> I think the reason is as 
>> follows when sending packets from VM1 to VM2:
>> 1) The two netback thread(the first for VM1 sending, second for VM2 
>> receiving) run concurrently.
>> 2) In first netback thread, it will do delayed copy from a foreign
granted
>> page to local memory when some outstanding packets have been pending
too
>> long( above half of one HZ).
>>    Then netbk->mmap_pages[idx] will be replaced with new allocated
page.
>> 3) If the packets are forwarded to VM2 by virtual switch,
netbk_gop_frag()
>> will be called in second netback thread.
>>    And that function will judge whether the pages in skb frags[] is
foreign
>> in order to make sure how to do grant copy.
>> 4) If the page replacing was done after the page foreign judge in 
>> netbk_gop_frag(), the BUG will be invoked because the page from skb
frags[]
>> are different with mmap_pages[idx].
>>
>> I tried to using spin_lock to protect the page accessing, but no
appropriate
>> solutions was found.
>> How to fix this problem?  Would you like to share some opinions?
>>
>> In addition, I have tried to turn off copy_skb. Then the vif netdevice
may
>> not be released after shutting down VM,
>> that''s because outstanding packets hold the reference count of
the device
>> too long for some unknown reason.
>> The reason may be that the NIC does not release packets after DMA.
>> Does anyone have met such problems? Thanks.
>>
>> Best regards,
>> Jerry
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
> 
> 
> 
> 
> .
>

Jan Beulich

2013-Oct-17 08:00 UTC

head link

Re: netback BUG_ON when using copy_skb=1

>>> On 17.10.13 at 09:41, jerry <jerry.lilijun@huawei.com> wrote:
> But there may be still concurrency problems in my test.
> If the page replacing in copy_pending_req() was done after 
> netif_get_page_ext() in netbk_gop_frag(), copy_gop->flags is wrongly
marked
> with GNTCOPY_source_gref.
> Here the memory of that page in skb has been replaced with Dom0 local 
> memory, so the later HYPERVISOR_multicall() with GNTTABOP_copy in 
> netbk_rx_actions() will get errors.
> The messages is shown as:
> 
> (XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected dom 0)
> 
> Would you like to share some opinions?
At a first glance that seems possible, but the question is - does it
cause any problems other than the quoted message to be issued
(and the problematic packet getting re-transmitted)? I''m asking
mainly because fixing this would appear to imply adding locking to
these paths - with the risk of adversely affecting performance.

Jan

jerry

2013-Oct-17 10:26 UTC

head link

Re: netback BUG_ON when using copy_skb=1

Hi Jan,

In my test, the grant table copy error may cause that VM crash.
The stack is as follows:
kernel BUG at /linux/driver/redhat6.2/xen-vnif/xen-netfront.c:372!
Pid: 2658, comm: iperf Not tainted 2.6.32-220.el6.x86_64 #1 Xen HVM domU
RIP: 0010:[<ffffffffa01166ca>]  [<ffffffffa01166ca>]
xennet_tx_buf_gc+0x18a/0x1f0 [xen_netfront]
RSP: 0018:ffff880004403df8  EFLAGS: 00010096
RAX: 0000000000000049 RBX: ffff8800821986e0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
RBP: ffff880004403e48 R08: ffffffff81c00690 R09: 0000000000000080
R10: 0000000000013030 R11: 0000000000000000 R12: 000000000000003b
R13: 000000000000023d R14: 0000000000000011 R15: 0000000000000011
FS:  00007fd8fd97e700(0000) GS:ffff880004400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000030270aab70 CR3: 0000000080cf4000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process iperf (pid: 2658, threadinfo ffff8800813ba000, task ffff880080d0eb00)
Stack:
 ffff880082198020 ffff880082198f90 ffff88007f8d00c0 0000003f04415fc0
<0> ffff880004403e28 ffff880082198768 ffff880082198020 ffff8800821986e0
<0> 0000000000000282 0000000000000100 ffff880004403e78 ffffffffa0117d4c
Call Trace:
 <IRQ>
 [<ffffffffa0117d4c>] xennet_interrupt+0x4c/0xb0 [xen_netfront]
 [<ffffffff810d94f0>] handle_IRQ_event+0x60/0x170
 [<ffffffff8109b8a3>] ? ktime_get+0x63/0xe0
 [<ffffffff810dbc2e>] handle_edge_irq+0xde/0x180
 [<ffffffff812fe809>] __xen_evtchn_do_upcall+0x1b9/0x1f0
 [<ffffffff812fedbf>] xen_evtchn_do_upcall+0x2f/0x50
 [<ffffffff8100c373>] xen_hvm_callback_vector+0x13/0x20

The BUG code in xen-netfront.c xennet_tx_buf_gc() is:
			if (unlikely(gnttab_query_foreign_access(
				np->grant_tx_ref[id]) != 0)) {
				printk(KERN_ALERT "xennet_tx_buf_gc: warning "
				       "-- grant still in use by backend "
				       "domain.\n");
				BUG();

In my guess the reason may be as follows:
1) XEN: The function _set_status() called in hypercall __gnttab_copy() and
__acquire_grant_for_copy() is executed failed and the grant ref is not ended.
        So GTF_reading bit cannot be cleared.
2) Netfront: this module invokes a BUG when it checks the GTF_reading bit is
still set.

Regards,
Jerry

On 2013/10/17 16:00, Jan Beulich wrote:>>>> On 17.10.13 at 09:41, jerry <jerry.lilijun@huawei.com>
wrote:
>> But there may be still concurrency problems in my test.
>> If the page replacing in copy_pending_req() was done after 
>> netif_get_page_ext() in netbk_gop_frag(), copy_gop->flags is wrongly
marked
>> with GNTCOPY_source_gref.
>> Here the memory of that page in skb has been replaced with Dom0 local 
>> memory, so the later HYPERVISOR_multicall() with GNTTABOP_copy in 
>> netbk_rx_actions() will get errors.
>> The messages is shown as:
>>
>> (XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected dom 0)
>>
>> Would you like to share some opinions?
> 
> At a first glance that seems possible, but the question is - does it
> cause any problems other than the quoted message to be issued
> (and the problematic packet getting re-transmitted)? I''m asking
> mainly because fixing this would appear to imply adding locking to
> these paths - with the risk of adversely affecting performance.
> 
> Jan
> 
> 
>

Jan Beulich

2013-Oct-17 12:11 UTC

head link

Re: netback BUG_ON when using copy_skb=1

>>> On 17.10.13 at 12:26, jerry <jerry.lilijun@huawei.com> wrote:
> Hi Jan,
please don''t top post.
> In my test, the grant table copy error may cause that VM crash.
> The stack is as follows:
> kernel BUG at /linux/driver/redhat6.2/xen-vnif/xen-netfront.c:372!
> ...
> The BUG code in xen-netfront.c xennet_tx_buf_gc() is:
> 			if (unlikely(gnttab_query_foreign_access(
> 				np->grant_tx_ref[id]) != 0)) {
> 				printk(KERN_ALERT "xennet_tx_buf_gc: warning "
> 				       "-- grant still in use by backend "
> 				       "domain.\n");
> 				BUG();
> 
> In my guess the reason may be as follows:
> 1) XEN: The function _set_status() called in hypercall __gnttab_copy() and 
> __acquire_grant_for_copy() is executed failed and the grant ref is not
ended.
>         So GTF_reading bit cannot be cleared.
> 2) Netfront: this module invokes a BUG when it checks the GTF_reading bit
is
> still set.
If that was the case, this would be a hypervisor bug: a grant copy
operation is supposed to hold the grant active only for as long as
the copy operation takes. You''ll in particular notice that
__acquire_grant_for_copy() in its error path clears GTF_reading
(and GTF_writing, as appropriate) again. You''d likely need to
instrument the code to demonstrate (via a couple of extra log
messages) what you think is not working properly here.

Jan
> On 2013/10/17 16:00, Jan Beulich wrote:
>>>>> On 17.10.13 at 09:41, jerry
<jerry.lilijun@huawei.com> wrote:
>>> But there may be still concurrency problems in my test.
>>> If the page replacing in copy_pending_req() was done after 
>>> netif_get_page_ext() in netbk_gop_frag(), copy_gop->flags is
wrongly marked
>>> with GNTCOPY_source_gref.
>>> Here the memory of that page in skb has been replaced with Dom0
local
>>> memory, so the later HYPERVISOR_multicall() with GNTTABOP_copy in 
>>> netbk_rx_actions() will get errors.
>>> The messages is shown as:
>>>
>>> (XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected dom
0)
>>>
>>> Would you like to share some opinions?
>> 
>> At a first glance that seems possible, but the question is - does it
>> cause any problems other than the quoted message to be issued
>> (and the problematic packet getting re-transmitted)? I''m
asking
>> mainly because fixing this would appear to imply adding locking to
>> these paths - with the risk of adversely affecting performance.
>> 
>> Jan
>> 
>> 
>>

jerry

2013-Oct-22 01:18 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On 2013/10/17 20:11, Jan Beulich wrote:>>>> On 17.10.13 at 12:26, jerry <jerry.lilijun@huawei.com>
wrote:
>> Hi Jan,
> 
> please don''t top post.
> 
>> In my test, the grant table copy error may cause that VM crash.
>> The stack is as follows:
>> kernel BUG at /linux/driver/redhat6.2/xen-vnif/xen-netfront.c:372!
>> ...
>> The BUG code in xen-netfront.c xennet_tx_buf_gc() is:
>> 			if (unlikely(gnttab_query_foreign_access(
>> 				np->grant_tx_ref[id]) != 0)) {
>> 				printk(KERN_ALERT "xennet_tx_buf_gc: warning "
>> 				       "-- grant still in use by backend "
>> 				       "domain.\n");
>> 				BUG();
>>
>> In my guess the reason may be as follows:
>> 1) XEN: The function _set_status() called in hypercall __gnttab_copy()
and
>> __acquire_grant_for_copy() is executed failed and the grant ref is not
ended.
>>         So GTF_reading bit cannot be cleared.
>> 2) Netfront: this module invokes a BUG when it checks the GTF_reading
bit is
>> still set.
> 
> If that was the case, this would be a hypervisor bug: a grant copy
> operation is supposed to hold the grant active only for as long as
> the copy operation takes. You''ll in particular notice that
> __acquire_grant_for_copy() in its error path clears GTF_reading
> (and GTF_writing, as appropriate) again. You''d likely need to
> instrument the code to demonstrate (via a couple of extra log
> messages) what you think is not working properly here.
I have proved that the GTF_reading or GTF_writing is surely cleared after
__gnttab_copy().
So the question is where the GTF_reading is set.
Is hypervisor doing a grant copy operation while VM netfront calling
xennet_tx_buf_gc()?

Any ideas?> 
> Jan
> 
>> On 2013/10/17 16:00, Jan Beulich wrote:
>>>>>> On 17.10.13 at 09:41, jerry
<jerry.lilijun@huawei.com> wrote:
>>>> But there may be still concurrency problems in my test.
>>>> If the page replacing in copy_pending_req() was done after 
>>>> netif_get_page_ext() in netbk_gop_frag(), copy_gop->flags is
wrongly marked
>>>> with GNTCOPY_source_gref.
>>>> Here the memory of that page in skb has been replaced with Dom0
local
>>>> memory, so the later HYPERVISOR_multicall() with GNTTABOP_copy
in
>>>> netbk_rx_actions() will get errors.
>>>> The messages is shown as:
>>>>
>>>> (XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected
dom 0)
>>>>
>>>> Would you like to share some opinions?
>>>
>>> At a first glance that seems possible, but the question is - does
it
>>> cause any problems other than the quoted message to be issued
>>> (and the problematic packet getting re-transmitted)? I''m
asking
>>> mainly because fixing this would appear to imply adding locking to
>>> these paths - with the risk of adversely affecting performance.
>>>
>>> Jan
>>>
>>>
>>>
> 
> 
> 
> 
> .
>

Jan Beulich

2013-Oct-22 07:11 UTC

head link

Re: netback BUG_ON when using copy_skb=1

>>> On 22.10.13 at 03:18, jerry <jerry.lilijun@huawei.com> wrote:
> I have proved that the GTF_reading or GTF_writing is surely cleared after 
> __gnttab_copy().
> So the question is where the GTF_reading is set.
> Is hypervisor doing a grant copy operation while VM netfront calling 
> xennet_tx_buf_gc()?
Surely not - grant copy operations would only ever be invoked by
netback.

Jan

jerry

2013-Oct-26 08:32 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On 2013/10/16 19:10, Jan Beulich wrote:>>>> On 16.10.13 at 06:13, jerry <jerry.lilijun@huawei.com>
wrote:
>> Hi Wei Liu,
>>
>> I am doing some network performance on Xen4.1.2 and kernel 3.0, and get
a
>> crash with BUG_ON(netbk->mmap_pages[idx] != page) in
netbk_gop_frag()
>> accidentally.
>>
>> By analyzing the module drivers/xen/netback,
> 
> You aren''t looking at the upstream driver, are you? If so, Wei is
> very likely the wrong addressee.
> 
> Assuming that you instead talk of the SLE11 kernel, I can only
> point out that a problem in that code was found and fixed a
> couple of months ago (resulting in the BUG_ON() you quoted not
> being there anymore), so you''re simply not looking at up-to-date
> code.
> 
> Jan
> 
>> I think the reason is as 
>> follows when sending packets from VM1 to VM2:
>> 1) The two netback thread(the first for VM1 sending, second for VM2 
>> receiving) run concurrently.
>> 2) In first netback thread, it will do delayed copy from a foreign
granted
>> page to local memory when some outstanding packets have been pending
too
>> long( above half of one HZ).
>>    Then netbk->mmap_pages[idx] will be replaced with new allocated
page.
>> 3) If the packets are forwarded to VM2 by virtual switch,
netbk_gop_frag()
>> will be called in second netback thread.
>>    And that function will judge whether the pages in skb frags[] is
foreign
>> in order to make sure how to do grant copy.
>> 4) If the page replacing was done after the page foreign judge in 
>> netbk_gop_frag(), the BUG will be invoked because the page from skb
frags[]
>> are different with mmap_pages[idx].
>>
>> I tried to using spin_lock to protect the page accessing, but no
appropriate
>> solutions was found.
>> How to fix this problem?  Would you like to share some opinions?
>>
>> In addition, I have tried to turn off copy_skb. Then the vif netdevice
may
>> not be released after shutting down VM,
>> that''s because outstanding packets hold the reference count of
the device
>> too long for some unknown reason.
>> The reason may be that the NIC does not release packets after DMA.
>> Does anyone have met such problems? Thanks.
>>
The reason why the vif net-device isn''t released after shutting down VM
was found with copy_skb disabled.
Let it be supposed that VM1(vif1.0) sends packets to VM2(vif2.0) by virtual
switch.
1) The VM2''s OS is windows 2003 and has been shutdown before for some
unexpected reason.
    After being created, this VM2 stopped the starting process at the prompt
windows named "Shutdown Event Tracker".
   It is waiting for users to input some messages for the question why the
computer shut down unexpectedly.

2) The VM2 already has vif2.0 created. Then I added a new vif net-device using
virsh commands.
  The new vif2.1 was not completely created with no interrupts, but its state is
running and tx queues is started as default.
   The function connect() in xenbus.c hasn''t been called for vif2.1.
The related information in xenstore is as follows:
linux-szRoyS:/ # xenstore-ls -f | grep 2 | grep state
/local/domain/0/device-model/2/state = "running"
/local/domain/0/backend/vbd/2/51712/state = "4"
/local/domain/0/backend/vbd/2/51760/state = "4"
/local/domain/0/backend/vif/2/0/state = "4"
/local/domain/0/backend/vif/2/1/state = "2"
/local/domain/0/backend/console/2/0/state = "1"
/local/domain/2/control/uvp/vm_state = "running"
/local/domain/2/device/vbd/51712/state = "4"
/local/domain/2/device/vbd/51760/state = "4"
/local/domain/2/device/vif/0/state = "4"
/local/domain/2/device/vif/1/state = "1"

3)  The KOBJ_ONLINE message was generated in function backend_create_netif()
called in netback_probe().
    This event will invoke network script named "vif-bridge" executing
and add vif2.1 to virtual switch.
    Then packets from vif1.0(VM1) will be forwarded or flooded to vif2.1 by
virtual switch.
    The vif2.1 dropped this packets because its not netif_schedulable() in
function netif_be_start_xmit().

4)  After setting vif2.1 to down and then to up, the TX queue can''t be
started in net_open() with carrier off.
    So its qdisc became fifo_qdic and the TX queue state stopped.
    In this case, the packets will be held in qdisc queue and can''t be
dequeued in function dequeue_skb()
    for vif2.1''s stopped TX queues.

5)  If VM1 was destroyed, the packets from vif1.0 can''t be released and
vif1.0 can''t be disconnected.
    The vif1.0 will be remained unreleased until setting vif2.1 to down.

   This problem is mainly because that vif2.1 was not created successfully and
got in a strange state:
   running but TX queue is stopped. The function backend_create_netif() is
called in two place netback_probe() and
   frontend_changed(). I think we can remove the backend_create_netif() call in
netback_probe().
   So we can make sure the vif net-device created completely after front-end
changed to XenbusStateConnected.

   The patch is as follows:
--- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000 +0800
+++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000 +0800
@@ -156,9 +156,6 @@
        if (err)
                goto fail;

-       /* This kicks hotplug scripts, so do it immediately. */
-       backend_create_netif(be);
-
        return 0;

 abort_transaction:

   Do you have some ideas?
>> Best regards,
>> Jerry
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
> 
> 
> 
> 
> .
>

Jan Beulich

2013-Oct-28 07:43 UTC

head link

Re: netback BUG_ON when using copy_skb=1

>>> On 26.10.13 at 10:32, jerry <jerry.lilijun@huawei.com> wrote:
> The reason why the vif net-device isn''t released after shutting
down VM was
> found with copy_skb disabled.
> Let it be supposed that VM1(vif1.0) sends packets to VM2(vif2.0) by virtual
> switch.
> 1) The VM2''s OS is windows 2003 and has been shutdown before for
some
> unexpected reason.
>     After being created, this VM2 stopped the starting process at the
prompt
> windows named "Shutdown Event Tracker".
>    It is waiting for users to input some messages for the question why the 
> computer shut down unexpectedly.
> 
> 2) The VM2 already has vif2.0 created. Then I added a new vif net-device 
> using virsh commands.
>   The new vif2.1 was not completely created with no interrupts, but its 
> state is running and tx queues is started as default.
>    The function connect() in xenbus.c hasn''t been called for
vif2.1. The
> related information in xenstore is as follows:
> linux-szRoyS:/ # xenstore-ls -f | grep 2 | grep state
> /local/domain/0/device-model/2/state = "running"
> /local/domain/0/backend/vbd/2/51712/state = "4"
> /local/domain/0/backend/vbd/2/51760/state = "4"
> /local/domain/0/backend/vif/2/0/state = "4"
> /local/domain/0/backend/vif/2/1/state = "2"
> /local/domain/0/backend/console/2/0/state = "1"
> /local/domain/2/control/uvp/vm_state = "running"
> /local/domain/2/device/vbd/51712/state = "4"
> /local/domain/2/device/vbd/51760/state = "4"
> /local/domain/2/device/vif/0/state = "4"
> /local/domain/2/device/vif/1/state = "1"
> 
> 3)  The KOBJ_ONLINE message was generated in function
backend_create_netif()
> called in netback_probe().
>     This event will invoke network script named "vif-bridge"
executing and
> add vif2.1 to virtual switch.
>     Then packets from vif1.0(VM1) will be forwarded or flooded to vif2.1 by
> virtual switch.
>     The vif2.1 dropped this packets because its not netif_schedulable() in 
> function netif_be_start_xmit().
> 
> 4)  After setting vif2.1 to down and then to up, the TX queue
can''t be
> started in net_open() with carrier off.
>     So its qdisc became fifo_qdic and the TX queue state stopped.
>     In this case, the packets will be held in qdisc queue and
can''t be
> dequeued in function dequeue_skb()
>     for vif2.1''s stopped TX queues.
> 
> 5)  If VM1 was destroyed, the packets from vif1.0 can''t be
released and
> vif1.0 can''t be disconnected.
>     The vif1.0 will be remained unreleased until setting vif2.1 to down.
> 
>    This problem is mainly because that vif2.1 was not created successfully 
> and got in a strange state:
>    running but TX queue is stopped. The function backend_create_netif() is 
> called in two place netback_probe() and
>    frontend_changed(). I think we can remove the backend_create_netif()
call
> in netback_probe().
>    So we can make sure the vif net-device created completely after
front-end
> changed to XenbusStateConnected.
> 
>    The patch is as follows:
> --- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000 +0800
> +++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000 +0800
> @@ -156,9 +156,6 @@
>         if (err)
>                 goto fail;
> 
> -       /* This kicks hotplug scripts, so do it immediately. */
> -       backend_create_netif(be);
> -
>         return 0;
> 
>  abort_transaction:
> 
>    Do you have some ideas?
No, not really. Would be helpful if this could be matched up to
behavior (and eventual changes thereto) of the upstream driver.

Jan

Wei Liu

2013-Oct-28 11:43 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On Sat, Oct 26, 2013 at 04:32:08PM +0800, jerry wrote:
[...]> 
>    The patch is as follows:
> --- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000 +0800
> +++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000 +0800
> @@ -156,9 +156,6 @@
>         if (err)
>                 goto fail;
> 
> -       /* This kicks hotplug scripts, so do it immediately. */
> -       backend_create_netif(be);
> -
>         return 0;
> 
>  abort_transaction:
> 
>    Do you have some ideas?
> 
My gut feeling is that this sort of change is regression-prone but we
have to live with that.

In any case, does upstream changeset ea732dff5c (xen-netback: Handle
backend state transitions in a more robust way) useful to you?


Wei.

jerry

2013-Oct-29 04:04 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On 2013/10/28 15:43, Jan Beulich wrote:>>>> On 26.10.13 at 10:32, jerry <jerry.lilijun@huawei.com>
wrote:
>> The reason why the vif net-device isn''t released after
shutting down VM was
>> found with copy_skb disabled.
>> Let it be supposed that VM1(vif1.0) sends packets to VM2(vif2.0) by
virtual
>> switch.
>> 1) The VM2''s OS is windows 2003 and has been shutdown before
for some
>> unexpected reason.
>>     After being created, this VM2 stopped the starting process at the
prompt
>> windows named "Shutdown Event Tracker".
>>    It is waiting for users to input some messages for the question why
the
>> computer shut down unexpectedly.
>>
>> 2) The VM2 already has vif2.0 created. Then I added a new vif
net-device
>> using virsh commands.
>>   The new vif2.1 was not completely created with no interrupts, but its
>> state is running and tx queues is started as default.
>>    The function connect() in xenbus.c hasn''t been called for
vif2.1. The
>> related information in xenstore is as follows:
>> linux-szRoyS:/ # xenstore-ls -f | grep 2 | grep state
>> /local/domain/0/device-model/2/state = "running"
>> /local/domain/0/backend/vbd/2/51712/state = "4"
>> /local/domain/0/backend/vbd/2/51760/state = "4"
>> /local/domain/0/backend/vif/2/0/state = "4"
>> /local/domain/0/backend/vif/2/1/state = "2"
>> /local/domain/0/backend/console/2/0/state = "1"
>> /local/domain/2/control/uvp/vm_state = "running"
>> /local/domain/2/device/vbd/51712/state = "4"
>> /local/domain/2/device/vbd/51760/state = "4"
>> /local/domain/2/device/vif/0/state = "4"
>> /local/domain/2/device/vif/1/state = "1"
>>
>> 3)  The KOBJ_ONLINE message was generated in function
backend_create_netif()
>> called in netback_probe().
>>     This event will invoke network script named "vif-bridge"
executing and
>> add vif2.1 to virtual switch.
>>     Then packets from vif1.0(VM1) will be forwarded or flooded to
vif2.1 by
>> virtual switch.
>>     The vif2.1 dropped this packets because its not netif_schedulable()
in
>> function netif_be_start_xmit().
>>
>> 4)  After setting vif2.1 to down and then to up, the TX queue
can''t be
>> started in net_open() with carrier off.
>>     So its qdisc became fifo_qdic and the TX queue state stopped.
>>     In this case, the packets will be held in qdisc queue and
can''t be
>> dequeued in function dequeue_skb()
>>     for vif2.1''s stopped TX queues.
>>
>> 5)  If VM1 was destroyed, the packets from vif1.0 can''t be
released and
>> vif1.0 can''t be disconnected.
>>     The vif1.0 will be remained unreleased until setting vif2.1 to
down.
>>
>>    This problem is mainly because that vif2.1 was not created
successfully
>> and got in a strange state:
>>    running but TX queue is stopped. The function backend_create_netif()
is
>> called in two place netback_probe() and
>>    frontend_changed(). I think we can remove the backend_create_netif()
call
>> in netback_probe().
>>    So we can make sure the vif net-device created completely after
front-end
>> changed to XenbusStateConnected.
>>
>>    The patch is as follows:
>> --- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000
+0800
>> +++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000
+0800
>> @@ -156,9 +156,6 @@
>>         if (err)
>>                 goto fail;
>>
>> -       /* This kicks hotplug scripts, so do it immediately. */
>> -       backend_create_netif(be);
>> -
>>         return 0;
>>
>>  abort_transaction:
>>
>>    Do you have some ideas?
> 
> No, not really. Would be helpful if this could be matched up to
> behavior (and eventual changes thereto) of the upstream driver.
Hi Wei and Jan,

Thanks for your reply.
My VM is running with SUSE11 sp2 netback drivers. So the upstream driver
xen-netback has not been tested in such situation.
The patch before may introduce some problems when migrating VMs.
So I have a new solution to fix my problem. The patch is as follows:
--- drivers/xen/netback/interface.c.old 2013-10-29 11:46:36.000000000 +0800
+++ drivers/xen/netback/interface.c     2013-10-29 11:46:47.000000000 +0800
@@ -111,8 +111,8 @@
        netif_t *netif = netdev_priv(dev);
        if (netback_carrier_ok(netif)) {
                __netif_up(netif);
-               netif_start_queue(dev);
        }
+       netif_start_queue(dev);
        return 0;
 }

After this modification, when vif is not connected to front-end, we can make
Qdisc continue to transmit skb to vif and then dropped.
I mean that the Qdisc queue shouldn''t cache SKBs when vif is not
created completely.

Any ideas?

Jerry> 
> Jan
> 
> 
> .
>

Ian Campbell

2013-Oct-31 15:17 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On Mon, 2013-10-28 at 11:43 +0000, Wei Liu wrote:> On Sat, Oct 26, 2013 at 04:32:08PM +0800, jerry wrote:
> [...]
> > 
> >    The patch is as follows:
> > --- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000
+0800
> > +++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000
+0800
> > @@ -156,9 +156,6 @@
> >         if (err)
> >                 goto fail;
> > 
> > -       /* This kicks hotplug scripts, so do it immediately. */
> > -       backend_create_netif(be);
> > -
> >         return 0;
> > 
> >  abort_transaction:
> > 
> >    Do you have some ideas?
> > 
> 
> My gut feeling is that this sort of change is regression-prone but we
> have to live with that.
> 
This thread/fix doesn''t apply to upstream netback, which
doesn''t have
copy_skb mode, right?
> In any case, does upstream changeset ea732dff5c (xen-netback: Handle
> backend state transitions in a more robust way) useful to you?
> 
> 
> Wei.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Wei Liu

2013-Oct-31 15:32 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On Thu, Oct 31, 2013 at 03:17:11PM +0000, Ian Campbell
wrote:> On Mon, 2013-10-28 at 11:43 +0000, Wei Liu wrote:
> > On Sat, Oct 26, 2013 at 04:32:08PM +0800, jerry wrote:
> > [...]
> > > 
> > >    The patch is as follows:
> > > --- drivers/xen/netback/xenbus.c.old    2013-10-26
16:23:07.000000000 +0800
> > > +++ drivers/xen/netback/xenbus.c        2013-10-26
16:23:31.000000000 +0800
> > > @@ -156,9 +156,6 @@
> > >         if (err)
> > >                 goto fail;
> > > 
> > > -       /* This kicks hotplug scripts, so do it immediately. */
> > > -       backend_create_netif(be);
> > > -
> > >         return 0;
> > > 
> > >  abort_transaction:
> > > 
> > >    Do you have some ideas?
> > > 
> > 
> > My gut feeling is that this sort of change is regression-prone but we
> > have to live with that.
> > 
> 
> This thread/fix doesn''t apply to upstream netback, which
doesn''t have
> copy_skb mode, right?
> 
No, it''s SuSE kernel.
> > In any case, does upstream changeset ea732dff5c (xen-netback: Handle
> > backend state transitions in a more robust way) useful to you?
> > 
> > 
> > Wei.
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
>

jerry

2013-Nov-01 02:53 UTC

head link

Re: netback BUG_ON when using copy_skb=1

On 2013/10/31 23:32, Wei Liu wrote:> On Thu, Oct 31, 2013 at 03:17:11PM +0000, Ian Campbell wrote:
>> On Mon, 2013-10-28 at 11:43 +0000, Wei Liu wrote:
>>> On Sat, Oct 26, 2013 at 04:32:08PM +0800, jerry wrote:
>>> [...]
>>>>
>>>>    The patch is as follows:
>>>> --- drivers/xen/netback/xenbus.c.old    2013-10-26
16:23:07.000000000 +0800
>>>> +++ drivers/xen/netback/xenbus.c        2013-10-26
16:23:31.000000000 +0800
>>>> @@ -156,9 +156,6 @@
>>>>         if (err)
>>>>                 goto fail;
>>>>
>>>> -       /* This kicks hotplug scripts, so do it immediately. */
>>>> -       backend_create_netif(be);
>>>> -
>>>>         return 0;
>>>>
>>>>  abort_transaction:
>>>>
>>>>    Do you have some ideas?
>>>>
>>>
>>> My gut feeling is that this sort of change is regression-prone but
we
>>> have to live with that.
>>>
>>
>> This thread/fix doesn''t apply to upstream netback, which
doesn''t have
>> copy_skb mode, right?
>>
> 
> No, it''s SuSE kernel.
Yes, I am using SuSE11 SP2 kernel. The two mainly points in my other emails can
be concluded as follows:
1) If testing with copy_skb mode enabled, some grant copy operations in another
RX netbk tread will failed.
   This error will introduce packet retransmit and sometimes VM get crashed.
   Now I have no appropriate solution to fix the problem. So I have to turn off
copy_skb mode.

2) If that''s disabled, the vif can''t be disconnected when VM
is destroyed and its sending packets have not been consumed.
   Fortunately I found those packets was cached in another abnormal
vif''s qdisc queues.
   My solution is keeping vif TX queue started  when its set to up. So packets
are dropped if vif is created, but not connected.
   The fix patch is shown as follow:
--- drivers/xen/netback/interface.c.old 2013-10-29 11:46:36.000000000 +0800
+++ drivers/xen/netback/interface.c     2013-10-29 11:46:47.000000000 +0800
@@ -111,8 +111,8 @@
        netif_t *netif = netdev_priv(dev);
        if (netback_carrier_ok(netif)) {
                __netif_up(netif);
-               netif_start_queue(dev);
        }
+       netif_start_queue(dev);
        return 0;
 }> 
>>> In any case, does upstream changeset ea732dff5c (xen-netback:
Handle
>>> backend state transitions in a more robust way) useful to you?
>>>
>>>
>>> Wei.
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>
> 
> .
>

Xen devel - Oct 2013 - netback BUG_ON when using copy_skb=1

netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1

Re: netback BUG_ON when using copy_skb=1