Xu, Dongxiao
2010-Jun-10 11:48 UTC
[Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
Hi Jeremy, The attached patch should fix the PV network issue after applying the netback multiple threads patchset. This bug comes from the version upgrade when resending my patchset last time. In patch v1 & v2, I used kzalloc() to allocate memory, so there is no problem. In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO, order)", so there is also no problem. However in patch v4, when modified it to vmalloc according to the comments, memset(0) is missing. Sorry for that. *ONE POINT IS*, the phenomenon on my side is different from the bug reports from mailing list. They often saw GPLPV network long ping latency, and on my side, system would crash when enabling guest network. After applying the patch, network in the following cases are working fine in my side. WinXP 32bit with gplpv_XP_0.11.0.213.msi 64bit Linux PV guest. 64bit RHEL5u3 HVM guest with PV driver Can somebody help to try this fixing for long ping latency phenomenon? Appreciate for that! Thanks, Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Jun-10 12:29 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
>>> On 10.06.10 at 13:48, "Xu, Dongxiao" <dongxiao.xu@intel.com> wrote: > Hi Jeremy, > > The attached patch should fix the PV network issue after applying the > netback multiple threads patchset. > > This bug comes from the version upgrade when resending my patchset last > time. > > In patch v1 & v2, I used kzalloc() to allocate memory, so there is no > problem. > In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO, > order)", so there is also no problem. > However in patch v4, when modified it to vmalloc according to the comments, > memset(0) is missing. Sorry for that.A perhaps better alternative would be to use __vmalloc(..., GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO, PAGE_KERNEL); Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
<djmagee@mageenet.net>
2010-Jun-10 12:51 UTC
RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset
I can confirm that this patch solves the problem for me. Tested in Win7 x64 with GPLPV 0.11.0.213, and with a PV linux domain running fedora 12 64bit. Doug Magee -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Dongxiao Sent: Thursday, June 10, 2010 7:49 AM To: Jeremy Fitzhardinge Cc: xen-devel@lists.xensource.com; djmagee@mageenet.net; Pasi@host212.adamapps.net; Fantu Subject: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset Hi Jeremy, The attached patch should fix the PV network issue after applying the netback multiple threads patchset. This bug comes from the version upgrade when resending my patchset last time. In patch v1 & v2, I used kzalloc() to allocate memory, so there is no problem. In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO, order)", so there is also no problem. However in patch v4, when modified it to vmalloc according to the comments, memset(0) is missing. Sorry for that. *ONE POINT IS*, the phenomenon on my side is different from the bug reports from mailing list. They often saw GPLPV network long ping latency, and on my side, system would crash when enabling guest network. After applying the patch, network in the following cases are working fine in my side. WinXP 32bit with gplpv_XP_0.11.0.213.msi 64bit Linux PV guest. 64bit RHEL5u3 HVM guest with PV driver Can somebody help to try this fixing for long ping latency phenomenon? Appreciate for that! Thanks, Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Jun-10 12:58 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset
On Thu, Jun 10, 2010 at 08:51:24AM -0400, djmagee@mageenet.net wrote:> I can confirm that this patch solves the problem for me. Tested in Win7 > x64 with GPLPV 0.11.0.213, and with a PV linux domain running fedora 12 > 64bit. >Good to hear. So I guess Jeremy can revert the revert and apply the bugfix patch :) -- Pasi> Doug Magee > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Dongxiao > Sent: Thursday, June 10, 2010 7:49 AM > To: Jeremy Fitzhardinge > Cc: xen-devel@lists.xensource.com; djmagee@mageenet.net; > Pasi@host212.adamapps.net; Fantu > Subject: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue > fornetback multiple threads patchset > > Hi Jeremy, > > The attached patch should fix the PV network issue after applying the > netback multiple threads patchset. > > This bug comes from the version upgrade when resending my patchset last > time. > > In patch v1 & v2, I used kzalloc() to allocate memory, so there is no > problem. > In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO, > order)", so there is also no problem. > However in patch v4, when modified it to vmalloc according to the > comments, memset(0) is missing. Sorry for that. > > *ONE POINT IS*, the phenomenon on my side is different from the bug > reports from mailing list. They often saw GPLPV network long ping > latency, and on my side, system would crash when enabling guest network. > > > After applying the patch, network in the following cases are working > fine in my side. > > WinXP 32bit with gplpv_XP_0.11.0.213.msi 64bit Linux PV guest. > 64bit RHEL5u3 HVM guest with PV driver > > Can somebody help to try this fixing for long ping latency phenomenon? > Appreciate for that! > > Thanks, > Dongxiao > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jun-11 09:35 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:> Hi Jeremy, > > The attached patch should fix the PV network issue after applying the netback multiple threads patchset.Thanks for this Donxiao. Do you think this crash was a potential symptom of this issue? It does seem to go away if I apply your patch. BUG: unable to handle kernel paging request at 70000027 IP: [<c0294867>] make_tx_response+0x17/0xd0 *pdpt = 0000000000000000 Oops: 0000 [#2] SMP last sysfs file: Modules linked in: Supported: Yes Pid: 1083, comm: netback/0 Tainted: G D (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>] EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0 EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4 ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021 Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070 task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8 c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002 ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4 f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ? net_tx_action+0x32a/0xa50 [<c0296f62>] ? netbk_action_thread+0x62/0x190 [<c0296f00>] ? netbk_action_thread+0x0/0x190 [<c013f84c>] ? kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70 [<c0105633>] ? kernel_thread_helper+0x7/0x10 ====================== Code: ec 8d 41 01 89 47 2c c7 45 e4 ea ff ff ff eb dd 8d 74 26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c 24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 f8 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0 SS:ESP e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]--- The crash is in one of the calls to list_move_tail and I think it is because netbk->pending_inuse_head not being initialised until after the threads and/or tasklets are created (I was running in threaded mode). Perhaps even though we are now zeroing the netbk struct those fields should still be initialised before kicking off any potentially asynchronous tasks? I didn''t even start any guests so I think we only got to the reference to pending_inuse_head because tx_work_todo can return a false positive if netbk is not properly zeroed and therefore we can call net_tx_action before we are ready. On an unrelated note, do you have any plans to make the number of groups react dynamically to CPU hotplug? Not necessarily while there are actually active VIFs (might be tricky to get right) but perhaps only when netback is idle (i.e. when there are no VIFs configured), since often the dynamic adjustment of VCPUs happens at start of day to reduce the domain 0 VCPU allocation from the total number of cores in the machine to something more manageable. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Jun-17 08:16 UTC
RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
Ian, Sorry for the late response, I was on vacation days before. Ian Campbell wrote:> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote: >> Hi Jeremy, >> >> The attached patch should fix the PV network issue after applying >> the netback multiple threads patchset. > > Thanks for this Donxiao. Do you think this crash was a potential > symptom > of this issue? It does seem to go away if I apply your patch.Actually, the phenomenon is the same on my side without the fixing patch.> BUG: unable to handle kernel paging request at 70000027 > IP: [<c0294867>] make_tx_response+0x17/0xd0 > *pdpt = 0000000000000000 > Oops: 0000 [#2] SMP > last sysfs file: > Modules linked in: > Supported: Yes > > Pid: 1083, comm: netback/0 Tainted: G D > (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>] > EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0 > EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4 > ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021 > Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070 > task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8 > c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002 > ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4 > f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc > ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ? > net_tx_action+0x32a/0xa50 [<c0296f62>] ? > netbk_action_thread+0x62/0x190 [<c0296f00>] ? > netbk_action_thread+0x0/0x190 [<c013f84c>] ? > kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70 > [<c0105633>] ? kernel_thread_helper+0x7/0x10 > ======================> Code: ec 8d 41 01 89 47 2c c7 45 e4 ea ff ff ff eb dd 8d 74 > 26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c > 24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 f8 > 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0 SS:ESP > e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]--- > > The crash is in one of the calls to list_move_tail and I think it is > because netbk->pending_inuse_head not being initialised until after > the > threads and/or tasklets are created (I was running in threaded mode). > Perhaps even though we are now zeroing the netbk struct those fields > should still be initialised before kicking off any potentially > asynchronous tasks?You are right, I will commit another patch to fix it.> > I didn''t even start any guests so I think we only got to the reference > to pending_inuse_head because tx_work_todo can return a false positive > if netbk is not properly zeroed and therefore we can call > net_tx_action > before we are ready. > > On an unrelated note, do you have any plans to make the number of > groups > react dynamically to CPU hotplug? Not necessarily while there are > actually active VIFs (might be tricky to get right) but perhaps only > when netback is idle (i.e. when there are no VIFs configured), since > often the dynamic adjustment of VCPUs happens at start of day to > reduce > the domain 0 VCPU allocation from the total number of cores in the > machine to something more manageable.I''m sorry, currently I am busy with some other tasks and may not have time to do this job. But if the case is to reduce dom0 VCPU number, keep the group number unchanged will not impact the performance, since the group reflects the tasklet/kthread number, and it doesn''t have direct association with dom0''s VCPU number. Thanks, Dongxiao> > Ian._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jun-21 11:14 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On 06/17/2010 09:16 AM, Xu, Dongxiao wrote:> Ian, > > Sorry for the late response, I was on vacation days before. > > Ian Campbell wrote: > >> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote: >> >>> Hi Jeremy, >>> >>> The attached patch should fix the PV network issue after applying >>> the netback multiple threads patchset. >>> >> Thanks for this Donxiao. Do you think this crash was a potential >> symptom >> of this issue? It does seem to go away if I apply your patch. >> > Actually, the phenomenon is the same on my side without the fixing patch. > > >> BUG: unable to handle kernel paging request at 70000027 >> IP: [<c0294867>] make_tx_response+0x17/0xd0 >> *pdpt = 0000000000000000 >> Oops: 0000 [#2] SMP >> last sysfs file: >> Modules linked in: >> Supported: Yes >> >> Pid: 1083, comm: netback/0 Tainted: G D >> (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>] >> EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0 >> EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4 >> ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c >> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021 >> Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070 >> task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8 >> c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002 >> ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4 >> f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc >> ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ? >> net_tx_action+0x32a/0xa50 [<c0296f62>] ? >> netbk_action_thread+0x62/0x190 [<c0296f00>] ? >> netbk_action_thread+0x0/0x190 [<c013f84c>] ? >> kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70 >> [<c0105633>] ? kernel_thread_helper+0x7/0x10 >> ======================>> Code: ec 8d 41 01 89 47 2c c7 45 e4 ea ff ff ff eb dd 8d 74 >> 26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c >> 24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 f8 >> 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0 SS:ESP >> e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]--- >> >> The crash is in one of the calls to list_move_tail and I think it is >> because netbk->pending_inuse_head not being initialised until after >> the >> threads and/or tasklets are created (I was running in threaded mode). >> Perhaps even though we are now zeroing the netbk struct those fields >> should still be initialised before kicking off any potentially >> asynchronous tasks? >> > You are right, I will commit another patch to fix it. >Does this patch address it, or does it need something else? Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code Split the netbk group array initialization into two phases: one to do simple "can''t fail" initialization of lists, timers, locks, etc; and a second pass to allocate memory and start the async code. This makes sure that all the basic stuff is in place by the time the async code is running. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c index 9a7ada2..95de476 100644 --- a/drivers/xen/netback/netback.c +++ b/drivers/xen/netback/netback.c @@ -1750,8 +1750,10 @@ static int __init netback_init(void) /* We can increase reservation by this much in net_rx_action(). */ // balloon_update_driver_allowance(NET_RX_RING_SIZE); + /* First pass - do simple "can''t fail" setup */ for (group = 0; group < xen_netbk_group_nr; group++) { struct xen_netbk *netbk = &xen_netbk[group]; + skb_queue_head_init(&netbk->rx_queue); skb_queue_head_init(&netbk->tx_queue); @@ -1764,16 +1766,6 @@ static int __init netback_init(void) netbk->netbk_tx_pending_timer.function netbk_tx_pending_timeout; - netbk->mmap_pages - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); - if (!netbk->mmap_pages) { - printk(KERN_ALERT "%s: out of memory\n", __func__); - del_timer(&netbk->netbk_tx_pending_timer); - del_timer(&netbk->net_timer); - rc = -ENOMEM; - goto failed_init; - } - for (i = 0; i < MAX_PENDING_REQS; i++) { page = netbk->mmap_pages[i]; SetPageForeign(page, netif_page_release); @@ -1786,6 +1778,26 @@ static int __init netback_init(void) for (i = 0; i < MAX_PENDING_REQS; i++) netbk->pending_ring[i] = i; + INIT_LIST_HEAD(&netbk->pending_inuse_head); + INIT_LIST_HEAD(&netbk->net_schedule_list); + + spin_lock_init(&netbk->net_schedule_list_lock); + + atomic_set(&netbk->netfront_count, 0); + } + + /* Second pass - do memory allocation and initialize threads/tasklets */ + for (group = 0; group < xen_netbk_group_nr; group++) { + struct xen_netbk *netbk = &xen_netbk[group]; + + netbk->mmap_pages + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); + if (!netbk->mmap_pages) { + printk(KERN_ALERT "%s: out of memory\n", __func__); + rc = -ENOMEM; + goto failed_init; + } + if (MODPARM_netback_kthread) { init_waitqueue_head(&netbk->kthread.netbk_action_wq); netbk->kthread.task @@ -1801,8 +1813,6 @@ static int __init netback_init(void) "kthread_run() fails at netback\n"); free_empty_pages_and_pagevec(netbk->mmap_pages, MAX_PENDING_REQS); - del_timer(&netbk->netbk_tx_pending_timer); - del_timer(&netbk->net_timer); rc = PTR_ERR(netbk->kthread.task); goto failed_init; } @@ -1814,13 +1824,6 @@ static int __init netback_init(void) net_rx_action, (unsigned long)netbk); } - - INIT_LIST_HEAD(&netbk->pending_inuse_head); - INIT_LIST_HEAD(&netbk->net_schedule_list); - - spin_lock_init(&netbk->net_schedule_list_lock); - - atomic_set(&netbk->netfront_count, 0); } netbk_copy_skb_mode = NETBK_DONT_COPY_SKB; @@ -1850,13 +1853,16 @@ static int __init netback_init(void) return 0; failed_init: - for (i = 0; i < group; i++) { + for (i = 0; i < xen_netbk_group_nr; i++) { struct xen_netbk *netbk = &xen_netbk[i]; - free_empty_pages_and_pagevec(netbk->mmap_pages, - MAX_PENDING_REQS); + del_timer(&netbk->netbk_tx_pending_timer); del_timer(&netbk->net_timer); - if (MODPARM_netback_kthread) + + if (netbk->mmap_pages) + free_empty_pages_and_pagevec(netbk->mmap_pages, + MAX_PENDING_REQS); + if (MODPARM_netback_kthread && netbk->kthread.task) kthread_stop(netbk->kthread.task); } vfree(xen_netbk); J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Jun-22 02:29 UTC
RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
Jeremy Fitzhardinge wrote:> On 06/17/2010 09:16 AM, Xu, Dongxiao wrote: >> Ian, >> >> Sorry for the late response, I was on vacation days before. >> >> Ian Campbell wrote: >> >>> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote: >>> >>>> Hi Jeremy, >>>> >>>> The attached patch should fix the PV network issue after applying >>>> the netback multiple threads patchset. >>>> >>> Thanks for this Donxiao. Do you think this crash was a potential >>> symptom of this issue? It does seem to go away if I apply your >>> patch. >>> >> Actually, the phenomenon is the same on my side without the fixing >> patch. >> >> >>> BUG: unable to handle kernel paging request at 70000027 >>> IP: [<c0294867>] make_tx_response+0x17/0xd0 >>> *pdpt = 0000000000000000 >>> Oops: 0000 [#2] SMP >>> last sysfs file: >>> Modules linked in: >>> Supported: Yes >>> >>> Pid: 1083, comm: netback/0 Tainted: G D >>> (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>] >>> EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0 >>> EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4 >>> ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c >>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021 >>> Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070 >>> task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8 >>> c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002 >>> ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4 >>> f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc >>> ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ? >>> net_tx_action+0x32a/0xa50 [<c0296f62>] ? >>> netbk_action_thread+0x62/0x190 [<c0296f00>] ? >>> netbk_action_thread+0x0/0x190 [<c013f84c>] ? >>> kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70 >>> [<c0105633>] ? kernel_thread_helper+0x7/0x10 >>> ======================>>> Code: ec 8d 41 01 89 47 2c c7 45 e4 ea ff ff ff eb dd 8d 74 >>> 26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c >>> 24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 >>> f8 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0 >>> SS:ESP e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]--- >>> >>> The crash is in one of the calls to list_move_tail and I think it is >>> because netbk->pending_inuse_head not being initialised until after >>> the threads and/or tasklets are created (I was running in threaded >>> mode). Perhaps even though we are now zeroing the netbk struct >>> those fields should still be initialised before kicking off any >>> potentially asynchronous tasks? >>> >> You are right, I will commit another patch to fix it. >> > > Does this patch address it, or does it need something else?Hi Jeremy, The patch is good except that we should move mmap_pages related code back after it is initialized. So I modified your patch a little. Thanks, Dongxiao Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code Split the netbk group array initialization into two phases: one to do simple "can''t fail" initialization of lists, timers, locks, etc; and a second pass to allocate memory and start the async code. This makes sure that all the basic stuff is in place by the time the async code is running. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> --- drivers/xen/netback/netback.c | 46 +++++++++++++++++++++++----------------- 1 files changed, 26 insertions(+), 20 deletions(-) diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c index 9a7ada2..f5d8952 100644 --- a/drivers/xen/netback/netback.c +++ b/drivers/xen/netback/netback.c @@ -1750,8 +1750,10 @@ static int __init netback_init(void) /* We can increase reservation by this much in net_rx_action(). */ // balloon_update_driver_allowance(NET_RX_RING_SIZE); + /* First pass - do simple "can''t fail" setup */ for (group = 0; group < xen_netbk_group_nr; group++) { struct xen_netbk *netbk = &xen_netbk[group]; + skb_queue_head_init(&netbk->rx_queue); skb_queue_head_init(&netbk->tx_queue); @@ -1764,12 +1766,27 @@ static int __init netback_init(void) netbk->netbk_tx_pending_timer.function netbk_tx_pending_timeout; + netbk->pending_cons = 0; + netbk->pending_prod = MAX_PENDING_REQS; + for (i = 0; i < MAX_PENDING_REQS; i++) + netbk->pending_ring[i] = i; + + INIT_LIST_HEAD(&netbk->pending_inuse_head); + INIT_LIST_HEAD(&netbk->net_schedule_list); + + spin_lock_init(&netbk->net_schedule_list_lock); + + atomic_set(&netbk->netfront_count, 0); + } + + /* Second pass - do memory allocation and initialize threads/tasklets */ + for (group = 0; group < xen_netbk_group_nr; group++) { + struct xen_netbk *netbk = &xen_netbk[group]; + netbk->mmap_pages alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); if (!netbk->mmap_pages) { printk(KERN_ALERT "%s: out of memory\n", __func__); - del_timer(&netbk->netbk_tx_pending_timer); - del_timer(&netbk->net_timer); rc = -ENOMEM; goto failed_init; } @@ -1781,11 +1798,6 @@ static int __init netback_init(void) INIT_LIST_HEAD(&netbk->pending_inuse[i].list); } - netbk->pending_cons = 0; - netbk->pending_prod = MAX_PENDING_REQS; - for (i = 0; i < MAX_PENDING_REQS; i++) - netbk->pending_ring[i] = i; - if (MODPARM_netback_kthread) { init_waitqueue_head(&netbk->kthread.netbk_action_wq); netbk->kthread.task @@ -1801,8 +1813,6 @@ static int __init netback_init(void) "kthread_run() fails at netback\n"); free_empty_pages_and_pagevec(netbk->mmap_pages, MAX_PENDING_REQS); - del_timer(&netbk->netbk_tx_pending_timer); - del_timer(&netbk->net_timer); rc = PTR_ERR(netbk->kthread.task); goto failed_init; } @@ -1814,13 +1824,6 @@ static int __init netback_init(void) net_rx_action, (unsigned long)netbk); } - - INIT_LIST_HEAD(&netbk->pending_inuse_head); - INIT_LIST_HEAD(&netbk->net_schedule_list); - - spin_lock_init(&netbk->net_schedule_list_lock); - - atomic_set(&netbk->netfront_count, 0); } netbk_copy_skb_mode = NETBK_DONT_COPY_SKB; @@ -1850,13 +1853,16 @@ static int __init netback_init(void) return 0; failed_init: - for (i = 0; i < group; i++) { + for (i = 0; i < xen_netbk_group_nr; i++) { struct xen_netbk *netbk = &xen_netbk[i]; - free_empty_pages_and_pagevec(netbk->mmap_pages, - MAX_PENDING_REQS); + del_timer(&netbk->netbk_tx_pending_timer); del_timer(&netbk->net_timer); - if (MODPARM_netback_kthread) + + if (netbk->mmap_pages) + free_empty_pages_and_pagevec(netbk->mmap_pages, + MAX_PENDING_REQS); + if (MODPARM_netback_kthread && netbk->kthread.task) kthread_stop(netbk->kthread.task); } vfree(xen_netbk); -- 1.6.3> > Subject: [PATCH] xen/netback: make sure all the group structures are > initialized before starting async code > > Split the netbk group array initialization into two phases: one to do > simple "can''t fail" initialization of lists, timers, locks, etc; and a > second pass to allocate memory and start the async code. > > This makes sure that all the basic stuff is in place by the time the > async code > is running. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> > > diff --git a/drivers/xen/netback/netback.c > b/drivers/xen/netback/netback.c > index 9a7ada2..95de476 100644 > --- a/drivers/xen/netback/netback.c > +++ b/drivers/xen/netback/netback.c > @@ -1750,8 +1750,10 @@ static int __init netback_init(void) > /* We can increase reservation by this much in net_rx_action(). */ > // balloon_update_driver_allowance(NET_RX_RING_SIZE); > > + /* First pass - do simple "can''t fail" setup */ > for (group = 0; group < xen_netbk_group_nr; group++) { > struct xen_netbk *netbk = &xen_netbk[group]; > + > skb_queue_head_init(&netbk->rx_queue); > skb_queue_head_init(&netbk->tx_queue); > > @@ -1764,16 +1766,6 @@ static int __init netback_init(void) > netbk->netbk_tx_pending_timer.function > netbk_tx_pending_timeout; > > - netbk->mmap_pages > - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > - if (!netbk->mmap_pages) { > - printk(KERN_ALERT "%s: out of memory\n", __func__); > - del_timer(&netbk->netbk_tx_pending_timer); > - del_timer(&netbk->net_timer); > - rc = -ENOMEM; > - goto failed_init; > - } > - > for (i = 0; i < MAX_PENDING_REQS; i++) { > page = netbk->mmap_pages[i]; > SetPageForeign(page, netif_page_release); > @@ -1786,6 +1778,26 @@ static int __init netback_init(void) > for (i = 0; i < MAX_PENDING_REQS; i++) > netbk->pending_ring[i] = i; > > + INIT_LIST_HEAD(&netbk->pending_inuse_head); > + INIT_LIST_HEAD(&netbk->net_schedule_list); > + > + spin_lock_init(&netbk->net_schedule_list_lock); > + > + atomic_set(&netbk->netfront_count, 0); > + } > + > + /* Second pass - do memory allocation and initialize > threads/tasklets */ + for (group = 0; group < xen_netbk_group_nr; > group++) { + struct xen_netbk *netbk = &xen_netbk[group]; > + > + netbk->mmap_pages > + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > + if (!netbk->mmap_pages) { > + printk(KERN_ALERT "%s: out of memory\n", __func__); > + rc = -ENOMEM; > + goto failed_init; > + } > + > if (MODPARM_netback_kthread) { > init_waitqueue_head(&netbk->kthread.netbk_action_wq); > netbk->kthread.task > @@ -1801,8 +1813,6 @@ static int __init netback_init(void) > "kthread_run() fails at netback\n"); > free_empty_pages_and_pagevec(netbk->mmap_pages, > MAX_PENDING_REQS); > - del_timer(&netbk->netbk_tx_pending_timer); > - del_timer(&netbk->net_timer); > rc = PTR_ERR(netbk->kthread.task); > goto failed_init; > } > @@ -1814,13 +1824,6 @@ static int __init netback_init(void) > net_rx_action, > (unsigned long)netbk); > } > - > - INIT_LIST_HEAD(&netbk->pending_inuse_head); > - INIT_LIST_HEAD(&netbk->net_schedule_list); > - > - spin_lock_init(&netbk->net_schedule_list_lock); > - > - atomic_set(&netbk->netfront_count, 0); > } > > netbk_copy_skb_mode = NETBK_DONT_COPY_SKB; > @@ -1850,13 +1853,16 @@ static int __init netback_init(void) > return 0; > > failed_init: > - for (i = 0; i < group; i++) { > + for (i = 0; i < xen_netbk_group_nr; i++) { > struct xen_netbk *netbk = &xen_netbk[i]; > - free_empty_pages_and_pagevec(netbk->mmap_pages, > - MAX_PENDING_REQS); > + > del_timer(&netbk->netbk_tx_pending_timer); > del_timer(&netbk->net_timer); > - if (MODPARM_netback_kthread) > + > + if (netbk->mmap_pages) > + free_empty_pages_and_pagevec(netbk->mmap_pages, > + MAX_PENDING_REQS); > + if (MODPARM_netback_kthread && netbk->kthread.task) > kthread_stop(netbk->kthread.task); > } > vfree(xen_netbk); > > J_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jun-24 08:33 UTC
RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Thu, 2010-06-17 at 09:16 +0100, Xu, Dongxiao wrote:> Ian, > > Sorry for the late response, I was on vacation days before.I was also on vacation so sorry in _my_ late reply ;-)> Ian Campbell wrote: > > On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote: > >> Hi Jeremy, > >> > >> The attached patch should fix the PV network issue after applying > >> the netback multiple threads patchset. > > > > Thanks for this Donxiao. Do you think this crash was a potential > > symptom > > of this issue? It does seem to go away if I apply your patch. > > Actually, the phenomenon is the same on my side without the fixing patch.Great, thanks.> > On an unrelated note, do you have any plans to make the number of > > groups > > react dynamically to CPU hotplug? Not necessarily while there are > > actually active VIFs (might be tricky to get right) but perhaps only > > when netback is idle (i.e. when there are no VIFs configured), since > > often the dynamic adjustment of VCPUs happens at start of day to > > reduce > > the domain 0 VCPU allocation from the total number of cores in the > > machine to something more manageable. > > I''m sorry, currently I am busy with some other tasks and may not have > time to do this job.I understand.> But if the case is to reduce dom0 VCPU number, keep the group number > unchanged will not impact the performance, since the group reflects the > tasklet/kthread number, and it doesn''t have direct association with > dom0''s VCPU number.Yes, that mitigates the issue to a large degree. I was just concerned about e.g. 64 threads competing for 4VCPU or similar which seems wasteful in terms of some resource or other... For XCP (which may soon switch from 1 to 4 domain 0 VCPUS in the unstable branch) I''ve been thinking of the following patch. I wonder if it might make sense in general? 4 is rather arbitrarily chosen but I think even on a 64 core machine you wouldn''t want to dedicate more than some fraction netback activity and if you do then it is configurable. Ian. netback: allow configuration of maximum number of groups to use limit to 4 by default. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> diff -r 7692c6381e1a drivers/xen/netback/netback.c --- a/drivers/xen/netback/netback.c Fri Jun 11 08:44:25 2010 +0100 +++ b/drivers/xen/netback/netback.c Fri Jun 11 09:31:48 2010 +0100 @@ -124,6 +124,10 @@ static int MODPARM_netback_kthread = 1; module_param_named(netback_kthread, MODPARM_netback_kthread, bool, 0); MODULE_PARM_DESC(netback_kthread, "Use kernel thread to replace tasklet"); + +static unsigned int MODPARM_netback_max_groups = 4; +module_param_named(netback_max_groups, MODPARM_netback_max_groups, bool, 0); +MODULE_PARM_DESC(netback_max_groups, "Maximum number of netback groups to allocate"); /* * Netback bottom half handler. @@ -1748,7 +1752,7 @@ if (!is_running_on_xen()) return -ENODEV; - xen_netbk_group_nr = num_online_cpus(); + xen_netbk_group_nr = min(num_online_cpus(), MODPARM_netback_max_groups); xen_netbk = (struct xen_netbk *)vmalloc(sizeof(struct xen_netbk) * xen_netbk_group_nr); if (!xen_netbk) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Jun-25 07:31 UTC
RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
Ian Campbell wrote:> On Thu, 2010-06-17 at 09:16 +0100, Xu, Dongxiao wrote: >> Ian, >> >> Sorry for the late response, I was on vacation days before. > > I was also on vacation so sorry in _my_ late reply ;-) > >> Ian Campbell wrote: >>> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote: >>>> Hi Jeremy, >>>> >>>> The attached patch should fix the PV network issue after applying >>>> the netback multiple threads patchset. >>> >>> Thanks for this Donxiao. Do you think this crash was a potential >>> symptom of this issue? It does seem to go away if I apply your >>> patch. >> >> Actually, the phenomenon is the same on my side without the fixing >> patch. > > Great, thanks. > >>> On an unrelated note, do you have any plans to make the number of >>> groups react dynamically to CPU hotplug? Not necessarily while >>> there are actually active VIFs (might be tricky to get right) but >>> perhaps only when netback is idle (i.e. when there are no VIFs >>> configured), since often the dynamic adjustment of VCPUs happens at >>> start of day to reduce the domain 0 VCPU allocation from the total >>> number of cores in the machine to something more manageable. >> >> I''m sorry, currently I am busy with some other tasks and may not have >> time to do this job. > > I understand. > >> But if the case is to reduce dom0 VCPU number, keep the group number >> unchanged will not impact the performance, since the group reflects >> the tasklet/kthread number, and it doesn''t have direct association >> with dom0''s VCPU number. > > Yes, that mitigates the issue to a large degree. I was just concerned > about e.g. 64 threads competing for 4VCPU or similar which seems > wasteful in terms of some resource or other... > > For XCP (which may soon switch from 1 to 4 domain 0 VCPUS in the > unstable branch) I''ve been thinking of the following patch. I wonder > if > it might make sense in general? 4 is rather arbitrarily chosen but I > think even on a 64 core machine you wouldn''t want to dedicate more > than > some fraction netback activity and if you do then it is configurable.Basically I am OK with it. One concern is that, when system is equipped with 10G NIC, according to my previous experience, 4 dom0 vCPU may be not enough for eating all the bandwidth. Thanks, Dongxiao> > Ian. > > > netback: allow configuration of maximum number of groups to use > limit to 4 by default. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > diff -r 7692c6381e1a drivers/xen/netback/netback.c > --- a/drivers/xen/netback/netback.c Fri Jun 11 08:44:25 2010 +0100 > +++ b/drivers/xen/netback/netback.c Fri Jun 11 09:31:48 2010 +0100 > @@ -124,6 +124,10 @@ > static int MODPARM_netback_kthread = 1; > module_param_named(netback_kthread, MODPARM_netback_kthread, bool, > 0); MODULE_PARM_DESC(netback_kthread, "Use kernel thread to replace > tasklet"); + > +static unsigned int MODPARM_netback_max_groups = 4; > +module_param_named(netback_max_groups, MODPARM_netback_max_groups, > bool, 0); +MODULE_PARM_DESC(netback_max_groups, "Maximum number of > netback groups to allocate"); > > /* > * Netback bottom half handler. > @@ -1748,7 +1752,7 @@ > if (!is_running_on_xen()) > return -ENODEV; > > - xen_netbk_group_nr = num_online_cpus(); > + xen_netbk_group_nr = min(num_online_cpus(), > MODPARM_netback_max_groups); xen_netbk = (struct xen_netbk > *)vmalloc(sizeof(struct xen_netbk) * xen_netbk_group_nr); > if (!xen_netbk) {_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jul-01 14:48 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote:> Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code > > Split the netbk group array initialization into two phases: one to do > simple "can''t fail" initialization of lists, timers, locks, etc; and a > second pass to allocate memory and start the async code. > > This makes sure that all the basic stuff is in place by the time the async code > is running.Paul noticed a crash in netback init because...> @@ -1764,16 +1766,6 @@ static int __init netback_init(void) > netbk->netbk_tx_pending_timer.function > netbk_tx_pending_timeout; > > - netbk->mmap_pages > - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > - if (!netbk->mmap_pages) { > - printk(KERN_ALERT "%s: out of memory\n", __func__); > - del_timer(&netbk->netbk_tx_pending_timer); > - del_timer(&netbk->net_timer); > - rc = -ENOMEM; > - goto failed_init; > - } > - > for (i = 0; i < MAX_PENDING_REQS; i++) { > page = netbk->mmap_pages[i]; > SetPageForeign(page, netif_page_release);...this dereference of netbk->mmap_pages[i]...> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)[...]> + netbk->mmap_pages > + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);... happens before this initialisation of the array. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-01 15:29 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On 07/01/2010 04:48 PM, Ian Campbell wrote:> On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote: > >> Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code >> >> Split the netbk group array initialization into two phases: one to do >> simple "can''t fail" initialization of lists, timers, locks, etc; and a >> second pass to allocate memory and start the async code. >> >> This makes sure that all the basic stuff is in place by the time the async code >> is running. >> > Paul noticed a crash in netback init because... > > >> @@ -1764,16 +1766,6 @@ static int __init netback_init(void) >> netbk->netbk_tx_pending_timer.function >> netbk_tx_pending_timeout; >> >> - netbk->mmap_pages >> - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); >> - if (!netbk->mmap_pages) { >> - printk(KERN_ALERT "%s: out of memory\n", __func__); >> - del_timer(&netbk->netbk_tx_pending_timer); >> - del_timer(&netbk->net_timer); >> - rc = -ENOMEM; >> - goto failed_init; >> - } >> - >> for (i = 0; i < MAX_PENDING_REQS; i++) { >> page = netbk->mmap_pages[i]; >> SetPageForeign(page, netif_page_release); >> > ...this dereference of netbk->mmap_pages[i]... > > >> @@ -1786,6 +1778,26 @@ static int __init netback_init(void) >> > [...] > >> + netbk->mmap_pages >> + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); >> > ... happens before this initialisation of the array. >Hm, I hadn''t meant to commit that properly. I had it locally and accidentally pushed it out. I only did that patch as an RFC in response to an issue alluded to by Dongxiao (or was it you?) about things not being fully initialized by the time the async code starts. Is this a real issue, and if so, what''s the correct fix? Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jul-01 15:47 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Thu, 2010-07-01 at 16:29 +0100, Jeremy Fitzhardinge wrote:> On 07/01/2010 04:48 PM, Ian Campbell wrote: > > On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote: > > > >> Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code > >> > >> Split the netbk group array initialization into two phases: one to do > >> simple "can''t fail" initialization of lists, timers, locks, etc; and a > >> second pass to allocate memory and start the async code. > >> > >> This makes sure that all the basic stuff is in place by the time the async code > >> is running. > >> > > Paul noticed a crash in netback init because... > > > > > >> @@ -1764,16 +1766,6 @@ static int __init netback_init(void) > >> netbk->netbk_tx_pending_timer.function > >> netbk_tx_pending_timeout; > >> > >> - netbk->mmap_pages > >> - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > >> - if (!netbk->mmap_pages) { > >> - printk(KERN_ALERT "%s: out of memory\n", __func__); > >> - del_timer(&netbk->netbk_tx_pending_timer); > >> - del_timer(&netbk->net_timer); > >> - rc = -ENOMEM; > >> - goto failed_init; > >> - } > >> - > >> for (i = 0; i < MAX_PENDING_REQS; i++) { > >> page = netbk->mmap_pages[i]; > >> SetPageForeign(page, netif_page_release); > >> > > ...this dereference of netbk->mmap_pages[i]... > > > > > >> @@ -1786,6 +1778,26 @@ static int __init netback_init(void) > >> > > [...] > > > >> + netbk->mmap_pages > >> + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > >> > > ... happens before this initialisation of the array. > > > > Hm, I hadn''t meant to commit that properly. I had it locally and > accidentally pushed it out. > > I only did that patch as an RFC in response to an issue alluded to by > Dongxiao (or was it you?) about things not being fully initialized by > the time the async code starts. Is this a real issue, and if so, what''s > the correct fix?I don''t think there is an actual current issue, just a potential one since we are relying on data structures being zeroed rather than properly initialised to keep the async code from running off into the weeds, it just seemed a little fragile this way. Originally I said:>> The crash is in one of the calls to list_move_tail and I think it is >> because netbk->pending_inuse_head not being initialised until after >> the >> threads and/or tasklets are created (I was running in threaded mode). >> Perhaps even though we are now zeroing the netbk struct those fields >> should still be initialised before kicking off any potentially >> asynchronous tasks?this specific issue was fixed by zeroing the netbk array as it is allocated, I just thought we could make things more robust by not triggering the async code until everything was fully setup. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jul-01 16:06 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Thu, 2010-07-01 at 16:47 +0100, Ian Campbell wrote:> On Thu, 2010-07-01 at 16:29 +0100, Jeremy Fitzhardinge wrote: > > On 07/01/2010 04:48 PM, Ian Campbell wrote: > > > On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote: > > > > > >> Subject: [PATCH] xen/netback: make sure all the group structures are initialized before starting async code > > >> > > >> Split the netbk group array initialization into two phases: one to do > > >> simple "can''t fail" initialization of lists, timers, locks, etc; and a > > >> second pass to allocate memory and start the async code. > > >> > > >> This makes sure that all the basic stuff is in place by the time the async code > > >> is running. > > >> > > > Paul noticed a crash in netback init because... > > > > > > > > >> @@ -1764,16 +1766,6 @@ static int __init netback_init(void) > > >> netbk->netbk_tx_pending_timer.function > > >> netbk_tx_pending_timeout; > > >> > > >> - netbk->mmap_pages > > >> - alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > > >> - if (!netbk->mmap_pages) { > > >> - printk(KERN_ALERT "%s: out of memory\n", __func__); > > >> - del_timer(&netbk->netbk_tx_pending_timer); > > >> - del_timer(&netbk->net_timer); > > >> - rc = -ENOMEM; > > >> - goto failed_init; > > >> - } > > >> - > > >> for (i = 0; i < MAX_PENDING_REQS; i++) { > > >> page = netbk->mmap_pages[i]; > > >> SetPageForeign(page, netif_page_release); > > >> > > > ...this dereference of netbk->mmap_pages[i]... > > > > > > > > >> @@ -1786,6 +1778,26 @@ static int __init netback_init(void) > > >> > > > [...] > > > > > >> + netbk->mmap_pages > > >> + alloc_empty_pages_and_pagevec(MAX_PENDING_REQS); > > >> > > > ... happens before this initialisation of the array. > > > > > > > Hm, I hadn''t meant to commit that properly. I had it locally and > > accidentally pushed it out. > > > > I only did that patch as an RFC in response to an issue alluded to by > > Dongxiao (or was it you?) about things not being fully initialized by > > the time the async code starts. Is this a real issue, and if so, what''s > > the correct fix? > > I don''t think there is an actual current issue, just a potential one > since we are relying on data structures being zeroed rather than > properly initialised to keep the async code from running off into the > weeds, it just seemed a little fragile this way. > > Originally I said: > >> The crash is in one of the calls to list_move_tail and I think it is > >> because netbk->pending_inuse_head not being initialised until after > >> the > >> threads and/or tasklets are created (I was running in threaded mode). > >> Perhaps even though we are now zeroing the netbk struct those fields > >> should still be initialised before kicking off any potentially > >> asynchronous tasks? > > this specific issue was fixed by zeroing the netbk array as it is > allocated, I just thought we could make things more robust by not > triggering the async code until everything was fully setup.I suspect simply not letting the kthread off the leash until the end of the loop may be sufficient. The tasklet case I think is safe until they are explicitly tickled. e.g. something like: diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c index 29b60c8..36f3cc7 100644 --- a/drivers/xen/netback/netback.c +++ b/drivers/xen/netback/netback.c @@ -1794,7 +1794,6 @@ static int __init netback_init(void) if (!IS_ERR(netbk->kthread.task)) { kthread_bind(netbk->kthread.task, group); - wake_up_process(netbk->kthread.task); } else { printk(KERN_ALERT "kthread_run() fails at netback\n"); @@ -1820,6 +1819,9 @@ static int __init netback_init(void) spin_lock_init(&netbk->net_schedule_list_lock); atomic_set(&netbk->netfront_count, 0); + + if (MODPARM_netback_kthread) + wake_up_process(netbk->kthread.task); } netbk_copy_skb_mode = NETBK_DONT_COPY_SKB; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-01 16:07 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On 07/01/2010 05:47 PM, Ian Campbell wrote:>> Hm, I hadn''t meant to commit that properly. I had it locally and >> accidentally pushed it out. >> >> I only did that patch as an RFC in response to an issue alluded to by >> Dongxiao (or was it you?) about things not being fully initialized by >> the time the async code starts. Is this a real issue, and if so, what''s >> the correct fix? >> > I don''t think there is an actual current issue, just a potential one > since we are relying on data structures being zeroed rather than > properly initialised to keep the async code from running off into the > weeds, it just seemed a little fragile this way. > > Originally I said: > >>> The crash is in one of the calls to list_move_tail and I think it is >>> because netbk->pending_inuse_head not being initialised until after >>> the >>> threads and/or tasklets are created (I was running in threaded mode). >>> Perhaps even though we are now zeroing the netbk struct those fields >>> should still be initialised before kicking off any potentially >>> asynchronous tasks? >>> > this specific issue was fixed by zeroing the netbk array as it is > allocated, I just thought we could make things more robust by not > triggering the async code until everything was fully setup. >It would only affect system startup time, not domain creation? I was looking at it because Stefano was having fairly consistent crashes on domain creation, and it looked like sort-of-racy symptoms. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Jul-01 16:11 UTC
Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset
On Thu, 2010-07-01 at 17:07 +0100, Jeremy Fitzhardinge wrote:> On 07/01/2010 05:47 PM, Ian Campbell wrote: > >> Hm, I hadn''t meant to commit that properly. I had it locally and > >> accidentally pushed it out. > >> > >> I only did that patch as an RFC in response to an issue alluded to by > >> Dongxiao (or was it you?) about things not being fully initialized by > >> the time the async code starts. Is this a real issue, and if so, what''s > >> the correct fix? > >> > > I don''t think there is an actual current issue, just a potential one > > since we are relying on data structures being zeroed rather than > > properly initialised to keep the async code from running off into the > > weeds, it just seemed a little fragile this way. > > > > Originally I said: > > > >>> The crash is in one of the calls to list_move_tail and I think it is > >>> because netbk->pending_inuse_head not being initialised until after > >>> the > >>> threads and/or tasklets are created (I was running in threaded mode). > >>> Perhaps even though we are now zeroing the netbk struct those fields > >>> should still be initialised before kicking off any potentially > >>> asynchronous tasks? > >>> > > this specific issue was fixed by zeroing the netbk array as it is > > allocated, I just thought we could make things more robust by not > > triggering the async code until everything was fully setup. > > > > It would only affect system startup time, not domain creation?I think so, it''s a race during netback_init. netbk_action_thread can reference &netbk->net_schedule_list (via tx_work_todo) before it is initialised. It is now zeroed which is probably safe but not exactly robust.> I was looking at it because Stefano was having fairly consistent crashes > on domain creation, and it looked like sort-of-racy syI think this particular race would be long gone by the time you started a domain. Although I suppose having the thread spinning doing potentially arbitrary things for a small window during netback_init could leave things in a fragile enough state that it might fall apart when you started a domain. I''m not convinced though. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel