Yang, Xiaowei
2010-Jan-21 08:16 UTC
[Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
Symptoms --------- Boot time of UP HVM on one DP Nehalem server (SMT-on): seconds 2.6.18 20 pvops 42 pvops dom0 util% is evidently higher (peak: 230% vs 99%) and resched IPI reaches 200+k per seconds in max. Analysis --------- Xentrace shows that the access time of IDE IOport is 10X slower than other IOports that go through QEMU (~300K v.s. ~30K). Ftrace of ''sched events'' in dom0 tells there are frequent context switches between ''idle'' and ''events/?'' (work queue execution kthread) on each idle vCPU, and events/? runs lru_add_drain_per_cpu(). This explains where resched IPI comes from: Kernel uses it to notify idle vCPU to do the real work. lru_add_drain_per_cpu() is triggered by lru_add_drain_all() which is a *costly* sync operation and won''t return until each vCPU executes the work queue. Throughout the kernel, there are 9 places calls lru_add_drain_all(): shm, memory migration, mlock and etc. If IDE IOport access invokes one of them, that could be reason why it''s so slow. Then ftrace of ''syscall'' in dom0 reveals the assumption is true - QEMU really calls mlock(). And it turns out that mlock() is used a lot in Xen (73 places), to ensure that dom0 user space''s buffer passed to Xen HV by hypercall is pinned in memory. IDE IOport access may call one of them - HVMOP_set_isa_irq_level. Kernel change log is searched backwards. In 2.6.28 (http://kernelnewbies.org/Linux_2_6_28), one major change to mlock implementation (b291f000: mlock: mlocked pages are unevictable) puts mlocked pages under the management of (page frame reclaiming) LRU, and lru_add_drain_all() is a prepare operation to purge the pages in a temporary data structures (pagevec) to an LRU list. That''s why 2.6.18 dom0 doesn''t have so many resched IPI. One hack is tried to omit mlock() before HVMOP_set_isa_irq_level in pvops dom0, and guest boot time returns to normal - ~20s. Solutions? ----------- - Limiting vCPU# of dom0 is always an easiest one - you may call it workaround rather than a solution:) It not only reduces the total # of resched IPI ( mlock# * (vCPU#-1)), but reduces the cost of each handler - because of spinlock. But the impact is still there, more or less, when vCPU# > 1. - To remove mlock, another sharing method is needed between dom0 user space app and Xen HV. - More ideas? Thanks, xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-21 08:44 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
On 21/01/2010 08:16, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:> - Limiting vCPU# of dom0 is always an easiest one - you may call it workaround > rather than a solution:) It not only reduces the total # of resched IPI ( > mlock# * (vCPU#-1)), but reduces the cost of each handler - because of > spinlock. > But the impact is still there, more or less, when vCPU# > 1. > > - To remove mlock, another sharing method is needed between dom0 user space > app > and Xen HV.A pre-mlock()ed memory page for small (sub-page) hypercalls? Protected with a semaphore: failure to acquire semaphore means take slow path. Have all hypercallers in libxc launder their data buffers through a new interface that tries to grab and copy into the pre-allocated buffer. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-21 09:27 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
On 21/01/2010 08:44, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> - Limiting vCPU# of dom0 is always an easiest one - you may call it >> workaround >> rather than a solution:) It not only reduces the total # of resched IPI ( >> mlock# * (vCPU#-1)), but reduces the cost of each handler - because of >> spinlock. >> But the impact is still there, more or less, when vCPU# > 1. >> >> - To remove mlock, another sharing method is needed between dom0 user space >> app >> and Xen HV. > > A pre-mlock()ed memory page for small (sub-page) hypercalls? Protected with > a semaphore: failure to acquire semaphore means take slow path. Have all > hypercallers in libxc launder their data buffers through a new interface > that tries to grab and copy into the pre-allocated buffer.I''ll sort out a trial patch for this myself. Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-21 11:23 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
On 21/01/2010 09:27, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> A pre-mlock()ed memory page for small (sub-page) hypercalls? Protected with >> a semaphore: failure to acquire semaphore means take slow path. Have all >> hypercallers in libxc launder their data buffers through a new interface >> that tries to grab and copy into the pre-allocated buffer. > > I''ll sort out a trial patch for this myself.How does the attached patch work for you? It ought to get you the same speedup as your hack. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Yang, Xiaowei
2010-Jan-22 08:07 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
Keir Fraser wrote:> On 21/01/2010 09:27, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > >>> A pre-mlock()ed memory page for small (sub-page) hypercalls? Protected with >>> a semaphore: failure to acquire semaphore means take slow path. Have all >>> hypercallers in libxc launder their data buffers through a new interface >>> that tries to grab and copy into the pre-allocated buffer. >> I''ll sort out a trial patch for this myself. > > How does the attached patch work for you? It ought to get you the same > speedup as your hack.The speed should be almost the same, regardless of twice memcpy. Some comments to your trial patch: 1. diff -r 6b61ef936e69 tools/libxc/xc_private.c --- a/tools/libxc/xc_private.c Fri Jan 22 14:50:30 2010 +0800 +++ b/tools/libxc/xc_private.c Fri Jan 22 15:32:48 2010 +0800 @@ -188,7 +188,10 @@ ((hcall_buf = calloc(1, sizeof(*hcall_buf))) != NULL) ) pthread_setspecific(hcall_buf_pkey, hcall_buf); if ( hcall_buf->buf == NULL ) + { hcall_buf->buf = xc_memalign(PAGE_SIZE, PAGE_SIZE); + lock_pages(hcall_buf->buf, PAGE_SIZE); + } if ( (len < PAGE_SIZE) && hcall_buf && hcall_buf->buf && !hcall_buf->oldbuf ) 2. _xc_clean_hcall_buf needs a more careful NULL pointer check. 3. It does modification to 5 out of 73 hypercalls invoking mlock. Other problem hypercalls could turn out to be the bottleneck later?:) Thanks, xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-22 08:31 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
On 22/01/2010 08:07, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:>> How does the attached patch work for you? It ought to get you the same >> speedup as your hack. > > The speed should be almost the same, regardless of twice memcpy.Did you actually try it out and confirm that?> Some comments to your trial patch: > 1. > diff -r 6b61ef936e69 tools/libxc/xc_private.c > --- a/tools/libxc/xc_private.c Fri Jan 22 14:50:30 2010 +0800 > +++ b/tools/libxc/xc_private.c Fri Jan 22 15:32:48 2010 +0800Yes, missed that all-important bit!> 2. _xc_clean_hcall_buf needs a more careful NULL pointer check.Not really: free() accepts NULL. But I suppose it would be clearer to put the free(hcall_buf) inside the if(hcall_buf) block.> 3. It does modification to 5 out of 73 hypercalls invoking mlock. Other > problem > hypercalls could turn out to be the bottleneck later?:)The point of a new interface was to be able to do the callers incrementally. A bit of care is needed on each one, and most are not and probably never will be bottlenecks. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Yang, Xiaowei
2010-Jan-22 08:48 UTC
Re: [Xen-devel] Root cause of the issue that HVM guest boots slowly with pvops dom0
Keir Fraser wrote:> On 22/01/2010 08:07, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote: > >>> How does the attached patch work for you? It ought to get you the same >>> speedup as your hack. >> The speed should be almost the same, regardless of twice memcpy. > > Did you actually try it out and confirm that?Yes, I tried it out. And there are no obvious speed difference comparing your patch (my comment 1 included) and the hack.> >> Some comments to your trial patch: >> 1. >> diff -r 6b61ef936e69 tools/libxc/xc_private.c >> --- a/tools/libxc/xc_private.c Fri Jan 22 14:50:30 2010 +0800 >> +++ b/tools/libxc/xc_private.c Fri Jan 22 15:32:48 2010 +0800 > > Yes, missed that all-important bit! > >> 2. _xc_clean_hcall_buf needs a more careful NULL pointer check. > > Not really: free() accepts NULL. But I suppose it would be clearer to put > the free(hcall_buf) inside the if(hcall_buf) block. > >> 3. It does modification to 5 out of 73 hypercalls invoking mlock. Other >> problem >> hypercalls could turn out to be the bottleneck later?:) > > The point of a new interface was to be able to do the callers incrementally. > A bit of care is needed on each one, and most are not and probably never > will be bottlenecks.Agree. Anyway when we meet other pvops performance issue later, let''s go back and have a check at this aspect. Thanks, xiaowei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel