I am confused with a problem: I have a blade with 64 physical CPUs and 64G physical RAM, and defined only one VM with 1 CPU and 40G RAM. For the first time I started the VM, it just took 3s, But for the second starting it took 30s. After studied it by printing log, I have located a place in the hypervisor where cost too much time, occupied 98% of the whole starting time. xen/common/page_alloc.c /* Allocate 2^@order contiguous pages. */ static struct page_info *alloc_heap_pages( unsigned int zone_lo, unsigned int zone_hi, unsigned int node, unsigned int order, unsigned int memflags) { if ( pg[i].u.free.need_tlbflush ) { /* Add in extra CPUs that need flushing because of this page. */ cpus_andnot(extra_cpus_mask, cpu_online_map, mask); tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); cpus_or(mask, mask, extra_cpus_mask); } } 1 in the first starting, most of need_tlbflush=NULL, so cost little; in the second one, most of RAM have been used thus makes need_tlbflush=true, so cost much. 2 but I repeated the same experiment in another blade which contains 16 physical CPUs and 64G physical RAM, the second starting cost 3s. After I traced the process between the two second startings, found that count entering into the judgement of pg[i].u.free.need_tlbflush is the same, but number of CPUs leads to the difference. 3 The code I pasted aims to compute a mask to determine whether it should flush CPU's TLB. I traced the values in starting period below: cpus_andnot(extra_cpus_mask, cpu_online_map, mask); //after, mask=0, cpu_online_map=0xFFFFFFFFFFFFFFFF, extra_cpus_mask=0xFFFFFFFFFFFFFFFF tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); //after, mask=0, extra_cpus_mask=0 cpus_or(mask, mask, extra_cpus_mask); //after, mask=0, extra_cpus_mask=0 every time it starts with mask=0, and ends with the same result mask=0, so leads to flush CPU's TLB definitely, which seems meaningless in the staring process. 4 The problem is, this seemed meaningless code is very time-consuming, CPUs get more, time costs more, it takes 30s here in my blade with 64 CPUs which may need some solution to improve the efficiency. tupeng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Not sure I understand. With 16 CPUs you find domain startup takes 3s always. With 64 CPUs you find it takes 3s first time, then 30s in future? And this is due to cost of tlbflush_filter() (not actual TLB flushes, because you always end up with mask=0)? If tlbflush_filter() were that expensive I¹d expect the 16-CPU case to have slowdown after the first domain startup, too. -- Keir On 11/10/2012 16:18, "tupeng212" <tupeng212@gmail.com> wrote:> I am confused with a problem: > I have a blade with 64 physical CPUs and 64G physical RAM, and defined only > one VM with 1 CPU and 40G RAM. > For the first time I started the VM, it just took 3s, But for the second > starting it took 30s. > After studied it by printing log, I have located a place in the hypervisor > where cost too much time, > occupied 98% of the whole starting time. > > xen/common/page_alloc.c > /* Allocate 2^@order contiguous pages. */ > static struct page_info *alloc_heap_pages( > unsigned int zone_lo, unsigned int zone_hi, > unsigned int node, unsigned int order, unsigned int memflags) > { > if ( pg[i].u.free.need_tlbflush ) > { > /* Add in extra CPUs that need flushing because of this page. */ > cpus_andnot(extra_cpus_mask, cpu_online_map, mask); > tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); > cpus_or(mask, mask, extra_cpus_mask); > } > > } > > 1 in the first starting, most of need_tlbflush=NULL, so cost little; in the > second one, most of RAM have been used > thus makes need_tlbflush=true, so cost much. > > 2 but I repeated the same experiment in another blade which contains 16 > physical CPUs and 64G physical RAM, the second > starting cost 3s. After I traced the process between the two second > startings, found that count entering into the judgement of > pg[i].u.free.need_tlbflush is the same, but number of CPUs leads to the > difference. > > 3 The code I pasted aims to compute a mask to determine whether it should > flush CPU''s TLB. I traced the values in starting period below: > >> >> cpus_andnot(extra_cpus_mask, cpu_online_map, mask); >> >> //after, mask=0, cpu_online_map=0xFFFFFFFFFFFFFFFF, >> extra_cpus_mask=0xFFFFFFFFFFFFFFFF >> >> tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); >> >> //after, mask=0, extra_cpus_mask=0 >> >> cpus_or(mask, mask, extra_cpus_mask); >> >> //after, mask=0, extra_cpus_mask=0 > > every time it starts with mask=0, and ends with the same result mask=0, so > leads to flush CPU''s TLB definitely, > which seems meaningless in the staring process. > > 4 The problem is, this seemed meaningless code is very time-consuming, CPUs > get more, time costs more, it takes 30s here in my blade > with 64 CPUs which may need some solution to improve the efficiency. > > > tupeng > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
With 16 CPUs you find domain startup takes 3s always. :No With 16CPUs, first 0.3s, second 3s With 64CPUs, first 3s, second 30s. With 64 CPUs you find it takes 3s first time, then 30s in future? : Yes And this is due to cost of tlbflush_filter() (not actual TLB flushes, because you always end up with mask=0)? : Yes, it costs much in tlbflush_filter() in the judgement. TLB flushing is really very fast, it just sends a IPI to related CPU. In the starting process's allocation, it always ends up with mask=0 which seems needless. If tlbflush_filter() were that expensive I’d expect the 16-CPU case to have slowdown after the first domain startup, too. : Yes, you are right, 16CPU slows down too after its first startup. The reason is very clear, I have discussed it with others, tlbflush_filter() is low efficient is no doubt, But I don't know how to improve it . and I also used xen oprofile to find the following two functions are called high frequently: alloc_heap_pages: 40% __next_cpu : 20% others: 0.x% ..... alloc_heap_pages -> tlbflush_filter -> for_each_cpu_mask next_cpu -> __next_cpu it seems traveling among CPUs is expensive. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 12 Oct 2012 20:24:16 +0800 tupeng212 <tupeng212@gmail.com> wrote:> With 16 CPUs you find domain startup takes 3s always. > :No > With 16CPUs, first 0.3s, second 3s > With 64CPUs, first 3s, second 30s. > > With 64 CPUs you find it takes 3s first time, then 30s in future? > : Yes........... I had seen this a while ago, it was always page scrubbing. That algoright was changed IIRC, so not sure if still could be the cause. My 2 cents, thanks Mukesh
On 12/10/2012 23:39, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:> On Fri, 12 Oct 2012 20:24:16 +0800 > tupeng212 <tupeng212@gmail.com> wrote: > >> With 16 CPUs you find domain startup takes 3s always. >> :No >> With 16CPUs, first 0.3s, second 3s >> With 64CPUs, first 3s, second 30s. >> >> With 64 CPUs you find it takes 3s first time, then 30s in future? >> : Yes > > ........... > > I had seen this a while ago, it was always page scrubbing. That > algoright was changed IIRC, so not sure if still could be the cause.Tupeng, If the tlbflush_filter() and cpumask_or() lines are commented out from the if(need_tlbflush) block in alloc_heap_pages(), what are the domain creation times like then? By the way it looks like you are not using xen-unstable or xen-4.2, can you try with one of these later versions of Xen? -- Keir> My 2 cents, > thanks > Mukesh
What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? : you mean just clear it, maybe a little violent.., you 'd like to do it at any other place. I assume you see lots of looping in one of those two functions, but only single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()? : In populate_physmap, all pages are 2M size, static void populate_physmap(struct memop_args *a) { for ( i = a->nr_done; i < a->nr_extents; i++ ) { page = alloc_domheap_pages(d, a->extent_order, a->memflags) ->alloc_heap_pages ; //a->extent_order = 9, always 2M size } //you mean move that block and TLB-flush here to avoid for loop ? } tupeng212 From: Keir Fraser Date: 2012-10-13 14:30 To: tupeng212 Subject: Re: [Xen-devel] alloc_heap_pages is low efficient with more CPUs What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? Seems a bit silly to do, but I’d like to know how much a few cpumask operations per page is costing (most are of course much quicker than tlbflush_filter as they operate on 64 bits per iteration, rather than one bit per iteration). If that is suitably fast, I think we can have a go at fixing this by pulling the TLB-flush logic out of alloc_heap_pages() and into the loops in increwase_reservation() and populate_physmap() in common/memory.c. I assume you see lots of looping in one of those two functions, but only single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()? -- Keir On 13/10/2012 07:21, "tupeng212" <tupeng212@gmail.com> wrote: If the tlbflush_filter() and cpumask_or() lines are commented out from the if(need_tlbflush) block in alloc_heap_pages(), what are the domain creation times like then? : You mean removing these code from alloc_heap_pages, then try it. I didn't do it as you said, but I calculated the whole time of if(need_tlbflush) block by using time1=NOW() ...block ... time2=NOW(), time=time2-time1, its unit is ns, and s = ns * 10^9 it occupy high rate of the whole time. whole starting time is 30s, then this block may be 29s. By the way it looks like you are not using xen-unstable or xen-4.2, can you try with one of these later versions of Xen? : fortunately, other groups have to use xen-4.2, we have repeated this experiment on that source code too, it changed nothing, time is still very long in second starting. tupeng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
/* Allocate 2^@order contiguous pages. */ static struct page_info *alloc_heap_pages( unsigned int zone_lo, unsigned int zone_hi, unsigned int node, unsigned int order, unsigned int memflags) { cpus_clear(mask); for ( i = 0; i < (1 << order); i++ ) { if ( pg[i].u.free.need_tlbflush ) { /* Add in extra CPUs that need flushing because of this page. */ cpus_andnot(extra_cpus_mask, cpu_online_map, mask); tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); cpus_or(mask, mask, extra_cpus_mask); } } if ( unlikely(!cpus_empty(mask)) ) { perfc_incr(need_flush_tlb_flush); flush_tlb_mask(&mask); } return pg; } Is this equal to : in pg[1<<order-1].u.free.need_tlbflush, if we compute mask=0, then flush TLB mask ? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
If the allocations are 2M size, we can do better quite easily I think. Please try the attached patch. -- Keir On 13/10/2012 07:46, "tupeng212" <tupeng212@gmail.com> wrote:> What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? > : you mean just clear it, maybe a little violent.., you ''d like to do it at > any other place. > > I assume you see lots of looping in one of those two functions, but only > single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()? > : In populate_physmap, all pages are 2M size, > static void populate_physmap(struct memop_args *a) > { > for ( i = a->nr_done; i < a->nr_extents; i++ ) > { > page = alloc_domheap_pages(d, a->extent_order, a->memflags) ->alloc_heap_pages > ; //a->extent_order = 9, always 2M size > } > //you mean move that block and TLB-flush here to avoid for loop ? > } > > tupeng212 > > From: Keir Fraser <mailto:keir.xen@gmail.com> > Date: 2012-10-13 14:30 > To: tupeng212 <mailto:tupeng212@gmail.com> > Subject: Re: [Xen-devel] alloc_heap_pages is low efficient with more CPUs > What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? > Seems a bit silly to do, but I¹d like to know how much a few cpumask > operations per page is costing (most are of course much quicker than > tlbflush_filter as they operate on 64 bits per iteration, rather than one bit > per iteration). > > If that is suitably fast, I think we can have a go at fixing this by pulling > the TLB-flush logic out of alloc_heap_pages() and into the loops in > increwase_reservation() and populate_physmap() in common/memory.c. I assume > you see lots of looping in one of those two functions, but only > single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()? > > -- Keir > > On 13/10/2012 07:21, "tupeng212" <tupeng212@gmail.com> wrote: > >> If the tlbflush_filter() and cpumask_or() lines are commented out from the >> if(need_tlbflush) block in alloc_heap_pages(), what are the domain creation >> times like then? >> : You mean removing these code from alloc_heap_pages, then try it. >> I didn''t do it as you said, but I calculated the whole time of >> if(need_tlbflush) block >> by using time1=NOW() ...block ... time2=NOW(), time=time2-time1, its unit is >> ns, and s = ns * 10^9 >> it occupy high rate of the whole time. whole starting time is 30s, then this >> block may be 29s. >> >> By the way it looks like you are not using xen-unstable or >> xen-4.2, can you try with one of these later versions of Xen? >> : fortunately, other groups have to use xen-4.2, we have repeated this >> experiment on >> that source code too, it changed nothing, time is still very long in second >> starting. >> >> >> >> tupeng >> >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 13/10/2012 08:20, "tupeng212" <tupeng212@gmail.com> wrote:> /* Allocate 2^@order contiguous pages. */ > static struct page_info *alloc_heap_pages( > unsigned int zone_lo, unsigned int zone_hi, > unsigned int node, unsigned int order, unsigned int memflags) > { > cpus_clear(mask); > for ( i = 0; i < (1 << order); i++ ) > { > if ( pg[i].u.free.need_tlbflush ) > { > /* Add in extra CPUs that need flushing because of this page. */ > cpus_andnot(extra_cpus_mask, cpu_online_map, mask); > tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp); > cpus_or(mask, mask, extra_cpus_mask); > } > } > if ( unlikely(!cpus_empty(mask)) ) > { > perfc_incr(need_flush_tlb_flush); > flush_tlb_mask(&mask); > } > return pg; > } > > Is this equal to : in pg[1<<order-1].u.free.need_tlbflush, if we compute > mask=0, then flush TLB mask ?No! -- Keir
Please try the attached patch. : Great! you have done a good job, needless time decreases badly to 1s. If anybody has no proposal, I suggest you to commit this patch. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote:> Please try the attached patch. > : Great! you have done a good job, needless time decreases badly to 1s. > > If anybody has no proposal, I suggest you to commit this patch.I have applied it to xen-unstable. It probably makes sense to put it in 4.1 and 4.2 as well (cc''ed Jan, and attaching the backport for 4.1 again). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote: > On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote: > >> Please try the attached patch. >> : Great! you have done a good job, needless time decreases badly to 1s. >> >> If anybody has no proposal, I suggest you to commit this patch. > > I have applied it to xen-unstable. It probably makes sense to put it in 4.1 > and 4.2 as well (cc''ed Jan, and attaching the backport for 4.1 again).Will do, but do you have an explanation how this simple, memory only operation (64 CPUs isn''t that many) has this dramatic an effect on performance. Are we bouncing cache lines this badly? If so, which one(s)? I don''t see what would be written frequently from multiple CPUs here - tlbflush_filter() itself only reads global variables, but never writes them. Jan
On 16/10/2012 08:51, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote: >> On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote: >> >>> Please try the attached patch. >>> : Great! you have done a good job, needless time decreases badly to 1s. >>> >>> If anybody has no proposal, I suggest you to commit this patch. >> >> I have applied it to xen-unstable. It probably makes sense to put it in 4.1 >> and 4.2 as well (cc''ed Jan, and attaching the backport for 4.1 again). > > Will do, but do you have an explanation how this simple, memory > only operation (64 CPUs isn''t that many) has this dramatic an > effect on performance. Are we bouncing cache lines this badly? If > so, which one(s)? I don''t see what would be written frequently > from multiple CPUs here - tlbflush_filter() itself only reads global > variables, but never writes them.It''s just the small factors multiplying up. A 40G domain is 10M page allocations, each of which does 64x per-cpu cpumask bitops and timestamp compares. That''s going on for a billion (10^9) times round tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for the effect to actually be noticeable. If the stuff being touched were not in the cache (which of course it is, since this path is so hot and not touching much memory) it would probably take an hour to create the domain! -- Keir> Jan >
>>> On 16.10.12 at 10:03, Keir Fraser <keir.xen@gmail.com> wrote: > On 16/10/2012 08:51, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote: >>> On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote: >>> >>>> Please try the attached patch. >>>> : Great! you have done a good job, needless time decreases badly to 1s. >>>> >>>> If anybody has no proposal, I suggest you to commit this patch. >>> >>> I have applied it to xen-unstable. It probably makes sense to put it in 4.1 >>> and 4.2 as well (cc''ed Jan, and attaching the backport for 4.1 again). >> >> Will do, but do you have an explanation how this simple, memory >> only operation (64 CPUs isn''t that many) has this dramatic an >> effect on performance. Are we bouncing cache lines this badly? If >> so, which one(s)? I don''t see what would be written frequently >> from multiple CPUs here - tlbflush_filter() itself only reads global >> variables, but never writes them. > > It''s just the small factors multiplying up. A 40G domain is 10M page > allocations, each of which does 64x per-cpu cpumask bitops and timestamp > compares. That''s going on for a billion (10^9) times round > tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for > the effect to actually be noticeable. If the stuff being touched were not in > the cache (which of course it is, since this path is so hot and not touching > much memory) it would probably take an hour to create the domain!Which means that your TODO remark in the change description is more than a nice-to-have, since - in the fallback case using 4k allocations your change doesn''t win anything - even larger domains would still suffer from this very problem (albeit in 4.2 we try 1G allocations first, so your change should have an even more visible effect there, as long as falling back to smaller chunks isn''t needed) One question of course is whether, for sufficiently large allocation requests on sufficiently large systems, trying to avoid the TLB flush is worth it at all, considering that determining where to skip the flushes is costing so much more than the flushes themselves. Jan
On 16/10/2012 09:28, "Jan Beulich" <JBeulich@suse.com> wrote:>> It''s just the small factors multiplying up. A 40G domain is 10M page >> allocations, each of which does 64x per-cpu cpumask bitops and timestamp >> compares. That''s going on for a billion (10^9) times round >> tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for >> the effect to actually be noticeable. If the stuff being touched were not in >> the cache (which of course it is, since this path is so hot and not touching >> much memory) it would probably take an hour to create the domain! > > Which means that your TODO remark in the change description > is more than a nice-to-have, since > - in the fallback case using 4k allocations your change doesn''t > win anything > - even larger domains would still suffer from this very problem > (albeit in 4.2 we try 1G allocations first, so your change should > have an even more visible effect there, as long as falling back > to smaller chunks isn''t needed)Agreed. This is a simple starting point which is also easy to backport though, and solves the immediate observed problem.> One question of course is whether, for sufficiently large allocation > requests on sufficiently large systems, trying to avoid the TLB > flush is worth it at all, considering that determining where to skip > the flushes is costing so much more than the flushes themselves.It might be a simpler option. I''ll see how doing it the filtering way looks. -- Keir