Hi George, Everyone, While reworking a bit my NUMA aware scheduling patches I figured I''m not sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of course) does. Here''s the thing. Upon every vcpu wakeup we put the new vcpu in a runq and then call __runq_tickle(), passing the waking vcpu via ''new''. Let''s call the vcpu that just woke up v_W, and the vcpu that is currently running on the cpu where that happens v_C. Let''s also call the CPU where all is happening P. As far as I''ve understood, in __runq_tickle(), we: static inline void __runq_tickle(unsigned int cpu, struct csched_vcpu *new) { [...] cpumask_t mask; cpumask_clear(&mask); /* If strictly higher priority than current VCPU, signal the CPU */ if ( new->pri > cur->pri ) { [...] cpumask_set_cpu(cpu, &mask); } --> Make sure we put the CPU we are on (P) in ''mask'', in case the woken --> vcpu (v_W) has higher priority that the currently running one (v_C). /* * If this CPU has at least two runnable VCPUs, we tickle any idlers to * let them know there is runnable work in the system... */ if ( cur->pri > CSCHED_PRI_IDLE ) { if ( cpumask_empty(prv->idlers) ) [...] else { cpumask_t idle_mask; cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); if ( !cpumask_empty(&idle_mask) ) { [...] if ( opt_tickle_one_idle ) { [...] cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); } else cpumask_or(&mask, &mask, &idle_mask); } cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); --> Make sure we include one or more (depending on opt_tickle_one_idle) --> CPUs that are both idle and part of v_W''s CPU-affinity in ''mask''. } } /* Send scheduler interrupts to designated CPUs */ if ( !cpumask_empty(&mask) ) cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ); --> Ask all the CPUs in ''mask'' to reschedule. That would mean all the --> idlers from v_W''s CPU-affinity and, possibly, "ourself" (P). The --> effect will be that all/some of the CPUs v_W''s has affinity with --> _and_ (let''s assume so) P will go through scheduling as quickly as --> possible. } Is the above right? If yes, here''s my question. Is that right to always tickle v_W''s affine CPUs and only them? I''m asking because a possible scenario, at least according to me, is that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it selects v_W and leaves v_C in its runq. At that point, one of the tickled CPU (say P'') enters schedule, sees that P is not idle, and tries to steal a vcpu from its runq. Now we know that P'' has affinity with v_W, but v_W is not there, while v_C is, and if P'' is not in its affinity, we''ve forced P'' to reschedule for nothing. Also, there now might be another (or even a number of) CPU where v_C could run that stays idle, as it has not being tickled. So, if that is true, it seems we leave some room for sub-optimal CPU utilization, as well as some non-work conserving windows. Of course, it is very hard to tell how frequent this actually happens. As it comes to possible solution, I think that, for instance, tickling all the CPUs in both v_W''s and v_C''s affinity masks could solve this, but that would also potentially increase the overhead (by asking _a_lot_ of CPUs to reschedule), and again, it''s hard to say if/when it''s worth... Actually, going all the way round, i.e., tickling only CPUs with affinity with v_C (in this case) looks more reasonable, under the assumption that v_w is going to be scheduled on P soon enough. In general, that would mean tickling the CPUs in the affinity mask of the vcpu with smaller priority, but I''ve not checked how that would interact with the rest of the scheduling logic yet. If I got things wrong and/or there''s something I missed or overlooked, please, accept my apologies. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 23/10/12 14:34, Dario Faggioli wrote:> Hi George, Everyone, > > While reworking a bit my NUMA aware scheduling patches I figured I''m not > sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of > course) does. > > Here''s the thing. Upon every vcpu wakeup we put the new vcpu in a runq > and then call __runq_tickle(), passing the waking vcpu via ''new''. Let''s > call the vcpu that just woke up v_W, and the vcpu that is currently > running on the cpu where that happens v_C. Let''s also call the CPU where > all is happening P. > > As far as I''ve understood, in __runq_tickle(), we: > > > static inline void > __runq_tickle(unsigned int cpu, struct csched_vcpu *new) > { > [...] > cpumask_t mask; > > cpumask_clear(&mask); > > /* If strictly higher priority than current VCPU, signal the CPU */ > if ( new->pri > cur->pri ) > { > [...] > cpumask_set_cpu(cpu, &mask); > } > > --> Make sure we put the CPU we are on (P) in ''mask'', in case the woken > --> vcpu (v_W) has higher priority that the currently running one (v_C). > > /* > * If this CPU has at least two runnable VCPUs, we tickle any idlers to > * let them know there is runnable work in the system... > */ > if ( cur->pri > CSCHED_PRI_IDLE ) > { > if ( cpumask_empty(prv->idlers) ) > [...] > else > { > cpumask_t idle_mask; > > cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > if ( !cpumask_empty(&idle_mask) ) > { > [...] > if ( opt_tickle_one_idle ) > { > [...] > cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > } > else > cpumask_or(&mask, &mask, &idle_mask); > } > cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); > > --> Make sure we include one or more (depending on opt_tickle_one_idle) > --> CPUs that are both idle and part of v_W''s CPU-affinity in ''mask''. > > } > } > > /* Send scheduler interrupts to designated CPUs */ > if ( !cpumask_empty(&mask) ) > cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ); > > --> Ask all the CPUs in ''mask'' to reschedule. That would mean all the > --> idlers from v_W''s CPU-affinity and, possibly, "ourself" (P). The > --> effect will be that all/some of the CPUs v_W''s has affinity with > --> _and_ (let''s assume so) P will go through scheduling as quickly as > --> possible. > > } > > Is the above right?It looks right to me.> If yes, here''s my question. Is that right to always tickle v_W''s affine > CPUs and only them? > > I''m asking because a possible scenario, at least according to me, is > that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it > selects v_W and leaves v_C in its runq. At that point, one of the > tickled CPU (say P'') enters schedule, sees that P is not idle, and tries > to steal a vcpu from its runq. Now we know that P'' has affinity with > v_W, but v_W is not there, while v_C is, and if P'' is not in its > affinity, we''ve forced P'' to reschedule for nothing. > Also, there now might be another (or even a number of) CPU where v_C > could run that stays idle, as it has not being tickled.Yes -- the two clauses look a bit like they were conceived independently, and maybe no one thought about how they might interact.> So, if that is true, it seems we leave some room for sub-optimal CPU > utilization, as well as some non-work conserving windows. > Of course, it is very hard to tell how frequent this actually happens. > > As it comes to possible solution, I think that, for instance, tickling > all the CPUs in both v_W''s and v_C''s affinity masks could solve this, > but that would also potentially increase the overhead (by asking _a_lot_ > of CPUs to reschedule), and again, it''s hard to say if/when it''s > worth...Well in my code, opt_tickle_idle_one is on by default, which means only one other cpu will be woken up. If there were an easy way to make it wake up a CPU in v_C''s affinity as well (supposing that there was no overlap), that would probably be a win. Of course, that''s only necessary if: * v_C is lower priority than v_W * There are no idlers that intersect both v_C and v_W''s affinity mask. It''s probably a good idea though to try to set up a scenario where this might be an issue and see how often it actually happens. -George
On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:> > If yes, here''s my question. Is that right to always tickle v_W''s affine > > CPUs and only them? > > > > I''m asking because a possible scenario, at least according to me, is > > that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it > > selects v_W and leaves v_C in its runq. At that point, one of the > > tickled CPU (say P'') enters schedule, sees that P is not idle, and tries > > to steal a vcpu from its runq. Now we know that P'' has affinity with > > v_W, but v_W is not there, while v_C is, and if P'' is not in its > > affinity, we''ve forced P'' to reschedule for nothing. > > Also, there now might be another (or even a number of) CPU where v_C > > could run that stays idle, as it has not being tickled. > > Yes -- the two clauses look a bit like they were conceived > independently, and maybe no one thought about how they might interact. >Yep, it looked the same to me.> > As it comes to possible solution, I think that, for instance, tickling > > all the CPUs in both v_W''s and v_C''s affinity masks could solve this, > > but that would also potentially increase the overhead (by asking _a_lot_ > > of CPUs to reschedule), and again, it''s hard to say if/when it''s > > worth... > > Well in my code, opt_tickle_idle_one is on by default, which means only > one other cpu will be woken up. If there were an easy way to make it > wake up a CPU in v_C''s affinity as well (supposing that there was no > overlap), that would probably be a win. >Yes, default is to tickle only 1 idler. However, as we offer that as a command line option, I think we should consider what could happen if one disables it. I double checked this on Linux and, mutatis mutandis, they sort of go the way I was suggesting, i.e., "pinging" the CPUs with affinity to the task that will likely stay in the runq rather than being picked up locally. However, there are of course big difference between the two schedulers and different assumptions being made, thus I''m not really sure of that being the best thing to do for us. So, yes, it probably make sense to think about something clever to try to involve CPUs from both the masks without causing too much overhead. I''ll put that in my TODO List. :-)> Of course, that''s only necessary if: > * v_C is lower priority than v_W > * There are no idlers that intersect both v_C and v_W''s affinity mask. >Sure, I said that in the first place, and I don''t think checking for that is too hard... Just a couple of more bitmap ops. But again, I''ll give it some thinking.> It''s probably a good idea though to try to set up a scenario where this > might be an issue and see how often it actually happens. >Definitely, before trying to "fix" it, I''m interested in finding out what I''d be actually fixing. Will do that. Thanks for taking time to check and answer this! :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:> > As it comes to possible solution, I think that, for instance, tickling > > all the CPUs in both v_W''s and v_C''s affinity masks could solve this, > > but that would also potentially increase the overhead (by asking _a_lot_ > > of CPUs to reschedule), and again, it''s hard to say if/when it''s > > worth... > > Well in my code, opt_tickle_idle_one is on by default, which means only > one other cpu will be woken up. If there were an easy way to make it > wake up a CPU in v_C''s affinity as well (supposing that there was no > overlap), that would probably be a win. > > Of course, that''s only necessary if: > * v_C is lower priority than v_W > * There are no idlers that intersect both v_C and v_W''s affinity mask. > > It''s probably a good idea though to try to set up a scenario where this > might be an issue and see how often it actually happens. >Ok, I think I managed in reproducing this. Look at the following trace, considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no affinity at all (its vcpus can run everywhere): 166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable ]166.853945884 ---|-|-------x-| d51v1 28004(2:8:4) 2 [ 0 7 ] . ]166.853986385 ---|-|-------x-| d51v1 2800e(2:8:e) 2 [ 33 4bf97be ] ]166.853986522 ---|-|-------x-| d51v1 2800f(2:8:f) 3 [ 0 a050 1c9c380 ] ]166.853986636 ---|-|-------x-| d51v1 2800a(2:8:a) 4 [ 33 1 0 7 ] . 166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable 166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running . . . ]166.854195353 ---|-|-------x-| d0v7 28006(2:8:6) 2 [ 0 7 ] ]166.854196484 ---|-|-------x-| d0v7 2800e(2:8:e) 2 [ 0 33530 ] ]166.854196584 ---|-|-------x-| d0v7 2800f(2:8:f) 3 [ 33 33530 1c9c380 ] ]166.854196691 ---|-|-------x-| d0v7 2800a(2:8:a) 4 [ 0 7 33 1 ] 166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked 166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running So, if I''m not reading the trace wrong, when d0v7 wakes up (very first event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle, none of them get tickled and comes to pick d51v1 up, which has then to wait in the runq until d0v7 goes back to sleep. I suspect this could be because, at d0v7 wakeup time, we try to tickle some pcpu which is in d0v7''s affinity, but not in d51v1''s one (as in the second ''if() {}'' block in __runq_tickle() we only care about new->vcpu->cpu_affinity, and in this case, new is d0v7). I know, looking at the timestamps it doesn''t look like it is a big deal in this case, and I''m still working on producing numbers that can better show whether or not this is a real problem. Anyway, and independently from the results of these tests, why do I care so much? Well, if you substitute the concept of "vcpu-affinity" with "node-affinity" above (which is what I am doing in my NUMA aware scheduling patches) you''ll see why this is bothering me quite a bit. In fact, in that case, waking up a random pcpu with which d0v7 has node-affinity with, while d51v1 has not, would cause d51v1 being pulled by that cpu (since node-affinity is only preference)! So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at pcpu 13''s runq for work to steal it does not find anything suitable and give up, leaving d51v1 in the runq even if there are idle pcpus on which it could run, which is already bad. In the node-affinity case, pcpu 3 will actually manage in stealing d51v1 and running it, even if there are idle pcpus with which it has node-affinity, and thus defeating most of the benefits of the whole NUMA aware scheduling thing (at least for some workloads). :-( Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 15/11/12 12:10, Dario Faggioli wrote:> On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote: >>> As it comes to possible solution, I think that, for instance, tickling >>> all the CPUs in both v_W''s and v_C''s affinity masks could solve this, >>> but that would also potentially increase the overhead (by asking _a_lot_ >>> of CPUs to reschedule), and again, it''s hard to say if/when it''s >>> worth... >> Well in my code, opt_tickle_idle_one is on by default, which means only >> one other cpu will be woken up. If there were an easy way to make it >> wake up a CPU in v_C''s affinity as well (supposing that there was no >> overlap), that would probably be a win. >> >> Of course, that''s only necessary if: >> * v_C is lower priority than v_W >> * There are no idlers that intersect both v_C and v_W''s affinity mask. >> >> It''s probably a good idea though to try to set up a scenario where this >> might be an issue and see how often it actually happens. >> > Ok, I think I managed in reproducing this. Look at the following trace, > considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no > affinity at all (its vcpus can run everywhere): > > 166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable > ]166.853945884 ---|-|-------x-| d51v1 28004(2:8:4) 2 [ 0 7 ] > . > ]166.853986385 ---|-|-------x-| d51v1 2800e(2:8:e) 2 [ 33 4bf97be ] > ]166.853986522 ---|-|-------x-| d51v1 2800f(2:8:f) 3 [ 0 a050 1c9c380 ] > ]166.853986636 ---|-|-------x-| d51v1 2800a(2:8:a) 4 [ 33 1 0 7 ] > . > 166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable > 166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running > . > . > . > ]166.854195353 ---|-|-------x-| d0v7 28006(2:8:6) 2 [ 0 7 ] > ]166.854196484 ---|-|-------x-| d0v7 2800e(2:8:e) 2 [ 0 33530 ] > ]166.854196584 ---|-|-------x-| d0v7 2800f(2:8:f) 3 [ 33 33530 1c9c380 ] > ]166.854196691 ---|-|-------x-| d0v7 2800a(2:8:a) 4 [ 0 7 33 1 ] > 166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked > 166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running > > So, if I''m not reading the trace wrong, when d0v7 wakes up (very first > event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle, > none of them get tickled and comes to pick d51v1 up, which has then to > wait in the runq until d0v7 goes back to sleep. > > I suspect this could be because, at d0v7 wakeup time, we try to tickle > some pcpu which is in d0v7''s affinity, but not in d51v1''s one (as in the > second ''if() {}'' block in __runq_tickle() we only care about > new->vcpu->cpu_affinity, and in this case, new is d0v7). > > I know, looking at the timestamps it doesn''t look like it is a big deal > in this case, and I''m still working on producing numbers that can better > show whether or not this is a real problem. > > Anyway, and independently from the results of these tests, why do I care > so much? > > Well, if you substitute the concept of "vcpu-affinity" with > "node-affinity" above (which is what I am doing in my NUMA aware > scheduling patches) you''ll see why this is bothering me quite a bit. In > fact, in that case, waking up a random pcpu with which d0v7 has > node-affinity with, while d51v1 has not, would cause d51v1 being pulled > by that cpu (since node-affinity is only preference)! > > So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at > pcpu 13''s runq for work to steal it does not find anything suitable and > give up, leaving d51v1 in the runq even if there are idle pcpus on which > it could run, which is already bad. > In the node-affinity case, pcpu 3 will actually manage in stealing d51v1 > and running it, even if there are idle pcpus with which it has > node-affinity, and thus defeating most of the benefits of the whole NUMA > aware scheduling thing (at least for some workloads).Maybe what we should do is do the wake-up based on who is likely to run on the current cpu: i.e., if "current" is likely to be pre-empted, look at idlers based on "current"''s mask; if "new" is likely to be put on the queue, look at idlers based on "new"''s mask. What do you think? -George
On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:> > So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at > > pcpu 13''s runq for work to steal it does not find anything suitable and > > give up, leaving d51v1 in the runq even if there are idle pcpus on which > > it could run, which is already bad. > > In the node-affinity case, pcpu 3 will actually manage in stealing d51v1 > > and running it, even if there are idle pcpus with which it has > > node-affinity, and thus defeating most of the benefits of the whole NUMA > > aware scheduling thing (at least for some workloads). > > Maybe what we should do is do the wake-up based on who is likely to run > on the current cpu: i.e., if "current" is likely to be pre-empted, look > at idlers based on "current"''s mask; if "new" is likely to be put on the > queue, look at idlers based on "new"''s mask. >EhEh, if you check the whole thread, you''ll find evidence that I thought this to be a good idea from the very beginning. I''ve already a patch for that, just let me see if numbers (with and without NUMA scheduling) are aligned with impressions and then I''ll send everything together. Thanks for your time, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
(Cc-ing David as it looks like he uses xenalyze quite a bit, and I''m seeking for any advice on how to squeeze data from there too :-P) On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:> Maybe what we should do is do the wake-up based on who is likely to run > on the current cpu: i.e., if "current" is likely to be pre-empted, look > at idlers based on "current"''s mask; if "new" is likely to be put on the > queue, look at idlers based on "new"''s mask. >Ok, find attached the two (trivial) patches that I produced and am testing in these days. Unfortunately, early results shows that I/we might be missing something. In fact, although I still don''t yet have the numbers for the NUMA-aware scheduling case (which is what originated all this! :-D), comparing ''upstream'' and ''patched'' (namely, ''upstream'' plus the two attached patches) I can spot some perf regressions. :-( Here''s the results of running some benchmarks on 2, 6 and 10 VMs. Each VM has 2 VCPUs and they run and execute the benchmarks concurrently on a 16 CPUs host. (Each test is repeated 3 times, and avg+/-stddev is what is reported). Also, the VCPUs where statically pinned on the host''s PCPUs. As already said, numbers for no-pinning and NUMA-scheduling will follow. + sysbench --test=memory (throughput, higher is better) #VMs | upstream | patched 2 | 550.97667 +/- 2.3512355 | 540.185 +/- 21.416892 6 | 443.15 +/- 5.7471797 | 442.66389 +/- 2.1071732 10 | 313.89233 +/- 1.3237493 | 305.69567 +/- 0.3279853 + sysbench --test=cpu (time, lower is better) #VMs | upstream | patched 2 | 47.8211 +/- 0.0215503 | 47.816117 +/- 0.0174079 6 | 62.689122 +/- 0.0877172 | 62.789883 +/- 0.1892171 10 | 90.321097 +/- 1.4803867 | 91.197767 +/- 0.1032667 + specjbb2005 (throughput, higher is better) #VMs | upstream | patched 2 | 49591.057 +/- 952.93384 | 50008.28 +/- 1502.4863 6 | 33538.247 +/- 1089.2115 | 33647.873 +/- 1007.3538 10 | 21927.87 +/- 831.88742 | 21869.654 +/- 578.236 So, as you can easily see, the numbers are very similar, with cases where the patches produces some slight performance reduction, while I was expecting the opposite, i.e., similar but a little bit better with the patches. For most of the runs of all the benchmarks, I have the full traces (although, only for SCHED-* events, IIRC), so I can investigate more. It''s an huge amount of data, so it''s really hard to make sense out of it, and any advice and direction on that would be much appreciated. For instance, looking at one of the runs of sysbench-memory, here''s what I found. With 10 VMs, the memory throughput reported by one of the VM during one of the runs is as follows: upstream: 315.68 MB/s patched: 306.69 MB/s I then went through the traces and I found out that the patched case lasted longer (for transferring the same amount of memory, hence the lower throughput), but with the following runstate related results: upstream: running for 73.67% of the time runnable for 24.94% of the time patched: running for 74.57% of the time runnable for 24.10% of the time And that is consistent with other random instances I checked. So, it looks like the patches are, after all, doing their job in increasing (at least a little) the running time, at the expenses of the runnable time, of the various VCPUs, but the benefits of that is being all eaten by some other effect --to the point that sometimes things go even worse-- that I''m not able to identify... For now! :-P Any idea about what''s going on and what I should check to better figure that out? Thanks a lot and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote:> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote: > > Maybe what we should do is do the wake-up based on who is likely to run > > on the current cpu: i.e., if "current" is likely to be pre-empted, look > > at idlers based on "current"''s mask; if "new" is likely to be put on the > > queue, look at idlers based on "new"''s mask. > > > Ok, find attached the two (trivial) patches that I produced and am > testing in these days. Unfortunately, early results shows that I/we > might be missing something. >I''m just came to thinking that this approach, although more, say, correct, could have a bad impact on caches and locality in general. In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C is running, with prio(N)>prio(C). What upstream does is asking to #x and to all the idlers that can execute N to reschedule. Doing both is, I think, wrong, as there''s the chance of ending up with N being scheduled on #x and C being runnable but not running (in #x''s runqueue) even if there are idle cpus that could run it, as they''re not poked (as already and repeatedly said). What the patches do, in this case (remember (prio(N)>prio(C)), is asking #x and all the idlers that can run C to reschedule, the effect being that N will likely run on #x, after a context switch, and C will run somewhere else, after a migration, potentially wasting its cache-hotness (it is running after all!). It looks like we can do better... Something like the below: + if there are no idlers where N can run, ask #x and the idlers where C can run to reschedule (exactly what the patches do, although, they do that _unconditionally_), as there isn''t anything else we can do to try to make sure they both will run; + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule and only poke them to come pick N up. In fact, in this case, it is not necessary to send C away for having both the vcpus ruunning, and it seems better to have N experience the migration as, since it''s waking-up, it''s more likely for him than for C to be cache-cold. I''ll run the benchmarks with this variant as soon as the one that I''m running right now finish... In the meanwhile, any thoughts? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 16/11/12 12:00, Dario Faggioli wrote:> On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote: >> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote: >>> Maybe what we should do is do the wake-up based on who is likely to run >>> on the current cpu: i.e., if "current" is likely to be pre-empted, look >>> at idlers based on "current"''s mask; if "new" is likely to be put on the >>> queue, look at idlers based on "new"''s mask. >>> >> Ok, find attached the two (trivial) patches that I produced and am >> testing in these days. Unfortunately, early results shows that I/we >> might be missing something. >> > I''m just came to thinking that this approach, although more, say, > correct, could have a bad impact on caches and locality in general.One thing that xenalyze will already tell you is statistics on how a vcpu migrates over pcpus. For example: cpu affinity: 242 7009916158 {621089444|5643356292|19752063006} [0]: 15 6940230676 {400952|5643531152|27013831272} [1]: 19 6366861827 {117462|5031404806|19751998114} [2]: 31 6888557514 {1410800684|5643015454|19752100009} [3]: 18 7790887470 {109764|5920027975|25395539566} ... The general format is: "$number $average_cycles {5th percentile|50th percentile|95th percentile}". The first line includes samples from *all* cpus (i.e,. so it migrated a total of 242 times, averaging 7 billion cycles each time); the subsequent numbers show statistics on specific pcpus (i.e., it had 15 sessions on pcpu 0, averaging 6.94 billion cycles, &c). You should be able to use this to do a basic verification of your hypothesis that vcpus are migrating more often.> In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C > is running, with prio(N)>prio(C). > > What upstream does is asking to #x and to all the idlers that can > execute N to reschedule. Doing both is, I think, wrong, as there''s the > chance of ending up with N being scheduled on #x and C being runnable > but not running (in #x''s runqueue) even if there are idle cpus that > could run it, as they''re not poked (as already and repeatedly said). > > What the patches do, in this case (remember (prio(N)>prio(C)), is asking > #x and all the idlers that can run C to reschedule, the effect being > that N will likely run on #x, after a context switch, and C will run > somewhere else, after a migration, potentially wasting its cache-hotness > (it is running after all!). > > It looks like we can do better... Something like the below: > + if there are no idlers where N can run, ask #x and the idlers where > C can run to reschedule (exactly what the patches do, although, they > do that _unconditionally_), as there isn''t anything else we can do > to try to make sure they both will run; > + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule > and only poke them to come pick N up. In fact, in this case, it is > not necessary to send C away for having both the vcpus ruunning, and > it seems better to have N experience the migration as, since it''s > waking-up, it''s more likely for him than for C to be cache-cold.I think that makes a lot of sense -- look forward to seeing the results. :-) There may be some other tricks we could look at. For example, if N and C are both going to do a significant chunk of computation, then this strategy will work best. But suppose that C does a significant junk of computation, but N is only going to run for a few hundred microseconds and then go to sleep again? In that case, it may be easier to just run N on the current processor and not bother with IPIs and such; C will run again in a few microseconds. Conversely, if N will do a significant chunk of work but C is fairly short, we might as well let C continue running, as N will shortly get to run. How to know if the next time this vcpu runs will be long or short? We could try tracking the runtimes of the last N (maybe 3 or 5) this was scheduled, and using that to predict the results. Do you have traces for any of those runs you did? I might just take a look at them and see if I can make an analysis of cache "temperature" wrt scheduling. :-) -George -George