thr3ads.net - Xen devel - About vcpu wakeup and runq tickling in credit [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Oct-23 13:34 UTC

About vcpu wakeup and runq tickling in credit

Hi George, Everyone,

While reworking a bit my NUMA aware scheduling patches I figured I''m
not
sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of
course) does.

Here''s the thing. Upon every vcpu wakeup we put the new vcpu in a runq
and then call __runq_tickle(), passing the waking vcpu via
''new''. Let''s
call the vcpu that just woke up v_W, and the vcpu that is currently
running on the cpu where that happens v_C. Let''s also call the CPU
where
all is happening P.

As far as I''ve understood, in  __runq_tickle(), we:


static inline void
__runq_tickle(unsigned int cpu, struct csched_vcpu *new)
{
    [...]
    cpumask_t mask;

    cpumask_clear(&mask);

    /* If strictly higher priority than current VCPU, signal the CPU */
    if ( new->pri > cur->pri )
    {
        [...]
        cpumask_set_cpu(cpu, &mask);
    }

--> Make sure we put the CPU we are on (P) in ''mask'', in
case the woken
--> vcpu (v_W) has higher priority that the currently running one (v_C).

    /*
     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
     * let them know there is runnable work in the system...
     */
    if ( cur->pri > CSCHED_PRI_IDLE )
    {
        if ( cpumask_empty(prv->idlers) )
        [...]
        else
        {
            cpumask_t idle_mask;

            cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
            if ( !cpumask_empty(&idle_mask) )
            {
                [...]
                if ( opt_tickle_one_idle )
                {
                    [...]
                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
                }
                else
                    cpumask_or(&mask, &mask, &idle_mask);
            }
            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);

--> Make sure we include one or more (depending on opt_tickle_one_idle)
--> CPUs that are both idle and part of v_W''s CPU-affinity in
''mask''.

        }
    }

    /* Send scheduler interrupts to designated CPUs */
    if ( !cpumask_empty(&mask) )
        cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);

--> Ask all the CPUs in ''mask'' to reschedule. That would
mean all the
--> idlers from v_W''s CPU-affinity and, possibly,
"ourself" (P). The
--> effect will be that all/some of the CPUs v_W''s has affinity with
--> _and_ (let''s assume so) P will go through scheduling as quickly
as
--> possible.

}

Is the above right?

If yes, here''s my question. Is that right to always tickle
v_W''s affine
CPUs and only them?

I''m asking because a possible scenario, at least according to me, is
that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it
selects v_W and leaves v_C in its runq. At that point, one of the
tickled CPU (say P'') enters schedule, sees that P is not idle, and
tries
to steal a vcpu from its runq. Now we know that P'' has affinity with
v_W, but v_W is not there, while v_C is, and if P'' is not in its
affinity, we''ve forced P'' to reschedule for nothing.
Also, there now might be another (or even a number of) CPU where v_C
could run that stays idle, as it has not being tickled.

So, if that is true, it seems we leave some room for sub-optimal CPU
utilization, as well as some non-work conserving windows.
Of course, it is very hard to tell how frequent this actually happens.

As it comes to possible solution, I think that, for instance, tickling
all the CPUs in both v_W''s and v_C''s affinity masks could
solve this,
but that would also potentially increase the overhead (by asking _a_lot_
of CPUs to reschedule), and again, it''s hard to say if/when
it''s
worth...

Actually, going all the way round, i.e., tickling only CPUs with
affinity with v_C (in this case) looks more reasonable, under the
assumption that v_w is going to be scheduled on P soon enough. In
general, that would mean tickling the CPUs in the affinity mask of the
vcpu with smaller priority, but I''ve not checked how that would
interact
with the rest of the scheduling logic yet.

If I got things wrong and/or there''s something I missed or overlooked,
please, accept my apologies. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-23 15:16 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On 23/10/12 14:34, Dario Faggioli wrote:> Hi George, Everyone,
>
> While reworking a bit my NUMA aware scheduling patches I figured
I''m not
> sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of
> course) does.
>
> Here''s the thing. Upon every vcpu wakeup we put the new vcpu in a
runq
> and then call __runq_tickle(), passing the waking vcpu via
''new''. Let''s
> call the vcpu that just woke up v_W, and the vcpu that is currently
> running on the cpu where that happens v_C. Let''s also call the CPU
where
> all is happening P.
>
> As far as I''ve understood, in  __runq_tickle(), we:
>
>
> static inline void
> __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
> {
>      [...]
>      cpumask_t mask;
>
>      cpumask_clear(&mask);
>
>      /* If strictly higher priority than current VCPU, signal the CPU */
>      if ( new->pri > cur->pri )
>      {
>          [...]
>          cpumask_set_cpu(cpu, &mask);
>      }
>
> --> Make sure we put the CPU we are on (P) in ''mask'',
in case the woken
> --> vcpu (v_W) has higher priority that the currently running one (v_C).
>
>      /*
>       * If this CPU has at least two runnable VCPUs, we tickle any idlers
to
>       * let them know there is runnable work in the system...
>       */
>      if ( cur->pri > CSCHED_PRI_IDLE )
>      {
>          if ( cpumask_empty(prv->idlers) )
>          [...]
>          else
>          {
>              cpumask_t idle_mask;
>
>              cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
>              if ( !cpumask_empty(&idle_mask) )
>              {
>                  [...]
>                  if ( opt_tickle_one_idle )
>                  {
>                      [...]
>                      cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
>                  }
>                  else
>                      cpumask_or(&mask, &mask, &idle_mask);
>              }
>              cpumask_and(&mask, &mask,
new->vcpu->cpu_affinity);
>
> --> Make sure we include one or more (depending on opt_tickle_one_idle)
> --> CPUs that are both idle and part of v_W''s CPU-affinity in
''mask''.
>
>          }
>      }
>
>      /* Send scheduler interrupts to designated CPUs */
>      if ( !cpumask_empty(&mask) )
>          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
>
> --> Ask all the CPUs in ''mask'' to reschedule. That
would mean all the
> --> idlers from v_W''s CPU-affinity and, possibly,
"ourself" (P). The
> --> effect will be that all/some of the CPUs v_W''s has affinity
with
> --> _and_ (let''s assume so) P will go through scheduling as
quickly as
> --> possible.
>
> }
>
> Is the above right?
It looks right to me.
> If yes, here''s my question. Is that right to always tickle
v_W''s affine
> CPUs and only them?
>
> I''m asking because a possible scenario, at least according to me,
is
> that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it
> selects v_W and leaves v_C in its runq. At that point, one of the
> tickled CPU (say P'') enters schedule, sees that P is not idle, and
tries
> to steal a vcpu from its runq. Now we know that P'' has affinity
with
> v_W, but v_W is not there, while v_C is, and if P'' is not in its
> affinity, we''ve forced P'' to reschedule for nothing.
> Also, there now might be another (or even a number of) CPU where v_C
> could run that stays idle, as it has not being tickled.
Yes -- the two clauses look a bit like they were conceived 
independently, and maybe no one thought about how they might interact.
> So, if that is true, it seems we leave some room for sub-optimal CPU
> utilization, as well as some non-work conserving windows.
> Of course, it is very hard to tell how frequent this actually happens.
>
> As it comes to possible solution, I think that, for instance, tickling
> all the CPUs in both v_W''s and v_C''s affinity masks could
solve this,
> but that would also potentially increase the overhead (by asking _a_lot_
> of CPUs to reschedule), and again, it''s hard to say if/when
it''s
> worth...
Well in my code, opt_tickle_idle_one is on by default, which means only 
one other cpu will be woken up.  If there were an easy way to make it 
wake up a CPU in v_C''s affinity as well (supposing that there was no 
overlap), that would probably be a win.

Of course, that''s only necessary if:
* v_C is lower priority than v_W
* There are no idlers that intersect both v_C and v_W''s affinity mask.

It''s probably a good idea though to try to set up a scenario where this
might be an issue and see how often it actually happens.

  -George

Dario Faggioli

2012-Oct-24 16:48 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:> > If yes, here''s my question. Is that right to always tickle
v_W''s affine
> > CPUs and only them?
> >
> > I''m asking because a possible scenario, at least according to
me, is
> > that P schedules very quickly after this and, as
prio(v_W)>prio(v_C), it
> > selects v_W and leaves v_C in its runq. At that point, one of the
> > tickled CPU (say P'') enters schedule, sees that P is not
idle, and tries
> > to steal a vcpu from its runq. Now we know that P'' has
affinity with
> > v_W, but v_W is not there, while v_C is, and if P'' is not in
its
> > affinity, we''ve forced P'' to reschedule for nothing.
> > Also, there now might be another (or even a number of) CPU where v_C
> > could run that stays idle, as it has not being tickled.
> 
> Yes -- the two clauses look a bit like they were conceived 
> independently, and maybe no one thought about how they might interact.
> Yep, it looked the same to me.
> > As it comes to possible solution, I think that, for instance, tickling
> > all the CPUs in both v_W''s and v_C''s affinity masks
could solve this,
> > but that would also potentially increase the overhead (by asking
_a_lot_
> > of CPUs to reschedule), and again, it''s hard to say if/when
it''s
> > worth...
> 
> Well in my code, opt_tickle_idle_one is on by default, which means only 
> one other cpu will be woken up.  If there were an easy way to make it 
> wake up a CPU in v_C''s affinity as well (supposing that there was
no
> overlap), that would probably be a win.
> Yes, default is to tickle only 1 idler. However, as we offer that as a
command line option, I think we should consider what could happen if one
disables it.

I double checked this on Linux and, mutatis mutandis, they sort of go
the way I was suggesting, i.e., "pinging" the CPUs with affinity to
the
task that will likely stay in the runq rather than being picked up
locally. However, there are of course big difference between the two
schedulers and different assumptions being made, thus I''m not really
sure of that being the best thing to do for us.

So, yes, it probably make sense to think about something clever to try
to involve CPUs from both the masks without causing too much overhead.
I''ll put that in my TODO List. :-)
> Of course, that''s only necessary if:
> * v_C is lower priority than v_W
> * There are no idlers that intersect both v_C and v_W''s affinity
mask.
> Sure, I said that in the first place, and I don''t think checking for
that is too hard... Just a couple of more bitmap ops. But again, I''ll
give it some thinking.
> It''s probably a good idea though to try to set up a scenario where
this
> might be an issue and see how often it actually happens.
> Definitely, before trying to "fix" it, I''m interested in
finding out
what I''d be actually fixing. Will do that.

Thanks for taking time to check and answer this! :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Nov-15 12:10 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:> > As it comes to possible solution, I think that, for instance, tickling
> > all the CPUs in both v_W''s and v_C''s affinity masks
could solve this,
> > but that would also potentially increase the overhead (by asking
_a_lot_
> > of CPUs to reschedule), and again, it''s hard to say if/when
it''s
> > worth...
> 
> Well in my code, opt_tickle_idle_one is on by default, which means only 
> one other cpu will be woken up.  If there were an easy way to make it 
> wake up a CPU in v_C''s affinity as well (supposing that there was
no
> overlap), that would probably be a win.
> 
> Of course, that''s only necessary if:
> * v_C is lower priority than v_W
> * There are no idlers that intersect both v_C and v_W''s affinity
mask.
> 
> It''s probably a good idea though to try to set up a scenario where
this
> might be an issue and see how often it actually happens.
> Ok, I think I managed in reproducing this. Look at the following trace,
considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no
affinity at all (its vcpus can run everywhere):

 166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable
]166.853945884 ---|-|-------x-| d51v1   28004(2:8:4) 2 [ 0 7 ]
.
]166.853986385 ---|-|-------x-| d51v1   2800e(2:8:e) 2 [ 33 4bf97be ]
]166.853986522 ---|-|-------x-| d51v1   2800f(2:8:f) 3 [ 0 a050 1c9c380 ]
]166.853986636 ---|-|-------x-| d51v1   2800a(2:8:a) 4 [ 33 1 0 7 ]
.
 166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable
 166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running
.
.
.
]166.854195353 ---|-|-------x-| d0v7   28006(2:8:6) 2 [ 0 7 ]
]166.854196484 ---|-|-------x-| d0v7   2800e(2:8:e) 2 [ 0 33530 ]
]166.854196584 ---|-|-------x-| d0v7   2800f(2:8:f) 3 [ 33 33530 1c9c380 ]
]166.854196691 ---|-|-------x-| d0v7   2800a(2:8:a) 4 [ 0 7 33 1 ]
 166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked
 166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running

So, if I''m not reading the trace wrong, when d0v7 wakes up (very first
event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle,
none of them get tickled and comes to pick d51v1 up, which has then to
wait in the runq until d0v7 goes back to sleep.

I suspect this could be because, at d0v7 wakeup time, we try to tickle
some pcpu which is in d0v7''s affinity, but not in d51v1''s one
(as in the
second ''if() {}'' block in __runq_tickle() we only care about
new->vcpu->cpu_affinity, and in this case, new is d0v7).

I know, looking at the timestamps it doesn''t look like it is a big deal
in this case, and I''m still working on producing numbers that can
better
show whether or not this is a real problem.

Anyway, and independently from the results of these tests, why do I care
so much?

Well, if you substitute the concept of "vcpu-affinity" with
"node-affinity" above (which is what I am doing in my NUMA aware
scheduling patches) you''ll see why this is bothering me quite a bit. In
fact, in that case, waking up a random pcpu with which d0v7 has
node-affinity with, while d51v1 has not, would cause d51v1 being pulled
by that cpu (since node-affinity is only preference)!

So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
pcpu 13''s runq for work to steal it does not find anything suitable and
give up, leaving d51v1 in the runq even if there are idle pcpus on which
it could run, which is already bad.
In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
and running it, even if there are idle pcpus with which it has
node-affinity, and thus defeating most of the benefits of the whole NUMA
aware scheduling thing (at least for some workloads).

:-(

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Nov-15 12:18 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On 15/11/12 12:10, Dario Faggioli wrote:> On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:
>>> As it comes to possible solution, I think that, for instance,
tickling
>>> all the CPUs in both v_W''s and v_C''s affinity
masks could solve this,
>>> but that would also potentially increase the overhead (by asking
_a_lot_
>>> of CPUs to reschedule), and again, it''s hard to say
if/when it''s
>>> worth...
>> Well in my code, opt_tickle_idle_one is on by default, which means only
>> one other cpu will be woken up.  If there were an easy way to make it
>> wake up a CPU in v_C''s affinity as well (supposing that there
was no
>> overlap), that would probably be a win.
>>
>> Of course, that''s only necessary if:
>> * v_C is lower priority than v_W
>> * There are no idlers that intersect both v_C and v_W''s
affinity mask.
>>
>> It''s probably a good idea though to try to set up a scenario
where this
>> might be an issue and see how often it actually happens.
>>
> Ok, I think I managed in reproducing this. Look at the following trace,
> considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no
> affinity at all (its vcpus can run everywhere):
>
>   166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7
blocked->runnable
> ]166.853945884 ---|-|-------x-| d51v1   28004(2:8:4) 2 [ 0 7 ]
> .
> ]166.853986385 ---|-|-------x-| d51v1   2800e(2:8:e) 2 [ 33 4bf97be ]
> ]166.853986522 ---|-|-------x-| d51v1   2800f(2:8:f) 3 [ 0 a050 1c9c380 ]
> ]166.853986636 ---|-|-------x-| d51v1   2800a(2:8:a) 4 [ 33 1 0 7 ]
> .
>   166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1
running->runnable
>   166.853986905 ---|-|-------x-| d?v? runstate_change d0v7
runnable->running
> .
> .
> .
> ]166.854195353 ---|-|-------x-| d0v7   28006(2:8:6) 2 [ 0 7 ]
> ]166.854196484 ---|-|-------x-| d0v7   2800e(2:8:e) 2 [ 0 33530 ]
> ]166.854196584 ---|-|-------x-| d0v7   2800f(2:8:f) 3 [ 33 33530 1c9c380 ]
> ]166.854196691 ---|-|-------x-| d0v7   2800a(2:8:a) 4 [ 0 7 33 1 ]
>   166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7
running->blocked
>   166.854197175 ---|-|-------x-| d?v? runstate_change d51v1
runnable->running
>
> So, if I''m not reading the trace wrong, when d0v7 wakes up (very
first
> event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle,
> none of them get tickled and comes to pick d51v1 up, which has then to
> wait in the runq until d0v7 goes back to sleep.
>
> I suspect this could be because, at d0v7 wakeup time, we try to tickle
> some pcpu which is in d0v7''s affinity, but not in d51v1''s
one (as in the
> second ''if() {}'' block in __runq_tickle() we only care
about
> new->vcpu->cpu_affinity, and in this case, new is d0v7).
>
> I know, looking at the timestamps it doesn''t look like it is a big
deal
> in this case, and I''m still working on producing numbers that can
better
> show whether or not this is a real problem.
>
> Anyway, and independently from the results of these tests, why do I care
> so much?
>
> Well, if you substitute the concept of "vcpu-affinity" with
> "node-affinity" above (which is what I am doing in my NUMA aware
> scheduling patches) you''ll see why this is bothering me quite a
bit. In
> fact, in that case, waking up a random pcpu with which d0v7 has
> node-affinity with, while d51v1 has not, would cause d51v1 being pulled
> by that cpu (since node-affinity is only preference)!
>
> So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
> pcpu 13''s runq for work to steal it does not find anything
suitable and
> give up, leaving d51v1 in the runq even if there are idle pcpus on which
> it could run, which is already bad.
> In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
> and running it, even if there are idle pcpus with which it has
> node-affinity, and thus defeating most of the benefits of the whole NUMA
> aware scheduling thing (at least for some workloads).
Maybe what we should do is do the wake-up based on who is likely to run 
on the current cpu: i.e., if "current" is likely to be pre-empted,
look
at idlers based on "current"''s mask; if "new" is
likely to be put on the
queue, look at idlers based on "new"''s mask.

What do you think?

  -George

Dario Faggioli

2012-Nov-15 15:50 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:> > So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
> > pcpu 13''s runq for work to steal it does not find anything
suitable and
> > give up, leaving d51v1 in the runq even if there are idle pcpus on
which
> > it could run, which is already bad.
> > In the node-affinity case, pcpu 3 will actually manage in stealing
d51v1
> > and running it, even if there are idle pcpus with which it has
> > node-affinity, and thus defeating most of the benefits of the whole
NUMA
> > aware scheduling thing (at least for some workloads).
> 
> Maybe what we should do is do the wake-up based on who is likely to run 
> on the current cpu: i.e., if "current" is likely to be
pre-empted, look
> at idlers based on "current"''s mask; if "new"
is likely to be put on the
> queue, look at idlers based on "new"''s mask.
> EhEh, if you check  the whole thread, you''ll find evidence that I
thought this to be a good idea from the very beginning. I''ve already a
patch for that, just let me see if numbers (with and without NUMA
scheduling) are aligned with impressions and then I''ll send everything
together.

Thanks for your time,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Nov-16 10:53 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

(Cc-ing David as it looks like he uses xenalyze quite a bit, and I''m 
 seeking for any advice on how to squeeze data from there too :-P)

On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:> Maybe what we should do is do the wake-up based on who is likely to run 
> on the current cpu: i.e., if "current" is likely to be
pre-empted, look
> at idlers based on "current"''s mask; if "new"
is likely to be put on the
> queue, look at idlers based on "new"''s mask.
> Ok, find attached the two (trivial) patches that I produced and am
testing in these days. Unfortunately, early results shows that I/we
might be missing something.

In fact, although I still don''t yet have the numbers for the NUMA-aware
scheduling case (which is what originated all this! :-D), comparing
''upstream'' and ''patched'' (namely,
''upstream'' plus the two attached
patches) I can spot some perf regressions. :-(

Here''s the results of running some benchmarks on 2, 6 and 10 VMs. Each
VM has 2 VCPUs and they run and execute the benchmarks concurrently on a
16 CPUs host. (Each test is repeated 3 times, and avg+/-stddev is what
is reported).

Also, the VCPUs where statically pinned on the host''s PCPUs. As already
said, numbers for no-pinning and NUMA-scheduling will follow.

+ sysbench --test=memory (throughput, higher is better)
 #VMs | upstream                | patched
    2 | 550.97667 +/- 2.3512355 | 540.185   +/- 21.416892
    6 | 443.15    +/- 5.7471797 | 442.66389 +/- 2.1071732
   10 | 313.89233 +/- 1.3237493 | 305.69567 +/- 0.3279853

+ sysbench --test=cpu (time, lower is better)
 #VMs | upstream                | patched
    2 | 47.8211   +/- 0.0215503 | 47.816117 +/- 0.0174079
    6 | 62.689122 +/- 0.0877172 | 62.789883 +/- 0.1892171
   10 | 90.321097 +/- 1.4803867 | 91.197767 +/- 0.1032667

+ specjbb2005 (throughput, higher is better)
 #VMs | upstream                | patched
    2 | 49591.057 +/- 952.93384 | 50008.28  +/- 1502.4863
    6 | 33538.247 +/- 1089.2115 | 33647.873 +/- 1007.3538
   10 | 21927.87  +/- 831.88742 | 21869.654 +/- 578.236

So, as you can easily see, the numbers are very similar, with cases
where the patches produces some slight performance reduction, while I
was expecting the opposite, i.e., similar but a little bit better with
the patches.

For most of the runs of all the benchmarks, I have the full traces
(although, only for SCHED-* events, IIRC), so I can investigate more.
It''s an huge amount of data, so it''s really hard to make sense
out of
it, and any advice and direction on that would be much appreciated.

For instance, looking at one of the runs of sysbench-memory, here''s
what
I found. With 10 VMs, the memory throughput reported by one of the VM
during one of the runs is as follows:

 upstream: 315.68 MB/s
 patched:  306.69 MB/s

I then went through the traces and I found out that the patched case
lasted longer (for transferring the same amount of memory, hence the
lower throughput), but with the following runstate related results:

 upstream: running for 73.67% of the time
           runnable for 24.94% of the time

 patched:  running for 74.57% of the time
           runnable for 24.10% of the time

And that is consistent with other random instances I checked. So, it
looks like the patches are, after all, doing their job in increasing (at
least a little) the running time, at the expenses of the runnable time,
of the various VCPUs, but the benefits of that is being all eaten by
some other effect --to the point that sometimes things go even worse--
that I''m not able to identify... For now! :-P

Any idea about what''s going on and what I should check to better figure
that out?

Thanks a lot and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Nov-16 12:00 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote:> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
> > Maybe what we should do is do the wake-up based on who is likely to
run
> > on the current cpu: i.e., if "current" is likely to be
pre-empted, look
> > at idlers based on "current"''s mask; if
"new" is likely to be put on the
> > queue, look at idlers based on "new"''s mask.
> > 
> Ok, find attached the two (trivial) patches that I produced and am
> testing in these days. Unfortunately, early results shows that I/we
> might be missing something.
> I''m just came to thinking that this approach, although more, say,
correct, could have a bad impact on caches and locality in general.

In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C
is running, with prio(N)>prio(C).

What upstream does is asking to #x and to all the idlers that can
execute N to reschedule. Doing both is, I think, wrong, as there''s the
chance of ending up with N being scheduled on #x and C being runnable
but not running (in #x''s runqueue) even if there are idle cpus that
could run it, as they''re not poked (as already and repeatedly said).

What the patches do, in this case (remember (prio(N)>prio(C)), is asking
#x and all the idlers that can run C to reschedule, the effect being
that N will likely run on #x, after a context switch, and C will run
somewhere else, after a migration, potentially wasting its cache-hotness
(it is running after all!).

It looks like we can do better... Something like the below:
 + if there are no idlers where N can run, ask #x and the idlers where 
   C can run to reschedule (exactly what the patches do, although, they 
   do that _unconditionally_), as there isn''t anything else we can do
   to try to make sure they both will run;
 + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule 
   and only poke them to come pick N up. In fact, in this case, it is 
   not necessary to send C away for having both the vcpus ruunning, and 
   it seems better to have N experience the migration as, since it''s 
   waking-up, it''s more likely for him than for C to be cache-cold.

I''ll run the benchmarks with this variant as soon as the one that
I''m
running right now finish... In the meanwhile, any thoughts?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Nov-16 15:44 UTC

head link

Re: About vcpu wakeup and runq tickling in credit

On 16/11/12 12:00, Dario Faggioli wrote:> On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote:
>> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
>>> Maybe what we should do is do the wake-up based on who is likely to
run
>>> on the current cpu: i.e., if "current" is likely to be
pre-empted, look
>>> at idlers based on "current"''s mask; if
"new" is likely to be put on the
>>> queue, look at idlers based on "new"''s mask.
>>>
>> Ok, find attached the two (trivial) patches that I produced and am
>> testing in these days. Unfortunately, early results shows that I/we
>> might be missing something.
>>
> I''m just came to thinking that this approach, although more, say,
> correct, could have a bad impact on caches and locality in general.
One thing that xenalyze will already tell you is statistics on how a 
vcpu migrates over pcpus.  For example:

  cpu affinity:     242 7009916158 {621089444|5643356292|19752063006}
    [0]:      15 6940230676 {400952|5643531152|27013831272}
    [1]:      19 6366861827 {117462|5031404806|19751998114}
    [2]:      31 6888557514 {1410800684|5643015454|19752100009}
    [3]:      18 7790887470 {109764|5920027975|25395539566}
...

The general format is: "$number $average_cycles {5th percentile|50th 
percentile|95th percentile}".  The first line includes samples from 
*all* cpus (i.e,. so it migrated a total of 242 times, averaging 7 
billion cycles each time); the subsequent numbers show statistics on 
specific pcpus (i.e., it had 15 sessions on pcpu 0, averaging 6.94 
billion cycles, &c).

You should be able to use this to do a basic verification of your 
hypothesis that vcpus are migrating more often.
> In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C
> is running, with prio(N)>prio(C).
>
> What upstream does is asking to #x and to all the idlers that can
> execute N to reschedule. Doing both is, I think, wrong, as there''s
the
> chance of ending up with N being scheduled on #x and C being runnable
> but not running (in #x''s runqueue) even if there are idle cpus
that
> could run it, as they''re not poked (as already and repeatedly
said).
>
> What the patches do, in this case (remember (prio(N)>prio(C)), is asking
> #x and all the idlers that can run C to reschedule, the effect being
> that N will likely run on #x, after a context switch, and C will run
> somewhere else, after a migration, potentially wasting its cache-hotness
> (it is running after all!).
>
> It looks like we can do better... Something like the below:
>   + if there are no idlers where N can run, ask #x and the idlers where
>     C can run to reschedule (exactly what the patches do, although, they
>     do that _unconditionally_), as there isn''t anything else we
can do
>     to try to make sure they both will run;
>   + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule
>     and only poke them to come pick N up. In fact, in this case, it is
>     not necessary to send C away for having both the vcpus ruunning, and
>     it seems better to have N experience the migration as, since
it''s
>     waking-up, it''s more likely for him than for C to be
cache-cold.
I think that makes a lot of sense -- look forward to seeing the results. :-)

There may be some other tricks we could look at.  For example, if N and 
C are both going to do a significant chunk of computation, then this 
strategy will work best.  But suppose that C does a significant junk of 
computation, but N is only going to run for a few hundred microseconds 
and then go to sleep again?  In that case, it may be easier to just run 
N on the current processor and not bother with IPIs and such; C will run 
again in a few microseconds.   Conversely, if N will do a significant 
chunk of work but C is fairly short, we might as well let C continue 
running, as N will shortly get to run.

How to know if the next time this vcpu runs will be long or short? We 
could try tracking the runtimes of the last N (maybe 3 or 5) this was 
scheduled, and using that to predict the results.

Do you have traces for any of those runs you did?  I might just take a 
look at them and see if I can make an analysis of cache "temperature" 
wrt scheduling. :-)

  -George

  -George

Xen devel - Oct 2012 - About vcpu wakeup and runq tickling in credit

About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit

Re: About vcpu wakeup and runq tickling in credit