thr3ads.net - Xen devel - NUMA TODO-list for xen-devel [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Aug-01 16:16 UTC

NUMA TODO-list for xen-devel

Hi everyone,

With automatic placement finally landing into xen-unstable, I stated
thinking about what I could work on next, still in the field of
improving Xen''s NUMA support. Well, it turned out that running out of
things to do is not an option! :-O

In fact, I can think of quite a bit of open issues in that area, that
I''m
just braindumping here. If anyone has thoughts or idea or feedback or
whatever, I''d be happy to serve as a collector of them. I''ve
already
created a Wiki page to help with the tracking. You can see it here
(for now it basically replicates this e-mail):

 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap

I''m putting a [D] (standing for Dario) near the points I''ve
started
working on or looking at, and again, I''d be happy to try tracking this
too, i.e., keeping the list of "who-is-doing-what" updated, in order
to
ease collaboration.

So, let''s cut the talking:

    - Automatic placement at guest creation time. Basics are there and
      will be shipping with 4.2. However, a lot of other things are
      missing and/or can be improved, for instance:
[D]    * automated verification and testing of the placement;
       * benchmarks and improvements of the placement heuristic;
[D]    * choosing/building up some measure of node load (more accurate
         than just counting vcpus) onto which to rely during placement;
       * consider IONUMA during placement;
       * automatic placement of Dom0, if possible (my current series is
         only affecting DomU)
       * having internal xen data structure honour the placement (e.g., 
         I''ve been told that right now vcpu stacks are always allocated
         on node 0... Andrew?).

[D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes''
pcpus,
      just have them _prefer_ running on the nodes where their memory
      is.

[D] - Dynamic memory migration between different nodes of the host. As
      the counter-part of the NUMA-aware scheduler.

    - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
      guest ends up on more than one nodes, make sure it knows it''s
      running on a NUMA platform (smaller than the actual host, but
      still NUMA). This interacts with some of the above points:
       * consider this during automatic placement for
         resuming/migrating domains (if they have a virtual topology,
         better not to change it);
       * consider this during memory migration (it can change the
         actual topology, should we update it on-line or disable memory
         migration?)

    - NUMA and ballooning and memory sharing. In some more details:
       * page sharing on NUMA boxes: it''s probably sane to make it
         possible disabling sharing pages across nodes;
       * ballooning and its interaction with placement (races, amount of
         memory needed and reported being different at different time,
         etc.).

    - Inter-VM dependencies and communication issues. If a workload is
      made up of more than just a VM and they all share the same (NUMA)
      host, it might be best to have them sharing the nodes as much as
      possible, or perhaps do right the opposite, depending on the
      specific characteristics of he workload itself, and this might be
      considered during placement, memory migration and perhaps
      scheduling.

    - Benchmarking and performances evaluation in general. Meaning both
      agreeing on a (set of) relevant workload(s) and on how to extract
      meaningful performances data from there (and maybe how to do that
      automatically?).

So, what do you think?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-01 16:24 UTC

head link

Re: NUMA TODO-list for xen-devel

On Wed, 2012-08-01 at 18:16 +0200, Dario Faggioli wrote:> Hi everyone,
>Quite a bad subject... I put it there just as a placeholder and then
forgot to change it into something sensible. :-(

Sorry for that. I hope the content can still get some attention. :-P

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2012-Aug-01 16:30 UTC

head link

Re: NUMA TODO-list for xen-devel

On 01/08/12 17:16, Dario Faggioli wrote:> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen''s NUMA support. Well, it turned out that running out
of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that
I''m
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I''d be happy to serve as a collector of them.
I''ve already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
>
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
>
> I''m putting a [D] (standing for Dario) near the points
I''ve started
> working on or looking at, and again, I''d be happy to try tracking
this
> too, i.e., keeping the list of "who-is-doing-what" updated, in
order to
> ease collaboration.
>
> So, let''s cut the talking:
>
> - Automatic placement at guest creation time. Basics are there and
> will be shipping with 4.2. However, a lot of other things are
> missing and/or can be improved, for instance:
> [D] * automated verification and testing of the placement;
> * benchmarks and improvements of the placement heuristic;
> [D] * choosing/building up some measure of node load (more accurate
> than just counting vcpus) onto which to rely during placement;
> * consider IONUMA during placement;
> * automatic placement of Dom0, if possible (my current series is
> only affecting DomU)
> * having internal xen data structure honour the placement (e.g.,
> I''ve been told that right now vcpu stacks are always allocated
> on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on
nodes'' pcpus,
> just have them _prefer_ running on the nodes where their memory
> is.
>
> [D] - Dynamic memory migration between different nodes of the host. As
> the counter-part of the NUMA-aware scheduler.
>
> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> guest ends up on more than one nodes, make sure it knows it''s
> running on a NUMA platform (smaller than the actual host, but
> still NUMA). This interacts with some of the above points:
> * consider this during automatic placement for
> resuming/migrating domains (if they have a virtual topology,
> better not to change it);
> * consider this during memory migration (it can change the
> actual topology, should we update it on-line or disable memory
> migration?)
>
> - NUMA and ballooning and memory sharing. In some more details:
> * page sharing on NUMA boxes: it''s probably sane to make it
> possible disabling sharing pages across nodes;
> * ballooning and its interaction with placement (races, amount of
> memory needed and reported being different at different time,
> etc.).
>
> - Inter-VM dependencies and communication issues. If a workload is
> made up of more than just a VM and they all share the same (NUMA)
> host, it might be best to have them sharing the nodes as much as
> possible, or perhaps do right the opposite, depending on the
> specific characteristics of he workload itself, and this might be
> considered during placement, memory migration and perhaps
> scheduling.
>
> - Benchmarking and performances evaluation in general. Meaning both
> agreeing on a (set of) relevant workload(s) and on how to extract
> meaningful performances data from there (and maybe how to do that
> automatically?).
- Xen NUMA internals.  Placing items such as the per-cpu stacks and data
area on the local NUMA node, rather than unconditionally on node 0 at
the moment.  As part of this, there will be changes to
alloc_{dom,xen}heap_page() to allow specification of which node(s) to
allocate memory from.

~Andrew
>
>
> So, what do you think?
>
> Thanks and Regards,
> Dario
>
-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


--------------090302080202010206030406
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8"
http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    On 01/08/12 17:16, Dario Faggioli wrote:<br>
    <span style="white-space: pre;">&gt; Hi
everyone,<br>
      &gt;<br>
      &gt; With automatic placement finally landing into xen-unstable, I
      stated<br>
      &gt; thinking about what I could work on next, still in the field
      of<br>
      &gt; improving Xen''s NUMA support. Well, it turned out that
      running out of<br>
      &gt; things to do is not an option! :-O<br>
      &gt;<br>
      &gt; In fact, I can think of quite a bit of open issues in that
      area, that I''m<br>
      &gt; just braindumping here. If anyone has thoughts or idea or
      feedback or<br>
      &gt; whatever, I''d be happy to serve as a collector of them.
I''ve
      already<br>
      &gt; created a Wiki page to help with the tracking. You can see it
      here<br>
      &gt; (for now it basically replicates this e-mail):<br>
      &gt;<br>
      &gt; <a class="moz-txt-link-freetext"
href="http://wiki.xen.org/wiki/Xen_NUMA_Roadmap">http://wiki.xen.org/wiki/Xen_NUMA_Roadmap</a><br>
      &gt;<br>
      &gt; I''m putting a [D] (standing for Dario) near the points
I''ve
      started<br>
      &gt; working on or looking at, and again, I''d be happy to try
      tracking this<br>
      &gt; too, i.e., keeping the list of "who-is-doing-what"
updated,
      in order to<br>
      &gt; ease collaboration.<br>
      &gt;<br>
      &gt; So, let''s cut the talking:<br>
      &gt;<br>
      &gt; - Automatic placement at guest creation time. Basics are
      there and<br>
      &gt; will be shipping with 4.2. However, a lot of other things
are<br>
      &gt; missing and/or can be improved, for instance:<br>
      &gt; [D] * automated verification and testing of the
placement;<br>
      &gt; * benchmarks and improvements of the placement
heuristic;<br>
      &gt; [D] * choosing/building up some measure of node load (more
      accurate<br>
      &gt; than just counting vcpus) onto which to rely during
      placement;<br>
      &gt; * consider IONUMA during placement;<br>
      &gt; * automatic placement of Dom0, if possible (my current series
      is<br>
      &gt; only affecting DomU)<br>
      &gt; * having internal xen data structure honour the placement
      (e.g., <br>
      &gt; I''ve been told that right now vcpu stacks are always
      allocated<br>
      &gt; on node 0... Andrew?).<br>
      &gt;<br>
      &gt; [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on
nodes''
      pcpus,<br>
      &gt; just have them _prefer_ running on the nodes where their
      memory<br>
      &gt; is.<br>
      &gt;<br>
      &gt; [D] - Dynamic memory migration between different nodes of the
      host. As<br>
      &gt; the counter-part of the NUMA-aware scheduler.<br>
      &gt;<br>
      &gt; - Virtual NUMA topology exposure to guests (a.k.a
      guest-numa). If a<br>
      &gt; guest ends up on more than one nodes, make sure it knows
it''s<br>
      &gt; running on a NUMA platform (smaller than the actual host,
but<br>
      &gt; still NUMA). This interacts with some of the above
points:<br>
      &gt; * consider this during automatic placement for<br>
      &gt; resuming/migrating domains (if they have a virtual
topology,<br>
      &gt; better not to change it);<br>
      &gt; * consider this during memory migration (it can change
the<br>
      &gt; actual topology, should we update it on-line or disable
      memory<br>
      &gt; migration?)<br>
      &gt;<br>
      &gt; - NUMA and ballooning and memory sharing. In some more
      details:<br>
      &gt; * page sharing on NUMA boxes: it''s probably sane to make
it<br>
      &gt; possible disabling sharing pages across nodes;<br>
      &gt; * ballooning and its interaction with placement (races,
      amount of<br>
      &gt; memory needed and reported being different at different
time,<br>
      &gt; etc.).<br>
      &gt;<br>
      &gt; - Inter-VM dependencies and communication issues. If a
      workload is<br>
      &gt; made up of more than just a VM and they all share the same
      (NUMA)<br>
      &gt; host, it might be best to have them sharing the nodes as much
      as<br>
      &gt; possible, or perhaps do right the opposite, depending on
the<br>
      &gt; specific characteristics of he workload itself, and this
      might be<br>
      &gt; considered during placement, memory migration and
perhaps<br>
      &gt; scheduling.<br>
      &gt;<br>
      &gt; - Benchmarking and performances evaluation in general.
      Meaning both<br>
      &gt; agreeing on a (set of) relevant workload(s) and on how to
      extract<br>
      &gt; meaningful performances data from there (and maybe how to do
      that<br>
      &gt; automatically?).</span><br>
    <br>
    - Xen NUMA internals.  Placing items such as the per-cpu stacks and
    data area on the local NUMA node, rather than unconditionally on
    node 0 at the moment.  As part of this, there will be changes to
    alloc_{dom,xen}heap_page() to allow specification of which node(s)
    to allocate memory from.<br>
    <br>
    ~Andrew<br>
    <br>
    <span style="white-space: pre;">&gt;<br>
      &gt;<br>
      &gt; So, what do you think?<br>
      &gt;<br>
      &gt; Thanks and Regards,<br>
      &gt; Dario<br>
      &gt;</span><br>
    <br>
    -- <br>
    Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br>
    T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext"
href="http://www.citrix.com">http://www.citrix.com</a><br>
    <br>
  </body>
</html>

--------------090302080202010206030406--


--===============7985365601985832137=Content-Type: text/plain;
charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============7985365601985832137==--

Anil Madhavapeddy

2012-Aug-01 16:32 UTC

head link

Re: NUMA TODO-list for xen-devel

On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:
>    - Inter-VM dependencies and communication issues. If a workload is
>      made up of more than just a VM and they all share the same (NUMA)
>      host, it might be best to have them sharing the nodes as much as
>      possible, or perhaps do right the opposite, depending on the
>      specific characteristics of he workload itself, and this might be
>      considered during placement, memory migration and perhaps
>      scheduling.
> 
>    - Benchmarking and performances evaluation in general. Meaning both
>      agreeing on a (set of) relevant workload(s) and on how to extract
>      meaningful performances data from there (and maybe how to do that
>      automatically?).
I haven''t tried out the latest Xen NUMA features yet, but
we''ve been
keeping track of the IPC benchmarks as we get newer machines here:

http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html

The newer chipsets (Sandy Bridge and AMD Valencia) both have quite
different inter-core/socket/MPM performance characteristics from their
respective previous generations; e.g.

http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpfCBrYh.html
http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmppI61nX.html

Happy to share the raw data if you have cycles to figure out the best
way to auto-place multiple VMs so they are near each other from a memory
latency perspective.  We haven''t run many macro-benchmarks though, so
in practise it might not matter, so it would be nice to settle on a good
set of benchmarks to determine that for sure.

-anil

Dario Faggioli

2012-Aug-01 16:47 UTC

head link

Re: NUMA TODO-list for xen-devel

On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:> On 01/08/12 17:16, Dario Faggioli wrote:
>
> ...
>
> > - Automatic placement at guest creation time. Basics are there and
> > will be shipping with 4.2. However, a lot of other things are
> > missing and/or can be improved, for instance:
> > [D] * automated verification and testing of the placement;
> > * benchmarks and improvements of the placement heuristic;
> > [D] * choosing/building up some measure of node load (more accurate
> > than just counting vcpus) onto which to rely during placement;
> > * consider IONUMA during placement;
> > * automatic placement of Dom0, if possible (my current series is
> > only affecting DomU)
> > * having internal xen data structure honour the placement (e.g., 
> > I''ve been told that right now vcpu stacks are always
allocated
> > on node 0... Andrew?).
> >
> 
> - Xen NUMA internals.  Placing items such as the per-cpu stacks and
> data area on the local NUMA node, rather than unconditionally on node
> 0 at the moment.  As part of this, there will be changes to
> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> allocate memory from.
As you see, I already tried to consider that (as you told me it does
that couple of weeks ago :-) ). I''ll add your wording of it (much
better
than mine) to the wiki... I understand you''re working on this,
aren''t
you? Can I put that down to?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2012-Aug-01 16:53 UTC

head link

Re: NUMA TODO-list for xen-devel

On 01/08/12 17:47, Dario Faggioli wrote:> On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:
>> On 01/08/12 17:16, Dario Faggioli wrote:
>>
>> ...
>>
>>> - Automatic placement at guest creation time. Basics are there and
>>> will be shipping with 4.2. However, a lot of other things are
>>> missing and/or can be improved, for instance:
>>> [D] * automated verification and testing of the placement;
>>> * benchmarks and improvements of the placement heuristic;
>>> [D] * choosing/building up some measure of node load (more accurate
>>> than just counting vcpus) onto which to rely during placement;
>>> * consider IONUMA during placement;
>>> * automatic placement of Dom0, if possible (my current series is
>>> only affecting DomU)
>>> * having internal xen data structure honour the placement (e.g.,
>>> I''ve been told that right now vcpu stacks are always
allocated
>>> on node 0... Andrew?).
>>>
>>
>> - Xen NUMA internals. Placing items such as the per-cpu stacks and
>> data area on the local NUMA node, rather than unconditionally on node
>> 0 at the moment. As part of this, there will be changes to
>> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
>> allocate memory from.
>
> As you see, I already tried to consider that (as you told me it does
> that couple of weeks ago :-) ). I''ll add your wording of it (much
better
> than mine) to the wiki... I understand you''re working on this,
aren''t
> you? Can I put that down to?
>
> Thanks and Regards,
> Dario
>
Wow - I completely managed to miss that while reading.  Someone will be
working on it for XS.next, and that someone will probably be me - put me
down for it.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


--------------030103050606070408050209
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8"
http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    On 01/08/12 17:47, Dario Faggioli wrote:<br>
    <span style="white-space: pre;">&gt; On Wed, 2012-08-01
at 17:30
      +0100, Andrew Cooper wrote:<br>
      &gt;&gt; On 01/08/12 17:16, Dario Faggioli wrote:<br>
      &gt;&gt;<br>
      &gt;&gt; ...<br>
      &gt;&gt;<br>
      &gt;&gt;&gt; - Automatic placement at guest creation time.
Basics
      are there and<br>
      &gt;&gt;&gt; will be shipping with 4.2. However, a lot of
other
      things are<br>
      &gt;&gt;&gt; missing and/or can be improved, for
instance:<br>
      &gt;&gt;&gt; [D] * automated verification and testing of the
      placement;<br>
      &gt;&gt;&gt; * benchmarks and improvements of the placement
      heuristic;<br>
      &gt;&gt;&gt; [D] * choosing/building up some measure of node
load
      (more accurate<br>
      &gt;&gt;&gt; than just counting vcpus) onto which to rely
during
      placement;<br>
      &gt;&gt;&gt; * consider IONUMA during placement;<br>
      &gt;&gt;&gt; * automatic placement of Dom0, if possible (my
      current series is<br>
      &gt;&gt;&gt; only affecting DomU)<br>
      &gt;&gt;&gt; * having internal xen data structure honour the
      placement (e.g., <br>
      &gt;&gt;&gt; I''ve been told that right now vcpu
stacks are always
      allocated<br>
      &gt;&gt;&gt; on node 0... Andrew?).<br>
      &gt;&gt;&gt;<br>
      &gt;&gt;<br>
      &gt;&gt; - Xen NUMA internals. Placing items such as the per-cpu
      stacks and<br>
      &gt;&gt; data area on the local NUMA node, rather than
      unconditionally on node<br>
      &gt;&gt; 0 at the moment. As part of this, there will be changes
      to<br>
      &gt;&gt; alloc_{dom,xen}heap_page() to allow specification of
      which node(s) to<br>
      &gt;&gt; allocate memory from.<br>
      &gt;<br>
      &gt; As you see, I already tried to consider that (as you told me
      it does<br>
      &gt; that couple of weeks ago :-) ). I''ll add your wording of
it
      (much better<br>
      &gt; than mine) to the wiki... I understand you''re working on
      this, aren''t<br>
      &gt; you? Can I put that down to?<br>
      &gt;<br>
      &gt; Thanks and Regards,<br>
      &gt; Dario<br>
      &gt;</span><br>
    <br>
    Wow - I completely managed to miss that while reading.  Someone will
    be working on it for XS.next, and that someone will probably be me -
    put me down for it.<br>
    <br>
    -- <br>
    Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br>
    T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext"
href="http://www.citrix.com">http://www.citrix.com</a><br>
    <br>
  </body>
</html>

--------------030103050606070408050209--


--===============2161362334832821623=Content-Type: text/plain;
charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============2161362334832821623==--

Dario Faggioli

2012-Aug-01 16:58 UTC

head link

Re: NUMA TODO-list for xen-devel

On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy
wrote:> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:
> 
> >    - Inter-VM dependencies and communication issues. If a workload is
> >      made up of more than just a VM and they all share the same (NUMA)
> >      host, it might be best to have them sharing the nodes as much as
> >      possible, or perhaps do right the opposite, depending on the
> >      specific characteristics of he workload itself, and this might be
> >      considered during placement, memory migration and perhaps
> >      scheduling.
> > 
> >    - Benchmarking and performances evaluation in general. Meaning both
> >      agreeing on a (set of) relevant workload(s) and on how to extract
> >      meaningful performances data from there (and maybe how to do that
> >      automatically?).
> 
> I haven''t tried out the latest Xen NUMA features yet, but
we''ve been
> keeping track of the IPC benchmarks as we get newer machines here:
> 
> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html
> Wow... That''s really cool. I''ll definitely take a deep look at
all these
data! I''m also adding the link to the wiki, if you''re fine
with that...
> Happy to share the raw data if you have cycles to figure out the best
> way to auto-place multiple VMs so they are near each other from a memory
> latency perspective.  
>I don''t have anything precise in mind yet, but we need to think about
this.
> We haven''t run many macro-benchmarks though, so
> in practise it might not matter, so it would be nice to settle on a good
> set of benchmarks to determine that for sure.
> Yes, that''s what we need. I''m open and available on trying to
figure
this out anytime... I seem to recall you''re going to be in SanDiego for
XenSummit, am I right? If yes, we can discuss this more there.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Malte Schwarzkopf

2012-Aug-02 00:04 UTC

head link

Re: NUMA TODO-list for xen-devel

On 01/08/12 17:58, Dario Faggioli wrote:> On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy wrote:
>> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it>
wrote:
>>
>>>    - Inter-VM dependencies and communication issues. If a workload
is
>>>      made up of more than just a VM and they all share the same
(NUMA)
>>>      host, it might be best to have them sharing the nodes as much
as
>>>      possible, or perhaps do right the opposite, depending on the
>>>      specific characteristics of he workload itself, and this might
be
>>>      considered during placement, memory migration and perhaps
>>>      scheduling.
>>>
>>>    - Benchmarking and performances evaluation in general. Meaning
both
>>>      agreeing on a (set of) relevant workload(s) and on how to
extract
>>>      meaningful performances data from there (and maybe how to do
that
>>>      automatically?).
>>
>> I haven''t tried out the latest Xen NUMA features yet, but
we''ve been
>> keeping track of the IPC benchmarks as we get newer machines here:
>>
> 
>> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html
>>
> Wow... That''s really cool. I''ll definitely take a deep
look at all these
> data! I''m also adding the link to the wiki, if you''re
fine with that...
No problem with adding a link, as this is public data :) If possible,
it''d be splendid to put a note next to this link encouraging people to
submit their own results -- doing so is very simple, and helps us extend
the database. Instructions are at
http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short
link, http://fable.io).
>> Happy to share the raw data if you have cycles to figure out the best
>> way to auto-place multiple VMs so they are near each other from a
memory
>> latency perspective.  
>>
> I don''t have anything precise in mind yet, but we need to think
about
> this.
While there has been plenty of work on optimizing co-location of
different kinds of workloads, there''s relatively little work (that I am
aware of) on VM scheduling in this environment. One (sadly somewhat
lacking) paper at HotCloud this year [1] looked at NUMA-aware VM
migration to balance memory accesses. Of greater interest is possibly
the Google ISCA paper on the detrimental effect of sharing
micro-architectural resources between different kinds of workloads,
although it is not explicitly focused on NUMA, and the metrics are
defined with regards to specific classes of latency-sensitive jobs [2].

One interesting thing to look at (that we haven''t looked at yet) is
what
memory allocators do about NUMA these days; there is an AMD whitepaper
from 2009 discussing the performance benefits of a NUMA-aware version of
tcmalloc [3], but I have found it hard to reproduce their results on
modern hardware. Of course, being virtualized may complicate matters
here, since the memory allocator can no longer freely pick and choose
where to allocate from.

Scheduling, notably, is key here, since the CPU a process is scheduled
on may determine where its memory is allocated -- frequent migrations
are likely to be bad for performance due to remote memory accesses,
although we have been unable to quantify a significant difference on
non-synthetic macrobenchmarks; that said, we did not try very hard so far.

Cheers,
Malte

[1] - Ahn et al., "Dynamic Virtual Machine Scheduling in Clouds for
Architectural Shared Resources", in Proceedings of HotCloud 2012,
https://www.usenix.org/conference/hotcloud12/dynamic-virtual-machine-scheduling-clouds-architectural-shared-resources

[2] - Tang et al., "The impact of memory subsystem resource sharing on
datacenter applications", in Proceedings of ISCA 2011,
http://dl.acm.org/citation.cfm?id=2000099

[3] -
http://developer.amd.com/Assets/NUMA_aware_heap_memory_manager_article_final.pdf

Zhang, Yang Z

2012-Aug-02 01:04 UTC

head link

Re: NUMA TODO-list for xen-devel

Dario Faggioli wrote on 2012-08-02:> Hi everyone,
> 
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen''s NUMA support. Well, it turned out that running out
of
> things to do is not an option! :-O
> 
> In fact, I can think of quite a bit of open issues in that area, that
I''m
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I''d be happy to serve as a collector of them.
I''ve already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
> 
>  http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
> I''m putting a [D] (standing for Dario) near the points
I''ve started
> working on or looking at, and again, I''d be happy to try tracking
this
> too, i.e., keeping the list of "who-is-doing-what" updated, in
order to
> ease collaboration.
> 
> So, let''s cut the talking:
> 
>     - Automatic placement at guest creation time. Basics are there and
>       will be shipping with 4.2. However, a lot of other things are
>       missing and/or can be improved, for instance:
> [D]    * automated verification and testing of the placement;
>        * benchmarks and improvements of the placement heuristic;
> [D]    * choosing/building up some measure of node load (more accurate
>          than just counting vcpus) onto which to rely during placement;
>        * consider IONUMA during placement;We should consider two things:
1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the node
which it resides. Currently, Dom0 allocates dma buffer without provide the node
info to the hypercall..
2.Guest IONUMA: when guest boots up with pass through device, we need to
allocate the memory from the node where the device resides for further dma
buffer allocation. And let guest know the IONUMA topology. This rely on the
guest NUMA.
This topic was mentioned in xen summit 2011:
http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf

>        * automatic placement of Dom0, if possible (my current series is
>          only affecting DomU) * having internal xen data structure
>          honour the placement (e.g., I''ve been told that right now
vcpu
>          stacks are always allocated on node 0... Andrew?).
> [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on
nodes'' pcpus,
>       just have them _prefer_ running on the nodes where their memory
>       is.
> [D] - Dynamic memory migration between different nodes of the host. As
>       the counter-part of the NUMA-aware scheduler.
>     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>       guest ends up on more than one nodes, make sure it knows
it''s
>       running on a NUMA platform (smaller than the actual host, but
>       still NUMA). This interacts with some of the above points:
>        * consider this during automatic placement for
>          resuming/migrating domains (if they have a virtual topology,
>          better not to change it); * consider this during memory
>          migration (it can change the actual topology, should we update
>          it on-line or disable memory migration?)
>     - NUMA and ballooning and memory sharing. In some more details:
>        * page sharing on NUMA boxes: it''s probably sane to make it
>          possible disabling sharing pages across nodes; * ballooning and
>          its interaction with placement (races, amount of memory needed
>          and reported being different at different time, etc.).
>     - Inter-VM dependencies and communication issues. If a workload is
>       made up of more than just a VM and they all share the same (NUMA)
>       host, it might be best to have them sharing the nodes as much as
>       possible, or perhaps do right the opposite, depending on the
>       specific characteristics of he workload itself, and this might be
>       considered during placement, memory migration and perhaps
>       scheduling.
>     - Benchmarking and performances evaluation in general. Meaning both
>       agreeing on a (set of) relevant workload(s) and on how to extract
>       meaningful performances data from there (and maybe how to do that
>       automatically?).
> So, what do you think?
> 
> Thanks and Regards,
> Dario
> 
> -- <<This happens because I choose it to happen!>> (Raistlin
Majere)
> ----------------------------------------------------------------- Dario
> Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software
> Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 
> 
> -- <<This happens because I choose it to happen!>> (Raistlin
Majere)
> ----------------------------------------------------------------- Dario
> Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software
> Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Best regards,
Yang

Jan Beulich

2012-Aug-02 09:40 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 01.08.12 at 18:30, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> - Xen NUMA internals.  Placing items such as the per-cpu stacks and data
> area on the local NUMA node, rather than unconditionally on node 0 at
> the moment.  As part of this, there will be changes to
> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> allocate memory from.
Those interfaces already support flags to be passed, including a
node ID. It just needs to be made use of in more places.

Jan

Jan Beulich

2012-Aug-02 09:43 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it>
wrote:
>     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>       guest ends up on more than one nodes, make sure it knows
it''s
>       running on a NUMA platform (smaller than the actual host, but
>       still NUMA). This interacts with some of the above points:
The question is whether this is really useful beyond the (I would
suppose) relatively small set of cases where migration isn''t
needed.
>        * consider this during automatic placement for
>          resuming/migrating domains (if they have a virtual topology,
>          better not to change it);
>        * consider this during memory migration (it can change the
>          actual topology, should we update it on-line or disable memory
>          migration?)
The question is whether trading functionality for performance
is an acceptable choice.

Jan

Dario Faggioli

2012-Aug-02 13:21 UTC

head link

Re: NUMA TODO-list for xen-devel

On Thu, 2012-08-02 at 10:40 +0100, Jan Beulich wrote:> >>> On 01.08.12 at 18:30, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> > - Xen NUMA internals.  Placing items such as the per-cpu stacks and
data
> > area on the local NUMA node, rather than unconditionally on node 0 at
> > the moment.  As part of this, there will be changes to
> > alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> > allocate memory from.
> 
> Those interfaces already support flags to be passed, including a
> node ID. It just needs to be made use of in more places.
> Yes, I also remember it being already node_affinity conscious, and think
it''s more a matter of how it is called. I''ll update the wiki
accordingly
(it doesn''t need to contain these sort of details anyway).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-02 13:34 UTC

head link

Re: NUMA TODO-list for xen-devel

On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it>
wrote:
> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If
a
> >       guest ends up on more than one nodes, make sure it knows
it''s
> >       running on a NUMA platform (smaller than the actual host, but
> >       still NUMA). This interacts with some of the above points:
> 
> The question is whether this is really useful beyond the (I would
> suppose) relatively small set of cases where migration isn''t
> needed.
> Mmm... Not sure I''m getting what you''re saying here, sorry.
Are you
suggesting that exposing a virtual topology is not a good idea as it
poses constraints/prevents live migration?

If yes, well, I mostly agree that this is an huge issue, and that''s why
I think wee need some bright idea on how to deal with it. I mean, it''s
easy to make it optional and let it automatically disable migration,
giving users the choice what they prefer, but I think this is more
dodging the problem than dealing with it! :-P
> >        * consider this during automatic placement for
> >          resuming/migrating domains (if they have a virtual topology,
> >          better not to change it);
> >        * consider this during memory migration (it can change the
> >          actual topology, should we update it on-line or disable
memory
> >          migration?)
> 
> The question is whether trading functionality for performance
> is an acceptable choice.
> Indeed. Again, I think it is possible to implement things flexibly
enough, but then we need to come out with a sane default, so we''re not
allowed to avoid discussing and deciding on this.

One can argue that it is an issue only for big-enough guests (and/or
nearly overcommitted hosts) that don''t fit in only one node (as, if
they
do, there is no virtual topology to export), but I''m not sure we can
neglect them on this basis.

Thanks for the feedback,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Aug-02 14:07 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 02.08.12 at 15:34, Dario Faggioli <raistlin@linux.it>
wrote:
> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>> >>> On 01.08.12 at 18:16, Dario Faggioli
<raistlin@linux.it> wrote:
>> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa).
If a
>> >       guest ends up on more than one nodes, make sure it knows
it''s
>> >       running on a NUMA platform (smaller than the actual host,
but
>> >       still NUMA). This interacts with some of the above points:
>> 
>> The question is whether this is really useful beyond the (I would
>> suppose) relatively small set of cases where migration isn''t
>> needed.
>> 
> Mmm... Not sure I''m getting what you''re saying here,
sorry. Are you
> suggesting that exposing a virtual topology is not a good idea as it
> poses constraints/prevents live migration?
Yes.
> If yes, well, I mostly agree that this is an huge issue, and
that''s why
> I think wee need some bright idea on how to deal with it. I mean,
it''s
> easy to make it optional and let it automatically disable migration,
> giving users the choice what they prefer, but I think this is more
> dodging the problem than dealing with it! :-P
Indeed.
>> >        * consider this during automatic placement for
>> >          resuming/migrating domains (if they have a virtual
topology,
>> >          better not to change it);
>> >        * consider this during memory migration (it can change the
>> >          actual topology, should we update it on-line or disable
memory
>> >          migration?)
>> 
>> The question is whether trading functionality for performance
>> is an acceptable choice.
>> 
> Indeed. Again, I think it is possible to implement things flexibly
> enough, but then we need to come out with a sane default, so we''re
not
> allowed to avoid discussing and deciding on this.
> 
> One can argue that it is an issue only for big-enough guests (and/or
> nearly overcommitted hosts) that don''t fit in only one node (as,
if they
> do, there is no virtual topology to export), but I''m not sure we
can
> neglect them on this basis.
We certainly can''t, the more that the "big enough" case may
not
be that infrequent going forward.

Jan

George Dunlap

2012-Aug-02 16:36 UTC

head link

Re: NUMA TODO-list for xen-devel

On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it>
wrote:> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>> >>> On 01.08.12 at 18:16, Dario Faggioli
<raistlin@linux.it> wrote:
>> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa).
If a
>> >       guest ends up on more than one nodes, make sure it knows
it''s
>> >       running on a NUMA platform (smaller than the actual host,
but
>> >       still NUMA). This interacts with some of the above points:
>>
>> The question is whether this is really useful beyond the (I would
>> suppose) relatively small set of cases where migration isn''t
>> needed.
>>
> Mmm... Not sure I''m getting what you''re saying here,
sorry. Are you
> suggesting that exposing a virtual topology is not a good idea as it
> poses constraints/prevents live migration?
>
> If yes, well, I mostly agree that this is an huge issue, and
that''s why
> I think wee need some bright idea on how to deal with it. I mean,
it''s
> easy to make it optional and let it automatically disable migration,
> giving users the choice what they prefer, but I think this is more
> dodging the problem than dealing with it! :-P
>
>> >        * consider this during automatic placement for
>> >          resuming/migrating domains (if they have a virtual
topology,
>> >          better not to change it);
>> >        * consider this during memory migration (it can change the
>> >          actual topology, should we update it on-line or disable
memory
>> >          migration?)
I think we could use cpu hot-plug to change the "virtual topology" of
VMs, couldn''t we?  We could probably even do that on a running guest
if we really needed to.

 -George

Jan Beulich

2012-Aug-03 09:23 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 02.08.12 at 18:36, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it>
wrote:
>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>> >>> On 01.08.12 at 18:16, Dario Faggioli
<raistlin@linux.it> wrote:
>>> >     - Virtual NUMA topology exposure to guests (a.k.a
guest-numa). If a
>>> >       guest ends up on more than one nodes, make sure it knows
it''s
>>> >       running on a NUMA platform (smaller than the actual
host, but
>>> >       still NUMA). This interacts with some of the above
points:
>>>
>>> The question is whether this is really useful beyond the (I would
>>> suppose) relatively small set of cases where migration
isn''t
>>> needed.
>>>
>> Mmm... Not sure I''m getting what you''re saying here,
sorry. Are you
>> suggesting that exposing a virtual topology is not a good idea as it
>> poses constraints/prevents live migration?
>>
>> If yes, well, I mostly agree that this is an huge issue, and
that''s why
>> I think wee need some bright idea on how to deal with it. I mean,
it''s
>> easy to make it optional and let it automatically disable migration,
>> giving users the choice what they prefer, but I think this is more
>> dodging the problem than dealing with it! :-P
>>
>>> >        * consider this during automatic placement for
>>> >          resuming/migrating domains (if they have a virtual
topology,
>>> >          better not to change it);
>>> >        * consider this during memory migration (it can change
the
>>> >          actual topology, should we update it on-line or
disable memory
>>> >          migration?)
> 
> I think we could use cpu hot-plug to change the "virtual
topology" of
> VMs, couldn''t we?  We could probably even do that on a running
guest
> if we really needed to.
Hmm, not sure - using hotplug behind the back of the guest might
be possible, but you''d first need to hot-unplug the vCPU.
That''s
something that I don''t think you can do on HVM guests (and for
PV guests, guest visible NUMA support makes even less sense
than for HVM ones).

Jan

Andre Przywara

2012-Aug-03 09:48 UTC

head link

Re: NUMA TODO-list for xen-devel

On 08/03/2012 11:23 AM, Jan Beulich wrote:>>>> On 02.08.12 at 18:36, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli
<raistlin@linux.it> wrote:
>>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>>>>>> On 01.08.12 at 18:16, Dario Faggioli
<raistlin@linux.it> wrote:
>>>>>      - Virtual NUMA topology exposure to guests (a.k.a
guest-numa). If a
>>>>>        guest ends up on more than one nodes, make sure it
knows it''s
>>>>>        running on a NUMA platform (smaller than the actual
host, but
>>>>>        still NUMA). This interacts with some of the above
points:
>>>>
>>>> The question is whether this is really useful beyond the (I
would
>>>> suppose) relatively small set of cases where migration
isn''t
>>>> needed.
>>>>
>>> Mmm... Not sure I''m getting what you''re saying
here, sorry. Are you
>>> suggesting that exposing a virtual topology is not a good idea as
it
>>> poses constraints/prevents live migration?
Honestly, what would be the problems with migration? NUMA awareness is 
actually a software optimization, so we will not really break something 
if the advertised topology isn''t the real one. This is especially true 
if we lower the number of NUMA nodes. Say the guest starts with two 
nodes and then gets migrated to a machine where it can happily live in 
one node. There would be some extra effort by the guest OS to obey the 
virtual NUMA topology, but if there isn''t actually a NUMA penalty 
anymore this shouldn''t really hurt, right?
Even if we would need to go to a machine where we have more nodes for a 
certain guest than before, this is actually what we have today: guest 
NUMA unawareness. I am not sure if this is really a migration 
showstopper, and certainly not a NUMA guest showstopper.

But we could make it a config file option, so we leave this decision to 
the admin. I have talked to people with huge guests, they keep asking me 
about this feature.
>>>
>>> If yes, well, I mostly agree that this is an huge issue, and
that''s why
>>> I think wee need some bright idea on how to deal with it. I mean,
it''s
>>> easy to make it optional and let it automatically disable
migration,
>>> giving users the choice what they prefer, but I think this is more
>>> dodging the problem than dealing with it! :-P
>>>
>>>>>         * consider this during automatic placement for
>>>>>           resuming/migrating domains (if they have a
virtual topology,
>>>>>           better not to change it);
>>>>>         * consider this during memory migration (it can
change the
>>>>>           actual topology, should we update it on-line or
disable memory
>>>>>           migration?)
>>
>> I think we could use cpu hot-plug to change the "virtual
topology" of
>> VMs, couldn''t we?  We could probably even do that on a running
guest
>> if we really needed to.
>
> Hmm, not sure - using hotplug behind the back of the guest might
> be possible, but you''d first need to hot-unplug the vCPU.
That''s
> something that I don''t think you can do on HVM guests (and for
> PV guests, guest visible NUMA support makes even less sense
> than for HVM ones).
I don''t think that hotplug would really work. I have checked this some 
times ago, at least the Linux NUMA code cannot be really fooled by this. 
The SRAT table is firmware defined and static by nature, so there is no 
code in Linux to change the NUMA topology at runtime. This is especially 
true for the memory layout.

But as said above, I don''t really buy this as an argument against guest
NUMA. At least provide it as an option to people who know what they do.

Regards,
Andre.


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

Andre Przywara

2012-Aug-03 10:02 UTC

head link

Re: NUMA TODO-list for xen-devel

On 08/01/2012 06:16 PM, Dario Faggioli wrote:> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen''s NUMA support. Well, it turned out that running out
of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that
I''m
> just braindumping here.
> ...
>
>         * automatic placement of Dom0, if possible (my current series is
>           only affecting DomU)
I think Dom0 NUMA awareness should be one of the top priorities. If I 
boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
actually has memory from all 8 nodes and thinks it''s memory is flat.
There are some tricks to confine it to node 0 (dom0_mem=<memory of 
node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires 
intimate knowledge of the systems parameters and is error-prone. Also 
this does not work well with ballooning.
Actually we could improve the NUMA placement with that: By asking the 
Dom0 explicitly for memory from a certain node.
>         * having internal xen data structure honour the placement (e.g.,
>           I''ve been told that right now vcpu stacks are always
allocated
>           on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on
nodes'' pcpus,
>        just have them _prefer_ running on the nodes where their memory
>        is.
This would be really cool. I once thought about something like a 
home-node. We start with placement to allocate memory from one node. 
Then we relax the VCPU-pinning, but mark this node as special for this 
guest, so that it if possible happens to get run there. But in times of 
CPU pressure we are happy to let it run on other nodes: CPU starving is 
much worse than NUMA penalty.
>
> [D] - Dynamic memory migration between different nodes of the host. As
>        the counter-part of the NUMA-aware scheduler.
I once read about a VMware feature: bandwith-limited migration in the 
background, hot pages first. So we get flexibility and avoid CPU 
starving, but still don''t hog the system with memory copying.
Sounds quite ambitious, though.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

Jan Beulich

2012-Aug-03 10:03 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 03.08.12 at 11:48, Andre Przywara <andre.przywara@amd.com>
wrote:
> On 08/03/2012 11:23 AM, Jan Beulich wrote:
>>>>> On 02.08.12 at 18:36, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>>> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli
<raistlin@linux.it> wrote:
>>>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>>>>>>> On 01.08.12 at 18:16, Dario Faggioli
<raistlin@linux.it> wrote:
>>>>>>      - Virtual NUMA topology exposure to guests (a.k.a
guest-numa). If a
>>>>>>        guest ends up on more than one nodes, make sure
it knows it''s
>>>>>>        running on a NUMA platform (smaller than the
actual host, but
>>>>>>        still NUMA). This interacts with some of the
above points:
>>>>>
>>>>> The question is whether this is really useful beyond the (I
would
>>>>> suppose) relatively small set of cases where migration
isn''t
>>>>> needed.
>>>>>
>>>> Mmm... Not sure I''m getting what you''re
saying here, sorry. Are you
>>>> suggesting that exposing a virtual topology is not a good idea
as it
>>>> poses constraints/prevents live migration?
> 
> Honestly, what would be the problems with migration? NUMA awareness is 
> actually a software optimization, so we will not really break something 
> if the advertised topology isn''t the real one.
Sure, nothing would break, but the purpose of the whole feature
is improving performance, and that might get entirely lost (or
even worse) after a migration to a different topology host.

Jan

Jan Beulich

2012-Aug-03 10:40 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com>
wrote:
> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>> Hi everyone,
>>
>> With automatic placement finally landing into xen-unstable, I stated
>> thinking about what I could work on next, still in the field of
>> improving Xen''s NUMA support. Well, it turned out that running
out of
>> things to do is not an option! :-O
>>
>> In fact, I can think of quite a bit of open issues in that area, that
I''m
>> just braindumping here.
> 
>> ...
>>
>>         * automatic placement of Dom0, if possible (my current series
is
>>           only affecting DomU)
> 
> I think Dom0 NUMA awareness should be one of the top priorities. If I 
> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
> actually has memory from all 8 nodes and thinks it''s memory is
flat.
> There are some tricks to confine it to node 0 (dom0_mem=<memory of 
> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this
requires
> intimate knowledge of the systems parameters and is error-prone.
How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
an extension to the current options?
> Also this does not work well with ballooning.
> Actually we could improve the NUMA placement with that: By asking the 
> Dom0 explicitly for memory from a certain node.
Yes, passing sideband information to the balloon driver was
always a missing item, not only for NUMA support, but also
for address restricted memory (e.g. such needed to start
32-bit PV guests on big systems).

Jan

George Dunlap

2012-Aug-03 11:00 UTC

head link

Re: NUMA TODO-list for xen-devel

On 03/08/12 10:48, Andre Przywara wrote:>>> I think we could use cpu hot-plug to change the "virtual
topology" of
>>> VMs, couldn''t we?  We could probably even do that on a
running guest
>>> if we really needed to.
>> Hmm, not sure - using hotplug behind the back of the guest might
>> be possible, but you''d first need to hot-unplug the vCPU.
That''s
>> something that I don''t think you can do on HVM guests (and for
>> PV guests, guest visible NUMA support makes even less sense
>> than for HVM ones).
> I don''t think that hotplug would really work. I have checked this
some
> times ago, at least the Linux NUMA code cannot be really fooled by this.
> The SRAT table is firmware defined and static by nature, so there is no
> code in Linux to change the NUMA topology at runtime. This is especially
> true for the memory layout.I was more thinking of giving a VM the biggest topology you would want 
at boot, and then asking Linux to online or offline vcpus; for example, 
giving it a 4x2 topology (4 vcores x 2 vnodes).  When running on a 
system with 2 cores per node, you offline 2 vcpus per vnode, giving it 
an effective layout of 2x2.  When running on a system with 4 cores per 
node, you could offline all of the cores on one node, giving it an 
effective topology of 4x1.

Unfortunately, I just realized that you could change the number of vcpus 
in a given node, but you couldn''t move the memory around very easily.
Unless you have memory hotplug? Hmm..... :-)

  -George

Andre Przywara

2012-Aug-03 11:26 UTC

head link

Re: NUMA TODO-list for xen-devel

On 08/03/2012 12:40 PM, Jan Beulich wrote:>>>> On 03.08.12 at 12:02, Andre Przywara
<andre.przywara@amd.com> wrote:
>> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>>> Hi everyone,
>>>
>>> With automatic placement finally landing into xen-unstable, I
stated
>>> thinking about what I could work on next, still in the field of
>>> improving Xen''s NUMA support. Well, it turned out that
running out of
>>> things to do is not an option! :-O
>>>
>>> In fact, I can think of quite a bit of open issues in that area,
that I''m
>>> just braindumping here.
>>
>>> ...
>>>
>>>          * automatic placement of Dom0, if possible (my current
series is
>>>            only affecting DomU)
>>
>> I think Dom0 NUMA awareness should be one of the top priorities. If I
>> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which
>> actually has memory from all 8 nodes and thinks it''s memory is
flat.
>> There are some tricks to confine it to node 0 (dom0_mem=<memory of
>> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this
requires
>> intimate knowledge of the systems parameters and is error-prone.
>
> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
> an extension to the current options?
Yes, that sounds like a good idea. And relatively easy to implement.
Maybe a list or a number of nodes (to make it more complicated ;-)

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

Jan Beulich

2012-Aug-03 11:38 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 03.08.12 at 13:26, Andre Przywara <andre.przywara@amd.com>
wrote:
> On 08/03/2012 12:40 PM, Jan Beulich wrote:
>>>>> On 03.08.12 at 12:02, Andre Przywara
<andre.przywara@amd.com> wrote:
>>> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>>>> Hi everyone,
>>>>
>>>> With automatic placement finally landing into xen-unstable, I
stated
>>>> thinking about what I could work on next, still in the field of
>>>> improving Xen''s NUMA support. Well, it turned out that
running out of
>>>> things to do is not an option! :-O
>>>>
>>>> In fact, I can think of quite a bit of open issues in that
area, that I''m
>>>> just braindumping here.
>>>
>>>> ...
>>>>
>>>>          * automatic placement of Dom0, if possible (my current
series is
>>>>            only affecting DomU)
>>>
>>> I think Dom0 NUMA awareness should be one of the top priorities. If
I
>>> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0
which
>>> actually has memory from all 8 nodes and thinks it''s
memory is flat.
>>> There are some tricks to confine it to node 0 (dom0_mem=<memory
of
>>> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but
this requires
>>> intimate knowledge of the systems parameters and is error-prone.
>>
>> How about "dom0_mem=node<n> dom0_vcpus=node<n>"
as
>> an extension to the current options?
> 
> Yes, that sounds like a good idea. And relatively easy to implement.
> Maybe a list or a number of nodes (to make it more complicated ;-)
Oh yes, of course I implied this flexibility. Just wanted to give
an easy to read example.

Jan

Dario Faggioli

2012-Aug-03 13:14 UTC

head link

Re: NUMA TODO-list for xen-devel

On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote: > >> How about "dom0_mem=node<n>
dom0_vcpus=node<n>" as
> >> an extension to the current options?
> > 
> > Yes, that sounds like a good idea. And relatively easy to implement.
> > Maybe a list or a number of nodes (to make it more complicated ;-)
> 
> Oh yes, of course I implied this flexibility. Just wanted to give
> an easy to read example.
> Yep, I agree it sounds nice and should be not to hard. I''ll update the
Wiki page.

I only have one question, should we try to take IONUMA into account here
as well? I mean, if it turns out that I/O hubs are connected to some
specific node(s), shouldn''t we consider pinning/"affining"
Dom0 to those
node(s), as it most likely will be responsible for some/most DomUs''
I/O?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Aug-03 13:52 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 03.08.12 at 15:14, Dario Faggioli <raistlin@linux.it>
wrote:
> On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote: 
>> >> How about "dom0_mem=node<n>
dom0_vcpus=node<n>" as
>> >> an extension to the current options?
>> > 
>> > Yes, that sounds like a good idea. And relatively easy to
implement.
>> > Maybe a list or a number of nodes (to make it more complicated ;-)
>> 
>> Oh yes, of course I implied this flexibility. Just wanted to give
>> an easy to read example.
>> 
> Yep, I agree it sounds nice and should be not to hard. I''ll update
the
> Wiki page.
> 
> I only have one question, should we try to take IONUMA into account here
> as well? I mean, if it turns out that I/O hubs are connected to some
> specific node(s), shouldn''t we consider
pinning/"affining" Dom0 to those
> node(s), as it most likely will be responsible for some/most
DomUs'' I/O?
I don''t think the necessary information is available at the time
when Dom0 gets constructed.

Jan

Dan Magenheimer

2012-Aug-03 22:22 UTC

head link

Re: NUMA TODO-list for xen-devel

> From: Dario Faggioli [mailto:raistlin@linux.it]
> Subject: [Xen-devel] NUMA TODO-list for xen-devel
> 
> Hi everyone,
Hi Dario --

Thanks for your great work on NUMA... an interest area of
mine but one, sadly, I haven''t been able to give much time to,
so I''m glad you''ve taken this bull by the horns.

I''ve been sitting on an idea for some time that probably
deserves some exposure on your list.  Naturally, it involves
my favorite topic tmem (readers, please don''t tune out yet :-).

It has occurred to me that a fundamental tenet of NUMA
is to put infrequently used data on "other" nodes, while
pulling frequently used data onto a "local" node.

Tmem very nicely separates infrequently-used data from
frequently-used data with an API/ABI that is now fully
implemented in upstream Linux.

If Xen had a "alloc_page_on_any_node_but_the_current_one()"
(or "any_node_except_this_guests_node_set" for multinode guests)
and Xen''s tmem implementation were to use it, especially
in combination with selfballooning (also upstream), this
could solve a significant part of the NUMA problem when running
tmem-enabled guests.  The most frequently used data
stays in the guest (thus in the guest''s "current node")
and the less frequently used data lives in tmem in the
hypervisor (on the guest''s "complement guest''s

Naturally, this doesn''t solve any NUMA problems at all for
tmem-ignorant or tmem-disabled guests, but if it works
sufficiently well for tmem-enabled guests, that might
encourage other OS''s to do a simple implementation of tmem.

Sadly, I''m not able to invest much time in this idea,
but the combination of tmem and NUMA might interest some
developers and/or grad students, in which case I''d be happy
to spend a little time assisting.

I''ll be at Xen Summit for at least the first day, so we
can chat more if you are interested.  George/Jan, I suspect
you have the best knowledge of tmem outside of Oracle as well
as being NUMA-fluent, so I''d appreciate your thoughts as well!

Thanks,
Dan

Dan Magenheimer

2012-Aug-03 22:34 UTC

head link

Re: NUMA TODO-list for xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, August 02, 2012 3:43 AM
> To: Dario Faggioli
> Cc: Andre Przywara; Anil Madhavapeddy; George Dunlap; xen-devel; Andrew
Cooper; Yang Z Zhang
> Subject: Re: [Xen-devel] NUMA TODO-list for xen-devel
> 
> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it>
wrote:
> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If
a
> >       guest ends up on more than one nodes, make sure it knows
it''s
> >       running on a NUMA platform (smaller than the actual host, but
> >       still NUMA). This interacts with some of the above points:
> 
> The question is whether this is really useful beyond the (I would
> suppose) relatively small set of cases where migration isn''t
> needed.
> 
> >        * consider this during automatic placement for
> >          resuming/migrating domains (if they have a virtual topology,
> >          better not to change it);
> >        * consider this during memory migration (it can change the
> >          actual topology, should we update it on-line or disable
memory
> >          migration?)
> 
> The question is whether trading functionality for performance
> is an acceptable choice.
If there were a lwn.net equivalent for Xen, I''d be pushing to get
quoted on the following:

"Virtualization: You can have flexibility or you can have performance.
Pick one."

A couple of years ago when NUMA was first being extensively discussed
for Xen, I suggested that this should really be a "top level" flag
that a sysadmin should be able to select: Either optimize for
performance or optimize for flexibility.  Then Xen and the Xen tools
should "do the right thing" depending on the selection.

I still think this is a good way to surface the tradeoffs for
a very complex problem to the vast majority of users/admins.
Clearly they will want "both" but forcing the choice will
provoke more thought about their use model, as well as provide
important guidance to the underlying implementations.

Dan Magenheimer

2012-Aug-03 22:40 UTC

head link

Re: NUMA TODO-list for xen-devel

> >>>>> The question is whether this is really useful beyond
the (I would
> >>>>> suppose) relatively small set of cases where migration
isn''t
> >>>>> needed.
> >>>>>
> >>>> Mmm... Not sure I''m getting what you''re
saying here, sorry. Are you
> >>>> suggesting that exposing a virtual topology is not a good
idea as it
> >>>> poses constraints/prevents live migration?
> >
> > Honestly, what would be the problems with migration? NUMA awareness is
> > actually a software optimization, so we will not really break
something
> > if the advertised topology isn''t the real one.
> 
> Sure, nothing would break, but the purpose of the whole feature
> is improving performance, and that might get entirely lost (or
> even worse) after a migration to a different topology host.
+1

In the end, customers who care about getting 99.9% of native performance
should use physical hardware.  Live migration means that someone/something
is trying to do resource optimization and so performance optimization is
secondary.  But claiming great performance before migration and getting
sucky performance after migration, is IMHO a disaster, especially when
future "cloud users" won''t have a clue whether their
environment has
migrated or not.

Just my two cents...

Dan Magenheimer

2012-Aug-03 22:42 UTC

head link

Re: NUMA TODO-list for xen-devel

> > [D] - Dynamic memory migration between different nodes of the host. As
> >        the counter-part of the NUMA-aware scheduler.
> 
> I once read about a VMware feature: bandwith-limited migration in the
> background, hot pages first. So we get flexibility and avoid CPU
> starving, but still don''t hog the system with memory copying.
> Sounds quite ambitious, though.
Something like this, but between NUMA nodes instead of physical systems?

http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf

Jan Beulich

2012-Aug-06 07:15 UTC

head link

Re: NUMA TODO-list for xen-devel

>>> On 04.08.12 at 00:34, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> The question is whether trading functionality for performance
>> is an acceptable choice.
> 
> If there were a lwn.net equivalent for Xen, I''d be pushing to get
> quoted on the following:
> 
> "Virtualization: You can have flexibility or you can have performance.
> Pick one."
> 
> A couple of years ago when NUMA was first being extensively discussed
> for Xen, I suggested that this should really be a "top level"
flag
> that a sysadmin should be able to select: Either optimize for
> performance or optimize for flexibility.  Then Xen and the Xen tools
> should "do the right thing" depending on the selection.
> 
> I still think this is a good way to surface the tradeoffs for
> a very complex problem to the vast majority of users/admins.
> Clearly they will want "both" but forcing the choice will
> provoke more thought about their use model, as well as provide
> important guidance to the underlying implementations.
I would expect a good part to pick performance, and then
go whine about something not working in an emergency. On
xen-devel one could respond with this-is-what-you-get, but
you can''t necessarily do so to paying customers...

Jan

Dan Magenheimer

2012-Aug-06 16:28 UTC

head link

Re: NUMA TODO-list for xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: [Xen-devel] NUMA TODO-list for xen-devel
> 
> >>> On 04.08.12 at 00:34, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> The question is whether trading functionality for performance
> >> is an acceptable choice.
> >
> > If there were a lwn.net equivalent for Xen, I''d be pushing to
get
> > quoted on the following:
> >
> > "Virtualization: You can have flexibility or you can have
performance.
> > Pick one."
> >
> > A couple of years ago when NUMA was first being extensively discussed
> > for Xen, I suggested that this should really be a "top
level" flag
> > that a sysadmin should be able to select: Either optimize for
> > performance or optimize for flexibility.  Then Xen and the Xen tools
> > should "do the right thing" depending on the selection.
> >
> > I still think this is a good way to surface the tradeoffs for
> > a very complex problem to the vast majority of users/admins.
> > Clearly they will want "both" but forcing the choice will
> > provoke more thought about their use model, as well as provide
> > important guidance to the underlying implementations.
> 
> I would expect a good part to pick performance, and then
> go whine about something not working in an emergency. On
> xen-devel one could respond with this-is-what-you-get, but
> you can''t necessarily do so to paying customers...
Well, you can, but you have to first convince marketing that
virtualization doesn''t solve all problems for all users all the
time. :-)

The two options would have to be clearly documented as:

"flexibility-is-my-highest-priority-and-performance-is-second-priority"

and

"performance-is-my-highest-priority-and-flexibility-is-second-priority"

and when a user selects the latter, they should be prompted with

"Are you really sure you want to use virtualization instead of bare
metal?"

Sigh. We can only wish.
Dan

Dario Faggioli

2012-Aug-07 22:56 UTC

head link

Re: NUMA TODO-list for xen-devel

On Thu, 2012-08-02 at 01:04 +0000, Zhang, Yang Z wrote:> >     - Automatic placement at guest creation time. Basics are there and
> >       will be shipping with 4.2. However, a lot of other things are
> >       missing and/or can be improved, for instance:
> > [D]    * automated verification and testing of the placement;
> >        * benchmarks and improvements of the placement heuristic;
> > [D]    * choosing/building up some measure of node load (more accurate
> >          than just counting vcpus) onto which to rely during
placement;
> >        * consider IONUMA during placement;
> We should consider two things:
> 1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the
node which it resides. Currently, Dom0 allocates dma buffer without provide the
node info to the hypercall..
> 2.Guest IONUMA: when guest boots up with pass through device, we need to
allocate the memory from the node where the device resides for further dma
buffer allocation. And let guest know the IONUMA topology. This rely on the
guest NUMA.
> This topic was mentioned in xen summit 2011:
>
http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf
> Seems fine, I knew that presentation and I added these details to the
Wiki page (sorry for the delay). Are you (or someone from your group)
perhaps working or planning to work on it?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-07 23:49 UTC

head link

Re: NUMA TODO-list for xen-devel

On Fri, 2012-08-03 at 15:22 -0700, Dan Magenheimer
wrote:> Hi Dario --
> Hello Dan,
> Thanks for your great work on NUMA... an interest area of
> mine but one, sadly, I haven''t been able to give much time to,
> so I''m glad you''ve taken this bull by the horns.
> Trying to... Let''s see! :-P
> I''ve been sitting on an idea for some time that probably
> deserves some exposure on your list.  Naturally, it involves
> my favorite topic tmem (readers, please don''t tune out yet :-).
> It sure does! I''ve already put something quite generic about
"memory
sharing" there, because I know that it has all but trivial interactions
with the improved NUMA support I am/we are trying to envision.

The fact that it is, as I said, generic, is due to my ignorance (let''s
say for now) of the whole tmem thing, so thanks for the contribution,
it''s very useful to hear your point of view on this!
> It has occurred to me that a fundamental tenet of NUMA
> is to put infrequently used data on "other" nodes, while
> pulling frequently used data onto a "local" node.
> 
> Tmem very nicely separates infrequently-used data from
> frequently-used data with an API/ABI that is now fully
> implemented in upstream Linux.
> I see, and it seems nice.
> [..]
>
> Naturally, this doesn''t solve any NUMA problems at all for
> tmem-ignorant or tmem-disabled guests, but if it works
> sufficiently well for tmem-enabled guests, that might
> encourage other OS''s to do a simple implementation of tmem.
> Sure. In my opinion, this is not an area where we could aim at "solving
every problem for everyone". However, we should definitely target having
a sensible solution for default and/or most common use cases and
scenarios.
> Sadly, I''m not able to invest much time in this idea,
> but the combination of tmem and NUMA might interest some
> developers and/or grad students, in which case I''d be happy
> to spend a little time assisting.
> That''s definitely the case. I''ve tried to put a summary of
what you said
in this mail to the Wiki (http://wiki.xen.org/wiki/Xen_NUMA_Roadmap) and
also put your contact next to it. Feel free to update/correct if you fin
something wrong. :-P
> I''ll be at Xen Summit for at least the first day, so we
> can chat more if you are interested.
>I indeed am interested, so let''s make that happen! :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-07 23:53 UTC

head link

Re: NUMA TODO-list for xen-devel

On Thu, 2012-08-02 at 01:04 +0100, Malte Schwarzkopf
wrote:> > Wow... That''s really cool. I''ll definitely take a
deep look at all these
> > data! I''m also adding the link to the wiki, if
you''re fine with that...
> 
> No problem with adding a link, as this is public data :) If possible,
> it''d be splendid to put a note next to this link encouraging
people to
> submit their own results -- doing so is very simple, and helps us extend
> the database. Instructions are at
> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short
> link, http://fable.io).
> Ok, I''ve tried doing this, here it is how it looks:
 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap#Inter-VM_dependencies_and_communication_issues

Thanks also for the references, I''ll definitely take a look at them.
:-)
> One interesting thing to look at (that we haven''t looked at yet)
is what
> memory allocators do about NUMA these days; there is an AMD whitepaper
> from 2009 discussing the performance benefits of a NUMA-aware version of
> tcmalloc [3], but I have found it hard to reproduce their results on
> modern hardware. Of course, being virtualized may complicate matters
> here, since the memory allocator can no longer freely pick and choose
> where to allocate from.
> 
> Scheduling, notably, is key here, since the CPU a process is scheduled
> on may determine where its memory is allocated -- frequent migrations
> are likely to be bad for performance due to remote memory accesses,
>That might be true for Linux, but it''s not so much true
(fortunately :-P) for Xen. However, I also think scheduling is a very
important aspect of this whole NUMA thing... I''ll repost my NUMA aware
credit scheduler patches soon.
> although we have been unable to quantify a significant difference on
> non-synthetic macrobenchmarks; that said, we did not try very hard so far.
> I think both kinds of benchmarks are interesting. I tried to concentrate
a bit on macrobenchmark (specjbb, I''ll let you decide if
that''s
synthetic or not :-D).

Another issue, if we want to tackle the problem of communicating/cooperating
VMs, pops up at the interface level, i.e., how do we want the user to
tell us that 2 (or more) VMs are "related"? Up to what level of
detail?
Should this "relationship" be permanent or might it change over time?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-08 07:07 UTC

head link

Re: NUMA TODO-list for xen-devel

On Fri, 2012-08-03 at 15:42 -0700, Dan Magenheimer
wrote:> > > [D] - Dynamic memory migration between different nodes of the
host. As
> > >        the counter-part of the NUMA-aware scheduler.
> > 
> > I once read about a VMware feature: bandwith-limited migration in the
> > background, hot pages first. So we get flexibility and avoid CPU
> > starving, but still don''t hog the system with memory copying.
> > Sounds quite ambitious, though.
> 
> Something like this, but between NUMA nodes instead of physical systems?
> 
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf 
>Likely. The analogy between this kind of "memory migration" and the
actual live migration we already have is indeed something I want to take
advantage of. The fact that we support that small thing called
_paravirtualization_ is complicating it all quite a bit, but I''m
looking
into it... Thanks for the reference. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Aug-08 07:43 UTC

head link

Re: NUMA TODO-list for xen-devel

On Fri, 2012-08-03 at 12:02 +0200, Andre Przywara wrote:> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
> > ...
> >
> >         * automatic placement of Dom0, if possible (my current series
is
> >           only affecting DomU)
> 
> I think Dom0 NUMA awareness should be one of the top priorities. If I 
> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
> actually has memory from all 8 nodes and thinks it''s memory is
flat.
>Ok, I updated the Wiki page with a link to this (sub)thread --- more
specifically, the mails where we agree about the new syntax. I can work
on it, but not in the next few days, so let''s see if anyone steps up
before I get to look at it. :-)
> > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on
nodes'' pcpus,
> >        just have them _prefer_ running on the nodes where their memory
> >        is.
> 
> This would be really cool. I once thought about something like a 
> home-node. We start with placement to allocate memory from one node. 
> Then we relax the VCPU-pinning, but mark this node as special for this 
> guest, so that it if possible happens to get run there. But in times of 
> CPU pressure we are happy to let it run on other nodes: CPU starving is 
> much worse than NUMA penalty.
> Yep. Patches coming soon.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Aug 2012 - NUMA TODO-list for xen-devel

NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel

Re: NUMA TODO-list for xen-devel