Xen vnuma introduction. The patchset introduces vnuma to paravirtualized Xen guests runnning as domU. Xen subop hypercall is used to retreive vnuma topology information. Bases on the retreived topology from Xen, NUMA number of nodes, memory ranges, distance table and cpumask is being set. If initialization is incorrect, sets ''dummy'' node and unsets nodemask. vNUMA topology is constructed by Xen toolstack. Xen patchset is available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3. Example of vnuma enabled pv domain dmesg: [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xffffffff] [ 0.000000] node 1: [mem 0x100000000-0x1ffffffff] [ 0.000000] node 2: [mem 0x200000000-0x2ffffffff] [ 0.000000] node 3: [mem 0x300000000-0x3ffffffff] [ 0.000000] On node 0 totalpages: 1048479 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 14280 pages used for memmap [ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 3 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152 numactl output: root@heatpipe:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4031 MB node 0 free: 3997 MB node 1 cpus: 1 node 1 size: 4039 MB node 1 free: 4022 MB node 2 cpus: 2 node 2 size: 4039 MB node 2 free: 4023 MB node 3 cpus: 3 node 3 size: 3975 MB node 3 free: 3963 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Current patchset is available at https://git.gitorious.org/xenvnuma/linuxvnuma.git:v3 Xen patchset is available at: https://git.gitorious.org/xenvnuma/xenvnuma.git:v3 TODO * dom0, pvh and hvm vnuma support; * multiple memory ranges per node support; * benchmarking; Elena Ufimtseva (2): xen: vnuma support for PV guests running as domU xen: enable vnuma for PV guest arch/x86/include/asm/xen/vnuma.h | 12 ++++ arch/x86/mm/numa.c | 3 + arch/x86/xen/Makefile | 2 +- arch/x86/xen/setup.c | 6 +- arch/x86/xen/vnuma.c | 127 ++++++++++++++++++++++++++++++++++++++ include/xen/interface/memory.h | 44 +++++++++++++ 6 files changed, 192 insertions(+), 2 deletions(-) create mode 100644 arch/x86/include/asm/xen/vnuma.h create mode 100644 arch/x86/xen/vnuma.c -- 1.7.10.4
Konrad Rzeszutek Wilk
2013-Nov-19 15:38 UTC
Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote:> Xen vnuma introduction. > > The patchset introduces vnuma to paravirtualized Xen guests > runnning as domU. > Xen subop hypercall is used to retreive vnuma topology information. > Bases on the retreived topology from Xen, NUMA number of nodes, > memory ranges, distance table and cpumask is being set. > If initialization is incorrect, sets ''dummy'' node and unsets > nodemask. > vNUMA topology is constructed by Xen toolstack. Xen patchset is > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.Yeey! One question - I know you had questions about the PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to be harvested for AutoNUMA balancing. And that the hypercall to set such PTE entry disallows the PROT_GLOBAL (it stripts it off)? That means that when the Linux page system kicks in (as it has ~PAGE_PRESENT) the Linux pagehandler won''t see the PROT_GLOBAL (as it has been filtered out). Which means that the AutoNUMA code won''t kick in. (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317) Was that problem ever answered?
Dario Faggioli
2013-Nov-19 18:29 UTC
Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote:> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote: > > The patchset introduces vnuma to paravirtualized Xen guests > > runnning as domU. > > Xen subop hypercall is used to retreive vnuma topology information. > > Bases on the retreived topology from Xen, NUMA number of nodes, > > memory ranges, distance table and cpumask is being set. > > If initialization is incorrect, sets ''dummy'' node and unsets > > nodemask. > > vNUMA topology is constructed by Xen toolstack. Xen patchset is > > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3. > > Yeey! >:-)> One question - I know you had questions about the > PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to > be harvested for AutoNUMA balancing. > > And that the hypercall to set such PTE entry disallows the > PROT_GLOBAL (it stripts it off)? That means that when the > Linux page system kicks in (as it has ~PAGE_PRESENT) the > Linux pagehandler won''t see the PROT_GLOBAL (as it has > been filtered out). Which means that the AutoNUMA code won''t > kick in. > > (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317) > > Was that problem ever answered? >I think the issue is a twofold one. If I remember correctly (Elena, please, correct me if I''m wrong) Elena was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest. That''s what pushed her to investigate the issue, and led to what you''re summing up above. However, it appears the crash was due to something completely unrelated to Xen and vNUMA, was affecting baremetal too, and got fixed, which means the crash is now gone. It remains to be seen (I think) whether that also means that AutoNUMA works. In fact, chatting about this in Edinburgh, Elena managed to convince me pretty badly that we should --as part of the vNUMA support-- do something about this, in order to make it work. At that time I thought we should be doing something to avoid the system to go ka-boom, but as I said, even now that it does not crash anymore, she was so persuasive that I now find it quite hard to believe that we really don''t need to do anything. :-P I guess, as soon as we get the chance, we should see if this actually works, i.e., in addition to seeing the proper topology and not crashing, verify that AutoNUMA in the guest is actually doing is job. What do you think? Again, Elena, please chime in and explain how things are, if I got something wrong. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Elena Ufimtseva
2013-Dec-04 00:35 UTC
Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
On Tue, Nov 19, 2013 at 1:29 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:> On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote: >> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote: >> > The patchset introduces vnuma to paravirtualized Xen guests >> > runnning as domU. >> > Xen subop hypercall is used to retreive vnuma topology information. >> > Bases on the retreived topology from Xen, NUMA number of nodes, >> > memory ranges, distance table and cpumask is being set. >> > If initialization is incorrect, sets ''dummy'' node and unsets >> > nodemask. >> > vNUMA topology is constructed by Xen toolstack. Xen patchset is >> > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3. >> >> Yeey! >> > :-) > >> One question - I know you had questions about the >> PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to >> be harvested for AutoNUMA balancing. >> >> And that the hypercall to set such PTE entry disallows the >> PROT_GLOBAL (it stripts it off)? That means that when the >> Linux page system kicks in (as it has ~PAGE_PRESENT) the >> Linux pagehandler won''t see the PROT_GLOBAL (as it has >> been filtered out). Which means that the AutoNUMA code won''t >> kick in. >> >> (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317) >> >> Was that problem ever answered? >> > I think the issue is a twofold one. > > If I remember correctly (Elena, please, correct me if I''m wrong) Elena > was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest. > That''s what pushed her to investigate the issue, and led to what you''re > summing up above. > > However, it appears the crash was due to something completely unrelated > to Xen and vNUMA, was affecting baremetal too, and got fixed, which > means the crash is now gone. > > It remains to be seen (I think) whether that also means that AutoNUMA > works. In fact, chatting about this in Edinburgh, Elena managed to > convince me pretty badly that we should --as part of the vNUMA support-- > do something about this, in order to make it work. At that time I > thought we should be doing something to avoid the system to go ka-boom, > but as I said, even now that it does not crash anymore, she was so > persuasive that I now find it quite hard to believe that we really don''t > need to do anything. :-PYes, you were right Dario :) See at the end. pv guests do not crash, but they have user space memory corruption. Ok, so I will try to understand what again had happened during this weekend. Meanwhile posting patches for Xen.> > I guess, as soon as we get the chance, we should see if this actually > works, i.e., in addition to seeing the proper topology and not crashing, > verify that AutoNUMA in the guest is actually doing is job. > > What do you think? Again, Elena, please chime in and explain how things > are, if I got something wrong. :-) >Oh guys, I feel really bad about not replying to these emails... Somehow these replies all got deleted.. wierd. Ok, about that automatic balancing. At the moment of the last patch automatic numa balancing seem to work, but after rebasing on the top of 3.12-rc2 I see similar issues. I will try to figure out what commits broke and will contact Ingo Molnar and Mel Gorman. Konrad, as of PROT_GLOBAL flag, I will double check once more to exclude errors from my side. Last time I was able to have numa_balancing working without any modifications from hypervisor side. But again, I want to double check this, some experiments might have appear being good :)> Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >-- Elena
Elena Ufimtseva
2013-Dec-04 06:20 UTC
Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:> On Tue, Nov 19, 2013 at 1:29 PM, Dario Faggioli > <dario.faggioli@citrix.com> wrote: >> On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote: >>> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote: >>> > The patchset introduces vnuma to paravirtualized Xen guests >>> > runnning as domU. >>> > Xen subop hypercall is used to retreive vnuma topology information. >>> > Bases on the retreived topology from Xen, NUMA number of nodes, >>> > memory ranges, distance table and cpumask is being set. >>> > If initialization is incorrect, sets ''dummy'' node and unsets >>> > nodemask. >>> > vNUMA topology is constructed by Xen toolstack. Xen patchset is >>> > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3. >>> >>> Yeey! >>> >> :-) >> >>> One question - I know you had questions about the >>> PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to >>> be harvested for AutoNUMA balancing. >>> >>> And that the hypercall to set such PTE entry disallows the >>> PROT_GLOBAL (it stripts it off)? That means that when the >>> Linux page system kicks in (as it has ~PAGE_PRESENT) the >>> Linux pagehandler won''t see the PROT_GLOBAL (as it has >>> been filtered out). Which means that the AutoNUMA code won''t >>> kick in. >>> >>> (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317) >>> >>> Was that problem ever answered? >>> >> I think the issue is a twofold one. >> >> If I remember correctly (Elena, please, correct me if I''m wrong) Elena >> was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest. >> That''s what pushed her to investigate the issue, and led to what you''re >> summing up above. >> >> However, it appears the crash was due to something completely unrelated >> to Xen and vNUMA, was affecting baremetal too, and got fixed, which >> means the crash is now gone. >> >> It remains to be seen (I think) whether that also means that AutoNUMA >> works. In fact, chatting about this in Edinburgh, Elena managed to >> convince me pretty badly that we should --as part of the vNUMA support-- >> do something about this, in order to make it work. At that time I >> thought we should be doing something to avoid the system to go ka-boom, >> but as I said, even now that it does not crash anymore, she was so >> persuasive that I now find it quite hard to believe that we really don''t >> need to do anything. :-P > > Yes, you were right Dario :) See at the end. pv guests do not crash, > but they have user space memory corruption. > Ok, so I will try to understand what again had happened during this > weekend. > Meanwhile posting patches for Xen. > >> >> I guess, as soon as we get the chance, we should see if this actually >> works, i.e., in addition to seeing the proper topology and not crashing, >> verify that AutoNUMA in the guest is actually doing is job. >> >> What do you think? Again, Elena, please chime in and explain how things >> are, if I got something wrong. :-) >> > > Oh guys, I feel really bad about not replying to these emails... Somehow these > replies all got deleted.. wierd. > > Ok, about that automatic balancing. At the moment of the last patch > automatic numa balancing seem to > work, but after rebasing on the top of 3.12-rc2 I see similar issues. > I will try to figure out what commits broke and will contact Ingo > Molnar and Mel Gorman. > > Konrad, > as of PROT_GLOBAL flag, I will double check once more to exclude > errors from my side. > Last time I was able to have numa_balancing working without any > modifications from hypervisor side. > But again, I want to double check this, some experiments might have > appear being good :) > > >> Regards, >> Dario >> >> -- >> <<This happens because I choose it to happen!>> (Raistlin Majere) >> ----------------------------------------------------------------- >> Dario Faggioli, Ph.D, http://about.me/dario.faggioli >> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> >As of now I have patch v4 for reviewing. Not sure if it will be beneficial to post it for review or look closer at the current problem. The issue I am seeing right now is defferent from what was happening before. The corruption happens when on change_prot_numa way : [ 6638.021439] pfn 45e602, highest_memmap_pfn - 14ddd7 [ 6638.021444] BUG: Bad page map in process dd pte:800000045e602166 pmd:abf1a067 [ 6638.021449] addr:00007f4fda2d8000 vm_flags:00100073 anon_vma:ffff8800abf77b90 mapping: (null) index:7f4fda2d8 [ 6638.021457] CPU: 1 PID: 1033 Comm: dd Tainted: G B W 3.13.0-rc2+ #10 [ 6638.021462] 0000000000000000 00007f4fda2d8000 ffffffff813ca5b1 ffff88010d68deb8 [ 6638.021471] ffffffff810f2c88 00000000abf1a067 800000045e602166 0000000000000000 [ 6638.021482] 000000000045e602 ffff88010d68deb8 00007f4fda2d8000 800000045e602166 [ 6638.021492] Call Trace: [ 6638.021497] [<ffffffff813ca5b1>] ? dump_stack+0x41/0x51 [ 6638.021503] [<ffffffff810f2c88>] ? print_bad_pte+0x19d/0x1c9 [ 6638.021509] [<ffffffff810f3aef>] ? vm_normal_page+0x94/0xb3 [ 6638.021519] [<ffffffff810fb788>] ? change_protection+0x35c/0x5a8 [ 6638.021527] [<ffffffff81107965>] ? change_prot_numa+0x13/0x24 [ 6638.021533] [<ffffffff81071697>] ? task_numa_work+0x1fb/0x299 [ 6638.021539] [<ffffffff8105ef54>] ? task_work_run+0x7b/0x8f [ 6638.021545] [<ffffffff8100e658>] ? do_notify_resume+0x53/0x68 [ 6638.021552] [<ffffffff813d4432>] ? int_signal+0x12/0x17 [ 6638.021560] pfn 45d732, highest_memmap_pfn - 14ddd7 [ 6638.021565] BUG: Bad page map in process dd pte:800000045d732166 pmd:10d684067 [ 6638.021572] addr:00007fff7c143000 vm_flags:00100173 anon_vma:ffff8800abf77960 mapping: (null) index:7fffffffc [ 6638.021582] CPU: 1 PID: 1033 Comm: dd Tainted: G B W 3.13.0-rc2+ #10 [ 6638.021587] 0000000000000000 00007fff7c143000 ffffffff813ca5b1 ffff8800abf339b0 [ 6638.021595] ffffffff810f2c88 000000010d684067 800000045d732166 0000000000000000 [ 6638.021603] 000000000045d732 ffff8800abf339b0 00007fff7c143000 800000045d732166 The code has changed since last problem, I will work on this to see where it comes from. Elena> > > -- > Elena-- Elena
Dario Faggioli
2013-Dec-05 01:13 UTC
Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
On mer, 2013-12-04 at 01:20 -0500, Elena Ufimtseva wrote:> On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > > Oh guys, I feel really bad about not replying to these emails... Somehow these > > replies all got deleted.. wierd. > >No worries... You should see *my* backlog. :-P> > Ok, about that automatic balancing. At the moment of the last patch > > automatic numa balancing seem to > > work, but after rebasing on the top of 3.12-rc2 I see similar issues. > > I will try to figure out what commits broke and will contact Ingo > > Molnar and Mel Gorman. > > > As of now I have patch v4 for reviewing. Not sure if it will be > beneficial to post it for review > or look closer at the current problem. >You mean the Linux side? Perhaps stick somewhere a reference to the git tree/branch where it lives, but, before re-sending, let''s wait for it to be as issue free as we can tell?> The issue I am seeing right now is defferent from what was happening before. > The corruption happens when on change_prot_numa way : >Ok, so, I think I need to step back a bit from the actual stack trace and look at the big picture. Please, Elena or anyone, correct me if I''m saying something wrong about how Linux''s autonuma works and interacts with Xen. The way it worked when I last looked at it was sort of like this: - there was a kthread scanning all the pages, removing the PAGE_PRESENT bit from actually present pages, and adding a new special one (PAGE_NUMA or something like that); - when a page fault is triggered and the PAGE_NUMA flag is found, it figures out the page is actually there, so no swap or anything. However, it tracks from what node the access to that page came from, matches it with the node where the page actually is and collect some statistics about that; - at some point (and here I don''t remember the exact logic, since it changed quite a few times) pages ranking badly in the stats above are moved from one node to another. Is this description still accurate? If yes, here''s what I would (double) check, when running this in a PV guest on top of Xen: 1. the NUMA hinting page fault, are we getting and handling them correctly in the PV guest? Are the stats in the guest kernel being updated in a sensible way, i.e., do they make sense and properly relate to the virtual topology of the guest? At some point we thought it would have been necessary to intercept these faults and make sure the above is true with some help from the hypervisor... Is this the case? Why? Why not? 2. what happens when autonuma tries to move pages from one node to another? For us, that would mean in moving from one virtual node to another... Is there a need to do anything at all? I mean, is this, from our perspective, just copying the content of an MFN from node X into another MFN on node Y, or do we need to update some of our vnuma tracking data structures in Xen? If we have this figured out already, then I think we just chase bugs and repost the series. If not, well, I think we should. :-D Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel