Konrad Rzeszutek Wilk
2009-Nov-13 22:16 UTC
[Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
# HG changeset patch # User konrad.wilk@oracle.com # Date 1258150318 18000 # Node ID 82762bc10aa5a193173d8a83a5dbada1003bdcd2 # Parent 88adf22e0fe3a77d0be95530b74c3781ffc918f1 Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch. If the user hasn''t used dom0_mem= bootup parameter, the privileged domain usurps all of the memory. During launch of PV guests with PCI pass-through we ratchet down the memory for the privileged domain to the required memory for the PV guest. However, for PV guests with PCI pass-through we do not take into account that the PV guest is going to swap its SWIOTLB memory for DMA32 memory - in fact, swap 64MB of it. This patch balloon''s down the privileged domain so that there are 64MB of DMA32 memory available. Note: If ''dom0_mem'' is used, the user will probably never encounter this failure. P.S. If you see: about to get started... And nothing after that, and xenctx shows Call Trace: [<ffffffff8132cfe3>] __const_udelay+0x1e <-- [<ffffffff816b9043>] panic+0x1c0 [<ffffffff81013335>] xen_swiotlb_fixup+0x123 [<ffffffff81a05e17>] xen_swiotlb_init_with_default_size+0x9c [<ffffffff81a05f91>] xen_swiotlb_init+0x4b [<ffffffff81a0ab72>] pci_iommu_alloc+0x86 [<ffffffff81a22972>] mem_init+0x28 [<ffffffff813201a9>] sort_extable+0x39 [<ffffffff819feb90>] start_kernel+0x301 [<ffffffff819fdf76>] x86_64_start_reservations+0x101 [<ffffffff81a03cdf>] xen_start_kernel+0x715 Then this is the patch for this. diff -r 88adf22e0fe3 -r 82762bc10aa5 tools/python/xen/lowlevel/xc/xc.c --- a/tools/python/xen/lowlevel/xc/xc.c Fri Nov 13 17:10:09 2009 -0500 +++ b/tools/python/xen/lowlevel/xc/xc.c Fri Nov 13 17:11:58 2009 -0500 @@ -1059,6 +1059,7 @@ int i, j, max_cpu_id; uint64_t free_heap; PyObject *ret_obj, *node_to_cpu_obj, *node_to_memory_obj; + PyObject *node_to_dma32_mem_obj; xc_cpu_to_node_t map[MAX_CPU_ID + 1]; const char *virtcap_names[] = { "hvm", "hvm_directio" }; @@ -1128,10 +1129,27 @@ Py_DECREF(pyint); } + xc_dom_loginit(); + /* DMA memory. */ + node_to_dma32_mem_obj = PyList_New(0); + + for ( i = 0; i < info.nr_nodes; i++ ) + { + PyObject *pyint; + + xc_availheap(self->xc_handle, 0, 32, i, &free_heap); + xc_dom_printf("Node:%d: DMA32:%ld\n", i, free_heap); + pyint = PyInt_FromLong(free_heap / 1024); + PyList_Append(node_to_dma32_mem_obj, pyint); + Py_DECREF(pyint); + } + PyDict_SetItemString(ret_obj, "node_to_cpu", node_to_cpu_obj); Py_DECREF(node_to_cpu_obj); PyDict_SetItemString(ret_obj, "node_to_memory", node_to_memory_obj); Py_DECREF(node_to_memory_obj); + PyDict_SetItemString(ret_obj, "node_to_dma32_mem", node_to_dma32_mem_obj); + Py_DECREF(node_to_dma32_mem_obj); return ret_obj; #undef MAX_CPU_ID diff -r 88adf22e0fe3 -r 82762bc10aa5 tools/python/xen/xend/XendConfig.py --- a/tools/python/xen/xend/XendConfig.py Fri Nov 13 17:10:09 2009 -0500 +++ b/tools/python/xen/xend/XendConfig.py Fri Nov 13 17:11:58 2009 -0500 @@ -2111,6 +2111,13 @@ def is_hap(self): return self[''platform''].get(''hap'', 0) + def is_pv_and_has_pci(self): + for dev_type, dev_info in self.all_devices_sxpr(): + if dev_type != ''pci'': + continue + return not self.is_hvm() + return False + def update_platform_pci(self): pci = [] for dev_type, dev_info in self.all_devices_sxpr(): diff -r 88adf22e0fe3 -r 82762bc10aa5 tools/python/xen/xend/XendDomainInfo.py --- a/tools/python/xen/xend/XendDomainInfo.py Fri Nov 13 17:10:09 2009 -0500 +++ b/tools/python/xen/xend/XendDomainInfo.py Fri Nov 13 17:11:58 2009 -0500 @@ -2580,7 +2580,8 @@ def _setCPUAffinity(self): - """ Repin domain vcpus if a restricted cpus list is provided + """ Repin domain vcpus if a restricted cpus list is provided. + Returns the choosen node number. """ def has_cpus(): @@ -2597,6 +2598,7 @@ return True return False + index = 0 if has_cpumap(): for v in range(0, self.info[''VCPUs_max'']): if self.info[''vcpus_params''].has_key(''cpumap%i'' % v): @@ -2647,6 +2649,54 @@ cpumask = info[''node_to_cpu''][index] for v in range(0, self.info[''VCPUs_max'']): xc.vcpu_setaffinity(self.domid, v, cpumask) + return index + + def _freeDMAmemory(self, node): + + # If we are PV and have PCI devices the guest will + # turn on a SWIOTLB. The SWIOTLB _MUST_ be located in the DMA32 + # zone (under 4GB). To do so, we need to balloon down Dom0 to where + # there is enough (64MB) memory under the 4GB mark. This balloon-ing + # might take more memory out than just 64MB thought :-( + if not self.info.is_pv_and_has_pci(): + return + + retries = 2000 + ask_for_mem = 0; + need_mem = 0 + try: + while (retries > 0): + physinfo = xc.physinfo() + free_mem = physinfo[''free_memory''] + nr_nodes = physinfo[''nr_nodes''] + node_to_dma32_mem = physinfo[''node_to_dma32_mem''] + if (node > nr_nodes): + return; + # Extra 2MB above 64GB seems to do the trick. + need_mem = 64 * 1024 + 2048 - node_to_dma32_mem[node] + # our starting point. We ask just for the difference to + # be have an extra 64MB under 4GB. + ask_for_mem = max(need_mem, ask_for_mem); + if (need_mem > 0): + log.debug(''_freeDMAmemory (%d) Need %dKiB DMA memory. '' + ''Asking for %dKiB'', retries, need_mem, + ask_for_mem) + + balloon.free(ask_for_mem, self) + ask_for_mem = ask_for_mem + 2048; + else: + # OK. We got enough DMA memory. + break + retries = retries - 1 + except: + # This is best-try after all. + need_mem = max(1, need_mem); + pass + + if (need_mem > 0): + log.warn(''We tried our best to balloon down DMA memory to '' + ''accomodate your PV guest. We need %dKiB extra memory.'', + need_mem) def _setSchedParams(self): if XendNode.instance().xenschedinfo() == ''credit'': @@ -2668,7 +2718,7 @@ # repin domain vcpus if a restricted cpus list is provided # this is done prior to memory allocation to aide in memory # distribution for NUMA systems. - self._setCPUAffinity() + node = self._setCPUAffinity() # Set scheduling parameters. self._setSchedParams() @@ -2730,6 +2780,8 @@ if self.info.target(): self._setTarget(self.info.target()) + self._freeDMAmemory(node) + self._createDevices() self.image.cleanupTmpImages() diff -r 88adf22e0fe3 -r 82762bc10aa5 tools/python/xen/xend/XendNode.py --- a/tools/python/xen/xend/XendNode.py Fri Nov 13 17:10:09 2009 -0500 +++ b/tools/python/xen/xend/XendNode.py Fri Nov 13 17:11:58 2009 -0500 @@ -872,11 +872,11 @@ except: str=''none\n'' return str[:-1]; - def format_node_to_memory(self, pinfo): + def format_node_to_memory(self, pinfo, key): str='''' whitespace='''' try: - node_to_memory=pinfo[''node_to_memory''] + node_to_memory=pinfo[key] for i in range(0, pinfo[''nr_nodes'']): str+=''%snode%d:%d\n'' % (whitespace, i, @@ -896,7 +896,10 @@ info[''total_memory''] = info[''total_memory''] / 1024 info[''free_memory''] = info[''free_memory''] / 1024 info[''node_to_cpu''] = self.format_node_to_cpu(info) - info[''node_to_memory''] = self.format_node_to_memory(info) + info[''node_to_memory''] = self.format_node_to_memory(info, + ''node_to_memory'') + info[''node_to_dma32_mem''] = self.format_node_to_memory(info, + ''node_to_dma32_mem'') ITEM_ORDER = [''nr_cpus'', ''nr_nodes'', @@ -908,7 +911,8 @@ ''total_memory'', ''free_memory'', ''node_to_cpu'', - ''node_to_memory'' + ''node_to_memory'', + ''node_to_dma32_mem'' ] return [[k, info[k]] for k in ITEM_ORDER] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Nov-16 10:02 UTC
Re: [Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 13.11.09 23:16 >>> >Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch. > >If the user hasn''t used dom0_mem= bootup parameter, the privileged domain >usurps all of the memory. During launch of PV guests with PCI pass-through >we ratchet down the memory for the privileged domain to the required memory >for the PV guest. However, for PV guests with PCI pass-through we do not >take into account that the PV guest is going to swap its SWIOTLB memory >for DMA32 memory - in fact, swap 64MB of it. This patch balloon''s down >the privileged domain so that there are 64MB of DMA32 memory available.While I realize that the patch was already committed, I still wanted to raise my concerns with it: For one, it is logically (just for a much smaller total amount and for a more narrow memory range) identical to what would be needed for 32-bit-pv DomU-s on 64-bit hv, so *if* this patch is considered conceptually valid, then it should be abstracted and used for both purposes. I think, however, that it is conceptually wrong, because it may mean that all of the memory possibly removable from Dom0 can get ballooned out without in fact yielding the memory needed by the to be started DomU. Besides that, hard-coding the value to 64Mb doesn''t seem very nice either (while I realize that both 2.6.18 and pv-ops default to 64Mb, I do not think this is really appropriate, especially given that in the 2.6.18 tree Dom0 can get run with as little as 2Mb, and I highly doubt that the demand of a DomU can by default be assumed to exceed that of Dom0), and in particular doesn''t help with the case where one really has to use a larger than the default size swiotlb. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2009-Nov-16 15:04 UTC
Re: [Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
On Mon, Nov 16, 2009 at 10:02:39AM +0000, Jan Beulich wrote:> >>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 13.11.09 23:16 >>> > >Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch. > > > >If the user hasn''t used dom0_mem= bootup parameter, the privileged domain > >usurps all of the memory. During launch of PV guests with PCI pass-through > >we ratchet down the memory for the privileged domain to the required memory > >for the PV guest. However, for PV guests with PCI pass-through we do not > >take into account that the PV guest is going to swap its SWIOTLB memory > >for DMA32 memory - in fact, swap 64MB of it. This patch balloon''s down > >the privileged domain so that there are 64MB of DMA32 memory available. > > While I realize that the patch was already committed, I still wanted to > raise my concerns with it:Thank you for raising them.> > For one, it is logically (just for a much smaller total amount and for a more > narrow memory range) identical to what would be needed for 32-bit-pv > DomU-s on 64-bit hv, so *if* this patch is considered conceptually valid,Meaning if you want to run a 8GB 32-bit PV, do the same. But if the PV is say, using 512MB, there is no need to allocate 64MB?> then it should be abstracted and used for both purposes. > > I think, however, that it is conceptually wrong, because it may mean that > all of the memory possibly removable from Dom0 can get ballooned out > without in fact yielding the memory needed by the to be started DomU.There is a limit at which it stops. Perhaps I should add a failsafe wherein if we don''t get enough of the memory, we give it back to Dom0?> > Besides that, hard-coding the value to 64Mb doesn''t seem very nice > either (while I realize that both 2.6.18 and pv-ops default to 64Mb, I > do not think this is really appropriate, especially given that in the 2.6.18 > tree Dom0 can get run with as little as 2Mb, and I highly doubt that the > demand of a DomU can by default be assumed to exceed that of Dom0), > and in particular doesn''t help with the case where one really has to use > a larger than the default size swiotlb.Sure. But the user will get a notice in the log pointing them to the fact that we could not get enough memory. Maybe I should expand it some more and say something along these lines: "You best bet is to use dom0_mem=2GB. We''ve tried to deflate the amount of memory the privileged domain is using, but we fear to go any lower. Your guest might not start." Since the SWIOTLB size is determined by the ''swiotlb'' argument passed to the guest, what if we scanned for that and if it has a number, calculate how much memory that amounts to and use that value? The default still being at 64MB. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Nov-16 15:31 UTC
Re: [Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 16.11.09 16:04 >>> >On Mon, Nov 16, 2009 at 10:02:39AM +0000, Jan Beulich wrote: >> >> For one, it is logically (just for a much smaller total amount and for a more >> narrow memory range) identical to what would be needed for 32-bit-pv >> DomU-s on 64-bit hv, so *if* this patch is considered conceptually valid, > >Meaning if you want to run a 8GB 32-bit PV, do the same. But if the PV is >say, using 512MB, there is no need to allocate 64MB?No, you probably misunderstood (and I probably implied to much in my response): On a system with more than 168G, just ballooning out arbitrary memory from Dom0 in order to start a 32-bit pv DomU won''t guarantee that the domain can actually start, as memory beyond the 128G boundary is unusable there for such guests. Conceptually, ballooning out arbitrary amounts of memory to find a certain amount below 4G is identical to ballooning out more than the amount a guest needs in order to find as much as it needs below 128G.>> then it should be abstracted and used for both purposes. >> >> I think, however, that it is conceptually wrong, because it may mean that >> all of the memory possibly removable from Dom0 can get ballooned out >> without in fact yielding the memory needed by the to be started DomU. > >There is a limit at which it stops. Perhaps I should add a failsafe >wherein if we don''t get enough of the memory, we give it back to Dom0?That would reduce the risk for Dom0, yes, but it doesn''t eliminate it (and I am of the opinion that especially as long as Dom0 is not restartable we have to avoid putting any sort of extra risk on it).>> Besides that, hard-coding the value to 64Mb doesn''t seem very nice >> either (while I realize that both 2.6.18 and pv-ops default to 64Mb, I >> do not think this is really appropriate, especially given that in the 2.6.18 >> tree Dom0 can get run with as little as 2Mb, and I highly doubt that the >> demand of a DomU can by default be assumed to exceed that of Dom0), >> and in particular doesn''t help with the case where one really has to use >> a larger than the default size swiotlb. > >Sure. But the user will get a notice in the log pointing them to the fact that >we could not get enough memory. Maybe I should expand it some more and say >something along these lines: "You best bet is to use dom0_mem=2GB. We''ve tried >to deflate the amount of memory the privileged domain is using, but we fear >to go any lower. Your guest might not start." > >Since the SWIOTLB size is determined by the ''swiotlb'' argument passed to >the guest, what if we scanned for that and if it has a number, calculate >how much memory that amounts to and use that value? The default still being at 64MB.That might be an option, but is very Linux-centric. I think the amount should be configurable per guest if something like this is being done at all. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2009-Nov-16 16:38 UTC
Re: [Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
> No, you probably misunderstood (and I probably implied to much in my > response): On a system with more than 168G, just ballooning out > arbitrary memory from Dom0 in order to start a 32-bit pv DomU won''t > guarantee that the domain can actually start, as memory beyond the > 128G boundary is unusable there for such guests.I see what you mean.> > Conceptually, ballooning out arbitrary amounts of memory to find a > certain amount below 4G is identical to ballooning out more than the > amount a guest needs in order to find as much as it needs below 128G.Perhaps an extra mechanism in the balloon driver in the Dom0 kernel to only give back to Xen pages below the 4GB mark? And if it fails return a failure back to Xend? That would solve the problem of ballooning out more than needed and would eliminate the "oh-lets balloon out 2 more MBs and see if that did it". .. snip ..> >Since the SWIOTLB size is determined by the ''swiotlb'' argument passed to > >the guest, what if we scanned for that and if it has a number, calculate > >how much memory that amounts to and use that value? The default still being at 64MB. > > That might be an option, but is very Linux-centric. I think the amount should > be configurable per guest if something like this is being done at all.How about both? Have a ''pci_mem'' argument to set the default and also check the swiotlb argument in ''extra.'' If there is a ''swiotlb'' argument in ''extra'' use that value (computed for kilobytes of course). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Nov-16 16:47 UTC
Re: [Xen-devel] [PATCH Xen-unstable] Balloon down memory to achive enough DMA32 memory for PV guests with PCI pass-through to succesfully launch.
>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 16.11.09 17:38 >>> >Perhaps an extra mechanism in the balloon driver in the Dom0 kernel >to only give back to Xen pages below the 4GB mark? And if it >fails return a failure back to Xend? > >That would solve the problem of ballooning out more than needed and >would eliminate the "oh-lets balloon out 2 more MBs and see if that >did it".Unfortunately that''s something you won''t easily get to work, as you can''t tell Linux'' page allocator that you want it to try to allocate a specific (preselected by you, i.e. the balloon driver) page. But that would be necessary here, as you would have to scan the p2m table to find suitable pages...>> >Since the SWIOTLB size is determined by the ''swiotlb'' argument passed to >> >the guest, what if we scanned for that and if it has a number, calculate >> >how much memory that amounts to and use that value? The default still being at 64MB. >> >> That might be an option, but is very Linux-centric. I think the amount should >> be configurable per guest if something like this is being done at all. > >How about both? Have a ''pci_mem'' argument to set the default and also >check the swiotlb argument in ''extra.'' If there is a ''swiotlb'' argument in >''extra'' use that value (computed for kilobytes of course).Oh, yes, I was thinking of one augmenting the other, of course. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel