I''ve been slowly working on the dma problem i ran into; thought i was making progress, but i think i''m up against a wall, so more discussion and ideas might be helpful. The problem was that on x86_32 PAE and x86_64, our physical address size is greater then 32 bits, yet many (most?) io devices can only address the first 32 bits of memory. So if/when we try to do dma to an address that''s has bits greater then 32 set (call these high addresses), due to truncation the dma ends up happening to the wrong address. I saw this problem on x86_64 with 6gigs ram, if i made dom0 too big, the allocator put it in high memory, the linux kernel booted fine, but the partition scan failed, and it couldn''t mount root. My original solution was to add another type to the xen zoneinfo array to divide memory between high and low. Finally, only allocate low memory when a domain needs to do dma or when high memory is exhausted. This was an easy patch that worked fine. I can provide it if anyone wants it. On the linux side of things, my first approach was to try to use linux zones to divide up memory. Currently under xen, all memory is placed in the dma zone. I was hoping i could somewhere loop over memory, do check the machine address of each page, and place it in the proper zones. The first problem with this approach is that linux zones are designed more for dealing with the even smaller isa address space. That aside, it seems to make large assumptions about memory being (mostly) contiguous and most frequently deals with "start" and "size" rather then arrays of pages. I start looking at code, thinking that i might change that, but at some point finally realized that on an abstract level, what i was fundamentally doing was the exact reason that the pfn/mfn mapping exists---teaching linux about non-contiguous memory looks fairly non-trivial. The next approach i started on was to have xen reback memory with low pages when it went to do dma. dma_alloc_coherent() makes a call to xen_contig_memory(), which forces a range of memory to be backed by machine contiguous pages by freeing the buffer to xen, and then asking for it back[1]. I tried adding another hypercall to request that dma''able pages be returned. This worked great for the network cards, but disk was another story. First off, there were several code paths that do dma that don''t end up calling xen_contig_memory (which right now is fine because its only ever on single pages). I started down the path of finding those, but in the mean time realized that for disk, we could be dma''ing to any memory. Additionally, Michael Hohnbaum reminded me of page flipping. Between these two, it seems reasonable to think that the pool for free dma memory could eventually become exhausted. That is the wall. Footnote: this will not be a problem on all machines. AMD x86_64 has iommu which should make this a non-problem (if the kernel chooses to use it). Unfortunately, from what i understand, EMT64 is not so blessed. sRp 1| incidentally, it seems to me that optimally xen_contig_memory() should just return if order==0. -- Scott Parish Signed-off-by: srparish@us.ibm.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tuesday 12 July 2005 12:18 pm, Scott Parish wrote:> I''ve been slowly working on the dma problem i ran into; thought i was > making progress, but i think i''m up against a wall, so more discussion > and ideas might be helpful. > > The problem was that on x86_32 PAE and x86_64, our physical address size > is greater then 32 bits, yet many (most?) io devices can only address > the first 32 bits of memory. So if/when we try to do dma to an address > that''s has bits greater then 32 set (call these high addresses), due to > truncation the dma ends up happening to the wrong address. > > I saw this problem on x86_64 with 6gigs ram, if i made dom0 too big, the > allocator put it in high memory, the linux kernel booted fine, but the > partition scan failed, and it couldn''t mount root.Why not have the allocator force all driver domains to be in memory < 4GB?> My original solution was to add another type to the xen zoneinfo array > to divide memory between high and low. Finally, only allocate low memory > when a domain needs to do dma or when high memory is exhausted. This was > an easy patch that worked fine. I can provide it if anyone wants it. > > On the linux side of things, my first approach was to try to use linux > zones to divide up memory. Currently under xen, all memory is placed in > the dma zone. I was hoping i could somewhere loop over memory, do check > the machine address of each page, and place it in the proper zones. The > first problem with this approach is that linux zones are designed more > for dealing with the even smaller isa address space. That aside, it > seems to make large assumptions about memory being (mostly) contiguous > and most frequently deals with "start" and "size" rather then arrays > of pages. I start looking at code, thinking that i might change > that, but at some point finally realized that on an abstract level, > what i was fundamentally doing was the exact reason that the pfn/mfn > mapping exists---teaching linux about non-contiguous memory looks fairly > non-trivial. > > The next approach i started on was to have xen reback memory with > low pages when it went to do dma. dma_alloc_coherent() makes a call > to xen_contig_memory(), which forces a range of memory to be backed > by machine contiguous pages by freeing the buffer to xen, and then > asking for it back[1]. I tried adding another hypercall to request that > dma''able pages be returned. This worked great for the network cards, but > disk was another story. First off, there were several code paths that > do dma that don''t end up calling xen_contig_memory (which right now is > fine because its only ever on single pages). I started down the path of > finding those, but in the mean time realized that for disk, we could be > dma''ing to any memory. Additionally, Michael Hohnbaum reminded me of > page flipping. Between these two, it seems reasonable to think that the > pool for free dma memory could eventually become exhausted.Running out of DMA''able memory happens. Perf sucks, but it shouldn''t kill your system. What''s the problem?> That is the wall. > > Footnote: this will not be a problem on all machines. AMD x86_64 has > iommu which should make this a non-problem (if the kernel chooses to use > it). Unfortunately, from what i understand, EMT64 is not so blessed.AMD64 has IOMMU HW acceleration. EM64T has software IOMMU. Whenever I get IOMMU working on x86-64, this should solve your problem.> sRp > > 1| incidentally, it seems to me that optimally xen_contig_memory() > should just return if order==0.Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nakajima, Jun
2005-Jul-13 04:36 UTC
RE: [Xen-devel] high memory dma update: up against a wall
Scott Parish wrote:> I''ve been slowly working on the dma problem i ran into; thought i was > making progress, but i think i''m up against a wall, so more discussion > and ideas might be helpful.I think porting swiotlb (arch/ia64/lib/swiotlb.c) is one of other approaches for EM64T as we are using it in the native x86_64 Linux. We need at least 64MB physically contiguous memory below 4GB for that. For dom0, I think we can find such area at boot time. We have a plan to work on that, but it will be after OLS... Basically, the io_tlb_start is the starting address of the buffer. You need to ensure that the memory is physically contiguous in machine physical. I think it''s easy to find such an area in dom0. alloc_bootmem_low_pages() may not work, so you may need to write a new (simple) function. swiotlb_init_with_default_size (size_t default_size) { unsigned long i; if (!io_tlb_nslabs) { io_tlb_nslabs = (default_size >> PAGE_SHIFT); io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); } /* * Get IO TLB memory from the low pages */ io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs * (1 << IO_TLB_SHIFT)); Other thing is to use virt_to_bus() not virt_to_phys(). See below. void * swiotlb_alloc_coherent(struct device *hwdev, size_t size, dma_addr_t *dma_handle, int flags) { unsigned long dev_addr; void *ret; int order = get_order(size); /* * XXX fix me: the DMA API should pass us an explicit DMA mask * instead, or use ZONE_DMA32 (ia64 overloads ZONE_DMA to be a ~32 * bit range instead of a 16MB one). */ flags |= GFP_DMA; ret = (void *)__get_free_pages(flags, order); if (ret && address_needs_mapping(hwdev, virt_to_phys(ret))) { /* * The allocated memory isn''t reachable by the device. * Fall back on swiotlb_map_single(). */ free_pages((unsigned long) ret, order); ret = NULL; } The baisc idea of swiotlb is if the memory allocate is lower than 4GB, then just use it. If not, allocate memory chunk from the buffer: if (!ret) { /* * We are either out of memory or the device can''t DMA * to GFP_DMA memory; fall back on * swiotlb_map_single(), which will grab memory from * the lowest available address range. */ dma_addr_t handle; handle = swiotlb_map_single(NULL, NULL, size, DMA_FROM_DEVICE); if (dma_mapping_error(handle)) return NULL; ret = phys_to_virt(handle); } Jun --- Intel Open Source Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Jul-13 08:48 UTC
Re: [Xen-devel] high memory dma update: up against a wall
On 13 Jul 2005, at 00:21, Jon Mason wrote:> > AMD64 has IOMMU HW acceleration. EM64T has software IOMMU. Whenever > I get > IOMMU working on x86-64, this should solve your problem.If dom0 controls the IOMMU (instead of Xen) then that will be a good solution for dom0 drivers. For other domains, with no control over a IOMMU, we''ll probably fall back to bounce buffers with an associated drop in performance. Most people run drivers in dom0 anyway. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I saw this problem on x86_64 with 6gigs ram, if i made dom0 > too big, > > the allocator put it in high memory, the linux kernel > booted fine, but > > the partition scan failed, and it couldn''t mount root. > > Why not have the allocator force all driver domains to be in > memory < 4GB?It''s irelevant whether the driver domains are in memory below 4GB -- they are passed pages by other domains which they want to DMA into. It''s clear that privileged domains need to support bounce buffers for hardware that can''t DMA above 4GB. We could try and optimise the situation by giving each domain some memory below 4GB so that it can maintain a zone to use in preference for skb''s etc. It can''t help for most block IO, since pretty much any of the domain''s pages can be a target. However, I''m not convinced that its worth implementing such a soloution. Keir and I just looked in Linux''s driver directory and found that pretty much all the chips used in server hardware over the last few years are>4GB capable: tg3, e1000, mpt_fusion, aacraid, megaraid, aic7xxx etc.The only exception seems to be ide/sata controllers. For the latter, having sperate memory zones won''t help. We need to use the gart or other io mmu to translate the DMA in the driver domain. I think we just go with bounce buffers for the moment, and add io mmu support once we''ve had a chance to discuss it further. I suspect that on most server hardware we won''t need it anyway. [Is there much extant hardware with >4GB of memory that doesn''t have disk or network hardware that are capable of DMA above 4GB? My guess would be no, but can anyone put forward hard data?] Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel