Jeremy Fitzhardinge
2007-Oct-12 16:58 UTC
[Xen-devel] Interaction between Xen and XFS: stray RW mappings
Hi Dave & other XFS folk, I''m tracking down a bug which appears to be a bad interaction between XFS and Xen. It looks like XFS is holding RW mappings on free pages, which Xen is trying to get an exclusive RO mapping on so it can turn them into pagetables. I''m assuming the pages are actually free, and this isn''t a use after free problem. From a quick poke around, the most likely pieces of XFS code is the stuff in xfs_map.c which creates a virtually contiguous mapping of pages with vmap, and seems to delay unmapping them. When pinning a pagetable, Xen tries to eliminate any RW aliases of the pages its using. This is generally trivial because pages returned by get_free_page don''t have any mappings aside from the normal kernel mapping. High pages, when using CONFIG_HIGHPTE, may have a residual kmap mapping, so we clear out the kmap cache if we encounter a highpage in the pagetable. I guess we could create a special-case interface to do the same thing with XFS mappings, but it would be nicer to have something more generic. Is my analysis correct? Or should XFS not be holding stray mappings? Or is there already some kind of generic mechanism I can use to get it to release its mappings? Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-12 17:08 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Jeremy Fitzhardinge wrote:> I guess we could create a special-case interface to do the same thing > with XFS mappings, but it would be nicer to have something more generic. > > Is my analysis correct? Or should XFS not be holding stray mappings? > Or is there already some kind of generic mechanism I can use to get it > to release its mappings?This test patch confirms my theory: diff -r 36a518c1fb4b fs/xfs/linux-2.6/xfs_buf.c --- a/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:03:56 2007 -0700 +++ b/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:07:03 2007 -0700 @@ -186,6 +186,11 @@ free_address( void *addr) { a_list_t *aentry; + +#ifdef CONFIG_XEN + vunmap(addr); + return; +#endif aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT); if (likely(aentry)) { With this in place, the problem goes away. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-14 22:56 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Fri, Oct 12, 2007 at 09:58:43AM -0700, Jeremy Fitzhardinge wrote:> Hi Dave & other XFS folk, > > I''m tracking down a bug which appears to be a bad interaction between XFS > and Xen. It looks like XFS is holding RW mappings on free pages, which Xen > is trying to get an exclusive RO mapping on so it can turn them into > pagetables. > > I''m assuming the pages are actually free, and this isn''t a use after free > problem. From a quick poke around, the most likely pieces of XFS code is > the stuff in xfs_map.c which creates a virtually contiguous mapping of pages > with vmap, and seems to delay unmapping them.You mean xfs_buf.c. And yes, we delay unmapping pages until we have a batch of them to unmap. vmap and vunmap do not scale, so this is batching helps alleviate some of the worst of the problems.> When pinning a pagetable, Xen tries to eliminate any RW aliases of the pages > its using. This is generally trivial because pages returned by > get_free_page don''t have any mappings aside from the normal kernel mapping. > High pages, when using CONFIG_HIGHPTE, may have a residual kmap mapping, so > we clear out the kmap cache if we encounter a highpage in the pagetable. > > I guess we could create a special-case interface to do the same thing with > XFS mappings, but it would be nicer to have something more generic.*nod*> Is my analysis correct? Or should XFS not be holding stray mappings? Or is > there already some kind of generic mechanism I can use to get it to release > its mappings?The xfsbufd cleans out any stale mappings - it''s woken by the memory shrinker interface (i.e. calls xfsbufd_wakeup()). Otherwise every 64th buffer being vmap()d will flush out stale mappings. Realistically, if this delayed release of vmaps is a problem for Xen, then I think that some generic VM solution is needed to this problem as vmap() is likely to become more common in future (think large blocks in filesystems). Nick - any comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-14 23:12 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
David Chinner wrote:> You mean xfs_buf.c. >Yes, sorry.> And yes, we delay unmapping pages until we have a batch of them > to unmap. vmap and vunmap do not scale, so this is batching helps > alleviate some of the worst of the problems. >How much performance does it cost? What kind of workloads would it show up under?> Realistically, if this delayed release of vmaps is a problem for > Xen, then I think that some generic VM solution is needed to this > problem as vmap() is likely to become more common in future (think > large blocks in filesystems). Nick - any comments? >Well, the only real problem is that the pages are returned to the free pool and reallocated while still being part of a mapping. If the pages are still owned by the filesystem/pagecache, then there''s no problem. What''s the lifetime of things being vmapped/unmapped in xfs? Are they necessarily being freed when they''re unmapped, or could unmapping of freed memory be more immediate than other memory? Maybe it just needs a notifier chain or something. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-14 23:33 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Sun, Oct 14, 2007 at 04:12:20PM -0700, Jeremy Fitzhardinge wrote:> David Chinner wrote: > > You mean xfs_buf.c. > > > > Yes, sorry. > > > And yes, we delay unmapping pages until we have a batch of them > > to unmap. vmap and vunmap do not scale, so this is batching helps > > alleviate some of the worst of the problems. > > > > How much performance does it cost?Every vunmap() cal causes a global TLB sync, and the region lists are globl with a spin lock protecting them. I thin kNick has shown a 64p altix with ~60 cpus spinning on the vmap locks under a simple workload....> What kind of workloads would it show > up under?A directory traversal when using large directory block sizes with large directories....> > Realistically, if this delayed release of vmaps is a problem for > > Xen, then I think that some generic VM solution is needed to this > > problem as vmap() is likely to become more common in future (think > > large blocks in filesystems). Nick - any comments? > > > > Well, the only real problem is that the pages are returned to the free > pool and reallocated while still being part of a mapping. If the pages > are still owned by the filesystem/pagecache, then there''s no problem.The pages are still attached to the blockdev address space mapping, but there''s nothing stopping them from being reclaimed before they are unmapped.> What''s the lifetime of things being vmapped/unmapped in xfs? Are they > necessarily being freed when they''re unmapped, or could unmapping of > freed memory be more immediate than other memory?It''s all "freed memory". At the time we pull the buffer down, there are no further references to the buffer. the pages are released and the mapping is never used again until it is torn down. it is torn down either on the next xfsbufd run (either memory pressure or every 15s) or every 64th new vmap() call to map new buffers.> Maybe it just needs a notifier chain or something.We''ve already got a memroy shrinker hook that triggers this reclaim. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-15 00:57 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Nick Piggin wrote:> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive > because it generally has to invalidate TLBs on all CPUs. >I see.> I''m looking at some more general solutions to this (already have some > batching / lazy unmapping that replaces the XFS specific one), however > they are still likely going to leave vmap mappings around after freeing > the page. >Hm. Could there be a call to shoot down any lazy mappings of a page, so the Xen pagetable code could use it on any pagetable page? Ideally one that could be used on any page, but only causes expensive operations where needed.> We _could_ hold on to the pages as well, but that''s pretty inefficient. > The memory cost of keeping the mappings around tends to be well under > 1% the cost of the page itself. OTOH we could also avoid lazy flushes > on architectures where it is not costly. Either way, it probably would > require an arch hook or even a couple of ifdefs in mm/vmalloc.c for > Xen. Although... it would be nice if Xen could take advantage of some > of these optimisations as well. >In general the lazy unmappings won''t worry Xen. It''s only for the specific case of allocating memory for pagetables. Xen can do a bit of extra optimisation for cross-cpu tlb flushes (if the target vcpus are not currently running, then you don''t need to do anything), but they''re still an expensive operation, so the optimisation is definitely useful.> What''s the actual problem for Xen? Anything that can be changed? >Not easily. Xen doesn''t use shadow pagetables. Instead, it gives the guest domains direct access to the real CPU''s pagetable, but makes sure they''re always mapped RO so that the hypervisor can control updates to the pagetables (either by trapping writes or via explicit hypercalls). This means that when constructing a new pagetable, Xen will verify that all the mappings of pages making up the new pagetable are RO before allowing it to be used. If there are stray RW mappings of those pages, pagetable construction will fail. Aside from XFS, the only other case I''ve found where there could be stray RW mappings is when using high pages which are still in the kmap cache; I added an explicit call to flush the kmap cache to handle this. If vmap and kmap can be unified (at least the lazy unmap aspects of them), then that would be a nice little cleanup. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-15 03:42 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Nick Piggin wrote:> Yeah, it would be possible. The easiest way would just be to shoot down > all lazy vmaps (because you''re doing the global IPIs anyway, which are > the expensive thing, at which point you may as well purge the rest of > your lazy mappings). >Sure.> If it is sufficiently rare, then it could be the simplest thing to do. >Yes. If there''s some way to tell whether a particular page is in a lazy mapping then that would help, since we could easily tell whether we need to do the whole shootdown thing. I would expect the population of lazily mapped pages in the whole freepage pool to be pretty small, but if the allocator tends to return the most recently freed pages you might hit them fairly regularly (shoving them at the other end of the freelist might be useful).> OK, I see. Because even though it is technically safe where we are > using it (because nothing writes through the mappings after the page > is freed), a corrupted guest could use the same window to do bad > things with the pagetables? >That''s right. The hypervisor doesn''t trust the guests, so it prevents them from getting into a state where they can do bad things.> For Xen -- shouldn''t be a big deal. We can have a single Linux mm API > to call, and we can do the right thing WRT vmap/kamp. I should try to > merge my current lazy vmap patches which replace the XFS stuff, so we > can implement such an API and fix your XFS issue?Sounds good.> That''s not going to > happen for at least a cycle or two though, so in the meantime maybe > an ifdef for that XFS vmap batching code would help? >For now I''ve proposed a patch to simply eagerly vunmap everything when CONFIG_XEN is set. It certainly works, but I don''t have a good feel for how much of a performance hit that imposes on XFS. A slightly more subtle change would be to test to see if we''re actually running under Xen before taking the eager path, so at least the performance burden only affects actual Xen users (and I presume xfs+xen is a fairly rare combination, or this problem would have turned up earlier, or perhaps the old xenified kernels have some other workaround for it). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-15 04:11 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Sun, Oct 14, 2007 at 08:42:34PM -0700, Jeremy Fitzhardinge wrote:> Nick Piggin wrote: > > That''s not going to > > happen for at least a cycle or two though, so in the meantime maybe > > an ifdef for that XFS vmap batching code would help? > > > > For now I''ve proposed a patch to simply eagerly vunmap everything when > CONFIG_XEN is set. It certainly works, but I don''t have a good feel for > how much of a performance hit that imposes on XFS.With defaults - little effect as vmap should never be used. It''s only when you start using larger block sizes for metadata that this becomes an issue. The CONFIG_XEN workaround should be fine until we get a proper vmap cache.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nick Piggin
2007-Oct-15 04:15 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Monday 15 October 2007 09:12, Jeremy Fitzhardinge wrote:> David Chinner wrote: > > You mean xfs_buf.c. > > Yes, sorry. > > > And yes, we delay unmapping pages until we have a batch of them > > to unmap. vmap and vunmap do not scale, so this is batching helps > > alleviate some of the worst of the problems. > > How much performance does it cost? What kind of workloads would it show > up under? > > > Realistically, if this delayed release of vmaps is a problem for > > Xen, then I think that some generic VM solution is needed to this > > problem as vmap() is likely to become more common in future (think > > large blocks in filesystems). Nick - any comments? > > Well, the only real problem is that the pages are returned to the free > pool and reallocated while still being part of a mapping. If the pages > are still owned by the filesystem/pagecache, then there''s no problem. > > What''s the lifetime of things being vmapped/unmapped in xfs? Are they > necessarily being freed when they''re unmapped, or could unmapping of > freed memory be more immediate than other memory?Yes, as Dave said, vmap (more specifically: vunmap) is very expensive because it generally has to invalidate TLBs on all CPUs. I''m looking at some more general solutions to this (already have some batching / lazy unmapping that replaces the XFS specific one), however they are still likely going to leave vmap mappings around after freeing the page. We _could_ hold on to the pages as well, but that''s pretty inefficient. The memory cost of keeping the mappings around tends to be well under 1% the cost of the page itself. OTOH we could also avoid lazy flushes on architectures where it is not costly. Either way, it probably would require an arch hook or even a couple of ifdefs in mm/vmalloc.c for Xen. Although... it would be nice if Xen could take advantage of some of these optimisations as well. What''s the actual problem for Xen? Anything that can be changed? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-15 04:18 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
David Chinner wrote:> With defaults - little effect as vmap should never be used. It''s > only when you start using larger block sizes for metadata that this > becomes an issue. The CONFIG_XEN workaround should be fine until we > get a proper vmap cache....Hm, well I saw the problem with a filesystem made with mkfs.xfs with no options, so there must be at least *some* vmapping going on there. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-15 04:25 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Sun, Oct 14, 2007 at 09:18:17PM -0700, Jeremy Fitzhardinge wrote:> David Chinner wrote: > > With defaults - little effect as vmap should never be used. It''s > > only when you start using larger block sizes for metadata that this > > becomes an issue. The CONFIG_XEN workaround should be fine until we > > get a proper vmap cache.... > > Hm, well I saw the problem with a filesystem made with mkfs.xfs with no > options, so there must be at least *some* vmapping going on there.Sorry - I should have been more precise - vmap should never be used in performance critical paths on default configs. Log recovery will trigger vmap/vunmap usage, so this is probably what you are seeing. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nick Piggin
2007-Oct-15 07:26 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Monday 15 October 2007 10:57, Jeremy Fitzhardinge wrote:> Nick Piggin wrote: > > Yes, as Dave said, vmap (more specifically: vunmap) is very expensive > > because it generally has to invalidate TLBs on all CPUs. > > I see. > > > I''m looking at some more general solutions to this (already have some > > batching / lazy unmapping that replaces the XFS specific one), however > > they are still likely going to leave vmap mappings around after freeing > > the page. > > Hm. Could there be a call to shoot down any lazy mappings of a page, so > the Xen pagetable code could use it on any pagetable page? Ideally one > that could be used on any page, but only causes expensive operations > where needed.Yeah, it would be possible. The easiest way would just be to shoot down all lazy vmaps (because you''re doing the global IPIs anyway, which are the expensive thing, at which point you may as well purge the rest of your lazy mappings). If it is sufficiently rare, then it could be the simplest thing to do.> > We _could_ hold on to the pages as well, but that''s pretty inefficient. > > The memory cost of keeping the mappings around tends to be well under > > 1% the cost of the page itself. OTOH we could also avoid lazy flushes > > on architectures where it is not costly. Either way, it probably would > > require an arch hook or even a couple of ifdefs in mm/vmalloc.c for > > Xen. Although... it would be nice if Xen could take advantage of some > > of these optimisations as well. > > In general the lazy unmappings won''t worry Xen. It''s only for the > specific case of allocating memory for pagetables. Xen can do a bit of > extra optimisation for cross-cpu tlb flushes (if the target vcpus are > not currently running, then you don''t need to do anything), but they''re > still an expensive operation, so the optimisation is definitely useful.OK.> > What''s the actual problem for Xen? Anything that can be changed? > > Not easily. Xen doesn''t use shadow pagetables. Instead, it gives the > guest domains direct access to the real CPU''s pagetable, but makes sure > they''re always mapped RO so that the hypervisor can control updates to > the pagetables (either by trapping writes or via explicit hypercalls). > This means that when constructing a new pagetable, Xen will verify that > all the mappings of pages making up the new pagetable are RO before > allowing it to be used. If there are stray RW mappings of those pages, > pagetable construction will fail.OK, I see. Because even though it is technically safe where we are using it (because nothing writes through the mappings after the page is freed), a corrupted guest could use the same window to do bad things with the pagetables?> Aside from XFS, the only other case I''ve found where there could be > stray RW mappings is when using high pages which are still in the kmap > cache; I added an explicit call to flush the kmap cache to handle this. > If vmap and kmap can be unified (at least the lazy unmap aspects of > them), then that would be a nice little cleanup.vmap is slightly harder than kmap in some respects. However it would be really nice to get vmap fast and general enough to completely replace all the kmap crud -- that''s one goal, but the first thing I''m doing is to concentrate on just vmap to work out how to make it as fast as possible. For Xen -- shouldn''t be a big deal. We can have a single Linux mm API to call, and we can do the right thing WRT vmap/kamp. I should try to merge my current lazy vmap patches which replace the XFS stuff, so we can implement such an API and fix your XFS issue? That''s not going to happen for at least a cycle or two though, so in the meantime maybe an ifdef for that XFS vmap batching code would help? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christoph Hellwig
2007-Oct-15 08:31 UTC
[Xen-devel] Re: [xfs-masters] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, Oct 15, 2007 at 02:25:46PM +1000, David Chinner wrote:> > Hm, well I saw the problem with a filesystem made with mkfs.xfs with no > > options, so there must be at least *some* vmapping going on there. > > Sorry - I should have been more precise - vmap should never be used in > performance critical paths on default configs. Log recovery will > trigger vmap/vunmap usage, so this is probably what you are seeing.The iclogs are also vmapped, but they aren''t unmapped until unmount so this optimizations doesn''t matter either. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-15 09:36 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
David Chinner <dgc@sgi.com> writes:> > And yes, we delay unmapping pages until we have a batch of them > to unmap. vmap and vunmap do not scale, so this is batching helps > alleviate some of the worst of the problems.You''re keeping vmaps around for already freed pages? That will be a big problem for proper PAT support, which needs to track all mappings to memory. It''s not just a problem for Xen. In fact I suspect it is already broken with DRM or AGP for example which can use UC and WC mappings -- if you keep the mapping around and DRM or AGP turns the page in another mapping uncacheable you''re creating an illegal cache attribute alias. These are known to occasionally create cache corruptions on several x86s; giving ___VERY___ hard to debug bugs once a blue moon. Probably it''ll require some generic VM batching mechanism where Xen or PAT code can hook into the list or force unmap the mappings as needed. Definitely needs to be fixed if true. You''re lucky that Xen caught it in time. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-15 11:07 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote:> Is this true even if you don''t write through those old mappings?I think it happened for reads too. It is a little counter intuitive because in theory the CPU doesn''t need to write back non dirty lines, but in the one case which took so long to debug exactly this happened somehow. At it is undefined for reads and writes in the architecture so better be safe than sorry. And x86 CPUs are out of order and do speculative executation and that can lead to arbitary memory accesses even if the code never touches an particular address. Newer Intel CPUs have something called self-snoop which was supposed to handle this; but in some situations it doesn''t seem to catch it either.> Is DRM or AGP then not also broken with lazy highmem flushing, or > how do they solve that?AGP doesn''t allocate highmem pages. Not sure about the DRM code. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nick Piggin
2007-Oct-15 11:28 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Monday 15 October 2007 21:07, Andi Kleen wrote:> On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote: > > Is this true even if you don''t write through those old mappings? > > I think it happened for reads too. It is a little counter intuitive > because in theory the CPU doesn''t need to write back non dirty lines, > but in the one case which took so long to debug exactly this happened > somehow. > > At it is undefined for reads and writes in the architecture so > better be safe than sorry.Yes, typo. I meant reads or writes.> And x86 CPUs are out of order and do speculative executation > and that can lead to arbitary memory accesses even if the code > never touches an particular address. > > Newer Intel CPUs have something called self-snoop which was supposed > to handle this; but in some situations it doesn''t seem to catch it > either.Fair enough, so we have to have this lazy tlb flush hook for Xen/PAT/etc. I don''t think it should be much problem to implement.> > Is DRM or AGP then not also broken with lazy highmem flushing, or > > how do they solve that? > > AGP doesn''t allocate highmem pages. Not sure about the DRM code.Hmm, OK. It looks like DRM vmallocs memory (which gives highmem). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-15 12:54 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
> Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).I meant I''m not sure if it uses that memory uncached. I admit not quite understanding that code. There used to be at least one place where it set UC for an user mapping though. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nick Piggin
2007-Oct-15 14:56 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Monday 15 October 2007 19:36, Andi Kleen wrote:> David Chinner <dgc@sgi.com> writes: > > And yes, we delay unmapping pages until we have a batch of them > > to unmap. vmap and vunmap do not scale, so this is batching helps > > alleviate some of the worst of the problems. > > You''re keeping vmaps around for already freed pages?> That will be a big problem for proper PAT support, which needs > to track all mappings to memory. It''s not just a problem for Xen. > > In fact I suspect it is already broken with DRM or AGP for example which > can use UC and WC mappings -- if you keep the mapping around and > DRM or AGP turns the page in another mapping uncacheable you''re > creating an illegal cache attribute alias. These are known to occasionally > create cache corruptions on several x86s; giving ___VERY___ hard to debug > bugs once a blue moon.Is this true even if you don''t write through those old mappings? Is DRM or AGP then not also broken with lazy highmem flushing, or how do they solve that?> Probably it''ll require some generic VM batching mechanism where > Xen or PAT code can hook into the list or force unmap the mappings > as needed.Definitely.> Definitely needs to be fixed if true. You''re lucky that Xen caught it > in time.I''ve been thinking that a simple debug mode could be a good idea. Have a new field in the struct page to count the number of possible unflushed mappings to the page so that things like PAT could go BUG_ON appropriate conditions. Would that be helpful? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dave Airlie
2007-Oct-21 12:17 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On 10/15/07, Andi Kleen <andi@firstfloor.org> wrote:> > Hmm, OK. It looks like DRM vmallocs memory (which gives highmem). > > I meant I''m not sure if it uses that memory uncached. I admit > not quite understanding that code. There used to be at least > one place where it set UC for an user mapping though.Currently the only DRM memory handed to userspace is vmalloc_32 in drm_scatter.c I notice the VIA driver does its own vmalloc for dmablit, so it may have an issue with this if highmem is involved. This will change with the upcoming memory manager so I''ll need to investigate it a bit perhaps... Dave. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Benjamin Herrenschmidt
2007-Oct-21 22:16 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, 2007-10-15 at 13:07 +0200, Andi Kleen wrote:> On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote: > > Is this true even if you don''t write through those old mappings? > > I think it happened for reads too. It is a little counter intuitive > because in theory the CPU doesn''t need to write back non dirty lines, > but in the one case which took so long to debug exactly this happened > somehow.The problem exist also on ppc, and afaik, is due to the line being in the cache at all (either dirty (write) or not (read)), thus causing the snoop logic to hit, that is what''s causing the problem vs. non cached accesses. Also, on some processors, the simple fact of having the page mapped can cause the CPU to prefetch from it even if it''s not actually accessed (speculative prefetch can cross page boundaries if things are mapped). Ben. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
dean gaudet
2007-Oct-22 03:18 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, 15 Oct 2007, Nick Piggin wrote:> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive > because it generally has to invalidate TLBs on all CPUs.why is that? ignoring 32-bit archs we have heaps of address space available... couldn''t the kernel just burn address space and delay global TLB invalidate by some relatively long time (say 1 second)? -dean _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-22 03:34 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
dean gaudet wrote:> On Mon, 15 Oct 2007, Nick Piggin wrote: > > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive >> because it generally has to invalidate TLBs on all CPUs. >> > > why is that? ignoring 32-bit archs we have heaps of address space > available... couldn''t the kernel just burn address space and delay global > TLB invalidate by some relatively long time (say 1 second)? >Yes, that''s precisely the problem. xfs does delay the unmap, leaving stray mappings, which upsets Xen. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
dean gaudet
2007-Oct-22 04:28 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:> dean gaudet wrote: > > On Mon, 15 Oct 2007, Nick Piggin wrote: > > > > > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive > >> because it generally has to invalidate TLBs on all CPUs. > >> > > > > why is that? ignoring 32-bit archs we have heaps of address space > > available... couldn''t the kernel just burn address space and delay global > > TLB invalidate by some relatively long time (say 1 second)? > > > > Yes, that''s precisely the problem. xfs does delay the unmap, leaving > stray mappings, which upsets Xen.sounds like a bug in xen to me :) -dean _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Nick Piggin
2007-Oct-22 04:39 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Monday 22 October 2007 14:28, dean gaudet wrote:> On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote: > > dean gaudet wrote: > > > On Mon, 15 Oct 2007, Nick Piggin wrote: > > >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive > > >> because it generally has to invalidate TLBs on all CPUs. > > > > > > why is that? ignoring 32-bit archs we have heaps of address space > > > available... couldn''t the kernel just burn address space and delay > > > global TLB invalidate by some relatively long time (say 1 second)? > > > > Yes, that''s precisely the problem. xfs does delay the unmap, leaving > > stray mappings, which upsets Xen. > > sounds like a bug in xen to me :)You could call it a bug I think. I don''t know much about Xen though, whether or not it expects to be able to run an arbitrary OS kernel. Presumably, the hypervisor _could_ write protect and trap writes to *all* page table page mappings. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-22 09:49 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, Oct 22, 2007 at 08:16:01AM +1000, Benjamin Herrenschmidt wrote:> On Mon, 2007-10-15 at 13:07 +0200, Andi Kleen wrote: > > On Tue, Oct 16, 2007 at 12:56:46AM +1000, Nick Piggin wrote: > > > Is this true even if you don''t write through those old mappings? > > > > I think it happened for reads too. It is a little counter intuitive > > because in theory the CPU doesn''t need to write back non dirty lines, > > but in the one case which took so long to debug exactly this happened > > somehow. > > The problem exist also on ppc, and afaik, is due to the line being in > the cache at all (either dirty (write) or not (read)), thus causing the > snoop logic to hit, that is what''s causing the problem vs. non cached > accesses.That makes sense. Snoop can effectively turn a read into a write.> Also, on some processors, the simple fact of having the page mapped can > cause the CPU to prefetch from it even if it''s not actually accessed > (speculative prefetch can cross page boundaries if things are mapped).Exactly that happens on x86. Normally prefetches stop on TLB miss, but the CPU can do speculative TLB fetches too. Also even without any prefetching the CPU does speculative execution and that can lead to random addresses being followed. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-22 13:47 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Jeremy Fitzhardinge <jeremy@goop.org> writes:> > Yes, that''s precisely the problem. xfs does delay the unmap, leaving > stray mappings, which upsets Xen.Again it not just upsets Xen, keeping mappings to freed pages is wrong generally and violates the x86 (and likely others like PPC) architecture because it can cause illegal caching attribute aliases. The patch that went into the tree was really not correct -- this bogus optimization should have been unconditionally removed or if you really wanted an ifdef made dependent on !CONFIG_XEN && !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that uses uncached mappings in memory). You just worked around the obvious failure and leave the non obvious rare corruptions in, which isn''t a good strategy. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-22 18:32 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
dean gaudet wrote:> sounds like a bug in xen to me :) >I explained at the head of this thread how and why Xen works in this manner. It''s certainly a change from native execution; whether you consider it to be a bug is a different matter. But it turns out that leaving stray mappings around on pages which get later used in special ways is a bad idea, and can manifest in all kinds of random unpleasant ways. Getting a clear error out of Xen is probably nicest way to discover and debug this problem. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-22 18:37 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Nick Piggin wrote:> You could call it a bug I think. I don''t know much about Xen though, > whether or not it expects to be able to run an arbitrary OS kernel. >Xen''s paravirtualized mode always requires a guest OS to be modified; certainly some operating systems would be very hard to make work with Xen. But you can always fall back to using shadow pagetables or full hvm (VT/SVM) mode.> Presumably, the hypervisor _could_ write protect and trap writes to > *all* page table page mappings. >Xen manages this stuff with refcounts; it doesn''t maintain an rmap for these pages, so it would have to exhaustively search to do this. But aside from that, Xen never actively modifies pagetables, so this would be a new behaviour in the interface. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-22 18:40 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Andi Kleen wrote:> Jeremy Fitzhardinge <jeremy@goop.org> writes: > >> Yes, that''s precisely the problem. xfs does delay the unmap, leaving >> stray mappings, which upsets Xen. >> > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally > and violates the x86 (and likely others like PPC) architecture because it can > cause illegal caching attribute aliases. > > The patch that went into the tree was really not correct -- this > bogus optimization should have been unconditionally removed > or if you really wanted an ifdef made dependent on !CONFIG_XEN && > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that > uses uncached mappings in memory). > > You just worked around the obvious failure and leave the non obvious > rare corruptions in, which isn''t a good strategy.Well, at least it becomes a known issue and/or placeholder for when Nick does his grand unified vmap manager. I guess a clean workaround would be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level... I''ll cook up a patch. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-22 19:07 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:> Andi Kleen wrote: > > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > > >> Yes, that''s precisely the problem. xfs does delay the unmap, leaving > >> stray mappings, which upsets Xen. > >> > > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally > > and violates the x86 (and likely others like PPC) architecture because it can > > cause illegal caching attribute aliases. > > > > The patch that went into the tree was really not correct -- this > > bogus optimization should have been unconditionally removed > > or if you really wanted an ifdef made dependent on !CONFIG_XEN && > > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that > > uses uncached mappings in memory). > > > > You just worked around the obvious failure and leave the non obvious > > rare corruptions in, which isn''t a good strategy. > > Well, at least it becomes a known issue and/or placeholder for when NickIt''s hidden now so it causes any obvious failures any more. Just subtle ones which is much worse. But why not just disable it? It''s not critical functionality, just a optimization that unfortunately turned out to be incorrect. It could be made correct short term by not freeing the pages until vunmap for example.> does his grand unified vmap manager. I guess a clean workaround would > be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level... > I''ll cook up a patch.This option could only be safely set in architectures that don''t care about caching attribute aliases or never remap kernel pages uncached. That''s not x86, powerpc, ia64 at a minimum. I think the only good short term fix is to turn your ifdef CONFIG_XEN into a #if 0 -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-22 19:11 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
Andi Kleen wrote:> It''s hidden now so it causes any obvious failures any more. Just > subtle ones which is much worse. >I think anything detected by Xen is still classed as "obscure" ;)> But why not just disable it? It''s not critical functionality, > just a optimization that unfortunately turned out to be incorrect. > > It could be made correct short term by not freeing the pages until > vunmap for example. >I think it only ends up holding 64 pages (or is it 64 mappings?), so its not a horrible use of memory. Particularly since it responds to memory pressure callbacks.>> does his grand unified vmap manager. I guess a clean workaround would >> be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level... >> I''ll cook up a patch. >> > > This option could only be safely set in architectures that don''t > care about caching attribute aliases or never remap kernel pages > uncached. That''s not x86, powerpc, ia64 at a minimum. > > I think the only good short term fix is to turn your ifdef CONFIG_XEN > into a #if 0 >#if 1, but yes, I see your point. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-22 22:32 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:> On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote: > > Andi Kleen wrote: > > > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > > > > >> Yes, that''s precisely the problem. xfs does delay the unmap, leaving > > >> stray mappings, which upsets Xen. > > >> > > > > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally > > > and violates the x86 (and likely others like PPC) architecture because it can > > > cause illegal caching attribute aliases. > > > > > > The patch that went into the tree was really not correct -- this > > > bogus optimization should have been unconditionally removed > > > or if you really wanted an ifdef made dependent on !CONFIG_XEN && > > > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that > > > uses uncached mappings in memory). > > > > > > You just worked around the obvious failure and leave the non obvious > > > rare corruptions in, which isn''t a good strategy. > > > > Well, at least it becomes a known issue and/or placeholder for when Nick > > It''s hidden now so it causes any obvious failures any more. Just > subtle ones which is much worse.There''s no evidence of any problems ever being caused by this code. It''s been there for years.> But why not just disable it? It''s not critical functionality, > just a optimization that unfortunately turned out to be incorrect.It is critical functionality for larger machines because vmap()/vunmap() really does suck in terms of scalability.> It could be made correct short term by not freeing the pages until > vunmap for example.Could vmap()/vunmap() take references to the pages that are mapped? That way delaying the unmap would also delay the freeing of the pages and hence we''d have no problems with the pages being reused before the mapping is torn down. That''d work for Xen even with XFS''s lazy unmapping scheme, and would allow Nick''s more advanced methods to work as well.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-22 23:35 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:> On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote: > > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote: > > > Andi Kleen wrote: > > > > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > > > > > > >> Yes, that''s precisely the problem. xfs does delay the unmap, leaving > > > >> stray mappings, which upsets Xen. > > > >> > > > > > > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally > > > > and violates the x86 (and likely others like PPC) architecture because it can > > > > cause illegal caching attribute aliases. > > > > > > > > The patch that went into the tree was really not correct -- this > > > > bogus optimization should have been unconditionally removed > > > > or if you really wanted an ifdef made dependent on !CONFIG_XEN && > > > > !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that > > > > uses uncached mappings in memory). > > > > > > > > You just worked around the obvious failure and leave the non obvious > > > > rare corruptions in, which isn''t a good strategy. > > > > > > Well, at least it becomes a known issue and/or placeholder for when Nick > > > > It''s hidden now so it causes any obvious failures any more. Just > > subtle ones which is much worse. > > There''s no evidence of any problems ever being caused by this code.We had a fairly nasty bug a couple of years ago with such a mapping that took months to track down. Luckily we had help from a hardware debug lab, without that it would have been very near impossible. But just stabilizing the problem was a nightmare. You will only see problems if the memory you''re still mapping will be allocated by someone who turns it into an uncached mapping. And even then it doesn''t happen always because the CPU does not always do the necessary prefetches. Also the corruption will not be on the XFS side, but on the uncached mapping so typically some data sent to some device gets a few corrupted cache line sized blocks. Depending on the device this might also not be fatal -- e.g. if it is texture data some corrupted pixels occasionally are not that visible. But there can be still cases where it can cause real failures when it hits something critical (the failures were it was tracked down was usually it hitting some command ring and the hardware erroring out) Also every time I broke change_page_attr() flushing (which handles exactly this problem and is tricky so it got broken a few times) I eventually got complaints/reports from some users who ran into such data inconsistencies. Given it was mostly from closed 3d driver users who seem to be most aggressive in using uncached mappings. But with more systems running more aggressive 3d drivers it''ll likely get worse.> It''s been there for years.That doesn''t mean it is correct. Basically you''re saying: I don''t care if I corrupt device driver data. That''s not a very nice thing to say.> > But why not just disable it? It''s not critical functionality, > > just a optimization that unfortunately turned out to be incorrect. > > It is critical functionality for larger machines because vmap()/vunmap() > really does suck in terms of scalability.Critical perhaps, but also broken. And if it''s that big a problem would it be really that difficult to change only the time critical paths using it to not? I assume it''s only the directory code where it is a problem? That would be likely even faster in general.> > It could be made correct short term by not freeing the pages until > > vunmap for example. > > Could vmap()/vunmap() take references to the pages that are > mapped? That way delaying the unmap would also delay the freeing > of the pages and hence we''d have no problems with the pages > being reused before the mapping is torn down. That''d work for > Xen even with XFS''s lazy unmapping scheme, and would allow > Nick''s more advanced methods to work as well....You could always just keep around an array of the pages and then drop the reference count after unmap. Or walk the vmalloc mapping and generate such an array before freeing, then unmap and then drop the reference counts. Or the other alternative would be change the time critical code that uses it to not. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zachary Amsden
2007-Oct-23 00:16 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote:> On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: > > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote: > > > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote: > > > > Andi Kleen wrote: > > > > > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > > > > > > > > >> Yes, that''s precisely the problem. xfs does delay the unmap, leaving > > > > >> stray mappings, which upsets Xen. > > > > >> > > > > > > > > > > Again it not just upsets Xen, keeping mappings to freed pages is wrong generally > > > > > and violates the x86 (and likely others like PPC) architecture because it can > > > > > cause illegal caching attribute aliases.It is a serious offense to leave stray mappings for memory which can get re-mapped to I/O devices... especially with PCI and other device hotplug. I have to back up Andi on this one unconditionally. On architectures where you have multibyte, non-wordsize updates to hardware page tables, you even have races here when setting, updating and clearing PTEs that must be done in a sequence where no aliasing of mappings to partially written PTEs can result in I/O memory getting mapped in a cacheable state. The window here is only one instruction, and yes, it is possible for a window this small to create a problem. A large (or permanently lazy) window is extremely frightening. These things do cause bugs. The bugs take a very long time to show up and are very difficult to track down, since they can basically cause random behavior, such as hanging the machine or losing orbit and crashing into the moon. Zach _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-23 00:36 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote:> On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: > > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote: > > > It''s hidden now so it causes any obvious failures any more. Just subtle > > > ones which is much worse. > > > > There''s no evidence of any problems ever being caused by this code......> Also the corruption will not be on the XFS side, but on the uncached mapping > so typically some data sent to some device gets a few corrupted cache line > sized blocks. Depending on the device this might also not be fatal -- e.g. > if it is texture data some corrupted pixels occasionally are not that > visible. But there can be still cases where it can cause real failures when > it hits something critical (the failures were it was tracked down was > usually it hitting some command ring and the hardware erroring out)There seems to be little evidence of that occurring around the place.> > It''s been there for years. > > That doesn''t mean it is correct.Right, but it also points to the fact that it''s not causing problems from 99.999% of ppl out there.> Basically you''re saying: I don''t care if I corrupt device driver data. > That''s not a very nice thing to say.No, I did not say that. I said that there''s no evidence that points to this causing problems anywhere.> > > But why not just disable it? It''s not critical functionality, just a > > > optimization that unfortunately turned out to be incorrect. > > > > It is critical functionality for larger machines because vmap()/vunmap() > > really does suck in terms of scalability. > > Critical perhaps, but also broken. > > And if it''s that big a problem would it be really that difficult to change > only the time critical paths using it to not? I assume it''s only the > directory code where it is a problem? That would be likely even faster in > general.Difficult - yes. All the btree code in XFS would have to be rewritten to remove the assumption that the buffer structures are contiguous in memory. Any bug we introduce in doing this will result in on disk corruption, which is far worse than occasionally crashing a machine with a stray mapping.> > > It could be made correct short term by not freeing the pages until > > > vunmap for example. > > > > Could vmap()/vunmap() take references to the pages that are mapped? That > > way delaying the unmap would also delay the freeing of the pages and hence > > we''d have no problems with the pages being reused before the mapping is > > torn down. That''d work for Xen even with XFS''s lazy unmapping scheme, and > > would allow Nick''s more advanced methods to work as well.... > > You could always just keep around an array of the pages and then drop the > reference count after unmap. Or walk the vmalloc mapping and generate such > an array before freeing, then unmap and then drop the reference counts.You mean like vmap() could record the pages passed to it in the area->pages array, and we walk and release than in __vunmap() like it already does for vfree()? If we did this, it would probably be best to pass a page release function into the vmap or vunmap call - we''d need page_cache_release() called on the page rather than __free_page().... The solution belongs behind the vmap/vunmap interface, not in XFS.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-23 07:04 UTC
[Xen-devel] [patch] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote:> On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote: > > On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: > > > Could vmap()/vunmap() take references to the pages that are mapped? That > > > way delaying the unmap would also delay the freeing of the pages and hence > > > we''d have no problems with the pages being reused before the mapping is > > > torn down. That''d work for Xen even with XFS''s lazy unmapping scheme, and > > > would allow Nick''s more advanced methods to work as well.... > > > > You could always just keep around an array of the pages and then drop the > > reference count after unmap. Or walk the vmalloc mapping and generate such > > an array before freeing, then unmap and then drop the reference counts. > > You mean like vmap() could record the pages passed to it in the area->pages > array, and we walk and release than in __vunmap() like it already does > for vfree()? > > If we did this, it would probably be best to pass a page release function > into the vmap or vunmap call - we''d need page_cache_release() called on > the page rather than __free_page().... > > The solution belongs behind the vmap/vunmap interface, not in XFS....Lightly tested(*) patch that does this with lazy unmapping below for comment. (*) a) kernel boots, b) made an XFS filesystem with 64k directory blocks, created ~100,000 files in a directory to get a wide btree (~1700 blocks, still only a single level) and run repeated finds across it dropping caches in between. Each traversal maps and unmaps every btree block. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_buf.c | 21 +---- include/linux/vmalloc.h | 4 + mm/vmalloc.c | 160 +++++++++++++++++++++++++++++++++++---------- 3 files changed, 133 insertions(+), 52 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c ==================================================================--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-23 14:48:32.131088616 +1000 @@ -187,19 +187,6 @@ free_address( { a_list_t *aentry; -#ifdef CONFIG_XEN - /* - * Xen needs to be able to make sure it can get an exclusive - * RO mapping of pages it wants to turn into a pagetable. If - * a newly allocated page is also still being vmap()ed by xfs, - * it will cause pagetable construction to fail. This is a - * quick workaround to always eagerly unmap pages so that Xen - * is happy. - */ - vunmap(addr); - return; -#endif - aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT); if (likely(aentry)) { spin_lock(&as_lock); @@ -209,7 +196,7 @@ free_address( as_list_len++; spin_unlock(&as_lock); } else { - vunmap(addr); + vunmap_pages(addr, put_page); } } @@ -228,7 +215,7 @@ purge_addresses(void) spin_unlock(&as_lock); while ((old = aentry) != NULL) { - vunmap(aentry->vm_addr); + vunmap_pages(aentry->vm_addr, put_page); aentry = aentry->next; kfree(old); } @@ -456,8 +443,8 @@ _xfs_buf_map_pages( } else if (flags & XBF_MAPPED) { if (as_list_len > 64) purge_addresses(); - bp->b_addr = vmap(bp->b_pages, bp->b_page_count, - VM_MAP, PAGE_KERNEL); + bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count, + VM_MAP, PAGE_KERNEL, xb_to_gfp(flags)); if (unlikely(bp->b_addr == NULL)) return -ENOMEM; bp->b_addr += bp->b_offset; Index: 2.6.x-xfs-new/include/linux/vmalloc.h ==================================================================--- 2.6.x-xfs-new.orig/include/linux/vmalloc.h 2007-09-12 15:41:41.000000000 +1000 +++ 2.6.x-xfs-new/include/linux/vmalloc.h 2007-10-23 14:36:56.264860921 +1000 @@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u unsigned long flags, pgprot_t prot); extern void vunmap(void *addr); +extern void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask); +extern void vunmap_pages(void *addr, void (*release)(struct page *page)); + extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); void vmalloc_sync_all(void); Index: 2.6.x-xfs-new/mm/vmalloc.c ==================================================================--- 2.6.x-xfs-new.orig/mm/vmalloc.c 2007-09-12 15:41:48.000000000 +1000 +++ 2.6.x-xfs-new/mm/vmalloc.c 2007-10-23 16:29:40.130384957 +1000 @@ -318,7 +318,56 @@ struct vm_struct *remove_vm_area(void *a return v; } -static void __vunmap(void *addr, int deallocate_pages) +static void vm_area_release_page(struct page *page) +{ + __free_page(page); +} + +static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask, + unsigned int nr_pages, int node) +{ + struct page **pages; + unsigned int array_size; + + array_size = (nr_pages * sizeof(struct page *)); + + area->nr_pages = nr_pages; + /* Please note that the recursion is strictly bounded. */ + if (array_size > PAGE_SIZE) { + pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, + PAGE_KERNEL, node); + area->flags |= VM_VPAGES; + } else { + pages = kmalloc_node(array_size, + (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO, + node); + } + area->pages = pages; + if (!area->pages) { + remove_vm_area(area->addr); + kfree(area); + return -ENOMEM; + } + return 0; +} + +static void vm_area_release_pagearray(struct vm_struct *area, + void (*release)(struct page *)) +{ + int i; + + for (i = 0; i < area->nr_pages; i++) { + BUG_ON(!area->pages[i]); + release(area->pages[i]); + } + + if (area->flags & VM_VPAGES) + vfree(area->pages); + else + kfree(area->pages); +} + +static void __vunmap(void *addr, void (*release)(struct page *)) { struct vm_struct *area; @@ -341,19 +390,8 @@ static void __vunmap(void *addr, int dea debug_check_no_locks_freed(addr, area->size); - if (deallocate_pages) { - int i; - - for (i = 0; i < area->nr_pages; i++) { - BUG_ON(!area->pages[i]); - __free_page(area->pages[i]); - } - - if (area->flags & VM_VPAGES) - vfree(area->pages); - else - kfree(area->pages); - } + if (release) + vm_area_release_pagearray(area, release); kfree(area); return; @@ -372,7 +410,7 @@ static void __vunmap(void *addr, int dea void vfree(void *addr) { BUG_ON(in_interrupt()); - __vunmap(addr, 1); + __vunmap(addr, vm_area_release_page); } EXPORT_SYMBOL(vfree); @@ -388,11 +426,30 @@ EXPORT_SYMBOL(vfree); void vunmap(void *addr) { BUG_ON(in_interrupt()); - __vunmap(addr, 0); + __vunmap(addr, NULL); } EXPORT_SYMBOL(vunmap); /** + * vunmap_pages - release virtual mapping obtained by vmap_pages() + * @addr: memory base address + * @release: release page callback function + * + * Free the virtually contiguous memory area starting at @addr, + * which was created from the page array passed to vmap_pages(), + * then call the release function on each page in the associated + * array of page attached to the area. + * + * Must not be called in interrupt context. + */ +void vunmap_pages(void *addr, void (*release)(struct page *)) +{ + BUG_ON(in_interrupt()); + __vunmap(addr, release); +} +EXPORT_SYMBOL(vunmap_pages); + +/** * vmap - map an array of pages into virtually contiguous space * @pages: array of page pointers * @count: number of pages to map @@ -422,32 +479,63 @@ void *vmap(struct page **pages, unsigned } EXPORT_SYMBOL(vmap); +/** + * vmap_pages - map an array of pages into virtually contiguous space + * @pages: array of page pointers + * @count: number of pages to map + * @flags: vm_area->flags + * @prot: page protection for the mapping + * @gfp_mask: flags for the page level allocator + * + * Maps @count pages from @pages into contiguous kernel virtual + * space taking a reference to each page and keeping track of all + * the pages within the vm area structure. + */ +void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask) +{ + struct vm_struct *area; + struct page **pgp; + int i; + + if (count > num_physpages) + return NULL; + + area = get_vm_area((count << PAGE_SHIFT), flags); + if (!area) + return NULL; + if (vm_area_alloc_pagearray(area, gfp_mask, count, -1)) + return NULL; + + /* map_vm_area modifies pgp */ + pgp = pages; + if (map_vm_area(area, prot, &pgp)) { + vunmap(area->addr); + return NULL; + } + /* + * now that the region is mapped, take a reference to each + * page and store them in the area page array. + */ + for (i = 0; i < area->nr_pages; i++) { + get_page(pages[i]); + area->pages[i] = pages[i]; + } + + return area->addr; +} +EXPORT_SYMBOL(vmap_pages); + void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages; + int i; nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; - array_size = (nr_pages * sizeof(struct page *)); - - area->nr_pages = nr_pages; - /* Please note that the recursion is strictly bounded. */ - if (array_size > PAGE_SIZE) { - pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, - PAGE_KERNEL, node); - area->flags |= VM_VPAGES; - } else { - pages = kmalloc_node(array_size, - (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO, - node); - } - area->pages = pages; - if (!area->pages) { - remove_vm_area(area->addr); - kfree(area); + if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node)) return NULL; - } for (i = 0; i < area->nr_pages; i++) { if (node < 0) @@ -461,6 +549,8 @@ void *__vmalloc_area_node(struct vm_stru } } + /* map_vm_area modifies pages */ + pages = area->pages; if (map_vm_area(area, prot, &pages)) goto fail; return area->addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-23 09:28 UTC
[Xen-devel] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote:> > That doesn''t mean it is correct. > > Right, but it also points to the fact that it''s not causing problems > from 99.999% of ppl out there.So you''re waiting for someone to take months to debug this again?> You mean like vmap() could record the pages passed to it in the area->pages > array,The page tables contain pointers to the pages anyways. vunmap() has to walk them. It would not be very difficult to store them in an array during the walk.> > If we did this, it would probably be best to pass a page release function > into the vmap or vunmap call - we''d need page_cache_release() called on > the page rather than __free_page()....Possible. Normally vmalloc pages should not be in the LRU except yours so it would be probably fine to just change it.> The solution belongs behind the vmap/vunmap interface, not in XFS....You could also just keep the array from map time around yourself. Then you could do it yourself. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen
2007-Oct-23 09:30 UTC
[Xen-devel] Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 05:04:14PM +1000, David Chinner wrote:> On Tue, Oct 23, 2007 at 10:36:41AM +1000, David Chinner wrote: > > On Tue, Oct 23, 2007 at 01:35:14AM +0200, Andi Kleen wrote: > > > On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: > > > > Could vmap()/vunmap() take references to the pages that are mapped? That > > > > way delaying the unmap would also delay the freeing of the pages and hence > > > > we''d have no problems with the pages being reused before the mapping is > > > > torn down. That''d work for Xen even with XFS''s lazy unmapping scheme, and > > > > would allow Nick''s more advanced methods to work as well.... > > > > > > You could always just keep around an array of the pages and then drop the > > > reference count after unmap. Or walk the vmalloc mapping and generate such > > > an array before freeing, then unmap and then drop the reference counts. > > > > You mean like vmap() could record the pages passed to it in the area->pages > > array, and we walk and release than in __vunmap() like it already does > > for vfree()? > > > > If we did this, it would probably be best to pass a page release function > > into the vmap or vunmap call - we''d need page_cache_release() called on > > the page rather than __free_page().... > > > > The solution belongs behind the vmap/vunmap interface, not in XFS.... > > Lightly tested(*) patch that does this with lazy unmapping > below for comment.Thanks> > (*) a) kernel boots, b) made an XFS filesystem with 64k directory > blocks, created ~100,000 files in a directory to get a wide btree > (~1700 blocks, still only a single level) and run repeated finds > across it dropping caches in between. Each traversal maps and > unmaps every btree block.Hmm, the __free_page -> page_cache_release() change in vfree() would have been simpler wouldn''t it? But if it works it is fine. -Andi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-23 12:41 UTC
[Xen-devel] Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings
On Tue, Oct 23, 2007 at 11:30:35AM +0200, Andi Kleen wrote:> On Tue, Oct 23, 2007 at 05:04:14PM +1000, David Chinner wrote: > > > You mean like vmap() could record the pages passed to it in the area->pages > > > array, and we walk and release than in __vunmap() like it already does > > > for vfree()? > > > > > > If we did this, it would probably be best to pass a page release function > > > into the vmap or vunmap call - we''d need page_cache_release() called on > > > the page rather than __free_page().... > > > > > > The solution belongs behind the vmap/vunmap interface, not in XFS.... > > > > Lightly tested(*) patch that does this with lazy unmapping > > below for comment. > > Thanks > > > > > (*) a) kernel boots, b) made an XFS filesystem with 64k directory > > blocks, created ~100,000 files in a directory to get a wide btree > > (~1700 blocks, still only a single level) and run repeated finds > > across it dropping caches in between. Each traversal maps and > > unmaps every btree block. > > Hmm, the __free_page -> page_cache_release() change in vfree() would > have been simpler wouldn''t it?Yes, it would, but I tried to leave vmalloc doing the same thing as it was before. I think that it would be safe simply to call put_page() directly in the __vunmap() code and drop all the release function passing, but I don''t know enough about that code to say for certain. I''ll spin up another patch tomorrow that does this and see how it goes.> But if it works it is fine.Seems to - it''s passed XFSQA with 64k directories and a bunch of dirstress workouts as well. Nick, Jeremy, (others?) any objections to this approach to solve the problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-23 14:33 UTC
[Xen-devel] Re: [patch] Re: Interaction between Xen and XFS: stray RW mappings
David Chinner wrote:> Nick, Jeremy, (others?) any objections to this approach to solve > the problem?Seems sounds in principle. I think Nick''s shootdown-stray-mappings mm API call is a better long-term answer, but this will do for now. I''ll test it out today. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-24 04:36 UTC
[Xen-devel] [PATCH] Allow lazy unmapping by taking extra page references V2
Allow lazy unmapping of vmap()d regions be taking a reference on each page in the region being vmap()ed. The page references get released after the region is vunmap()ed, thereby ensuring we don''t leave stray mappings on freed pages that could lead to problems pages being reallocated with incorrect mappings associated with them. Tested with XFS filesystems with 64k directory blocks with dirstress and XFSQA. Version 2: - drop the release function stuff and just call put_page() as it falls down to the same code in all calling cases. Signed-Off-By: Dave Chinner <dgc@sgi.com> --- fs/xfs/linux-2.6/xfs_buf.c | 21 +------- include/linux/vmalloc.h | 4 + mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++-------- 3 files changed, 106 insertions(+), 37 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c ==================================================================--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-24 09:16:26.930345566 +1000 @@ -187,19 +187,6 @@ free_address( { a_list_t *aentry; -#ifdef CONFIG_XEN - /* - * Xen needs to be able to make sure it can get an exclusive - * RO mapping of pages it wants to turn into a pagetable. If - * a newly allocated page is also still being vmap()ed by xfs, - * it will cause pagetable construction to fail. This is a - * quick workaround to always eagerly unmap pages so that Xen - * is happy. - */ - vunmap(addr); - return; -#endif - aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT); if (likely(aentry)) { spin_lock(&as_lock); @@ -209,7 +196,7 @@ free_address( as_list_len++; spin_unlock(&as_lock); } else { - vunmap(addr); + vunmap_pages(addr); } } @@ -228,7 +215,7 @@ purge_addresses(void) spin_unlock(&as_lock); while ((old = aentry) != NULL) { - vunmap(aentry->vm_addr); + vunmap_pages(aentry->vm_addr); aentry = aentry->next; kfree(old); } @@ -456,8 +443,8 @@ _xfs_buf_map_pages( } else if (flags & XBF_MAPPED) { if (as_list_len > 64) purge_addresses(); - bp->b_addr = vmap(bp->b_pages, bp->b_page_count, - VM_MAP, PAGE_KERNEL); + bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count, + VM_MAP, PAGE_KERNEL, xb_to_gfp(flags)); if (unlikely(bp->b_addr == NULL)) return -ENOMEM; bp->b_addr += bp->b_offset; Index: 2.6.x-xfs-new/include/linux/vmalloc.h ==================================================================--- 2.6.x-xfs-new.orig/include/linux/vmalloc.h 2007-09-12 15:41:41.000000000 +1000 +++ 2.6.x-xfs-new/include/linux/vmalloc.h 2007-10-24 09:16:55.194741855 +1000 @@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u unsigned long flags, pgprot_t prot); extern void vunmap(void *addr); +extern void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask); +extern void vunmap_pages(void *addr); + extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); void vmalloc_sync_all(void); Index: 2.6.x-xfs-new/mm/vmalloc.c ==================================================================--- 2.6.x-xfs-new.orig/mm/vmalloc.c 2007-09-12 15:41:48.000000000 +1000 +++ 2.6.x-xfs-new/mm/vmalloc.c 2007-10-24 09:30:53.447102103 +1000 @@ -318,6 +318,34 @@ struct vm_struct *remove_vm_area(void *a return v; } +static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask, + unsigned int nr_pages, int node) +{ + struct page **pages; + unsigned int array_size; + + array_size = (nr_pages * sizeof(struct page *)); + + area->nr_pages = nr_pages; + /* Please note that the recursion is strictly bounded. */ + if (array_size > PAGE_SIZE) { + pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, + PAGE_KERNEL, node); + area->flags |= VM_VPAGES; + } else { + pages = kmalloc_node(array_size, + (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO, + node); + } + area->pages = pages; + if (!area->pages) { + remove_vm_area(area->addr); + kfree(area); + return -ENOMEM; + } + return 0; +} + static void __vunmap(void *addr, int deallocate_pages) { struct vm_struct *area; @@ -346,7 +374,7 @@ static void __vunmap(void *addr, int dea for (i = 0; i < area->nr_pages; i++) { BUG_ON(!area->pages[i]); - __free_page(area->pages[i]); + put_page(area->pages[i]); } if (area->flags & VM_VPAGES) @@ -393,6 +421,23 @@ void vunmap(void *addr) EXPORT_SYMBOL(vunmap); /** + * vunmap_pages - release virtual mapping obtained by vmap_pages() + * @addr: memory base address + * + * Free the virtually contiguous memory area starting at @addr, + * which was created from the page array passed to vmap_pages(), + * releasing the reference on the pages gained in vmap_pages(). + * + * Must not be called in interrupt context. + */ +void vunmap_pages(void *addr) +{ + BUG_ON(in_interrupt()); + __vunmap(addr, 1); +} +EXPORT_SYMBOL(vunmap_pages); + +/** * vmap - map an array of pages into virtually contiguous space * @pages: array of page pointers * @count: number of pages to map @@ -422,32 +467,63 @@ void *vmap(struct page **pages, unsigned } EXPORT_SYMBOL(vmap); +/** + * vmap_pages - map an array of pages into virtually contiguous space + * @pages: array of page pointers + * @count: number of pages to map + * @flags: vm_area->flags + * @prot: page protection for the mapping + * @gfp_mask: flags for the page level allocator + * + * Maps @count pages from @pages into contiguous kernel virtual + * space taking a reference to each page and keeping track of all + * the pages within the vm area structure. + */ +void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask) +{ + struct vm_struct *area; + struct page **pgp; + int i; + + if (count > num_physpages) + return NULL; + + area = get_vm_area((count << PAGE_SHIFT), flags); + if (!area) + return NULL; + if (vm_area_alloc_pagearray(area, gfp_mask, count, -1)) + return NULL; + + /* map_vm_area modifies pgp */ + pgp = pages; + if (map_vm_area(area, prot, &pgp)) { + vunmap(area->addr); + return NULL; + } + /* + * now that the region is mapped, take a reference to each + * page and store them in the area page array. + */ + for (i = 0; i < area->nr_pages; i++) { + get_page(pages[i]); + area->pages[i] = pages[i]; + } + + return area->addr; +} +EXPORT_SYMBOL(vmap_pages); + void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages; + int i; nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; - array_size = (nr_pages * sizeof(struct page *)); - - area->nr_pages = nr_pages; - /* Please note that the recursion is strictly bounded. */ - if (array_size > PAGE_SIZE) { - pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, - PAGE_KERNEL, node); - area->flags |= VM_VPAGES; - } else { - pages = kmalloc_node(array_size, - (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO, - node); - } - area->pages = pages; - if (!area->pages) { - remove_vm_area(area->addr); - kfree(area); + if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node)) return NULL; - } for (i = 0; i < area->nr_pages; i++) { if (node < 0) @@ -461,6 +537,8 @@ void *__vmalloc_area_node(struct vm_stru } } + /* map_vm_area modifies pages */ + pages = area->pages; if (map_vm_area(area, prot, &pages)) goto fail; return area->addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-24 05:08 UTC
[Xen-devel] Re: [PATCH] Allow lazy unmapping by taking extra page references V2
David Chinner wrote:> Allow lazy unmapping of vmap()d regions be taking a reference > on each page in the region being vmap()ed. The page references > get released after the region is vunmap()ed, thereby ensuring > we don''t leave stray mappings on freed pages that could lead to > problems pages being reallocated with incorrect mappings > associated with them. > > Tested with XFS filesystems with 64k directory blocks with > dirstress and XFSQA. > > Version 2: > - drop the release function stuff and just call put_page() > as it falls down to the same code in all calling cases. >This doesn''t apply cleanly to the current git tree. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-24 21:48 UTC
[Xen-devel] [PATCH] Allow lazy unmapping by taking extra page references V3
Allow lazy unmapping of vmap()d regions be taking a reference on each page in the region being vmap()ed. The page references get released after the region is vunmap()ed, thereby ensuring we don''t leave stray mappings on freed pages that could lead to problems pages being reallocated with incorrect mappings associated with them. Tested with XFS filesystems with 64k directory blocks with dirstress and XFSQA. Version 3: - compile on latest -git Version 2: - drop the release function stuff and just call put_page() as it falls down to the same code in all calling cases. Signed-off-by: Dave Chinner <dgc@sgi.com> --- fs/xfs/linux-2.6/xfs_buf.c | 21 +------- include/linux/vmalloc.h | 4 + mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++-------- 3 files changed, 106 insertions(+), 37 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index b9c8589..38f073f 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -187,19 +187,6 @@ free_address( { a_list_t *aentry; -#ifdef CONFIG_XEN - /* - * Xen needs to be able to make sure it can get an exclusive - * RO mapping of pages it wants to turn into a pagetable. If - * a newly allocated page is also still being vmap()ed by xfs, - * it will cause pagetable construction to fail. This is a - * quick workaround to always eagerly unmap pages so that Xen - * is happy. - */ - vunmap(addr); - return; -#endif - aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT); if (likely(aentry)) { spin_lock(&as_lock); @@ -209,7 +196,7 @@ #endif as_list_len++; spin_unlock(&as_lock); } else { - vunmap(addr); + vunmap_pages(addr); } } @@ -228,7 +215,7 @@ purge_addresses(void) spin_unlock(&as_lock); while ((old = aentry) != NULL) { - vunmap(aentry->vm_addr); + vunmap_pages(aentry->vm_addr); aentry = aentry->next; kfree(old); } @@ -458,8 +445,8 @@ _xfs_buf_map_pages( } else if (flags & XBF_MAPPED) { if (as_list_len > 64) purge_addresses(); - bp->b_addr = vmap(bp->b_pages, bp->b_page_count, - VM_MAP, PAGE_KERNEL); + bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count, + VM_MAP, PAGE_KERNEL, xb_to_gfp(flags)); if (unlikely(bp->b_addr == NULL)) return -ENOMEM; bp->b_addr += bp->b_offset; diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 89338b4..40c34da 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u unsigned long flags, pgprot_t prot); extern void vunmap(void *addr); +extern void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask); +extern void vunmap_pages(void *addr); + extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); void vmalloc_sync_all(void); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index af77e17..95ba0fd 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -319,6 +319,34 @@ struct vm_struct *remove_vm_area(void *a return v; } +static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask, + unsigned int nr_pages, int node) +{ + struct page **pages; + unsigned int array_size; + + array_size = (nr_pages * sizeof(struct page *)); + + area->nr_pages = nr_pages; + /* Please note that the recursion is strictly bounded. */ + if (array_size > PAGE_SIZE) { + pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, + PAGE_KERNEL, node); + area->flags |= VM_VPAGES; + } else { + pages = kmalloc_node(array_size, + (gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO, + node); + } + area->pages = pages; + if (!area->pages) { + remove_vm_area(area->addr); + kfree(area); + return -ENOMEM; + } + return 0; +} + static void __vunmap(void *addr, int deallocate_pages) { struct vm_struct *area; @@ -347,7 +375,7 @@ static void __vunmap(void *addr, int dea for (i = 0; i < area->nr_pages; i++) { BUG_ON(!area->pages[i]); - __free_page(area->pages[i]); + put_page(area->pages[i]); } if (area->flags & VM_VPAGES) @@ -394,6 +422,23 @@ void vunmap(void *addr) EXPORT_SYMBOL(vunmap); /** + * vunmap_pages - release virtual mapping obtained by vmap_pages() + * @addr: memory base address + * + * Free the virtually contiguous memory area starting at @addr, + * which was created from the page array passed to vmap_pages(), + * releasing the reference on the pages gained in vmap_pages(). + * + * Must not be called in interrupt context. + */ +void vunmap_pages(void *addr) +{ + BUG_ON(in_interrupt()); + __vunmap(addr, 1); +} +EXPORT_SYMBOL(vunmap_pages); + +/** * vmap - map an array of pages into virtually contiguous space * @pages: array of page pointers * @count: number of pages to map @@ -423,32 +468,63 @@ void *vmap(struct page **pages, unsigned } EXPORT_SYMBOL(vmap); +/** + * vmap_pages - map an array of pages into virtually contiguous space + * @pages: array of page pointers + * @count: number of pages to map + * @flags: vm_area->flags + * @prot: page protection for the mapping + * @gfp_mask: flags for the page level allocator + * + * Maps @count pages from @pages into contiguous kernel virtual + * space taking a reference to each page and keeping track of all + * the pages within the vm area structure. + */ +void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask) +{ + struct vm_struct *area; + struct page **pgp; + int i; + + if (count > num_physpages) + return NULL; + + area = get_vm_area((count << PAGE_SHIFT), flags); + if (!area) + return NULL; + if (vm_area_alloc_pagearray(area, gfp_mask, count, -1)) + return NULL; + + /* map_vm_area modifies pgp */ + pgp = pages; + if (map_vm_area(area, prot, &pgp)) { + vunmap(area->addr); + return NULL; + } + /* + * now that the region is mapped, take a reference to each + * page and store them in the area page array. + */ + for (i = 0; i < area->nr_pages; i++) { + get_page(pages[i]); + area->pages[i] = pages[i]; + } + + return area->addr; +} +EXPORT_SYMBOL(vmap_pages); + void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages; + int i; nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; - array_size = (nr_pages * sizeof(struct page *)); - - area->nr_pages = nr_pages; - /* Please note that the recursion is strictly bounded. */ - if (array_size > PAGE_SIZE) { - pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, - PAGE_KERNEL, node); - area->flags |= VM_VPAGES; - } else { - pages = kmalloc_node(array_size, - (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO, - node); - } - area->pages = pages; - if (!area->pages) { - remove_vm_area(area->addr); - kfree(area); + if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node)) return NULL; - } for (i = 0; i < area->nr_pages; i++) { if (node < 0) @@ -462,6 +538,8 @@ void *__vmalloc_area_node(struct vm_stru } } + /* map_vm_area modifies pages */ + pages = area->pages; if (map_vm_area(area, prot, &pages)) goto fail; return area->addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2007-Oct-24 22:46 UTC
[Xen-devel] Re: [PATCH] Allow lazy unmapping by taking extra page references V3
David Chinner wrote:> Allow lazy unmapping of vmap()d regions be taking a reference > on each page in the region being vmap()ed. The page references > get released after the region is vunmap()ed, thereby ensuring > we don''t leave stray mappings on freed pages that could lead to > problems pages being reallocated with incorrect mappings > associated with them. > > Tested with XFS filesystems with 64k directory blocks with > dirstress and XFSQA. > > Version 3: > - compile on latest -git >Not quite: CC mm/vmalloc.o /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function ''vm_area_alloc_pagearray'': /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: ''GFP_LEVEL_MASK'' undeclared (first use in this function) /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.) make[3]: *** [mm/vmalloc.o] Error 1 make[2]: *** [mm] Error 2 make[2]: *** Waiting for unfinished jobs.... GFP_RECLAM_MASK now? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Chinner
2007-Oct-24 23:21 UTC
[Xen-devel] Re: [PATCH] Allow lazy unmapping by taking extra page references V3
On Wed, Oct 24, 2007 at 03:46:16PM -0700, Jeremy Fitzhardinge wrote:> David Chinner wrote: > > Version 3: > > - compile on latest -git > > > > Not quite: > > CC mm/vmalloc.o > /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function ''vm_area_alloc_pagearray'': > /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: ''GFP_LEVEL_MASK'' undeclared (first use in this function) > /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once > /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.) > make[3]: *** [mm/vmalloc.o] Error 1 > make[2]: *** [mm] Error 2 > make[2]: *** Waiting for unfinished jobs.... > > > GFP_RECLAM_MASK now?Yeah, it is. Not sure what happened there - I did make that change. new diff below. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index b9c8589..38f073f 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -187,19 +187,6 @@ free_address( { a_list_t *aentry; -#ifdef CONFIG_XEN - /* - * Xen needs to be able to make sure it can get an exclusive - * RO mapping of pages it wants to turn into a pagetable. If - * a newly allocated page is also still being vmap()ed by xfs, - * it will cause pagetable construction to fail. This is a - * quick workaround to always eagerly unmap pages so that Xen - * is happy. - */ - vunmap(addr); - return; -#endif - aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT); if (likely(aentry)) { spin_lock(&as_lock); @@ -209,7 +196,7 @@ #endif as_list_len++; spin_unlock(&as_lock); } else { - vunmap(addr); + vunmap_pages(addr); } } @@ -228,7 +215,7 @@ purge_addresses(void) spin_unlock(&as_lock); while ((old = aentry) != NULL) { - vunmap(aentry->vm_addr); + vunmap_pages(aentry->vm_addr); aentry = aentry->next; kfree(old); } @@ -458,8 +445,8 @@ _xfs_buf_map_pages( } else if (flags & XBF_MAPPED) { if (as_list_len > 64) purge_addresses(); - bp->b_addr = vmap(bp->b_pages, bp->b_page_count, - VM_MAP, PAGE_KERNEL); + bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count, + VM_MAP, PAGE_KERNEL, xb_to_gfp(flags)); if (unlikely(bp->b_addr == NULL)) return -ENOMEM; bp->b_addr += bp->b_offset; diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 89338b4..40c34da 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u unsigned long flags, pgprot_t prot); extern void vunmap(void *addr); +extern void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask); +extern void vunmap_pages(void *addr); + extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); void vmalloc_sync_all(void); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index af77e17..720f338 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -319,6 +319,34 @@ struct vm_struct *remove_vm_area(void *a return v; } +static int vm_area_alloc_pagearray(struct vm_struct *area, gfp_t gfp_mask, + unsigned int nr_pages, int node) +{ + struct page **pages; + unsigned int array_size; + + array_size = (nr_pages * sizeof(struct page *)); + + area->nr_pages = nr_pages; + /* Please note that the recursion is strictly bounded. */ + if (array_size > PAGE_SIZE) { + pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, + PAGE_KERNEL, node); + area->flags |= VM_VPAGES; + } else { + pages = kmalloc_node(array_size, + (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO, + node); + } + area->pages = pages; + if (!area->pages) { + remove_vm_area(area->addr); + kfree(area); + return -ENOMEM; + } + return 0; +} + static void __vunmap(void *addr, int deallocate_pages) { struct vm_struct *area; @@ -347,7 +375,7 @@ static void __vunmap(void *addr, int dea for (i = 0; i < area->nr_pages; i++) { BUG_ON(!area->pages[i]); - __free_page(area->pages[i]); + put_page(area->pages[i]); } if (area->flags & VM_VPAGES) @@ -394,6 +422,23 @@ void vunmap(void *addr) EXPORT_SYMBOL(vunmap); /** + * vunmap_pages - release virtual mapping obtained by vmap_pages() + * @addr: memory base address + * + * Free the virtually contiguous memory area starting at @addr, + * which was created from the page array passed to vmap_pages(), + * releasing the reference on the pages gained in vmap_pages(). + * + * Must not be called in interrupt context. + */ +void vunmap_pages(void *addr) +{ + BUG_ON(in_interrupt()); + __vunmap(addr, 1); +} +EXPORT_SYMBOL(vunmap_pages); + +/** * vmap - map an array of pages into virtually contiguous space * @pages: array of page pointers * @count: number of pages to map @@ -423,32 +468,63 @@ void *vmap(struct page **pages, unsigned } EXPORT_SYMBOL(vmap); +/** + * vmap_pages - map an array of pages into virtually contiguous space + * @pages: array of page pointers + * @count: number of pages to map + * @flags: vm_area->flags + * @prot: page protection for the mapping + * @gfp_mask: flags for the page level allocator + * + * Maps @count pages from @pages into contiguous kernel virtual + * space taking a reference to each page and keeping track of all + * the pages within the vm area structure. + */ +void *vmap_pages(struct page **pages, unsigned int count, + unsigned long flags, pgprot_t prot, gfp_t gfp_mask) +{ + struct vm_struct *area; + struct page **pgp; + int i; + + if (count > num_physpages) + return NULL; + + area = get_vm_area((count << PAGE_SHIFT), flags); + if (!area) + return NULL; + if (vm_area_alloc_pagearray(area, gfp_mask, count, -1)) + return NULL; + + /* map_vm_area modifies pgp */ + pgp = pages; + if (map_vm_area(area, prot, &pgp)) { + vunmap(area->addr); + return NULL; + } + /* + * now that the region is mapped, take a reference to each + * page and store them in the area page array. + */ + for (i = 0; i < area->nr_pages; i++) { + get_page(pages[i]); + area->pages[i] = pages[i]; + } + + return area->addr; +} +EXPORT_SYMBOL(vmap_pages); + void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages; + int i; nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; - array_size = (nr_pages * sizeof(struct page *)); - - area->nr_pages = nr_pages; - /* Please note that the recursion is strictly bounded. */ - if (array_size > PAGE_SIZE) { - pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO, - PAGE_KERNEL, node); - area->flags |= VM_VPAGES; - } else { - pages = kmalloc_node(array_size, - (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO, - node); - } - area->pages = pages; - if (!area->pages) { - remove_vm_area(area->addr); - kfree(area); + if (vm_area_alloc_pagearray(area, gfp_mask, nr_pages, node)) return NULL; - } for (i = 0; i < area->nr_pages; i++) { if (node < 0) @@ -462,6 +538,8 @@ void *__vmalloc_area_node(struct vm_stru } } + /* map_vm_area modifies pages */ + pages = area->pages; if (map_vm_area(area, prot, &pages)) goto fail; return area->addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel