Rafal Wojtczuk
2010-Apr-12 18:54 UTC
[Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
Hello, Would someone on the list enlight me on the following issue, possibly related to mfn management in the PV guest. Environment: xen-3.4.3, pvops 2.6.32.9 in dom0 and domU, all 64bit. Usermode code (if you are interested, at http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/vchan/init.c;h=cb2fb851c3b97804b115dbf58fd47a30d6d0a8a3;hb=HEAD) in PV domU does the following: 1) gets a page via ring_ref_v=mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE |MAP_ANON,-1, 0); 2) mlock(ring_ref_v, 4096) 3) determines the mfn number of the frame holding ring_ref_v via a call to u2mfn driver (http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/u2mfn/u2mfn.c;h=6ff113c07c50ef078ab04d9e61d2faab338357e7;hb=HEAD) the driver basically does get_user_pages+kmap+virt_to_mfn (and later kunmap+put_page for cleanup). Then, the usermode code in dom0 does xc_map_foreign_range on the returned mfn, and can communicate with the code in domU over this shared page... ...but sometimes, apparently the page that backs ring_ref_v changes: if the domU application calls u2mfn ioctl again with ring_ref_v argument, it returns a different mfn. And naturally the code in dom0 reads garbage from the address returned by its pevious call to xc_map_foreign_range. I find it puzzling. Is this behaviour normal/expected ? Mlock man pages say "All pages that contain a part of the specified address range are guaranteed to be resident in RAM", not "be resident at the same RAM location", but why would a frame backing a mlock-ed memory be changed ? Is there some memory defragmentation going on ? Or maybe only frame->frame number function changes (but why would it) ? Anyway, this behaviour causes problems, as you can see in http://www.qubes-os.org/trac/ticket/16#comment:4 It would be nice if the mfn of a frame that holds a given mlock-ed usermode page could be made constant. If you can offer some insight, that would be helpful, particularly: 1) Why this does not happen to pages allocated in kernel mode (if it did, it would break the split drivers model) ? 2) Can this frame-changing behaviour be switched off at Xen/kernel level? 3) Would using grant tables (instead of brutal xc_map_foreign_range) change the situation ? BTW, for Qubes it is necessary to map PV domU usermode pages in dom0; particularly, map X server composition buffers. Regards, Rafal Wojtczuk The Qubes OS Project http://qubes-os.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Apr-12 20:01 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 11:54 AM, Rafal Wojtczuk wrote:> Hello, > Would someone on the list enlight me on the following issue, possibly related > to mfn management in the PV guest. > Environment: xen-3.4.3, pvops 2.6.32.9 in dom0 and domU, all 64bit. > Usermode code > (if you are interested, at http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/vchan/init.c;h=cb2fb851c3b97804b115dbf58fd47a30d6d0a8a3;hb=HEAD) > in PV domU does the following: > 1) gets a page via > ring_ref_v=mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE |MAP_ANON,-1, 0); > 2) mlock(ring_ref_v, 4096) > 3) determines the mfn number of the frame holding ring_ref_v via a call to > u2mfn driver > (http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/u2mfn/u2mfn.c;h=6ff113c07c50ef078ab04d9e61d2faab338357e7;hb=HEAD) > the driver basically does get_user_pages+kmap+virt_to_mfn (and later > kunmap+put_page for cleanup). >Yeah, this looks fundamentally suspect. Using mlock in this way is going to be fragile.> Then, the > usermode code in dom0 does xc_map_foreign_range on the returned mfn, and can > communicate with the code in domU over this shared page... > ...but sometimes, apparently the page that backs ring_ref_v changes: if the > domU application calls u2mfn ioctl again with ring_ref_v argument, it > returns a different mfn. > And naturally the code in dom0 reads garbage from the address returned by > its pevious call to xc_map_foreign_range. > > I find it puzzling. Is this behaviour normal/expected ? >Yes.> Mlock man pages say "All pages that contain a part of the specified address > range are guaranteed to be resident in RAM", not "be resident at the same > RAM location", but why would a frame backing a mlock-ed memory be changed ? > Is there some memory defragmentation going on ?Yes, the kernel can move usermode memory around to defrag memory. This is done to allow higher-order memory allocations to keep working even on a long-running system which would otherwise fragment the address space. Ideally it allows 2M page allocations to succeed.> Or maybe only frame->frame > number function changes (but why would it) ? > > Anyway, this behaviour causes problems, as you can see in > http://www.qubes-os.org/trac/ticket/16#comment:4 > It would be nice if the mfn of a frame that holds a given mlock-ed usermode > page could be made constant. > > If you can offer some insight, that would be helpful, particularly: > 1) Why this does not happen to pages allocated in kernel mode (if it did, it > would break the split drivers model) ? >No, kernel allocations are not movable by default.> 2) Can this frame-changing behaviour be switched off at Xen/kernel level? >Not that I know of, and it wouldn''t be desirable if it could be.> 3) Would using grant tables (instead of brutal xc_map_foreign_range) change > the situation ? > BTW, for Qubes it is necessary to map PV domU usermode pages in dom0; > particularly, map X server composition buffers. >Why is it necessary to map usermode pages? It just seems like asking for trouble. Why not make it so that the domU X server gets the memory from the kernel (via some kind of driver), and then map that through to dom0? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joanna Rutkowska
2010-Apr-12 20:21 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 10:01 PM, Jeremy Fitzhardinge wrote:> > Why is it necessary to map usermode pages? It just seems like asking > for trouble. Why not make it so that the domU X server gets the memory > from the kernel (via some kind of driver), and then map that through to > dom0?Because we want to avoid modifying Xorg sources -- it normally allocates its composition buffers using malloc, and if we wanted to make it using some kernel allocated memory (by our custom driver) we would need to patch the Xorg, which we obviously wanted to avoid... joanna. ps. Copied this to qubes-devel as well. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Apr-12 20:39 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 01:21 PM, Joanna Rutkowska wrote:> On 04/12/2010 10:01 PM, Jeremy Fitzhardinge wrote: > >> Why is it necessary to map usermode pages? It just seems like asking >> for trouble. Why not make it so that the domU X server gets the memory >> from the kernel (via some kind of driver), and then map that through to >> dom0? >> > Because we want to avoid modifying Xorg sources -- it normally allocates > its composition buffers using malloc, and if we wanted to make it using > some kernel allocated memory (by our custom driver) we would need to > patch the Xorg, which we obviously wanted to avoid... >The referenced code doesn''t do that; it allocates some memory with with mmap, mlocks it, uses /proc/u2mfn to get the mfn then pokes it into xenbus. But I assume you have other code which wants to grant through the Xorg-allocated framebufer. That complicates things a bit, but you could still add a device (no /proc files, please) with an ioctl which: 1. takes a range of usermode addresses 2. increments the page refcount for those pages 3. returns the mfns for those pages That will prevent the pages from being migrated while you''re referring to their mfns. You need to add something to explicitly decrement the refcount to prevent a memory leak, presumably at the time you tear down the mapping in dom0. Ideally you''d arrange to do that triggered off unmap of the memory range (by isolating the pages in their own new vma) so that it all gets cleaned up on process exit. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joanna Rutkowska
2010-Apr-12 21:19 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 10:39 PM, Jeremy Fitzhardinge wrote:> On 04/12/2010 01:21 PM, Joanna Rutkowska wrote: >> On 04/12/2010 10:01 PM, Jeremy Fitzhardinge wrote: >> >>> Why is it necessary to map usermode pages? It just seems like asking >>> for trouble. Why not make it so that the domU X server gets the memory >>> from the kernel (via some kind of driver), and then map that through to >>> dom0? >>> >> Because we want to avoid modifying Xorg sources -- it normally allocates >> its composition buffers using malloc, and if we wanted to make it using >> some kernel allocated memory (by our custom driver) we would need to >> patch the Xorg, which we obviously wanted to avoid... >> > > The referenced code doesn''t do that; it allocates some memory with with > mmap, mlocks it, uses /proc/u2mfn to get the mfn then pokes it into xenbus. >Right, that''s for the "ring" page, which we use to implement a ring buffer, and we then pass mfns of the actual Xorg''s composition buffers over this ring buffer to Dom0. Interestingly, I have never seen a garbage in any of the composition buffers (which are directly displayed by our appviewers, so it would be immediately visible), just like if only the mfn for the "ring" page could be modified, but the composition buffer''s mfn were somehow pinned... This might suggest that the memory used by the composition buffers (which are in usermode) is somehow locked? Thanks, j. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Apr-12 21:26 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 02:19 PM, Joanna Rutkowska wrote:> Right, that''s for the "ring" page, which we use to implement a ring > buffer, and we then pass mfns of the actual Xorg''s composition buffers > over this ring buffer to Dom0. > > Interestingly, I have never seen a garbage in any of the composition > buffers (which are directly displayed by our appviewers, so it would be > immediately visible), just like if only the mfn for the "ring" page > could be modified, but the composition buffer''s mfn were somehow pinned... > > This might suggest that the memory used by the composition buffers > (which are in usermode) is somehow locked? >Worth looking into. I''m not at all familiar with how X manages composition buffers, but it seems to me that in normal use, one would want to be able to either allocate that buffer in texture memory (so it can be used as a texture source), or at least copy updates into texture memory. Couldn''t you hook into that transfer to the composition hardware (ie, dom0)? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joanna Rutkowska
2010-Apr-12 21:36 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/12/2010 11:26 PM, Jeremy Fitzhardinge wrote:> On 04/12/2010 02:19 PM, Joanna Rutkowska wrote: >> Right, that''s for the "ring" page, which we use to implement a ring >> buffer, and we then pass mfns of the actual Xorg''s composition buffers >> over this ring buffer to Dom0. >> >> Interestingly, I have never seen a garbage in any of the composition >> buffers (which are directly displayed by our appviewers, so it would be >> immediately visible), just like if only the mfn for the "ring" page >> could be modified, but the composition buffer''s mfn were somehow pinned... >> >> This might suggest that the memory used by the composition buffers >> (which are in usermode) is somehow locked? >> > > Worth looking into. > > I''m not at all familiar with how X manages composition buffers, but it > seems to me that in normal use, one would want to be able to either > allocate that buffer in texture memory (so it can be used as a texture > source), or at least copy updates into texture memory. Couldn''t you > hook into that transfer to the composition hardware (ie, dom0)? >We will definitely look into this. Thanks a lot for your help, Jeremy! joanna. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rafal Wojtczuk
2010-Apr-19 11:25 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On Mon, Apr 12, 2010 at 01:39:43PM -0700, Jeremy Fitzhardinge wrote:> But I assume you have other code which wants to grant through the > Xorg-allocated framebufer. That complicates things a bit, but you could > still add a device (no /proc files, please) with an ioctl which: > > 1. takes a range of usermode addresses > 2. increments the page refcount for those pagesOr, do not decrement the count, by not calling put_page() ?> 3. returns the mfns for those pages > > That will prevent the pages from being migrated while you''re referring > to their mfns.After removing the call to put_page() in u2mfn_ioctl(), see once again http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/u2mfn/u2mfn.c;h=6ff113c07c50ef078ab04d9e61d2faab338357e7;hb=HEAD#l35 the page''s mfn changed again. Even commenting out the kunmap() call in this function did not help, either. Am I missing something ? The only working way (for the ring buffer case) is to acquire memory via kmalloc and pass it to userspace via remap_pfn_range. But this is unsuitable for the case of X composition buffers, because we don''t want to alter the way X allocates memory (it calls plain malloc). We could hijack X''s malloc() via LD_PRELOAD, but then we cannot distinguish which calls are made because of composition buffer allocation.> You need to add something to explicitly decrement the > refcount to prevent a memory leak, presumably at the time you tear down > the mapping in dom0. Ideally you''d arrange to do that triggered off > unmap of the memory range (by isolating the pages in their own new vma) > so that it all gets cleaned up on process exit.By "triggered off unmap" do you mean setting the vm_ops field in struct vm_area_struct to a custom struct vm_operations_struct (particularly, with a custom close() method), or is there something simpler ?> I''m not at all familiar with how X manages composition buffers, but it > seems to me that in normal use, one would want to be able to either > allocate that buffer in texture memory (so it can be used as a texture > source), or at least copy updates into texture memory. Couldn''t you > hook into that transfer to the composition hardware (ie, dom0)?We are talking about X running in domU; there is no related hardware. We can determine where the composition buffer is only after it has been allocated.> No, kernel allocations are not movable by default.Could you mention a few details more on the related migration mechanism ? E.g. which PG_ flag (set by kmalloc) makes a page unmovable ? Preferably, with pointers to relevant code ? I guess it is in linux/mm/migrate.c, but I am getting lost trying to figure out which parts are NUMA specific and which are not; and particularly, what triggers the migration. Interestingly, Xorg guys claim X server does nothing special with the memory acquired by malloc() for the composition buffer. Yet, so far no corruption of the displayed images have been observed. Maybe a single page vma (that stores the ring buffer) is particularly attractive for the migration/defragmentation algorithm, and that is why it is easy to trigger its relocation (but not so with the composition buffer case) ? Once again thanks a lot for your explanations. Regards, Rafal Wojtczuk The Qubes OS Project http://qubes-os.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Apr-19 16:48 UTC
Re: [Xen-devel] The mfn of the frame, that holds a mlock-ed PV domU usermode page, can change
On 04/19/2010 04:25 AM, Rafal Wojtczuk wrote:> On Mon, Apr 12, 2010 at 01:39:43PM -0700, Jeremy Fitzhardinge wrote: > >> But I assume you have other code which wants to grant through the >> Xorg-allocated framebufer. That complicates things a bit, but you could >> still add a device (no /proc files, please) with an ioctl which: >> >> 1. takes a range of usermode addresses >> 2. increments the page refcount for those pages >> > Or, do not decrement the count, by not calling put_page() ? > > >> 3. returns the mfns for those pages >> >> That will prevent the pages from being migrated while you''re referring >> to their mfns. >> > After removing the call to put_page() in u2mfn_ioctl(), see once again > http://gitweb.qubes-os.org/gitweb/?p=mainstream/gui.git;a=blob;f=vchan/u2mfn/u2mfn.c;h=6ff113c07c50ef078ab04d9e61d2faab338357e7;hb=HEAD#l35 > the page''s mfn changed again. > Even commenting out the kunmap() call in this function did not help, either. > Am I missing something ? >It definitely shouldn''t be possible to move a page with a non-zero refcount. So it looks like something else is going on there. Even if the process exits, those pages should remain in unusable limbo rather than being freed and reallocated.> The only working way (for the ring buffer case) is to acquire memory via > kmalloc and pass it to userspace via remap_pfn_range. But this is unsuitable > for the case of X composition buffers, because we don''t want to alter the > way X allocates memory (it calls plain malloc). We could hijack X''s malloc() > via LD_PRELOAD, but then we cannot distinguish which calls are made because > of composition buffer allocation. >Yes. Unfortunately that has its own set of problems. For example, if the X server wants to fork for some reason then you become subject to the whims of COW as to what page is being used in which process. But it seems to me you''re operating at the wrong architectural level here. I fully understand your short-term goal is "get it working", but I think you''re going to want to revise this for v2.0. Your architecture is not very different from a standard CPU+GPU compositing setup, except your "GPU" is actually dom0 (which of course may be really using the GPU). X should already have all the interfaces you need to efficiently pass an application''s compositing buffer to the "GPU" for rendering. (Maybe you need to do a "Xen DRI" driver to implement this?)>> You need to add something to explicitly decrement the >> refcount to prevent a memory leak, presumably at the time you tear down >> the mapping in dom0. Ideally you''d arrange to do that triggered off >> unmap of the memory range (by isolating the pages in their own new vma) >> so that it all gets cleaned up on process exit. >> > By "triggered off unmap" do you mean setting the vm_ops field in struct > vm_area_struct to a custom struct vm_operations_struct (particularly, with a > custom close() method), or is there something simpler ? >Yes, that''s what I had in mind. You''d need to chop the VMA up to isolate the virtual address range you want to apply the close to. But that assumes your range doesn''t already have a close method of course; it gets awkward if it does.>> I''m not at all familiar with how X manages composition buffers, but it >> seems to me that in normal use, one would want to be able to either >> allocate that buffer in texture memory (so it can be used as a texture >> source), or at least copy updates into texture memory. Couldn''t you >> hook into that transfer to the composition hardware (ie, dom0)? >> > We are talking about X running in domU; there is no related hardware. > We can determine where the composition buffer is only after it has > been allocated. >(See above.)>> No, kernel allocations are not movable by default. >> > Could you mention a few details more on the related migration mechanism ? > E.g. which PG_ flag (set by kmalloc) makes a page unmovable ? Preferably, > with pointers to relevant code ? >__GFP_MOVABLE is the key thing to look at. It causes page allocation to allocate the page in a movable zone. All user memory is allocated with GFP_HIGHUSER_MOVABLE (in do_wp_page(), for example), which means that the memory needn''t be directly addressable by the kernel (HIGHUSER), and can be moved or reclaimed when necessary (MOVABLE).> I guess it is in linux/mm/migrate.c, but I am getting lost > trying to figure out which parts are NUMA specific and which are not; and > particularly, what triggers the migration. >TBH I''ve never really looked into the mechanisms of how it works. But I think mm/migrate.c is actually something else, relating to moving pages around between NUMA nodes. I had a quick look at it just now, and migration definitely seems to happen on demand in the buddy_allocator (mm/page_alloc.c), if it can''t satisfy a memory request. I don''t know whether it tries to actively move pages around to decrease fragmentation.> Interestingly, Xorg guys claim X server does nothing special with the memory > acquired by malloc() for the composition buffer. Yet, so far no corruption > of the displayed images have been observed. Maybe a single page vma (that > stores the ring buffer) is particularly attractive for the > migration/defragmentation algorithm, and that is why it is easy to trigger > its relocation (but not so with the composition buffer case) ? >Hm, that doesn''t ring true. AFAIK all migration happens at the page level with no reference to VMAs (though its possible that being mapped into a process address space makes a page temporarily unmigratable, and it needs to wait for something to shoot down/age out the ptes before migrating the page). Again, I''m not well versed in the details. Its quite possible that the problem you''re seeing has nothing to do with page migration at all, and this is a goosechase. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel