On Tue 25-07-17 19:56:24, Wei Wang wrote:> On 07/25/2017 07:25 PM, Michal Hocko wrote: > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > >>>[... > >>>>>All you should need is the check for the page reference count, no? I > >>>>>assume you do some sort of pfn walk and so you should be able to get an > >>>>>access to the struct page. > >>>>Not necessarily - the guest struct page is not seen by the hypervisor. The > >>>>hypervisor only gets those guest pfns which are hinted as unused. From the > >>>>hypervisor (host) point of view, a guest physical address corresponds to a > >>>>virtual address of a host process. So, once the hypervisor knows a guest > >>>>physical page is unsued, it knows that the corresponding virtual memory of > >>>>the process doesn't need to be transferred in the 1st round. > >>>I am sorry, but I do not understand. Why cannot _guest_ simply check the > >>>struct page ref count and send them to the hypervisor? > >>Were you suggesting the following? > >>1) get a free page block from the page list using the API; > >No. Use a pfn walk, check the reference count and skip those pages which > >have 0 ref count. > > > "pfn walk" - do you mean start from the first pfn, and scan all the pfns > that the VM has?yes> >I suspected that you need to do some sort of the pfn > >walk anyway because you somehow have to evaluate a memory to migrate, > >right? > > We don't need to do the pfn walk in the guest kernel. When the API > reports, for example, a 2MB free page block, the API caller offers to > the hypervisor the base address of the page block, and size=2MB, to > the hypervisor.So you want to skip pfn walks by regularly calling into the page allocator to update your bitmap. If that is the case then would an API that would allow you to update your bitmap via a callback be s sufficient? Something like void walk_free_mem(int node, int min_order, void (*visit)(unsigned long pfn, unsigned long nr_pages)) The function will call the given callback for each free memory block on the given node starting from the given min_order. The callback will be strictly an atomic and very light context. You can update your bitmap from there. This would address my main concern that the allocator internals would get outside of the allocator proper. A nasty callback which would be too expensive could still stall other allocations and cause latencies but the locking will be under mm core control at least. Does that sound useful? -- Michal Hocko SUSE Labs
On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:> On Tue 25-07-17 19:56:24, Wei Wang wrote: > > On 07/25/2017 07:25 PM, Michal Hocko wrote: > > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > > >>>[... > > We don't need to do the pfn walk in the guest kernel. When the API > > reports, for example, a 2MB free page block, the API caller offers to > > the hypervisor the base address of the page block, and size=2MB, to > > the hypervisor. > > So you want to skip pfn walks by regularly calling into the page allocator to > update your bitmap. If that is the case then would an API that would allow you > to update your bitmap via a callback be s sufficient? Something like > void walk_free_mem(int node, int min_order, > void (*visit)(unsigned long pfn, unsigned long nr_pages)) > > The function will call the given callback for each free memory block on the given > node starting from the given min_order. The callback will be strictly an atomic > and very light context. You can update your bitmap from there.I would need to introduce more about the background here: The hypervisor and the guest live in their own address space. The hypervisor's bitmap isn't seen by the guest. I think we also wouldn't be able to give a callback function from the hypervisor to the guest in this case.> > This would address my main concern that the allocator internals would get > outside of the allocator proper.What issue would it have to expose the internal, for_each_zone()? I think new code which would call it will also be strictly checked when they are pushed to upstream. Best, Wei
On Tue 25-07-17 14:47:16, Wang, Wei W wrote:> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: > > On Tue 25-07-17 19:56:24, Wei Wang wrote: > > > On 07/25/2017 07:25 PM, Michal Hocko wrote: > > > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > > > >>>[... > > > We don't need to do the pfn walk in the guest kernel. When the API > > > reports, for example, a 2MB free page block, the API caller offers to > > > the hypervisor the base address of the page block, and size=2MB, to > > > the hypervisor. > > > > So you want to skip pfn walks by regularly calling into the page allocator to > > update your bitmap. If that is the case then would an API that would allow you > > to update your bitmap via a callback be s sufficient? Something like > > void walk_free_mem(int node, int min_order, > > void (*visit)(unsigned long pfn, unsigned long nr_pages)) > > > > The function will call the given callback for each free memory block on the given > > node starting from the given min_order. The callback will be strictly an atomic > > and very light context. You can update your bitmap from there. > > I would need to introduce more about the background here: > The hypervisor and the guest live in their own address space. The hypervisor's bitmap > isn't seen by the guest. I think we also wouldn't be able to give a callback function > from the hypervisor to the guest in this case.How did you plan to use your original API which export struct page array then?> > This would address my main concern that the allocator internals would get > > outside of the allocator proper. > > What issue would it have to expose the internal, for_each_zone()?zone is a MM internal concept. No code outside of the MM proper should really care about zones. -- Michal Hocko SUSE Labs
Possibly Parallel Threads
- [PATCH v12 6/8] mm: support reporting free page blocks
- [PATCH v12 6/8] mm: support reporting free page blocks
- [PATCH v12 6/8] mm: support reporting free page blocks
- [PATCH v12 6/8] mm: support reporting free page blocks
- [PATCH v12 6/8] mm: support reporting free page blocks