Alexander Duyck
2020-Dec-23 16:39 UTC
[RFC PATCH 1/3] mm: support hugetlb free page reporting
On Tue, Dec 22, 2020 at 7:39 PM Liang Li <liliang324 at gmail.com> wrote:> > > > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev, > > > + struct hstate *h, unsigned int nid, > > > + struct scatterlist *sgl, unsigned int *offset) > > > +{ > > > + struct list_head *list = &h->hugepage_freelists[nid]; > > > + unsigned int page_len = PAGE_SIZE << h->order; > > > + struct page *page, *next; > > > + long budget; > > > + int ret = 0, scan_cnt = 0; > > > + > > > + /* > > > + * Perform early check, if free area is empty there is > > > + * nothing to process so we can skip this free_list. > > > + */ > > > + if (list_empty(list)) > > > + return ret; > > > + > > > + spin_lock_irq(&hugetlb_lock); > > > + > > > + if (huge_page_order(h) > MAX_ORDER) > > > + budget = HUGEPAGE_REPORTING_CAPACITY; > > > + else > > > + budget = HUGEPAGE_REPORTING_CAPACITY * 32; > > > > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we > > don't even really need budget since this should probably be pulling > > out no more than one hugepage at a time. > > I want to disting a 2M page and 1GB page here. The order of 1GB page is greater > than MAX_ORDER while 2M page's order is less than MAX_ORDER.The budget here is broken. When I put the budget in page reporting it was so that we wouldn't try to report all of the memory in a given region. It is meant to hold us to no more than one pass through 1/16 of the free memory. So essentially we will be slowly processing all of memory and it will take 16 calls (32 seconds) for us to process a system that is sitting completely idle. It is meant to pace us so we don't spend a ton of time doing work that will be undone, not to prevent us from burying a CPU which is what seems to be implied here. Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it in the original definition because it was how many pages we could scoop out at a time and then I was aiming for a 16th of that. Here you are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the amount of work you will doo since you are using it as a multiple instead of a divisor.> > > > > + /* loop through free list adding unreported pages to sg list */ > > > + list_for_each_entry_safe(page, next, list, lru) { > > > + /* We are going to skip over the reported pages. */ > > > + if (PageReported(page)) { > > > + if (++scan_cnt >= MAX_SCAN_NUM) { > > > + ret = scan_cnt; > > > + break; > > > + } > > > + continue; > > > + } > > > + > > > > It would probably have been better to place this set before your new > > set. I don't see your new set necessarily being the best use for page > > reporting. > > I haven't really latched on to what you mean, could you explain it again?It would be better for you to spend time understanding how this patch set works before you go about expanding it to do other things. Mistakes like the budget one above kind of point out the fact that you don't understand how this code was supposed to work and just kind of shoehorned you page zeroing code onto it. It would be better to look at trying to understand this code first before you extend it to support your zeroing use case. So adding huge pages first might make more sense than trying to zero and push the order down. The fact is the page reporting extension should be minimal for huge pages since they are just passed as a scatterlist so you should only need to add a small bit to page_reporting.c to extend it to support this use case.> > > > > + /* > > > + * If we fully consumed our budget then update our > > > + * state to indicate that we are requesting additional > > > + * processing and exit this list. > > > + */ > > > + if (budget < 0) { > > > + atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED); > > > + next = page; > > > + break; > > > + } > > > + > > > > If budget is only ever going to be 1 then we probably could just look > > at making this the default case for any time we find a non-reported > > page. > > and here again.It comes down to the fact that the changes you made have a significant impact on how this is supposed to function. Reducing the scatterlist to a size of one makes the whole point of doing batching kind of pointless. Basically the code should be rewritten with the assumption that if you find a page you report it. The old code would batch things up because there is significant overhead to be addressed when going to the hypervisor to report said memory. Your code doesn't seem to really take anything like that into account and instead is using an arbitrary budget value based on the page size.> > > + /* Attempt to pull page from list and place in scatterlist */ > > > + if (*offset) { > > > + isolate_free_huge_page(page, h, nid); > > > + /* Add page to scatter list */ > > > + --(*offset); > > > + sg_set_page(&sgl[*offset], page, page_len, 0); > > > + > > > + continue; > > > + } > > > + > > > > There is no point in the continue case if we only have a budget of 1. > > We should probably just tighten up the loop so that all it does is > > search until it finds the 1 page it can pull, pull it, and then return > > it. The scatterlist doesn't serve much purpose and could be reduced to > > just a single entry. > > I will think about it more. > > > > +static int > > > +hugepage_reporting_process_hstate(struct page_reporting_dev_info *prdev, > > > + struct scatterlist *sgl, struct hstate *h) > > > +{ > > > + unsigned int leftover, offset = HUGEPAGE_REPORTING_CAPACITY; > > > + int ret = 0, nid; > > > + > > > + for (nid = 0; nid < MAX_NUMNODES; nid++) { > > > + ret = hugepage_reporting_cycle(prdev, h, nid, sgl, &offset); > > > + > > > + if (ret < 0) > > > + return ret; > > > + } > > > + > > > + /* report the leftover pages before going idle */ > > > + leftover = HUGEPAGE_REPORTING_CAPACITY - offset; > > > + if (leftover) { > > > + sgl = &sgl[offset]; > > > + ret = prdev->report(prdev, sgl, leftover); > > > + > > > + /* flush any remaining pages out from the last report */ > > > + spin_lock_irq(&hugetlb_lock); > > > + hugepage_reporting_drain(prdev, h, sgl, leftover, !ret); > > > + spin_unlock_irq(&hugetlb_lock); > > > + } > > > + > > > + return ret; > > > +} > > > + > > > > If HUGEPAGE_REPORTING_CAPACITY is 1 it would make more sense to > > rewrite this code to just optimize for a find and process a page > > approach rather than trying to batch pages. > > Yes, I will make a change. Thanks for your comments!Lastly I would recommend setting up and testing page reporting with the virtio-balloon driver. I worry that your patch set would have a significant negative impact on the performance of it. As I mentioned before it was designed to be more of a leaky bucket solution to reporting memory and was supposed to take about 30 seconds for it to flush all of the memory in a guest. Your changes seem to be trying to do a much more aggressive task and I worry that what you are going to find is that it will easily push CPUs to 100% on an active system since it will be aggressively trying to zero memory as soon as it is freed rather than taking it at a slower pace.