Nadav Amit
2022-Oct-10 14:47 UTC
RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory
On Oct 10, 2022, at 12:24 AM, Alexander Atanasov <alexander.atanasov at virtuozzo.com> wrote:> Hello, > > On 10.10.22 9:18, Nadav Amit wrote: >> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <alexander.atanasov at virtuozzo.com> wrote: >>> So all balloon drivers give large amount of RAM on boot , then inflate the balloon. But this places have already been initiallized and they know that the system have given amount of totalram which is not true the moment they start to operate. the result is that too much space gets used and it degrades the userspace performance. >>> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram for eventpool - when you inflate half of the ram it becomes 8% of the ram - do you really need 8% of your ram to be used for eventpool? >>> >>> To solve this you need to register and when notified update - cache size, limits and for whatever is the calculated amount of memory used. >> Hmm.. Not sure about all of that. Most balloon drivers are manually managed, >> and call adjust_managed_page_count(), and tas a result might want to redo >> all the calculations that are based on totalram_pages(). > > Yes, i will say that it looks like mixed manual - for large changes and automatic for small changes. VMWare and HyperV have automatic and manual/not sure exactly what you can change on a running VM but i guess you can/ - Virtio is only manual. I do not know about dlpar / xen. > > Scenario is like this start a VM with 4GB ram, reduce to 2GB with balloon - vm can be upgraded. > > All we are talking about relates to memory hotplug/unplug /where unplug is close to nonexistant hence the balloons are used./ > > All values should be recalculated on memory hotplug too, so you can use the newly available RAM. > > RAM is the most valuable resource of all so i consider using it optimally to be of a great importance. > >> Side-note: That?s not the case for VMware balloon. I actually considered >> calling adjust_managed_page_count() just to conform with other balloon >> drivers. But since we use totalram_pages() to communicate to the hypervisor >> the total-ram, this would create endless (and wrong) feedback loop. I am not >> claiming it is not possible to VMware balloon driver to call >> adjust_managed_page_count(), but the chances are that it would create more >> harm than good. > > Virtio does both - depending on the deflate on OOM option. I suggested already to unify all drivers to inflate the used memory as it seems more logical to me since no body expects the totalram_pages() to change but the current state is that both ways are accepted and if changed can break existing users. > See discussion here https://lore.kernel.org/lkml/20220809095358.2203355-1-alexander.atanasov at virtuozzo.com/.Thanks for the reminder. I wish you can somehow summarize all of that into the cover-letter and/or the commit messages for these patches.> > >> Back to the matter at hand. It seems that you wish that the notifiers would >> be called following any changes that would be reflected in totalram_pages(). >> So, doesn't it make more sense to call it from adjust_managed_page_count() ? > > It will hurt performance - all drivers work page by page , i.e. they update by +1/-1 and they do so under locks which as you already noted can lead to bad things. The notifier will accumulate the change and let its user know how much changed, so the can decide if they have to recalculate - it even can do so async in order to not disturb the drivers.So updating the counters by 1 is ok (using atomic operation, which is not free)? And the reason it is (relatively) cheap is because nobody actually looks at the value (i.e., nobody actually acts on the value)? If nobody considers the value, then doesn?t it make sense just to update it less frequently, and then call the notifiers?>>> The difference is here: >>> >>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 < >>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100 >>> uses percents and you can recalculate easy with >>> >>> +static inline unsigned long totalram_pages_current(void) >>> +{ >>> + unsigned long inflated = 0; >>> +#ifdef CONFIG_MEMORY_BALLOON >>> + extern atomic_long_t mem_balloon_inflated_free_kb; >>> + inflated = atomic_long_read(&mem_balloon_inflated_free_kb); >>> + inflated >>= (PAGE_SHIFT - 10); >>> +#endif >>> + return (unsigned long)atomic_long_read(&_totalram_pages) - inflated; >>> +}So we have here two values and it appears there is a hidden assumption that they are both updated atomically. Otherwise, it appears, inflated theoretically might be greater that _totalram_pages dn we get negative value and all hell breaks loose. But _totalram_pages and mem_balloon_inflated_free_kb are not updated atomically together (each one is, but not together).>>> And you are good when you switch to _current version - si_meminfo_current is alike . >>> >>> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this. >> I don?t get all of that. Now that you provided some more explanations, it >> sounds that what you want is adjust_managed_page_count(), which we already >> have and affects the output of totalram_pages(). Therefore, totalram_pages() >> anyhow accounts for the balloon memory (excluding VMware?s). So why do we >> need to take mem_balloon_inflated_free_kb into account? > Ok, you have this: > / totalram > |----used----|b1|----free------|b2| > > drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram pages get smaller. so when you need totalram_pages() to do a calculation you need to adjust it with the pages that are inflated in free/used (b1). VMWare is not exception , Virtio does the same. > And according to to mst and davidh it is okay like this. > So i am proposing a way to handle both cases.Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can?t this information be used instead of yet another counter? Unless, of course, you get the atomicity that I mentioned before.>> Sounds to me that all you want is some notifier to be called from >> adjust_managed_page_count(). What am I missing? > > Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking. > Notifier is similar to the memory hotplug notifier.Overall, I am not convinced that there is any value of separating the value and the notifier. You can batch both or not batch both. In addition, as I mentioned, having two values seems racy.