Hello, I have been testing autoballooning on a production Xen system today (with cleancache + frontswap on Xen-provided tmem). For most of the idle or CPU-centric VMs it seems to work just fine. However, on one of the web-serving VMs, there is also a cron job running every few minutes which runs over a rather large directory (plus, this directory is on OCFS2 so this is a rather time-consuming process). Now, if the dcache/inode cache is large enough (which it was before, since the VM got allocated 4 GB and is only using 1-2 most of the time), this was not a problem. Now, with self-ballooning, the memory gets reduced to somewhat between 1 and 2 GB and after a few minutes the load is going through the ceiling. Jobs reading through said directories are piling up (stuck in D state, waiting for the FS). And most of the time kswapd is spinning at 100%. If I deactivate self-ballooning and assign the VM 3 GB, everything goes back to normal after a few minutes. (and, "ls -l" on said directory is served from the cache again). Now, I am aware that said problem is a self-made one. The directory was not actually supposed to contain that many files and the next job not waiting for the previous job to terminate is cause for trouble - but still, I would consider this a possible regression since it seems self-ballooning is constantly thrashing the VM''s caches. Not all caches can be saved in cleancache. What about an additional tunable: a user-specified amount of pages that is added on top of the computed target number of pages? This way, one could manually reserve a bit more room for other types of caches. (in fact, I might try this myself, since it shouldn''t be too hard to do so) Any opinions on this? Thank you, Jana
> From: Jana Saout [mailto:jana@saout.de] > Subject: [Xen-devel] Self-ballooning question / cache issue > > Hello, > > I have been testing autoballooning on a production Xen system today > (with cleancache + frontswap on Xen-provided tmem). For most of the > idle or CPU-centric VMs it seems to work just fine. > > However, on one of the web-serving VMs, there is also a cron job running > every few minutes which runs over a rather large directory (plus, this > directory is on OCFS2 so this is a rather time-consuming process). Now, > if the dcache/inode cache is large enough (which it was before, since > the VM got allocated 4 GB and is only using 1-2 most of the time), this > was not a problem. > > Now, with self-ballooning, the memory gets reduced to somewhat between 1 > and 2 GB and after a few minutes the load is going through the ceiling. > Jobs reading through said directories are piling up (stuck in D state, > waiting for the FS). And most of the time kswapd is spinning at 100%. > If I deactivate self-ballooning and assign the VM 3 GB, everything goes > back to normal after a few minutes. (and, "ls -l" on said directory is > served from the cache again). > > Now, I am aware that said problem is a self-made one. The directory was > not actually supposed to contain that many files and the next job not > waiting for the previous job to terminate is cause for trouble - but > still, I would consider this a possible regression since it seems > self-ballooning is constantly thrashing the VM''s caches. Not all caches > can be saved in cleancache. > > What about an additional tunable: a user-specified amount of pages that > is added on top of the computed target number of pages? This way, one > could manually reserve a bit more room for other types of caches. (in > fact, I might try this myself, since it shouldn''t be too hard to do so) > > Any opinions on this?Hi Jana -- Thanks for doing this analysis. While your workload is a bit unusual, I agree that you have exposed a problem that will need to be resolved. It was observed three years ago that the next "frontend" for tmem could handle a cleancache-like mechanism for the dcache. Until now, I had thought that this was purely optional and would yield only a small performance improvement. But with your workload, I think the combination of the facts that selfballooning is forcing out dcache entries and they aren''t being saved in tmem is resulting in the problem you are seeing. I think the best solution for this will be a "cleandcache" patch in the Linux guest... but given how long it has taken to get cleancache and frontswap into the kernel (and the fact that a working cleandcache patch doesn''t even exist yet), I wouldn''t hold my breath ;-) I will put it on the "to do" list though. Your idea of the tunable is interesting (and patches are always welcome!) but I am skeptical that it will solve the problem since I would guess the Linux kernel is shrinking dcache proportional to the size of the page cache. So adding more RAM with your "user-specified amount of pages that is added on top of the computed target number of pages", the RAM will still be shared across all caches and only some small portion of the added RAM will likely be used for dcache. However, if you have a chance to try it, I would be interested in your findings. Note that you already can set a permanent floor for selfballooning ("min_usable_mb") or, of course, just turn off selfballooning altogether. Thanks, Dan
Hi Dan,> > I have been testing autoballooning on a production Xen system today > > (with cleancache + frontswap on Xen-provided tmem). For most of the > > idle or CPU-centric VMs it seems to work just fine. > > > > However, on one of the web-serving VMs, there is also a cron job running > > every few minutes which runs over a rather large directory (plus, this > > directory is on OCFS2 so this is a rather time-consuming process). Now, > > if the dcache/inode cache is large enough (which it was before, since > > the VM got allocated 4 GB and is only using 1-2 most of the time), this > > was not a problem. > > > > Now, with self-ballooning, the memory gets reduced to somewhat between 1 > > and 2 GB and after a few minutes the load is going through the ceiling. > > Jobs reading through said directories are piling up (stuck in D state, > > waiting for the FS). And most of the time kswapd is spinning at 100%. > > If I deactivate self-ballooning and assign the VM 3 GB, everything goes > > back to normal after a few minutes. (and, "ls -l" on said directory is > > served from the cache again). > > > > Now, I am aware that said problem is a self-made one. The directory was > > not actually supposed to contain that many files and the next job not > > waiting for the previous job to terminate is cause for trouble - but > > still, I would consider this a possible regression since it seems > > self-ballooning is constantly thrashing the VM''s caches. Not all caches > > can be saved in cleancache. > > > > What about an additional tunable: a user-specified amount of pages that > > is added on top of the computed target number of pages? This way, one > > could manually reserve a bit more room for other types of caches. (in > > fact, I might try this myself, since it shouldn''t be too hard to do so) > > > > Any opinions on this? > > Thanks for doing this analysis. While your workload is a bit > unusual, I agree that you have exposed a problem that will need > to be resolved. It was observed three years ago that the next > "frontend" for tmem could handle a cleancache-like mechanism > for the dcache. Until now, I had thought that this was purely > optional and would yield only a small performance improvement. > But with your workload, I think the combination of the facts that > selfballooning is forcing out dcache entries and they aren''t > being saved in tmem is resulting in the problem you are seeing.Yes. In fact, I''ve been rolling out selfballooning across a development system and most VMs were just fine with the default. The overall memory savings from going from a static to a dynamic memory allocation is quite significant - without the VMs having to resort to actual to-disk-paging when there is a sudden increase in memory usage. Quite nice. Just for information: The filesystem which this machine was using is OCFS2 (shared across 5 VMs) and the directory contains 45k files (*cough* - I''m aware that''s not optimal, I''m currently talking to the dev of that application to not scan the entire list of files every minute) - which takes a few minutes (especially stat''ing every file). I have been observing, that kswapd seems rather busy at times on some VMs, even when there is no actual swapping taking place. (or, could it be frontswap or just page reclaim?) This can be migitated by increasing the memory reserve a bit using my trivial test patch (see below).> I think the best solution for this will be a "cleandcache" > patch in the Linux guest... but given how long it has taken > to get cleancache and frontswap into the kernel (and the fact > that a working cleandcache patch doesn''t even exist yet), I > wouldn''t hold my breath ;-) I will put it on the "to do" > list though.That sounds nice!> Your idea of the tunable is interesting (and patches are always > welcome!) but I am skeptical that it will solve the problem > since I would guess the Linux kernel is shrinking dcache > proportional to the size of the page cache. So adding more > RAM with your "user-specified amount of pages that is > added on top of the computed target number of pages", > the RAM will still be shared across all caches and only > some small portion of the added RAM will likely be used > for dcache.That''s true. In fact, I have to add about 1 GB of memory in order to keep the relevant dcache / inode cache entries to stay in the cache. When I do that the largest portion of memory is still eaten up by the regular page cache. So this is more of a workaround than a solution, but for now it works. I''ve attached the simple patch I''ve whipped up below.> However, if you have a chance to try it, I would be interested > in your findings. Note that you already can set a > permanent floor for selfballooning ("min_usable_mb") or, > of course, just turn off selfballooning altogether.Sure, that''s always a possibility. However, the VM already had an overly large amount of memory before to avoid the problem. Now it runs with less memory (still a bit more than required), and when a load spike comes, it can quickly balloon up, which is exactly what I was looking for. Jana ---- Author: Jana Saout <jana@saout.de> Date: Sun Apr 29 22:09:29 2012 +0200 Add selfballoning memory reservation tunable. diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 146c948..7d041cb 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -105,6 +105,12 @@ static unsigned int selfballoon_interval __read_mostly = 5; */ static unsigned int selfballoon_min_usable_mb; +/* + * Amount of RAM in MB to add to the target number of pages. + * Can be used to reserve some more room for caches and the like. + */ +static unsigned int selfballoon_reserved_mb; + static void selfballoon_process(struct work_struct *work); static DECLARE_DELAYED_WORK(selfballoon_worker, selfballoon_process); @@ -217,7 +223,8 @@ static void selfballoon_process(struct work_struct *work) cur_pages = totalram_pages; tgt_pages = cur_pages; /* default is no change */ goal_pages = percpu_counter_read_positive(&vm_committed_as) + - totalreserve_pages; + totalreserve_pages + + MB2PAGES(selfballoon_reserved_mb); #ifdef CONFIG_FRONTSWAP /* allow space for frontswap pages to be repatriated */ if (frontswap_selfshrinking && frontswap_enabled) @@ -397,6 +404,30 @@ static DEVICE_ATTR(selfballoon_min_usable_mb, S_IRUGO | S_IWUSR, show_selfballoon_min_usable_mb, store_selfballoon_min_usable_mb); +SELFBALLOON_SHOW(selfballoon_reserved_mb, "%d\n", + selfballoon_reserved_mb); + +static ssize_t store_selfballoon_reserved_mb(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t count) +{ + unsigned long val; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + err = strict_strtoul(buf, 10, &val); + if (err) + return -EINVAL; + selfballoon_reserved_mb = val; + return count; +} + +static DEVICE_ATTR(selfballoon_reserved_mb, S_IRUGO | S_IWUSR, + show_selfballoon_reserved_mb, + store_selfballoon_reserved_mb); + #ifdef CONFIG_FRONTSWAP SELFBALLOON_SHOW(frontswap_selfshrinking, "%d\n", frontswap_selfshrinking); @@ -480,6 +511,7 @@ static struct attribute *selfballoon_attrs[] = { &dev_attr_selfballoon_downhysteresis.attr, &dev_attr_selfballoon_uphysteresis.attr, &dev_attr_selfballoon_min_usable_mb.attr, + &dev_attr_selfballoon_reserved_mb.attr, #ifdef CONFIG_FRONTSWAP &dev_attr_frontswap_selfshrinking.attr, &dev_attr_frontswap_hysteresis.attr,
> From: Jana Saout [mailto:jana@saout.de] > Subject: Re: [Xen-devel] Self-ballooning question / cache issue >Hi Jana -- Since you have tested this patch and have found it useful, and since its use is entirely optional, it is OK with me for it to be upstreamed at the next window. Konrad cc''ed. You will need to add a Signed-off-by line to the patch but other than that you can consider it Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>> > Your idea of the tunable is interesting (and patches are always > > welcome!) but I am skeptical that it will solve the problem > > since I would guess the Linux kernel is shrinking dcache > > proportional to the size of the page cache. So adding more > > RAM with your "user-specified amount of pages that is > > added on top of the computed target number of pages", > > the RAM will still be shared across all caches and only > > some small portion of the added RAM will likely be used > > for dcache. > > That''s true. In fact, I have to add about 1 GB of memory in order to > keep the relevant dcache / inode cache entries to stay in the cache. > When I do that the largest portion of memory is still eaten up by the > regular page cache. So this is more of a workaround than a solution, > but for now it works. > > I''ve attached the simple patch I''ve whipped up below. > > > However, if you have a chance to try it, I would be interested > > in your findings. Note that you already can set a > > permanent floor for selfballooning ("min_usable_mb") or, > > of course, just turn off selfballooning altogether. > > Sure, that''s always a possibility. However, the VM already had an > overly large amount of memory before to avoid the problem. Now it runs > with less memory (still a bit more than required), and when a load spike > comes, it can quickly balloon up, which is exactly what I was looking > for. > > Jana > > ---- > Author: Jana Saout <jana@saout.de> > Date: Sun Apr 29 22:09:29 2012 +0200 > > Add selfballoning memory reservation tunable. > > diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c > index 146c948..7d041cb 100644 > --- a/drivers/xen/xen-selfballoon.c > +++ b/drivers/xen/xen-selfballoon.c > @@ -105,6 +105,12 @@ static unsigned int selfballoon_interval __read_mostly = 5; > */ > static unsigned int selfballoon_min_usable_mb; > > +/* > + * Amount of RAM in MB to add to the target number of pages. > + * Can be used to reserve some more room for caches and the like. > + */ > +static unsigned int selfballoon_reserved_mb; > + > static void selfballoon_process(struct work_struct *work); > static DECLARE_DELAYED_WORK(selfballoon_worker, selfballoon_process); > > @@ -217,7 +223,8 @@ static void selfballoon_process(struct work_struct *work) > cur_pages = totalram_pages; > tgt_pages = cur_pages; /* default is no change */ > goal_pages = percpu_counter_read_positive(&vm_committed_as) + > - totalreserve_pages; > + totalreserve_pages + > + MB2PAGES(selfballoon_reserved_mb); > #ifdef CONFIG_FRONTSWAP > /* allow space for frontswap pages to be repatriated */ > if (frontswap_selfshrinking && frontswap_enabled) > @@ -397,6 +404,30 @@ static DEVICE_ATTR(selfballoon_min_usable_mb, S_IRUGO | S_IWUSR, > show_selfballoon_min_usable_mb, > store_selfballoon_min_usable_mb); > > +SELFBALLOON_SHOW(selfballoon_reserved_mb, "%d\n", > + selfballoon_reserved_mb); > + > +static ssize_t store_selfballoon_reserved_mb(struct device *dev, > + struct device_attribute *attr, > + const char *buf, > + size_t count) > +{ > + unsigned long val; > + int err; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + err = strict_strtoul(buf, 10, &val); > + if (err) > + return -EINVAL; > + selfballoon_reserved_mb = val; > + return count; > +} > + > +static DEVICE_ATTR(selfballoon_reserved_mb, S_IRUGO | S_IWUSR, > + show_selfballoon_reserved_mb, > + store_selfballoon_reserved_mb); > + > > #ifdef CONFIG_FRONTSWAP > SELFBALLOON_SHOW(frontswap_selfshrinking, "%d\n", frontswap_selfshrinking); > @@ -480,6 +511,7 @@ static struct attribute *selfballoon_attrs[] = { > &dev_attr_selfballoon_downhysteresis.attr, > &dev_attr_selfballoon_uphysteresis.attr, > &dev_attr_selfballoon_min_usable_mb.attr, > + &dev_attr_selfballoon_reserved_mb.attr, > #ifdef CONFIG_FRONTSWAP > &dev_attr_frontswap_selfshrinking.attr, > &dev_attr_frontswap_hysteresis.attr, > >
On Wed, May 02, 2012 at 10:51:12AM -0700, Dan Magenheimer wrote:> > From: Jana Saout [mailto:jana@saout.de] > > Subject: Re: [Xen-devel] Self-ballooning question / cache issue > > > > Hi Jana -- > > Since you have tested this patch and have found it useful, and > since its use is entirely optional, it is OK with me for it to > be upstreamed at the next window. Konrad cc''ed. > > You will need to add a Signed-off-by line to the patch > but other than that you can consider it > > Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>Looks good. Can you resend it with the right tags to xen-devel and lkml and to me please?> > > > Your idea of the tunable is interesting (and patches are always > > > welcome!) but I am skeptical that it will solve the problem > > > since I would guess the Linux kernel is shrinking dcache > > > proportional to the size of the page cache. So adding more > > > RAM with your "user-specified amount of pages that is > > > added on top of the computed target number of pages", > > > the RAM will still be shared across all caches and only > > > some small portion of the added RAM will likely be used > > > for dcache. > > > > That''s true. In fact, I have to add about 1 GB of memory in order to > > keep the relevant dcache / inode cache entries to stay in the cache. > > When I do that the largest portion of memory is still eaten up by the > > regular page cache. So this is more of a workaround than a solution, > > but for now it works. > > > > I''ve attached the simple patch I''ve whipped up below. > > > > > However, if you have a chance to try it, I would be interested > > > in your findings. Note that you already can set a > > > permanent floor for selfballooning ("min_usable_mb") or, > > > of course, just turn off selfballooning altogether. > > > > Sure, that''s always a possibility. However, the VM already had an > > overly large amount of memory before to avoid the problem. Now it runs > > with less memory (still a bit more than required), and when a load spike > > comes, it can quickly balloon up, which is exactly what I was looking > > for. > > > > Jana > > > > ---- > > Author: Jana Saout <jana@saout.de> > > Date: Sun Apr 29 22:09:29 2012 +0200 > > > > Add selfballoning memory reservation tunable. > > > > diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c > > index 146c948..7d041cb 100644 > > --- a/drivers/xen/xen-selfballoon.c > > +++ b/drivers/xen/xen-selfballoon.c > > @@ -105,6 +105,12 @@ static unsigned int selfballoon_interval __read_mostly = 5; > > */ > > static unsigned int selfballoon_min_usable_mb; > > > > +/* > > + * Amount of RAM in MB to add to the target number of pages. > > + * Can be used to reserve some more room for caches and the like. > > + */ > > +static unsigned int selfballoon_reserved_mb; > > + > > static void selfballoon_process(struct work_struct *work); > > static DECLARE_DELAYED_WORK(selfballoon_worker, selfballoon_process); > > > > @@ -217,7 +223,8 @@ static void selfballoon_process(struct work_struct *work) > > cur_pages = totalram_pages; > > tgt_pages = cur_pages; /* default is no change */ > > goal_pages = percpu_counter_read_positive(&vm_committed_as) + > > - totalreserve_pages; > > + totalreserve_pages + > > + MB2PAGES(selfballoon_reserved_mb); > > #ifdef CONFIG_FRONTSWAP > > /* allow space for frontswap pages to be repatriated */ > > if (frontswap_selfshrinking && frontswap_enabled) > > @@ -397,6 +404,30 @@ static DEVICE_ATTR(selfballoon_min_usable_mb, S_IRUGO | S_IWUSR, > > show_selfballoon_min_usable_mb, > > store_selfballoon_min_usable_mb); > > > > +SELFBALLOON_SHOW(selfballoon_reserved_mb, "%d\n", > > + selfballoon_reserved_mb); > > + > > +static ssize_t store_selfballoon_reserved_mb(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, > > + size_t count) > > +{ > > + unsigned long val; > > + int err; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > + err = strict_strtoul(buf, 10, &val); > > + if (err) > > + return -EINVAL; > > + selfballoon_reserved_mb = val; > > + return count; > > +} > > + > > +static DEVICE_ATTR(selfballoon_reserved_mb, S_IRUGO | S_IWUSR, > > + show_selfballoon_reserved_mb, > > + store_selfballoon_reserved_mb); > > + > > > > #ifdef CONFIG_FRONTSWAP > > SELFBALLOON_SHOW(frontswap_selfshrinking, "%d\n", frontswap_selfshrinking); > > @@ -480,6 +511,7 @@ static struct attribute *selfballoon_attrs[] = { > > &dev_attr_selfballoon_downhysteresis.attr, > > &dev_attr_selfballoon_uphysteresis.attr, > > &dev_attr_selfballoon_min_usable_mb.attr, > > + &dev_attr_selfballoon_reserved_mb.attr, > > #ifdef CONFIG_FRONTSWAP > > &dev_attr_frontswap_selfshrinking.attr, > > &dev_attr_frontswap_hysteresis.attr, > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel