Elliott Mitchell
2019-Apr-14 17:19 UTC
[Pkg-xen-devel] Bug#927071: xen: Odd memory-leak-like behavior for Dom0
Package: src:xen Version: 4.8.5+shim4.10.2+xsa282-1+deb9u11 Severity: important I'm observing an odd memory-leak like behavior for Xen's Dom0. I've been attempting to reduce the memory usage of Dom0 so, I'd been slowly decreasing the amount allocated to Dom0. Presently using "dom0_mem=360M,max:384M". Two weeks ago I noticed `xl list` was listing less memory for Dom0 than I thought I'd previously told Xen to allocate. Worse, `xl mem-set 0 384` fails to have any effect. Further observation indicates Dom0 is getting ballooned down roughly 1-2MB per day, in spite of Xen having plenty of free memory (more than 1GB is available). Just tried and `xl mem-set 0` to less than the current limit works, but `xl mem-set 0 384` merely sets Dom0 back to the previous limit. Over time this is forcing restarts of Dom0, which is Bad(tm). This is a rather serious situation. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2019-Apr-21 23:10 UTC
[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation
Refering to this as "balloon-leak" as it looks sort of like a memory-leak, but is instead memory disappearing into the balloon. There are two things which have been happening more recently which may have exaggerated this problem. First, one of the DomUs is acting as a fileserver and that has been getting more usage recently. Second, I've been testing block-device hotplug as a mechanism to transfer data between VMs (xl block-detach from one VM, then xl block-attach to another VM). The DomUs appear absolutely uneffected by this. Even as Dom0 has gotten to a situation where `shutdown -r now` fails, the DomUs appear to be chugging away with no problems. There is plenty of free memory for creating additional VMs (perhaps too much, and that confused Xen?), so this is really puzzling that memory is being ballooned away from Dom0. At this point I plan after the next restart to double the allocation for Dom0 and see whether Dom0 is able to last more than a week. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Hans van Kranenburg
2019-Apr-22 14:02 UTC
[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation
Hi, On 4/22/19 1:10 AM, Elliott Mitchell wrote:> Refering to this as "balloon-leak" as it looks sort of like a > memory-leak, but is instead memory disappearing into the balloon.I haven't heard about this symptom before your report.> There are two things which have been happening more recently which may > have exaggerated this problem. First, one of the DomUs is acting as a > fileserver and that has been getting more usage recently. Second, I've > been testing block-device hotplug as a mechanism to transfer data between > VMs (xl block-detach from one VM, then xl block-attach to another VM).Did you look at the numbers and see if it happens when you do this?> The DomUs appear absolutely uneffected by this. Even as Dom0 has gotten > to a situation where `shutdown -r now` fails, the DomUs appear to be > chugging away with no problems. > > There is plenty of free memory for creating additional VMs (perhaps too > much, and that confused Xen?), so this is really puzzling that memory is > being ballooned away from Dom0. At this point I plan after the next > restart to double the allocation for Dom0 and see whether Dom0 is able > to last more than a week.Weird. Can you log memory stats over time, so that you can see when it happens, and correlate it to other events? Hans
Elliott Mitchell
2019-Apr-30 22:55 UTC
[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation
On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote:> On 4/22/19 1:10 AM, Elliott Mitchell wrote: > > There is plenty of free memory for creating additional VMs (perhaps too > > much, and that confused Xen?), so this is really puzzling that memory is > > being ballooned away from Dom0. At this point I plan after the next > > restart to double the allocation for Dom0 and see whether Dom0 is able > > to last more than a week. > > Weird. Can you log memory stats over time, so that you can see when it > happens, and correlate it to other events?At this point there is only one real pattern I've noticed: Always `smartd` was the process which triggered the kernel OOM-killer. Originally I was attributing this to `smartd` doing some large memory allocation during its night-time tasks (which I would attribute to perhaps `smartd` not being that well written). Yet now, I never saw anything else trigger the OOM-killer and I'm now willing to speculate some I/O operation `smartd` was doing triggers a bug in Xen. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Elliott Mitchell
2019-Jul-20 01:04 UTC
[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation
What I'm seeing seems kind of related to the topic of XSA-300. Mainly something ballooning out pages. The Debian Wiki advises reducing the amount of memory used by Domain-0. Perhaps the Debian Wiki should be advising to try to keep the Domain-0 maximum substantially higher than the actual allocation in order to allow for ballooning pages used for I/O? For my case it looks like Domain-0 can function with less than 300MB of allocated memory, but needs around 200MB of ballooned pages for I/O. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg at m5p.com PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Hans van Kranenburg
2020-Sep-18 20:55 UTC
[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation
Hi again, On 5/1/19 12:55 AM, Elliott Mitchell wrote:> On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote: >> On 4/22/19 1:10 AM, Elliott Mitchell wrote: >>> There is plenty of free memory for creating additional VMs (perhaps too >>> much, and that confused Xen?), so this is really puzzling that memory is >>> being ballooned away from Dom0. At this point I plan after the next >>> restart to double the allocation for Dom0 and see whether Dom0 is able >>> to last more than a week. >> >> Weird. Can you log memory stats over time, so that you can see when it >> happens, and correlate it to other events? > > At this point there is only one real pattern I've noticed: Always > `smartd` was the process which triggered the kernel OOM-killer. > > Originally I was attributing this to `smartd` doing some large memory > allocation during its night-time tasks (which I would attribute to > perhaps `smartd` not being that well written). Yet now, I never saw > anything else trigger the OOM-killer and I'm now willing to speculate > some I/O operation `smartd` was doing triggers a bug in Xen.At first I replied with "I haven't heard about this symptom before your report.", but later I realized that I am totally seeing the same kind of behaviour. During some debian-xen day in Feb 2020, I even had a bit of a closer look at this together with Ian, and we ended up thinking that there's actually some kind of obscure miscalculation bug happening. If you look closely at the numbers in xl info and xl list, then you'll see that the numbers just do not add up. The dom0 gets some kind of fake-down-ballooning which is an accounting error. I can't provide more proof right now, because I have to reproduce the thing in a simplified environment to be able to provide a kind of walk-through scenario with all the output of the numbers. And yes, I have seen oom killers do stuff in customer production environments because of this. O_O A team member in my team has been busy doing storage migrations where we attach new block devices to domUs and then sync all their data to the new filesystem (moving from ext4 to btrfs and also to new iSCSI storage) and later reboot after a final sync and then swap block devices, etc.>From the graphs we've been looking at, combined with when migrationstuff is happening, I have gotten a suspicion that it looks like the fake dom0 down-ballooning is related to grant mappings, since it seems like the dom0 memory is not decreasing when attaching the new disk, but it is when starting activity using it. To be continued.... Hans