thr3ads.net - Pkg xen devel - [Pkg-xen-devel] Bug#927071: xen: Odd memory-leak-like behavior for Dom0 [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Elliott Mitchell

2019-Apr-14 17:19 UTC

[Pkg-xen-devel] Bug#927071: xen: Odd memory-leak-like behavior for Dom0

Package: src:xen
Version: 4.8.5+shim4.10.2+xsa282-1+deb9u11
Severity: important

I'm observing an odd memory-leak like behavior for Xen's Dom0.

I've been attempting to reduce the memory usage of Dom0 so, I'd been
slowly decreasing the amount allocated to Dom0.  Presently using
"dom0_mem=360M,max:384M".

Two weeks ago I noticed `xl list` was listing less memory for Dom0 than I
thought I'd previously told Xen to allocate.  Worse, `xl mem-set 0 384`
fails to have any effect.  Further observation indicates Dom0 is getting
ballooned down roughly 1-2MB per day, in spite of Xen having plenty of
free memory (more than 1GB is available).

Just tried and `xl mem-set 0` to less than the current limit works, but
`xl mem-set 0 384` merely sets Dom0 back to the previous limit.  Over
time this is forcing restarts of Dom0, which is Bad(tm).

This is a rather serious situation.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Elliott Mitchell

2019-Apr-21 23:10 UTC

head link

[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

Refering to this as "balloon-leak" as it looks sort of like a
memory-leak, but is instead memory disappearing into the balloon.

There are two things which have been happening more recently which may
have exaggerated this problem.  First, one of the DomUs is acting as a
fileserver and that has been getting more usage recently.  Second, I've
been testing block-device hotplug as a mechanism to transfer data between
VMs (xl block-detach from one VM, then xl block-attach to another VM).

The DomUs appear absolutely uneffected by this.  Even as Dom0 has gotten
to a situation where `shutdown -r now` fails, the DomUs appear to be
chugging away with no problems.

There is plenty of free memory for creating additional VMs (perhaps too
much, and that confused Xen?), so this is really puzzling that memory is
being ballooned away from Dom0.  At this point I plan after the next
restart to double the allocation for Dom0 and see whether Dom0 is able
to last more than a week.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Hans van Kranenburg

2019-Apr-22 14:02 UTC

head link

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

Hi,

On 4/22/19 1:10 AM, Elliott Mitchell wrote:> Refering to this as "balloon-leak" as it looks sort of like a
> memory-leak, but is instead memory disappearing into the balloon.
I haven't heard about this symptom before your report.
> There are two things which have been happening more recently which may
> have exaggerated this problem.  First, one of the DomUs is acting as a
> fileserver and that has been getting more usage recently.  Second, I've
> been testing block-device hotplug as a mechanism to transfer data between
> VMs (xl block-detach from one VM, then xl block-attach to another VM).
Did you look at the numbers and see if it happens when you do this?
> The DomUs appear absolutely uneffected by this.  Even as Dom0 has gotten
> to a situation where `shutdown -r now` fails, the DomUs appear to be
> chugging away with no problems.
> 
> There is plenty of free memory for creating additional VMs (perhaps too
> much, and that confused Xen?), so this is really puzzling that memory is
> being ballooned away from Dom0.  At this point I plan after the next
> restart to double the allocation for Dom0 and see whether Dom0 is able
> to last more than a week.
Weird. Can you log memory stats over time, so that you can see when it
happens, and correlate it to other events?

Hans

Elliott Mitchell

2019-Apr-30 22:55 UTC

head link

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg
wrote:> On 4/22/19 1:10 AM, Elliott Mitchell wrote:
> > There is plenty of free memory for creating additional VMs (perhaps
too
> > much, and that confused Xen?), so this is really puzzling that memory
is
> > being ballooned away from Dom0.  At this point I plan after the next
> > restart to double the allocation for Dom0 and see whether Dom0 is able
> > to last more than a week.
> 
> Weird. Can you log memory stats over time, so that you can see when it
> happens, and correlate it to other events?
At this point there is only one real pattern I've noticed:  Always
`smartd` was the process which triggered the kernel OOM-killer.

Originally I was attributing this to `smartd` doing some large memory
allocation during its night-time tasks (which I would attribute to
perhaps `smartd` not being that well written).  Yet now, I never saw
anything else trigger the OOM-killer and I'm now willing to speculate
some I/O operation `smartd` was doing triggers a bug in Xen.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Elliott Mitchell

2019-Jul-20 01:04 UTC

head link

[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

What I'm seeing seems kind of related to the topic of XSA-300.  Mainly
something ballooning out pages.

The Debian Wiki advises reducing the amount of memory used by Domain-0.
Perhaps the Debian Wiki should be advising to try to keep the Domain-0
maximum substantially higher than the actual allocation in order to allow
for ballooning pages used for I/O?

For my case it looks like Domain-0 can function with less than 300MB of
allocated memory, but needs around 200MB of ballooned pages for I/O.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Hans van Kranenburg

2020-Sep-18 20:55 UTC

head link

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

Hi again,

On 5/1/19 12:55 AM, Elliott Mitchell wrote:> On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote:
>> On 4/22/19 1:10 AM, Elliott Mitchell wrote:
>>> There is plenty of free memory for creating additional VMs (perhaps
too
>>> much, and that confused Xen?), so this is really puzzling that
memory is
>>> being ballooned away from Dom0.  At this point I plan after the
next
>>> restart to double the allocation for Dom0 and see whether Dom0 is
able
>>> to last more than a week.
>>
>> Weird. Can you log memory stats over time, so that you can see when it
>> happens, and correlate it to other events?
> 
> At this point there is only one real pattern I've noticed:  Always
> `smartd` was the process which triggered the kernel OOM-killer.
> 
> Originally I was attributing this to `smartd` doing some large memory
> allocation during its night-time tasks (which I would attribute to
> perhaps `smartd` not being that well written).  Yet now, I never saw
> anything else trigger the OOM-killer and I'm now willing to speculate
> some I/O operation `smartd` was doing triggers a bug in Xen.
At first I replied with "I haven't heard about this symptom before your
report.", but later I realized that I am totally seeing the same kind of
behaviour.

During some debian-xen day in Feb 2020, I even had a bit of a closer
look at this together with Ian, and we ended up thinking that there's
actually some kind of obscure miscalculation bug happening. If you look
closely at the numbers in xl info and xl list, then you'll see that the
numbers just do not add up.

The dom0 gets some kind of fake-down-ballooning which is an accounting
error.

I can't provide more proof right now, because I have to reproduce the
thing in a simplified environment to be able to provide a kind of
walk-through scenario with all the output of the numbers.

And yes, I have seen oom killers do stuff in customer production
environments because of this. O_O

A team member in my team has been busy doing storage migrations where we
attach new block devices to domUs and then sync all their data to the
new filesystem (moving from ext4 to btrfs and also to new iSCSI storage)
and later reboot after a final sync and then swap block devices,
etc.>From the graphs we've been looking at, combined with when migrationstuff is happening, I have gotten a suspicion that it looks like the
fake dom0 down-ballooning is related to grant mappings, since it seems
like the dom0 memory is not decreasing when attaching the new disk, but
it is when starting activity using it.

To be continued....

Hans

Pkg xen devel - Apr 2019 - Bug#927071: xen: Odd memory-leak-like behavior for Dom0

[Pkg-xen-devel] Bug#927071: xen: Odd memory-leak-like behavior for Dom0

[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

[Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation