thr3ads.net - Pkg xen devel - [Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64 [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Hans van Kranenburg

2018-Jan-06 22:17 UTC

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

Hi Christian and everyone else,

Ack on reassign to Xen.

On 01/06/2018 04:11 PM, Yves-Alexis Perez wrote:> control: reassign -1 xen-hypervisor-4.8-amd64
> 
> On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote:
>> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote:
>>> According to that link, the fix seems to be configuration rather
than
>>> code.
>>> Does this mean this bug against the kernel should be closed?
>>
>> Yes, the problem seems to be in the Xen hypervisor and not the Linux
>> kernel itself.  The default value for the gnttab_max_frames parameter
>> needs to be increased to avoid domU disk IO hangs, for example:
>>
>>   GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256"
>>
>> So either close the bug or reassign it to xen-hypervisor package so
>> they can increase the default value for this parameter in the
>> hypervisor code.
>>
> Ok, I'll reassign and let the Xen maintainers handle that (maybe in a
stable
> update).
> 
> @Xen maintainers: see the complete bug log for more information, but
basically
> it seems that a domu freezes happens with the ?new? multi-queue xen blk
> driver, and the fix is to increase a configuration value. Valentin suggests
> adding that to the default.
The dom0 gnttab_max_frames boot setting is about how many pages are
allocated to fill with 'grants'. The grant concept is related to sharing
information between the dom0 and domU.

It allows memory pages to be shared back and forth, so that e.g. a domU
can fill a page with outgoing network packets or disk writes. Then the
dom0 can take over ownership of the memory page and read the contents
and do its trick with it. In this way, zero-copy IO is implemented.

When running xen domUs, the total amount of network interfaces and block
devices that are attached to all of the domUs that are running (and,
apparently, how heavy they are used) cause the usage of these grant guys
to increase. At some point you run out of grants because all of the
pages are filled.

I agree that the upstream default, 32 is quite low. This is indeed a
configuration issue. I myself ran into this years ago with a growing
number of domUs and network interfaces in use. We have been using
gnttab_max_nr_frames=128 for a long time already instead.

I was tempted to reassign src:xen, but in the meantime, this option has
already been removed again, so this bug does not apply to unstable
(well, as soon as we get something new in there) any more (as far as I
can see quickly now).

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30

Including a better default for gnttab_max_nr_frames in the grub config
in the debian xen package in stable sounds reasonable from a best
practices point of view.

But, I would be interested in learning more about the relation with
block mq although. Does using newer linux kernels (like from
stretch-backports) for the domU always put a bigger strain on this? Or,
is it just related to the overall number of network devices and block
devices you are adding to your domUs in your specific own situation, and
did you just trip over the default limit?

In any case, the grub option thing is a conffile, so any user upgrading
has to accept/merge the change, so we won't cause a stable user to just
run out of memory because of a few extra kilobytes of memory usage
without notice.

Hans van Kranenburg

P.S. Debian Xen team is in the process of being "rebooted" while the
current shitstorm about meltdown/spectre is going on, so don't hold your
breath. :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.alioth.debian.org/pipermail/pkg-xen-devel/attachments/20180106/d7e650a6/attachment-0001.sig>

Hans van Kranenburg

2018-Jan-07 18:36 UTC

head link

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

On 01/07/2018 10:05 AM, Valentin Vidic wrote:> On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote:
>> I agree that the upstream default, 32 is quite low. This is indeed a
>> configuration issue. I myself ran into this years ago with a growing
>> number of domUs and network interfaces in use. We have been using
>> gnttab_max_nr_frames=128 for a long time already instead.
>>
>> I was tempted to reassign src:xen, but in the meantime, this option has
>> already been removed again, so this bug does not apply to unstable
>> (well, as soon as we get something new in there) any more (as far as I
>> can see quickly now).
>>
>>
https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30
> 
> It does not seem to be removed but increased the default from 32 to 64?
Ehm, yes you are correct. I was misreading and mixing up things. Let's
try again...

The referenced commit is talking about removal of the obsolete
gnttab_max_nr_frames from the documentation, so not related.
>> Including a better default for gnttab_max_nr_frames in the grub config
>> in the debian xen package in stable sounds reasonable from a best
>> practices point of view.
So, that's gnttab_max_frames, not gnttab_max_nr_frames... I was reading
out loud from my old Jessie dom0 grub config.
>> But, I would be interested in learning more about the relation with
>> block mq although. Does using newer linux kernels (like from
>> stretch-backports) for the domU always put a bigger strain on this? Or,
>> is it just related to the overall number of network devices and block
>> devices you are adding to your domUs in your specific own situation,
and
>> did you just trip over the default limit?
> 
> After upgrading the domU and dom0 from jessie to stretch on a big
postgresql
> database server (50 VCPUs, 200GB RAM) it starting freezing very soon
> after boot as posted there here:
> 
>   https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html
> 
> It did not have these problems while running jessie versions of the
> hypervisor and the kernels.  The problem seems to be related to the
> number of CPUs used, as smaller domUs with a few VCPUs did not hang
> like this.  Could it be that large number of VCPUs -> more queues in
> Xen mq driver -> faster exhaustion of allocated pages?
That exactly seems to be the case yes. Maybe this is also one of the
reasons that the default max was increased in Xen.

"xen/blkback: make pool of persistent grants and free pages per-queue"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4bf0065b7251afb723a29b2fd58f7c38f8ce297

Recently a tool was added to "dump guest grant table info". You could
see if it compiles on the 4.8 source and see if it works? Would be
interesting to get some idea about how high or low these numbers are in
different scenarios. I mean, I'm using 128, you 256, and we even don't
know if the actual value is maybe just above 32? :]

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a

If this is something users are going to run into while not doing more
unusual things like having dozens of vcpus or network interfaces, then
changing the default could prevent hours of frustration and debugging
for them.

The least invasive option is to add the option to the documentation of
GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub.d/xen.cfg like "If you
have more than xyz disks or network interfaces in a domU, use this, blah
blah."

Actually setting the option there is not a good idea, because people can
still have GRUB_CMDLINE_XEN_DEFAULT set in e.g. /etc/default/grub, so
that would override and damage things.

Other option is to add a patch to drag the defaults in the upstream code
from 32 to 64, including documentation etc.

Sorry for the earlier confusion,
Hans

Christian Schwamborn

2018-Jan-15 10:12 UTC

head link

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

Hi Hans and Valentin,

first of all: Thanks for your help and explanations, that is very
helpfull. I was on vacation last week and couldn't answer right away.

On 07.01.2018 19:36, Hans van Kranenburg wrote:> If this is something users are going to run into while not doing more
> unusual things like having dozens of vcpus or network interfaces, then
> changing the default could prevent hours of frustration and debugging
> for them.
As a reference:

Dom0 is stretch.

0 root at zero:~# xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 1961 2 r----- 407972.8
xaver-jessie 10 2048 2 -b---- 177520.8
ustrich-jessie 12 2048 2 -b---- 8555.9
ourea-stretch 14 8192 2 -b---- 167352.7
arriba 17 4096 2 -b---- 5108.3

All DomU's have one network interface on a bridge.
xaver-jessie has 5 block devices (phys, lvm)
ustrich-jessie has 4 block devices (phys, lvm)
ourea-stretch has 16 block devices (phys, lvm)
arriba has just one (phys, lvm) and is a hvm windows system

As you can see, nothing crazy with lots of vcpus or network interfaces.

The crashing (freezing) DomU was ourea-stretch, which is the one with
the most load (smb, some web services, cal/card dav, psql, ldap,
postfix, cyrus ...). As mentioned, the freezes stopped after using the
backports kernel, nothing else changed. I was desperate at that time to
get this new installed system to work and frankly stopped all planed
updates to stretch on other systems at that point until I know what is
going on.

Is there a easy way to get/monitor the used 'grants' frames? As I
understand it, the xen-diag tool you mentioned doesn't compile in xen 4.8?

Christian

Pkg xen devel - Jan 2018 - Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64