Yves-Alexis Perez
2018-Jan-06 15:11 UTC
[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
control: reassign -1 xen-hypervisor-4.8-amd64 On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote:> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: > > According to that link, the fix seems to be configuration rather than > > code. > > Does this mean this bug against the kernel should be closed? > > Yes, the problem seems to be in the Xen hypervisor and not the Linux > kernel itself. The default value for the gnttab_max_frames parameter > needs to be increased to avoid domU disk IO hangs, for example: > > GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" > > So either close the bug or reassign it to xen-hypervisor package so > they can increase the default value for this parameter in the > hypervisor code. >Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable update). @Xen maintainers: see the complete bug log for more information, but basically it seems that a domu freezes happens with the ?new? multi-queue xen blk driver, and the fix is to increase a configuration value. Valentin suggests adding that to the default. Regards, -- Yves-Alexis -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: This is a digitally signed message part URL: <http://lists.alioth.debian.org/pipermail/pkg-xen-devel/attachments/20180106/c1c6ff26/attachment.sig>
Hans van Kranenburg
2018-Jan-06 22:17 UTC
[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi Christian and everyone else, Ack on reassign to Xen. On 01/06/2018 04:11 PM, Yves-Alexis Perez wrote:> control: reassign -1 xen-hypervisor-4.8-amd64 > > On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote: >> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: >>> According to that link, the fix seems to be configuration rather than >>> code. >>> Does this mean this bug against the kernel should be closed? >> >> Yes, the problem seems to be in the Xen hypervisor and not the Linux >> kernel itself. The default value for the gnttab_max_frames parameter >> needs to be increased to avoid domU disk IO hangs, for example: >> >> GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" >> >> So either close the bug or reassign it to xen-hypervisor package so >> they can increase the default value for this parameter in the >> hypervisor code. >> > Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable > update). > > @Xen maintainers: see the complete bug log for more information, but basically > it seems that a domu freezes happens with the ?new? multi-queue xen blk > driver, and the fix is to increase a configuration value. Valentin suggests > adding that to the default.The dom0 gnttab_max_frames boot setting is about how many pages are allocated to fill with 'grants'. The grant concept is related to sharing information between the dom0 and domU. It allows memory pages to be shared back and forth, so that e.g. a domU can fill a page with outgoing network packets or disk writes. Then the dom0 can take over ownership of the memory page and read the contents and do its trick with it. In this way, zero-copy IO is implemented. When running xen domUs, the total amount of network interfaces and block devices that are attached to all of the domUs that are running (and, apparently, how heavy they are used) cause the usage of these grant guys to increase. At some point you run out of grants because all of the pages are filled. I agree that the upstream default, 32 is quite low. This is indeed a configuration issue. I myself ran into this years ago with a growing number of domUs and network interfaces in use. We have been using gnttab_max_nr_frames=128 for a long time already instead. I was tempted to reassign src:xen, but in the meantime, this option has already been removed again, so this bug does not apply to unstable (well, as soon as we get something new in there) any more (as far as I can see quickly now). https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30 Including a better default for gnttab_max_nr_frames in the grub config in the debian xen package in stable sounds reasonable from a best practices point of view. But, I would be interested in learning more about the relation with block mq although. Does using newer linux kernels (like from stretch-backports) for the domU always put a bigger strain on this? Or, is it just related to the overall number of network devices and block devices you are adding to your domUs in your specific own situation, and did you just trip over the default limit? In any case, the grub option thing is a conffile, so any user upgrading has to accept/merge the change, so we won't cause a stable user to just run out of memory because of a few extra kilobytes of memory usage without notice. Hans van Kranenburg P.S. Debian Xen team is in the process of being "rebooted" while the current shitstorm about meltdown/spectre is going on, so don't hold your breath. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <http://lists.alioth.debian.org/pipermail/pkg-xen-devel/attachments/20180106/d7e650a6/attachment-0001.sig>
Valentin Vidic
2018-Jan-07 09:05 UTC
[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote:> I agree that the upstream default, 32 is quite low. This is indeed a > configuration issue. I myself ran into this years ago with a growing > number of domUs and network interfaces in use. We have been using > gnttab_max_nr_frames=128 for a long time already instead. > > I was tempted to reassign src:xen, but in the meantime, this option has > already been removed again, so this bug does not apply to unstable > (well, as soon as we get something new in there) any more (as far as I > can see quickly now). > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30It does not seem to be removed but increased the default from 32 to 64?> Including a better default for gnttab_max_nr_frames in the grub config > in the debian xen package in stable sounds reasonable from a best > practices point of view. > > But, I would be interested in learning more about the relation with > block mq although. Does using newer linux kernels (like from > stretch-backports) for the domU always put a bigger strain on this? Or, > is it just related to the overall number of network devices and block > devices you are adding to your domUs in your specific own situation, and > did you just trip over the default limit?After upgrading the domU and dom0 from jessie to stretch on a big postgresql database server (50 VCPUs, 200GB RAM) it starting freezing very soon after boot as posted there here: https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html It did not have these problems while running jessie versions of the hypervisor and the kernels. The problem seems to be related to the number of CPUs used, as smaller domUs with a few VCPUs did not hang like this. Could it be that large number of VCPUs -> more queues in Xen mq driver -> faster exhaustion of allocated pages? -- Valentin