thr3ads.net - CentOS virt - [CentOS-virt] kernel-4.9.37-29.el7 (and el6) [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Johnny Hughes

2017-Jul-19 14:23 UTC

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:> On Mon, 17 Jul 2017, Johnny Hughes wrote:
> 
>> Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6,
>> with the one config file change) working for everyone:
>>
>> (turn off: CONFIG_IO_STRICT_DEVMEM)
> 
> Hello.
> Maybe it's not the most appropriate thread or time, but I have been
> signalling it before:
> 
> 4.9.* kernels do not work well for me any more (and for other people
> neither, as I know). Last stable kernel was 4.9.13-22.
> 
> Since 4.9.25-26 I do often get:
> on 3 supermicro servers (different generations):
> - memory allocation errors on Dom0 and corresponding lost lost page writes
>     due to buffer I/O error on PV guests
> - after such memory allocation error od dom0 I have spotted also:
>     - NFS client hangups on guests (server not responding, still trying
> => server OK)
>     - iptables lockups on PV guest reboot
> 
> on 1 supermicro server:
> - memory allocation errors on Dom0 and SATA lockups (many, if not SATA
> channels at
>     - once):
>     exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen
>     hard resetting link
>     failed to IDENTIFY (I/O error, err_mask=0x4)
>     then: blk_update_request: I/O error, dev sd., sector ....
> 
> 
> All of these machines have been tested with memtest, no detected memory
> problems.
> No such things occur, when I boot 4.9.13-22
> Most of my guests are centos 6 x86_64, bridged.
> 
> Do anyone had such problems, dealt with it somehow?
> 
> 
> Since spotting these errors I have done many tests, compiled and tested to
> point out single code change (kernel version, patch) - no conclusions yet.
> 
> But one has changed much between 4.9.13 and 4.9.25: kernel size and
> configuration.
> 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been
> compiled into kernel, here is shortened, but significant list:
> - iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)
> - SATA_AHCI
> - ATA_AHCI (PATA, what a heck?)
> - FBDEV_FRONTEND
> - HID_MAGICKMOUSE
> - HID_NTRIG
> - USB_XHCI
> - INTEL_SMARTCONNECT
> Modules that are not loaded are not used.  It has no impact at all on
performance or compatibility unless it is used.  If you take an lsmod of
the kernel that works and one of the kernel with issues, we can see if
there are LOADED modules that might cause issues.

The modules that are built are the same as Fedora and if in the RHEL 7
kernel, RHEL 7.

We did troubleshoot and turn off some things recently, one thing in
particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but
which is off in some other distros and causes issues with ISCSI and some
other things.

We also added some specific xen patches, one for netback queue, one for
apic, one for nested dom0.  Also upstream has added in several xen
patches since 4.9.13.

And yes, we did change the kernel configs specifically to add in
iptables as many people want them.

If you can point to problems with a specific module, we can discuss it
here and turn it off if necessary.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170719/6b1190c1/attachment-0002.sig>

Johnny Hughes

2017-Jul-19 14:43 UTC

head link

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

On 07/19/2017 09:23 AM, Johnny Hughes wrote:> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:
>> On Mon, 17 Jul 2017, Johnny Hughes wrote:
>>
>>> Are the testing kernels (kernel-4.9.37-29.el7 and
kernel-4.9.37-29.el6,
>>> with the one config file change) working for everyone:
>>>
>>> (turn off: CONFIG_IO_STRICT_DEVMEM)
>>
>> Hello.
>> Maybe it's not the most appropriate thread or time, but I have been
>> signalling it before:
>>
>> 4.9.* kernels do not work well for me any more (and for other people
>> neither, as I know). Last stable kernel was 4.9.13-22.
>>
>> Since 4.9.25-26 I do often get:
>> on 3 supermicro servers (different generations):
>> - memory allocation errors on Dom0 and corresponding lost lost page
writes
>>     due to buffer I/O error on PV guests
>> - after such memory allocation error od dom0 I have spotted also:
>>     - NFS client hangups on guests (server not responding, still trying
>> => server OK)
>>     - iptables lockups on PV guest reboot
>>
>> on 1 supermicro server:
>> - memory allocation errors on Dom0 and SATA lockups (many, if not SATA
>> channels at
>>     - once):
>>     exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen
>>     hard resetting link
>>     failed to IDENTIFY (I/O error, err_mask=0x4)
>>     then: blk_update_request: I/O error, dev sd., sector ....
>>
>>
>> All of these machines have been tested with memtest, no detected memory
>> problems.
>> No such things occur, when I boot 4.9.13-22
>> Most of my guests are centos 6 x86_64, bridged.
>>
>> Do anyone had such problems, dealt with it somehow?
>>
>>
>> Since spotting these errors I have done many tests, compiled and tested
to
>> point out single code change (kernel version, patch) - no conclusions
yet.
>>
>> But one has changed much between 4.9.13 and 4.9.25: kernel size and
>> configuration.
>> 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been
>> compiled into kernel, here is shortened, but significant list:
>> - iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)
>> - SATA_AHCI
>> - ATA_AHCI (PATA, what a heck?)
>> - FBDEV_FRONTEND
>> - HID_MAGICKMOUSE
>> - HID_NTRIG
>> - USB_XHCI
>> - INTEL_SMARTCONNECT
>>
> Modules that are not loaded are not used.  It has no impact at all on
> performance or compatibility unless it is used.  If you take an lsmod of
> the kernel that works and one of the kernel with issues, we can see if
> there are LOADED modules that might cause issues.
> 
> The modules that are built are the same as Fedora and if in the RHEL 7
> kernel, RHEL 7.
> 
> We did troubleshoot and turn off some things recently, one thing in
> particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but
> which is off in some other distros and causes issues with ISCSI and some
> other things.
> 
> We also added some specific xen patches, one for netback queue, one for
> apic, one for nested dom0.  Also upstream has added in several xen
> patches since 4.9.13.
There are several very important patches in this kernel for xen (for
example):

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.36


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170719/6f100d53/attachment-0002.sig>

Piotr Gackiewicz

2017-Jul-20 10:31 UTC

head link

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

On Wed, 19 Jul 2017, Johnny Hughes wrote:
> On 07/19/2017 09:23 AM, Johnny Hughes wrote:
>> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:
>>> On Mon, 17 Jul 2017, Johnny Hughes wrote:
>>>
>>>> Are the testing kernels (kernel-4.9.37-29.el7 and
kernel-4.9.37-29.el6,
>>>> with the one config file change) working for everyone:
>>>>
>>>> (turn off: CONFIG_IO_STRICT_DEVMEM)
>>>
>>> Hello.
>>> Maybe it's not the most appropriate thread or time, but I have
been
>>> signalling it before:
>>>
>>> 4.9.* kernels do not work well for me any more (and for other
people
>>> neither, as I know). Last stable kernel was 4.9.13-22.
I think I have nailed down the faulty combo.
My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of
Xen Hypervisor.
Id does not work at least on one of my testing servers (old AMD K8 (1 proc,
1 core), only 1 paravirt guest).
If kernel with SLUB booted as main (w/o Xen hypervisor), it works well.
If booted as Xen hypervisor module - it almost instantly gets page allocation
failure.


SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems
started to explode in my production environment, and on testing server mentioned
above.

After recompiling recent 4.9.34 with SLAB - everything works well on that
testing machine.
A will try to test 4.9.38 with the same config on my production servers.

Moreover, digging into logs of memory allocation failures on my production
supermicro servers resulted in some interesting findings:

Jul  9 05:02:47 xen kernel: [3040088.089379] gzip: page allocation failure:
order:0, mode:0x2080020(GFP_ATOMIC)
Jul 10 12:18:01 xen kernel: [3152495.802565] 2.xvda5-0: page allocation failure:
order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
Jul 10 12:18:01 xen kernel: [3152495.815871] SLUB: Unable to allocate memory on
node -1, gfp=0x2000000(GFP_NOWAIT)
Jul 10 12:18:01 xen kernel: [3152495.816826] 2.xvda5-0: page allocation failure:
order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
Jul 10 12:18:01 xen kernel: [3152495.832477] SLUB: Unable to allocate memory on
node -1, gfp=0x2000000(GFP_NOWAIT)
Jul 10 12:20:20 xen kernel: [3152635.070680] 1.xvda5-0: page allocation failure:
order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
Jul 10 12:20:20 xen kernel: [3152635.083952] SLUB: Unable to allocate memory on
node -1, gfp=0x2000000(GFP_NOWAIT)
Jul 12 09:15:15 xen kernel: [118420.343615] 10.xvda5-0: page allocation failure:
order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
Jul 12 09:15:15 xen kernel: [118420.359779] SLUB: Unable to allocate memory on
node -1, gfp=0x2000000(GFP_NOWAIT)

What is node "-1" ?
8-/

I think it should be reported to Xen and/or SLUB developers.
I suggest releasing new Xen kernels with SLAB, until the issue is resolved.

Regards,

-- 
Piotr Gackiewicz
Intertele S.A. - operator system?w ITL.PL i DOMENY.ITL.PL
al. T. Rejtana 10, 35-310 Rzesz?w
TEL: +48 17 8507580, FAX: +48 17 8520275

http://www.itl.pl       - niezawodne us?ugi hostingowe
http://domeny.itl.pl    - tanie domeny internetowe
http://www.intertele.pl

Apparently Analagous Threads

Search for more apparently analagous threads

CentOS virt - Jul 2017 - kernel-4.9.37-29.el7 (and el6)

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

[CentOS-virt] kernel-4.9.37-29.el7 (and el6)

Apparently Analagous Threads