On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:> On Mon, 17 Jul 2017, Johnny Hughes wrote: > >> Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, >> with the one config file change) working for everyone: >> >> (turn off: CONFIG_IO_STRICT_DEVMEM) > > Hello. > Maybe it's not the most appropriate thread or time, but I have been > signalling it before: > > 4.9.* kernels do not work well for me any more (and for other people > neither, as I know). Last stable kernel was 4.9.13-22. > > Since 4.9.25-26 I do often get: > on 3 supermicro servers (different generations): > - memory allocation errors on Dom0 and corresponding lost lost page writes > due to buffer I/O error on PV guests > - after such memory allocation error od dom0 I have spotted also: > - NFS client hangups on guests (server not responding, still trying > => server OK) > - iptables lockups on PV guest reboot > > on 1 supermicro server: > - memory allocation errors on Dom0 and SATA lockups (many, if not SATA > channels at > - once): > exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen > hard resetting link > failed to IDENTIFY (I/O error, err_mask=0x4) > then: blk_update_request: I/O error, dev sd., sector .... > > > All of these machines have been tested with memtest, no detected memory > problems. > No such things occur, when I boot 4.9.13-22 > Most of my guests are centos 6 x86_64, bridged. > > Do anyone had such problems, dealt with it somehow? > > > Since spotting these errors I have done many tests, compiled and tested to > point out single code change (kernel version, patch) - no conclusions yet. > > But one has changed much between 4.9.13 and 4.9.25: kernel size and > configuration. > 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been > compiled into kernel, here is shortened, but significant list: > - iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT) > - SATA_AHCI > - ATA_AHCI (PATA, what a heck?) > - FBDEV_FRONTEND > - HID_MAGICKMOUSE > - HID_NTRIG > - USB_XHCI > - INTEL_SMARTCONNECT >Modules that are not loaded are not used. It has no impact at all on performance or compatibility unless it is used. If you take an lsmod of the kernel that works and one of the kernel with issues, we can see if there are LOADED modules that might cause issues. The modules that are built are the same as Fedora and if in the RHEL 7 kernel, RHEL 7. We did troubleshoot and turn off some things recently, one thing in particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but which is off in some other distros and causes issues with ISCSI and some other things. We also added some specific xen patches, one for netback queue, one for apic, one for nested dom0. Also upstream has added in several xen patches since 4.9.13. And yes, we did change the kernel configs specifically to add in iptables as many people want them. If you can point to problems with a specific module, we can discuss it here and turn it off if necessary. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170719/6b1190c1/attachment-0002.sig>
On 07/19/2017 09:23 AM, Johnny Hughes wrote:> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote: >> On Mon, 17 Jul 2017, Johnny Hughes wrote: >> >>> Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, >>> with the one config file change) working for everyone: >>> >>> (turn off: CONFIG_IO_STRICT_DEVMEM) >> >> Hello. >> Maybe it's not the most appropriate thread or time, but I have been >> signalling it before: >> >> 4.9.* kernels do not work well for me any more (and for other people >> neither, as I know). Last stable kernel was 4.9.13-22. >> >> Since 4.9.25-26 I do often get: >> on 3 supermicro servers (different generations): >> - memory allocation errors on Dom0 and corresponding lost lost page writes >> due to buffer I/O error on PV guests >> - after such memory allocation error od dom0 I have spotted also: >> - NFS client hangups on guests (server not responding, still trying >> => server OK) >> - iptables lockups on PV guest reboot >> >> on 1 supermicro server: >> - memory allocation errors on Dom0 and SATA lockups (many, if not SATA >> channels at >> - once): >> exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen >> hard resetting link >> failed to IDENTIFY (I/O error, err_mask=0x4) >> then: blk_update_request: I/O error, dev sd., sector .... >> >> >> All of these machines have been tested with memtest, no detected memory >> problems. >> No such things occur, when I boot 4.9.13-22 >> Most of my guests are centos 6 x86_64, bridged. >> >> Do anyone had such problems, dealt with it somehow? >> >> >> Since spotting these errors I have done many tests, compiled and tested to >> point out single code change (kernel version, patch) - no conclusions yet. >> >> But one has changed much between 4.9.13 and 4.9.25: kernel size and >> configuration. >> 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been >> compiled into kernel, here is shortened, but significant list: >> - iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT) >> - SATA_AHCI >> - ATA_AHCI (PATA, what a heck?) >> - FBDEV_FRONTEND >> - HID_MAGICKMOUSE >> - HID_NTRIG >> - USB_XHCI >> - INTEL_SMARTCONNECT >> > Modules that are not loaded are not used. It has no impact at all on > performance or compatibility unless it is used. If you take an lsmod of > the kernel that works and one of the kernel with issues, we can see if > there are LOADED modules that might cause issues. > > The modules that are built are the same as Fedora and if in the RHEL 7 > kernel, RHEL 7. > > We did troubleshoot and turn off some things recently, one thing in > particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but > which is off in some other distros and causes issues with ISCSI and some > other things. > > We also added some specific xen patches, one for netback queue, one for > apic, one for nested dom0. Also upstream has added in several xen > patches since 4.9.13.There are several very important patches in this kernel for xen (for example): https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.36 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170719/6f100d53/attachment-0002.sig>
On Wed, 19 Jul 2017, Johnny Hughes wrote:> On 07/19/2017 09:23 AM, Johnny Hughes wrote: >> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote: >>> On Mon, 17 Jul 2017, Johnny Hughes wrote: >>> >>>> Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, >>>> with the one config file change) working for everyone: >>>> >>>> (turn off: CONFIG_IO_STRICT_DEVMEM) >>> >>> Hello. >>> Maybe it's not the most appropriate thread or time, but I have been >>> signalling it before: >>> >>> 4.9.* kernels do not work well for me any more (and for other people >>> neither, as I know). Last stable kernel was 4.9.13-22.I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure. SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above. After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers. Moreover, digging into logs of memory allocation failures on my production supermicro servers resulted in some interesting findings: Jul 9 05:02:47 xen kernel: [3040088.089379] gzip: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC) Jul 10 12:18:01 xen kernel: [3152495.802565] 2.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:18:01 xen kernel: [3152495.815871] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 10 12:18:01 xen kernel: [3152495.816826] 2.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:18:01 xen kernel: [3152495.832477] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 10 12:20:20 xen kernel: [3152635.070680] 1.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:20:20 xen kernel: [3152635.083952] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 12 09:15:15 xen kernel: [118420.343615] 10.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 12 09:15:15 xen kernel: [118420.359779] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) What is node "-1" ? 8-/ I think it should be reported to Xen and/or SLUB developers. I suggest releasing new Xen kernels with SLAB, until the issue is resolved. Regards, -- Piotr Gackiewicz Intertele S.A. - operator system?w ITL.PL i DOMENY.ITL.PL al. T. Rejtana 10, 35-310 Rzesz?w TEL: +48 17 8507580, FAX: +48 17 8520275 http://www.itl.pl - niezawodne us?ugi hostingowe http://domeny.itl.pl - tanie domeny internetowe http://www.intertele.pl