flight 18851 xen-unstable real [real] http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778 test-amd64-i386-pv 7 debian-install fail REGR. vs. 18778 test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs. 18778 Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-pcipt-intel 9 guest-start fail never pass test-amd64-amd64-xl-qemuu-win7-amd64 13 guest-stop fail never pass test-amd64-i386-xl-winxpsp3-vcpus1 13 guest-stop fail never pass test-amd64-i386-xend-winxpsp3 16 leak-check/check fail never pass test-amd64-i386-xl-qemut-winxpsp3-vcpus1 13 guest-stop fail never pass test-amd64-amd64-xl-qemuu-winxpsp3 13 guest-stop fail never pass test-amd64-i386-xend-qemut-winxpsp3 16 leak-check/check fail never pass test-amd64-amd64-xl-qemut-win7-amd64 13 guest-stop fail never pass test-amd64-amd64-xl-qemut-winxpsp3 13 guest-stop fail never pass test-amd64-amd64-xl-winxpsp3 13 guest-stop fail never pass test-amd64-amd64-xl-win7-amd64 13 guest-stop fail never pass test-amd64-i386-xl-qemut-win7-amd64 13 guest-stop fail never pass test-amd64-i386-xl-win7-amd64 13 guest-stop fail never pass version targeted for testing: xen fb3f1c1855bd9aca625bc0d040be4cdcc216e958 baseline version: xen 8a7769b4453168e23e8935a85e9a875ef5117253 ------------------------------------------------------------ People who touched revisions under test: Andrew Cooper <andrew.cooper3@citrix.com> Ian Campbell <ian.campbell@citrix.com> Ian Campbell <ijc@hellion.org.uk> Ian Jackson <ian.jackson@eu.citrix.com> Jaeyong Yoo <jaeyong.yoo@samsung.com> Jan Beulich <jbeulich@suse.com> Julien Grall <julien.grall@linaro.org> Keir Fraser <keir@xen.org> Matt Wilson <msw@amazon.com> Sander Eikelenboom <linux@eikelenboom.it> Suravee Suthikulpanit <suravee.suthikulapanit@amd.com> Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tomasz Wroblewski <tomasz.wroblewski@citrix.com> ------------------------------------------------------------ jobs: build-amd64 pass build-armhf pass build-i386 pass build-amd64-oldkern pass build-i386-oldkern pass build-amd64-pvops pass build-i386-pvops pass test-amd64-amd64-xl pass test-amd64-i386-xl pass test-amd64-i386-rhel6hvm-amd fail test-amd64-i386-qemut-rhel6hvm-amd pass test-amd64-i386-qemuu-rhel6hvm-amd pass test-amd64-amd64-xl-qemut-win7-amd64 fail test-amd64-i386-xl-qemut-win7-amd64 fail test-amd64-amd64-xl-qemuu-win7-amd64 fail test-amd64-amd64-xl-win7-amd64 fail test-amd64-i386-xl-win7-amd64 fail test-amd64-i386-xl-credit2 pass test-amd64-amd64-xl-pcipt-intel fail test-amd64-i386-rhel6hvm-intel pass test-amd64-i386-qemut-rhel6hvm-intel pass test-amd64-i386-qemuu-rhel6hvm-intel pass test-amd64-i386-xl-multivcpu fail test-amd64-amd64-pair pass test-amd64-i386-pair pass test-amd64-amd64-xl-sedf-pin pass test-amd64-amd64-pv pass test-amd64-i386-pv fail test-amd64-amd64-xl-sedf pass test-amd64-i386-xl-qemut-winxpsp3-vcpus1 fail test-amd64-i386-xl-winxpsp3-vcpus1 fail test-amd64-i386-xend-qemut-winxpsp3 fail test-amd64-amd64-xl-qemut-winxpsp3 fail test-amd64-amd64-xl-qemuu-winxpsp3 fail test-amd64-i386-xend-winxpsp3 fail test-amd64-amd64-xl-winxpsp3 fail ------------------------------------------------------------ sg-report-flight on woking.cam.xci-test.com logs: /home/xc_osstest/logs images: /home/xc_osstest/images Logs, config files, etc. are available at http://www.chiark.greenend.org.uk/~xensrcts/logs Test harness code can be found at http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 333 lines long.)
>>> On 29.08.13 at 21:18, xen.org <ian.jackson@eu.citrix.com> wrote: > flight 18851 xen-unstable real [real] > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778 > test-amd64-i386-pv 7 debian-install fail REGR. vs. 18778 > test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs. 18778So these all appear to be timeouts of infrastructure operations that don''t have an immediate explanation to me. The only odd thing is [ 12.719551] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 12.726458] IPv6: ADDRCONF(NETDEV_UP): xenbr0: link is not ready in each of the respective woodlouse---var-log-dmesg files. Is woodlouse suffering from a network connectivity issue, perhaps as a result of the kernel update? In any event, throughout the last runs it has - afaics - always been woodlouse that had failures (and the stickiness of failed tests then likely prevents them to ever get a success elsewhere). So perhaps worth trying to take woodlouse out of the pool temporarily? Jan
xen.org writes ("[xen-unstable test] 18851: regressions - FAIL"):> flight 18851 xen-unstable real [real] > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778I have had a bisection report about this: From: "xen.org" <osstest@woking.cam.xci-test.com> From: "xen.org" <ian.jackson@eu.citrix.com> X-rewrote-sender: osstest@woking.cam.xci-test.com Date: Mon, 02 Sep 2013 14:33:30 +0100 branch xen-unstable xen branch xen-unstable job test-amd64-i386-qemut-rhel6hvm-amd test redhat-install Tree: linux git://xenbits.xen.org/linux-pvops.git Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git Tree: qemu git://xenbits.xen.org/staging/qemu-xen-unstable.git Tree: qemuu git://xenbits.xen.org/staging/qemu-upstream-unstable.git Tree: xen git://xenbits.xen.org/xen.git *** Found and reproduced problem changeset *** Bug is in tree: linux git://xenbits.xen.org/linux-pvops.git Bug introduced: 8bf3379a74bc9132751bfa685bad2da318fd59d7 Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00 commit 8bf3379a74bc9132751bfa685bad2da318fd59d7 Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Date: Thu Aug 29 09:47:51 2013 -0700 Linux 3.10.10 [etc.] The head commit there is a merge. The email contained all the log messages in between those two, so bounced. (The bisector didn''t examine the other parent of the merge, I think because it wasn''t an ancestor of the baseline "good" revision.) I''m not sure why my osstest push gate didn''t catch this, but the regression is indeed caused by the change from Jeremy''s old tree to Linux 3.10.y. Ian.
Ian Jackson writes ("Re: [xen-unstable test] 18851: regressions - FAIL"):> xen.org writes ("[xen-unstable test] 18851: regressions - FAIL"): > > flight 18851 xen-unstable real [real] > > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ > > > > Regressions :-( > > > > Tests which did not succeed and are blocking, > > including tests which could not be run: > > test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778xen.org writes ("[xen-unstable test] 19006: regressions - trouble: broken/fail/pass"):> Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs. 18778I looked at this one from 19006. The system is running under Xen but has no guests. It shows a wget process running. I can''t easily tell whether it has hung, but there are no other signs of trouble in the logs. The tester was able to ssh in and get process listings and so forth so it must be that (a) just the debootstrap stuff has hung (b) trying to ssh in to collect logs unwedged it (c) the problem is actually poor performance, not a hang. The system allows 2ks for a debootstrap, which should be ample (given that there''s a local mirror). So I think it''s probably a performance regression. I will try to repro this tomorrow. Ian.
>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > *** Found and reproduced problem changeset *** > > Bug is in tree: linux git://xenbits.xen.org/linux-pvops.git > Bug introduced: 8bf3379a74bc9132751bfa685bad2da318fd59d7 > Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00 > > > commit 8bf3379a74bc9132751bfa685bad2da318fd59d7 > Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Date: Thu Aug 29 09:47:51 2013 -0700 > > Linux 3.10.10 > > [etc.] > > The head commit there is a merge. The email contained all the log > messages in between those two, so bounced. (The bisector didn''t > examine the other parent of the merge, I think because it wasn''t an > ancestor of the baseline "good" revision.) > > I''m not sure why my osstest push gate didn''t catch this, but the > regression is indeed caused by the change from Jeremy''s old tree to > Linux 3.10.y.So how do we want to deal with that? Linux maintainers - any chance you could help out? The staging tree having been stuck for over a week is certainly less than ideal... Jan
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"):> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:...> > I''m not sure why my osstest push gate didn''t catch this, but the > > regression is indeed caused by the change from Jeremy''s old tree to > > Linux 3.10.y.It appears that the push gate didn''t catch it because it''s host specific, and it got lucky and didn''t run a test on that host.> So how do we want to deal with that? Linux maintainers - any > chance you could help out? The staging tree having been stuck > for over a week is certainly less than ideal...David Vrabel pointed out that more modern kernels have a different interpretation of things like "dom0_mem=256M", and can waste lots and lots of actual memory on pointless bookkeeping for future expansion (which the kernel envisages but we do not). I have changed it to "dom0_mem=256M,max:256M". I got a push of this change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the test runs yet reported have used this change. ... I have just checked the database and flights 19046 onwards are using this new command-line option. None of them have reported yet. In fact due to the backlog the system is rather clogged with runs using the old osstest. I''m going to manually kill those. Ian.
On 04/09/13 11:41, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > ... >>> I''m not sure why my osstest push gate didn''t catch this, but the >>> regression is indeed caused by the change from Jeremy''s old tree to >>> Linux 3.10.y. > > It appears that the push gate didn''t catch it because it''s host > specific, and it got lucky and didn''t run a test on that host. > >> So how do we want to deal with that? Linux maintainers - any >> chance you could help out? The staging tree having been stuck >> for over a week is certainly less than ideal... > > David Vrabel pointed out that more modern kernels have a different > interpretation of things like "dom0_mem=256M", and can waste lots and > lots of actual memory on pointless bookkeeping for future expansion > (which the kernel envisages but we do not). > > I have changed it to "dom0_mem=256M,max:256M". I got a push of this > change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the > test runs yet reported have used this change.Woodlouse''s e820 as seen by the kernel looks like: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved That last reserved entry I think confuses the early setup and it does odd things like: [ 0.000000] Set 266338518 page(s) to 1-1 mapping Possibly relevant kernel thread here: http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html I note that the e820 as seen by Xen does not have this reserved region (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009a800 (usable) (XEN) 000000000009a800 - 00000000000a0000 (reserved) (XEN) 00000000000e6000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000d7f90000 (usable) (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) (XEN) 00000000e0000000 - 00000000f0000000 (reserved) (XEN) 00000000fec00000 - 00000000fec03000 (reserved) (XEN) 00000000fee00000 - 00000000fee01000 (reserved) (XEN) 00000000ff700000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000228000000 (usable) So it must be being added by Xen? David
>>> On 05.09.13 at 13:24, David Vrabel <david.vrabel@citrix.com> wrote: > On 04/09/13 11:41, Ian Jackson wrote: >> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: >> ... >>>> I''m not sure why my osstest push gate didn''t catch this, but the >>>> regression is indeed caused by the change from Jeremy''s old tree to >>>> Linux 3.10.y. >> >> It appears that the push gate didn''t catch it because it''s host >> specific, and it got lucky and didn''t run a test on that host. >> >>> So how do we want to deal with that? Linux maintainers - any >>> chance you could help out? The staging tree having been stuck >>> for over a week is certainly less than ideal... >> >> David Vrabel pointed out that more modern kernels have a different >> interpretation of things like "dom0_mem=256M", and can waste lots and >> lots of actual memory on pointless bookkeeping for future expansion >> (which the kernel envisages but we do not). >> >> I have changed it to "dom0_mem=256M,max:256M". I got a push of this >> change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the >> test runs yet reported have used this change. > > Woodlouse''s e820 as seen by the kernel looks like: > > [ 0.000000] e820: BIOS-provided physical RAM map: > [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable > [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved > [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable > [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 > [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data > [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS > [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved > [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved > [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved > [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved > [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved > [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved > [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable > [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable > [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved > > That last reserved entry I think confuses the early setup and it does > odd things like: > > [ 0.000000] Set 266338518 page(s) to 1-1 mapping > > Possibly relevant kernel thread here: > > http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html > > I note that the e820 as seen by Xen does not have this reserved region > > (XEN) Xen-e820 RAM map: > (XEN) 0000000000000000 - 000000000009a800 (usable) > (XEN) 000000000009a800 - 00000000000a0000 (reserved) > (XEN) 00000000000e6000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 00000000d7f90000 (usable) > (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 > (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) > (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) > (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) > (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) > (XEN) 00000000e0000000 - 00000000f0000000 (reserved) > (XEN) 00000000fec00000 - 00000000fec03000 (reserved) > (XEN) 00000000fee00000 - 00000000fee01000 (reserved) > (XEN) 00000000ff700000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 0000000228000000 (usable) > > So it must be being added by Xen?Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT address range"). But that''s the case on all AMD systems, and I thought it wasn''t just woodlouse that''s an AMD one - Ian? In any event - how can the kernel side code make _any_ assumptions on what is or is not in the E820 table? I''ve recently seen logs from a system where reserved (MMIO) blocks appear right below the 1Tb (or maybe it was even 16Tb) boundary, without Xen inserting them. I would certainly be willing to revert that patch for the time being if we have reasons to believe this helps, but only as long as it is clear that the kernel needs fixing, and that I''ll want this back before 4.4 goes out. Do we have baseline (8a7769b4) test results including the new kernel, with part of it run on woodlouse? Jan
On 05/09/13 13:20, Jan Beulich wrote:>>>> On 05.09.13 at 13:24, David Vrabel <david.vrabel@citrix.com> wrote: >> On 04/09/13 11:41, Ian Jackson wrote: >>> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: >>> ... >>>>> I''m not sure why my osstest push gate didn''t catch this, but the >>>>> regression is indeed caused by the change from Jeremy''s old tree to >>>>> Linux 3.10.y. >>> >>> It appears that the push gate didn''t catch it because it''s host >>> specific, and it got lucky and didn''t run a test on that host. >>> >>>> So how do we want to deal with that? Linux maintainers - any >>>> chance you could help out? The staging tree having been stuck >>>> for over a week is certainly less than ideal... >>> >>> David Vrabel pointed out that more modern kernels have a different >>> interpretation of things like "dom0_mem=256M", and can waste lots and >>> lots of actual memory on pointless bookkeeping for future expansion >>> (which the kernel envisages but we do not). >>> >>> I have changed it to "dom0_mem=256M,max:256M". I got a push of this >>> change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the >>> test runs yet reported have used this change. >> >> Woodlouse''s e820 as seen by the kernel looks like: >> >> [ 0.000000] e820: BIOS-provided physical RAM map: >> [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable >> [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved >> [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable >> [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 >> [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data >> [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS >> [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved >> [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved >> [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved >> [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved >> [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved >> [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved >> [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable >> [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable >> [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved >> >> That last reserved entry I think confuses the early setup and it does >> odd things like: >> >> [ 0.000000] Set 266338518 page(s) to 1-1 mapping >> >> Possibly relevant kernel thread here: >> >> http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html >> >> I note that the e820 as seen by Xen does not have this reserved region >> >> (XEN) Xen-e820 RAM map: >> (XEN) 0000000000000000 - 000000000009a800 (usable) >> (XEN) 000000000009a800 - 00000000000a0000 (reserved) >> (XEN) 00000000000e6000 - 0000000000100000 (reserved) >> (XEN) 0000000000100000 - 00000000d7f90000 (usable) >> (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 >> (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) >> (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) >> (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) >> (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) >> (XEN) 00000000e0000000 - 00000000f0000000 (reserved) >> (XEN) 00000000fec00000 - 00000000fec03000 (reserved) >> (XEN) 00000000fee00000 - 00000000fee01000 (reserved) >> (XEN) 00000000ff700000 - 0000000100000000 (reserved) >> (XEN) 0000000100000000 - 0000000228000000 (usable) >> >> So it must be being added by Xen? > > Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT > address range"). But that''s the case on all AMD systems, and > I thought it wasn''t just woodlouse that''s an AMD one - Ian? > > In any event - how can the kernel side code make _any_ > assumptions on what is or is not in the E820 table? I''ve > recently seen logs from a system where reserved (MMIO) > blocks appear right below the 1Tb (or maybe it was even 16Tb) > boundary, without Xen inserting them. > > I would certainly be willing to revert that patch for the time > being if we have reasons to believe this helps, but only as long > as it is clear that the kernel needs fixing, and that I''ll want this > back before 4.4 goes out. Do we have baseline (8a7769b4) > test results including the new kernel, with part of it run on > woodlouse?This looks like a red herring. Having poked about in woodlouse it looks like something is screwy with interrupts. The tg3 cards aren''t using MSI and the USB controller is using edge not level handlers. Another machine with the same chipset is happily using MSIs. Malcolm (Cc) has some suggestions for things to try. David
Ian Jackson
2013-Sep-06 10:38 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"):> This looks like a red herring. Having poked about in woodlouse it looks > like something is screwy with interrupts. The tg3 cards aren''t using > MSI and the USB controller is using edge not level handlers. Another > machine with the same chipset is happily using MSIs.I did the following tests overnight: * 3.4.60 kernel: Pass! [adhoc flight 19081] * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen Date: Wed Sep 04 21:54:01 BST 2013 Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com> Fail as before (in this case, timeout in debootstrap trying to install a geust). [adhoc flight 19082] * 3.10.10, kernel command line "pci=noacpi and pci=nocrs" Total boot failure. SATA controller complaining bitterly about lost interrupts. [adhoc flight 19085] I also took woodlouse out of the main test pool, which is how we got a push of 4.2. I''m going to put it back now, and make a change to switch to Linux 3.4.y for general tests. I think this gets the 3.10.y problem off the critical path for everything else but of course we should still fix it. I will leave the 3.10.y push gate in place. Ian.
Jan Beulich
2013-Sep-06 10:49 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
>>> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> This looks like a red herring. Having poked about in woodlouse it looks >> like something is screwy with interrupts. The tg3 cards aren''t using >> MSI and the USB controller is using edge not level handlers. Another >> machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081] > > * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER > Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under > Xen > Date: Wed Sep 04 21:54:01 BST 2013 > Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com> > > Fail as before (in this case, timeout in debootstrap trying to > install a geust). [adhoc flight 19082] > > * 3.10.10, kernel command line "pci=noacpi and pci=nocrs" > > Total boot failure. SATA controller complaining bitterly about > lost interrupts. [adhoc flight 19085] > > I also took woodlouse out of the main test pool, which is how we got a > push of 4.2. I''m going to put it back now, and make a change to > switch to Linux 3.4.y for general tests.For -unstable this also resulted in just a single left test failure (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host), which appears to be the result of the migration, after the first few thousand pages, seeing a rapid decrease of speed (which then likely causes that timeout). I couldn''t spot anything in the logs that would explain this though. But I did notice that in two of the three runs there was not xend.log captured on the source host in the first place - is there an explanation for this? In any event I''m going to take these almost-pushes as a "good enough" sign to pull over the two or three commits into the stable branches, in the expectation that we should be able to get a push there over the weekend, and then release early next week. Looking through the logs of *-mite it also seems like you gave 3.11 a try, hitting a BUG() in balloon.c. Jan
David Vrabel
2013-Sep-06 10:58 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On 06/09/13 11:38, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> This looks like a red herring. Having poked about in woodlouse it looks >> like something is screwy with interrupts. The tg3 cards aren''t using >> MSI and the USB controller is using edge not level handlers. Another >> machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081]Where are the logs for this run? I tried: http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/ David
Ian Jackson
2013-Sep-06 11:06 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]"):> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > For -unstable this also resulted in just a single left test failure > (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host), > which appears to be the result of the migration, after the > first few thousand pages, seeing a rapid decrease of speed > (which then likely causes that timeout). I couldn''t spot anything > in the logs that would explain this though. But I did notice that > in two of the three runs there was not xend.log captured on > the source host in the first place - is there an explanation for > this?Looking at the logs-capture log, it appears that itch-mite was totally unresponsive by then. The log capture script decided to power cycle it. After having done that, xend wasn''t running. Due to a bug in the script it didn''t retry the log capture.> In any event I''m going to take these almost-pushes as a > "good enough" sign to pull over the two or three commits into > the stable branches, in the expectation that we should be > able to get a push there over the weekend, and then release > early next week.OK.> Looking through the logs of *-mite it also seems like you gave > 3.11 a try, hitting a BUG() in balloon.c.That''ll be the "linux-linus" test, which isn''t doing very well. Ian.
Ian Jackson
2013-Sep-06 11:50 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
David Vrabel writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]"):> On 06/09/13 11:38, Ian Jackson wrote: > > I did the following tests overnight: > > > > * 3.4.60 kernel: > > > > Pass! [adhoc flight 19081] > > Where are the logs for this run? > > I tried: > http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/It doesn''t automatically publish the logs of adhoc flights. I have just done this now (for all three I mentioned). Ian.
Konrad Rzeszutek Wilk
2013-Sep-06 12:49 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On Fri, Sep 06, 2013 at 12:06:38PM +0100, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]"): > > On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > > For -unstable this also resulted in just a single left test failure > > (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host), > > which appears to be the result of the migration, after the > > first few thousand pages, seeing a rapid decrease of speed > > (which then likely causes that timeout). I couldn''t spot anything > > in the logs that would explain this though. But I did notice that > > in two of the three runs there was not xend.log captured on > > the source host in the first place - is there an explanation for > > this? > > Looking at the logs-capture log, it appears that itch-mite was totally > unresponsive by then. The log capture script decided to power cycle > it. After having done that, xend wasn''t running. Due to a bug in the > script it didn''t retry the log capture. > > > In any event I''m going to take these almost-pushes as a > > "good enough" sign to pull over the two or three commits into > > the stable branches, in the expectation that we should be > > able to get a push there over the weekend, and then release > > early next week. > > OK. > > > Looking through the logs of *-mite it also seems like you gave > > 3.11 a try, hitting a BUG() in balloon.c. > > That''ll be the "linux-linus" test, which isn''t doing very well.I think Boris has a patch for that fixes the regression.> > Ian.
Konrad Rzeszutek Wilk
2013-Sep-06 12:57 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On Fri, Sep 06, 2013 at 11:38:42AM +0100, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): > > This looks like a red herring. Having poked about in woodlouse it looks > > like something is screwy with interrupts. The tg3 cards aren''t using > > MSI and the USB controller is using edge not level handlers. Another > > machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081] > > * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER > Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen > Date: Wed Sep 04 21:54:01 BST 2013 > Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com> > > Fail as before (in this case, timeout in debootstrap trying to > install a geust). [adhoc flight 19082] > > * 3.10.10, kernel command line "pci=noacpi and pci=nocrs" > > Total boot failure. SATA controller complaining bitterly about > lost interrupts. [adhoc flight 19085]Somebody (Andrew? David?) took a look at the box and found that the MSIs were all out of whack. I guess with the ''noacpi'' parameter the thinking is that the ACPI _PRT are out of whack with the more modern kernels? I am not that familiar with oss-test - but is each of the set of boxes running a different version of the hypervisor? Meaning you don''t randomly install from scratch a new version of a hypervisor on different boxes? Thanks!> > I also took woodlouse out of the main test pool, which is how we got a > push of 4.2. I''m going to put it back now, and make a change to > switch to Linux 3.4.y for general tests. > > I think this gets the 3.10.y problem off the critical path for > everything else but of course we should still fix it. I will leave > the 3.10.y push gate in place.Aye. Is this issue (network incredibly slow) only surfacing on this box? No - I thought I saw the issue on gall and lice with the upstream Linux? Are those two machines the same as woodlouse?> > Ian.
Ian Jackson
2013-Sep-06 13:34 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Konrad Rzeszutek Wilk writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]"):> I am not that familiar with oss-test - but is each of the set of boxes > running a different version of the hypervisor? Meaning you don''t > randomly install from scratch a new version of a hypervisor on different > boxes?No, each test is of a specific version of the hypervisor, a specific version of the kernel, etc. For each test the tester will pick a machine from the test pool. The scheduling algorithm tries to pick a machine which has not recently run this test, unless the test failed most recently, in which case it tries to pick (the) one it failed on. Each test job involves a complete wipe of the system, and then installing a dom0 OS with the selected hypervisor and kernel.> > I think this gets the 3.10.y problem off the critical path for > > everything else but of course we should still fix it. I will leave > > the 3.10.y push gate in place. > > Aye. Is this issue (network incredibly slow) only surfacing on this box? > No - I thought I saw the issue on gall and lice with the upstream Linux? > Are those two machines the same as woodlouse?No, they are entirely different. This incredibly slow network issue has only been seen on woodlouse. Most of the machines are in identical pairs, but not woodlouse, sadly. ian.