flight 18851 xen-unstable real [real]
http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/
Regressions :-(
Tests which did not succeed and are blocking,
including tests which could not be run:
test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778
test-amd64-i386-pv 7 debian-install fail REGR. vs. 18778
test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs. 18778
Tests which did not succeed, but are not blocking:
test-amd64-amd64-xl-pcipt-intel 9 guest-start fail never pass
test-amd64-amd64-xl-qemuu-win7-amd64 13 guest-stop fail never pass
test-amd64-i386-xl-winxpsp3-vcpus1 13 guest-stop fail never pass
test-amd64-i386-xend-winxpsp3 16 leak-check/check fail never pass
test-amd64-i386-xl-qemut-winxpsp3-vcpus1 13 guest-stop fail never pass
test-amd64-amd64-xl-qemuu-winxpsp3 13 guest-stop fail never pass
test-amd64-i386-xend-qemut-winxpsp3 16 leak-check/check fail never pass
test-amd64-amd64-xl-qemut-win7-amd64 13 guest-stop fail never pass
test-amd64-amd64-xl-qemut-winxpsp3 13 guest-stop fail never pass
test-amd64-amd64-xl-winxpsp3 13 guest-stop fail never pass
test-amd64-amd64-xl-win7-amd64 13 guest-stop fail never pass
test-amd64-i386-xl-qemut-win7-amd64 13 guest-stop fail never pass
test-amd64-i386-xl-win7-amd64 13 guest-stop fail never pass
version targeted for testing:
xen fb3f1c1855bd9aca625bc0d040be4cdcc216e958
baseline version:
xen 8a7769b4453168e23e8935a85e9a875ef5117253
------------------------------------------------------------
People who touched revisions under test:
Andrew Cooper <andrew.cooper3@citrix.com>
Ian Campbell <ian.campbell@citrix.com>
Ian Campbell <ijc@hellion.org.uk>
Ian Jackson <ian.jackson@eu.citrix.com>
Jaeyong Yoo <jaeyong.yoo@samsung.com>
Jan Beulich <jbeulich@suse.com>
Julien Grall <julien.grall@linaro.org>
Keir Fraser <keir@xen.org>
Matt Wilson <msw@amazon.com>
Sander Eikelenboom <linux@eikelenboom.it>
Suravee Suthikulpanit <suravee.suthikulapanit@amd.com>
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
------------------------------------------------------------
jobs:
build-amd64 pass
build-armhf pass
build-i386 pass
build-amd64-oldkern pass
build-i386-oldkern pass
build-amd64-pvops pass
build-i386-pvops pass
test-amd64-amd64-xl pass
test-amd64-i386-xl pass
test-amd64-i386-rhel6hvm-amd fail
test-amd64-i386-qemut-rhel6hvm-amd pass
test-amd64-i386-qemuu-rhel6hvm-amd pass
test-amd64-amd64-xl-qemut-win7-amd64 fail
test-amd64-i386-xl-qemut-win7-amd64 fail
test-amd64-amd64-xl-qemuu-win7-amd64 fail
test-amd64-amd64-xl-win7-amd64 fail
test-amd64-i386-xl-win7-amd64 fail
test-amd64-i386-xl-credit2 pass
test-amd64-amd64-xl-pcipt-intel fail
test-amd64-i386-rhel6hvm-intel pass
test-amd64-i386-qemut-rhel6hvm-intel pass
test-amd64-i386-qemuu-rhel6hvm-intel pass
test-amd64-i386-xl-multivcpu fail
test-amd64-amd64-pair pass
test-amd64-i386-pair pass
test-amd64-amd64-xl-sedf-pin pass
test-amd64-amd64-pv pass
test-amd64-i386-pv fail
test-amd64-amd64-xl-sedf pass
test-amd64-i386-xl-qemut-winxpsp3-vcpus1 fail
test-amd64-i386-xl-winxpsp3-vcpus1 fail
test-amd64-i386-xend-qemut-winxpsp3 fail
test-amd64-amd64-xl-qemut-winxpsp3 fail
test-amd64-amd64-xl-qemuu-winxpsp3 fail
test-amd64-i386-xend-winxpsp3 fail
test-amd64-amd64-xl-winxpsp3 fail
------------------------------------------------------------
sg-report-flight on woking.cam.xci-test.com
logs: /home/xc_osstest/logs
images: /home/xc_osstest/images
Logs, config files, etc. are available at
http://www.chiark.greenend.org.uk/~xensrcts/logs
Test harness code can be found at
http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary
Not pushing.
(No revision log; it would be 333 lines long.)
>>> On 29.08.13 at 21:18, xen.org <ian.jackson@eu.citrix.com> wrote: > flight 18851 xen-unstable real [real] > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778 > test-amd64-i386-pv 7 debian-install fail REGR. vs. 18778 > test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs. 18778So these all appear to be timeouts of infrastructure operations that don''t have an immediate explanation to me. The only odd thing is [ 12.719551] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 12.726458] IPv6: ADDRCONF(NETDEV_UP): xenbr0: link is not ready in each of the respective woodlouse---var-log-dmesg files. Is woodlouse suffering from a network connectivity issue, perhaps as a result of the kernel update? In any event, throughout the last runs it has - afaics - always been woodlouse that had failures (and the stickiness of failed tests then likely prevents them to ever get a success elsewhere). So perhaps worth trying to take woodlouse out of the pool temporarily? Jan
xen.org writes ("[xen-unstable test] 18851: regressions -
FAIL"):> flight 18851 xen-unstable real [real]
> http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/
>
> Regressions :-(
>
> Tests which did not succeed and are blocking,
> including tests which could not be run:
> test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs. 18778
I have had a bisection report about this:
From: "xen.org" <osstest@woking.cam.xci-test.com>
From: "xen.org" <ian.jackson@eu.citrix.com>
X-rewrote-sender: osstest@woking.cam.xci-test.com
Date: Mon, 02 Sep 2013 14:33:30 +0100
branch xen-unstable
xen branch xen-unstable
job test-amd64-i386-qemut-rhel6hvm-amd
test redhat-install
Tree: linux git://xenbits.xen.org/linux-pvops.git
Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
Tree: qemu git://xenbits.xen.org/staging/qemu-xen-unstable.git
Tree: qemuu git://xenbits.xen.org/staging/qemu-upstream-unstable.git
Tree: xen git://xenbits.xen.org/xen.git
*** Found and reproduced problem changeset ***
Bug is in tree: linux git://xenbits.xen.org/linux-pvops.git
Bug introduced: 8bf3379a74bc9132751bfa685bad2da318fd59d7
Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00
commit 8bf3379a74bc9132751bfa685bad2da318fd59d7
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Thu Aug 29 09:47:51 2013 -0700
Linux 3.10.10
[etc.]
The head commit there is a merge. The email contained all the log
messages in between those two, so bounced. (The bisector didn''t
examine the other parent of the merge, I think because it wasn''t an
ancestor of the baseline "good" revision.)
I''m not sure why my osstest push gate didn''t catch this, but
the
regression is indeed caused by the change from Jeremy''s old tree to
Linux 3.10.y.
Ian.
Ian Jackson writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> xen.org writes ("[xen-unstable test] 18851: regressions - FAIL"):
> > flight 18851 xen-unstable real [real]
> > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/
> >
> > Regressions :-(
> >
> > Tests which did not succeed and are blocking,
> > including tests which could not be run:
> > test-amd64-i386-rhel6hvm-amd 7 redhat-install fail REGR. vs.
18778
xen.org writes ("[xen-unstable test] 19006: regressions - trouble:
broken/fail/pass"):> Tests which did not succeed and are blocking,
> including tests which could not be run:
> test-amd64-i386-xl-multivcpu 7 debian-install fail REGR. vs.
18778
I looked at this one from 19006. The system is running under Xen but
has no guests. It shows a wget process running. I can''t easily tell
whether it has hung, but there are no other signs of trouble in the
logs.
The tester was able to ssh in and get process listings and so forth so
it must be that (a) just the debootstrap stuff has hung (b) trying to
ssh in to collect logs unwedged it (c) the problem is actually poor
performance, not a hang.
The system allows 2ks for a debootstrap, which should be ample (given
that there''s a local mirror).
So I think it''s probably a performance regression. I will try to
repro this tomorrow.
Ian.
>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > *** Found and reproduced problem changeset *** > > Bug is in tree: linux git://xenbits.xen.org/linux-pvops.git > Bug introduced: 8bf3379a74bc9132751bfa685bad2da318fd59d7 > Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00 > > > commit 8bf3379a74bc9132751bfa685bad2da318fd59d7 > Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Date: Thu Aug 29 09:47:51 2013 -0700 > > Linux 3.10.10 > > [etc.] > > The head commit there is a merge. The email contained all the log > messages in between those two, so bounced. (The bisector didn''t > examine the other parent of the merge, I think because it wasn''t an > ancestor of the baseline "good" revision.) > > I''m not sure why my osstest push gate didn''t catch this, but the > regression is indeed caused by the change from Jeremy''s old tree to > Linux 3.10.y.So how do we want to deal with that? Linux maintainers - any chance you could help out? The staging tree having been stuck for over a week is certainly less than ideal... Jan
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
...> > I''m not sure why my osstest push gate didn''t catch
this, but the
> > regression is indeed caused by the change from Jeremy''s old
tree to
> > Linux 3.10.y.
It appears that the push gate didn''t catch it because it''s
host
specific, and it got lucky and didn''t run a test on that host.
> So how do we want to deal with that? Linux maintainers - any
> chance you could help out? The staging tree having been stuck
> for over a week is certainly less than ideal...
David Vrabel pointed out that more modern kernels have a different
interpretation of things like "dom0_mem=256M", and can waste lots and
lots of actual memory on pointless bookkeeping for future expansion
(which the kernel envisages but we do not).
I have changed it to "dom0_mem=256M,max:256M". I got a push of this
change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think
any of the
test runs yet reported have used this change.
...
I have just checked the database and flights 19046 onwards are using
this new command-line option. None of them have reported yet. In
fact due to the backlog the system is rather clogged with runs using
the old osstest. I''m going to manually kill those.
Ian.
On 04/09/13 11:41, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > ... >>> I''m not sure why my osstest push gate didn''t catch this, but the >>> regression is indeed caused by the change from Jeremy''s old tree to >>> Linux 3.10.y. > > It appears that the push gate didn''t catch it because it''s host > specific, and it got lucky and didn''t run a test on that host. > >> So how do we want to deal with that? Linux maintainers - any >> chance you could help out? The staging tree having been stuck >> for over a week is certainly less than ideal... > > David Vrabel pointed out that more modern kernels have a different > interpretation of things like "dom0_mem=256M", and can waste lots and > lots of actual memory on pointless bookkeeping for future expansion > (which the kernel envisages but we do not). > > I have changed it to "dom0_mem=256M,max:256M". I got a push of this > change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the > test runs yet reported have used this change.Woodlouse''s e820 as seen by the kernel looks like: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved That last reserved entry I think confuses the early setup and it does odd things like: [ 0.000000] Set 266338518 page(s) to 1-1 mapping Possibly relevant kernel thread here: http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html I note that the e820 as seen by Xen does not have this reserved region (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009a800 (usable) (XEN) 000000000009a800 - 00000000000a0000 (reserved) (XEN) 00000000000e6000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000d7f90000 (usable) (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) (XEN) 00000000e0000000 - 00000000f0000000 (reserved) (XEN) 00000000fec00000 - 00000000fec03000 (reserved) (XEN) 00000000fee00000 - 00000000fee01000 (reserved) (XEN) 00000000ff700000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000228000000 (usable) So it must be being added by Xen? David
>>> On 05.09.13 at 13:24, David Vrabel <david.vrabel@citrix.com> wrote: > On 04/09/13 11:41, Ian Jackson wrote: >> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: >> ... >>>> I''m not sure why my osstest push gate didn''t catch this, but the >>>> regression is indeed caused by the change from Jeremy''s old tree to >>>> Linux 3.10.y. >> >> It appears that the push gate didn''t catch it because it''s host >> specific, and it got lucky and didn''t run a test on that host. >> >>> So how do we want to deal with that? Linux maintainers - any >>> chance you could help out? The staging tree having been stuck >>> for over a week is certainly less than ideal... >> >> David Vrabel pointed out that more modern kernels have a different >> interpretation of things like "dom0_mem=256M", and can waste lots and >> lots of actual memory on pointless bookkeeping for future expansion >> (which the kernel envisages but we do not). >> >> I have changed it to "dom0_mem=256M,max:256M". I got a push of this >> change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the >> test runs yet reported have used this change. > > Woodlouse''s e820 as seen by the kernel looks like: > > [ 0.000000] e820: BIOS-provided physical RAM map: > [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable > [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved > [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable > [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 > [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data > [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS > [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved > [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved > [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved > [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved > [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved > [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved > [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable > [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable > [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved > > That last reserved entry I think confuses the early setup and it does > odd things like: > > [ 0.000000] Set 266338518 page(s) to 1-1 mapping > > Possibly relevant kernel thread here: > > http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html > > I note that the e820 as seen by Xen does not have this reserved region > > (XEN) Xen-e820 RAM map: > (XEN) 0000000000000000 - 000000000009a800 (usable) > (XEN) 000000000009a800 - 00000000000a0000 (reserved) > (XEN) 00000000000e6000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 00000000d7f90000 (usable) > (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 > (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) > (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) > (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) > (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) > (XEN) 00000000e0000000 - 00000000f0000000 (reserved) > (XEN) 00000000fec00000 - 00000000fec03000 (reserved) > (XEN) 00000000fee00000 - 00000000fee01000 (reserved) > (XEN) 00000000ff700000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 0000000228000000 (usable) > > So it must be being added by Xen?Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT address range"). But that''s the case on all AMD systems, and I thought it wasn''t just woodlouse that''s an AMD one - Ian? In any event - how can the kernel side code make _any_ assumptions on what is or is not in the E820 table? I''ve recently seen logs from a system where reserved (MMIO) blocks appear right below the 1Tb (or maybe it was even 16Tb) boundary, without Xen inserting them. I would certainly be willing to revert that patch for the time being if we have reasons to believe this helps, but only as long as it is clear that the kernel needs fixing, and that I''ll want this back before 4.4 goes out. Do we have baseline (8a7769b4) test results including the new kernel, with part of it run on woodlouse? Jan
On 05/09/13 13:20, Jan Beulich wrote:>>>> On 05.09.13 at 13:24, David Vrabel <david.vrabel@citrix.com> wrote: >> On 04/09/13 11:41, Ian Jackson wrote: >>> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: >>> ... >>>>> I''m not sure why my osstest push gate didn''t catch this, but the >>>>> regression is indeed caused by the change from Jeremy''s old tree to >>>>> Linux 3.10.y. >>> >>> It appears that the push gate didn''t catch it because it''s host >>> specific, and it got lucky and didn''t run a test on that host. >>> >>>> So how do we want to deal with that? Linux maintainers - any >>>> chance you could help out? The staging tree having been stuck >>>> for over a week is certainly less than ideal... >>> >>> David Vrabel pointed out that more modern kernels have a different >>> interpretation of things like "dom0_mem=256M", and can waste lots and >>> lots of actual memory on pointless bookkeeping for future expansion >>> (which the kernel envisages but we do not). >>> >>> I have changed it to "dom0_mem=256M,max:256M". I got a push of this >>> change at "Wed, 4 Sep 2013 03:50:14 +0100". I don''t think any of the >>> test runs yet reported have used this change. >> >> Woodlouse''s e820 as seen by the kernel looks like: >> >> [ 0.000000] e820: BIOS-provided physical RAM map: >> [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable >> [ 0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved >> [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable >> [ 0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9 >> [ 0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data >> [ 0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS >> [ 0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved >> [ 0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved >> [ 0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved >> [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved >> [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved >> [ 0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved >> [ 0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable >> [ 0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable >> [ 0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved >> >> That last reserved entry I think confuses the early setup and it does >> odd things like: >> >> [ 0.000000] Set 266338518 page(s) to 1-1 mapping >> >> Possibly relevant kernel thread here: >> >> http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html >> >> I note that the e820 as seen by Xen does not have this reserved region >> >> (XEN) Xen-e820 RAM map: >> (XEN) 0000000000000000 - 000000000009a800 (usable) >> (XEN) 000000000009a800 - 00000000000a0000 (reserved) >> (XEN) 00000000000e6000 - 0000000000100000 (reserved) >> (XEN) 0000000000100000 - 00000000d7f90000 (usable) >> (XEN) 00000000d7f9e000 - 00000000d7fa0000 type 9 >> (XEN) 00000000d7fa0000 - 00000000d7fae000 (ACPI data) >> (XEN) 00000000d7fae000 - 00000000d7fe0000 (ACPI NVS) >> (XEN) 00000000d7fe0000 - 00000000d7fee000 (reserved) >> (XEN) 00000000d7ff0000 - 00000000d8000000 (reserved) >> (XEN) 00000000e0000000 - 00000000f0000000 (reserved) >> (XEN) 00000000fec00000 - 00000000fec03000 (reserved) >> (XEN) 00000000fee00000 - 00000000fee01000 (reserved) >> (XEN) 00000000ff700000 - 0000000100000000 (reserved) >> (XEN) 0000000100000000 - 0000000228000000 (usable) >> >> So it must be being added by Xen? > > Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT > address range"). But that''s the case on all AMD systems, and > I thought it wasn''t just woodlouse that''s an AMD one - Ian? > > In any event - how can the kernel side code make _any_ > assumptions on what is or is not in the E820 table? I''ve > recently seen logs from a system where reserved (MMIO) > blocks appear right below the 1Tb (or maybe it was even 16Tb) > boundary, without Xen inserting them. > > I would certainly be willing to revert that patch for the time > being if we have reasons to believe this helps, but only as long > as it is clear that the kernel needs fixing, and that I''ll want this > back before 4.4 goes out. Do we have baseline (8a7769b4) > test results including the new kernel, with part of it run on > woodlouse?This looks like a red herring. Having poked about in woodlouse it looks like something is screwy with interrupts. The tg3 cards aren''t using MSI and the USB controller is using edge not level handlers. Another machine with the same chipset is happily using MSIs. Malcolm (Cc) has some suggestions for things to try. David
Ian Jackson
2013-Sep-06 10:38 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> This looks like a red herring. Having poked about in woodlouse it looks
> like something is screwy with interrupts. The tg3 cards aren''t
using
> MSI and the USB controller is using edge not level handlers. Another
> machine with the same chipset is happily using MSIs.
I did the following tests overnight:
* 3.4.60 kernel:
Pass! [adhoc flight 19081]
* 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER
Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen
Date: Wed Sep 04 21:54:01 BST 2013
Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com>
Fail as before (in this case, timeout in debootstrap trying to
install a geust). [adhoc flight 19082]
* 3.10.10, kernel command line "pci=noacpi and pci=nocrs"
Total boot failure. SATA controller complaining bitterly about
lost interrupts. [adhoc flight 19085]
I also took woodlouse out of the main test pool, which is how we got a
push of 4.2. I''m going to put it back now, and make a change to
switch to Linux 3.4.y for general tests.
I think this gets the 3.10.y problem off the critical path for
everything else but of course we should still fix it. I will leave
the 3.10.y push gate in place.
Ian.
Jan Beulich
2013-Sep-06 10:49 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
>>> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> This looks like a red herring. Having poked about in woodlouse it looks >> like something is screwy with interrupts. The tg3 cards aren''t using >> MSI and the USB controller is using edge not level handlers. Another >> machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081] > > * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER > Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under > Xen > Date: Wed Sep 04 21:54:01 BST 2013 > Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com> > > Fail as before (in this case, timeout in debootstrap trying to > install a geust). [adhoc flight 19082] > > * 3.10.10, kernel command line "pci=noacpi and pci=nocrs" > > Total boot failure. SATA controller complaining bitterly about > lost interrupts. [adhoc flight 19085] > > I also took woodlouse out of the main test pool, which is how we got a > push of 4.2. I''m going to put it back now, and make a change to > switch to Linux 3.4.y for general tests.For -unstable this also resulted in just a single left test failure (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host), which appears to be the result of the migration, after the first few thousand pages, seeing a rapid decrease of speed (which then likely causes that timeout). I couldn''t spot anything in the logs that would explain this though. But I did notice that in two of the three runs there was not xend.log captured on the source host in the first place - is there an explanation for this? In any event I''m going to take these almost-pushes as a "good enough" sign to pull over the two or three commits into the stable branches, in the expectation that we should be able to get a push there over the weekend, and then release early next week. Looking through the logs of *-mite it also seems like you gave 3.11 a try, hitting a BUG() in balloon.c. Jan
David Vrabel
2013-Sep-06 10:58 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On 06/09/13 11:38, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): >> This looks like a red herring. Having poked about in woodlouse it looks >> like something is screwy with interrupts. The tg3 cards aren''t using >> MSI and the USB controller is using edge not level handlers. Another >> machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081]Where are the logs for this run? I tried: http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/ David
Ian Jackson
2013-Sep-06 11:06 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and
1 more messages]"):> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
> For -unstable this also resulted in just a single left test failure
> (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host),
> which appears to be the result of the migration, after the
> first few thousand pages, seeing a rapid decrease of speed
> (which then likely causes that timeout). I couldn''t spot anything
> in the logs that would explain this though. But I did notice that
> in two of the three runs there was not xend.log captured on
> the source host in the first place - is there an explanation for
> this?
Looking at the logs-capture log, it appears that itch-mite was totally
unresponsive by then. The log capture script decided to power cycle
it. After having done that, xend wasn''t running. Due to a bug in the
script it didn''t retry the log capture.
> In any event I''m going to take these almost-pushes as a
> "good enough" sign to pull over the two or three commits into
> the stable branches, in the expectation that we should be
> able to get a push there over the weekend, and then release
> early next week.
OK.
> Looking through the logs of *-mite it also seems like you gave
> 3.11 a try, hitting a BUG() in balloon.c.
That''ll be the "linux-linus" test, which isn''t doing
very well.
Ian.
Ian Jackson
2013-Sep-06 11:50 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
David Vrabel writes ("Re: [xen-unstable test] 18851: regressions - FAIL
[and 1 more messages]"):> On 06/09/13 11:38, Ian Jackson wrote:
> > I did the following tests overnight:
> >
> > * 3.4.60 kernel:
> >
> > Pass! [adhoc flight 19081]
>
> Where are the logs for this run?
>
> I tried:
> http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/
It doesn''t automatically publish the logs of adhoc flights. I have
just done this now (for all three I mentioned).
Ian.
Konrad Rzeszutek Wilk
2013-Sep-06 12:49 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On Fri, Sep 06, 2013 at 12:06:38PM +0100, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]"): > > On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > > For -unstable this also resulted in just a single left test failure > > (test-amd64-i386-pair 17 guest-migrate/src_host/dst_host), > > which appears to be the result of the migration, after the > > first few thousand pages, seeing a rapid decrease of speed > > (which then likely causes that timeout). I couldn''t spot anything > > in the logs that would explain this though. But I did notice that > > in two of the three runs there was not xend.log captured on > > the source host in the first place - is there an explanation for > > this? > > Looking at the logs-capture log, it appears that itch-mite was totally > unresponsive by then. The log capture script decided to power cycle > it. After having done that, xend wasn''t running. Due to a bug in the > script it didn''t retry the log capture. > > > In any event I''m going to take these almost-pushes as a > > "good enough" sign to pull over the two or three commits into > > the stable branches, in the expectation that we should be > > able to get a push there over the weekend, and then release > > early next week. > > OK. > > > Looking through the logs of *-mite it also seems like you gave > > 3.11 a try, hitting a BUG() in balloon.c. > > That''ll be the "linux-linus" test, which isn''t doing very well.I think Boris has a patch for that fixes the regression.> > Ian.
Konrad Rzeszutek Wilk
2013-Sep-06 12:57 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
On Fri, Sep 06, 2013 at 11:38:42AM +0100, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL"): > > This looks like a red herring. Having poked about in woodlouse it looks > > like something is screwy with interrupts. The tg3 cards aren''t using > > MSI and the USB controller is using edge not level handlers. Another > > machine with the same chipset is happily using MSIs. > > I did the following tests overnight: > > * 3.4.60 kernel: > > Pass! [adhoc flight 19081] > > * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER > Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen > Date: Wed Sep 04 21:54:01 BST 2013 > Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com> > > Fail as before (in this case, timeout in debootstrap trying to > install a geust). [adhoc flight 19082] > > * 3.10.10, kernel command line "pci=noacpi and pci=nocrs" > > Total boot failure. SATA controller complaining bitterly about > lost interrupts. [adhoc flight 19085]Somebody (Andrew? David?) took a look at the box and found that the MSIs were all out of whack. I guess with the ''noacpi'' parameter the thinking is that the ACPI _PRT are out of whack with the more modern kernels? I am not that familiar with oss-test - but is each of the set of boxes running a different version of the hypervisor? Meaning you don''t randomly install from scratch a new version of a hypervisor on different boxes? Thanks!> > I also took woodlouse out of the main test pool, which is how we got a > push of 4.2. I''m going to put it back now, and make a change to > switch to Linux 3.4.y for general tests. > > I think this gets the 3.10.y problem off the critical path for > everything else but of course we should still fix it. I will leave > the 3.10.y push gate in place.Aye. Is this issue (network incredibly slow) only surfacing on this box? No - I thought I saw the issue on gall and lice with the upstream Linux? Are those two machines the same as woodlouse?> > Ian.
Ian Jackson
2013-Sep-06 13:34 UTC
Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]
Konrad Rzeszutek Wilk writes ("Re: [xen-unstable test] 18851: regressions -
FAIL [and 1 more messages]"):> I am not that familiar with oss-test - but is each of the set of boxes
> running a different version of the hypervisor? Meaning you don''t
> randomly install from scratch a new version of a hypervisor on different
> boxes?
No, each test is of a specific version of the hypervisor, a specific
version of the kernel, etc.
For each test the tester will pick a machine from the test pool. The
scheduling algorithm tries to pick a machine which has not recently
run this test, unless the test failed most recently, in which case it
tries to pick (the) one it failed on.
Each test job involves a complete wipe of the system, and then
installing a dom0 OS with the selected hypervisor and kernel.
> > I think this gets the 3.10.y problem off the critical path for
> > everything else but of course we should still fix it. I will leave
> > the 3.10.y push gate in place.
>
> Aye. Is this issue (network incredibly slow) only surfacing on this box?
> No - I thought I saw the issue on gall and lice with the upstream Linux?
> Are those two machines the same as woodlouse?
No, they are entirely different. This incredibly slow network issue
has only been seen on woodlouse. Most of the machines are in
identical pairs, but not woodlouse, sadly.
ian.