thr3ads.net - Xen devel - [xen-unstable test] 18851: regressions

If this information is useful, please help other people find it:
Share via:

xen.org

2013-Aug-29 19:18 UTC

[xen-unstable test] 18851: regressions - FAIL

flight 18851 xen-unstable real [real]
http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-rhel6hvm-amd  7 redhat-install            fail REGR. vs. 18778
 test-amd64-i386-pv            7 debian-install            fail REGR. vs. 18778
 test-amd64-i386-xl-multivcpu  7 debian-install            fail REGR. vs. 18778

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-pcipt-intel  9 guest-start                 fail never pass
 test-amd64-amd64-xl-qemuu-win7-amd64 13 guest-stop             fail never pass
 test-amd64-i386-xl-winxpsp3-vcpus1 13 guest-stop               fail never pass
 test-amd64-i386-xend-winxpsp3 16 leak-check/check             fail  never pass
 test-amd64-i386-xl-qemut-winxpsp3-vcpus1 13 guest-stop         fail never pass
 test-amd64-amd64-xl-qemuu-winxpsp3 13 guest-stop               fail never pass
 test-amd64-i386-xend-qemut-winxpsp3 16 leak-check/check        fail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 13 guest-stop             fail never pass
 test-amd64-amd64-xl-qemut-winxpsp3 13 guest-stop               fail never pass
 test-amd64-amd64-xl-winxpsp3 13 guest-stop                   fail   never pass
 test-amd64-amd64-xl-win7-amd64 13 guest-stop                   fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 13 guest-stop              fail never pass
 test-amd64-i386-xl-win7-amd64 13 guest-stop                   fail  never pass

version targeted for testing:
 xen                  fb3f1c1855bd9aca625bc0d040be4cdcc216e958
baseline version:
 xen                  8a7769b4453168e23e8935a85e9a875ef5117253

------------------------------------------------------------
People who touched revisions under test:
  Andrew Cooper <andrew.cooper3@citrix.com>
  Ian Campbell <ian.campbell@citrix.com>
  Ian Campbell <ijc@hellion.org.uk>
  Ian Jackson <ian.jackson@eu.citrix.com>
  Jaeyong Yoo <jaeyong.yoo@samsung.com>
  Jan Beulich <jbeulich@suse.com>
  Julien Grall <julien.grall@linaro.org>
  Keir Fraser <keir@xen.org>
  Matt Wilson <msw@amazon.com>
  Sander Eikelenboom <linux@eikelenboom.it>
  Suravee Suthikulpanit <suravee.suthikulapanit@amd.com>
  Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
  Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
------------------------------------------------------------

jobs:
 build-amd64                                                  pass    
 build-armhf                                                  pass    
 build-i386                                                   pass    
 build-amd64-oldkern                                          pass    
 build-i386-oldkern                                           pass    
 build-amd64-pvops                                            pass    
 build-i386-pvops                                             pass    
 test-amd64-amd64-xl                                          pass    
 test-amd64-i386-xl                                           pass    
 test-amd64-i386-rhel6hvm-amd                                 fail    
 test-amd64-i386-qemut-rhel6hvm-amd                           pass    
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass    
 test-amd64-amd64-xl-qemut-win7-amd64                         fail    
 test-amd64-i386-xl-qemut-win7-amd64                          fail    
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail    
 test-amd64-amd64-xl-win7-amd64                               fail    
 test-amd64-i386-xl-win7-amd64                                fail    
 test-amd64-i386-xl-credit2                                   pass    
 test-amd64-amd64-xl-pcipt-intel                              fail    
 test-amd64-i386-rhel6hvm-intel                               pass    
 test-amd64-i386-qemut-rhel6hvm-intel                         pass    
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass    
 test-amd64-i386-xl-multivcpu                                 fail    
 test-amd64-amd64-pair                                        pass    
 test-amd64-i386-pair                                         pass    
 test-amd64-amd64-xl-sedf-pin                                 pass    
 test-amd64-amd64-pv                                          pass    
 test-amd64-i386-pv                                           fail    
 test-amd64-amd64-xl-sedf                                     pass    
 test-amd64-i386-xl-qemut-winxpsp3-vcpus1                     fail    
 test-amd64-i386-xl-winxpsp3-vcpus1                           fail    
 test-amd64-i386-xend-qemut-winxpsp3                          fail    
 test-amd64-amd64-xl-qemut-winxpsp3                           fail    
 test-amd64-amd64-xl-qemuu-winxpsp3                           fail    
 test-amd64-i386-xend-winxpsp3                                fail    
 test-amd64-amd64-xl-winxpsp3                                 fail    


------------------------------------------------------------
sg-report-flight on woking.cam.xci-test.com
logs: /home/xc_osstest/logs
images: /home/xc_osstest/images

Logs, config files, etc. are available at
    http://www.chiark.greenend.org.uk/~xensrcts/logs

Test harness code can be found at
    http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 333 lines long.)

Jan Beulich

2013-Aug-30 10:36 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

>>> On 29.08.13 at 21:18, xen.org <ian.jackson@eu.citrix.com>
wrote:
> flight 18851 xen-unstable real [real]
> http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/ 
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-rhel6hvm-amd  7 redhat-install            fail REGR. vs.
18778
>  test-amd64-i386-pv            7 debian-install            fail REGR. vs.
18778
>  test-amd64-i386-xl-multivcpu  7 debian-install            fail REGR. vs.
18778
So these all appear to be timeouts of infrastructure operations
that don''t have an immediate explanation to me. The only odd
thing is

[   12.719551] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   12.726458] IPv6: ADDRCONF(NETDEV_UP): xenbr0: link is not ready

in each of the respective woodlouse---var-log-dmesg files. Is
woodlouse suffering from a network connectivity issue, perhaps
as a result of the kernel update? In any event, throughout the
last runs it has - afaics - always been woodlouse that had failures
(and the stickiness of failed tests then likely prevents them to
ever get a success elsewhere). So perhaps worth trying to take
woodlouse out of the pool temporarily?

Jan

Ian Jackson

2013-Sep-02 15:10 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

xen.org writes ("[xen-unstable test] 18851: regressions -
FAIL"):> flight 18851 xen-unstable real [real]
> http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-rhel6hvm-amd  7 redhat-install       fail REGR. vs. 18778
I have had a bisection report about this:

  From: "xen.org" <osstest@woking.cam.xci-test.com>
  From: "xen.org" <ian.jackson@eu.citrix.com>
  X-rewrote-sender: osstest@woking.cam.xci-test.com
  Date: Mon, 02 Sep 2013 14:33:30 +0100

  branch xen-unstable
  xen branch xen-unstable
  job test-amd64-i386-qemut-rhel6hvm-amd
  test redhat-install

  Tree: linux git://xenbits.xen.org/linux-pvops.git
  Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
  Tree: qemu git://xenbits.xen.org/staging/qemu-xen-unstable.git
  Tree: qemuu git://xenbits.xen.org/staging/qemu-upstream-unstable.git
  Tree: xen git://xenbits.xen.org/xen.git

  *** Found and reproduced problem changeset ***

    Bug is in tree:  linux git://xenbits.xen.org/linux-pvops.git
    Bug introduced:  8bf3379a74bc9132751bfa685bad2da318fd59d7
    Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00


    commit 8bf3379a74bc9132751bfa685bad2da318fd59d7
    Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Date:   Thu Aug 29 09:47:51 2013 -0700

        Linux 3.10.10

    [etc.]

The head commit there is a merge.  The email contained all the log
messages in between those two, so bounced.  (The bisector didn''t
examine the other parent of the merge, I think because it wasn''t an
ancestor of the baseline "good" revision.)

I''m not sure why my osstest push gate didn''t catch this, but
the
regression is indeed caused by the change from Jeremy''s old tree to
Linux 3.10.y.

Ian.

Ian Jackson

2013-Sep-02 17:09 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

Ian Jackson writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> xen.org writes ("[xen-unstable test] 18851: regressions - FAIL"):
> > flight 18851 xen-unstable real [real]
> > http://www.chiark.greenend.org.uk/~xensrcts/logs/18851/
> > 
> > Regressions :-(
> > 
> > Tests which did not succeed and are blocking,
> > including tests which could not be run:
> >  test-amd64-i386-rhel6hvm-amd  7 redhat-install       fail REGR. vs.
18778
xen.org writes ("[xen-unstable test] 19006: regressions - trouble:
broken/fail/pass"):> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-xl-multivcpu  7 debian-install         fail REGR. vs.
18778
I looked at this one from 19006.  The system is running under Xen but
has no guests.  It shows a wget process running.  I can''t easily tell
whether it has hung, but there are no other signs of trouble in the
logs.

The tester was able to ssh in and get process listings and so forth so
it must be that (a) just the debootstrap stuff has hung (b) trying to
ssh in to collect logs unwedged it (c) the problem is actually poor
performance, not a hang.

The system allows 2ks for a debootstrap, which should be ample (given
that there''s a local mirror).

So I think it''s probably a performance regression.  I will try to
repro this tomorrow.

Ian.

Jan Beulich

2013-Sep-04 09:04 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:
>   *** Found and reproduced problem changeset ***
> 
>     Bug is in tree:  linux git://xenbits.xen.org/linux-pvops.git
>     Bug introduced:  8bf3379a74bc9132751bfa685bad2da318fd59d7
>     Bug not present: a938a246d34912423c560f475ccf1ce0c71d9d00
> 
> 
>     commit 8bf3379a74bc9132751bfa685bad2da318fd59d7
>     Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>     Date:   Thu Aug 29 09:47:51 2013 -0700
> 
>         Linux 3.10.10
> 
>     [etc.]
> 
> The head commit there is a merge.  The email contained all the log
> messages in between those two, so bounced.  (The bisector didn''t
> examine the other parent of the merge, I think because it wasn''t
an
> ancestor of the baseline "good" revision.)
> 
> I''m not sure why my osstest push gate didn''t catch this,
but the
> regression is indeed caused by the change from Jeremy''s old tree
to
> Linux 3.10.y.
So how do we want to deal with that? Linux maintainers - any
chance you could help out? The staging tree having been stuck
for over a week is certainly less than ideal...

Jan

Ian Jackson

2013-Sep-04 10:41 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
...> > I''m not sure why my osstest push gate didn''t catch
this, but the
> > regression is indeed caused by the change from Jeremy''s old
tree to
> > Linux 3.10.y.
It appears that the push gate didn''t catch it because it''s
host
specific, and it got lucky and didn''t run a test on that host.
> So how do we want to deal with that? Linux maintainers - any
> chance you could help out? The staging tree having been stuck
> for over a week is certainly less than ideal...
David Vrabel pointed out that more modern kernels have a different
interpretation of things like "dom0_mem=256M", and can waste lots and
lots of actual memory on pointless bookkeeping for future expansion
(which the kernel envisages but we do not).

I have changed it to "dom0_mem=256M,max:256M".  I got a push of this
change at "Wed, 4 Sep 2013 03:50:14 +0100".  I don''t think
any of the
test runs yet reported have used this change.

...

I have just checked the database and flights 19046 onwards are using
this new command-line option.  None of them have reported yet.  In
fact due to the backlog the system is rather clogged with runs using
the old osstest.  I''m going to manually kill those.

Ian.

David Vrabel

2013-Sep-05 11:24 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

On 04/09/13 11:41, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):
>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:
> ...
>>> I''m not sure why my osstest push gate didn''t
catch this, but the
>>> regression is indeed caused by the change from Jeremy''s
old tree to
>>> Linux 3.10.y.
> 
> It appears that the push gate didn''t catch it because
it''s host
> specific, and it got lucky and didn''t run a test on that host.
> 
>> So how do we want to deal with that? Linux maintainers - any
>> chance you could help out? The staging tree having been stuck
>> for over a week is certainly less than ideal...
> 
> David Vrabel pointed out that more modern kernels have a different
> interpretation of things like "dom0_mem=256M", and can waste lots
and
> lots of actual memory on pointless bookkeeping for future expansion
> (which the kernel envisages but we do not).
> 
> I have changed it to "dom0_mem=256M,max:256M".  I got a push of
this
> change at "Wed, 4 Sep 2013 03:50:14 +0100".  I don''t
think any of the
> test runs yet reported have used this change.
Woodlouse''s e820 as seen by the kernel looks like:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable
[    0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable
[    0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9
[    0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data
[    0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS
[    0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved
[    0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved
[    0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved
[    0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved
[    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable
[    0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable
[    0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved

That last reserved entry I think confuses the early setup and it does
odd things like:

[    0.000000] Set 266338518 page(s) to 1-1 mapping

Possibly relevant kernel thread here:

http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html

I note that the e820 as seen by Xen does not have this reserved region

(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009a800 (usable)
(XEN)  000000000009a800 - 00000000000a0000 (reserved)
(XEN)  00000000000e6000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000d7f90000 (usable)
(XEN)  00000000d7f9e000 - 00000000d7fa0000 type 9
(XEN)  00000000d7fa0000 - 00000000d7fae000 (ACPI data)
(XEN)  00000000d7fae000 - 00000000d7fe0000 (ACPI NVS)
(XEN)  00000000d7fe0000 - 00000000d7fee000 (reserved)
(XEN)  00000000d7ff0000 - 00000000d8000000 (reserved)
(XEN)  00000000e0000000 - 00000000f0000000 (reserved)
(XEN)  00000000fec00000 - 00000000fec03000 (reserved)
(XEN)  00000000fee00000 - 00000000fee01000 (reserved)
(XEN)  00000000ff700000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000228000000 (usable)

So it must be being added by Xen?

David

Jan Beulich

2013-Sep-05 12:20 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

>>> On 05.09.13 at 13:24, David Vrabel <david.vrabel@citrix.com>
wrote:
> On 04/09/13 11:41, Ian Jackson wrote:
>> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):
>>> On 02.09.13 at 17:10, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:
>> ...
>>>> I''m not sure why my osstest push gate didn''t
catch this, but the
>>>> regression is indeed caused by the change from
Jeremy''s old tree to
>>>> Linux 3.10.y.
>> 
>> It appears that the push gate didn''t catch it because
it''s host
>> specific, and it got lucky and didn''t run a test on that host.
>> 
>>> So how do we want to deal with that? Linux maintainers - any
>>> chance you could help out? The staging tree having been stuck
>>> for over a week is certainly less than ideal...
>> 
>> David Vrabel pointed out that more modern kernels have a different
>> interpretation of things like "dom0_mem=256M", and can waste
lots and
>> lots of actual memory on pointless bookkeeping for future expansion
>> (which the kernel envisages but we do not).
>> 
>> I have changed it to "dom0_mem=256M,max:256M".  I got a push
of this
>> change at "Wed, 4 Sep 2013 03:50:14 +0100".  I don''t
think any of the
>> test runs yet reported have used this change.
> 
> Woodlouse''s e820 as seen by the kernel looks like:
> 
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable
> [    0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable
> [    0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9
> [    0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI data
> [    0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI NVS
> [    0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff] reserved
> [    0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff] reserved
> [    0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff] reserved
> [    0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff] reserved
> [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
> [    0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff] reserved
> [    0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable
> [    0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff] unusable
> [    0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
> 
> That last reserved entry I think confuses the early setup and it does
> odd things like:
> 
> [    0.000000] Set 266338518 page(s) to 1-1 mapping
> 
> Possibly relevant kernel thread here:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html 
> 
> I note that the e820 as seen by Xen does not have this reserved region
> 
> (XEN) Xen-e820 RAM map:
> (XEN)  0000000000000000 - 000000000009a800 (usable)
> (XEN)  000000000009a800 - 00000000000a0000 (reserved)
> (XEN)  00000000000e6000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 00000000d7f90000 (usable)
> (XEN)  00000000d7f9e000 - 00000000d7fa0000 type 9
> (XEN)  00000000d7fa0000 - 00000000d7fae000 (ACPI data)
> (XEN)  00000000d7fae000 - 00000000d7fe0000 (ACPI NVS)
> (XEN)  00000000d7fe0000 - 00000000d7fee000 (reserved)
> (XEN)  00000000d7ff0000 - 00000000d8000000 (reserved)
> (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
> (XEN)  00000000fec00000 - 00000000fec03000 (reserved)
> (XEN)  00000000fee00000 - 00000000fee01000 (reserved)
> (XEN)  00000000ff700000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 0000000228000000 (usable)
> 
> So it must be being added by Xen?
Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT
address range"). But that''s the case on all AMD systems, and
I thought it wasn''t just woodlouse that''s an AMD one - Ian?

In any event - how can the kernel side code make _any_
assumptions on what is or is not in the E820 table? I''ve
recently seen logs from a system where reserved (MMIO)
blocks appear right below the 1Tb (or maybe it was even 16Tb)
boundary, without Xen inserting them.

I would certainly be willing to revert that patch for the time
being if we have reasons to believe this helps, but only as long
as it is clear that the kernel needs fixing, and that I''ll want this
back before 4.4 goes out. Do we have baseline (8a7769b4)
test results including the new kernel, with part of it run on
woodlouse?

Jan

David Vrabel

2013-Sep-05 14:09 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL

On 05/09/13 13:20, Jan Beulich wrote:>>>> On 05.09.13 at 13:24, David Vrabel
<david.vrabel@citrix.com> wrote:
>> On 04/09/13 11:41, Ian Jackson wrote:
>>> Jan Beulich writes ("Re: [xen-unstable test] 18851:
regressions - FAIL"):
>>>> On 02.09.13 at 17:10, Ian Jackson
<Ian.Jackson@eu.citrix.com> wrote:
>>> ...
>>>>> I''m not sure why my osstest push gate
didn''t catch this, but the
>>>>> regression is indeed caused by the change from
Jeremy''s old tree to
>>>>> Linux 3.10.y.
>>>
>>> It appears that the push gate didn''t catch it because
it''s host
>>> specific, and it got lucky and didn''t run a test on that
host.
>>>
>>>> So how do we want to deal with that? Linux maintainers - any
>>>> chance you could help out? The staging tree having been stuck
>>>> for over a week is certainly less than ideal...
>>>
>>> David Vrabel pointed out that more modern kernels have a different
>>> interpretation of things like "dom0_mem=256M", and can
waste lots and
>>> lots of actual memory on pointless bookkeeping for future expansion
>>> (which the kernel envisages but we do not).
>>>
>>> I have changed it to "dom0_mem=256M,max:256M".  I got a
push of this
>>> change at "Wed, 4 Sep 2013 03:50:14 +0100".  I
don''t think any of the
>>> test runs yet reported have used this change.
>>
>> Woodlouse''s e820 as seen by the kernel looks like:
>>
>> [    0.000000] e820: BIOS-provided physical RAM map:
>> [    0.000000] Xen: [mem 0x0000000000000000-0x0000000000099fff] usable
>> [    0.000000] Xen: [mem 0x000000000009a800-0x00000000000fffff]
reserved
>> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000d7f8ffff] usable
>> [    0.000000] Xen: [mem 0x00000000d7f9e000-0x00000000d7f9ffff] type 9
>> [    0.000000] Xen: [mem 0x00000000d7fa0000-0x00000000d7fadfff] ACPI
data
>> [    0.000000] Xen: [mem 0x00000000d7fae000-0x00000000d7fdffff] ACPI
NVS
>> [    0.000000] Xen: [mem 0x00000000d7fe0000-0x00000000d7fedfff]
reserved
>> [    0.000000] Xen: [mem 0x00000000d7ff0000-0x00000000d7ffffff]
reserved
>> [    0.000000] Xen: [mem 0x00000000e0000000-0x00000000efffffff]
reserved
>> [    0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec02fff]
reserved
>> [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff]
reserved
>> [    0.000000] Xen: [mem 0x00000000ff700000-0x00000000ffffffff]
reserved
>> [    0.000000] Xen: [mem 0x0000000100000000-0x00000001884d1fff] usable
>> [    0.000000] Xen: [mem 0x00000001884d2000-0x0000000227ffffff]
unusable
>> [    0.000000] Xen: [mem 0x000000fd00000000-0x000000ffffffffff]
reserved
>>
>> That last reserved entry I think confuses the early setup and it does
>> odd things like:
>>
>> [    0.000000] Set 266338518 page(s) to 1-1 mapping
>>
>> Possibly relevant kernel thread here:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/1110.1/01213.html 
>>
>> I note that the e820 as seen by Xen does not have this reserved region
>>
>> (XEN) Xen-e820 RAM map:
>> (XEN)  0000000000000000 - 000000000009a800 (usable)
>> (XEN)  000000000009a800 - 00000000000a0000 (reserved)
>> (XEN)  00000000000e6000 - 0000000000100000 (reserved)
>> (XEN)  0000000000100000 - 00000000d7f90000 (usable)
>> (XEN)  00000000d7f9e000 - 00000000d7fa0000 type 9
>> (XEN)  00000000d7fa0000 - 00000000d7fae000 (ACPI data)
>> (XEN)  00000000d7fae000 - 00000000d7fe0000 (ACPI NVS)
>> (XEN)  00000000d7fe0000 - 00000000d7fee000 (reserved)
>> (XEN)  00000000d7ff0000 - 00000000d8000000 (reserved)
>> (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
>> (XEN)  00000000fec00000 - 00000000fec03000 (reserved)
>> (XEN)  00000000fee00000 - 00000000fee01000 (reserved)
>> (XEN)  00000000ff700000 - 0000000100000000 (reserved)
>> (XEN)  0000000100000000 - 0000000228000000 (usable)
>>
>> So it must be being added by Xen?
> 
> Yes - see d838ac25 ("x86: don''t allow Dom0 access to the HT
> address range"). But that''s the case on all AMD systems, and
> I thought it wasn''t just woodlouse that''s an AMD one -
Ian?
> 
> In any event - how can the kernel side code make _any_
> assumptions on what is or is not in the E820 table? I''ve
> recently seen logs from a system where reserved (MMIO)
> blocks appear right below the 1Tb (or maybe it was even 16Tb)
> boundary, without Xen inserting them.
> 
> I would certainly be willing to revert that patch for the time
> being if we have reasons to believe this helps, but only as long
> as it is clear that the kernel needs fixing, and that I''ll want
this
> back before 4.4 goes out. Do we have baseline (8a7769b4)
> test results including the new kernel, with part of it run on
> woodlouse?
This looks like a red herring.  Having poked about in woodlouse it looks
like something is screwy with interrupts.  The tg3 cards aren''t using
MSI and the USB controller is using edge not level handlers.  Another
machine with the same chipset is happily using MSIs.

Malcolm (Cc) has some suggestions for things to try.

David

Ian Jackson

2013-Sep-06 10:38 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):> This looks like a red herring.  Having poked about in woodlouse it looks
> like something is screwy with interrupts.  The tg3 cards aren''t
using
> MSI and the USB controller is using edge not level handlers.  Another
> machine with the same chipset is happily using MSIs.
I did the following tests overnight:

 * 3.4.60 kernel:

   Pass!  [adhoc flight 19081]

 * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER
   Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen
   Date: Wed Sep 04 21:54:01 BST 2013
   Message-ID: <1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com>

   Fail as before (in this case, timeout in debootstrap trying to
   install a geust).  [adhoc flight 19082]

 * 3.10.10, kernel command line "pci=noacpi and pci=nocrs"

   Total boot failure.  SATA controller complaining bitterly about
   lost interrupts.  [adhoc flight 19085]

I also took woodlouse out of the main test pool, which is how we got a
push of 4.2.  I''m going to put it back now, and make a change to
switch to Linux 3.4.y for general tests.

I think this gets the 3.10.y problem off the critical path for
everything else but of course we should still fix it.  I will leave
the 3.10.y push gate in place.

Ian.

Jan Beulich

2013-Sep-06 10:49 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

>>> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):
>> This looks like a red herring.  Having poked about in woodlouse it
looks
>> like something is screwy with interrupts.  The tg3 cards
aren''t using
>> MSI and the USB controller is using edge not level handlers.  Another
>> machine with the same chipset is happily using MSIs.
> 
> I did the following tests overnight:
> 
>  * 3.4.60 kernel:
> 
>    Pass!  [adhoc flight 19081]
> 
>  * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER
>    Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under 
> Xen
>    Date: Wed Sep 04 21:54:01 BST 2013
>    Message-ID:
<1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com>
> 
>    Fail as before (in this case, timeout in debootstrap trying to
>    install a geust).  [adhoc flight 19082]
> 
>  * 3.10.10, kernel command line "pci=noacpi and pci=nocrs"
> 
>    Total boot failure.  SATA controller complaining bitterly about
>    lost interrupts.  [adhoc flight 19085]
> 
> I also took woodlouse out of the main test pool, which is how we got a
> push of 4.2.  I''m going to put it back now, and make a change to
> switch to Linux 3.4.y for general tests.
For -unstable this also resulted in just a single left test failure
(test-amd64-i386-pair   17 guest-migrate/src_host/dst_host),
which appears to be the result of the migration, after the
first few thousand pages, seeing a rapid decrease of speed
(which then likely causes that timeout). I couldn''t spot anything
in the logs that would explain this though. But I did notice that
in two of the three runs there was not xend.log captured on
the source host in the first place - is there an explanation for
this?

In any event I''m going to take these almost-pushes as a
"good enough" sign to pull over the two or three commits into
the stable branches, in the expectation that we should be
able to get a push there over the weekend, and then release
early next week.

Looking through the logs of *-mite it also seems like you gave
3.11 a try, hitting a BUG() in balloon.c.

Jan

David Vrabel

2013-Sep-06 10:58 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

On 06/09/13 11:38, Ian Jackson wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):
>> This looks like a red herring.  Having poked about in woodlouse it
looks
>> like something is screwy with interrupts.  The tg3 cards
aren''t using
>> MSI and the USB controller is using edge not level handlers.  Another
>> machine with the same chipset is happily using MSIs.
> 
> I did the following tests overnight:
> 
>  * 3.4.60 kernel:
> 
>    Pass!  [adhoc flight 19081]
Where are the logs for this run?

I tried:

http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/

David

Ian Jackson

2013-Sep-06 11:06 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL [and
1 more messages]"):> On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
> For -unstable this also resulted in just a single left test failure
> (test-amd64-i386-pair   17 guest-migrate/src_host/dst_host),
> which appears to be the result of the migration, after the
> first few thousand pages, seeing a rapid decrease of speed
> (which then likely causes that timeout). I couldn''t spot anything
> in the logs that would explain this though. But I did notice that
> in two of the three runs there was not xend.log captured on
> the source host in the first place - is there an explanation for
> this?
Looking at the logs-capture log, it appears that itch-mite was totally
unresponsive by then.  The log capture script decided to power cycle
it.  After having done that, xend wasn''t running.  Due to a bug in the
script it didn''t retry the log capture.
> In any event I''m going to take these almost-pushes as a
> "good enough" sign to pull over the two or three commits into
> the stable branches, in the expectation that we should be
> able to get a push there over the weekend, and then release
> early next week.
OK.
> Looking through the logs of *-mite it also seems like you gave
> 3.11 a try, hitting a BUG() in balloon.c.
That''ll be the "linux-linus" test, which isn''t doing
very well.

Ian.

Ian Jackson

2013-Sep-06 11:50 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

David Vrabel writes ("Re: [xen-unstable test] 18851: regressions - FAIL
[and 1 more messages]"):> On 06/09/13 11:38, Ian Jackson wrote:
> > I did the following tests overnight:
> > 
> >  * 3.4.60 kernel:
> > 
> >    Pass!  [adhoc flight 19081]
> 
> Where are the logs for this run?
> 
> I tried:
> http://www.chiark.greenend.org.uk/~xensrcts/logs/19081/
It doesn''t automatically publish the logs of adhoc flights.  I have
just done this now (for all three I mentioned).

Ian.

Konrad Rzeszutek Wilk

2013-Sep-06 12:49 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

On Fri, Sep 06, 2013 at 12:06:38PM +0100, Ian Jackson
wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions - FAIL
[and 1 more messages]"):
> > On 06.09.13 at 12:38, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:
> > For -unstable this also resulted in just a single left test failure
> > (test-amd64-i386-pair   17 guest-migrate/src_host/dst_host),
> > which appears to be the result of the migration, after the
> > first few thousand pages, seeing a rapid decrease of speed
> > (which then likely causes that timeout). I couldn''t spot
anything
> > in the logs that would explain this though. But I did notice that
> > in two of the three runs there was not xend.log captured on
> > the source host in the first place - is there an explanation for
> > this?
> 
> Looking at the logs-capture log, it appears that itch-mite was totally
> unresponsive by then.  The log capture script decided to power cycle
> it.  After having done that, xend wasn''t running.  Due to a bug in
the
> script it didn''t retry the log capture.
> 
> > In any event I''m going to take these almost-pushes as a
> > "good enough" sign to pull over the two or three commits
into
> > the stable branches, in the expectation that we should be
> > able to get a push there over the weekend, and then release
> > early next week.
> 
> OK.
> 
> > Looking through the logs of *-mite it also seems like you gave
> > 3.11 a try, hitting a BUG() in balloon.c.
> 
> That''ll be the "linux-linus" test, which isn''t
doing very well.
I think Boris has a patch for that fixes the regression.
> 
> Ian.

Konrad Rzeszutek Wilk

2013-Sep-06 12:57 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

On Fri, Sep 06, 2013 at 11:38:42AM +0100, Ian Jackson
wrote:> Jan Beulich writes ("Re: [xen-unstable test] 18851: regressions -
FAIL"):
> > This looks like a red herring.  Having poked about in woodlouse it
looks
> > like something is screwy with interrupts.  The tg3 cards
aren''t using
> > MSI and the USB controller is using edge not level handlers.  Another
> > machine with the same chipset is happily using MSIs.
> 
> I did the following tests overnight:
> 
>  * 3.4.60 kernel:
> 
>    Pass!  [adhoc flight 19081]
> 
>  * 3.10.10 + patch from Zoltan Kiss to limit SKB_FRAG_PAGE_ORDER
>    Subject: net/core: Order-3 frag allocator causes SWIOTLB bouncing under
Xen
>    Date: Wed Sep 04 21:54:01 BST 2013
>    Message-ID:
<1378327638-23956-1-git-send-email-zoltan.kiss@citrix.com>
> 
>    Fail as before (in this case, timeout in debootstrap trying to
>    install a geust).  [adhoc flight 19082]
> 
>  * 3.10.10, kernel command line "pci=noacpi and pci=nocrs"
> 
>    Total boot failure.  SATA controller complaining bitterly about
>    lost interrupts.  [adhoc flight 19085]
Somebody (Andrew? David?) took a look at the box and found that the MSIs
were all out of whack. I guess with the ''noacpi'' parameter the
thinking is
that the ACPI _PRT are out of whack with the more modern kernels?

I am not that familiar with oss-test - but is each of the set of boxes
running a different version of the hypervisor? Meaning you don''t
randomly install from scratch a new version of a hypervisor on different
boxes?

Thanks!> 
> I also took woodlouse out of the main test pool, which is how we got a
> push of 4.2.  I''m going to put it back now, and make a change to
> switch to Linux 3.4.y for general tests.
> 
> I think this gets the 3.10.y problem off the critical path for
> everything else but of course we should still fix it.  I will leave
> the 3.10.y push gate in place.
Aye. Is this issue (network incredibly slow) only surfacing on this box?
No - I thought I saw the issue on gall and lice with the upstream Linux?
Are those two machines the same as woodlouse?> 
> Ian.

Ian Jackson

2013-Sep-06 13:34 UTC

head link

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Konrad Rzeszutek Wilk writes ("Re: [xen-unstable test] 18851: regressions -
FAIL [and 1 more messages]"):> I am not that familiar with oss-test - but is each of the set of boxes
> running a different version of the hypervisor? Meaning you don''t
> randomly install from scratch a new version of a hypervisor on different
> boxes?
No, each test is of a specific version of the hypervisor, a specific
version of the kernel, etc.

For each test the tester will pick a machine from the test pool.  The
scheduling algorithm tries to pick a machine which has not recently
run this test, unless the test failed most recently, in which case it
tries to pick (the) one it failed on.

Each test job involves a complete wipe of the system, and then
installing a dom0 OS with the selected hypervisor and kernel.
> > I think this gets the 3.10.y problem off the critical path for
> > everything else but of course we should still fix it.  I will leave
> > the 3.10.y push gate in place.
> 
> Aye. Is this issue (network incredibly slow) only surfacing on this box?
> No - I thought I saw the issue on gall and lice with the upstream Linux?
> Are those two machines the same as woodlouse?
No, they are entirely different.  This incredibly slow network issue
has only been seen on woodlouse.  Most of the machines are in
identical pairs, but not woodlouse, sadly.

ian.

Xen devel - Aug 2013 - [xen-unstable test] 18851: regressions - FAIL

[xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]

Re: [xen-unstable test] 18851: regressions - FAIL [and 1 more messages]