flight 11574 xen-unstable real [real] http://www.chiark.greenend.org.uk/~xensrcts/logs/11574/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-i386-win 7 windows-install fail pass in 11561 test-i386-i386-win 7 windows-install fail pass in 11561 test-amd64-i386-xend-winxpsp3 7 windows-install fail in 11561 pass in 11574 test-amd64-amd64-xl-win 7 windows-install fail in 11561 pass in 11574 test-amd64-amd64-win 7 windows-install fail in 11561 pass in 11574 Regressions which are regarded as allowable (not blocking): test-amd64-i386-win-vcpus1 7 windows-install fail like 11561 Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-pcipt-intel 9 guest-start fail never pass test-amd64-i386-rhel6hvm-intel 9 guest-start.2 fail never pass test-amd64-i386-rhel6hvm-amd 9 guest-start.2 fail never pass test-amd64-amd64-xl-win7-amd64 13 guest-stop fail never pass test-amd64-i386-xend-winxpsp3 16 leak-check/check fail never pass test-amd64-amd64-xl-win 13 guest-stop fail never pass test-i386-i386-xl-win 13 guest-stop fail never pass test-amd64-i386-xl-win-vcpus1 13 guest-stop fail never pass test-amd64-i386-xl-win7-amd64 13 guest-stop fail never pass test-i386-i386-xl-winxpsp3 13 guest-stop fail never pass test-amd64-amd64-win 16 leak-check/check fail never pass test-amd64-amd64-xl-winxpsp3 13 guest-stop fail never pass test-amd64-i386-xl-winxpsp3-vcpus1 13 guest-stop fail never pass test-amd64-i386-win 16 leak-check/check fail in 11561 never pass test-i386-i386-win 16 leak-check/check fail in 11561 never pass version targeted for testing: xen 370924e204dc baseline version: xen 370924e204dc jobs: build-amd64 pass build-i386 pass build-amd64-oldkern pass build-i386-oldkern pass build-amd64-pvops pass build-i386-pvops pass test-amd64-amd64-xl pass test-amd64-i386-xl pass test-i386-i386-xl pass test-amd64-i386-rhel6hvm-amd fail test-amd64-amd64-xl-win7-amd64 fail test-amd64-i386-xl-win7-amd64 fail test-amd64-i386-xl-credit2 pass test-amd64-amd64-xl-pcipt-intel fail test-amd64-i386-rhel6hvm-intel fail test-amd64-i386-xl-multivcpu pass test-amd64-amd64-pair pass test-amd64-i386-pair pass test-i386-i386-pair pass test-amd64-amd64-xl-sedf-pin pass test-amd64-amd64-pv pass test-amd64-i386-pv pass test-i386-i386-pv pass test-amd64-amd64-xl-sedf pass test-amd64-i386-win-vcpus1 fail test-amd64-i386-xl-win-vcpus1 fail test-amd64-i386-xl-winxpsp3-vcpus1 fail test-amd64-amd64-win fail test-amd64-i386-win fail test-i386-i386-win fail test-amd64-amd64-xl-win fail test-i386-i386-xl-win fail test-amd64-i386-xend-winxpsp3 fail test-amd64-amd64-xl-winxpsp3 fail test-i386-i386-xl-winxpsp3 fail ------------------------------------------------------------ sg-report-flight on woking.cam.xci-test.com logs: /home/xc_osstest/logs images: /home/xc_osstest/images Logs, config files, etc. are available at http://www.chiark.greenend.org.uk/~xensrcts/logs Test harness code can be found at http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary Published tested tree is already up to date.
xen.org writes ("[xen-unstable test] 11574: tolerable FAIL"):> Tests which are failing intermittently (not blocking): > test-amd64-i386-win 7 windows-install fail pass in 11561This is a host-specific, but deterministic, failure. My bisector has been working on it (the basis pass was in November so there has been a lot of ground to cover and I had to make some new arrangements to be able to run an ad-hoc bisection of this problem) and the results so far are fingering changesets between 24367:537ceb11d51e (bad) and 24362:d35bedf334f2 (good). Extract from hg log -v is below. I have looked at the logs of one of these failures and there is nothing of note. The screenshot shows the Windows screensaver, so I guess neither the guest itself nor qemu have crashed. The likely situation is either that the guest has somehow locked up and ceased to make progress, or that the networking is nonfunctional. I''m expecting the bisector to finger a specific changeset and will let you know when it does, but this will take some more hours and maybe only finish overnight or tomorrow. Ian. changeset: 24367:537ceb11d51e user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:31:49 2011 +0000 files: xen/arch/x86/hvm/hvm.c description: Improve handling of nested page faults Add checks for access type. Be less reliant on implicit semantics. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> changeset: 24366:534b2a15e669 user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:10:32 2011 +0000 files: xen/arch/x86/mm/mem_sharing.c xen/arch/x86/mm/p2m.c xen/include/public/mem_event.h description: x86/mm: Allow dummy responses on the mem_event ring. Ring semantics require that for every request, a response be put. This allows consumer to place a dummy response if need be. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> changeset: 24365:c65d1a9769b4 user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:10:32 2011 +0000 files: xen/arch/x86/mm/mem_event.c xen/arch/x86/mm/mem_sharing.c xen/arch/x86/mm/p2m.c xen/include/asm-x86/mem_event.h description: x86/mm: Consume multiple mem event responses off the ring Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Adin Scannell <adin@scanneel.ca> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> changeset: 24364:0964341efd65 user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:10:32 2011 +0000 files: xen/arch/x86/mm/mem_event.c description: x86/mm: Allow memevent responses to be signaled via the event channel Don''t require a separate domctl to notify the memevent interface that an event has occured. This domctl can be taxing, particularly when you are scaling events and paging to many domains across a single system. Instead, we use the existing event channel to signal when we place something in the ring (as per normal ring operation). Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Keir Fraser <keir@xen.org> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> changeset: 24363:1620291f0c4a user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:10:32 2011 +0000 files: xen/arch/ia64/vmx/vmx_init.c xen/arch/x86/hvm/hvm.c xen/arch/x86/mm/mem_event.c xen/common/event_channel.c xen/include/xen/event.h xen/include/xen/sched.h description: Create a generic callback mechanism for Xen-bound event channels For event channels for which Xen is the consumer, there currently is a single action. With this patch, we allow event channel creators to specify a generic callback (or no callback). Because the expectation is that there will be few callbacks, they are stored in a small table. Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Keir Fraser <keir@xen.org> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Committed-by: Tim Deegan <tim@xen.org>
Ian Jackson writes ("Re: [xen-unstable test] 11574: tolerable FAIL"):> This is a host-specific, but deterministic, failure. My bisector has > been working on it (the basis pass was in November so there has been a > lot of ground to cover and I had to make some new arrangements to be > able to run an ad-hoc bisection of this problem) and the results so > far are fingering changesets between 24367:537ceb11d51e (bad) and > 24362:d35bedf334f2 (good).The bisector is now minded to blame this changeset: changeset: 24367:537ceb11d51e user: Andres Lagar-Cavilla <andres@lagarcavilla.org> date: Tue Dec 06 20:31:49 2011 +0000 files: xen/arch/x86/hvm/hvm.c description: Improve handling of nested page faults Add checks for access type. Be less reliant on implicit semantics. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> It''s doing a couple more before-and-after repros to be sure, but this looks like a plausible candidate. Full diff below. I''m not an expert on this area. Perhaps someone could speculate what Windows is doing that is triggering one of these new checks ? Ian. # HG changeset patch # User Andres Lagar-Cavilla <andres@lagarcavilla.org> # Date 1323203509 0 # Node ID 537ceb11d51ef60cd4abffd2f54de0ae0ca50008 # Parent 534b2a15e6695dfd8866c0ff626b002cbf57991a Improve handling of nested page faults Add checks for access type. Be less reliant on implicit semantics. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org> diff -r 534b2a15e669 -r 537ceb11d51e xen/arch/x86/hvm/hvm.c --- a/xen/arch/x86/hvm/hvm.c Tue Dec 06 20:10:32 2011 +0000 +++ b/xen/arch/x86/hvm/hvm.c Tue Dec 06 20:31:49 2011 +0000 @@ -1288,7 +1288,8 @@ int hvm_hap_nested_page_fault(unsigned l * If this GFN is emulated MMIO or marked as read-only, pass the fault * to the mmio handler. */ - if ( (p2mt == p2m_mmio_dm) || (p2mt == p2m_ram_ro) ) + if ( (p2mt == p2m_mmio_dm) || + (access_w && (p2mt == p2m_ram_ro)) ) { if ( !handle_mmio() ) hvm_inject_exception(TRAP_gp_fault, 0, 0); @@ -1302,7 +1303,7 @@ int hvm_hap_nested_page_fault(unsigned l p2m_mem_paging_populate(v->domain, gfn); /* Mem sharing: unshare the page and try again */ - if ( p2mt == p2m_ram_shared ) + if ( access_w && (p2mt == p2m_ram_shared) ) { ASSERT(!p2m_is_nestedp2m(p2m)); mem_sharing_unshare_page(p2m->domain, gfn, 0); @@ -1319,14 +1320,17 @@ int hvm_hap_nested_page_fault(unsigned l * a large page, we do not change other pages type within that large * page. */ - paging_mark_dirty(v->domain, mfn_x(mfn)); - p2m_change_type(v->domain, gfn, p2m_ram_logdirty, p2m_ram_rw); + if ( access_w ) + { + paging_mark_dirty(v->domain, mfn_x(mfn)); + p2m_change_type(v->domain, gfn, p2m_ram_logdirty, p2m_ram_rw); + } rc = 1; goto out_put_gfn; } /* Shouldn''t happen: Maybe the guest was writing to a r/o grant mapping? */ - if ( p2mt == p2m_grant_map_ro ) + if ( access_w && (p2mt == p2m_grant_map_ro) ) { gdprintk(XENLOG_WARNING, "trying to write to read-only grant mapping\n");
> Ian Jackson writes ("Re: [xen-unstable test] 11574: tolerable FAIL"): >> This is a host-specific, but deterministic, failure. My bisector has >> been working on it (the basis pass was in November so there has been a >> lot of ground to cover and I had to make some new arrangements to be >> able to run an ad-hoc bisection of this problem) and the results so >> far are fingering changesets between 24367:537ceb11d51e (bad) and >> 24362:d35bedf334f2 (good). > > The bisector is now minded to blame this changeset: > > changeset: 24367:537ceb11d51e > user: Andres Lagar-Cavilla <andres@lagarcavilla.org> > date: Tue Dec 06 20:31:49 2011 +0000 > files: xen/arch/x86/hvm/hvm.c > description: > Improve handling of nested page faults > > Add checks for access type. Be less reliant on implicit semantics. > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > Acked-by: Tim Deegan <tim@xen.org> > Committed-by: Tim Deegan <tim@xen.org> > > It''s doing a couple more before-and-after repros to be sure, but this > looks like a plausible candidate. Full diff below.Thanks Ian, just to be sure, none of the following conditions apply on your test: - hvm uses memory sharing - hvm runs a backend pv driver - hvm uses populate-on-demand. If all the above apply, I can quickly put together a micro-patch to assert that the deal-breaker is> - if ( (p2mt == p2m_mmio_dm) || (p2mt == p2m_ram_ro) ) > + if ( (p2mt == p2m_mmio_dm) || > + (access_w && (p2mt == p2m_ram_ro)) ) >Would you be able to test that? Thanks again Andres> I''m not an expert on this area. Perhaps someone could speculate what > Windows is doing that is triggering one of these new checks ? > > Ian. > > # HG changeset patch > # User Andres Lagar-Cavilla <andres@lagarcavilla.org> > # Date 1323203509 0 > # Node ID 537ceb11d51ef60cd4abffd2f54de0ae0ca50008 > # Parent 534b2a15e6695dfd8866c0ff626b002cbf57991a > Improve handling of nested page faults > > Add checks for access type. Be less reliant on implicit semantics. > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > Acked-by: Tim Deegan <tim@xen.org> > Committed-by: Tim Deegan <tim@xen.org> > > diff -r 534b2a15e669 -r 537ceb11d51e xen/arch/x86/hvm/hvm.c > --- a/xen/arch/x86/hvm/hvm.c Tue Dec 06 20:10:32 2011 +0000 > +++ b/xen/arch/x86/hvm/hvm.c Tue Dec 06 20:31:49 2011 +0000 > @@ -1288,7 +1288,8 @@ int hvm_hap_nested_page_fault(unsigned l > * If this GFN is emulated MMIO or marked as read-only, pass the > fault > * to the mmio handler. > */ > - if ( (p2mt == p2m_mmio_dm) || (p2mt == p2m_ram_ro) ) > + if ( (p2mt == p2m_mmio_dm) || > + (access_w && (p2mt == p2m_ram_ro)) ) > { > if ( !handle_mmio() ) > hvm_inject_exception(TRAP_gp_fault, 0, 0); > @@ -1302,7 +1303,7 @@ int hvm_hap_nested_page_fault(unsigned l > p2m_mem_paging_populate(v->domain, gfn); > > /* Mem sharing: unshare the page and try again */ > - if ( p2mt == p2m_ram_shared ) > + if ( access_w && (p2mt == p2m_ram_shared) ) > { > ASSERT(!p2m_is_nestedp2m(p2m)); > mem_sharing_unshare_page(p2m->domain, gfn, 0); > @@ -1319,14 +1320,17 @@ int hvm_hap_nested_page_fault(unsigned l > * a large page, we do not change other pages type within that > large > * page. > */ > - paging_mark_dirty(v->domain, mfn_x(mfn)); > - p2m_change_type(v->domain, gfn, p2m_ram_logdirty, p2m_ram_rw); > + if ( access_w ) > + { > + paging_mark_dirty(v->domain, mfn_x(mfn)); > + p2m_change_type(v->domain, gfn, p2m_ram_logdirty, > p2m_ram_rw); > + } > rc = 1; > goto out_put_gfn; > } > > /* Shouldn''t happen: Maybe the guest was writing to a r/o grant > mapping? */ > - if ( p2mt == p2m_grant_map_ro ) > + if ( access_w && (p2mt == p2m_grant_map_ro) ) > { > gdprintk(XENLOG_WARNING, > "trying to write to read-only grant mapping\n"); >
Hi, At 16:00 +0000 on 24 Jan (1327420844), Ian Jackson wrote:> Ian Jackson writes ("Re: [xen-unstable test] 11574: tolerable FAIL"): > > This is a host-specific, but deterministic, failure. > > Improve handling of nested page faults > > Add checks for access type. Be less reliant on implicit semantics. > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > Acked-by: Tim Deegan <tim@xen.org> > Committed-by: Tim Deegan <tim@xen.org>Hmm, Didn''t have to pull on that thread too hard to find it''s not tied to anything. The access_* arguments to hvm_hap_nested_page_fault() aren''t plumbed in on AMD: ret = hvm_hap_nested_page_fault(gpa, 0, ~0ul, 0, 0, 0, 0); so gating behaviour on them is not going to work so well. Sorry about that - I should definitely have caught this. (But Andres, did you test this, or any of your mm work, on AMD?) The attached patch ought to fix it. Smoke-tested but I won''t have good enough access to my test machines to check Windows installs before Thursday. Cheers, Tim. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Hi, > > At 16:00 +0000 on 24 Jan (1327420844), Ian Jackson wrote: >> Ian Jackson writes ("Re: [xen-unstable test] 11574: tolerable FAIL"): >> > This is a host-specific, but deterministic, failure. >> >> Improve handling of nested page faults >> >> Add checks for access type. Be less reliant on implicit semantics. >> >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> >> Acked-by: Tim Deegan <tim@xen.org> >> Committed-by: Tim Deegan <tim@xen.org> > > Hmm, Didn''t have to pull on that thread too hard to find it''s not tied > to anything. The access_* arguments to hvm_hap_nested_page_fault() > aren''t plumbed in on AMD: > > ret = hvm_hap_nested_page_fault(gpa, 0, ~0ul, 0, 0, 0, 0); > > so gating behaviour on them is not going to work so well. Sorry about > that - I should definitely have caught this. (But Andres, did you test > this, or any of your mm work, on AMD?)Phew, thanks Tim. FWIW: Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> And, no, we haven''t tested on AMD, because we depend critically on the support that has been built into EPT for paging/mem_access. Andres> > The attached patch ought to fix it. Smoke-tested but I won''t have > good enough access to my test machines to check Windows installs before > Thursday. > > Cheers, > > Tim. >
Tim Deegan writes ("Re: [Xen-devel] [xen-unstable test] 11574: tolerable FAIL"):> At 16:00 +0000 on 24 Jan (1327420844), Ian Jackson wrote: > Hmm, Didn''t have to pull on that thread too hard to find it''s not tied > to anything. The access_* arguments to hvm_hap_nested_page_fault() > aren''t plumbed in on AMD: > > ret = hvm_hap_nested_page_fault(gpa, 0, ~0ul, 0, 0, 0, 0); > > so gating behaviour on them is not going to work so well. Sorry about > that - I should definitely have caught this. (But Andres, did you test > this, or any of your mm work, on AMD?)In general it''s probably unrealistic to ask every submitter to test every patch on two different systems...> The attached patch ought to fix it. Smoke-tested but I won''t have > good enough access to my test machines to check Windows installs before > Thursday.I''d be quite happy if this patch went into -unstable right away. We should be able to tell from the automatic tests whether it is awful :-) and the current situation is rather poor. Also when we have got rid of this host-specific failure I can push my new test machinery branch which is intended to (mostly) prevent host-specific failures being regarded as heisenbugs. Thanks, Ian.
At 17:44 +0000 on 24 Jan (1327427040), Ian Jackson wrote:> Tim Deegan writes ("Re: [Xen-devel] [xen-unstable test] 11574: tolerable FAIL"): > > At 16:00 +0000 on 24 Jan (1327420844), Ian Jackson wrote: > > Hmm, Didn''t have to pull on that thread too hard to find it''s not tied > > to anything. The access_* arguments to hvm_hap_nested_page_fault() > > aren''t plumbed in on AMD: > > > > ret = hvm_hap_nested_page_fault(gpa, 0, ~0ul, 0, 0, 0, 0); > > > > so gating behaviour on them is not going to work so well. Sorry about > > that - I should definitely have caught this. (But Andres, did you test > > this, or any of your mm work, on AMD?) > > In general it''s probably unrealistic to ask every submitter to test > every patch on two different systems...True.> > The attached patch ought to fix it. Smoke-tested but I won''t have > > good enough access to my test machines to check Windows installs before > > Thursday. > > I''d be quite happy if this patch went into -unstable right away. We > should be able to tell from the automatic tests whether it is awful > :-) and the current situation is rather poor.Done. Cheers, Tim.