flight 6374 xen-unstable real [real] http://www.chiark.greenend.org.uk/~xensrcts/logs/6374/ Regressions :-( Tests which did not succeed and are blocking: test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369 Tests which did not succeed, but are not blocking, including regressions (tests previously passed) regarded as allowable: test-amd64-amd64-win 16 leak-check/check fail never pass test-amd64-amd64-xl-win 13 guest-stop fail never pass test-amd64-i386-rhel6hvm-amd 8 guest-saverestore fail never pass test-amd64-i386-rhel6hvm-intel 8 guest-saverestore fail never pass test-amd64-i386-win-vcpus1 16 leak-check/check fail never pass test-amd64-i386-win 16 leak-check/check fail never pass test-amd64-i386-xl-credit2 9 guest-start fail like 6367 test-amd64-i386-xl-win-vcpus1 13 guest-stop fail never pass test-amd64-xcpkern-i386-rhel6hvm-amd 8 guest-saverestore fail never pass test-amd64-xcpkern-i386-rhel6hvm-intel 8 guest-saverestore fail never pass test-amd64-xcpkern-i386-win 16 leak-check/check fail never pass test-amd64-xcpkern-i386-xl-credit2 11 guest-localmigrate fail like 6369 test-amd64-xcpkern-i386-xl-win 13 guest-stop fail never pass test-i386-i386-win 16 leak-check/check fail never pass test-i386-i386-xl-win 13 guest-stop fail never pass test-i386-xcpkern-i386-win 16 leak-check/check fail never pass version targeted for testing: xen 22cc047eb146 baseline version: xen 6fa299ad15c8 ------------------------------------------------------------ People who touched revisions under test: Ian Campbell <ian.campbell@citrix.com> Ian Jackson <ian.jackson@eu.citrix.com> Jan Beulich <jbeulich@novell.com> Jim Fehlig <jfehlig@novell.com> Liu, Jinsong <jinsong.liu@intel.com> Stefano Stabellini <stefano.stabellini@eu.citrix.com> Wei Gang <gang.wei@intel.com> ------------------------------------------------------------ jobs: build-i386-xcpkern pass build-amd64 pass build-i386 pass build-amd64-oldkern pass build-i386-oldkern pass build-amd64-pvops pass build-i386-pvops pass test-amd64-amd64-xl pass test-amd64-i386-xl pass test-i386-i386-xl pass test-amd64-xcpkern-i386-xl pass test-i386-xcpkern-i386-xl pass test-amd64-i386-rhel6hvm-amd fail test-amd64-xcpkern-i386-rhel6hvm-amd fail test-amd64-i386-xl-credit2 fail test-amd64-xcpkern-i386-xl-credit2 fail test-amd64-i386-rhel6hvm-intel fail test-amd64-xcpkern-i386-rhel6hvm-intel fail test-amd64-i386-xl-multivcpu pass test-amd64-xcpkern-i386-xl-multivcpu pass test-amd64-amd64-pair pass test-amd64-i386-pair pass test-i386-i386-pair pass test-amd64-xcpkern-i386-pair pass test-i386-xcpkern-i386-pair pass test-amd64-amd64-pv pass test-amd64-i386-pv fail test-i386-i386-pv pass test-amd64-xcpkern-i386-pv pass test-i386-xcpkern-i386-pv pass test-amd64-i386-win-vcpus1 fail test-amd64-i386-xl-win-vcpus1 fail test-amd64-amd64-win fail test-amd64-i386-win fail test-i386-i386-win fail test-amd64-xcpkern-i386-win fail test-i386-xcpkern-i386-win fail test-amd64-amd64-xl-win fail test-i386-i386-xl-win fail test-amd64-xcpkern-i386-xl-win fail ------------------------------------------------------------ sg-report-flight on woking.cam.xci-test.com logs: /home/xc_osstest/logs images: /home/xc_osstest/images Logs, config files, etc. are available at http://www.chiark.greenend.org.uk/~xensrcts/logs Test harness code can be found at http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary Not pushing. ------------------------------------------------------------ changeset: 23020:22cc047eb146 tag: tip user: Liu, Jinsong <jinsong.liu@intel.com> date: Thu Mar 10 18:35:32 2011 +0000 x86: Fix cpuidle bug Before invoking C3, bus master disable / flush cache should be the last step; After resume from C3, bus master enable should be the first step; Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Acked-by: Wei Gang <gang.wei@intel.com> changeset: 23019:c8947c24536a user: Ian Campbell <ian.campbell@citrix.com> date: Thu Mar 10 18:21:42 2011 +0000 libxl: do not rely on guest to respond when forcing pci device removal This is consistent with the expected semantics of a forced device removal and also avoids a delay when destroying an HVM domain which either does not support hot unplug (does not respond to SCI) or has crashed. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com> changeset: 23018:a46101334ee2 user: Jim Fehlig <jfehlig@novell.com> date: Thu Mar 10 18:17:16 2011 +0000 libxl: Call setsid(2) before exec''ing device model While doing development on libvirt libxenlight driver I noticed that terminating a libxenlight client causes any qemu-dm processes that were indirectly created by the client to also terminate. Calling setsid(2) before exec''ing qemu-dm resolves the issue. Signed-off-by: Jim Fehlig <jfehlig@novell.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com> changeset: 23017:b16644e446ef user: Stefano Stabellini <stefano.stabellini@eu.citrix.com> date: Thu Mar 10 18:11:31 2011 +0000 update README update README: we are missing few compile time dependencies and a link to the pvops kernel page on the wiki. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> changeset: 23016:6fa299ad15c8 user: Jan Beulich <jbeulich@novell.com> date: Wed Mar 09 17:25:44 2011 +0000 x86: remove pre-686 CPU support bits ... as Xen doesn''t run on such CPUs anyway. Clearly these bits were particularly odd to have on x86-64. Signed-off-by: Jan Beulich <jbeulich@novell.com> (qemu changes not included) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Mar-11 17:51 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):> flight 6374 xen-unstable real [real] > Tests which did not succeed and are blocking: > test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369Xen crash in scheduler (non-credit2). Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck! Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]---- Mar 11 13:46:57.931763 (XEN) CPU: 1 Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002 Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78 Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002 Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001 Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0 Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770 Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008 Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00: Mar 11 13:46:57.998802 (XEN) ffff82c480119557 00007cfe580503c7 ffff82c4802d1ac0 ffff82c4802d0cc0 Mar 11 13:46:58.010781 (XEN) ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0 Mar 11 13:46:58.019765 (XEN) 01ff8301a7fb3048 0000000000000800 0000000000000000 0000000000000100 Mar 11 13:46:58.019798 (XEN) 0000000000000000 0000000000000f02 0000000000000000 0000000000000f00 Mar 11 13:46:58.031777 (XEN) 0000000000000000 ffff8301a7fb3048 ffff8301a7fb3048 ffff8301a7eac048 Mar 11 13:46:58.039906 (XEN) 0000000000000001 ffff82c4802d0cc0 0000000000000000 ffff8301a7fafcc8 Mar 11 13:46:58.039930 (XEN) ffff82c480119582 ffff8301a7fafd28 ffff82c480122c8d 0000000100000001 Mar 11 13:46:58.051781 (XEN) ffff82c4802d0cc0 ffff82c4802d0cc0 ffff8300d7cdc000 0000000000000206 Mar 11 13:46:58.063769 (XEN) ffff8300d7cdc000 0000000000000001 ffff8300d7cdc000 000000018b4d75e5 Mar 11 13:46:58.063807 (XEN) ffff8301a7fb3040 ffff8301a7fafd48 ffff82c480122e24 ed543b2d00000000 Mar 11 13:46:58.075781 (XEN) ffff8300d7afc000 ffff8301a7fafe38 ffff82c480157f17 ffff82c480123dd4 Mar 11 13:46:58.087771 (XEN) ffff82c4802d0cc8 ffff8301a7fafe38 ffff82c480118c8a ffff82c4802d0cc0 Mar 11 13:46:58.098761 (XEN) ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0 Mar 11 13:46:58.098797 (XEN) 000000018b4d75e5 ffff8301a7fafe68 00000001a7e80e70 ffff8301a7ffa400 Mar 11 13:46:58.110773 (XEN) ffff8301a7ffaee8 ffff8301a7fafdf8 ffc08301a7ffaf90 0000000000000086 Mar 11 13:46:58.119760 (XEN) ffff8301a7fafdf8 ffff82c480123b91 0000000000000001 0000000000000000 Mar 11 13:46:58.119794 (XEN) 0000000000000000 ffff8301a7fafe38 ffff8300d7afc000 ffff8300d7cdc000 Mar 11 13:46:58.134790 (XEN) 0000000000000003 000000018b4d75e5 ffff8301a7fb3040 ffff8301a7fafeb8 Mar 11 13:46:58.139763 (XEN) ffff82c4801226b4 ffff8301a7fafe68 000000018b4d75e5 ffff8301a7fb3100 Mar 11 13:46:58.139804 (XEN) ffff8300d7cdc060 ffff8300d7afc000 ffffffffffffffff ffff8301a7faff00 Mar 11 13:46:58.154777 (XEN) Xen call trace: Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10 Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230 Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624 Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99 Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a Mar 11 13:46:58.198817 (XEN) Mar 11 13:46:58.198828 (XEN) Mar 11 13:46:58.198839 (XEN) **************************************** Mar 11 13:46:58.207765 (XEN) Panic on CPU 1: Mar 11 13:46:58.207787 (XEN) FATAL TRAP: vector = 2 (nmi) Mar 11 13:46:58.207813 (XEN) [error_code=0000] , IN INTERRUPT CONTEXT Mar 11 13:46:58.218761 (XEN) **************************************** Mar 11 13:46:58.218788 (XEN) Mar 11 13:46:58.218802 (XEN) Reboot in five seconds... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Mar-14 10:02 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote:> Mar 11 13:46:58.154777 (XEN) Xen call trace: > Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f > Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10 > Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230 > Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b > Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca > Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624 > Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99 > Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7aI think this hang comes because although this code: cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); if ( commit ) CSCHED_PCPU(nxt)->idle_bias = cpu; cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); removes the new cpu and its siblings from cpus, cpu isn''t guaranteed to have been in cpus in the first place, and none of its siblings are either since nxt might not be its sibling. Possible fix: diff -r b9a5d116102d xen/common/sched_credit.c --- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000 +++ b/xen/common/sched_credit.c Mon Mar 14 09:25:07 2011 +0000 @@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); if ( commit ) CSCHED_PCPU(nxt)->idle_bias = cpu; - cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); + cpus_andnot(cpus, cpus, nxt_idlers); } else { which guarantees that nxt will be removed from cpus, though I suspect this means that we might not pick the best HT pair in a particular core. Scheduler code is twisty and hurts my brain so I''d like George''s opinion before checking anything in. Cheers, Tim. P.S. the patch above is a one-liner for clarity: a better fix would be: diff -r b9a5d116102d xen/common/sched_credit.c --- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000 +++ b/xen/common/sched_credit.c Mon Mar 14 09:26:11 2011 +0000 @@ -533,12 +533,8 @@ _csched_cpu_pick(const struct scheduler cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); if ( commit ) CSCHED_PCPU(nxt)->idle_bias = cpu; - cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); } - else - { - cpus_andnot(cpus, cpus, nxt_idlers); - } + cpus_andnot(cpus, cpus, nxt_idlers); } return cpu; -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-14 10:33 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
>>> On 11.03.11 at 18:51, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"): >> flight 6374 xen-unstable real [real] >> Tests which did not succeed and are blocking: >> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369 > > Xen crash in scheduler (non-credit2). > > Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck! > Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]---- > Mar 11 13:46:57.931763 (XEN) CPU: 1 > Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f > Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor > Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002 > Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78 > Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002 > Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f > Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001 > Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0 > Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770 > Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008 > Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00: >... > Mar 11 13:46:58.154777 (XEN) Xen call trace: > Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f > Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10 > Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230 > Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b > Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca > Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624 > Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99 > Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7aI suppose that''s a result of 22957:c5c4688d5654 - as I understand it exiting the loop is only possible if two consecutive invocations of pick_cpu return the same result. This, however, is precisely what the pCPU''s idle_bias is supposed to prevent on hyper-threaded/multi-core systems (so that it''s not always the same entity that gets selected). But even beyond that particular aspect, relying on any form of "stability" of the returned value isn''t correct. Plus running pick_cpu repeatedly without actually using its result is wrong wrt to idle_bias updating too - that''s why cached_vcpu_acct() calls _csched_cpu_pick() with the commit argument set to false (which will result in a subsequent call - through pick_cpu - with the argument set to true to be likely to return the same value, but there''s no correctness dependency on that). So 22948:2d35823a86e7 already wasn''t really correct in putting a loop around pick_cpu. It''s also not clear to me what the surrounding if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock ) is supposed to filter, as the lock pointer gets set only when a CPU gets brought up. As I don''t really understand what is being tried to achieve here, I also can''t really suggest a possible fix other than reverting both offending changesets. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-14 10:39 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
>>> On 14.03.11 at 11:02, Tim Deegan <Tim.Deegan@citrix.com> wrote: > At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote: >> Mar 11 13:46:58.154777 (XEN) Xen call trace: >> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f >> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10 >> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230 >> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b >> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] > context_switch+0xd98/0xdca >> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624 >> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99 >> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a > > I think this hang comes because although this code: > > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); > if ( commit ) > CSCHED_PCPU(nxt)->idle_bias = cpu; > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); > > removes the new cpu and its siblings from cpus, cpu isn''t guaranteed to > have been in cpus in the first place, and none of its siblings are > either since nxt might not be its sibling.I had originally spent quite a while to verify that the loop this is in can''t be infinite (i.e. there''s going to be always at least one bit removed from "cpus"), and did so again during the last half hour or so. I''m certain (hardened also by the CPU masks we see on the stack) that it''s not this function itself that''s looping infinitely, but rather its caller (see my other reply sent just a few minutes ago).> Possible fix: > > diff -r b9a5d116102d xen/common/sched_credit.c > --- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000 > +++ b/xen/common/sched_credit.c Mon Mar 14 09:25:07 2011 +0000 > @@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); > if ( commit ) > CSCHED_PCPU(nxt)->idle_bias = cpu; > - cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); > + cpus_andnot(cpus, cpus, nxt_idlers); > } > else > { > > which guarantees that nxt will be removed from cpus, though I suspect > this means that we might not pick the best HT pair in a particular core. > Scheduler code is twisty and hurts my brain so I''d like George''s > opinion before checking anything in.No - that was precisely done the opposite direction to get better symmetry of load across all CPUs. With what you propose, idle_bias would become meaningless. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Mar-14 10:52 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:> > I think this hang comes because although this code: > > > > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); > > if ( commit ) > > CSCHED_PCPU(nxt)->idle_bias = cpu; > > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); > > > > removes the new cpu and its siblings from cpus, cpu isn''t guaranteed to > > have been in cpus in the first place, and none of its siblings are > > either since nxt might not be its sibling. > > I had originally spent quite a while to verify that the loop this is in > can''t be infinite (i.e. there''s going to be always at least one bit > removed from "cpus"), and did so again during the last half hour > or so.I''m pretty sure there are possible passes through this loop that don''t remove any cpus, though I haven''t constructed the full history that gets you there. But the cpupool patches you suggest in your other email look like much stronger candidates for this hang.> > which guarantees that nxt will be removed from cpus, though I suspect > > this means that we might not pick the best HT pair in a particular core. > > Scheduler code is twisty and hurts my brain so I''d like George''s > > opinion before checking anything in. > > No - that was precisely done the opposite direction to get > better symmetry of load across all CPUs. With what you propose, > idle_bias would become meaningless.I don''t think see why it would. As I said, having picked a core we might not iterate to pick the best cpu within that core, but the round-robining effect is still there. And even if not I figured a hypervisor crash is worse than a suboptimal scheduling decision. :) Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Juergen Gross
2011-Mar-14 14:40 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
On 03/14/11 11:33, Jan Beulich wrote:>>>> On 11.03.11 at 18:51, Ian Jackson<Ian.Jackson@eu.citrix.com> wrote: >> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"): >>> flight 6374 xen-unstable real [real] >>> Tests which did not succeed and are blocking: >>> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369 >> >> Xen crash in scheduler (non-credit2). >> >> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck! >> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]---- >> Mar 11 13:46:57.931763 (XEN) CPU: 1 >> Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f >> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor >> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002 >> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78 >> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002 >> Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f >> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001 >> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0 >> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770 >> Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008 >> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00: >> ... >> Mar 11 13:46:58.154777 (XEN) Xen call trace: >> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f >> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10 >> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230 >> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b >> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca >> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624 >> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99 >> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a > > I suppose that''s a result of 22957:c5c4688d5654 - as I understand it > exiting the loop is only possible if two consecutive invocations of > pick_cpu return the same result. This, however, is precisely what the > pCPU''s idle_bias is supposed to prevent on hyper-threaded/multi-core > systems (so that it''s not always the same entity that gets selected). > > But even beyond that particular aspect, relying on any form of > "stability" of the returned value isn''t correct. > > Plus running pick_cpu repeatedly without actually using its result > is wrong wrt to idle_bias updating too - that''s why > cached_vcpu_acct() calls _csched_cpu_pick() with the commit > argument set to false (which will result in a subsequent call - > through pick_cpu - with the argument set to true to be likely > to return the same value, but there''s no correctness dependency > on that). So 22948:2d35823a86e7 already wasn''t really correct > in putting a loop around pick_cpu. > > It''s also not clear to me what the surrounding > if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock ) > is supposed to filter, as the lock pointer gets set only when a > CPU gets brought up.Yeah, but the vcpu can change cpus while we don''t hold the lock. This means old_cpu can change between selecting the lock and actually taking it...> As I don''t really understand what is being tried to achieve here, > I also can''t really suggest a possible fix other than reverting both > offending changesets.I''ll send a patch as a suggestion :-) Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-14 16:08 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@citrix.com> wrote: > At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote: >> > I think this hang comes because although this code: >> > >> > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); >> > if ( commit ) >> > CSCHED_PCPU(nxt)->idle_bias = cpu; >> > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); >> > >> > removes the new cpu and its siblings from cpus, cpu isn''t guaranteed to >> > have been in cpus in the first place, and none of its siblings are >> > either since nxt might not be its sibling. >> >> I had originally spent quite a while to verify that the loop this is in >> can''t be infinite (i.e. there''s going to be always at least one bit >> removed from "cpus"), and did so again during the last half hour >> or so. > > I''m pretty sure there are possible passes through this loop that don''t > remove any cpus, though I haven''t constructed the full history that gets > you there.Actually, while I don''t think that this can happen, something else is definitely broken here: The logic can select a CPU that''s not in the vCPU''s affinity mask. How I managed to not note this when I originally put this change together I can''t tell. I''ll send a patch in a moment, and I think after that patch it''s also easier to see that each iteration will remove at least one bit.>> > which guarantees that nxt will be removed from cpus, though I suspect >> > this means that we might not pick the best HT pair in a particular core. >> > Scheduler code is twisty and hurts my brain so I''d like George''s >> > opinion before checking anything in. >> >> No - that was precisely done the opposite direction to get >> better symmetry of load across all CPUs. With what you propose, >> idle_bias would become meaningless. > > I don''t think see why it would. As I said, having picked a core we > might not iterate to pick the best cpu within that core, but the > round-robining effect is still there. And even if not I figured a > hypervisor crash is worse than a suboptimal scheduling decision. :)Sure. Just that this code has been there for quite a long time, and it would be really strange to only now see it start producing hangs (which apparently aren''t that difficult to reproduce - iirc a similar one was sent around by Ian a few days earlier). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Mar-14 16:17 UTC
Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
At 16:08 +0000 on 14 Mar (1300118917), Jan Beulich wrote:> Actually, while I don''t think that this can happen, something else is > definitely broken here: The logic can select a CPU that''s not in the > vCPU''s affinity mask. How I managed to not note this when I > originally put this change together I can''t tell. I''ll send a patch in > a moment, and I think after that patch it''s also easier to see that > each iteration will remove at least one bit.Yes, as long as the cpu selected has to be in "cpus", the loop is definitely safe.> Sure. Just that this code has been there for quite a long time, and > it would be really strange to only now see it start producing hangs > (which apparently aren''t that difficult to reproduce - iirc a similar > one was sent around by Ian a few days earlier).Agreed; the other branch of this thread is clerly where this particular hang is coming from. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel