I''ve been tracking down a bug where a multi-vcpu VM hangs in the hvmloader on credit2, but not on credit1. It hangs while trying to bring up extra cpus. It turns out that an unintended quirk in credit2 (some might call it a bug) causes a scheduling order which exposes a race in the vlapic init_sipi tasklet handling code. The code as it stands right now, is meant to do this: * v0 does an APIC ICR write with APIC_DM_STARTUP, trapping to Xen. * vlapic code checks to see that v1 is down (vlapic.c:318); finds that it is down, and schedules the tasklet, returning X86_EMUL_RETRY (vlapic.c:270) * Taslket runs, brings up v1. * v1 starts running. * v0 re-executes the instruction, finds that v1 is up, and returns X86_EMUL_OK, allowing the instruction to move forward. * v1 does some diagnostics, and takes itself offline. Unfortunately, the credit2 scheduler almost always preempts v0 immediately, allowing v1 to run to completion and bring itself back offline again, before v0 can re-try the MMIO. It looks like this: * v0 does APIC ICR APIC_DM_STARTUP write, trapping to Xen. * vlapic code checks to see that v1 is down; finds that it is down, schedules the tasklet, returns X86_EMUL_RETRY * Tasklet runs, brings up v1 * Credit 2 pre-empts v0, allowing v1 to run * v1 starts running * v1 does some diagnostics, and takes itself offline. * v0 re-executes the instruction, finds that v1 is down, and again schedules the tasklet and returns X86_EMUL_RETRY. * For some reason the tasklet doesn''t actually bring up v1 again (presumably because it hasn''t had an APIC_DM_INIT again); so v0 is stuck doing X86_EMUL_RETRY forever. The problem is that VPF_down is used as the test to see if the tasklet has finished its work; but there''s no guarantee that the scheduler will run v0 before v1 has come up and gone back down again. I discussed this with Tim, and we agreed that we should ask you. One option would be to simply make vlapic_schedule_sipi_init_ipi() always return X86_EMUL_OK, but we weren''t sure if that might cause some other problems. The "right" solution, if synchronization is needed, is to have an explicit signal sent back that the instruction can be allowed to complete, rather than relying on reading VPF_down, which may cause races. Thoughts? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Good race! I''ll work out a fix and let you know. K. On 18/10/2010 18:16, "George Dunlap" <dunlapg@umich.edu> wrote:> I''ve been tracking down a bug where a multi-vcpu VM hangs in the > hvmloader on credit2, but not on credit1. It hangs while trying to > bring up extra cpus. > > It turns out that an unintended quirk in credit2 (some might call it a > bug) causes a scheduling order which exposes a race in the vlapic > init_sipi tasklet handling code. > > The code as it stands right now, is meant to do this: > * v0 does an APIC ICR write with APIC_DM_STARTUP, trapping to Xen. > * vlapic code checks to see that v1 is down (vlapic.c:318); finds that > it is down, and schedules the tasklet, returning X86_EMUL_RETRY > (vlapic.c:270) > * Taslket runs, brings up v1. > * v1 starts running. > * v0 re-executes the instruction, finds that v1 is up, and returns > X86_EMUL_OK, allowing the instruction to move forward. > * v1 does some diagnostics, and takes itself offline. > > Unfortunately, the credit2 scheduler almost always preempts v0 > immediately, allowing v1 to run to completion and bring itself back > offline again, before v0 can re-try the MMIO. It looks like this: > * v0 does APIC ICR APIC_DM_STARTUP write, trapping to Xen. > * vlapic code checks to see that v1 is down; finds that it is down, > schedules the tasklet, returns X86_EMUL_RETRY > * Tasklet runs, brings up v1 > * Credit 2 pre-empts v0, allowing v1 to run > * v1 starts running > * v1 does some diagnostics, and takes itself offline. > * v0 re-executes the instruction, finds that v1 is down, and again > schedules the tasklet and returns X86_EMUL_RETRY. > * For some reason the tasklet doesn''t actually bring up v1 again > (presumably because it hasn''t had an APIC_DM_INIT again); so v0 is > stuck doing X86_EMUL_RETRY forever. > > The problem is that VPF_down is used as the test to see if the tasklet > has finished its work; but there''s no guarantee that the scheduler > will run v0 before v1 has come up and gone back down again. > > I discussed this with Tim, and we agreed that we should ask you. > > One option would be to simply make vlapic_schedule_sipi_init_ipi() > always return X86_EMUL_OK, but we weren''t sure if that might cause > some other problems. > > The "right" solution, if synchronization is needed, is to have an > explicit signal sent back that the instruction can be allowed to > complete, rather than relying on reading VPF_down, which may cause > races. > > Thoughts? > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It turned out there is a really simple hypervisor fix for this which I checked in as c/s 22265. By the way I tested this with "sched=credit2 maxcpus=1" and in that configuration the next thing of note is that the xenstore interactions in hvmloader run like a dog. Probably something to do with hvmloader polling via yield, and some bad interaction with credit2''s handling of yield. I seemed to get stuck with no VCPUs running at all, and dom0 unresponsive, which was weird. Probably hvmloader should really be using SCHEDOP_poll and properly waiting on its end of the event channel. But still there is obviosuly some fishy scheduling issue here regardless of hvmloader''s current naivety. -- Keir On 18/10/2010 18:16, "George Dunlap" <dunlapg@umich.edu> wrote:> I''ve been tracking down a bug where a multi-vcpu VM hangs in the > hvmloader on credit2, but not on credit1. It hangs while trying to > bring up extra cpus. > > It turns out that an unintended quirk in credit2 (some might call it a > bug) causes a scheduling order which exposes a race in the vlapic > init_sipi tasklet handling code. > > The code as it stands right now, is meant to do this: > * v0 does an APIC ICR write with APIC_DM_STARTUP, trapping to Xen. > * vlapic code checks to see that v1 is down (vlapic.c:318); finds that > it is down, and schedules the tasklet, returning X86_EMUL_RETRY > (vlapic.c:270) > * Taslket runs, brings up v1. > * v1 starts running. > * v0 re-executes the instruction, finds that v1 is up, and returns > X86_EMUL_OK, allowing the instruction to move forward. > * v1 does some diagnostics, and takes itself offline. > > Unfortunately, the credit2 scheduler almost always preempts v0 > immediately, allowing v1 to run to completion and bring itself back > offline again, before v0 can re-try the MMIO. It looks like this: > * v0 does APIC ICR APIC_DM_STARTUP write, trapping to Xen. > * vlapic code checks to see that v1 is down; finds that it is down, > schedules the tasklet, returns X86_EMUL_RETRY > * Tasklet runs, brings up v1 > * Credit 2 pre-empts v0, allowing v1 to run > * v1 starts running > * v1 does some diagnostics, and takes itself offline. > * v0 re-executes the instruction, finds that v1 is down, and again > schedules the tasklet and returns X86_EMUL_RETRY. > * For some reason the tasklet doesn''t actually bring up v1 again > (presumably because it hasn''t had an APIC_DM_INIT again); so v0 is > stuck doing X86_EMUL_RETRY forever. > > The problem is that VPF_down is used as the test to see if the tasklet > has finished its work; but there''s no guarantee that the scheduler > will run v0 before v1 has come up and gone back down again. > > I discussed this with Tim, and we agreed that we should ask you. > > One option would be to simply make vlapic_schedule_sipi_init_ipi() > always return X86_EMUL_OK, but we weren''t sure if that might cause > some other problems. > > The "right" solution, if synchronization is needed, is to have an > explicit signal sent back that the instruction can be allowed to > complete, rather than relying on reading VPF_down, which may cause > races. > > Thoughts? > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Yes, credit2 definitely has some issues. The strange scheduling order that exposed this race condition was definitely produced by some code that wasn''t doing what it should do. :-) There''s also the problem that the credit2 code misunderstands the current cpupools interface, which results in VMs ending up with a huge negative credit balance. I''ve got fixes, will probably send a patchqueue in the next couple of days. Credit2 doesn''t implement yield (yet), so it will probably just run the same vcpu again until the yielding cpu''s credit is lower than another runnable VM''s credit. This is pretty much the same thing credit1 would do until I added the yield patch (which I think was just a month or two ago in -unstable). However, this combined with the bug above means that a new VM would have to burn credit all the way down to dom0''s negative credit before "yield" would give up priority. I think that would probably explain the dog-like behavior you''re seeing if you only have 1 cpu. :-) -George On Wed, Oct 20, 2010 at 10:00 AM, Keir Fraser <keir@xen.org> wrote:> It turned out there is a really simple hypervisor fix for this which I > checked in as c/s 22265. By the way I tested this with "sched=credit2 > maxcpus=1" and in that configuration the next thing of note is that the > xenstore interactions in hvmloader run like a dog. Probably something to do > with hvmloader polling via yield, and some bad interaction with credit2''s > handling of yield. I seemed to get stuck with no VCPUs running at all, and > dom0 unresponsive, which was weird. Probably hvmloader should really be > using SCHEDOP_poll and properly waiting on its end of the event channel. But > still there is obviosuly some fishy scheduling issue here regardless of > hvmloader''s current naivety. > > -- Keir > > On 18/10/2010 18:16, "George Dunlap" <dunlapg@umich.edu> wrote: > >> I''ve been tracking down a bug where a multi-vcpu VM hangs in the >> hvmloader on credit2, but not on credit1. It hangs while trying to >> bring up extra cpus. >> >> It turns out that an unintended quirk in credit2 (some might call it a >> bug) causes a scheduling order which exposes a race in the vlapic >> init_sipi tasklet handling code. >> >> The code as it stands right now, is meant to do this: >> * v0 does an APIC ICR write with APIC_DM_STARTUP, trapping to Xen. >> * vlapic code checks to see that v1 is down (vlapic.c:318); finds that >> it is down, and schedules the tasklet, returning X86_EMUL_RETRY >> (vlapic.c:270) >> * Taslket runs, brings up v1. >> * v1 starts running. >> * v0 re-executes the instruction, finds that v1 is up, and returns >> X86_EMUL_OK, allowing the instruction to move forward. >> * v1 does some diagnostics, and takes itself offline. >> >> Unfortunately, the credit2 scheduler almost always preempts v0 >> immediately, allowing v1 to run to completion and bring itself back >> offline again, before v0 can re-try the MMIO. It looks like this: >> * v0 does APIC ICR APIC_DM_STARTUP write, trapping to Xen. >> * vlapic code checks to see that v1 is down; finds that it is down, >> schedules the tasklet, returns X86_EMUL_RETRY >> * Tasklet runs, brings up v1 >> * Credit 2 pre-empts v0, allowing v1 to run >> * v1 starts running >> * v1 does some diagnostics, and takes itself offline. >> * v0 re-executes the instruction, finds that v1 is down, and again >> schedules the tasklet and returns X86_EMUL_RETRY. >> * For some reason the tasklet doesn''t actually bring up v1 again >> (presumably because it hasn''t had an APIC_DM_INIT again); so v0 is >> stuck doing X86_EMUL_RETRY forever. >> >> The problem is that VPF_down is used as the test to see if the tasklet >> has finished its work; but there''s no guarantee that the scheduler >> will run v0 before v1 has come up and gone back down again. >> >> I discussed this with Tim, and we agreed that we should ask you. >> >> One option would be to simply make vlapic_schedule_sipi_init_ipi() >> always return X86_EMUL_OK, but we weren''t sure if that might cause >> some other problems. >> >> The "right" solution, if synchronization is needed, is to have an >> explicit signal sent back that the instruction can be allowed to >> complete, rather than relying on reading VPF_down, which may cause >> races. >> >> Thoughts? >> >> -George >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel