Hello, We have had a bug raised against Xen-3.4 that the kexec path fails, on HP BL465c G7 blades. The problem does not reproduce on any other AMD machines I have to hand. On further investigation, it appears that if the crashing cpu is #0, then the kexec path hangs forever trying to grab the already locked legacy_hpet_event.lock in hpet_disable_legacy_broadcast(). Removing the lock/unlock pair causes the kexec crash path to work as expected. If the crashing cpu is not #0, then local_time_calibration() gets worried and dumps the calibration data, and hangs at some later point which I have yet to find. This hang happens while performing the NMI shootdown of other cpus. The support engineer who raised the bug says that it doesn''t occur with Xen-4.1. Is there anything architecturally new in the Magny-Cours processors which might explain this behavior? I am unwilling to try and backport the hpet code from Xen-4.x without understanding the problem, although it is a possible solution. Thanks -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > We have had a bug raised against Xen-3.4 that the kexec path fails, on > HP BL465c G7 blades. The problem does not reproduce on any other AMD > machines I have to hand. > > On further investigation, it appears that if the crashing cpu is #0, > then the kexec path hangs forever trying to grab the already locked > legacy_hpet_event.lock in hpet_disable_legacy_broadcast(). Removing the > lock/unlock pair causes the kexec crash path to work as expected.Are you sure it is locked (rather than never initialized)? The problem could be that hpet_broadcast_is_available() returns true because of num_hpets_used > 0, yet hpet_broadcast_init() didn''t make it down to spin_lock_init(&legacy_hpet_event.lock).> If the crashing cpu is not #0, then local_time_calibration() gets > worried and dumps the calibration data, and hangs at some later point > which I have yet to find. This hang happens while performing the NMI > shootdown of other cpus. > > The support engineer who raised the bug says that it doesn''t occur with > Xen-4.1. Is there anything architecturally new in the Magny-Cours > processors which might explain this behavior?Possibly more a question of the surrounding platform, namely whether there are HPETs in the system, and whether they get used for the C-state broadcasting. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/08/11 11:09, Jan Beulich wrote:>>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> We have had a bug raised against Xen-3.4 that the kexec path fails, on >> HP BL465c G7 blades. The problem does not reproduce on any other AMD >> machines I have to hand. >> >> On further investigation, it appears that if the crashing cpu is #0, >> then the kexec path hangs forever trying to grab the already locked >> legacy_hpet_event.lock in hpet_disable_legacy_broadcast(). Removing the >> lock/unlock pair causes the kexec crash path to work as expected. > Are you sure it is locked (rather than never initialized)? The problem > could be that hpet_broadcast_is_available() returns true because of > num_hpets_used > 0, yet hpet_broadcast_init() didn''t make it down > to spin_lock_init(&legacy_hpet_event.lock).That is an very good point. I had not considered it, and it turns out that legacy broadcast is never set up (XEN) HPET: starting hpet_broadcast_init() (XEN) HPET: hpet_setup() successful (XEN) HPET: 4 timers in total, 3 timers will be used for broadcast hpet_broadcast_init() exits inside the "if ( num_hpets_used > 0 )" clause (as the boot dmesg doesn''t printk the line immediately following the if clause), meaning that legacy broadcasts are never set up. Therefore, the logic if ( hpet_broadcast_is_available() ) hpet_disable_legacy_broadcast(); in several places is wrong, and should be "if hpet_lecacy broadcast used". Judging on the similarities in this regard between Xen-3.4 and Xen-4.x, i am now not certain that Xen-4.x is immune and will now proceed to investigate this.>> If the crashing cpu is not #0, then local_time_calibration() gets >> worried and dumps the calibration data, and hangs at some later point >> which I have yet to find. This hang happens while performing the NMI >> shootdown of other cpus. >> >> The support engineer who raised the bug says that it doesn''t occur with >> Xen-4.1. Is there anything architecturally new in the Magny-Cours >> processors which might explain this behavior? > Possibly more a question of the surrounding platform, namely whether > there are HPETs in the system, and whether they get used for the > C-state broadcasting. > > Jan >Why would C-state broadcasting make a difference at this point? I have narrowed the crash down a bit, and local_time_calibration() is dumping its state after one_cpu_only() and before the shootdown actually occurs. However, I cant see any code between these two points which alters the state of the other CPU, which should still be running normally at this point. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 16.08.11 at 14:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 16/08/11 11:09, Jan Beulich wrote: >>>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> The support engineer who raised the bug says that it doesn''t occur with >>> Xen-4.1. Is there anything architecturally new in the Magny-Cours >>> processors which might explain this behavior? >> Possibly more a question of the surrounding platform, namely whether >> there are HPETs in the system, and whether they get used for the >> C-state broadcasting. > > Why would C-state broadcasting make a difference at this point? I have > narrowed the crash down a bit, and local_time_calibration() is dumping > its state after one_cpu_only() and before the shootdown actually > occurs. However, I cant see any code between these two points which > alters the state of the other CPU, which should still be running > normally at this point.That "num_hpets_used > 0" check in hpet_broadcast_is_available() could be false for all other AMD systems you had tried this on, and hence you might not be getting into hpet_disable_legacy_broadcast() there at all. (4.0.2 and 4.1.1 have, btw., an extra non-zero check against legacy_hpet_event.shift in hpet_disable_legacy_broadcast(); 4.0.1 and 4.1.0 don''t.) Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel