I discovered a triple fault bug in the crash kernel while working on the wider reentrant NMI/MCE issues discussed on the past few weeks on the list. As this is quite a substantial change, I present it here as an RFC ahead of the longer series for reentrant behaviour (which I am still working on). I present patches 2 thru 4 simply as debugging aids for the bug fixed in patch 1. I am primarily rally concerned about RFC feedback from patch 1, although comments on patches 2 and 3 will not go amis, as they are expected to be in the longer reentrant series. Patch 4 is a debugkey ''1'' which deliberately creates reentrant NMIs and stack corruption. It is not intended for actually committing into the codebase. I am at a loss to explain why the triple fault is actually occurring; I failed to get anything from early printk from the crash kernel, but did not attempt to debug the extra gubbins which kexec puts in the blob. The fault however is completely deterministic and can be verified by commenting out the call to enable_nmis() in machine_kexec() My setup is xen-unstable, 2.6.32-classic based dom0 and kdump kernels with kexec-tools 2.0.2 ~Andrew
Andrew Cooper
2012-Dec-04 18:16 UTC
[PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
Experimentally, certain crash kernels will triple fault very early after starting if started with NMIs disabled. This was discovered when experimenting with a debug keyhandler which deliberately created a reentrant NMI, causing stack corruption. Because of this discovered bug, and that the future changes to the NMI handling will make the kexec path more fragile, take the time now to bullet-proof the kexec behaviour to be safe in all circumstances. This patch adds three new low level routines: * trap_nop This is a no op handler which irets immediately * nmi_crash This is a special NMI handler for using during a kexec crash * enable_nmis This function enables NMIs by executing an iret-to-self As a result, the new behaviour of the kexec_crash path is: nmi_shootdown_cpus() will: * Disable interrupt stack tables. Disabling ISTs will prevent stack corruption from future NMIs and MCEs. As Xen is going to crash, we are not concerned with the security race condition with ''sysret''; any guest which managed to benefit from the race condition will only be able execute a handful of instructions before it will be killed anyway, and a guest is still unable to trigger the race condition in the first place. * Swap the NMI trap handlers. The crashing pcpu gets the nop handler, to prevent it getting stuck in an NMI context, causing a hang instead of crash. The non-crashing pcpus all get the crash_nmi handler which is designed never to return. do_nmi_crash() will: * Save the crash notes and shut the pcpu down. There is now an extra per-cpu variable to prevent us from executing this multiple times. In the case where we reenter midway through, attempt the whole operation again in preference to not completing it in the first place. * Set up another NMI at the LAPIC. Even when the LAPIC has been disabled, the ID and command registers are still usable. As a result, we can deliberately queue up a new NMI to re-interrupt us later if NMIs get unlatched. Because of the call to __stop_this_cpu(), we have to hand craft self_nmi() to be safe from General Protection Faults. * Fall into infinite loop. machine_kexec() will: * Swap the MCE handlers to be a nop. We cannot prevent MCEs from being delivered when we pass off to the crash kernel, and the less Xen context is being touched the better. * Explicitly enable NMIs. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c --- a/xen/arch/x86/crash.c +++ b/xen/arch/x86/crash.c @@ -32,41 +32,97 @@ static atomic_t waiting_for_crash_ipi; static unsigned int crashing_cpu; +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) +/* This becomes the NMI handler for non-crashing CPUs. */ +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) { - /* Don''t do anything if this handler is invoked on crashing cpu. - * Otherwise, system will completely hang. Crashing cpu can get - * an NMI if system was initially booted with nmi_watchdog parameter. + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ + ASSERT( crashing_cpu != smp_processor_id() ); + + /* Save crash information and shut down CPU. Attempt only once. */ + if ( ! this_cpu(crash_save_done) ) + { + kexec_crash_save_cpu(); + __stop_this_cpu(); + + this_cpu(crash_save_done) = 1; + } + + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC + * back to its boot state, so we are unable to rely on the regular + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. + * (The likely senario is that we have reverted from x2apic mode to + * xapic, at which point #GPFs will occur if we use the apic_* + * functions) + * + * The ICR and APIC ID of the LAPIC are still valid even during + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to + * queue up another NMI, to force us back into this loop if we exit. */ - if ( cpu == crashing_cpu ) - return 1; - local_irq_disable(); + switch ( current_local_apic_mode() ) + { + u32 apic_id; - kexec_crash_save_cpu(); + case APIC_MODE_X2APIC: + apic_id = apic_rdmsr(APIC_ID); - __stop_this_cpu(); + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); + break; - atomic_dec(&waiting_for_crash_ipi); + case APIC_MODE_XAPIC: + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); + + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) + cpu_relax(); + + apic_mem_write(APIC_ICR2, apic_id << 24); + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); + break; + + default: + break; + } for ( ; ; ) halt(); - - return 1; } static void nmi_shootdown_cpus(void) { unsigned long msecs; + int i; + int cpu = smp_processor_id(); local_irq_disable(); - crashing_cpu = smp_processor_id(); + crashing_cpu = cpu; local_irq_count(crashing_cpu) = 0; atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); - /* Would it be better to replace the trap vector here? */ - set_nmi_callback(crash_nmi_callback); + + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which + * invokes do_nmi_crash (above), which cause them to write state and + * fall into a loop. The crashing pcpu gets the nop handler to + * cause it to return to this function ASAP. + * + * Futhermore, disable stack tables for NMIs and MCEs to prevent + * race conditions resulting in corrupt stack frames. As Xen is + * about to die, we no longer care about the security-related race + * condition with ''sysret'' which ISTs are designed to solve. + */ + for ( i = 0; i < nr_cpu_ids; ++i ) + if ( idt_tables[i] ) + { + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32); + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32); + + if ( i == cpu ) + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); + else + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash); + } + /* Ensure the new callback function is set before sending out the NMI. */ wmb(); diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c --- a/xen/arch/x86/machine_kexec.c +++ b/xen/arch/x86/machine_kexec.c @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im */ local_irq_disable(); + /* Now regular interrupts are disabled, we need to reduce the impact + * of interrupts not disabled by ''cli''. NMIs have already been + * dealt with by machine_crash_shutdown(). + * + * Set all pcpu MCE handler to be a noop. */ + set_intr_gate(TRAP_machine_check, &trap_nop); + + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do + * not like running with NMIs disabled. */ + enable_nmis(); + /* * compat_machine_kexec() returns to idle pagetables, which requires us * to be running on a static GDT mapping (idle pagetables have no GDT diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -635,11 +635,42 @@ ENTRY(nmi) movl $TRAP_nmi,4(%rsp) jmp handle_ist_exception +ENTRY(nmi_crash) + cli + pushq $0 + movl $TRAP_nmi,4(%rsp) + SAVE_ALL + movq %rsp,%rdi + callq do_nmi_crash /* Does not return */ + ud2 + ENTRY(machine_check) pushq $0 movl $TRAP_machine_check,4(%rsp) jmp handle_ist_exception +/* No op trap handler. Required for kexec path. */ +ENTRY(trap_nop) + iretq + +/* Enable NMIs. No special register assumptions, and all preserved. */ +ENTRY(enable_nmis) + push %rax + movq %rsp, %rax /* Grab RSP before pushing */ + + /* Set up stack frame */ + pushq $0 /* SS */ + pushq %rax /* RSP */ + pushfq /* RFLAGS */ + pushq $__HYPERVISOR_CS /* CS */ + leaq 1f(%rip),%rax + pushq %rax /* RIP */ + + iretq /* Disable the hardware NMI latch */ +1: + popq %rax + retq + .section .rodata, "a", @progbits ENTRY(exception_table) diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs * DECLARE_TRAP_HANDLER(divide_error); DECLARE_TRAP_HANDLER(debug); DECLARE_TRAP_HANDLER(nmi); +DECLARE_TRAP_HANDLER(nmi_crash); DECLARE_TRAP_HANDLER(int3); DECLARE_TRAP_HANDLER(overflow); DECLARE_TRAP_HANDLER(bounds); @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check); DECLARE_TRAP_HANDLER(spurious_interrupt_bug); #undef DECLARE_TRAP_HANDLER +void trap_nop(void); +void enable_nmis(void); + void syscall_enter(void); void sysenter_entry(void); void sysenter_eflags_saved(void);
Andrew Cooper
2012-Dec-04 18:16 UTC
[PATCH 2 of 4 RFC] xen/panic: Introduce panic_in_progress
The panic() function will re-enable NMIs. In addition, because of the private spinlock, panic() is not safe to reentry on the same pcpu, due to an NMI or MCE. We introduce a panic_in_progress flag and is_panic_in_progress() helper, to be used in subsequent patches. (And also fix a whitespace error.) Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> diff -r b15d3ae525af -r 48a60a407e15 xen/drivers/char/console.c --- a/xen/drivers/char/console.c +++ b/xen/drivers/char/console.c @@ -65,6 +65,8 @@ static int __read_mostly sercon_handle static DEFINE_SPINLOCK(console_lock); +static atomic_t panic_in_progress = ATOMIC_INIT(0); + /* * To control the amount of printing, thresholds are added. * These thresholds correspond to the XENLOG logging levels. @@ -980,13 +982,22 @@ static int __init debugtrace_init(void) * ************************************************************** */ +/* Is a panic() in progress? Used to prevent reentrancy of panic() from + * NMIs/MCEs, as there is potential to deadlock from those contexts. */ +long is_panic_in_progress(void) +{ + return atomic_read(&panic_in_progress); +} + void panic(const char *fmt, ...) { va_list args; unsigned long flags; static DEFINE_SPINLOCK(lock); static char buf[128]; - + + atomic_set(&panic_in_progress, 1); + debugtrace_dump(); /* Protects buf[] and ensure multi-line message prints atomically. */ diff -r b15d3ae525af -r 48a60a407e15 xen/include/xen/lib.h --- a/xen/include/xen/lib.h +++ b/xen/include/xen/lib.h @@ -78,6 +78,7 @@ extern void debugtrace_printk(const char #define _p(_x) ((void *)(unsigned long)(_x)) extern void printk(const char *format, ...) __attribute__ ((format (printf, 1, 2))); +extern long is_panic_in_progress(void); extern void panic(const char *format, ...) __attribute__ ((format (printf, 1, 2))); extern long vm_assist(struct domain *, unsigned int, unsigned int);
Andrew Cooper
2012-Dec-04 18:16 UTC
[PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
The (old) function do_nmi() is not reentrantly-safe. Rename it to _do_nmi() and present a new do_nmi() which reentrancy guards. If a reentrant NMI has been detected, then it is highly likely that the outer NMI exception frame has been corrupted, meaning we cannot return to the original context. In this case, we panic() obviously rather than falling into an infinite loop. panic() however is not safe to reenter from an NMI context, as an NMI (or MCE) can interrupt it inside its critical section, at which point a new call to panic() will deadlock. As a result, we bail early if a panic() is already in progress, as Xen is about to die anyway. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> -- I am fairly sure this is safe with the current kexec_crash functionality which involves holding all non-crashing pcpus in an NMI loop. In the case of reentrant NMIs and panic_in_progress, we will repeatedly bail early in an infinite loop of NMIs, which has the same intended effect of simply causing all non-crashing CPUs to stay out of the way while the main crash occurs. diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = string_param("nmi", opt_nmi); DEFINE_PER_CPU(u64, efer); +static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0; DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr); @@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu static nmi_callback_t nmi_callback = dummy_nmi_callback; -void do_nmi(struct cpu_user_regs *regs) +/* This function should never be called directly. Use do_nmi() instead. */ +static void _do_nmi(struct cpu_user_regs *regs) { unsigned int cpu = smp_processor_id(); unsigned char reason; @@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs) } } +/* This function is NOT SAFE to call from C code in general. + * Use with extreme care! */ +void do_nmi(struct cpu_user_regs *regs) +{ + bool_t * in_progress = &this_cpu(nmi_in_progress); + + if ( is_panic_in_progress() ) + { + /* A panic is already in progress. It may have reenabled NMIs, + * or we are simply unluckly to receive one right now. Either + * way, bail early, as Xen is about to die. + * + * TODO: Ideally we should exit without executing an iret, to + * leave NMIs disabled, but that option is not currently + * available to us. + */ + return; + } + + if ( test_and_set_bool(*in_progress) ) + { + /* Crash in an obvious mannor, as opposed to falling into + * infinite loop because our exception frame corrupted the + * exception frame of the previous NMI. + * + * TODO: This check does not cover all possible cases of corrupt + * exception frames, but it is substantially better than + * nothing. + */ + console_force_unlock(); + show_execution_state(regs); + panic("Reentrant NMI detected\n"); + } + + _do_nmi(regs); + *in_progress = 0; +} + void set_nmi_callback(nmi_callback_t callback) { nmi_callback = callback;
Andrew Cooper
2012-Dec-04 18:16 UTC
[PATCH 4 of 4 RFC] xen/nmi: DO NOT APPLY: debugkey to deliberatly invoke a reentrant NMI
FOR TESTING PURPOSES ONLY. Use debug key ''1'' diff -r f6ad86b61d5a -r 303d94fa720c xen/arch/x86/nmi.c --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -39,6 +39,7 @@ static unsigned int nmi_perfctr_msr; /* static unsigned int nmi_p4_cccr_val; static DEFINE_PER_CPU(struct timer, nmi_timer); static DEFINE_PER_CPU(unsigned int, nmi_timer_ticks); +static DEFINE_PER_CPU(unsigned int, reent_state); /* opt_watchdog: If true, run a watchdog NMI on each processor. */ bool_t __initdata opt_watchdog = 0; @@ -532,10 +533,106 @@ static struct keyhandler nmi_stats_keyha .desc = "NMI statistics" }; +void do_nmi_reent_fault(struct cpu_user_regs * regs) +{ + printk("In fault handler from NMI - iret from this is bad\n"); + mdelay(10); +} + +void do_nmi_reentrant_handler(void) +{ + volatile unsigned int * state_ptr = &(this_cpu(reent_state)); + unsigned int state; + + state = *state_ptr; + + switch ( state ) + { + case 0: // Not a reentrant NMI request + return; + + case 1: // Outer NMI call - lets try to set up a reentrant case + printk("In outer reentrant case\n"); + + // Signal inner state + *state_ptr = state = 2; + // Queue up another NMI which cant currently be delivered + self_nmi(); + /* printk("Queued up suplimentary NMI\n"); */ + /* mdelay(10); */ + /* // iret from exception handler should re-enable NMIs */ + run_in_exception_handler(do_nmi_reent_fault); + /* mdelay(10); */ + + return; + + case 2: + printk("In inner reentrant case - stuff will probably break quickly\n"); + *state_ptr = state = 0; + } +} + +static void nmi_reentrant(unsigned char key) +{ + volatile unsigned int * state_ptr = &(this_cpu(reent_state)); + unsigned int state; + s_time_t deadline; + + printk("*** Trying to force reentrant NMI ***\n"); + + if ( *state_ptr != 0 ) + { + printk("State not ready - waiting for up to 2 seconds\n"); + deadline = (NOW() + SECONDS(2)); + do + { + cpu_relax(); + state = *state_ptr; + } while ( NOW() < deadline && state != 0 ); + + if ( state != 0 ) + { + printk("Forcibly resetting state\n"); + *state_ptr = 0; + } + else + printk("Waiting seemed to work\n"); + } + + // Set up trap for NMI handler + *state_ptr = 1; + deadline = (NOW() + SECONDS(1)); + printk("Set up trigger for NMI handler and waiting for magic\n"); + self_nmi(); + + do + { + cpu_relax(); + state = *state_ptr; + } while ( NOW() < deadline && state != 0 ); + + if ( state != 0 ) + { + printk("Forcibly resetting state\n"); + *state_ptr = 0; + } + else + printk("Apparently AOK\n"); + + printk("*** Done reentrant test ***\n"); +} + +static struct keyhandler nmi_reentrant_keyhandler = { + .diagnostic = 0, + .u.fn = nmi_reentrant, + .desc = "Reentrant NMI test" +}; + static __init int register_nmi_trigger(void) { register_keyhandler(''N'', &nmi_trigger_keyhandler); register_keyhandler(''n'', &nmi_stats_keyhandler); + register_keyhandler(''1'', &nmi_reentrant_keyhandler); return 0; } __initcall(register_nmi_trigger); diff -r f6ad86b61d5a -r 303d94fa720c xen/arch/x86/traps.c --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3183,6 +3183,7 @@ static int dummy_nmi_callback(struct cpu static nmi_callback_t nmi_callback = dummy_nmi_callback; +extern void do_nmi_reentrant_handler(void); /* This function should never be called directly. Use do_nmi() instead. */ static void _do_nmi(struct cpu_user_regs *regs) { @@ -3191,6 +3192,8 @@ static void _do_nmi(struct cpu_user_regs ++nmi_count(cpu); + do_nmi_reentrant_handler(); + if ( nmi_callback(regs, cpu) ) return;
Andrew Cooper
2012-Dec-04 18:25 UTC
Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
On 04/12/12 18:16, Andrew Cooper wrote:> Experimentally, certain crash kernels will triple fault very early after > starting if started with NMIs disabled. This was discovered when > experimenting with a debug keyhandler which deliberately created a > reentrant NMI, causing stack corruption. > > Because of this discovered bug, and that the future changes to the NMI > handling will make the kexec path more fragile, take the time now to > bullet-proof the kexec behaviour to be safe in all circumstances. > > This patch adds three new low level routines: > * trap_nop > This is a no op handler which irets immediately > * nmi_crash > This is a special NMI handler for using during a kexec crash > * enable_nmis > This function enables NMIs by executing an iret-to-self > > As a result, the new behaviour of the kexec_crash path is: > > nmi_shootdown_cpus() will: > > * Disable interrupt stack tables. > Disabling ISTs will prevent stack corruption from future NMIs and > MCEs. As Xen is going to crash, we are not concerned with the > security race condition with ''sysret''; any guest which managed to > benefit from the race condition will only be able execute a handful > of instructions before it will be killed anyway, and a guest is > still unable to trigger the race condition in the first place. > > * Swap the NMI trap handlers. > The crashing pcpu gets the nop handler, to prevent it getting stuck in > an NMI context, causing a hang instead of crash. The non-crashing > pcpus all get the crash_nmi handler which is designed never to > return. > > do_nmi_crash() will: > > * Save the crash notes and shut the pcpu down. > There is now an extra per-cpu variable to prevent us from executing > this multiple times. In the case where we reenter midway through, > attempt the whole operation again in preference to not completing > it in the first place. > > * Set up another NMI at the LAPIC. > Even when the LAPIC has been disabled, the ID and command registers > are still usable. As a result, we can deliberately queue up a new > NMI to re-interrupt us later if NMIs get unlatched. Because of the > call to __stop_this_cpu(), we have to hand craft self_nmi() to be > safe from General Protection Faults. > > * Fall into infinite loop. > > machine_kexec() will: > > * Swap the MCE handlers to be a nop. > We cannot prevent MCEs from being delivered when we pass off to the > crash kernel, and the less Xen context is being touched the better. > > * Explicitly enable NMIs. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c > --- a/xen/arch/x86/crash.c > +++ b/xen/arch/x86/crash.c > @@ -32,41 +32,97 @@ > > static atomic_t waiting_for_crash_ipi; > static unsigned int crashing_cpu; > +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); > > -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) > +/* This becomes the NMI handler for non-crashing CPUs. */ > +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) > { > - /* Don''t do anything if this handler is invoked on crashing cpu. > - * Otherwise, system will completely hang. Crashing cpu can get > - * an NMI if system was initially booted with nmi_watchdog parameter. > + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ > + ASSERT( crashing_cpu != smp_processor_id() ); > + > + /* Save crash information and shut down CPU. Attempt only once. */ > + if ( ! this_cpu(crash_save_done) ) > + { > + kexec_crash_save_cpu(); > + __stop_this_cpu(); > + > + this_cpu(crash_save_done) = 1;And I have already found a bug. The "atomic_dec(&waiting_for_crash_ipi);" from below should appear here. Without it, we just wait the full second in nmi_shootdown_cpus() and continue anyway. ~Andrew> + } > + > + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC > + * back to its boot state, so we are unable to rely on the regular > + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. > + * (The likely senario is that we have reverted from x2apic mode to > + * xapic, at which point #GPFs will occur if we use the apic_* > + * functions) > + * > + * The ICR and APIC ID of the LAPIC are still valid even during > + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to > + * queue up another NMI, to force us back into this loop if we exit. > */ > - if ( cpu == crashing_cpu ) > - return 1; > - local_irq_disable(); > + switch ( current_local_apic_mode() ) > + { > + u32 apic_id; > > - kexec_crash_save_cpu(); > + case APIC_MODE_X2APIC: > + apic_id = apic_rdmsr(APIC_ID); > > - __stop_this_cpu(); > + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); > + break; > > - atomic_dec(&waiting_for_crash_ipi); > + case APIC_MODE_XAPIC: > + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); > + > + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) > + cpu_relax(); > + > + apic_mem_write(APIC_ICR2, apic_id << 24); > + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); > + break; > + > + default: > + break; > + } > > for ( ; ; ) > halt(); > - > - return 1; > } > > static void nmi_shootdown_cpus(void) > { > unsigned long msecs; > + int i; > + int cpu = smp_processor_id(); > > local_irq_disable(); > > - crashing_cpu = smp_processor_id(); > + crashing_cpu = cpu; > local_irq_count(crashing_cpu) = 0; > > atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); > - /* Would it be better to replace the trap vector here? */ > - set_nmi_callback(crash_nmi_callback); > + > + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which > + * invokes do_nmi_crash (above), which cause them to write state and > + * fall into a loop. The crashing pcpu gets the nop handler to > + * cause it to return to this function ASAP. > + * > + * Futhermore, disable stack tables for NMIs and MCEs to prevent > + * race conditions resulting in corrupt stack frames. As Xen is > + * about to die, we no longer care about the security-related race > + * condition with ''sysret'' which ISTs are designed to solve. > + */ > + for ( i = 0; i < nr_cpu_ids; ++i ) > + if ( idt_tables[i] ) > + { > + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32); > + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32); > + > + if ( i == cpu ) > + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); > + else > + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash); > + } > + > /* Ensure the new callback function is set before sending out the NMI. */ > wmb(); > > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c > --- a/xen/arch/x86/machine_kexec.c > +++ b/xen/arch/x86/machine_kexec.c > @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im > */ > local_irq_disable(); > > + /* Now regular interrupts are disabled, we need to reduce the impact > + * of interrupts not disabled by ''cli''. NMIs have already been > + * dealt with by machine_crash_shutdown(). > + * > + * Set all pcpu MCE handler to be a noop. */ > + set_intr_gate(TRAP_machine_check, &trap_nop); > + > + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do > + * not like running with NMIs disabled. */ > + enable_nmis(); > + > /* > * compat_machine_kexec() returns to idle pagetables, which requires us > * to be running on a static GDT mapping (idle pagetables have no GDT > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S > --- a/xen/arch/x86/x86_64/entry.S > +++ b/xen/arch/x86/x86_64/entry.S > @@ -635,11 +635,42 @@ ENTRY(nmi) > movl $TRAP_nmi,4(%rsp) > jmp handle_ist_exception > > +ENTRY(nmi_crash) > + cli > + pushq $0 > + movl $TRAP_nmi,4(%rsp) > + SAVE_ALL > + movq %rsp,%rdi > + callq do_nmi_crash /* Does not return */ > + ud2 > + > ENTRY(machine_check) > pushq $0 > movl $TRAP_machine_check,4(%rsp) > jmp handle_ist_exception > > +/* No op trap handler. Required for kexec path. */ > +ENTRY(trap_nop) > + iretq > + > +/* Enable NMIs. No special register assumptions, and all preserved. */ > +ENTRY(enable_nmis) > + push %rax > + movq %rsp, %rax /* Grab RSP before pushing */ > + > + /* Set up stack frame */ > + pushq $0 /* SS */ > + pushq %rax /* RSP */ > + pushfq /* RFLAGS */ > + pushq $__HYPERVISOR_CS /* CS */ > + leaq 1f(%rip),%rax > + pushq %rax /* RIP */ > + > + iretq /* Disable the hardware NMI latch */ > +1: > + popq %rax > + retq > + > .section .rodata, "a", @progbits > > ENTRY(exception_table) > diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h > --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs * > DECLARE_TRAP_HANDLER(divide_error); > DECLARE_TRAP_HANDLER(debug); > DECLARE_TRAP_HANDLER(nmi); > +DECLARE_TRAP_HANDLER(nmi_crash); > DECLARE_TRAP_HANDLER(int3); > DECLARE_TRAP_HANDLER(overflow); > DECLARE_TRAP_HANDLER(bounds); > @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check); > DECLARE_TRAP_HANDLER(spurious_interrupt_bug); > #undef DECLARE_TRAP_HANDLER > > +void trap_nop(void); > +void enable_nmis(void); > + > void syscall_enter(void); > void sysenter_entry(void); > void sysenter_eflags_saved(void);-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Jan Beulich
2012-Dec-05 09:17 UTC
Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > --- a/xen/arch/x86/crash.c > +++ b/xen/arch/x86/crash.c > @@ -32,41 +32,97 @@ > > static atomic_t waiting_for_crash_ipi; > static unsigned int crashing_cpu; > +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); > > -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) > +/* This becomes the NMI handler for non-crashing CPUs. */ > +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) > { > - /* Don''t do anything if this handler is invoked on crashing cpu. > - * Otherwise, system will completely hang. Crashing cpu can get > - * an NMI if system was initially booted with nmi_watchdog parameter. > + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ > + ASSERT( crashing_cpu != smp_processor_id() ); > + > + /* Save crash information and shut down CPU. Attempt only once. */ > + if ( ! this_cpu(crash_save_done) ) > + { > + kexec_crash_save_cpu(); > + __stop_this_cpu(); > + > + this_cpu(crash_save_done) = 1; > + } > + > + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC > + * back to its boot state, so we are unable to rely on the regular > + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. > + * (The likely senario is that we have reverted from x2apic mode to > + * xapic, at which point #GPFs will occur if we use the apic_* > + * functions) > + * > + * The ICR and APIC ID of the LAPIC are still valid even during > + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to > + * queue up another NMI, to force us back into this loop if we exit. > */ > - if ( cpu == crashing_cpu ) > - return 1; > - local_irq_disable(); > + switch ( current_local_apic_mode() ) > + { > + u32 apic_id; > > - kexec_crash_save_cpu(); > + case APIC_MODE_X2APIC: > + apic_id = apic_rdmsr(APIC_ID); > > - __stop_this_cpu(); > + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));As in the description you talk about the LAPIC being (possibly) disabled here - did you check that this would not #GP in that case?> + break; > > - atomic_dec(&waiting_for_crash_ipi); > + case APIC_MODE_XAPIC: > + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); > + > + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) > + cpu_relax(); > + > + apic_mem_write(APIC_ICR2, apic_id << 24); > + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); > + break; > + > + default: > + break; > + } > > for ( ; ; ) > halt(); > - > - return 1; > }>...> ENTRY(machine_check) > pushq $0 > movl $TRAP_machine_check,4(%rsp) > jmp handle_ist_exception > > +/* No op trap handler. Required for kexec path. */ > +ENTRY(trap_nop)I''d prefer you to re-use an existing IRETQ (e.g. the one you add below) here - this is not performance critical, and hence there''s no need for it to be aligned. Jan> + iretq > + > +/* Enable NMIs. No special register assumptions, and all preserved. */ > +ENTRY(enable_nmis) > + push %rax > + movq %rsp, %rax /* Grab RSP before pushing */ > + > + /* Set up stack frame */ > + pushq $0 /* SS */ > + pushq %rax /* RSP */ > + pushfq /* RFLAGS */ > + pushq $__HYPERVISOR_CS /* CS */ > + leaq 1f(%rip),%rax > + pushq %rax /* RIP */ > + > + iretq /* Disable the hardware NMI latch */ > +1: > + popq %rax > + retq > + > .section .rodata, "a", @progbits > > ENTRY(exception_table)
Jan Beulich
2012-Dec-05 09:21 UTC
Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > The (old) function do_nmi() is not reentrantly-safe. Rename it to > _do_nmi() and present a new do_nmi() which reentrancy guards. > > If a reentrant NMI has been detected, then it is highly likely that the > outer NMI exception frame has been corrupted, meaning we cannot return > to the original context. In this case, we panic() obviously rather than > falling into an infinite loop. > > panic() however is not safe to reenter from an NMI context, as an NMI > (or MCE) can interrupt it inside its critical section, at which point a > new call to panic() will deadlock. As a result, we bail early if a > panic() is already in progress, as Xen is about to die anyway. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > > -- > I am fairly sure this is safe with the current kexec_crash functionality > which involves holding all non-crashing pcpus in an NMI loop. In the > case of reentrant NMIs and panic_in_progress, we will repeatedly bail > early in an infinite loop of NMIs, which has the same intended effect of > simply causing all non-crashing CPUs to stay out of the way while the > main crash occurs. > > diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = > string_param("nmi", opt_nmi); > > DEFINE_PER_CPU(u64, efer); > +static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0; > > DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr); > > @@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu > > static nmi_callback_t nmi_callback = dummy_nmi_callback; > > -void do_nmi(struct cpu_user_regs *regs) > +/* This function should never be called directly. Use do_nmi() instead. */ > +static void _do_nmi(struct cpu_user_regs *regs) > { > unsigned int cpu = smp_processor_id(); > unsigned char reason; > @@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs) > } > } > > +/* This function is NOT SAFE to call from C code in general. > + * Use with extreme care! */ > +void do_nmi(struct cpu_user_regs *regs) > +{ > + bool_t * in_progress = &this_cpu(nmi_in_progress); > + > + if ( is_panic_in_progress() ) > + { > + /* A panic is already in progress. It may have reenabled NMIs, > + * or we are simply unluckly to receive one right now. Either > + * way, bail early, as Xen is about to die. > + * > + * TODO: Ideally we should exit without executing an iret, to > + * leave NMIs disabled, but that option is not currently > + * available to us.You could easily provide the ground work for this here by having the function return a bool_t (even if not immediately consumed by the caller in this same patch). Jan> + */ > + return; > + } > + > + if ( test_and_set_bool(*in_progress) ) > + { > + /* Crash in an obvious mannor, as opposed to falling into > + * infinite loop because our exception frame corrupted the > + * exception frame of the previous NMI. > + * > + * TODO: This check does not cover all possible cases of corrupt > + * exception frames, but it is substantially better than > + * nothing. > + */ > + console_force_unlock(); > + show_execution_state(regs); > + panic("Reentrant NMI detected\n"); > + } > + > + _do_nmi(regs); > + *in_progress = 0; > +} > + > void set_nmi_callback(nmi_callback_t callback) > { > nmi_callback = callback; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andrew Cooper
2012-Dec-05 10:05 UTC
Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
On 05/12/12 09:17, Jan Beulich wrote:>>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> --- a/xen/arch/x86/crash.c >> +++ b/xen/arch/x86/crash.c >> @@ -32,41 +32,97 @@ >> >> static atomic_t waiting_for_crash_ipi; >> static unsigned int crashing_cpu; >> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); >> >> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) >> +/* This becomes the NMI handler for non-crashing CPUs. */ >> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) >> { >> - /* Don''t do anything if this handler is invoked on crashing cpu. >> - * Otherwise, system will completely hang. Crashing cpu can get >> - * an NMI if system was initially booted with nmi_watchdog parameter. >> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ >> + ASSERT( crashing_cpu != smp_processor_id() ); >> + >> + /* Save crash information and shut down CPU. Attempt only once. */ >> + if ( ! this_cpu(crash_save_done) ) >> + { >> + kexec_crash_save_cpu(); >> + __stop_this_cpu(); >> + >> + this_cpu(crash_save_done) = 1; >> + } >> + >> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC >> + * back to its boot state, so we are unable to rely on the regular >> + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. >> + * (The likely senario is that we have reverted from x2apic mode to >> + * xapic, at which point #GPFs will occur if we use the apic_* >> + * functions) >> + * >> + * The ICR and APIC ID of the LAPIC are still valid even during >> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to >> + * queue up another NMI, to force us back into this loop if we exit. >> */ >> - if ( cpu == crashing_cpu ) >> - return 1; >> - local_irq_disable(); >> + switch ( current_local_apic_mode() ) >> + { >> + u32 apic_id; >> >> - kexec_crash_save_cpu(); >> + case APIC_MODE_X2APIC: >> + apic_id = apic_rdmsr(APIC_ID); >> >> - __stop_this_cpu(); >> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); > As in the description you talk about the LAPIC being (possibly) > disabled here - did you check that this would not #GP in that > case?Yes - we are switching on current_local_apic_mode() which reads the APICBASE MSR to work out which mode we are in, and returns an enum apic_mode. With this information, all the apic_**msr accesses are in case x2apic, and all the apic_mem_** accesses are in case xapic. If the apic_mode is disabled, then we skip trying to set up the next NMI. The patch is sadly rather less legible than I would have hoped, but the code post patch is far more logical to read.>> + break; >> >> - atomic_dec(&waiting_for_crash_ipi); >> + case APIC_MODE_XAPIC: >> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); >> + >> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) >> + cpu_relax(); >> + >> + apic_mem_write(APIC_ICR2, apic_id << 24); >> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); >> + break; >> + >> + default: >> + break; >> + } >> >> for ( ; ; ) >> halt(); >> - >> - return 1; >> } >> ... >> ENTRY(machine_check) >> pushq $0 >> movl $TRAP_machine_check,4(%rsp) >> jmp handle_ist_exception >> >> +/* No op trap handler. Required for kexec path. */ >> +ENTRY(trap_nop) > I''d prefer you to re-use an existing IRETQ (e.g. the one you add > below) here - this is not performance critical, and hence there''s > no need for it to be aligned. > > JanCertainly. ~Andrew> >> + iretq >> + >> +/* Enable NMIs. No special register assumptions, and all preserved. */ >> +ENTRY(enable_nmis) >> + push %rax >> + movq %rsp, %rax /* Grab RSP before pushing */ >> + >> + /* Set up stack frame */ >> + pushq $0 /* SS */ >> + pushq %rax /* RSP */ >> + pushfq /* RFLAGS */ >> + pushq $__HYPERVISOR_CS /* CS */ >> + leaq 1f(%rip),%rax >> + pushq %rax /* RIP */ >> + >> + iretq /* Disable the hardware NMI latch */ >> +1: >> + popq %rax >> + retq >> + >> .section .rodata, "a", @progbits >> >> ENTRY(exception_table)-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Andrew Cooper
2012-Dec-05 10:47 UTC
Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
On 05/12/12 09:21, Jan Beulich wrote:>>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> The (old) function do_nmi() is not reentrantly-safe. Rename it to >> _do_nmi() and present a new do_nmi() which reentrancy guards. >> >> If a reentrant NMI has been detected, then it is highly likely that the >> outer NMI exception frame has been corrupted, meaning we cannot return >> to the original context. In this case, we panic() obviously rather than >> falling into an infinite loop. >> >> panic() however is not safe to reenter from an NMI context, as an NMI >> (or MCE) can interrupt it inside its critical section, at which point a >> new call to panic() will deadlock. As a result, we bail early if a >> panic() is already in progress, as Xen is about to die anyway. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> >> -- >> I am fairly sure this is safe with the current kexec_crash functionality >> which involves holding all non-crashing pcpus in an NMI loop. In the >> case of reentrant NMIs and panic_in_progress, we will repeatedly bail >> early in an infinite loop of NMIs, which has the same intended effect of >> simply causing all non-crashing CPUs to stay out of the way while the >> main crash occurs. >> >> diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c >> --- a/xen/arch/x86/traps.c >> +++ b/xen/arch/x86/traps.c >> @@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = >> string_param("nmi", opt_nmi); >> >> DEFINE_PER_CPU(u64, efer); >> +static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0; >> >> DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr); >> >> @@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu >> >> static nmi_callback_t nmi_callback = dummy_nmi_callback; >> >> -void do_nmi(struct cpu_user_regs *regs) >> +/* This function should never be called directly. Use do_nmi() instead. */ >> +static void _do_nmi(struct cpu_user_regs *regs) >> { >> unsigned int cpu = smp_processor_id(); >> unsigned char reason; >> @@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs) >> } >> } >> >> +/* This function is NOT SAFE to call from C code in general. >> + * Use with extreme care! */ >> +void do_nmi(struct cpu_user_regs *regs) >> +{ >> + bool_t * in_progress = &this_cpu(nmi_in_progress); >> + >> + if ( is_panic_in_progress() ) >> + { >> + /* A panic is already in progress. It may have reenabled NMIs, >> + * or we are simply unluckly to receive one right now. Either >> + * way, bail early, as Xen is about to die. >> + * >> + * TODO: Ideally we should exit without executing an iret, to >> + * leave NMIs disabled, but that option is not currently >> + * available to us. > You could easily provide the ground work for this here by having > the function return a bool_t (even if not immediately consumed by > the caller in this same patch). > > JanWill do. I had considered a bool_t and was planning to integrate it later in development. ~Andrew> >> + */ >> + return; >> + } >> + >> + if ( test_and_set_bool(*in_progress) ) >> + { >> + /* Crash in an obvious mannor, as opposed to falling into >> + * infinite loop because our exception frame corrupted the >> + * exception frame of the previous NMI. >> + * >> + * TODO: This check does not cover all possible cases of corrupt >> + * exception frames, but it is substantially better than >> + * nothing. >> + */ >> + console_force_unlock(); >> + show_execution_state(regs); >> + panic("Reentrant NMI detected\n"); >> + } >> + >> + _do_nmi(regs); >> + *in_progress = 0; >> +} >> + >> void set_nmi_callback(nmi_callback_t callback) >> { >> nmi_callback = callback; >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > >-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Tim Deegan
2012-Dec-06 11:36 UTC
Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote:> do_nmi_crash() will: > > * Save the crash notes and shut the pcpu down. > There is now an extra per-cpu variable to prevent us from executing > this multiple times. In the case where we reenter midway through, > attempt the whole operation again in preference to not completing > it in the first place. > > * Set up another NMI at the LAPIC. > Even when the LAPIC has been disabled, the ID and command registers > are still usable. As a result, we can deliberately queue up a new > NMI to re-interrupt us later if NMIs get unlatched.Can you please include this reasoning in the code itself, as well as in the commit message?> machine_kexec() will: > > * Swap the MCE handlers to be a nop. > We cannot prevent MCEs from being delivered when we pass off to the > crash kernel, and the less Xen context is being touched the better. > > * Explicitly enable NMIs.Would it be cleaner to have this path explicitly set the IDT entry to invoke the noop handler? Or do we know this is alwys the case when we reach this point? I''m just wondering about the case where we get here with an NMI pending. Presumably if we have some other source of NMIs active while we kexec, the post-exec kernel will crash if it overwrites our handlers &c before setting up its own IDT. But if kexec()ing with NMIs _disabled_ also fails than we haven''t much choice. :| Otherwise, this looks good to me. Tim.> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c > --- a/xen/arch/x86/crash.c > +++ b/xen/arch/x86/crash.c > @@ -32,41 +32,97 @@ > > static atomic_t waiting_for_crash_ipi; > static unsigned int crashing_cpu; > +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); > > -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) > +/* This becomes the NMI handler for non-crashing CPUs. */ > +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) > { > - /* Don''t do anything if this handler is invoked on crashing cpu. > - * Otherwise, system will completely hang. Crashing cpu can get > - * an NMI if system was initially booted with nmi_watchdog parameter. > + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ > + ASSERT( crashing_cpu != smp_processor_id() ); > + > + /* Save crash information and shut down CPU. Attempt only once. */ > + if ( ! this_cpu(crash_save_done) ) > + { > + kexec_crash_save_cpu(); > + __stop_this_cpu(); > + > + this_cpu(crash_save_done) = 1; > + } > + > + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC > + * back to its boot state, so we are unable to rely on the regular > + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. > + * (The likely senario is that we have reverted from x2apic mode to > + * xapic, at which point #GPFs will occur if we use the apic_* > + * functions) > + * > + * The ICR and APIC ID of the LAPIC are still valid even during > + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to > + * queue up another NMI, to force us back into this loop if we exit. > */ > - if ( cpu == crashing_cpu ) > - return 1; > - local_irq_disable(); > + switch ( current_local_apic_mode() ) > + { > + u32 apic_id; > > - kexec_crash_save_cpu(); > + case APIC_MODE_X2APIC: > + apic_id = apic_rdmsr(APIC_ID); > > - __stop_this_cpu(); > + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); > + break; > > - atomic_dec(&waiting_for_crash_ipi); > + case APIC_MODE_XAPIC: > + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); > + > + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) > + cpu_relax(); > + > + apic_mem_write(APIC_ICR2, apic_id << 24); > + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); > + break; > + > + default: > + break; > + } > > for ( ; ; ) > halt(); > - > - return 1; > } > > static void nmi_shootdown_cpus(void) > { > unsigned long msecs; > + int i; > + int cpu = smp_processor_id(); > > local_irq_disable(); > > - crashing_cpu = smp_processor_id(); > + crashing_cpu = cpu; > local_irq_count(crashing_cpu) = 0; > > atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); > - /* Would it be better to replace the trap vector here? */ > - set_nmi_callback(crash_nmi_callback); > + > + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which > + * invokes do_nmi_crash (above), which cause them to write state and > + * fall into a loop. The crashing pcpu gets the nop handler to > + * cause it to return to this function ASAP. > + * > + * Futhermore, disable stack tables for NMIs and MCEs to prevent > + * race conditions resulting in corrupt stack frames. As Xen is > + * about to die, we no longer care about the security-related race > + * condition with ''sysret'' which ISTs are designed to solve. > + */ > + for ( i = 0; i < nr_cpu_ids; ++i ) > + if ( idt_tables[i] ) > + { > + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32); > + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32); > + > + if ( i == cpu ) > + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); > + else > + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash); > + } > + > /* Ensure the new callback function is set before sending out the NMI. */ > wmb(); > > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c > --- a/xen/arch/x86/machine_kexec.c > +++ b/xen/arch/x86/machine_kexec.c > @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im > */ > local_irq_disable(); > > + /* Now regular interrupts are disabled, we need to reduce the impact > + * of interrupts not disabled by ''cli''. NMIs have already been > + * dealt with by machine_crash_shutdown(). > + * > + * Set all pcpu MCE handler to be a noop. */ > + set_intr_gate(TRAP_machine_check, &trap_nop); > + > + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do > + * not like running with NMIs disabled. */ > + enable_nmis(); > + > /* > * compat_machine_kexec() returns to idle pagetables, which requires us > * to be running on a static GDT mapping (idle pagetables have no GDT > diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S > --- a/xen/arch/x86/x86_64/entry.S > +++ b/xen/arch/x86/x86_64/entry.S > @@ -635,11 +635,42 @@ ENTRY(nmi) > movl $TRAP_nmi,4(%rsp) > jmp handle_ist_exception > > +ENTRY(nmi_crash) > + cli > + pushq $0 > + movl $TRAP_nmi,4(%rsp) > + SAVE_ALL > + movq %rsp,%rdi > + callq do_nmi_crash /* Does not return */ > + ud2 > + > ENTRY(machine_check) > pushq $0 > movl $TRAP_machine_check,4(%rsp) > jmp handle_ist_exception > > +/* No op trap handler. Required for kexec path. */ > +ENTRY(trap_nop) > + iretq > + > +/* Enable NMIs. No special register assumptions, and all preserved. */ > +ENTRY(enable_nmis) > + push %rax > + movq %rsp, %rax /* Grab RSP before pushing */ > + > + /* Set up stack frame */ > + pushq $0 /* SS */ > + pushq %rax /* RSP */ > + pushfq /* RFLAGS */ > + pushq $__HYPERVISOR_CS /* CS */ > + leaq 1f(%rip),%rax > + pushq %rax /* RIP */ > + > + iretq /* Disable the hardware NMI latch */ > +1: > + popq %rax > + retq > + > .section .rodata, "a", @progbits > > ENTRY(exception_table) > diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h > --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs * > DECLARE_TRAP_HANDLER(divide_error); > DECLARE_TRAP_HANDLER(debug); > DECLARE_TRAP_HANDLER(nmi); > +DECLARE_TRAP_HANDLER(nmi_crash); > DECLARE_TRAP_HANDLER(int3); > DECLARE_TRAP_HANDLER(overflow); > DECLARE_TRAP_HANDLER(bounds); > @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check); > DECLARE_TRAP_HANDLER(spurious_interrupt_bug); > #undef DECLARE_TRAP_HANDLER > > +void trap_nop(void); > +void enable_nmis(void); > + > void syscall_enter(void); > void sysenter_entry(void); > void sysenter_eflags_saved(void); > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Tim Deegan
2012-Dec-06 11:39 UTC
Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
At 18:16 +0000 on 04 Dec (1354644970), Andrew Cooper wrote:> +/* This function is NOT SAFE to call from C code in general. > + * Use with extreme care! */ > +void do_nmi(struct cpu_user_regs *regs) > +{ > + bool_t * in_progress = &this_cpu(nmi_in_progress); > + > + if ( is_panic_in_progress() ) > + { > + /* A panic is already in progress. It may have reenabled NMIs, > + * or we are simply unluckly to receive one right now. Either > + * way, bail early, as Xen is about to die. > + * > + * TODO: Ideally we should exit without executing an iret, to > + * leave NMIs disabled, but that option is not currently > + * available to us. > + */ > + return; > + } > + > + if ( test_and_set_bool(*in_progress) ) > + { > + /* Crash in an obvious mannor, as opposed to falling into > + * infinite loop because our exception frame corrupted the > + * exception frame of the previous NMI. > + * > + * TODO: This check does not cover all possible cases of corrupt > + * exception frames, but it is substantially better than > + * nothing. > + */ > + console_force_unlock(); > + show_execution_state(regs); > + panic("Reentrant NMI detected\n"); > + } > + > + _do_nmi(regs);I think you need a barrier() above and below _do_nmi() in case the compiler inlines _do_nmi() and then decides it can shuffle the writes to in_progress around. Otherwise, this looks fine, and you can add my ack if you like.> + *in_progress = 0; > +} > + > void set_nmi_callback(nmi_callback_t callback) > { > nmi_callback = callback; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andrew Cooper
2012-Dec-06 11:58 UTC
Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
On 06/12/12 11:36, Tim Deegan wrote:> At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote: >> do_nmi_crash() will: >> >> * Save the crash notes and shut the pcpu down. >> There is now an extra per-cpu variable to prevent us from executing >> this multiple times. In the case where we reenter midway through, >> attempt the whole operation again in preference to not completing >> it in the first place. >> >> * Set up another NMI at the LAPIC. >> Even when the LAPIC has been disabled, the ID and command registers >> are still usable. As a result, we can deliberately queue up a new >> NMI to re-interrupt us later if NMIs get unlatched. > Can you please include this reasoning in the code itself, as well as in > the commit message?There is a comment alluding to this; the final sentence in the block beginning "Poor mans self_nmi()", but I will reword it to make it clearer.> >> machine_kexec() will: >> >> * Swap the MCE handlers to be a nop. >> We cannot prevent MCEs from being delivered when we pass off to the >> crash kernel, and the less Xen context is being touched the better. >> >> * Explicitly enable NMIs. > Would it be cleaner to have this path explicitly set the IDT entry to > invoke the noop handler? Or do we know this is alwys the case when we > reach this point?Early versions of this patch did change the NMI entry here, but that was before I decided that nmi_shootdown_cpus() needed redoing. As it currently stands, all pcpus other than us have the nmi_crash handler, and we have the nop handler because the call graph looks like: kexec_crash() ... machine_crash_shutdown() nmi_shootdown_cpus() machine_kexec() I will however clarify this NMI setup with a comment at this location.> > I''m just wondering about the case where we get here with an NMI pending. > Presumably if we have some other source of NMIs active while we kexec, > the post-exec kernel will crash if it overwrites our handlers &c before > setting up its own IDT. But if kexec()ing with NMIs _disabled_ also > fails than we haven''t much choice. :| > > Otherwise, this looks good to me. > > Tim.We do have another source of NMIs active; The watchdog disable routine only tells the NMI handler to ignore watchdog NMIs, rather than disable the source of NMIs themselves. It is another item on my hitlist. The other problem we have have with the kexec mechanism is to do with taking NMIs and MCEs between purgatory changing addressing schemes and setting up its own IDT. I believe that compat_machine_kexec() does this in an unsafe way even before jumping to purgatory. I was wondering whether it was possible to monopolise some very low pages which could be identity mapped in any paging scheme, but I cant think of a nice way to fix a move into real mode at which point the loaded IDT is completely invalid. Another item for the hitlist. ~Andrew> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> >> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c >> --- a/xen/arch/x86/crash.c >> +++ b/xen/arch/x86/crash.c >> @@ -32,41 +32,97 @@ >> >> static atomic_t waiting_for_crash_ipi; >> static unsigned int crashing_cpu; >> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); >> >> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) >> +/* This becomes the NMI handler for non-crashing CPUs. */ >> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) >> { >> - /* Don''t do anything if this handler is invoked on crashing cpu. >> - * Otherwise, system will completely hang. Crashing cpu can get >> - * an NMI if system was initially booted with nmi_watchdog parameter. >> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ >> + ASSERT( crashing_cpu != smp_processor_id() ); >> + >> + /* Save crash information and shut down CPU. Attempt only once. */ >> + if ( ! this_cpu(crash_save_done) ) >> + { >> + kexec_crash_save_cpu(); >> + __stop_this_cpu(); >> + >> + this_cpu(crash_save_done) = 1; >> + } >> + >> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC >> + * back to its boot state, so we are unable to rely on the regular >> + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. >> + * (The likely senario is that we have reverted from x2apic mode to >> + * xapic, at which point #GPFs will occur if we use the apic_* >> + * functions) >> + * >> + * The ICR and APIC ID of the LAPIC are still valid even during >> + * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to >> + * queue up another NMI, to force us back into this loop if we exit. >> */ >> - if ( cpu == crashing_cpu ) >> - return 1; >> - local_irq_disable(); >> + switch ( current_local_apic_mode() ) >> + { >> + u32 apic_id; >> >> - kexec_crash_save_cpu(); >> + case APIC_MODE_X2APIC: >> + apic_id = apic_rdmsr(APIC_ID); >> >> - __stop_this_cpu(); >> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); >> + break; >> >> - atomic_dec(&waiting_for_crash_ipi); >> + case APIC_MODE_XAPIC: >> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); >> + >> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) >> + cpu_relax(); >> + >> + apic_mem_write(APIC_ICR2, apic_id << 24); >> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); >> + break; >> + >> + default: >> + break; >> + } >> >> for ( ; ; ) >> halt(); >> - >> - return 1; >> } >> >> static void nmi_shootdown_cpus(void) >> { >> unsigned long msecs; >> + int i; >> + int cpu = smp_processor_id(); >> >> local_irq_disable(); >> >> - crashing_cpu = smp_processor_id(); >> + crashing_cpu = cpu; >> local_irq_count(crashing_cpu) = 0; >> >> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); >> - /* Would it be better to replace the trap vector here? */ >> - set_nmi_callback(crash_nmi_callback); >> + >> + /* Change NMI trap handlers. Non-crashing pcpus get crash_nmi which >> + * invokes do_nmi_crash (above), which cause them to write state and >> + * fall into a loop. The crashing pcpu gets the nop handler to >> + * cause it to return to this function ASAP. >> + * >> + * Futhermore, disable stack tables for NMIs and MCEs to prevent >> + * race conditions resulting in corrupt stack frames. As Xen is >> + * about to die, we no longer care about the security-related race >> + * condition with ''sysret'' which ISTs are designed to solve. >> + */ >> + for ( i = 0; i < nr_cpu_ids; ++i ) >> + if ( idt_tables[i] ) >> + { >> + idt_tables[i][TRAP_nmi].a &= ~(7UL << 32); >> + idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32); >> + >> + if ( i == cpu ) >> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); >> + else >> + _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash); >> + } >> + >> /* Ensure the new callback function is set before sending out the NMI. */ >> wmb(); >> >> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c >> --- a/xen/arch/x86/machine_kexec.c >> +++ b/xen/arch/x86/machine_kexec.c >> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im >> */ >> local_irq_disable(); >> >> + /* Now regular interrupts are disabled, we need to reduce the impact >> + * of interrupts not disabled by ''cli''. NMIs have already been >> + * dealt with by machine_crash_shutdown(). >> + * >> + * Set all pcpu MCE handler to be a noop. */ >> + set_intr_gate(TRAP_machine_check, &trap_nop); >> + >> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do >> + * not like running with NMIs disabled. */ >> + enable_nmis(); >> + >> /* >> * compat_machine_kexec() returns to idle pagetables, which requires us >> * to be running on a static GDT mapping (idle pagetables have no GDT >> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S >> --- a/xen/arch/x86/x86_64/entry.S >> +++ b/xen/arch/x86/x86_64/entry.S >> @@ -635,11 +635,42 @@ ENTRY(nmi) >> movl $TRAP_nmi,4(%rsp) >> jmp handle_ist_exception >> >> +ENTRY(nmi_crash) >> + cli >> + pushq $0 >> + movl $TRAP_nmi,4(%rsp) >> + SAVE_ALL >> + movq %rsp,%rdi >> + callq do_nmi_crash /* Does not return */ >> + ud2 >> + >> ENTRY(machine_check) >> pushq $0 >> movl $TRAP_machine_check,4(%rsp) >> jmp handle_ist_exception >> >> +/* No op trap handler. Required for kexec path. */ >> +ENTRY(trap_nop) >> + iretq >> + >> +/* Enable NMIs. No special register assumptions, and all preserved. */ >> +ENTRY(enable_nmis) >> + push %rax >> + movq %rsp, %rax /* Grab RSP before pushing */ >> + >> + /* Set up stack frame */ >> + pushq $0 /* SS */ >> + pushq %rax /* RSP */ >> + pushfq /* RFLAGS */ >> + pushq $__HYPERVISOR_CS /* CS */ >> + leaq 1f(%rip),%rax >> + pushq %rax /* RIP */ >> + >> + iretq /* Disable the hardware NMI latch */ >> +1: >> + popq %rax >> + retq >> + >> .section .rodata, "a", @progbits >> >> ENTRY(exception_table) >> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h >> --- a/xen/include/asm-x86/processor.h >> +++ b/xen/include/asm-x86/processor.h >> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs * >> DECLARE_TRAP_HANDLER(divide_error); >> DECLARE_TRAP_HANDLER(debug); >> DECLARE_TRAP_HANDLER(nmi); >> +DECLARE_TRAP_HANDLER(nmi_crash); >> DECLARE_TRAP_HANDLER(int3); >> DECLARE_TRAP_HANDLER(overflow); >> DECLARE_TRAP_HANDLER(bounds); >> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check); >> DECLARE_TRAP_HANDLER(spurious_interrupt_bug); >> #undef DECLARE_TRAP_HANDLER >> >> +void trap_nop(void); >> +void enable_nmis(void); >> + >> void syscall_enter(void); >> void sysenter_entry(void); >> void sysenter_eflags_saved(void); >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com