Here is V4 of the series. Incorperated all previous feedback, and added IDT entry helper functions to address the present race conditions and IST issues. xen/arch/x86/hvm/svm/svm.c | 12 ++-- xen/arch/x86/x86_64/traps.c | 6 +- xen/include/asm-x86/processor.h | 19 ++++- xen/arch/x86/crash.c | 117 ++++++++++++++++++++++++++++++++++----- xen/arch/x86/machine_kexec.c | 19 ++++++ xen/arch/x86/x86_64/entry.S | 34 +++++++++++ xen/include/asm-x86/desc.h | 42 ++++++++++++++ xen/include/asm-x86/processor.h | 4 + 8 files changed, 224 insertions(+), 29 deletions(-) ~Andrew
Andrew Cooper
2012-Dec-11 15:34 UTC
[PATCH 1 of 2 V4] x86/IST: Create set_ist() helper function
xen/arch/x86/hvm/svm/svm.c | 12 ++++++------ xen/arch/x86/x86_64/traps.c | 6 +++--- xen/include/asm-x86/processor.h | 18 ++++++++++++++---- 3 files changed, 23 insertions(+), 13 deletions(-) ... to save using open-coded bitwise operations, and update all IST manipulation sites to use the function. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> -- Changes since v1: * Change ASSERT() to assert against IST_MAX, the maximum Xen used IST value, rather than the maximum possible IST value. * Removed second ''& 7UL'' as it would be optimised out in all acceptable use cases. diff -r bc624b00d6d6 -r def9d03429f6 xen/arch/x86/hvm/svm/svm.c --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -869,9 +869,9 @@ static void svm_ctxt_switch_from(struct svm_vmload(per_cpu(root_vmcb, cpu)); /* Resume use of ISTs now that the host TR is reinstated. */ - idt_tables[cpu][TRAP_double_fault].a |= IST_DF << 32; - idt_tables[cpu][TRAP_nmi].a |= IST_NMI << 32; - idt_tables[cpu][TRAP_machine_check].a |= IST_MCE << 32; + set_ist(&idt_tables[cpu][TRAP_double_fault], IST_DF); + set_ist(&idt_tables[cpu][TRAP_nmi], IST_NMI); + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE); } static void svm_ctxt_switch_to(struct vcpu *v) @@ -893,9 +893,9 @@ static void svm_ctxt_switch_to(struct vc * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR. * But this doesn''t matter: the IST is only req''d to handle SYSCALL/SYSRET. */ - idt_tables[cpu][TRAP_double_fault].a &= ~(7UL << 32); - idt_tables[cpu][TRAP_nmi].a &= ~(7UL << 32); - idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32); + set_ist(&idt_tables[cpu][TRAP_double_fault], IST_NONE); + set_ist(&idt_tables[cpu][TRAP_nmi], IST_NONE); + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE); svm_restore_dr(v); diff -r bc624b00d6d6 -r def9d03429f6 xen/arch/x86/x86_64/traps.c --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -370,9 +370,9 @@ void __devinit subarch_percpu_traps_init { /* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */ set_intr_gate(TRAP_double_fault, &double_fault); - idt_table[TRAP_double_fault].a |= IST_DF << 32; - idt_table[TRAP_nmi].a |= IST_NMI << 32; - idt_table[TRAP_machine_check].a |= IST_MCE << 32; + set_ist(&idt_table[TRAP_double_fault], IST_DF); + set_ist(&idt_table[TRAP_nmi], IST_NMI); + set_ist(&idt_table[TRAP_machine_check], IST_MCE); /* * The 32-on-64 hypercall entry vector is only accessible from ring 1. diff -r bc624b00d6d6 -r def9d03429f6 xen/include/asm-x86/processor.h --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -425,10 +425,20 @@ struct tss_struct { u8 __cacheline_filler[24]; } __cacheline_aligned __attribute__((packed)); -#define IST_DF 1UL -#define IST_NMI 2UL -#define IST_MCE 3UL -#define IST_MAX 3UL +#define IST_NONE 0UL +#define IST_DF 1UL +#define IST_NMI 2UL +#define IST_MCE 3UL +#define IST_MAX 3UL + +/* Set the interrupt stack table used by a particular interrupt + * descriptor table entry. */ +static always_inline void set_ist(idt_entry_t * idt, unsigned long ist) +{ + /* ist is a 3 bit field, 32 bits into the idt entry. */ + ASSERT( ist <= IST_MAX ); + idt->a = ( idt->a & ~(7UL << 32) ) | ( ist << 32 ); +} #define IDT_ENTRIES 256 extern idt_entry_t idt_table[];
Andrew Cooper
2012-Dec-11 15:34 UTC
[PATCH 2 of 2 V4] x86/kexec: Change NMI and MCE handling on kexec path
xen/arch/x86/crash.c | 117 ++++++++++++++++++++++++++++++++++----- xen/arch/x86/machine_kexec.c | 19 ++++++ xen/arch/x86/x86_64/entry.S | 34 +++++++++++ xen/include/asm-x86/desc.h | 42 ++++++++++++++ xen/include/asm-x86/processor.h | 4 + 5 files changed, 201 insertions(+), 15 deletions(-) Experimentally, certain crash kernels will triple fault very early after starting if started with NMIs disabled. This was discovered when experimenting with a debug keyhandler which deliberately created a reentrant NMI, causing stack corruption. Because of this discovered bug, and that the future changes to the NMI handling will make the kexec path more fragile, take the time now to bullet-proof the kexec behaviour to be safer in more circumstances. This patch adds three new low level routines: * nmi_crash This is a special NMI handler for using during a kexec crash. * enable_nmis This function enables NMIs by executing an iret-to-self, to disengage the hardware NMI latch. * trap_nop This is a no op handler which irets immediately. It is actually in the middle of enable_nmis to reuse the iret instruction, without having a single lone aligned iret inflating the code side. And adds three new IDT entry helper routines: * _write_gate_lower This is a substitute for using cmpxchg16b to update a 128bit structure at once. It assumes that the top 64 bits are unchanged (and ASSERT()s the fact) and performs a regular write on the lower 64 bits. * _set_gate_lower This is functionally equivalent to the already present _set_gate(), except it uses _write_gate_lower rather than updating both 64bit values. * _update_gate_addr_lower This is designed to update an IDT entry handler only, without altering any other settings in the entry. It also uses _write_gate_lower. The IDT entry helpers are required because: * Is it unsafe to attempt a disable/update/re-enable cycle on the NMI or MCE IDT entries. * We need to be able to update NMI handlers without changing the IST entry. As a result, the new behaviour of the kexec_crash path is: nmi_shootdown_cpus() will: * Disable the crashing cpus NMI/MCE interrupt stack tables. Disabling the stack tables removes race conditions which would lead to corrupt exception frames and infinite loops. As this pcpu is never planning to execute a sysret back to a pv vcpu, the update is safe from a security point of view. * Swap the NMI trap handlers. The crashing pcpu gets the nop handler, to prevent it getting stuck in an NMI context, causing a hang instead of crash. The non-crashing pcpus all get the nmi_crash handler which is designed never to return. do_nmi_crash() will: * Save the crash notes and shut the pcpu down. There is now an extra per-cpu variable to prevent us from executing this multiple times. In the case where we reenter midway through, attempt the whole operation again in preference to not completing it in the first place. * Set up another NMI at the LAPIC. Even when the LAPIC has been disabled, the ID and command registers are still usable. As a result, we can deliberately queue up a new NMI to re-interrupt us later if NMIs get unlatched. Because of the call to __stop_this_cpu(), we have to hand craft self_nmi() to be safe from General Protection Faults. * Fall into infinite loop. machine_kexec() will: * Swap the MCE handlers to be a nop. We cannot prevent MCEs from being delivered when we pass off to the crash kernel, and the less Xen context is being touched the better. * Explicitly enable NMIs. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> -- Changes since v3: * Added IDT entry helpers to safely update NMI/MCE entries. Changes since v2: * Rework the alteration of the stack tables to completely remove the possibility of a PV domain getting very lucky with the "NMI or MCE in a 1 instruction race window on sysret" and managing to execute code in the hypervisor context. * Make use of set_ist() from the previous patch in the series to avoid open-coding the IST manipulation. Changes since v1: * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed during the original refactoring. * Fold trap_nop into the middle of enable_nmis to reuse the iret. * Expand comments in areas as per Tim''s suggestions. diff -r def9d03429f6 -r 528bd4030389 xen/arch/x86/crash.c --- a/xen/arch/x86/crash.c +++ b/xen/arch/x86/crash.c @@ -32,41 +32,128 @@ static atomic_t waiting_for_crash_ipi; static unsigned int crashing_cpu; +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done); -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu) +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */ +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs) { - /* Don''t do anything if this handler is invoked on crashing cpu. - * Otherwise, system will completely hang. Crashing cpu can get - * an NMI if system was initially booted with nmi_watchdog parameter. + int cpu = smp_processor_id(); + + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */ + ASSERT( cpu != crashing_cpu ); + + /* Save crash information and shut down CPU. Attempt only once. */ + if ( ! this_cpu(crash_save_done) ) + { + /* Disable the interrupt stack table for the MCE handler. This + * prevents race conditions between clearing MCIP and receving a + * new MCE, during which the exception frame would be clobbered + * and the MCE handler fall into an infinite loop. We are soon + * going to disable the NMI watchdog, so the loop would not be + * caught. + * + * We do not need to change the NMI IST, as the nmi_crash + * handler is immue to corrupt exception frames, by virtue of + * being designed never to return. + * + * This update is safe from a security point of view, as this + * pcpu is never going to try to sysret back to a PV vcpu. + */ + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE); + + kexec_crash_save_cpu(); + __stop_this_cpu(); + + this_cpu(crash_save_done) = 1; + atomic_dec(&waiting_for_crash_ipi); + } + + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC + * back to its boot state, so we are unable to rely on the regular + * apic_* functions, due to ''x2apic_enabled'' being possibly wrong. + * (The likely scenario is that we have reverted from x2apic mode to + * xapic, at which point #GPFs will occur if we use the apic_* + * functions) + * + * The ICR and APIC ID of the LAPIC are still valid even during + * software disable (Intel SDM Vol 3, 10.4.7.2). As a result, we + * can deliberately queue up another NMI at the LAPIC which will not + * be delivered as the hardware NMI latch is currently in effect. + * This means that if NMIs become unlatched (e.g. following a + * non-fatal MCE), the LAPIC will force us back here rather than + * wandering back into regular Xen code. */ - if ( cpu == crashing_cpu ) - return 1; - local_irq_disable(); + switch ( current_local_apic_mode() ) + { + u32 apic_id; - kexec_crash_save_cpu(); + case APIC_MODE_X2APIC: + apic_id = apic_rdmsr(APIC_ID); - __stop_this_cpu(); + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32)); + break; - atomic_dec(&waiting_for_crash_ipi); + case APIC_MODE_XAPIC: + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID)); + + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY ) + cpu_relax(); + + apic_mem_write(APIC_ICR2, apic_id << 24); + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL); + break; + + default: + break; + } for ( ; ; ) halt(); - - return 1; } static void nmi_shootdown_cpus(void) { unsigned long msecs; + int i, cpu = smp_processor_id(); local_irq_disable(); - crashing_cpu = smp_processor_id(); + crashing_cpu = cpu; local_irq_count(crashing_cpu) = 0; atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1); - /* Would it be better to replace the trap vector here? */ - set_nmi_callback(crash_nmi_callback); + + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which + * invokes do_nmi_crash (above), which cause them to write state and + * fall into a loop. The crashing pcpu gets the nop handler to + * cause it to return to this function ASAP. + */ + for ( i = 0; i < nr_cpu_ids; ++i ) + if ( idt_tables[i] ) + { + + if ( i == cpu ) + { + /* Disable the interrupt stack tables for this MCE and + * NMI handler (shortly to become a nop) as there is a 1 + * instruction race window where NMIs could be + * re-enabled and corrupt the exception frame, leaving + * us unable to continue on this crash path (which half + * defeats the point of using the nop handler in the + * first place). + * + * This update is safe from a security point of view, as + * this pcpu is never going to try to sysret back to a + * PV vcpu. + */ + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop); + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE); + } + else + /* Do not update stack table for other pcpus. */ + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], &nmi_crash); + } + /* Ensure the new callback function is set before sending out the NMI. */ wmb(); diff -r def9d03429f6 -r 528bd4030389 xen/arch/x86/machine_kexec.c --- a/xen/arch/x86/machine_kexec.c +++ b/xen/arch/x86/machine_kexec.c @@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im .base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY), .limit = LAST_RESERVED_GDT_BYTE }; + int i; /* We are about to permenantly jump out of the Xen context into the kexec * purgatory code. We really dont want to be still servicing interupts. */ local_irq_disable(); + /* Now regular interrupts are disabled, we need to reduce the impact + * of interrupts not disabled by ''cli''. + * + * The NMI handlers have already been set up nmi_shootdown_cpus(). All + * pcpus other than us have the nmi_crash handler, while we have the nop + * handler. + * + * The MCE handlers touch extensive areas of Xen code and data. At this + * point, there is nothing we can usefully do, so set the nop handler. + */ + for ( i = 0; i < nr_cpu_ids; ++i ) + if ( idt_tables[i] ) + _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check], &trap_nop); + + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do + * not like running with NMIs disabled. */ + enable_nmis(); + /* * compat_machine_kexec() returns to idle pagetables, which requires us * to be running on a static GDT mapping (idle pagetables have no GDT diff -r def9d03429f6 -r 528bd4030389 xen/arch/x86/x86_64/entry.S --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -635,11 +635,45 @@ ENTRY(nmi) movl $TRAP_nmi,4(%rsp) jmp handle_ist_exception +ENTRY(nmi_crash) + cli + pushq $0 + movl $TRAP_nmi,4(%rsp) + SAVE_ALL + movq %rsp,%rdi + callq do_nmi_crash /* Does not return */ + ud2 + ENTRY(machine_check) pushq $0 movl $TRAP_machine_check,4(%rsp) jmp handle_ist_exception +/* Enable NMIs. No special register assumptions. Only %rax is not preserved. */ +ENTRY(enable_nmis) + movq %rsp, %rax /* Grab RSP before pushing */ + + /* Set up stack frame */ + pushq $0 /* SS */ + pushq %rax /* RSP */ + pushfq /* RFLAGS */ + pushq $__HYPERVISOR_CS /* CS */ + leaq 1f(%rip),%rax + pushq %rax /* RIP */ + + iretq /* Disable the hardware NMI latch */ +1: + retq + +/* No op trap handler. Required for kexec crash path. This is not + * declared with the ENTRY() macro to avoid wasted alignment space. + */ +.globl trap_nop +trap_nop: + iretq + + + .section .rodata, "a", @progbits ENTRY(exception_table) diff -r def9d03429f6 -r 528bd4030389 xen/include/asm-x86/desc.h --- a/xen/include/asm-x86/desc.h +++ b/xen/include/asm-x86/desc.h @@ -106,6 +106,19 @@ typedef struct { u64 a, b; } idt_entry_t; +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32 + * bits of the address not changing, which is a safe aumption until our + * code size exceeds 4GB. + * + * Ideally, we would use cmpxchg16b, but this is not supported on some + * old AMD 64bit capable processors, and has no safe equivelent. + */ +static inline void _write_gate_lower(idt_entry_t * gate, idt_entry_t * new) +{ + ASSERT(gate->b == new->b); + *(volatile unsigned long *)&gate->a = new->a; +} + #define _set_gate(gate_addr,type,dpl,addr) \ do { \ (gate_addr)->a = 0; \ @@ -122,6 +135,35 @@ do { (1UL << 47); \ } while (0) +#define _set_gate_lower(gate_addr,type,dpl,addr) \ + do { \ + idt_entry_t idte; \ + idte.b = (gate_addr)->b; \ + idte.a = \ + (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | \ + ((unsigned long)(dpl) << 45) | \ + ((unsigned long)(type) << 40) | \ + ((unsigned long)(addr) & 0xFFFFUL) | \ + ((unsigned long)__HYPERVISOR_CS64 << 16) | \ + (1UL << 47); \ + _write_gate_lower((gate_addr), &idte); \ +} while (0) + +/* Update the lower half handler of an IDT Entry, without changing any + * other configuration. */ +static inline void _update_gate_addr_lower(idt_entry_t * gate, void * addr) +{ + idt_entry_t idte; + idte.a = gate->a; + + idte.b = ((unsigned long)(addr) >> 32); + idte.a &= 0x0000FFFFFFFF0000ULL; + idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | + ((unsigned long)(addr) & 0xFFFFUL); + + _write_gate_lower(gate, &idte); +} + #define _set_tssldt_desc(desc,addr,limit,type) \ do { \ (desc)[0].b = (desc)[1].b = 0; \ diff -r def9d03429f6 -r 528bd4030389 xen/include/asm-x86/processor.h --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs * DECLARE_TRAP_HANDLER(divide_error); DECLARE_TRAP_HANDLER(debug); DECLARE_TRAP_HANDLER(nmi); +DECLARE_TRAP_HANDLER(nmi_crash); DECLARE_TRAP_HANDLER(int3); DECLARE_TRAP_HANDLER(overflow); DECLARE_TRAP_HANDLER(bounds); @@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check); DECLARE_TRAP_HANDLER(spurious_interrupt_bug); #undef DECLARE_TRAP_HANDLER +void trap_nop(void); +void enable_nmis(void); + void syscall_enter(void); void sysenter_entry(void); void sysenter_eflags_saved(void);
Jan Beulich
2012-Dec-11 15:58 UTC
Re: [PATCH 2 of 2 V4] x86/kexec: Change NMI and MCE handling on kexec path
>>> On 11.12.12 at 16:34, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > - /* Would it be better to replace the trap vector here? */ > - set_nmi_callback(crash_nmi_callback); > + > + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which > + * invokes do_nmi_crash (above), which cause them to write state and > + * fall into a loop. The crashing pcpu gets the nop handler to > + * cause it to return to this function ASAP. > + */ > + for ( i = 0; i < nr_cpu_ids; ++i ) > + if ( idt_tables[i] ) > + { > + > + if ( i == cpu ) > + { > + /* Disable the interrupt stack tables for this MCE and > + * NMI handler (shortly to become a nop) as there is a 1 > + * instruction race window where NMIs could be > + * re-enabled and corrupt the exception frame, leaving > + * us unable to continue on this crash path (which half > + * defeats the point of using the nop handler in the > + * first place). > + * > + * This update is safe from a security point of view, as > + * this pcpu is never going to try to sysret back to a > + * PV vcpu. > + */This comment appears to have become stale with the latest changes.> + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);No need for the extra & on functions and arrays.> + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE); > + } > + else > + /* Do not update stack table for other pcpus. */ > + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], &nmi_crash); > + } > +>...> +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32 > + * bits of the address not changing, which is a safe aumption until ourassumption> + * code size exceeds 4GB.1Gb.> + * > + * Ideally, we would use cmpxchg16b, but this is not supported on some > + * old AMD 64bit capable processors, and has no safe equivelent.equivalent> + */ > +static inline void _write_gate_lower(idt_entry_t * gate, idt_entry_t * new)static inline void _write_gate_lower(idt_entry_t *gate, idt_entry_t *new) (similar extra blanks elsewhere) Also, to make clear which of the two is the entry written, const- qualifying the other one might be a good idea.> +{ > + ASSERT(gate->b == new->b); > + *(volatile unsigned long *)&gate->a = new->a;volatile? And if so, why not volatilie-qualify the function parameter?> +} > + > #define _set_gate(gate_addr,type,dpl,addr) \ > do { \ > (gate_addr)->a = 0; \ > @@ -122,6 +135,35 @@ do { > (1UL << 47); \ > } while (0) > > +#define _set_gate_lower(gate_addr,type,dpl,addr) \ > + do { \ > + idt_entry_t idte; \ > + idte.b = (gate_addr)->b; \ > + idte.a = \ > + (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | \ > + ((unsigned long)(dpl) << 45) | \ > + ((unsigned long)(type) << 40) | \ > + ((unsigned long)(addr) & 0xFFFFUL) | \ > + ((unsigned long)__HYPERVISOR_CS64 << 16) | \ > + (1UL << 47); \ > + _write_gate_lower((gate_addr), &idte); \No need for extra inner parentheses.> +} while (0) > + > +/* Update the lower half handler of an IDT Entry, without changing any > + * other configuration. */ > +static inline void _update_gate_addr_lower(idt_entry_t * gate, void * addr)Any reason for this being an inline function and the other being a macro? Jan> +{ > + idt_entry_t idte; > + idte.a = gate->a; > + > + idte.b = ((unsigned long)(addr) >> 32); > + idte.a &= 0x0000FFFFFFFF0000ULL; > + idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) | > + ((unsigned long)(addr) & 0xFFFFUL); > + > + _write_gate_lower(gate, &idte); > +} > + > #define _set_tssldt_desc(desc,addr,limit,type) \ > do { \ > (desc)[0].b = (desc)[1].b = 0; \
Andrew Cooper
2012-Dec-11 17:24 UTC
Re: [PATCH 2 of 2 V4] x86/kexec: Change NMI and MCE handling on kexec path
On 11/12/12 15:58, Jan Beulich wrote:>>>> On 11.12.12 at 16:34, Andrew Cooper<andrew.cooper3@citrix.com> wrote: >> - /* Would it be better to replace the trap vector here? */ >> - set_nmi_callback(crash_nmi_callback); >> + >> + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which >> + * invokes do_nmi_crash (above), which cause them to write state and >> + * fall into a loop. The crashing pcpu gets the nop handler to >> + * cause it to return to this function ASAP. >> + */ >> + for ( i = 0; i< nr_cpu_ids; ++i ) >> + if ( idt_tables[i] ) >> + { >> + >> + if ( i == cpu ) >> + { >> + /* Disable the interrupt stack tables for this MCE and >> + * NMI handler (shortly to become a nop) as there is a 1 >> + * instruction race window where NMIs could be >> + * re-enabled and corrupt the exception frame, leaving >> + * us unable to continue on this crash path (which half >> + * defeats the point of using the nop handler in the >> + * first place). >> + * >> + * This update is safe from a security point of view, as >> + * this pcpu is never going to try to sysret back to a >> + * PV vcpu. >> + */ > This comment appears to have become stale with the latest > changes.Ok> >> + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0,&trap_nop); > No need for the extra& on functions and arrays.I was continuing the prevailing style from traps.c Personally, I prefer the & notation for function pointers, to be consistent with regular pointers, even though I am aware that it is not strictly needed. I am not fussed if you wish to insist on one style, but we do have mixed styles across the codebase.> >> + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE); >> + } >> + else >> + /* Do not update stack table for other pcpus. */ >> + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi],&nmi_crash); >> + } >> + >> ... >> +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32 >> + * bits of the address not changing, which is a safe aumption until our > assumption > >> + * code size exceeds 4GB. > 1Gb. > >> + * >> + * Ideally, we would use cmpxchg16b, but this is not supported on some >> + * old AMD 64bit capable processors, and has no safe equivelent. > equivalent > >> + */ >> +static inline void _write_gate_lower(idt_entry_t * gate, idt_entry_t * new) > static inline void _write_gate_lower(idt_entry_t *gate, idt_entry_t *new) > > (similar extra blanks elsewhere) > > Also, to make clear which of the two is the entry written, const- > qualifying the other one might be a good idea.Ok> >> +{ >> + ASSERT(gate->b == new->b); >> + *(volatile unsigned long *)&gate->a = new->a; > volatile? And if so, why not volatilie-qualify the function parameter?I was looking to avoid having the compiler inline this function and decide that it can merge *gate_addr and idte together, resulting in multiple writes to gate_addr. Without the volatile, the compiler is free to make this optimization, which puts us back with the racy case we are trying to avoid. The reason for avoiding the volatile function parameter is so the assertion equality can be optimized where possible.> >> +} >> + >> #define _set_gate(gate_addr,type,dpl,addr) \ >> do { \ >> (gate_addr)->a = 0; \ >> @@ -122,6 +135,35 @@ do { >> (1UL<< 47); \ >> } while (0) >> >> +#define _set_gate_lower(gate_addr,type,dpl,addr) \ >> + do { \ >> + idt_entry_t idte; \ >> + idte.b = (gate_addr)->b; \ >> + idte.a = \ >> + (((unsigned long)(addr)& 0xFFFF0000UL)<< 32) | \ >> + ((unsigned long)(dpl)<< 45) | \ >> + ((unsigned long)(type)<< 40) | \ >> + ((unsigned long)(addr)& 0xFFFFUL) | \ >> + ((unsigned long)__HYPERVISOR_CS64<< 16) | \ >> + (1UL<< 47); \ >> + _write_gate_lower((gate_addr),&idte); \ > No need for extra inner parentheses. > >> +} while (0) >> + >> +/* Update the lower half handler of an IDT Entry, without changing any >> + * other configuration. */ >> +static inline void _update_gate_addr_lower(idt_entry_t * gate, void * addr) > Any reason for this being an inline function and the other being > a macro? > > JanWhere possible, I prefer static inline in preference to macros because of the added type checking etc. _set_gate_lower is based on _set_gate, so I used the _set_gate style. Again, I am not overly fussed if you have a strong preference for style. (Both of these styles are mixed across the codebase and have no indication of preference in CODING_STYLE) ~Andrew> >> +{ >> + idt_entry_t idte; >> + idte.a = gate->a; >> + >> + idte.b = ((unsigned long)(addr)>> 32); >> + idte.a&= 0x0000FFFFFFFF0000ULL; >> + idte.a |= (((unsigned long)(addr)& 0xFFFF0000UL)<< 32) | >> + ((unsigned long)(addr)& 0xFFFFUL); >> + >> + _write_gate_lower(gate,&idte); >> +} >> + >> #define _set_tssldt_desc(desc,addr,limit,type) \ >> do { \ >> (desc)[0].b = (desc)[1].b = 0; \
Jan Beulich
2012-Dec-12 08:05 UTC
Re: [PATCH 2 of 2 V4] x86/kexec: Change NMI and MCE handling on kexec path
>>> On 11.12.12 at 18:24, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 11/12/12 15:58, Jan Beulich wrote: >>>>> On 11.12.12 at 16:34, Andrew Cooper<andrew.cooper3@citrix.com> wrote: >>> - /* Would it be better to replace the trap vector here? */ >>> - set_nmi_callback(crash_nmi_callback); >>> + >>> + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which >>> + * invokes do_nmi_crash (above), which cause them to write state and >>> + * fall into a loop. The crashing pcpu gets the nop handler to >>> + * cause it to return to this function ASAP. >>> + */ >>> + for ( i = 0; i< nr_cpu_ids; ++i ) >>> + if ( idt_tables[i] ) >>> + { >>> + >>> + if ( i == cpu ) >>> + { >>> + /* Disable the interrupt stack tables for this MCE and >>> + * NMI handler (shortly to become a nop) as there is a 1 >>> + * instruction race window where NMIs could be >>> + * re-enabled and corrupt the exception frame, leaving >>> + * us unable to continue on this crash path (which half >>> + * defeats the point of using the nop handler in the >>> + * first place). >>> + * >>> + * This update is safe from a security point of view, as >>> + * this pcpu is never going to try to sysret back to a >>> + * PV vcpu. >>> + */ >> This comment appears to have become stale with the latest >> changes. > > Ok > >> >>> + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0,&trap_nop); >> No need for the extra& on functions and arrays. > > I was continuing the prevailing style from traps.c Personally, I prefer > the & notation for function pointers, to be consistent with regular > pointers, even though I am aware that it is not strictly needed. I am > not fussed if you wish to insist on one style, but we do have mixed > styles across the codebase.I''ll leave that to your preference then.>>> +{ >>> + ASSERT(gate->b == new->b); >>> + *(volatile unsigned long *)&gate->a = new->a; >> volatile? And if so, why not volatilie-qualify the function parameter? > > I was looking to avoid having the compiler inline this function and > decide that it can merge *gate_addr and idte together, resulting in > multiple writes to gate_addr. Without the volatile, the compiler is > free to make this optimization, which puts us back with the racy case we > are trying to avoid. The reason for avoiding the volatile function > parameter is so the assertion equality can be optimized where possible.Optimizing ASSERT() expressions is pointless. If you really want volatile here, put it on the parameter. Casts, as having some risk associated with them, ought to be avoided wherever possible.>>> +} while (0) >>> + >>> +/* Update the lower half handler of an IDT Entry, without changing any >>> + * other configuration. */ >>> +static inline void _update_gate_addr_lower(idt_entry_t * gate, void * addr) >> Any reason for this being an inline function and the other being >> a macro? >> >> Jan > > Where possible, I prefer static inline in preference to macros because > of the added type checking etc. > > _set_gate_lower is based on _set_gate, so I used the _set_gate style. > > Again, I am not overly fussed if you have a strong preference for > style. (Both of these styles are mixed across the codebase and have no > indication of preference in CODING_STYLE)Generally, when a parameter is referenced more than once, or when local variables are needed, we should prefer inline functions over macros. With the exception of this causing dependency problems, of course. Jan