thr3ads.net - Xen devel - [PATCH 0 of 2 V3] Kexec alterations [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Andrew Cooper

2012-Dec-06 21:42 UTC

[PATCH 0 of 2 V3] Kexec alterations

Hello,

V3 of this patch, following detailed code review from 3 colleagues.

Patch 1 implemented a helper function for working with IST entries, to
avoid open-coded bitwise manipulation, while the major difference in
patch 2 is a reworking of the IST alterations to remove the (very
remote) possiblity of re-enabling the sysret context switch security
vulnerability.

~Andrew

Andrew Cooper

2012-Dec-06 21:42 UTC

head link

[PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

... to save using open-coded bitwise operations, and update all IST
manipulation sites to use the function.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

--

I am not overly happy with the name set_ist(), and certainly not tied to
it.  However, I am unable to think of a better name. set_idt_ist() is
wrong, as is set_irq_ist(), while set_idt_entry_ist() just seems to
cludgy.  The comment and parameter types do explicitly state what is
expected t be passed, but suggestions welcome for a better name.

diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/hvm/svm/svm.c
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -869,9 +869,9 @@ static void svm_ctxt_switch_from(struct 
     svm_vmload(per_cpu(root_vmcb, cpu));
 
     /* Resume use of ISTs now that the host TR is reinstated. */
-    idt_tables[cpu][TRAP_double_fault].a  |= IST_DF << 32;
-    idt_tables[cpu][TRAP_nmi].a           |= IST_NMI << 32;
-    idt_tables[cpu][TRAP_machine_check].a |= IST_MCE << 32;
+    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_DF);
+    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NMI);
+    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE);
 }
 
 static void svm_ctxt_switch_to(struct vcpu *v)
@@ -893,9 +893,9 @@ static void svm_ctxt_switch_to(struct vc
      * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR.
      * But this doesn''t matter: the IST is only req''d to
handle SYSCALL/SYSRET.
      */
-    idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
-    idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
-    idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
+    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_NONE);
+    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
+    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
 
     svm_restore_dr(v);
 
diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/x86_64/traps.c
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -370,9 +370,9 @@ void __devinit subarch_percpu_traps_init
     {
         /* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */
         set_intr_gate(TRAP_double_fault, &double_fault);
-        idt_table[TRAP_double_fault].a  |= IST_DF << 32;
-        idt_table[TRAP_nmi].a           |= IST_NMI << 32;
-        idt_table[TRAP_machine_check].a |= IST_MCE << 32;
+        set_ist(&idt_table[TRAP_double_fault],  IST_DF);
+        set_ist(&idt_table[TRAP_nmi],           IST_NMI);
+        set_ist(&idt_table[TRAP_machine_check], IST_MCE);
 
         /*
          * The 32-on-64 hypercall entry vector is only accessible from ring 1.
diff -r bc624b00d6d6 -r 43f86afe90be xen/include/asm-x86/processor.h
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -425,10 +425,20 @@ struct tss_struct {
     u8 __cacheline_filler[24];
 } __cacheline_aligned __attribute__((packed));
 
-#define IST_DF  1UL
-#define IST_NMI 2UL
-#define IST_MCE 3UL
-#define IST_MAX 3UL
+#define IST_NONE 0UL
+#define IST_DF   1UL
+#define IST_NMI  2UL
+#define IST_MCE  3UL
+#define IST_MAX  3UL
+
+/* Set the interrupt stack table used by a particular interrupt
+ * descriptor table entry. */
+static always_inline void set_ist(idt_entry_t * idt, unsigned long ist)
+{
+    /* ist is a 3 bit field, 32 bits into the idt entry. */
+    ASSERT( ist < 8 );
+    idt->a = ( idt->a & ~(7UL << 32) ) | ( (ist & 7UL)
<< 32 );
+}
 
 #define IDT_ENTRIES 256
 extern idt_entry_t idt_table[];

Andrew Cooper

2012-Dec-06 21:42 UTC

head link

[PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

Experimentally, certain crash kernels will triple fault very early after
starting if started with NMIs disabled.  This was discovered when
experimenting with a debug keyhandler which deliberately created a
reentrant NMI, causing stack corruption.

Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safer in more circumstances.

This patch adds three new low level routines:
 * nmi_crash
    This is a special NMI handler for using during a kexec crash.
 * enable_nmis
    This function enables NMIs by executing an iret-to-self, to
    disengage the hardware NMI latch.
 * trap_nop
    This is a no op handler which irets immediately.  It is actually in
    the middle of enable_nmis to reuse the iret instruction, without
    having a single lone aligned iret inflating the code side.


As a result, the new behaviour of the kexec_crash path is:

nmi_shootdown_cpus() will:

 * Disable the crashing cpus NMI/MCE interrupt stack tables.
    Disabling the stack tables removes race conditions which would lead
    to corrupt exception frames and infinite loops.  As this pcpu is
    never planning to execute a sysret back to a pv vcpu, the update is
    safe from a security point of view.

 * Swap the NMI trap handlers.
    The crashing pcpu gets the nop handler, to prevent it getting stuck in
    an NMI context, causing a hang instead of crash.  The non-crashing
    pcpus all get the nmi_crash handler which is designed never to
    return.

do_nmi_crash() will:

 * Save the crash notes and shut the pcpu down.
    There is now an extra per-cpu variable to prevent us from executing
    this multiple times.  In the case where we reenter midway through,
    attempt the whole operation again in preference to not completing
    it in the first place.

 * Set up another NMI at the LAPIC.
    Even when the LAPIC has been disabled, the ID and command registers
    are still usable.  As a result, we can deliberately queue up a new
    NMI to re-interrupt us later if NMIs get unlatched.  Because of the
    call to __stop_this_cpu(), we have to hand craft self_nmi() to be
    safe from General Protection Faults.

 * Fall into infinite loop.

machine_kexec() will:

  * Swap the MCE handlers to be a nop.
     We cannot prevent MCEs from being delivered when we pass off to the
     crash kernel, and the less Xen context is being touched the better.

  * Explicitly enable NMIs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

--

Changes since v2:

 * Rework the alteration of the stack tables to completely remove the
   possiblity of a PV domain getting very lucky with the "NMI or MCE in
   a 1 instruction race window on sysret" and managing to execute code
   in the hypervisor context.
 * Make use of set_ist() from the previous patch in the series to avoid
   open-coding the IST manipulation.

Changes since v1:
 * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed
   during the original refactoring.
 * Fold trap_nop into the middle of enable_nmis to reuse the iret.
 * Expand comments in areas as per Tim''s suggestions.

diff -r 43f86afe90be -r 3757511a7852 xen/arch/x86/crash.c
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -32,41 +32,129 @@
 
 static atomic_t waiting_for_crash_ipi;
 static unsigned int crashing_cpu;
+static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
 
-static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
+/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */
+void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
 {
-    /* Don''t do anything if this handler is invoked on crashing cpu.
-     * Otherwise, system will completely hang. Crashing cpu can get
-     * an NMI if system was initially booted with nmi_watchdog parameter.
+    int cpu = smp_processor_id();
+
+    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
+    ASSERT( cpu != crashing_cpu );
+
+    /* Save crash information and shut down CPU.  Attempt only once. */
+    if ( ! this_cpu(crash_save_done) )
+    {
+        /* Disable the interrupt stack table for the MCE handler.  This
+         * prevents race conditions between clearing MCIP and receving a
+         * new MCE, during which the exception frame would be clobbered
+         * and the MCE handler fall into an infinite loop.  We are soon
+         * going to disable the NMI watchdog, so the loop would not be
+         * caught.
+         *
+         * We do not need to change the NMI IST, as the nmi_crash
+         * handler is immue to corrupt exception frames, by virtue of
+         * being designed never to return.
+         *
+         * This update is safe from a security point of view, as this
+         * pcpu is never going to try to sysret back to a PV vcpu.
+         */
+        set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
+
+        kexec_crash_save_cpu();
+        __stop_this_cpu();
+
+        this_cpu(crash_save_done) = 1;
+        atomic_dec(&waiting_for_crash_ipi);
+    }
+
+    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
+     * back to its boot state, so we are unable to rely on the regular
+     * apic_* functions, due to ''x2apic_enabled'' being
possibly wrong.
+     * (The likely scenario is that we have reverted from x2apic mode to
+     * xapic, at which point #GPFs will occur if we use the apic_*
+     * functions)
+     *
+     * The ICR and APIC ID of the LAPIC are still valid even during
+     * software disable (Intel SDM Vol 3, 10.4.7.2).  As a result, we
+     * can deliberately queue up another NMI at the LAPIC which will not
+     * be delivered as the hardware NMI latch is currently in effect.
+     * This means that if NMIs become unlatched (e.g. following a
+     * non-fatal MCE), the LAPIC will force us back here rather than
+     * wandering back into regular Xen code.
      */
-    if ( cpu == crashing_cpu )
-        return 1;
-    local_irq_disable();
+    switch ( current_local_apic_mode() )
+    {
+        u32 apic_id;
 
-    kexec_crash_save_cpu();
+    case APIC_MODE_X2APIC:
+        apic_id = apic_rdmsr(APIC_ID);
 
-    __stop_this_cpu();
+        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id
<< 32));
+        break;
 
-    atomic_dec(&waiting_for_crash_ipi);
+    case APIC_MODE_XAPIC:
+        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
+
+        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
+            cpu_relax();
+
+        apic_mem_write(APIC_ICR2, apic_id << 24);
+        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
+        break;
+
+    default:
+        break;
+    }
 
     for ( ; ; )
         halt();
-
-    return 1;
 }
 
 static void nmi_shootdown_cpus(void)
 {
     unsigned long msecs;
+    int i, cpu = smp_processor_id();
 
     local_irq_disable();
 
-    crashing_cpu = smp_processor_id();
+    crashing_cpu = cpu;
     local_irq_count(crashing_cpu) = 0;
 
     atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
-    /* Would it be better to replace the trap vector here? */
-    set_nmi_callback(crash_nmi_callback);
+
+    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
+     * invokes do_nmi_crash (above), which cause them to write state and
+     * fall into a loop.  The crashing pcpu gets the nop handler to
+     * cause it to return to this function ASAP.
+     */
+    for ( i = 0; i < nr_cpu_ids; ++i )
+        if ( idt_tables[i] )
+        {
+
+            if ( i == cpu )
+            {
+                /* Disable the interrupt stack tables for this MCE and
+                 * NMI handler (shortly to become a nop) as there is a 1
+                 * instruction race window where NMIs could be
+                 * re-enabled and corrupt the exception frame, leaving
+                 * us unable to continue on this crash path (which half
+                 * defeats the point of using the nop handler in the
+                 * first place).
+                 *
+                 * This update is safe from a security point of view, as
+                 * this pcpu is never going to try to sysret back to a
+                 * PV vcpu.
+                 */
+                set_ist(&idt_tables[i][TRAP_nmi],           IST_NONE);
+                set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
+
+                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
+            }
+            else
+                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
+        }
+
     /* Ensure the new callback function is set before sending out the NMI. */
     wmb();
 
diff -r 43f86afe90be -r 3757511a7852 xen/arch/x86/machine_kexec.c
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -87,6 +87,22 @@ void machine_kexec(xen_kexec_image_t *im
      */
     local_irq_disable();
 
+    /* Now regular interrupts are disabled, we need to reduce the impact
+     * of interrupts not disabled by ''cli''.
+     *
+     * The NMI handlers have already been set up nmi_shootdown_cpus().  All
+     * pcpus other than us have the nmi_crash handler, while we have the nop
+     * handler.
+     *
+     * The MCE handlers touch extensive areas of Xen code and data.  At this
+     * point, there is nothing we can usefully do, so set the nop handler.
+     */
+    set_intr_gate(TRAP_machine_check, &trap_nop);
+
+    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
+     * not like running with NMIs disabled. */
+    enable_nmis();
+
     /*
      * compat_machine_kexec() returns to idle pagetables, which requires us
      * to be running on a static GDT mapping (idle pagetables have no GDT
diff -r 43f86afe90be -r 3757511a7852 xen/arch/x86/x86_64/entry.S
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -635,11 +635,45 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
         jmp   handle_ist_exception
 
+ENTRY(nmi_crash)
+        cli
+        pushq $0
+        movl $TRAP_nmi,4(%rsp)
+        SAVE_ALL
+        movq %rsp,%rdi
+        callq do_nmi_crash /* Does not return */
+        ud2
+
 ENTRY(machine_check)
         pushq $0
         movl  $TRAP_machine_check,4(%rsp)
         jmp   handle_ist_exception
 
+/* Enable NMIs.  No special register assumptions, and all preserved. */
+ENTRY(enable_nmis)
+        pushq %rax
+        movq  %rsp, %rax /* Grab RSP before pushing */
+
+        /* Set up stack frame */
+        pushq $0               /* SS */
+        pushq %rax             /* RSP */
+        pushfq                 /* RFLAGS */
+        pushq $__HYPERVISOR_CS /* CS */
+        leaq  1f(%rip),%rax
+        pushq %rax             /* RIP */
+
+/* No op trap handler.  Required for kexec crash path.
+ * It is not used in performance critical code, and saves having a single
+ * lone aligned iret. It does not use ENTRY to declare the symbol to avoid the
+ * explicit alignment. */
+.globl trap_nop;
+trap_nop:
+
+        iretq /* Disable the hardware NMI latch */
+1:
+        popq %rax
+        retq
+
 .section .rodata, "a", @progbits
 
 ENTRY(exception_table)
diff -r 43f86afe90be -r 3757511a7852 xen/include/asm-x86/processor.h
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs *
 DECLARE_TRAP_HANDLER(divide_error);
 DECLARE_TRAP_HANDLER(debug);
 DECLARE_TRAP_HANDLER(nmi);
+DECLARE_TRAP_HANDLER(nmi_crash);
 DECLARE_TRAP_HANDLER(int3);
 DECLARE_TRAP_HANDLER(overflow);
 DECLARE_TRAP_HANDLER(bounds);
@@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
 DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
 #undef DECLARE_TRAP_HANDLER
 
+void trap_nop(void);
+void enable_nmis(void);
+
 void syscall_enter(void);
 void sysenter_entry(void);
 void sysenter_eflags_saved(void);

Jan Beulich

2012-Dec-07 11:45 UTC

head link

Re: [PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

>>> On 06.12.12 at 22:42, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> ... to save using open-coded bitwise operations, and update all IST
> manipulation sites to use the function.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> --
> 
> I am not overly happy with the name set_ist(), and certainly not tied to
> it.  However, I am unable to think of a better name. set_idt_ist() is
> wrong, as is set_irq_ist(), while set_idt_entry_ist() just seems to
> cludgy.  The comment and parameter types do explicitly state what is
> expected t be passed, but suggestions welcome for a better name.
> 
> diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/hvm/svm/svm.c
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -869,9 +869,9 @@ static void svm_ctxt_switch_from(struct 
>      svm_vmload(per_cpu(root_vmcb, cpu));
>  
>      /* Resume use of ISTs now that the host TR is reinstated. */
> -    idt_tables[cpu][TRAP_double_fault].a  |= IST_DF << 32;
> -    idt_tables[cpu][TRAP_nmi].a           |= IST_NMI << 32;
> -    idt_tables[cpu][TRAP_machine_check].a |= IST_MCE << 32;
> +    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_DF);
> +    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NMI);
> +    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE);
>  }
>  
>  static void svm_ctxt_switch_to(struct vcpu *v)
> @@ -893,9 +893,9 @@ static void svm_ctxt_switch_to(struct vc
>       * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest
TR.
>       * But this doesn''t matter: the IST is only req''d to
handle SYSCALL/SYSRET.
>       */
> -    idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
> -    idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
> -    idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
> +    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_NONE);
> +    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
> +    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
>  
>      svm_restore_dr(v);
>  
> diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/x86_64/traps.c
> --- a/xen/arch/x86/x86_64/traps.c
> +++ b/xen/arch/x86/x86_64/traps.c
> @@ -370,9 +370,9 @@ void __devinit subarch_percpu_traps_init
>      {
>          /* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */
>          set_intr_gate(TRAP_double_fault, &double_fault);
> -        idt_table[TRAP_double_fault].a  |= IST_DF << 32;
> -        idt_table[TRAP_nmi].a           |= IST_NMI << 32;
> -        idt_table[TRAP_machine_check].a |= IST_MCE << 32;
> +        set_ist(&idt_table[TRAP_double_fault],  IST_DF);
> +        set_ist(&idt_table[TRAP_nmi],           IST_NMI);
> +        set_ist(&idt_table[TRAP_machine_check], IST_MCE);
>  
>          /*
>           * The 32-on-64 hypercall entry vector is only accessible from
ring
> 1.
> diff -r bc624b00d6d6 -r 43f86afe90be xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -425,10 +425,20 @@ struct tss_struct {
>      u8 __cacheline_filler[24];
>  } __cacheline_aligned __attribute__((packed));
>  
> -#define IST_DF  1UL
> -#define IST_NMI 2UL
> -#define IST_MCE 3UL
> -#define IST_MAX 3UL
> +#define IST_NONE 0UL
> +#define IST_DF   1UL
> +#define IST_NMI  2UL
> +#define IST_MCE  3UL
> +#define IST_MAX  3UL
> +
> +/* Set the interrupt stack table used by a particular interrupt
> + * descriptor table entry. */
> +static always_inline void set_ist(idt_entry_t * idt, unsigned long ist)
> +{
> +    /* ist is a 3 bit field, 32 bits into the idt entry. */
> +    ASSERT( ist < 8 );
This ought to check against IST_MAX.
> +    idt->a = ( idt->a & ~(7UL << 32) ) | ( (ist & 7UL)
<< 32 );
And with the check above, the right most & is pretty pointless.

Jan
> +}
>  
>  #define IDT_ENTRIES 256
>  extern idt_entry_t idt_table[];
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel

Jan Beulich

2012-Dec-07 11:52 UTC

head link

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

>>> On 06.12.12 at 22:42, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> +
> +    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
> +     * invokes do_nmi_crash (above), which cause them to write state and
> +     * fall into a loop.  The crashing pcpu gets the nop handler to
> +     * cause it to return to this function ASAP.
> +     */
> +    for ( i = 0; i < nr_cpu_ids; ++i )
> +        if ( idt_tables[i] )
> +        {
> +
> +            if ( i == cpu )
> +            {
> +                /* Disable the interrupt stack tables for this MCE and
> +                 * NMI handler (shortly to become a nop) as there is a 1
> +                 * instruction race window where NMIs could be
> +                 * re-enabled and corrupt the exception frame, leaving
> +                 * us unable to continue on this crash path (which half
> +                 * defeats the point of using the nop handler in the
> +                 * first place).
> +                 *
> +                 * This update is safe from a security point of view, as
> +                 * this pcpu is never going to try to sysret back to a
> +                 * PV vcpu.
> +                 */
> +                set_ist(&idt_tables[i][TRAP_nmi],           IST_NONE);
> +                set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
> +
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&trap_nop);
This makes the first set_ist() above pointless, doesn''t it?
> +            }
> +            else
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&nmi_crash);
> +        }
> +
>      /* Ensure the new callback function is set before sending out the NMI.
*/
>      wmb();
>  ...
> +/* Enable NMIs.  No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> +        pushq %rax
What''s the point of saving %rax here, btw?

Jan
> +        movq  %rsp, %rax /* Grab RSP before pushing */
> +
> +        /* Set up stack frame */
> +        pushq $0               /* SS */
> +        pushq %rax             /* RSP */
> +        pushfq                 /* RFLAGS */
> +        pushq $__HYPERVISOR_CS /* CS */
> +        leaq  1f(%rip),%rax
> +        pushq %rax             /* RIP */
> +
> +/* No op trap handler.  Required for kexec crash path.
> + * It is not used in performance critical code, and saves having a single
> + * lone aligned iret. It does not use ENTRY to declare the symbol to avoid
the
> + * explicit alignment. */
> +.globl trap_nop;
> +trap_nop:
> +
> +        iretq /* Disable the hardware NMI latch */
> +1:
> +        popq %rax
> +        retq
> +
>  .section .rodata, "a", @progbits
>  
>  ENTRY(exception_table)

Andrew Cooper

2012-Dec-07 21:01 UTC

head link

Re: [PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

On 07/12/2012 11:45, Jan Beulich wrote:>>>> On 06.12.12 at 22:42, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> ... to save using open-coded bitwise operations, and update all IST
>> manipulation sites to use the function.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> --
>>
>> I am not overly happy with the name set_ist(), and certainly not tied
to
>> it.  However, I am unable to think of a better name. set_idt_ist() is
>> wrong, as is set_irq_ist(), while set_idt_entry_ist() just seems to
>> cludgy.  The comment and parameter types do explicitly state what is
>> expected t be passed, but suggestions welcome for a better name.
>>
>> diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/hvm/svm/svm.c
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -869,9 +869,9 @@ static void svm_ctxt_switch_from(struct 
>>      svm_vmload(per_cpu(root_vmcb, cpu));
>>  
>>      /* Resume use of ISTs now that the host TR is reinstated. */
>> -    idt_tables[cpu][TRAP_double_fault].a  |= IST_DF << 32;
>> -    idt_tables[cpu][TRAP_nmi].a           |= IST_NMI << 32;
>> -    idt_tables[cpu][TRAP_machine_check].a |= IST_MCE << 32;
>> +    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_DF);
>> +    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NMI);
>> +    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE);
>>  }
>>  
>>  static void svm_ctxt_switch_to(struct vcpu *v)
>> @@ -893,9 +893,9 @@ static void svm_ctxt_switch_to(struct vc
>>       * Cannot use ISTs for NMI/#MC/#DF while we are running with the
guest TR.
>>       * But this doesn''t matter: the IST is only
req''d to handle SYSCALL/SYSRET.
>>       */
>> -    idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
>> -    idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
>> -    idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
>> +    set_ist(&idt_tables[cpu][TRAP_double_fault],  IST_NONE);
>> +    set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
>> +    set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
>>  
>>      svm_restore_dr(v);
>>  
>> diff -r bc624b00d6d6 -r 43f86afe90be xen/arch/x86/x86_64/traps.c
>> --- a/xen/arch/x86/x86_64/traps.c
>> +++ b/xen/arch/x86/x86_64/traps.c
>> @@ -370,9 +370,9 @@ void __devinit subarch_percpu_traps_init
>>      {
>>          /* Specify dedicated interrupt stacks for NMI, #DF, and #MC.
*/
>>          set_intr_gate(TRAP_double_fault, &double_fault);
>> -        idt_table[TRAP_double_fault].a  |= IST_DF << 32;
>> -        idt_table[TRAP_nmi].a           |= IST_NMI << 32;
>> -        idt_table[TRAP_machine_check].a |= IST_MCE << 32;
>> +        set_ist(&idt_table[TRAP_double_fault],  IST_DF);
>> +        set_ist(&idt_table[TRAP_nmi],           IST_NMI);
>> +        set_ist(&idt_table[TRAP_machine_check], IST_MCE);
>>  
>>          /*
>>           * The 32-on-64 hypercall entry vector is only accessible from
ring
>> 1.
>> diff -r bc624b00d6d6 -r 43f86afe90be xen/include/asm-x86/processor.h
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -425,10 +425,20 @@ struct tss_struct {
>>      u8 __cacheline_filler[24];
>>  } __cacheline_aligned __attribute__((packed));
>>  
>> -#define IST_DF  1UL
>> -#define IST_NMI 2UL
>> -#define IST_MCE 3UL
>> -#define IST_MAX 3UL
>> +#define IST_NONE 0UL
>> +#define IST_DF   1UL
>> +#define IST_NMI  2UL
>> +#define IST_MCE  3UL
>> +#define IST_MAX  3UL
>> +
>> +/* Set the interrupt stack table used by a particular interrupt
>> + * descriptor table entry. */
>> +static always_inline void set_ist(idt_entry_t * idt, unsigned long
ist)
>> +{
>> +    /* ist is a 3 bit field, 32 bits into the idt entry. */
>> +    ASSERT( ist < 8 );
> This ought to check against IST_MAX.
Ok
>
>> +    idt->a = ( idt->a & ~(7UL << 32) ) | ( (ist &
7UL) << 32 );
> And with the check above, the right most & is pretty pointless.
>
> Jan
I was trying to protect against overflowing into the rest of the bits. 
I would hope that all actual instantiations use the IST_ macros, and the
compiler will optimise it away, but in the case that someone calls
set_ist(<entry>, 100), the compiler will do the right thing.

Thinking about it though, anyone calling set_ist() with a value greater
than 7 probably has larger bugs in their code.

I will remove it.

~Andrew
>
>> +}
>>  
>>  #define IDT_ENTRIES 256
>>  extern idt_entry_t idt_table[];
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
>
>

Andrew Cooper

2012-Dec-07 21:18 UTC

head link

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

On 07/12/2012 11:52, Jan Beulich wrote:>>>> On 06.12.12 at 22:42, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>> +
>> +    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash
which
>> +     * invokes do_nmi_crash (above), which cause them to write state
and
>> +     * fall into a loop.  The crashing pcpu gets the nop handler to
>> +     * cause it to return to this function ASAP.
>> +     */
>> +    for ( i = 0; i < nr_cpu_ids; ++i )
>> +        if ( idt_tables[i] )
>> +        {
>> +
>> +            if ( i == cpu )
>> +            {
>> +                /* Disable the interrupt stack tables for this MCE and
>> +                 * NMI handler (shortly to become a nop) as there is a
1
>> +                 * instruction race window where NMIs could be
>> +                 * re-enabled and corrupt the exception frame, leaving
>> +                 * us unable to continue on this crash path (which
half
>> +                 * defeats the point of using the nop handler in the
>> +                 * first place).
>> +                 *
>> +                 * This update is safe from a security point of view,
as
>> +                 * this pcpu is never going to try to sysret back to a
>> +                 * PV vcpu.
>> +                 */
>> +                set_ist(&idt_tables[i][TRAP_nmi],          
IST_NONE);
>> +                set_ist(&idt_tables[i][TRAP_machine_check],
IST_NONE);
>> +
>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&trap_nop);
> This makes the first set_ist() above pointless, doesn''t it?
No.  The set_ist() is to remove the possibility of stack corruption from
reentrant NMIs, while the trap_nop handler is so we don''t get diverted
into the regular NMI handler.  There is nothing the NMI handler would do
which could alter the outcome, and there are many cases where the
regular NMI handler would try to panic, starting us reentrantly on the
kexec path again (where we would deadlock on the one_cpu_only() check).

Given that we are going to crash, any time spent in the NMI handler is
time not spent trying to kill the other pcpus, which themselves might be
heading towards a cascade failure.
>
>> +            }
>> +            else
>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&nmi_crash);
>> +        }
>> +
>>      /* Ensure the new callback function is set before sending out the
NMI. */
>>      wmb();
>>  ...
>> +/* Enable NMIs.  No special register assumptions, and all preserved.
*/
>> +ENTRY(enable_nmis)
>> +        pushq %rax
> What''s the point of saving %rax here, btw?
>
> Jan
Because at the moment I believe I might need to call it from asm
context, when doing some of the later fixes.  I figured that it was
better to make it safe now, rather than patch it up later.

~Andrew
>
>> +        movq  %rsp, %rax /* Grab RSP before pushing */
>> +
>> +        /* Set up stack frame */
>> +        pushq $0               /* SS */
>> +        pushq %rax             /* RSP */
>> +        pushfq                 /* RFLAGS */
>> +        pushq $__HYPERVISOR_CS /* CS */
>> +        leaq  1f(%rip),%rax
>> +        pushq %rax             /* RIP */
>> +
>> +/* No op trap handler.  Required for kexec crash path.
>> + * It is not used in performance critical code, and saves having a
single
>> + * lone aligned iret. It does not use ENTRY to declare the symbol to
avoid the
>> + * explicit alignment. */
>> +.globl trap_nop;
>> +trap_nop:
>> +
>> +        iretq /* Disable the hardware NMI latch */
>> +1:
>> +        popq %rax
>> +        retq
>> +
>>  .section .rodata, "a", @progbits
>>  
>>  ENTRY(exception_table)
>

Jan Beulich

2012-Dec-10 09:32 UTC

head link

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

>>> On 07.12.12 at 22:18, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 07/12/2012 11:52, Jan Beulich wrote:
>>>>> On 06.12.12 at 22:42, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>>> +
>>> +    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash
which
>>> +     * invokes do_nmi_crash (above), which cause them to write
state and
>>> +     * fall into a loop.  The crashing pcpu gets the nop handler
to
>>> +     * cause it to return to this function ASAP.
>>> +     */
>>> +    for ( i = 0; i < nr_cpu_ids; ++i )
>>> +        if ( idt_tables[i] )
>>> +        {
>>> +
>>> +            if ( i == cpu )
>>> +            {
>>> +                /* Disable the interrupt stack tables for this MCE
and
>>> +                 * NMI handler (shortly to become a nop) as there
is a 1
>>> +                 * instruction race window where NMIs could be
>>> +                 * re-enabled and corrupt the exception frame,
leaving
>>> +                 * us unable to continue on this crash path (which
half
>>> +                 * defeats the point of using the nop handler in
the
>>> +                 * first place).
>>> +                 *
>>> +                 * This update is safe from a security point of
view, as
>>> +                 * this pcpu is never going to try to sysret back
to a
>>> +                 * PV vcpu.
>>> +                 */
>>> +                set_ist(&idt_tables[i][TRAP_nmi],          
IST_NONE);
>>> +                set_ist(&idt_tables[i][TRAP_machine_check],
IST_NONE);
>>> +
>>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&trap_nop);
>> This makes the first set_ist() above pointless, doesn''t it?
> 
> No.  The set_ist() is to remove the possibility of stack corruption from
> reentrant NMIs, while the trap_nop handler is so we don''t get
diverted
> into the regular NMI handler.  There is nothing the NMI handler would do
> which could alter the outcome, and there are many cases where the
> regular NMI handler would try to panic, starting us reentrantly on the
> kexec path again (where we would deadlock on the one_cpu_only() check).
I think you didn''t get the point of the question: _set_gate() clears
the IST field of the descriptor anyway, so why clear it separately
first, and then overwrite it again?
>>> +            }
>>> +            else
>>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0,
&nmi_crash);
>>> +        }
>>> +
>>>      /* Ensure the new callback function is set before sending out
the NMI.
> */
>>>      wmb();
>>>  ...
>>> +/* Enable NMIs.  No special register assumptions, and all
preserved. */
>>> +ENTRY(enable_nmis)
>>> +        pushq %rax
>> What''s the point of saving %rax here, btw?
>>
>> Jan
> 
> Because at the moment I believe I might need to call it from asm
> context, when doing some of the later fixes.  I figured that it was
> better to make it safe now, rather than patch it up later.
I don''t think that''s good practice - if you end up not calling
the
thing from assembly code, the question on the purpose of saving
%rax will re-surface sooner or later. Plus the patch by itself
wouldn''t really explain either why this is so (which might be
of interest in the context of backporting it to older trees).

Jan

Andrew Cooper

2012-Dec-10 12:54 UTC

head link

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

On 10/12/12 09:32, Jan Beulich wrote:>>>> On 07.12.12 at 22:18, Andrew
Cooper<andrew.cooper3@citrix.com>  wrote:
>> On 07/12/2012 11:52, Jan Beulich wrote:
>>>>>> On 06.12.12 at 22:42, Andrew
Cooper<andrew.cooper3@citrix.com>  wrote:
>>>> +
>>>> +    /* Change NMI trap handlers.  Non-crashing pcpus get
nmi_crash which
>>>> +     * invokes do_nmi_crash (above), which cause them to write
state and
>>>> +     * fall into a loop.  The crashing pcpu gets the nop
handler to
>>>> +     * cause it to return to this function ASAP.
>>>> +     */
>>>> +    for ( i = 0; i<  nr_cpu_ids; ++i )
>>>> +        if ( idt_tables[i] )
>>>> +        {
>>>> +
>>>> +            if ( i == cpu )
>>>> +            {
>>>> +                /* Disable the interrupt stack tables for this
MCE and
>>>> +                 * NMI handler (shortly to become a nop) as
there is a 1
>>>> +                 * instruction race window where NMIs could be
>>>> +                 * re-enabled and corrupt the exception frame,
leaving
>>>> +                 * us unable to continue on this crash path
(which half
>>>> +                 * defeats the point of using the nop handler
in the
>>>> +                 * first place).
>>>> +                 *
>>>> +                 * This update is safe from a security point
of view, as
>>>> +                 * this pcpu is never going to try to sysret
back to a
>>>> +                 * PV vcpu.
>>>> +                 */
>>>> +                set_ist(&idt_tables[i][TRAP_nmi],         
IST_NONE);
>>>> +               
set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
>>>> +
>>>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14,
0,&trap_nop);
>>> This makes the first set_ist() above pointless, doesn''t
it?
>> No.  The set_ist() is to remove the possibility of stack corruption
from
>> reentrant NMIs, while the trap_nop handler is so we don''t get
diverted
>> into the regular NMI handler.  There is nothing the NMI handler would
do
>> which could alter the outcome, and there are many cases where the
>> regular NMI handler would try to panic, starting us reentrantly on the
>> kexec path again (where we would deadlock on the one_cpu_only() check).
> I think you didn''t get the point of the question: _set_gate()
clears
> the IST field of the descriptor anyway, so why clear it separately
> first, and then overwrite it again?
Ah - my apologies.  I indeed was not understanding the point.

I will need to fix up the calls then.  Having _set_gate() change the IST 
as well reintroduces the security vulnerability.  I will create a new 
function similar to _set_gate() which only changes the handler and 
nothing else.
>
>>>> +            }
>>>> +            else
>>>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14,
0,&nmi_crash);
>>>> +        }
>>>> +
>>>>       /* Ensure the new callback function is set before sending
out the NMI.
>> */
>>>>       wmb();
>>>>   ...
>>>> +/* Enable NMIs.  No special register assumptions, and all
preserved. */
>>>> +ENTRY(enable_nmis)
>>>> +        pushq %rax
>>> What''s the point of saving %rax here, btw?
>>>
>>> Jan
>> Because at the moment I believe I might need to call it from asm
>> context, when doing some of the later fixes.  I figured that it was
>> better to make it safe now, rather than patch it up later.
> I don''t think that''s good practice - if you end up not
calling the
> thing from assembly code, the question on the purpose of saving
> %rax will re-surface sooner or later. Plus the patch by itself
> wouldn''t really explain either why this is so (which might be
> of interest in the context of backporting it to older trees).
>
> Jan
>
Ok - will remove.

~Andrew

Xen devel - Dec 2012 - [PATCH 0 of 2 V3] Kexec alterations

[PATCH 0 of 2 V3] Kexec alterations

[PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

[PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

Re: [PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

Re: [PATCH 1 of 2 V3] x86/IST: Create set_ist() helper function

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path

Re: [PATCH 2 of 2 V3] x86/kexec: Change NMI and MCE handling on kexec path