thr3ads.net - Xen devel - [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Zhu Yanhai

2013-Nov-06 06:41 UTC

[PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

As we know Intel X86''s CR0.TS is a sticky bit, which means once set
it remains set until cleared by some software routines, in other words,
the exception handler expects the bit is set when it starts to execute.

However xen doesn''t simulate this behavior quite well for PV guests -
vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
so the guest kernel''s #NM handler runs with CR0.TS cleared. Generally
speaking
it''s fine since the linux kernel executes the exception handler with
interrupt disabled and a sane #NM handler will clear the bit anyway
before it exits, but there''s a catch: if it''s the first FPU
trap for the process,
the linux kernel must allocate a piece of SLAB memory for it to save
the FPU registers, which opens a schedule window as the memory
allocation might sleep -- and with CR0.TS keeps clear!

[see the code below in linux kernel,

void math_state_restore(void)
{
    struct task_struct *tsk = current;

    if (!tsk_used_math(tsk)) {
        local_irq_enable();
        /*
         * does a slab alloc which can sleep
         */
        if (init_fpu(tsk)) {                 <<<< Here it might open
a schedule window
            /*
             * ran out of memory!
             */
            do_group_exit(SIGKILL);
            return;
        }
        local_irq_disable();
    }

    __thread_fpu_begin(tsk);    <<<< Here the process gets marked as
a ''fpu user''
                                         after the schedule window

    /*
     * Paranoid restore. send a SIGSEGV if we fail to restore the state.
     */
    if (unlikely(restore_fpu_checking(tsk))) {
        drop_init_fpu(tsk);
        force_sig(SIGSEGV, tsk);
        return;
    }

    tsk->fpu_counter++;
}
]

The check in linux kernel''s switch_fpu_prepare() doesn''t
stts() either because
the current process only gets marked as a FPU user after the schedule window
(the story is a bit different if eagerfpu is enabled, anyway a sane hypervisor
cannot depend on such undetermined things). And then supposing that the new
process scheduled-in wants to touch FPU, nobody will do fxrstor/frstor for it
anymore,
conducing to a serious data damage.

Also, The point is everything is fine on linux + baremetal since CR0.TS will
keep set until the interrupted #NM handler got the memory it needs and exits,
so the incomer FPU user will get trapped as it''s supposed to be.

The test case is as below,

        buf = malloc(BUF_SIZE);
        if (!buf) {
                fprintf(stderr, "error %s during %s\n",
                        strerror(-err),
                        "malloc");
                return 1;
        }
        memset(buf, IO_PATTERN, BUF_SIZE);
        memset(cmp_buf, IO_PATTERN, BUF_SIZE);

        if (memcmp(buf, cmp_buf, BUF_SIZE)) {
                unsigned long long *ubuf = (unsigned long long *)buf;
                int i;

                for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++)
                        printf("%d: 0x%llx\n", i, ubuf[i]);

                return 2;
        }

Two shell scripts on each box''s dom0 runs above program repeatedly
until
the compare fails (so every time the C program is a new fpu user and triggers
memory allocation). we can see the data damage at least once with
xen 4.3 + linux 2.6.32 on ~200 physical machines within two hours.
With xen 4.3 + linux 3.11.6 stable it becomes harder to reproduce
(guess it''s because of the eagerfpu feature introduced in linux kernel
3.7)
but it''s still possible to come out within about four hours.

The fix here is trying to make xen behave as close to the hardware as possible.

This bug only has effects on PV guests (and including dom0 kernel of course).

Cc: Wan Jia <jia.wanj@alibaba-inc.com>
Cc: Shen Yiben <zituan@taobao.com>
Cc: Charles Wang <muming.wq@taobao.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
---
 xen/arch/x86/traps.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 77c200b..b0321a6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3267,8 +3267,8 @@ void do_device_not_available(struct cpu_user_regs *regs)
 
     if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
     {
+        stts();
         do_guest_trap(TRAP_no_device, regs, 0);
-        curr->arch.pv_vcpu.ctrlreg[0] &= ~X86_CR0_TS;
     }
     else
         TRACE_0D(TRC_PV_MATH_STATE_RESTORE);
-- 
1.7.4.4

Jan Beulich

2013-Nov-06 08:51 UTC

head link

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com>
wrote:
> As we know Intel X86''s CR0.TS is a sticky bit, which means once
set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
Since when would that be the case? CR0.TS is entirely unaffected
by exception invocations according to all I know. All that is known
here is that #NM wouldn''t have occurred in the first place if CR0.TS
was clear.
> However xen doesn''t simulate this behavior quite well for PV
guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very
beginning,
> so the guest kernel''s #NM handler runs with CR0.TS cleared.
Generally
> speaking
> it''s fine since the linux kernel executes the exception handler
with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there''s a catch: if it''s the first
FPU trap for the
> process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
> 
> [see the code below in linux kernel,
You''re apparently referring to the pvops kernel.
> void math_state_restore(void)
> {
>     struct task_struct *tsk = current;
> 
>     if (!tsk_used_math(tsk)) {
>         local_irq_enable();
>         /*
>          * does a slab alloc which can sleep
>          */
>         if (init_fpu(tsk)) {                 <<<< Here it might
open a schedule window
>             /*
>              * ran out of memory!
>              */
>             do_group_exit(SIGKILL);
>             return;
>         }
>         local_irq_disable();
>     }
> 
>     __thread_fpu_begin(tsk);    <<<< Here the process gets
marked as a ''fpu user''
>                                          after the schedule window
> 
>     /*
>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>      */
>     if (unlikely(restore_fpu_checking(tsk))) {
>         drop_init_fpu(tsk);
>         force_sig(SIGSEGV, tsk);
>         return;
>     }
> 
>     tsk->fpu_counter++;
> }
> ]
May I direct your attention to the XenoLinux one:

asmlinkage void math_state_restore(void)
{
	struct task_struct *me = current;

	/* NB. ''clts'' is done for us by Xen during virtual trap. */
	__get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
	if (!used_math())
		init_fpu(me);
	restore_fpu_checking(&me->thread.i387.fxsave);
	task_thread_info(me)->status |= TS_USEDFPU;
}

Note the comment close to the beginning - the fact that CR0.TS
is clear at exception handler entry is actually part of the PV ABI,
i.e. by altering hypervisor behavior here you break all forward
ported kernels.

Nevertheless I agree that there is an issue, but this needs to be
fixed on the Linux side (hence adding the Linux maintainers to Cc);
this issue was introduced way back in 2.6.26 (before that there
was no allocation on that path). It''s not clear though whether
using GFP_ATOMIC for the allocation would be preferable over
stts() before calling the allocation function (and clts() if it
succeeded), or whether perhaps to defer the stts() until we
actually know the task is being switched out. It''s going to be an
ugly, Xen-specific hack in any event.

Jan

Zhu Yanhai

2013-Nov-06 09:15 UTC

head link

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

2013/11/6 Jan Beulich <JBeulich@suse.com>:>>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com>
wrote:
>> As we know Intel X86''s CR0.TS is a sticky bit, which means
once set
>> it remains set until cleared by some software routines, in other words,
>> the exception handler expects the bit is set when it starts to execute.
>
> Since when would that be the case? CR0.TS is entirely unaffected
> by exception invocations according to all I know. All that is known
> here is that #NM wouldn''t have occurred in the first place if
CR0.TS
> was clear.
Yes, you are right, nevertheless IMHO no exception handlers in the
real world do not clear this bit.
>
>> However xen doesn''t simulate this behavior quite well for PV
guests -
>> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very
beginning,
>> so the guest kernel''s #NM handler runs with CR0.TS cleared.
Generally
>> speaking
>> it''s fine since the linux kernel executes the exception
handler with
>> interrupt disabled and a sane #NM handler will clear the bit anyway
>> before it exits, but there''s a catch: if it''s the
first FPU trap for the
>> process,
>> the linux kernel must allocate a piece of SLAB memory for it to save
>> the FPU registers, which opens a schedule window as the memory
>> allocation might sleep -- and with CR0.TS keeps clear!
>>
>> [see the code below in linux kernel,
>
> You''re apparently referring to the pvops kernel.
Yes, it''s from Linus tree with tag 3.11.
>
>> void math_state_restore(void)
>> {
>>     struct task_struct *tsk = current;
>>
>>     if (!tsk_used_math(tsk)) {
>>         local_irq_enable();
>>         /*
>>          * does a slab alloc which can sleep
>>          */
>>         if (init_fpu(tsk)) {                 <<<< Here it
might open a schedule window
>>             /*
>>              * ran out of memory!
>>              */
>>             do_group_exit(SIGKILL);
>>             return;
>>         }
>>         local_irq_disable();
>>     }
>>
>>     __thread_fpu_begin(tsk);    <<<< Here the process gets
marked as a ''fpu user''
>>                                          after the schedule window
>>
>>     /*
>>      * Paranoid restore. send a SIGSEGV if we fail to restore the
state.
>>      */
>>     if (unlikely(restore_fpu_checking(tsk))) {
>>         drop_init_fpu(tsk);
>>         force_sig(SIGSEGV, tsk);
>>         return;
>>     }
>>
>>     tsk->fpu_counter++;
>> }
>> ]
>
> May I direct your attention to the XenoLinux one:
>
> asmlinkage void math_state_restore(void)
> {
>         struct task_struct *me = current;
>
>         /* NB. ''clts'' is done for us by Xen during
virtual trap. */
>         __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>         if (!used_math())
>                 init_fpu(me);
>         restore_fpu_checking(&me->thread.i387.fxsave);
>         task_thread_info(me)->status |= TS_USEDFPU;
> }
>
> Note the comment close to the beginning - the fact that CR0.TS
> is clear at exception handler entry is actually part of the PV ABI,
> i.e. by altering hypervisor behavior here you break all forward
> ported kernels.
I see, XenoLinux kernel doesn''t sleep in init_fpu() so it
doesn''t have
this issue. But I wonder why PV ABI decide to clear this bit for the
guest kernel, isn''t it better for the guest kernel itself to see bit
set? Since it''s more similar with the hardware. I know the ABI cannot
be changed, just for curious.
>
> Nevertheless I agree that there is an issue, but this needs to be
> fixed on the Linux side (hence adding the Linux maintainers to Cc);
> this issue was introduced way back in 2.6.26 (before that there
> was no allocation on that path). It''s not clear though whether
> using GFP_ATOMIC for the allocation would be preferable over
> stts() before calling the allocation function (and clts() if it
> succeeded), or whether perhaps to defer the stts() until we
> actually know the task is being switched out. It''s going to be an
> ugly, Xen-specific hack in any event.
Yes, it also can be fixed at the linux kernel side. I didn''t know such
behavior was part of PV ABI before.

--
Thanks,
Zhu Yanhai
>
> Jan
>

Jan Beulich

2013-Nov-06 09:28 UTC

head link

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

>>> On 06.11.13 at 10:15, Zhu Yanhai <zhu.yanhai@gmail.com>
wrote:
> 2013/11/6 Jan Beulich <JBeulich@suse.com>:
>> May I direct your attention to the XenoLinux one:
>>
>> asmlinkage void math_state_restore(void)
>> {
>>         struct task_struct *me = current;
>>
>>         /* NB. ''clts'' is done for us by Xen during
virtual trap. */
>>         __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>>         if (!used_math())
>>                 init_fpu(me);
>>         restore_fpu_checking(&me->thread.i387.fxsave);
>>         task_thread_info(me)->status |= TS_USEDFPU;
>> }
>>
>> Note the comment close to the beginning - the fact that CR0.TS
>> is clear at exception handler entry is actually part of the PV ABI,
>> i.e. by altering hypervisor behavior here you break all forward
>> ported kernels.
> 
> I see, XenoLinux kernel doesn''t sleep in init_fpu() so it
doesn''t have
> this issue. But I wonder why PV ABI decide to clear this bit for the
> guest kernel, isn''t it better for the guest kernel itself to see
bit
> set? Since it''s more similar with the hardware. I know the ABI
cannot
> be changed, just for curious.
Quite obvious - performance. Since (as you also confirmed) it is
(almost) guaranteed for the handler to want to clear the bit, we
can save it from having to do another hypercall here. In the
forward ported XenoLinux the change is quite trivial (leaving
aside any optimization, as we''re on a rarely used path here
anyway - it''s being taken only the first time a process accesses
the FPU): stts() before the local_irq_enable(), and clts() after
the local_irq_disable(). But the x86 maintainers probably won''t
like this for pvops...

Jan

Xen devel - Nov 2013 - [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler

[PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler

Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest''s #NM exception handler