Peter Moody
2013-Jan-28 19:17 UTC
100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
TL;DR, the domU crash I reported over the summer on Xen 4.0.1 can be reproduced on 4.1.3 and on more processor families and with out the special memory/cpu configurations I previously reported. Longer version: apropos of this thread [1] from last summer, I''ve managed to test for this bug on a more recent version of Xen and I can confirm that it exists in at least 4.1.3. Also, based on the release notes for 4.0.1 [2] (the original version of Xen where I encountered this issue), I reproduced the bug on an AMD Athlon processor in case the interrupts issue mentioned had an effect. The patch I posted to the audit list didn''t actually fix the problem. Steps I used to reproduce: 1) installed Xen from ubuntu packages and boot into Xen enabled system. 2) installed ubuntu 12.10 domU using 20G flat file as disk (the previous system used drbd). 3) installed auditd and inserted any syscall rule (audit on chmod''s for example). 4) compiled the attached sample program as a 32 bit binary. 5) ran it (works as a normal user). The result is an immediate crash (if KILLDIR doesn''t exist or isn''t writable, you just get a segfault). Interestingly, it also seems to leave dom0 in a funky state where dom0 is unable to reboot (I think it has to do with the disk file not being unmounted). I can only recover from this cleanly by running xm destroy on the crashed domain. This is my xen configuration for this particular domain: memory = "1024" disk = [ ''file:/home/pmoody/virt/xen/xen-bug/disk1.img,xvda,w'', ] vif = [ ''bridge=xenbr0'', ] vcpus=4 on_reboot = "restart" on_crash = "restart" (the number of vcpus doesn''t appear to be important). From my recent testing, it seems like it should be very easy for someone else to reproduce this issue. So, does anyone have any idea of what might be going on? Cheers, peter [1] http://lists.xen.org/archives/html/xen-devel/2012-08/msg01052.html [2] http://wiki.xen.org/wiki/Xen_4.0_Release_Notes#Known_issues -- [ Peter Moody | Security Engineer | Google ] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-29 11:38 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
>>> On 28.01.13 at 20:17, Peter Moody <pmoody@google.com> wrote: > TL;DR, the domU crash I reported over the summer on Xen 4.0.1 can be > reproduced on 4.1.3 and on more processor families and with out the > special memory/cpu configurations I previously reported. > > Longer version: > apropos of this thread [1] from last summer, I''ve managed to test for > this bug on a more recent version of Xen and I can confirm that it > exists in at least 4.1.3. Also, based on the release notes for 4.0.1 > [2] (the original version of Xen where I encountered this issue), I > reproduced the bug on an AMD Athlon processor in case the interrupts > issue mentioned had an effect.I''m surprised this is still unresolved, but part of the problem may be that you tag your problem (in the subject) to a particular Xen version, thus implying it is a hypervisor issue. From the data you provide I would think it''s a kernel issue though.> The patch I posted to the audit list didn''t actually fix the problem. > > Steps I used to reproduce: > 1) installed Xen from ubuntu packages and boot into Xen enabled system. > 2) installed ubuntu 12.10 domU using 20G flat file as disk (the > previous system used drbd). > 3) installed auditd and inserted any syscall rule (audit on chmod''s > for example). > 4) compiled the attached sample program as a 32 bit binary. > 5) ran it (works as a normal user). > > The result is an immediate crash (if KILLDIR doesn''t exist or isn''t > writable, you just get a segfault). > > So, does anyone have any idea of what might be going on?Sure - the code in question wants to run with interrupts enabled, but they aren''t for (I think) quite obvious a reason: arch/x86/ia32/ia32entry.S:auditsys_exit has hard STI/CLI in it, when those really should be ENABLE_INTERRUPTS() and DISABLE_INTERRUPTS() respectively. Does the below help? Jan --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -207,7 +207,7 @@ sysexit_from_sys_call: testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) jnz ia32_ret_from_sys_call TRACE_IRQS_ON - sti + ENABLE_INTERRUPTS(CLBR_NONE) movl %eax,%esi /* second arg, syscall return value */ cmpl $-MAX_ERRNO,%eax /* is it an error ? */ jbe 1f @@ -217,7 +217,7 @@ sysexit_from_sys_call: call __audit_syscall_exit movq RAX-ARGOFFSET(%rsp),%rax /* reload syscall return value */ movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi - cli + DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF testl %edi,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) jz \exit
David Vrabel
2013-Jan-29 11:56 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
On 28/01/13 19:17, Peter Moody wrote:> TL;DR, the domU crash I reported over the summer on Xen 4.0.1 can be > reproduced on 4.1.3 and on more processor families and with out the > special memory/cpu configurations I previously reported. > > Longer version: > apropos of this thread [1] from last summer, I''ve managed to test for > this bug on a more recent version of Xen and I can confirm that it > exists in at least 4.1.3. Also, based on the release notes for 4.0.1 > [2] (the original version of Xen where I encountered this issue), I > reproduced the bug on an AMD Athlon processor in case the interrupts > issue mentioned had an effect. > > The patch I posted to the audit list didn''t actually fix the problem. > > Steps I used to reproduce: > 1) installed Xen from ubuntu packages and boot into Xen enabled system. > 2) installed ubuntu 12.10 domU using 20G flat file as disk (the > previous system used drbd). > 3) installed auditd and inserted any syscall rule (audit on chmod''s > for example). > 4) compiled the attached sample program as a 32 bit binary. > 5) ran it (works as a normal user).The BUG is because irqs_disabled(). The call to __audit_syscall_exit is from ia32_sysenter_target in arch/x86/ia32/ia32entry.S which attempts to enable interrupts prior to the call with an sti instruction. I don''t think this works as expected with a PV kernel and I''m surprised that this doesn''t cause a #GP fault. Jan (Cc''d) is more familar with these low-level bits but does (untested) this patch help? ---8<------------- From 8a3ebe942a8e6f930ee1636e8fe54a357144b007 Mon Sep 17 00:00:00 2001 From: David Vrabel <david.vrabel@citrix.com> Date: Tue, 29 Jan 2013 11:48:14 +0000 Subject: [PATCH] x86/ia32: correctly enable irqs before calling __audit_syscall_exit Before calling __audit_syscall_exit, local interrupt were being enabled with sti (and the disabled with cli). This does not work in paravirtualized guests so use the correct ENABLE_INTERRUPTS() and DISABLE_INTERRUPTS() macros instead. This fixes a BUG when auditing systems calls from a 32-bit userspace process inside a 64-bit Xen PV guest. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/ia32/ia32entry.S | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 102ff7c..142c4ce 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -207,7 +207,7 @@ sysexit_from_sys_call: testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) jnz ia32_ret_from_sys_call TRACE_IRQS_ON - sti + ENABLE_INTERRUPTS(CLBR_NONE) movl %eax,%esi /* second arg, syscall return value */ cmpl $-MAX_ERRNO,%eax /* is it an error ? */ jbe 1f @@ -217,7 +217,7 @@ sysexit_from_sys_call: call __audit_syscall_exit movq RAX-ARGOFFSET(%rsp),%rax /* reload syscall return value */ movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi - cli + DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF testl %edi,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) jz \exit
Jan Beulich
2013-Jan-29 12:57 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
>>> On 29.01.13 at 12:56, David Vrabel <david.vrabel@citrix.com> wrote: > On 28/01/13 19:17, Peter Moody wrote: >> TL;DR, the domU crash I reported over the summer on Xen 4.0.1 can be >> reproduced on 4.1.3 and on more processor families and with out the >> special memory/cpu configurations I previously reported. >> >> Longer version: >> apropos of this thread [1] from last summer, I''ve managed to test for >> this bug on a more recent version of Xen and I can confirm that it >> exists in at least 4.1.3. Also, based on the release notes for 4.0.1 >> [2] (the original version of Xen where I encountered this issue), I >> reproduced the bug on an AMD Athlon processor in case the interrupts >> issue mentioned had an effect. >> >> The patch I posted to the audit list didn''t actually fix the problem. >> >> Steps I used to reproduce: >> 1) installed Xen from ubuntu packages and boot into Xen enabled system. >> 2) installed ubuntu 12.10 domU using 20G flat file as disk (the >> previous system used drbd). >> 3) installed auditd and inserted any syscall rule (audit on chmod''s >> for example). >> 4) compiled the attached sample program as a 32 bit binary. >> 5) ran it (works as a normal user). > > The BUG is because irqs_disabled(). > > The call to __audit_syscall_exit is from ia32_sysenter_target in > arch/x86/ia32/ia32entry.S which attempts to enable interrupts prior to > the call with an sti instruction. > > I don''t think this works as expected with a PV kernel and I''m surprised > that this doesn''t cause a #GP fault.It does, but it gets dealt with by the hypervisor. Just that the code handling this is commented out (i.e. both STI and CLI are effectively NOPs), because of the inconsistency their emulation would cause with PUSHF/POPF. See the respective cases in arch/x86/traps.c:emulate_privileged_op().> Jan (Cc''d) is more familar with these low-level bits but does (untested) > this patch help?Apart from the title and description, the one I had sent a couple of minutes earlier than you sent yours, is identical, so I guess both of us having come to the same conclusion is a good sign... Jan
Peter Moody
2013-Jan-29 21:05 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
On Tue, Jan 29, 2013 at 3:38 AM, Jan Beulich <JBeulich@suse.com> wrote:> Sure - the code in question wants to run with interrupts enabled, > but they aren''t for (I think) quite obvious a reason: > arch/x86/ia32/ia32entry.S:auditsys_exit has hard STI/CLI in it, > when those really should be ENABLE_INTERRUPTS() and > DISABLE_INTERRUPTS() respectively. Does the below help?I''ll test this when I get home tonight. This is for the guest kernel, right? Thanks! Cheers, peter -- [ Peter Moody | Security Engineer | Google ]
Peter Moody
2013-Jan-29 21:44 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
On Tue, Jan 29, 2013 at 1:05 PM, Peter Moody <pmoody@google.com> wrote:> On Tue, Jan 29, 2013 at 3:38 AM, Jan Beulich <JBeulich@suse.com> wrote: > >> Sure - the code in question wants to run with interrupts enabled, >> but they aren''t for (I think) quite obvious a reason: >> arch/x86/ia32/ia32entry.S:auditsys_exit has hard STI/CLI in it, >> when those really should be ENABLE_INTERRUPTS() and >> DISABLE_INTERRUPTS() respectively. Does the below help? > > I''ll test this when I get home tonight. This is for the guest kernel, right?I just tested on one of my previously crashing work machines and this looks good. Thanks again! Cheers, peter -- [ Peter Moody | Security Engineer | Google ]
Peter Moody
2013-Jan-29 22:21 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
On Tue, Jan 29, 2013 at 1:44 PM, Peter Moody <pmoody@google.com> wrote:> On Tue, Jan 29, 2013 at 1:05 PM, Peter Moody <pmoody@google.com> wrote: >> On Tue, Jan 29, 2013 at 3:38 AM, Jan Beulich <JBeulich@suse.com> wrote: >> >>> Sure - the code in question wants to run with interrupts enabled, >>> but they aren''t for (I think) quite obvious a reason: >>> arch/x86/ia32/ia32entry.S:auditsys_exit has hard STI/CLI in it, >>> when those really should be ENABLE_INTERRUPTS() and >>> DISABLE_INTERRUPTS() respectively. Does the below help? >> >> I''ll test this when I get home tonight. This is for the guest kernel, right? > > I just tested on one of my previously crashing work machines and this > looks good. Thanks again!poor form to keep replying to myself here but, is this patch something that would make it back into linus'' tree from here or is there a more appropriate forum where I should push this? Cheers, peter -- [ Peter Moody | Security Engineer | Google ]
Jan Beulich
2013-Jan-30 07:57 UTC
Re: 100% reliable oops on Xen 4.1.3 (initially reported on 4.0.1)
>>> On 29.01.13 at 23:21, Peter Moody <pmoody@google.com> wrote: > On Tue, Jan 29, 2013 at 1:44 PM, Peter Moody <pmoody@google.com> wrote: >> On Tue, Jan 29, 2013 at 1:05 PM, Peter Moody <pmoody@google.com> wrote: >>> On Tue, Jan 29, 2013 at 3:38 AM, Jan Beulich <JBeulich@suse.com> wrote: >>> >>>> Sure - the code in question wants to run with interrupts enabled, >>>> but they aren''t for (I think) quite obvious a reason: >>>> arch/x86/ia32/ia32entry.S:auditsys_exit has hard STI/CLI in it, >>>> when those really should be ENABLE_INTERRUPTS() and >>>> DISABLE_INTERRUPTS() respectively. Does the below help? >>> >>> I''ll test this when I get home tonight. This is for the guest kernel, right? >> >> I just tested on one of my previously crashing work machines and this >> looks good. Thanks again! > > poor form to keep replying to myself here but, is this patch something > that would make it back into linus'' tree from here or is there a more > appropriate forum where I should push this?Just sent this upstream, with you Cc-ed so you can see eventual progress. Hopefully older stable trees will also pick this up sooner or later. Jan