thr3ads.net - Xen devel - 100% reliable Oops on xen 4.0.1 [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Peter Moody

2012-Aug-14 00:03 UTC

100% reliable Oops on xen 4.0.1

This seems to be some combination of Xen and the audit subsystem, but
the attached program crashes my machine 100% of the time.

steps to reproduce the crash:

 *  1) compile with gcc -m32
 *  2) start auditd, install any rule (I''ve only tested syscall
auditing, but any syscall seems to work).
 *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
exit,always -F arch=64 -S chmod
 *  3) run''n wait (this only loops twice for me before dying)
 *     ./a.out
 *  4) bask in instantaneous kernel oops.

here''s xm info from dom0

[xen2.atl] root@gntb1:~# xm info
host                   : gntb1.atl.corp.google.com
release                : 3.2.13-ganeti-rx6-xen0
version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
machine                : x86_64
nr_cpus                : 12
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 1
cpu_mhz                : 2660
hw_caps                :
bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
virt_caps              : hvm
total_memory           : 32755
free_memory            : 22665
node_to_cpu            : node0:0,2,4,6,8,10
                         node1:1,3,5,7,9,11
node_to_memory         : node0:13083
                         node1:9582
node_to_dma32_mem      : node0:0
                         node1:3235
max_node_id            : 1
xen_major              : 4
xen_minor              : 0
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : placeholder dom0_mem=1024M loglvl=all
com1=115200,8n1 console=com1 iommu=0
cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
cc_compile_by          : pmacedo
cc_compile_domain      : google.com
cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
xend_config_format     : 4

I''m not sure what you need from the domU. It''s running
2.6.38.8 (but
I''ve seen this bug all the way up to 3.5.0-rc7, the latest
I''ve
tested). It''s a fairly beefy setup, 32G memory and 6 cpus.

I suspect xen as opposed to auditd because:

 a) this only happens on our xen machines (though not all of them)
 b) one of my stack traces started with

[172577.560441]  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10

Any one have any idea what''s going on?

Cheers,
peter

-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Pasi Kärkkäinen

2012-Aug-14 06:46 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody
wrote:> This seems to be some combination of Xen and the audit subsystem, but
> the attached program crashes my machine 100% of the time.
> 
Did you try with a later Xen version? 4.0.1 is quite old. 
For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ? 

-- Pasi
> steps to reproduce the crash:
> 
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I''ve only tested syscall
> auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
> exit,always -F arch=64 -S chmod
>  *  3) run''n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
> 
> here''s xm info from dom0
> 
> [xen2.atl] root@gntb1:~# xm info
> host                   : gntb1.atl.corp.google.com
> release                : 3.2.13-ganeti-rx6-xen0
> version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
> machine                : x86_64
> nr_cpus                : 12
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 1
> cpu_mhz                : 2660
> hw_caps                :
> bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm
> total_memory           : 32755
> free_memory            : 22665
> node_to_cpu            : node0:0,2,4,6,8,10
>                          node1:1,3,5,7,9,11
> node_to_memory         : node0:13083
>                          node1:9582
> node_to_dma32_mem      : node0:0
>                          node1:3235
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : placeholder dom0_mem=1024M loglvl=all
> com1=115200,8n1 console=com1 iommu=0
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by          : pmacedo
> cc_compile_domain      : google.com
> cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
> xend_config_format     : 4
> 
> I''m not sure what you need from the domU. It''s running
2.6.38.8 (but
> I''ve seen this bug all the way up to 3.5.0-rc7, the latest
I''ve
> tested). It''s a fairly beefy setup, 32G memory and 6 cpus.
> 
> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ?
xen_force_evtchn_callback+0xd/0x10
> 
> Any one have any idea what''s going on?
> 
> Cheers,
> peter
> 
> -- 
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038
> /*
>  * steps:
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I''ve only tested syscall
auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a exit,always -F
arch=64 -S chmod
>  *  3) run''n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
>  [  571.282777] ------------[ cut here ]------------
>  [  571.282786] kernel BUG at fs/buffer.c:1263!
>  [  571.282790] invalid opcode: 0000 [#1] SMP
>  [  571.282795] last sysfs file:
/sys/devices/system/cpu/sched_mc_power_savings
>  [  571.282798] CPU 0
>  [  571.282802] Pid: 7457, comm: a.out Not tainted
2.6.38.8-gg868-ganetixenu #1
>  [  571.282808] RIP: e030:[<ffffffff81153853>] 
[<ffffffff81153853>] __find_get_block+0x1f3/0x200
>  [  571.282819] RSP: e02b:ffff88079b7ddc78  EFLAGS: 00010046
>  [  571.282822] RAX: ffff8807bc290000 RBX: ffff8806d9bb9a98 RCX:
00000000023dc17c
>  [  571.282826] RDX: 0000000000001000 RSI: 00000000023dc17c RDI:
ffff8807fec29a00
>  [  571.282830] RBP: ffff88079b7ddcd8 R08: 0000000000000001 R09:
ffff8806d9bb99c0
>  [  571.282834] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8806d9bb99c4
>  [  571.282839] R13: ffff8806d9bb99f0 R14: ffff8807feff9060 R15:
00000000023dc17c
>  [  571.282845] FS:  00007f8f6a76a7c0(0000) GS:ffff8807fff26000(0063)
knlGS:0000000000000000
>  [  571.282849] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>  [  571.282853] CR2: 00000000f76c6970 CR3: 00000007a250b000 CR4:
0000000000002660
>  [  571.282857] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
>  [  571.282861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
>  [  571.282866] Process a.out (pid: 7457, threadinfo ffff88079b7dc000, task
ffff8807786843e0)
>  [  571.282870] Stack:
>  [  571.282872]  ffff88079b7ddc98 ffffffff81654cd1 ffff88079b7ddca8
ffff8806d9bba440
>  [  571.282879]  ffff88079b7ddd08 ffffffff811c9294 ffff8807ffffffc3
0000000000000014
>  [  571.282887]  ffff8806d9bb9a98 ffff8806d9bb99c4 ffff8806d9bb99f0
ffff8807feff9060
>  [  571.282895] Call Trace:
>  [  571.282901]  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>  [  571.282907]  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>  [  571.282913]  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>  [  571.282918]  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>  [  571.282923]  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>  [  571.282928]  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>  [  571.282933]  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>  [  571.282938]  [<ffffffff8114065f>] evict+0x1f/0xb0
>  [  571.282945]  [<ffffffff81006d52>] ? check_events+0x12/0x20
>  [  571.282949]  [<ffffffff81140c14>] iput+0x1a4/0x290
>  [  571.282955]  [<ffffffff8113ed05>] dput+0x265/0x310
>  [  571.282959]  [<ffffffff81132435>] path_put+0x15/0x30
>  [  571.282965]  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>  [  571.282971]  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>  [  571.282974] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00
e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe
<0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89
>  [  571.283027] RIP  [<ffffffff81153853>]
__find_get_block+0x1f3/0x200
>  [  571.283033]  RSP <ffff88079b7ddc78>
>  [  571.283036] ---[ end trace 5975ffe20808ecd2 ]---
>  *
>  */
> 
> #include <stdio.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
> 
> #define KILLDIR "/usr/local/tmp/crasher/kill_dir"
> 
> int main(void) {
>   FILE *f;
>   char fullpath[512];
>   int i;
> 
>   while (1) {
>     fprintf(stderr, "%d ", i++);
>     mkdir(KILLDIR, 0777);
>     chdir(KILLDIR);
>     sprintf(fullpath, "%s/file", KILLDIR);
>     f = fopen(fullpath, "w+");
>     fprintf(f, "nothing to see here");
>     fclose(f);
>     unlink("/usr/local/tmp/crasher/kill_dir/file");
>     rmdir(KILLDIR);
>   }
>   return 0;
> }
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Jan Beulich

2012-Aug-14 08:27 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

>>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com> wrote:
> I''m not sure what you need from the domU. It''s running
2.6.38.8 (but
> I''ve seen this bug all the way up to 3.5.0-rc7, the latest
I''ve
> tested). It''s a fairly beefy setup, 32G memory and 6 cpus.
Are these kernel versions refer to plain upstream ones?

Is the subject referring to 4.0.1 in any way meaningful? I.e.
does the problem not occur with other Xen versions?
> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ?
xen_force_evtchn_callback+0xd/0x10
This is a weak indication of a problem with Xen, but could as
well just indicate it''s a problem that only gets surfaced under
Xen. It would certainly help if you included the full oops
message (or multiple of them if they''re meaningfully different).

Jan

Iustin Pop

2012-Aug-14 09:11 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, Aug 14, 2012 at 09:46:28AM +0300, Pasi Kärkkäinen
wrote:> On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody wrote:
> > This seems to be some combination of Xen and the audit subsystem, but
> > the attached program crashes my machine 100% of the time.
> > 
> 
> Did you try with a later Xen version? 4.0.1 is quite old. 
> For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ? 
This is 4.0.1 from Debian, so it has at least all the CVEs applied. We
haven''t tried yet with a newer Xen though (yet).

regards,
iustin

Iustin Pop

2012-Aug-14 09:12 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, Aug 14, 2012 at 09:27:31AM +0100, Jan Beulich
wrote:> >>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com>
wrote:
> > I''m not sure what you need from the domU. It''s
running 2.6.38.8 (but
> > I''ve seen this bug all the way up to 3.5.0-rc7, the latest
I''ve
> > tested). It''s a fairly beefy setup, 32G memory and 6 cpus.
> 
> Are these kernel versions refer to plain upstream ones?
They are mostly Ubuntu kernels, so not vanilla.> 
> Is the subject referring to 4.0.1 in any way meaningful? I.e.
> does the problem not occur with other Xen versions?
I believe this was only related to the version we run, not that it''s
fixed in other Xen version. We will try to test with newer Xens to see.

regards,
iustin

Ian Campbell

2012-Aug-14 09:19 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, 2012-08-14 at 01:03 +0100, Peter Moody wrote:> This seems to be some combination of Xen and the audit subsystem, but
> the attached program crashes my machine 100% of the time.
> 
> steps to reproduce the crash:
> 
>  *  1) compile with gcc -m32
>  *  2) start auditd, install any rule (I''ve only tested syscall
> auditing, but any syscall seems to work).
>  *     /etc/init.d/auditd start ; auditctl -D ; auditctl -a
> exit,always -F arch=64 -S chmod
>  *  3) run''n wait (this only loops twice for me before dying)
>  *     ./a.out
>  *  4) bask in instantaneous kernel oops.
> 
> here''s xm info from dom0
> 
> [xen2.atl] root@gntb1:~# xm info
> host                   : gntb1.atl.corp.google.com
> release                : 3.2.13-ganeti-rx6-xen0
> version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
> machine                : x86_64
> nr_cpus                : 12
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 1
> cpu_mhz                : 2660
> hw_caps                :
> bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm
> total_memory           : 32755
> free_memory            : 22665
> node_to_cpu            : node0:0,2,4,6,8,10
>                          node1:1,3,5,7,9,11
> node_to_memory         : node0:13083
>                          node1:9582
> node_to_dma32_mem      : node0:0
>                          node1:3235
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : placeholder dom0_mem=1024M loglvl=all
> com1=115200,8n1 console=com1 iommu=0
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by          : pmacedo
> cc_compile_domain      : google.com
> cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
> xend_config_format     : 4
> 
> I''m not sure what you need from the domU. It''s running
2.6.38.8 (but
> I''ve seen this bug all the way up to 3.5.0-rc7, the latest
I''ve
> tested). It''s a fairly beefy setup, 32G memory and 6 cpus.
> 
> I suspect xen as opposed to auditd because:
> 
>  a) this only happens on our xen machines (though not all of them)
>  b) one of my stack traces started with
> 
> [172577.560441]  [<ffffffff810065ad>] ?
xen_force_evtchn_callback+0xd/0x10
This is likely to be a coincidence IMHO since this function forces a
call to the hypervisor to trigger the (re)injection of any pending
interrupts (typically after reenabling interrupts), so it is not unusual
for it to be at the bottom of any stack trace which happens in interrupt
context.

The example stack trace in crasher.c doesn''t involve Xen -- can you
post
any examples of ones which do.

Ian.

Peter Moody

2012-Aug-14 14:42 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, Aug 14, 2012 at 2:19 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:
>>
>>  a) this only happens on our xen machines (though not all of them)
>>  b) one of my stack traces started with
>>
>> [172577.560441]  [<ffffffff810065ad>] ?
xen_force_evtchn_callback+0xd/0x10
>
> This is likely to be a coincidence IMHO since this function forces a
> call to the hypervisor to trigger the (re)injection of any pending
> interrupts (typically after reenabling interrupts), so it is not unusual
> for it to be at the bottom of any stack trace which happens in interrupt
> context.
>
> The example stack trace in crasher.c doesn''t involve Xen -- can
you post
> any examples of ones which do.
Hi Ian, here''s the trace in question. I''m perfectly happy with
this
not being a xen issue if for no other reason then it means that I have
one less thing I need to look at. The python script in question was
essentially doing the same thing as crasher.c, though in the middle of
other, more productive activities.

Cheers,
peter

------------[ cut here ]------------
kernel BUG at fs/buffer.c:1263!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 3
Pid: 27277, comm: python2.6 Not tainted 2.6.38.8-gg868-ganetixenu #1
RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>]
__find_get_block+0x1f3/0x200
RSP: e02b:ffff880496cffc78  EFLAGS: 00010046
RAX: ffff8807b9480000 RBX: ffff88049f172de8 RCX: 000000000086dafd
RDX: 0000000000001000 RSI: 000000000086dafd RDI: ffff8807ba4dd380
RBP: ffff880496cffcd8 R08: 0000000000000001 R09: ffff88049f172d10
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88049f172d14
R13: ffff88049f172d40 R14: ffff8807ba4b7228 R15: 000000000086dafd
FS:  00007f667a0ca700(0000) GS:ffff8807fff74000(0063) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 000000000a130260 CR3: 00000004e978c000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process python2.6 (pid: 27277, threadinfo ffff880496cfe000, task
ffff8804b5a72d40)
Stack:
 ffff880496cffc98 ffffffff81654cd1 ffff880496cffca8 ffff88062d8d2440
 ffff880496cffd08 ffffffff811c9294 ffff8804ffffffc3 0000000000000014
 ffff88049f172de8 ffff88049f172d14 ffff88049f172d40 ffff8807ba4b7228
Call Trace:
 [<ffffffff81654cd1>] ? down_read+0x11/0x30
 [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
 [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
 [<ffffffff811bb104>] ext3_free_data+0x114/0x160
 [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
 [<ffffffff812133f5>] ? journal_start+0xb5/0x100
 [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
 [<ffffffff8114065f>] evict+0x1f/0xb0
 [<ffffffff81006d52>] ? check_events+0x12/0x20
 [<ffffffff81140c14>] iput+0x1a4/0x290
 [<ffffffff8113ed05>] dput+0x265/0x310
 [<ffffffff81132435>] path_put+0x15/0x30
 [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
 [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
 [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81006d52>] ? check_events+0x12/0x20
Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe
ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f>
0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89
RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
 RSP <ffff880496cffc78>
---[ end trace d45267c89c4e0548 ]---


-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

Jan Beulich

2012-Aug-14 14:47 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote:
> Hi Ian, here''s the trace in question. I''m perfectly happy
with this
> not being a xen issue if for no other reason then it means that I have
> one less thing I need to look at. The python script in question was
> essentially doing the same thing as crasher.c, though in the middle of
> other, more productive activities.
> ...
> Call Trace:
>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>  [<ffffffff8114065f>] evict+0x1f/0xb0
>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>  [<ffffffff81140c14>] iput+0x1a4/0x290
>  [<ffffffff8113ed05>] dput+0x265/0x310
>  [<ffffffff81132435>] path_put+0x15/0x30
>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>  [<ffffffff81006d52>] ? check_events+0x12/0x20
This obviously is just a leftover on the stack, one can see clearly
that we''re in the middle of a syscall (which would never have
xen_force_evtchn_callback that deep into the stack (i.e. where
we just came from user mode).

Jan

Peter Moody

2012-Aug-14 15:55 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com>
wrote:
>> Hi Ian, here''s the trace in question. I''m perfectly
happy with this
>> not being a xen issue if for no other reason then it means that I have
>> one less thing I need to look at. The python script in question was
>> essentially doing the same thing as crasher.c, though in the middle of
>> other, more productive activities.
>> ...
>> Call Trace:
>>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>>  [<ffffffff8114065f>] evict+0x1f/0xb0
>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>  [<ffffffff81140c14>] iput+0x1a4/0x290
>>  [<ffffffff8113ed05>] dput+0x265/0x310
>>  [<ffffffff81132435>] path_put+0x15/0x30
>>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>
> This obviously is just a leftover on the stack, one can see clearly
> that we''re in the middle of a syscall (which would never have
> xen_force_evtchn_callback that deep into the stack (i.e. where
> we just came from user mode).
Interesting, thanks. Do you have any idea why something like this
would only be reproducible (thus far anyway, still trying to get my
hands on some other test systems) on xen? And not just xen, but on
this particular xen configuration (huge memory, lots of cpus, etc)? Is
this likely a race condition with the audit subsystem or some other
part of the kernel that this configuration somehow tickles?

Cheers,
peter
> Jan
>


-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

Jan Beulich

2012-Aug-14 16:09 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

>>> On 14.08.12 at 17:55, Peter Moody <pmoody@google.com> wrote:
> On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com>
wrote:
>>> Hi Ian, here''s the trace in question. I''m
perfectly happy with this
>>> not being a xen issue if for no other reason then it means that I
have
>>> one less thing I need to look at. The python script in question was
>>> essentially doing the same thing as crasher.c, though in the middle
of
>>> other, more productive activities.
>>> ...
>>> Call Trace:
>>>  [<ffffffff81654cd1>] ? down_read+0x11/0x30
>>>  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
>>>  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
>>>  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
>>>  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
>>>  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
>>>  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
>>>  [<ffffffff8114065f>] evict+0x1f/0xb0
>>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>>  [<ffffffff81140c14>] iput+0x1a4/0x290
>>>  [<ffffffff8113ed05>] dput+0x265/0x310
>>>  [<ffffffff81132435>] path_put+0x15/0x30
>>>  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
>>>  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
>>>  [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10
>>>  [<ffffffff81006d52>] ? check_events+0x12/0x20
>>
>> This obviously is just a leftover on the stack, one can see clearly
>> that we''re in the middle of a syscall (which would never have
>> xen_force_evtchn_callback that deep into the stack (i.e. where
>> we just came from user mode).
> 
> Interesting, thanks. Do you have any idea why something like this
> would only be reproducible (thus far anyway, still trying to get my
> hands on some other test systems) on xen? And not just xen, but on
> this particular xen configuration (huge memory, lots of cpus, etc)? Is
> this likely a race condition with the audit subsystem or some other
> part of the kernel that this configuration somehow tickles?
From the above as well as based on you indicating that the
traces are highly variable between instances, I''d suppose
this is memory corruption of some sort, which can easily be
hidden by all sorts of factors.

Until you can find a pattern, I don''t think there can be done
much by anyone not having an affected system available for
debugging.

Jan

Peter Moody

2012-Aug-14 16:16 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote:
> From the above as well as based on you indicating that the
> traces are highly variable between instances, I''d suppose
> this is memory corruption of some sort, which can easily be
> hidden by all sorts of factors.
>
> Until you can find a pattern, I don''t think there can be done
> much by anyone not having an affected system available for
> debugging.
So I have such a system :).

Are there any pointers or tips you can give me to help me track down
the root cause? I realize that''s a broad question, and a perfectly
justifiable answer is "read the memory management chapter of
understanding linux device drivers" but at this point basically any
advice you can give me is appreciated (and will most likely get me
closer to the solution).

Cheers,
peter
> Jan
>

-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

Jan Beulich

2012-Aug-14 16:26 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com> wrote:
> On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com>
wrote:
> 
>> From the above as well as based on you indicating that the
>> traces are highly variable between instances, I''d suppose
>> this is memory corruption of some sort, which can easily be
>> hidden by all sorts of factors.
>>
>> Until you can find a pattern, I don''t think there can be done
>> much by anyone not having an affected system available for
>> debugging.
> 
> So I have such a system :).
That''s what I implied.
> Are there any pointers or tips you can give me to help me track down
> the root cause? I realize that''s a broad question, and a perfectly
> justifiable answer is "read the memory management chapter of
> understanding linux device drivers" but at this point basically any
> advice you can give me is appreciated (and will most likely get me
> closer to the solution).
As said, figuring out a pattern in the crashes would likely help
placing debug prints, breakpoints, or anything similar to aid
detecting the presumed corruption earlier. Without a pattern,
there''s regretfully not much I can suggest.

Jan

Peter Moody

2012-Aug-17 20:38 UTC

head link

Re: 100% reliable Oops on xen 4.0.1

Just to close the loop over here. This is an audit bug, not a xen bug.

https://www.redhat.com/archives/linux-audit/2012-August/msg00018.html

Cheers,
peter

On Tue, Aug 14, 2012 at 9:26 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com>
wrote:
>> On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>
>>> From the above as well as based on you indicating that the
>>> traces are highly variable between instances, I''d suppose
>>> this is memory corruption of some sort, which can easily be
>>> hidden by all sorts of factors.
>>>
>>> Until you can find a pattern, I don''t think there can be
done
>>> much by anyone not having an affected system available for
>>> debugging.
>>
>> So I have such a system :).
>
> That''s what I implied.
>
>> Are there any pointers or tips you can give me to help me track down
>> the root cause? I realize that''s a broad question, and a
perfectly
>> justifiable answer is "read the memory management chapter of
>> understanding linux device drivers" but at this point basically
any
>> advice you can give me is appreciated (and will most likely get me
>> closer to the solution).
>
> As said, figuring out a pattern in the crashes would likely help
> placing debug prints, breakpoints, or anything similar to aid
> detecting the presumed corruption earlier. Without a pattern,
> there''s regretfully not much I can suggest.
>
> Jan
>


-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

Xen devel - Aug 2012 - 100% reliable Oops on xen 4.0.1

100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1

Re: 100% reliable Oops on xen 4.0.1