This seems to be some combination of Xen and the audit subsystem, but the attached program crashes my machine 100% of the time. steps to reproduce the crash: * 1) compile with gcc -m32 * 2) start auditd, install any rule (I''ve only tested syscall auditing, but any syscall seems to work). * /etc/init.d/auditd start ; auditctl -D ; auditctl -a exit,always -F arch=64 -S chmod * 3) run''n wait (this only loops twice for me before dying) * ./a.out * 4) bask in instantaneous kernel oops. here''s xm info from dom0 [xen2.atl] root@gntb1:~# xm info host : gntb1.atl.corp.google.com release : 3.2.13-ganeti-rx6-xen0 version : #1 SMP Thu Jun 7 12:59:40 CEST 2012 machine : x86_64 nr_cpus : 12 nr_nodes : 2 cores_per_socket : 6 threads_per_core : 1 cpu_mhz : 2660 hw_caps : bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000 virt_caps : hvm total_memory : 32755 free_memory : 22665 node_to_cpu : node0:0,2,4,6,8,10 node1:1,3,5,7,9,11 node_to_memory : node0:13083 node1:9582 node_to_dma32_mem : node0:0 node1:3235 max_node_id : 1 xen_major : 4 xen_minor : 0 xen_extra : .1 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable xen_commandline : placeholder dom0_mem=1024M loglvl=all com1=115200,8n1 console=com1 iommu=0 cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) cc_compile_by : pmacedo cc_compile_domain : google.com cc_compile_date : Wed Mar 16 15:24:06 UTC 2011 xend_config_format : 4 I''m not sure what you need from the domU. It''s running 2.6.38.8 (but I''ve seen this bug all the way up to 3.5.0-rc7, the latest I''ve tested). It''s a fairly beefy setup, 32G memory and 6 cpus. I suspect xen as opposed to auditd because: a) this only happens on our xen machines (though not all of them) b) one of my stack traces started with [172577.560441] [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 Any one have any idea what''s going on? Cheers, peter -- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody wrote:> This seems to be some combination of Xen and the audit subsystem, but > the attached program crashes my machine 100% of the time. >Did you try with a later Xen version? 4.0.1 is quite old. For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ? -- Pasi> steps to reproduce the crash: > > * 1) compile with gcc -m32 > * 2) start auditd, install any rule (I''ve only tested syscall > auditing, but any syscall seems to work). > * /etc/init.d/auditd start ; auditctl -D ; auditctl -a > exit,always -F arch=64 -S chmod > * 3) run''n wait (this only loops twice for me before dying) > * ./a.out > * 4) bask in instantaneous kernel oops. > > here''s xm info from dom0 > > [xen2.atl] root@gntb1:~# xm info > host : gntb1.atl.corp.google.com > release : 3.2.13-ganeti-rx6-xen0 > version : #1 SMP Thu Jun 7 12:59:40 CEST 2012 > machine : x86_64 > nr_cpus : 12 > nr_nodes : 2 > cores_per_socket : 6 > threads_per_core : 1 > cpu_mhz : 2660 > hw_caps : > bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000 > virt_caps : hvm > total_memory : 32755 > free_memory : 22665 > node_to_cpu : node0:0,2,4,6,8,10 > node1:1,3,5,7,9,11 > node_to_memory : node0:13083 > node1:9582 > node_to_dma32_mem : node0:0 > node1:3235 > max_node_id : 1 > xen_major : 4 > xen_minor : 0 > xen_extra : .1 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > hvm-3.0-x86_32p hvm-3.0-x86_64 > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > xen_commandline : placeholder dom0_mem=1024M loglvl=all > com1=115200,8n1 console=com1 iommu=0 > cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) > cc_compile_by : pmacedo > cc_compile_domain : google.com > cc_compile_date : Wed Mar 16 15:24:06 UTC 2011 > xend_config_format : 4 > > I''m not sure what you need from the domU. It''s running 2.6.38.8 (but > I''ve seen this bug all the way up to 3.5.0-rc7, the latest I''ve > tested). It''s a fairly beefy setup, 32G memory and 6 cpus. > > I suspect xen as opposed to auditd because: > > a) this only happens on our xen machines (though not all of them) > b) one of my stack traces started with > > [172577.560441] [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 > > Any one have any idea what''s going on? > > Cheers, > peter > > -- > Peter Moody Google 1.650.253.7306 > Security Engineer pgp:0xC3410038> /* > * steps: > * 1) compile with gcc -m32 > * 2) start auditd, install any rule (I''ve only tested syscall auditing, but any syscall seems to work). > * /etc/init.d/auditd start ; auditctl -D ; auditctl -a exit,always -F arch=64 -S chmod > * 3) run''n wait (this only loops twice for me before dying) > * ./a.out > * 4) bask in instantaneous kernel oops. > [ 571.282777] ------------[ cut here ]------------ > [ 571.282786] kernel BUG at fs/buffer.c:1263! > [ 571.282790] invalid opcode: 0000 [#1] SMP > [ 571.282795] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings > [ 571.282798] CPU 0 > [ 571.282802] Pid: 7457, comm: a.out Not tainted 2.6.38.8-gg868-ganetixenu #1 > [ 571.282808] RIP: e030:[<ffffffff81153853>] [<ffffffff81153853>] __find_get_block+0x1f3/0x200 > [ 571.282819] RSP: e02b:ffff88079b7ddc78 EFLAGS: 00010046 > [ 571.282822] RAX: ffff8807bc290000 RBX: ffff8806d9bb9a98 RCX: 00000000023dc17c > [ 571.282826] RDX: 0000000000001000 RSI: 00000000023dc17c RDI: ffff8807fec29a00 > [ 571.282830] RBP: ffff88079b7ddcd8 R08: 0000000000000001 R09: ffff8806d9bb99c0 > [ 571.282834] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8806d9bb99c4 > [ 571.282839] R13: ffff8806d9bb99f0 R14: ffff8807feff9060 R15: 00000000023dc17c > [ 571.282845] FS: 00007f8f6a76a7c0(0000) GS:ffff8807fff26000(0063) knlGS:0000000000000000 > [ 571.282849] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > [ 571.282853] CR2: 00000000f76c6970 CR3: 00000007a250b000 CR4: 0000000000002660 > [ 571.282857] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 571.282861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 571.282866] Process a.out (pid: 7457, threadinfo ffff88079b7dc000, task ffff8807786843e0) > [ 571.282870] Stack: > [ 571.282872] ffff88079b7ddc98 ffffffff81654cd1 ffff88079b7ddca8 ffff8806d9bba440 > [ 571.282879] ffff88079b7ddd08 ffffffff811c9294 ffff8807ffffffc3 0000000000000014 > [ 571.282887] ffff8806d9bb9a98 ffff8806d9bb99c4 ffff8806d9bb99f0 ffff8807feff9060 > [ 571.282895] Call Trace: > [ 571.282901] [<ffffffff81654cd1>] ? down_read+0x11/0x30 > [ 571.282907] [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 > [ 571.282913] [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 > [ 571.282918] [<ffffffff811bb104>] ext3_free_data+0x114/0x160 > [ 571.282923] [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 > [ 571.282928] [<ffffffff812133f5>] ? journal_start+0xb5/0x100 > [ 571.282933] [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 > [ 571.282938] [<ffffffff8114065f>] evict+0x1f/0xb0 > [ 571.282945] [<ffffffff81006d52>] ? check_events+0x12/0x20 > [ 571.282949] [<ffffffff81140c14>] iput+0x1a4/0x290 > [ 571.282955] [<ffffffff8113ed05>] dput+0x265/0x310 > [ 571.282959] [<ffffffff81132435>] path_put+0x15/0x30 > [ 571.282965] [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 > [ 571.282971] [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f > [ 571.282974] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 > [ 571.283027] RIP [<ffffffff81153853>] __find_get_block+0x1f3/0x200 > [ 571.283033] RSP <ffff88079b7ddc78> > [ 571.283036] ---[ end trace 5975ffe20808ecd2 ]--- > * > */ > > #include <stdio.h> > #include <sys/stat.h> > #include <sys/types.h> > #include <unistd.h> > > #define KILLDIR "/usr/local/tmp/crasher/kill_dir" > > int main(void) { > FILE *f; > char fullpath[512]; > int i; > > while (1) { > fprintf(stderr, "%d ", i++); > mkdir(KILLDIR, 0777); > chdir(KILLDIR); > sprintf(fullpath, "%s/file", KILLDIR); > f = fopen(fullpath, "w+"); > fprintf(f, "nothing to see here"); > fclose(f); > unlink("/usr/local/tmp/crasher/kill_dir/file"); > rmdir(KILLDIR); > } > return 0; > }> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com> wrote: > I''m not sure what you need from the domU. It''s running 2.6.38.8 (but > I''ve seen this bug all the way up to 3.5.0-rc7, the latest I''ve > tested). It''s a fairly beefy setup, 32G memory and 6 cpus.Are these kernel versions refer to plain upstream ones? Is the subject referring to 4.0.1 in any way meaningful? I.e. does the problem not occur with other Xen versions?> I suspect xen as opposed to auditd because: > > a) this only happens on our xen machines (though not all of them) > b) one of my stack traces started with > > [172577.560441] [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10This is a weak indication of a problem with Xen, but could as well just indicate it''s a problem that only gets surfaced under Xen. It would certainly help if you included the full oops message (or multiple of them if they''re meaningfully different). Jan
On Tue, Aug 14, 2012 at 09:46:28AM +0300, Pasi Kärkkäinen wrote:> On Mon, Aug 13, 2012 at 05:03:06PM -0700, Peter Moody wrote: > > This seems to be some combination of Xen and the audit subsystem, but > > the attached program crashes my machine 100% of the time. > > > > Did you try with a later Xen version? 4.0.1 is quite old. > For example the latest in Xen 4.0.x series which is 4.0.4 ? Or Xen 4.1.3 ?This is 4.0.1 from Debian, so it has at least all the CVEs applied. We haven''t tried yet with a newer Xen though (yet). regards, iustin
On Tue, Aug 14, 2012 at 09:27:31AM +0100, Jan Beulich wrote:> >>> On 14.08.12 at 02:03, Peter Moody <pmoody@google.com> wrote: > > I''m not sure what you need from the domU. It''s running 2.6.38.8 (but > > I''ve seen this bug all the way up to 3.5.0-rc7, the latest I''ve > > tested). It''s a fairly beefy setup, 32G memory and 6 cpus. > > Are these kernel versions refer to plain upstream ones?They are mostly Ubuntu kernels, so not vanilla.> > Is the subject referring to 4.0.1 in any way meaningful? I.e. > does the problem not occur with other Xen versions?I believe this was only related to the version we run, not that it''s fixed in other Xen version. We will try to test with newer Xens to see. regards, iustin
On Tue, 2012-08-14 at 01:03 +0100, Peter Moody wrote:> This seems to be some combination of Xen and the audit subsystem, but > the attached program crashes my machine 100% of the time. > > steps to reproduce the crash: > > * 1) compile with gcc -m32 > * 2) start auditd, install any rule (I''ve only tested syscall > auditing, but any syscall seems to work). > * /etc/init.d/auditd start ; auditctl -D ; auditctl -a > exit,always -F arch=64 -S chmod > * 3) run''n wait (this only loops twice for me before dying) > * ./a.out > * 4) bask in instantaneous kernel oops. > > here''s xm info from dom0 > > [xen2.atl] root@gntb1:~# xm info > host : gntb1.atl.corp.google.com > release : 3.2.13-ganeti-rx6-xen0 > version : #1 SMP Thu Jun 7 12:59:40 CEST 2012 > machine : x86_64 > nr_cpus : 12 > nr_nodes : 2 > cores_per_socket : 6 > threads_per_core : 1 > cpu_mhz : 2660 > hw_caps : > bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000 > virt_caps : hvm > total_memory : 32755 > free_memory : 22665 > node_to_cpu : node0:0,2,4,6,8,10 > node1:1,3,5,7,9,11 > node_to_memory : node0:13083 > node1:9582 > node_to_dma32_mem : node0:0 > node1:3235 > max_node_id : 1 > xen_major : 4 > xen_minor : 0 > xen_extra : .1 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > hvm-3.0-x86_32p hvm-3.0-x86_64 > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > xen_commandline : placeholder dom0_mem=1024M loglvl=all > com1=115200,8n1 console=com1 iommu=0 > cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) > cc_compile_by : pmacedo > cc_compile_domain : google.com > cc_compile_date : Wed Mar 16 15:24:06 UTC 2011 > xend_config_format : 4 > > I''m not sure what you need from the domU. It''s running 2.6.38.8 (but > I''ve seen this bug all the way up to 3.5.0-rc7, the latest I''ve > tested). It''s a fairly beefy setup, 32G memory and 6 cpus. > > I suspect xen as opposed to auditd because: > > a) this only happens on our xen machines (though not all of them) > b) one of my stack traces started with > > [172577.560441] [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10This is likely to be a coincidence IMHO since this function forces a call to the hypervisor to trigger the (re)injection of any pending interrupts (typically after reenabling interrupts), so it is not unusual for it to be at the bottom of any stack trace which happens in interrupt context. The example stack trace in crasher.c doesn''t involve Xen -- can you post any examples of ones which do. Ian.
On Tue, Aug 14, 2012 at 2:19 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:>> >> a) this only happens on our xen machines (though not all of them) >> b) one of my stack traces started with >> >> [172577.560441] [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 > > This is likely to be a coincidence IMHO since this function forces a > call to the hypervisor to trigger the (re)injection of any pending > interrupts (typically after reenabling interrupts), so it is not unusual > for it to be at the bottom of any stack trace which happens in interrupt > context. > > The example stack trace in crasher.c doesn''t involve Xen -- can you post > any examples of ones which do.Hi Ian, here''s the trace in question. I''m perfectly happy with this not being a xen issue if for no other reason then it means that I have one less thing I need to look at. The python script in question was essentially doing the same thing as crasher.c, though in the middle of other, more productive activities. Cheers, peter ------------[ cut here ]------------ kernel BUG at fs/buffer.c:1263! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 3 Pid: 27277, comm: python2.6 Not tainted 2.6.38.8-gg868-ganetixenu #1 RIP: e030:[<ffffffff81153853>] [<ffffffff81153853>] __find_get_block+0x1f3/0x200 RSP: e02b:ffff880496cffc78 EFLAGS: 00010046 RAX: ffff8807b9480000 RBX: ffff88049f172de8 RCX: 000000000086dafd RDX: 0000000000001000 RSI: 000000000086dafd RDI: ffff8807ba4dd380 RBP: ffff880496cffcd8 R08: 0000000000000001 R09: ffff88049f172d10 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88049f172d14 R13: ffff88049f172d40 R14: ffff8807ba4b7228 R15: 000000000086dafd FS: 00007f667a0ca700(0000) GS:ffff8807fff74000(0063) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 000000008005003b CR2: 000000000a130260 CR3: 00000004e978c000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process python2.6 (pid: 27277, threadinfo ffff880496cfe000, task ffff8804b5a72d40) Stack: ffff880496cffc98 ffffffff81654cd1 ffff880496cffca8 ffff88062d8d2440 ffff880496cffd08 ffffffff811c9294 ffff8804ffffffc3 0000000000000014 ffff88049f172de8 ffff88049f172d14 ffff88049f172d40 ffff8807ba4b7228 Call Trace: [<ffffffff81654cd1>] ? down_read+0x11/0x30 [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 [<ffffffff811bb104>] ext3_free_data+0x114/0x160 [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 [<ffffffff812133f5>] ? journal_start+0xb5/0x100 [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 [<ffffffff8114065f>] evict+0x1f/0xb0 [<ffffffff81006d52>] ? check_events+0x12/0x20 [<ffffffff81140c14>] iput+0x1a4/0x290 [<ffffffff8113ed05>] dput+0x265/0x310 [<ffffffff81132435>] path_put+0x15/0x30 [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81006d52>] ? check_events+0x12/0x20 Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 RIP [<ffffffff81153853>] __find_get_block+0x1f3/0x200 RSP <ffff880496cffc78> ---[ end trace d45267c89c4e0548 ]--- -- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038
>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote: > Hi Ian, here''s the trace in question. I''m perfectly happy with this > not being a xen issue if for no other reason then it means that I have > one less thing I need to look at. The python script in question was > essentially doing the same thing as crasher.c, though in the middle of > other, more productive activities. > ... > Call Trace: > [<ffffffff81654cd1>] ? down_read+0x11/0x30 > [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 > [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 > [<ffffffff811bb104>] ext3_free_data+0x114/0x160 > [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 > [<ffffffff812133f5>] ? journal_start+0xb5/0x100 > [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 > [<ffffffff8114065f>] evict+0x1f/0xb0 > [<ffffffff81006d52>] ? check_events+0x12/0x20 > [<ffffffff81140c14>] iput+0x1a4/0x290 > [<ffffffff8113ed05>] dput+0x265/0x310 > [<ffffffff81132435>] path_put+0x15/0x30 > [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 > [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f > [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 > [<ffffffff81006d52>] ? check_events+0x12/0x20This obviously is just a leftover on the stack, one can see clearly that we''re in the middle of a syscall (which would never have xen_force_evtchn_callback that deep into the stack (i.e. where we just came from user mode). Jan
On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote: >> Hi Ian, here''s the trace in question. I''m perfectly happy with this >> not being a xen issue if for no other reason then it means that I have >> one less thing I need to look at. The python script in question was >> essentially doing the same thing as crasher.c, though in the middle of >> other, more productive activities. >> ... >> Call Trace: >> [<ffffffff81654cd1>] ? down_read+0x11/0x30 >> [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 >> [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 >> [<ffffffff811bb104>] ext3_free_data+0x114/0x160 >> [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 >> [<ffffffff812133f5>] ? journal_start+0xb5/0x100 >> [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 >> [<ffffffff8114065f>] evict+0x1f/0xb0 >> [<ffffffff81006d52>] ? check_events+0x12/0x20 >> [<ffffffff81140c14>] iput+0x1a4/0x290 >> [<ffffffff8113ed05>] dput+0x265/0x310 >> [<ffffffff81132435>] path_put+0x15/0x30 >> [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 >> [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f >> [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 >> [<ffffffff81006d52>] ? check_events+0x12/0x20 > > This obviously is just a leftover on the stack, one can see clearly > that we''re in the middle of a syscall (which would never have > xen_force_evtchn_callback that deep into the stack (i.e. where > we just came from user mode).Interesting, thanks. Do you have any idea why something like this would only be reproducible (thus far anyway, still trying to get my hands on some other test systems) on xen? And not just xen, but on this particular xen configuration (huge memory, lots of cpus, etc)? Is this likely a race condition with the audit subsystem or some other part of the kernel that this configuration somehow tickles? Cheers, peter> Jan >-- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038
>>> On 14.08.12 at 17:55, Peter Moody <pmoody@google.com> wrote: > On Tue, Aug 14, 2012 at 7:47 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 14.08.12 at 16:42, Peter Moody <pmoody@google.com> wrote: >>> Hi Ian, here''s the trace in question. I''m perfectly happy with this >>> not being a xen issue if for no other reason then it means that I have >>> one less thing I need to look at. The python script in question was >>> essentially doing the same thing as crasher.c, though in the middle of >>> other, more productive activities. >>> ... >>> Call Trace: >>> [<ffffffff81654cd1>] ? down_read+0x11/0x30 >>> [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 >>> [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 >>> [<ffffffff811bb104>] ext3_free_data+0x114/0x160 >>> [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 >>> [<ffffffff812133f5>] ? journal_start+0xb5/0x100 >>> [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 >>> [<ffffffff8114065f>] evict+0x1f/0xb0 >>> [<ffffffff81006d52>] ? check_events+0x12/0x20 >>> [<ffffffff81140c14>] iput+0x1a4/0x290 >>> [<ffffffff8113ed05>] dput+0x265/0x310 >>> [<ffffffff81132435>] path_put+0x15/0x30 >>> [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 >>> [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f >>> [<ffffffff810065ad>] ? xen_force_evtchn_callback+0xd/0x10 >>> [<ffffffff81006d52>] ? check_events+0x12/0x20 >> >> This obviously is just a leftover on the stack, one can see clearly >> that we''re in the middle of a syscall (which would never have >> xen_force_evtchn_callback that deep into the stack (i.e. where >> we just came from user mode). > > Interesting, thanks. Do you have any idea why something like this > would only be reproducible (thus far anyway, still trying to get my > hands on some other test systems) on xen? And not just xen, but on > this particular xen configuration (huge memory, lots of cpus, etc)? Is > this likely a race condition with the audit subsystem or some other > part of the kernel that this configuration somehow tickles?From the above as well as based on you indicating that the traces are highly variable between instances, I''d suppose this is memory corruption of some sort, which can easily be hidden by all sorts of factors. Until you can find a pattern, I don''t think there can be done much by anyone not having an affected system available for debugging. Jan
On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote:> From the above as well as based on you indicating that the > traces are highly variable between instances, I''d suppose > this is memory corruption of some sort, which can easily be > hidden by all sorts of factors. > > Until you can find a pattern, I don''t think there can be done > much by anyone not having an affected system available for > debugging.So I have such a system :). Are there any pointers or tips you can give me to help me track down the root cause? I realize that''s a broad question, and a perfectly justifiable answer is "read the memory management chapter of understanding linux device drivers" but at this point basically any advice you can give me is appreciated (and will most likely get me closer to the solution). Cheers, peter> Jan >-- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038
>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com> wrote: > On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote: > >> From the above as well as based on you indicating that the >> traces are highly variable between instances, I''d suppose >> this is memory corruption of some sort, which can easily be >> hidden by all sorts of factors. >> >> Until you can find a pattern, I don''t think there can be done >> much by anyone not having an affected system available for >> debugging. > > So I have such a system :).That''s what I implied.> Are there any pointers or tips you can give me to help me track down > the root cause? I realize that''s a broad question, and a perfectly > justifiable answer is "read the memory management chapter of > understanding linux device drivers" but at this point basically any > advice you can give me is appreciated (and will most likely get me > closer to the solution).As said, figuring out a pattern in the crashes would likely help placing debug prints, breakpoints, or anything similar to aid detecting the presumed corruption earlier. Without a pattern, there''s regretfully not much I can suggest. Jan
Just to close the loop over here. This is an audit bug, not a xen bug. https://www.redhat.com/archives/linux-audit/2012-August/msg00018.html Cheers, peter On Tue, Aug 14, 2012 at 9:26 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 14.08.12 at 18:16, Peter Moody <pmoody@google.com> wrote: >> On Tue, Aug 14, 2012 at 9:09 AM, Jan Beulich <JBeulich@suse.com> wrote: >> >>> From the above as well as based on you indicating that the >>> traces are highly variable between instances, I''d suppose >>> this is memory corruption of some sort, which can easily be >>> hidden by all sorts of factors. >>> >>> Until you can find a pattern, I don''t think there can be done >>> much by anyone not having an affected system available for >>> debugging. >> >> So I have such a system :). > > That''s what I implied. > >> Are there any pointers or tips you can give me to help me track down >> the root cause? I realize that''s a broad question, and a perfectly >> justifiable answer is "read the memory management chapter of >> understanding linux device drivers" but at this point basically any >> advice you can give me is appreciated (and will most likely get me >> closer to the solution). > > As said, figuring out a pattern in the crashes would likely help > placing debug prints, breakpoints, or anything similar to aid > detecting the presumed corruption earlier. Without a pattern, > there''s regretfully not much I can suggest. > > Jan >-- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038