Hey folks, I''m looking for some help debugging a kernel oops that only seems to manifest itself on xen instances. We have a small number of machines that seem to be having a weird interaction between auditd and xen. When under a high-ish amount of load, with auditd running, these machines crash almost 100% of the time. Here are two separate OOPs traces that appear to be from the same bug: from kernel 2.6.38.8 [32581.835885] ------------[ cut here ]------------ [32581.835894] kernel BUG at fs/buffer.c:1263! [32581.835898] invalid opcode: 0000 [#1] SMP [32581.835903] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings [32581.835906] CPU 0 [32581.835910] Pid: 14899, comm: python2.6 Not tainted 2.6.38.8-gg868-ganetixenu #1 [32581.835917] RIP: e030:[<ffffffff81153853>] [<ffffffff81153853>] __find_get_block+0x1f3/0x200 [32581.835927] RSP: e02b:ffff88067cfcdc78 EFLAGS: 00010046 [32581.835931] RAX: ffff8807be6b0000 RBX: ffff8807740c50f0 RCX: 00000000007e980a [32581.835935] RDX: 0000000000001000 RSI: 00000000007e980a RDI: ffff8807b8cad380 [32581.835939] RBP: ffff88067cfcdcd8 R08: 0000000000000001 R09: ffff8807740c5018 [32581.835944] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8807740c501c [32581.835949] R13: ffff8807740c5048 R14: ffff8807fef36210 R15: 00000000007e980a [32581.835955] FS: 00007f37fd752700(0000) GS:ffff8807fff26000(0063) knlGS:0000000000000000 [32581.835960] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b [32581.835965] CR2: 00000000ea515000 CR3: 000000068522b000 CR4: 0000000000002660 [32581.835970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [32581.835975] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [32581.835979] Process python2.6 (pid: 14899, threadinfo ffff88067cfcc000, task ffff8806e772da80) [32581.835984] Stack: [32581.835986] ffff88067cfcdc98 ffffffff81654cd1 ffff88067cfcdca8 ffff8807740c5a98 [32581.835994] ffff88067cfcdd08 ffffffff811c9294 ffff8806ffffffc3 0000000000000014 [32581.836001] ffff8807740c50f0 ffff8807740c501c ffff8807740c5048 ffff8807fef36210 [32581.836009] Call Trace: [32581.836015] [<ffffffff81654cd1>] ? down_read+0x11/0x30 [32581.836021] [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 [32581.836027] [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 [32581.836032] [<ffffffff811bb104>] ext3_free_data+0x114/0x160 [32581.836037] [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 [32581.836042] [<ffffffff812133f5>] ? journal_start+0xb5/0x100 [32581.836047] [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 [32581.836052] [<ffffffff8114065f>] evict+0x1f/0xb0 [32581.836058] [<ffffffff81006d52>] ? check_events+0x12/0x20 [32581.836063] [<ffffffff81140c14>] iput+0x1a4/0x290 [32581.836068] [<ffffffff8113ed05>] dput+0x265/0x310 [32581.836072] [<ffffffff81132435>] path_put+0x15/0x30 [32581.836078] [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 [32581.836084] [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f [32581.836088] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 [32581.836141] RIP [<ffffffff81153853>] __find_get_block+0x1f3/0x200 [32581.836146] RSP <ffff88067cfcdc78> [32581.836157] ---[ end trace 0658a2308b35c81e ]--- and from kernel 3.2.5 [12109.709499] ------------[ cut here ]------------ [12109.709508] kernel BUG at fs/buffer.c:1265! [12109.709512] invalid opcode: 0000 [#1] SMP [12109.709517] CPU 0 [12109.709521] Pid: 21385, comm: python2.6 Not tainted 3.2.5-gg886-ganetixenu #1 [12109.709527] RIP: e030:[<ffffffff81166473>] [<ffffffff81166473>] __find_get_block+0x1f3/0x200 [12109.709536] RSP: e02b:ffff88061fa99c68 EFLAGS: 00010046 [12109.709540] RAX: ffff8807c91e0000 RBX: ffff88028e268cf8 RCX: 0000000002523819 [12109.709544] RDX: 0000000000001000 RSI: 0000000002523819 RDI: ffff8807dede2080 [12109.709549] RBP: ffff88061fa99cc8 R08: 0000000000000001 R09: ffff88028e268c20 [12109.709553] R10: ffff8806cf8bc000 R11: ffffffff81c30e08 R12: ffff88028e268c24 [12109.709557] R13: ffff88028e268c50 R14: ffff8807defe8528 R15: 0000000002523819 [12109.709563] FS: 00007fa1f4026700(0000) GS:ffff8807fff2b000(0063) knlGS:0000000000000000 [12109.709567] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b [12109.709571] CR2: 0000000002bb6000 CR3: 00000006c8730000 CR4: 0000000000002660 [12109.709575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12109.709580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [12109.709584] Process python2.6 (pid: 21385, threadinfo ffff88061fa98000, task ffff8806e37796a0) [12109.709589] Stack: [12109.709591] ffff88061fa99cb8 ffffffff81143001 ffff88061fa99c98 ffffffff81699f91 [12109.709599] ffff88061fa99d28 ffff8806fcb79610 ffff88061fa99d08 ffffffff811e5d34 [12109.709606] ffff88028e268cf8 ffff88028e268c24 ffff88028e268c50 ffff8807defe8528 [12109.709613] Call Trace: [12109.709619] [<ffffffff81143001>] ? generic_permission+0x101/0x250 [12109.709625] [<ffffffff81699f91>] ? down_read+0x11/0x30 [12109.709631] [<ffffffff811e5d34>] ? ext3_xattr_get+0xf4/0x2b0 [12109.709635] [<ffffffff811d0ac8>] ext3_clear_blocks+0x128/0x190 [12109.709640] [<ffffffff811d0c44>] ext3_free_data+0x114/0x160 [12109.709644] [<ffffffff811d13da>] ext3_truncate+0x50a/0x610 [12109.709649] [<ffffffff811465c6>] ? path_lookupat+0x66/0x690 [12109.709654] [<ffffffff8112b5d7>] ? kmem_cache_alloc+0x37/0x120 [12109.709659] [<ffffffff812393b5>] ? journal_start+0xb5/0x100 [12109.709664] [<ffffffff811d2638>] ext3_evict_inode+0x218/0x280 [12109.709669] [<ffffffff81151051>] evict+0xa1/0x1a0 [12109.709673] [<ffffffff81151258>] iput+0x108/0x210 [12109.709677] [<ffffffff8114f385>] dput+0x285/0x330 [12109.709681] [<ffffffff81142425>] path_put+0x15/0x30 [12109.709686] [<ffffffff810afb11>] audit_syscall_exit+0x171/0x260 [12109.709692] [<ffffffff8169df6a>] sysexit_audit+0x21/0x5f [12109.709695] Code: 82 a0 ff 00 00 85 c0 75 de 65 48 89 1c 25 a0 ff 00 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 [12109.709752] RIP [<ffffffff81166473>] __find_get_block+0x1f3/0x200 [12109.709757] RSP <ffff88061fa99c68> [12109.709763] ---[ end trace 842635a4dd3e3b5f ]--- That BUG() line is: static inline void check_irqs_on(void) { #ifdef irqs_disabled BUG_ON(irqs_disabled()); #endif } so it seems like something about the load is either causing __find_get_block to be called while irqs are disabled or there''s something mucking with the irq flags. Finally, here''s the relevant bits from xm info on one of the affected nodes: # xm info host : release : 3.2.13-ganeti-rx6-xen0 version : #1 SMP Thu Jun 7 12:59:40 CEST 2012 machine : x86_64 nr_cpus : 24 nr_nodes : 2 cores_per_socket : 6 threads_per_core : 2 cpu_mhz : 3200 hw_caps : bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000 virt_caps : total_memory : 98292 free_memory : 82906 node_to_cpu : node0:0-5,12-17 node1:6-11,18-23 node_to_memory : node0:40810 node1:42095 node_to_dma32_mem : node0:1969 node1:0 max_node_id : 1 xen_major : 4 xen_minor : 0 xen_extra : .1 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable xen_commandline : placeholder dom0_mem=2048M loglvl=all com1=115200,8n1 console=com1 iommu=0 cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) cc_compile_by : cc_compile_domain : google.com cc_compile_date : Wed Mar 16 15:24:06 UTC 2011 xend_config_format : 4 I should be getting access to one of these machines shortly, but I''m new to xen-debugging, so I was hoping that someone might be able to give me some hints on debugging xen-specific kernel issues. Cheers, peter -- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038
Peter Moody
2012-Jul-25 17:32 UTC
Re: Advice/hints on debugging a xen/auditd kernel issue.
Is there a more suitable list for this question? perhaps xen-devel? On Mon, Jul 23, 2012 at 8:26 AM, Peter Moody <pmoody@google.com> wrote:> Hey folks, > > I''m looking for some help debugging a kernel oops that only seems to > manifest itself on xen instances. > > We have a small number of machines that seem to be having a weird > interaction between auditd and xen. When under a high-ish amount of > load, with auditd running, these machines crash almost 100% of the > time. > > Here are two separate OOPs traces that appear to be from the same bug: > > from kernel 2.6.38.8 > > [32581.835885] ------------[ cut here ]------------ > [32581.835894] kernel BUG at fs/buffer.c:1263! > [32581.835898] invalid opcode: 0000 [#1] SMP > [32581.835903] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings > [32581.835906] CPU 0 > [32581.835910] Pid: 14899, comm: python2.6 Not tainted > 2.6.38.8-gg868-ganetixenu #1 > [32581.835917] RIP: e030:[<ffffffff81153853>] [<ffffffff81153853>] > __find_get_block+0x1f3/0x200 > [32581.835927] RSP: e02b:ffff88067cfcdc78 EFLAGS: 00010046 > [32581.835931] RAX: ffff8807be6b0000 RBX: ffff8807740c50f0 RCX: 00000000007e980a > [32581.835935] RDX: 0000000000001000 RSI: 00000000007e980a RDI: ffff8807b8cad380 > [32581.835939] RBP: ffff88067cfcdcd8 R08: 0000000000000001 R09: ffff8807740c5018 > [32581.835944] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8807740c501c > [32581.835949] R13: ffff8807740c5048 R14: ffff8807fef36210 R15: 00000000007e980a > [32581.835955] FS: 00007f37fd752700(0000) GS:ffff8807fff26000(0063) > knlGS:0000000000000000 > [32581.835960] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > [32581.835965] CR2: 00000000ea515000 CR3: 000000068522b000 CR4: 0000000000002660 > [32581.835970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [32581.835975] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [32581.835979] Process python2.6 (pid: 14899, threadinfo > ffff88067cfcc000, task ffff8806e772da80) > [32581.835984] Stack: > [32581.835986] ffff88067cfcdc98 ffffffff81654cd1 ffff88067cfcdca8 > ffff8807740c5a98 > [32581.835994] ffff88067cfcdd08 ffffffff811c9294 ffff8806ffffffc3 > 0000000000000014 > [32581.836001] ffff8807740c50f0 ffff8807740c501c ffff8807740c5048 > ffff8807fef36210 > [32581.836009] Call Trace: > [32581.836015] [<ffffffff81654cd1>] ? down_read+0x11/0x30 > [32581.836021] [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0 > [32581.836027] [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190 > [32581.836032] [<ffffffff811bb104>] ext3_free_data+0x114/0x160 > [32581.836037] [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950 > [32581.836042] [<ffffffff812133f5>] ? journal_start+0xb5/0x100 > [32581.836047] [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0 > [32581.836052] [<ffffffff8114065f>] evict+0x1f/0xb0 > [32581.836058] [<ffffffff81006d52>] ? check_events+0x12/0x20 > [32581.836063] [<ffffffff81140c14>] iput+0x1a4/0x290 > [32581.836068] [<ffffffff8113ed05>] dput+0x265/0x310 > [32581.836072] [<ffffffff81132435>] path_put+0x15/0x30 > [32581.836078] [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260 > [32581.836084] [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f > [32581.836088] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05 > 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff > 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 > 49 89 > [32581.836141] RIP [<ffffffff81153853>] __find_get_block+0x1f3/0x200 > [32581.836146] RSP <ffff88067cfcdc78> > [32581.836157] ---[ end trace 0658a2308b35c81e ]--- > > and from kernel 3.2.5 > > [12109.709499] ------------[ cut here ]------------ > [12109.709508] kernel BUG at fs/buffer.c:1265! > [12109.709512] invalid opcode: 0000 [#1] SMP > [12109.709517] CPU 0 > [12109.709521] Pid: 21385, comm: python2.6 Not tainted 3.2.5-gg886-ganetixenu #1 > [12109.709527] RIP: e030:[<ffffffff81166473>] [<ffffffff81166473>] > __find_get_block+0x1f3/0x200 > [12109.709536] RSP: e02b:ffff88061fa99c68 EFLAGS: 00010046 > [12109.709540] RAX: ffff8807c91e0000 RBX: ffff88028e268cf8 RCX: 0000000002523819 > [12109.709544] RDX: 0000000000001000 RSI: 0000000002523819 RDI: ffff8807dede2080 > [12109.709549] RBP: ffff88061fa99cc8 R08: 0000000000000001 R09: ffff88028e268c20 > [12109.709553] R10: ffff8806cf8bc000 R11: ffffffff81c30e08 R12: ffff88028e268c24 > [12109.709557] R13: ffff88028e268c50 R14: ffff8807defe8528 R15: 0000000002523819 > [12109.709563] FS: 00007fa1f4026700(0000) GS:ffff8807fff2b000(0063) > knlGS:0000000000000000 > [12109.709567] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > [12109.709571] CR2: 0000000002bb6000 CR3: 00000006c8730000 CR4: 0000000000002660 > [12109.709575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [12109.709580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [12109.709584] Process python2.6 (pid: 21385, threadinfo > ffff88061fa98000, task ffff8806e37796a0) > [12109.709589] Stack: > [12109.709591] ffff88061fa99cb8 ffffffff81143001 ffff88061fa99c98 > ffffffff81699f91 > [12109.709599] ffff88061fa99d28 ffff8806fcb79610 ffff88061fa99d08 > ffffffff811e5d34 > [12109.709606] ffff88028e268cf8 ffff88028e268c24 ffff88028e268c50 > ffff8807defe8528 > [12109.709613] Call Trace: > [12109.709619] [<ffffffff81143001>] ? generic_permission+0x101/0x250 > [12109.709625] [<ffffffff81699f91>] ? down_read+0x11/0x30 > [12109.709631] [<ffffffff811e5d34>] ? ext3_xattr_get+0xf4/0x2b0 > [12109.709635] [<ffffffff811d0ac8>] ext3_clear_blocks+0x128/0x190 > [12109.709640] [<ffffffff811d0c44>] ext3_free_data+0x114/0x160 > [12109.709644] [<ffffffff811d13da>] ext3_truncate+0x50a/0x610 > [12109.709649] [<ffffffff811465c6>] ? path_lookupat+0x66/0x690 > [12109.709654] [<ffffffff8112b5d7>] ? kmem_cache_alloc+0x37/0x120 > [12109.709659] [<ffffffff812393b5>] ? journal_start+0xb5/0x100 > [12109.709664] [<ffffffff811d2638>] ext3_evict_inode+0x218/0x280 > [12109.709669] [<ffffffff81151051>] evict+0xa1/0x1a0 > [12109.709673] [<ffffffff81151258>] iput+0x108/0x210 > [12109.709677] [<ffffffff8114f385>] dput+0x285/0x330 > [12109.709681] [<ffffffff81142425>] path_put+0x15/0x30 > [12109.709686] [<ffffffff810afb11>] audit_syscall_exit+0x171/0x260 > [12109.709692] [<ffffffff8169df6a>] sysexit_audit+0x21/0x5f > [12109.709695] Code: 82 a0 ff 00 00 85 c0 75 de 65 48 89 1c 25 a0 ff > 00 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff > 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 > 49 89 > [12109.709752] RIP [<ffffffff81166473>] __find_get_block+0x1f3/0x200 > [12109.709757] RSP <ffff88061fa99c68> > [12109.709763] ---[ end trace 842635a4dd3e3b5f ]--- > > That BUG() line is: > > static inline void check_irqs_on(void) > { > #ifdef irqs_disabled > BUG_ON(irqs_disabled()); > #endif > } > > so it seems like something about the load is either causing > __find_get_block to be called while irqs are disabled or there''s > something mucking with the irq flags. > > Finally, here''s the relevant bits from xm info on one of the affected nodes: > > # xm info > host : > release : 3.2.13-ganeti-rx6-xen0 > version : #1 SMP Thu Jun 7 12:59:40 CEST 2012 > machine : x86_64 > nr_cpus : 24 > nr_nodes : 2 > cores_per_socket : 6 > threads_per_core : 2 > cpu_mhz : 3200 > hw_caps : > bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000 > virt_caps : > total_memory : 98292 > free_memory : 82906 > node_to_cpu : node0:0-5,12-17 > node1:6-11,18-23 > node_to_memory : node0:40810 > node1:42095 > node_to_dma32_mem : node0:1969 > node1:0 > max_node_id : 1 > xen_major : 4 > xen_minor : 0 > xen_extra : .1 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > xen_commandline : placeholder dom0_mem=2048M loglvl=all > com1=115200,8n1 console=com1 iommu=0 > cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) > cc_compile_by : > cc_compile_domain : google.com > cc_compile_date : Wed Mar 16 15:24:06 UTC 2011 > xend_config_format : 4 > > I should be getting access to one of these machines shortly, but I''m > new to xen-debugging, so I was hoping that someone might be able to > give me some hints on debugging xen-specific kernel issues. > > Cheers, > peter > > -- > Peter Moody Google 1.650.253.7306 > Security Engineer pgp:0xC3410038-- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038