Charles Duffy
2005-Dec-23 19:03 UTC
[Xen-devel] DomU Oopsing on xen-3.0-testing changeset 8259
One of my DomUs is sporadically oopsing, roughly once per day. This was first observed on a pre-3.0-release changeset; after upgrading to changeset 8259 on the xen-3.0-testing branch (after the release), it still occurs. This effectively kills the instance when it occurs -- worse, the instance in question *stays* down even though panic=5 is specified as an extra parameter to be passed to the DomU kernel. The text of the oops is attached, as are my kernel configs (which are a touch nonstandard). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Charles Duffy
2006-Jun-19 13:12 UTC
[Xen-devel] Oops in xen 3.0.2 dequeue_signal [was: Re: DomU Oopsing on xen-3.0-testing changeset 8259]
I''m seeing the same behavior I previously reported against xen-3.0-testing changeset 8259, albeit much more sporadically, on Xen 3.0.2 (with a 2.6.16.16 kernel built via the Gentoo Xen packages). I''d use stock XenSource binaries, but last I checked they don''t have support for some of my hardware (ie. the 3w9xxx driver). Hints on anything I can do to provide more detailed information (in the hopes of actually getting this fixed) would be welcome. ksymoops outlook looks like the following: RIP: e030:[<ffffffff8013b1e3>] <ffffffff8013b1e3>{__dequeue_signal+259} Using defaults from ksymoops -t elf64-x86-64 -a i386:x86-64 RSP: e02b:ffff88003144fe38 EFLAGS: 00010446 RAX: 0000000000000000 RBX: ffff88000b3e06d0 RCX: 0000000000000009 RDX: 0000000000000200 RSI: ffff88003144feb0 RDI: 0000000000000000 RBP: ffff88003144fe68 R08: ffff88003144e000 R09: 0000000000000000 R10: 0000000000000060 R11: 00000000fffffffa R12: ffff88001b05c950 R13: 000000000000000a R14: ffff88003144feb8 R15: 000000000000000a FS: 00002b464b3890a0(0063) GS:ffffffff80535000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Stack: 000000003144fe88 0000000000000000 ffff88003144feb0 ffff88003144feb8 ffff88000b3e00c0 0000000000000000 ffff88003144fe98 ffffffff8013b2f0 0000000000000000 0000000000000000 Call Trace: <ffffffff8013b2f0>{dequeue_signal+48} <ffffffff8013dcf4>{sys_rt_sigtimedwait+596} <ffffffff8013e02a>{do_tkill+250} <ffffffff8013ab12>{recalc_sigpending+18} <ffffffff8013d621>{sigprocmask+225} <ffffffff8013d79c>{sys_rt_sigprocmask+268} <ffffffff8010b27e>{system_call+134} <ffffffff8010b1f8>{system_call+0} Code: 48 8b 00 0f 18 08 48 39 df 75 e4 4d 85 e4 74 64 49 8b 54 24 >>RIP; ffffffff8013b1e3 <__dequeue_signal+103/1e0> <==== >>RBX; ffff88000b3e06d0 <__start___xen_guest+ffff88000b3d42ea/ffffffff800f3c1a> >>RSI; ffff88003144feb0 <__start___xen_guest+ffff880031443aca/ffffffff800f3c1a> >>RBP; ffff88003144fe68 <__start___xen_guest+ffff880031443a82/ffffffff800f3c1a> >>R08; ffff88003144e000 <__start___xen_guest+ffff880031441c1a/ffffffff800f3c1a> >>R11; 00000000fffffffa <__start___xen_guest+ffff3c14/ffffffff800f3c1a> >>R12; ffff88001b05c950 <__start___xen_guest+ffff88001b05056a/ffffffff800f3c1a> >>R14; ffff88003144feb8 <__start___xen_guest+ffff880031443ad2/ffffffff800f3c1a> Trace; ffffffff8013b2f0 <dequeue_signal+30/e0> Trace; ffffffff8013e02a <do_tkill+fa/150> Trace; ffffffff8013d621 <sigprocmask+e1/150> Trace; ffffffff8010b27e <system_call+86/8b> Code; ffffffff8013b1e3 <__dequeue_signal+103/1e0> 0000000000000000 <_RIP>: Code; ffffffff8013b1e3 <__dequeue_signal+103/1e0> <==== 0: 48 8b 00 mov (%rax),%rax <====Code; ffffffff8013b1e6 <__dequeue_signal+106/1e0> 3: 0f 18 08 prefetcht0 (%rax) Code; ffffffff8013b1e9 <__dequeue_signal+109/1e0> 6: 48 39 df cmp %rbx,%rdi Code; ffffffff8013b1ec <__dequeue_signal+10c/1e0> 9: 75 e4 jne ffffffffffffffef <_RIP+0xffffffffffffffef> Code; ffffffff8013b1ee <__dequeue_signal+10e/1e0> b: 4d 85 e4 test %r12,%r12 Code; ffffffff8013b1f1 <__dequeue_signal+111/1e0> e: 74 64 je 74 <_RIP+0x74> Code; ffffffff8013b1f3 <__dequeue_signal+113/1e0> 10: 49 8b 54 24 00 mov 0x0(%r12),%rdx CR2: 0000000000000000 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Jun-19 13:59 UTC
Re: [Xen-devel] Oops in xen 3.0.2 dequeue_signal [was: Re: DomU Oopsing on xen-3.0-testing changeset 8259]
On 19 Jun 2006, at 14:12, Charles Duffy wrote:> I''m seeing the same behavior I previously reported against > xen-3.0-testing changeset 8259, albeit much more sporadically, on Xen > 3.0.2 (with a 2.6.16.16 kernel built via the Gentoo Xen packages). I''d > use stock XenSource binaries, but last I checked they don''t have > support for some of my hardware (ie. the 3w9xxx driver). > > Hints on anything I can do to provide more detailed information (in > the hopes of actually getting this fixed) would be welcome.Does it always crash in __dequeue_signal()? You might have to add some tracing in there to find out exactly which part of the function it is crashing in. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Jun-19 14:00 UTC
Re: [Xen-devel] Oops in xen 3.0.2 dequeue_signal [was: Re: DomU Oopsing on xen-3.0-testing changeset 8259]
On 19 Jun 2006, at 14:12, Charles Duffy wrote:> I''m seeing the same behavior I previously reported against > xen-3.0-testing changeset 8259, albeit much more sporadically, on Xen > 3.0.2 (with a 2.6.16.16 kernel built via the Gentoo Xen packages). I''d > use stock XenSource binaries, but last I checked they don''t have > support for some of my hardware (ie. the 3w9xxx driver). > > Hints on anything I can do to provide more detailed information (in > the hopes of actually getting this fixed) would be welcome.Also, try upgrading to 3.0-testing tip and see if you still get the problem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Charles Duffy
2006-Jun-26 21:15 UTC
Re: [Xen-devel] Oops in xen 3.0.2 dequeue_signal [was: Re: DomU Oopsing on xen-3.0-testing changeset 8259]
Keir Fraser wrote:> Does it always crash in __dequeue_signal()?Yes.> You might have to add some tracing in there to find out exactly which > part of the function it is crashing in.Any way to do that with less performance impact than adding printks to what I presume is a quite-commonly-called method? This system is used by internal personnel; while its performance and uptime aren''t critical, it would be a good thing if they weren''t impacted more than necessary.> Also, try upgrading to 3.0-testing tip and see if you still get the problem.Should it be adequate to upgrade this DomU only, or is there cause to also upgrade the hypervisor and Dom0? (I''m also building a kernel with debug symbols; I anticipate that this will let me get an annotated disassembled copy of the source in question, and thus figure out which line of source maps to the instruction offset from the top of the method we''re in... at least, in theory). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Charles Duffy
2006-Jul-18 20:20 UTC
[Xen-devel] Re: Oops in xen 3.0.2 dequeue_signal [was: Re: DomU Oopsing on xen-3.0-testing changeset 8259]
Keir Fraser wrote:> On 19 Jun 2006, at 14:12, Charles Duffy wrote: > >> I''m seeing the same behavior I previously reported against >> xen-3.0-testing changeset 8259, albeit much more sporadically, on Xen >> 3.0.2 (with a 2.6.16.16 kernel built via the Gentoo Xen packages). I''d >> use stock XenSource binaries, but last I checked they don''t have >> support for some of my hardware (ie. the 3w9xxx driver). >> >> Hints on anything I can do to provide more detailed information (in >> the hopes of actually getting this fixed) would be welcome. > > Does it always crash in __dequeue_signal()? You might have to add some > tracing in there to find out exactly which part of the function it is > crashing in.Okay. I''ve rebuilt against a debug-enabled kernel, and (on getting another panic) decompiled vmlinux to try to match the instructions it''s failing in to an individual line. The crash appears to be occurring in this second instruction generated associated with kernel/signal.c:1976 (from Linux-2.6.16.16+Xen 3.0.2): kernel/signal.c:1976 /* Run the handler. */ *return_ka = *ka; ffffffff8013d152: 48 8b 75 d0 mov 0xffffffffffffffd0(%rbp),%rsi ffffffff8013d156: 48 89 06 mov %rax,(%rsi) <<<=== HERE ffffffff8013d159: 48 8b 42 f0 mov 0xfffffffffffffff0(%rdx),%rax ffffffff8013d15d: 48 89 46 08 mov %rax,0x8(%rsi) ffffffff8013d161: 48 8b 42 f8 mov 0xfffffffffffffff8(%rdx),%rax ffffffff8013d165: 48 89 46 10 mov %rax,0x10(%rsi) ffffffff8013d169: 48 8b 41 18 mov 0x18(%rcx),%rax ffffffff8013d16d: 48 89 46 18 mov %rax,0x18(%rsi) My x86 assembler is tremendously rusty, but it looks to me like return_ka (which is passed in as a parameter to get_signal_to_deliver) points somewhere it shouldn''t. This parameter is passed in from arch/x86_64/kernel/signal.c''s do_signal(), where it''s declared as a function-local variable with its home on the stack. The code all looks fine at a glance -- but since the top of the stack is at ffff88013e87fe18, it doesn''t make much sense for a variable living on the stack defined just a few calls ago to be at 7c51186269a192da. I''m guessing there''s some kind of funky race condition going on -- but beyond that vague assertion, I''m pretty much lost. Ideas, anyone? ksymoops output follows: CPU 0 Pid: 16571, comm: java Not tainted 2.6.16.18-xen #4 RIP: e030:[<ffffffff8013d156>] <ffffffff8013d156>{get_signal_to_deliver+662} Using defaults from ksymoops -t elf64-x86-64 -a i386:x86-64 RSP: e02b:ffff88013e87fdc8 EFLAGS: 00010406 RAX: 00002ab89d447a1b RBX: 000000000000000a RCX: ffff88000061eb68 RDX: ffff88000061eb80 RSI: 7c51186269a192da RDI: ffff880144962750 RBP: ffff88013e87fe18 R08: 0000000000000000 R09: 0000000000003a66 R10: 0000000000000000 R11: ffffffff8010b27e R12: 000000000000000a R13: ffff88013e87fe48 R14: 0000000000000008 R15: ffff88013e87fe48 FS: 00002b47b9c0f900(0063) GS:ffffffff80535000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Stack: ffff88013e87fe68 7acefa865eaca248 26ab946c27ba950b 46b67a71dd1c67e3 7c51186269a192da 1287d8a161cad8d5 60de4c306b46ae9f a035a0ac294ee773 6cd46345a1e152ae 228b761ceaf9a045 Call Trace: <ffffffff8010b27e>{system_call+134} <ffffffff8010ad69>{sys_rt_sigsuspend+249} <ffffffff8010b681>{ptregscall_common+61} Code: 48 89 06 48 8b 42 f0 48 89 46 08 48 8b 42 f8 48 89 46 10 48 >>RIP; ffffffff8013d156 <get_signal_to_deliver+296/6e0> <==== >>RAX; 00002ab89d447a1b <__crc_ioctl_by_bdev+2ab79d5e3940/fffffffe8029bf25> >>RCX; ffff88000061eb68 <__crc_ioctl_by_bdev+ffff87ff007baa8d/fffffffe8029bf25> >>RDX; ffff88000061eb80 <__crc_ioctl_by_bdev+ffff87ff007baaa5/fffffffe8029bf25> >>RSI; 7c51186269a192da <__crc_ioctl_by_bdev+7c51186169bb51ff/fffffffe8029bf25> >>RDI; ffff880144962750 <__crc_ioctl_by_bdev+ffff880044afe675/fffffffe8029bf25> >>RBP; ffff88013e87fe18 <__crc_ioctl_by_bdev+ffff88003ea1bd3d/fffffffe8029bf25> >>R11; ffffffff8010b27e <system_call+86/8b> >>R13; ffff88013e87fe48 <__crc_ioctl_by_bdev+ffff88003ea1bd6d/fffffffe8029bf25> >>R15; ffff88013e87fe48 <__crc_ioctl_by_bdev+ffff88003ea1bd6d/fffffffe8029bf25> Trace; ffffffff8010b27e <system_call+86/8b> Trace; ffffffff8010b681 <ptregscall_common+3d/64> Code; ffffffff8013d156 <get_signal_to_deliver+296/6e0> 0000000000000000 <_RIP>: Code; ffffffff8013d156 <get_signal_to_deliver+296/6e0> <==== 0: 48 89 06 mov %rax,(%rsi) <====Code; ffffffff8013d159 <get_signal_to_deliver+299/6e0> 3: 48 8b 42 f0 mov 0xfffffffffffffff0(%rdx),%rax Code; ffffffff8013d15d <get_signal_to_deliver+29d/6e0> 7: 48 89 46 08 mov %rax,0x8(%rsi) Code; ffffffff8013d161 <get_signal_to_deliver+2a1/6e0> b: 48 8b 42 f8 mov 0xfffffffffffffff8(%rdx),%rax Code; ffffffff8013d165 <get_signal_to_deliver+2a5/6e0> f: 48 89 46 10 mov %rax,0x10(%rsi) Code; ffffffff8013d169 <get_signal_to_deliver+2a9/6e0> 13: 48 00 00 rex64 add %al,(%rax) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel