Xen: (XEN) Xen version 4.1.5-pre (maker@(none)) (gcc version 4.4.5 (Debian 4.4.5-8) ) Mon Feb 4 12:59:38 EST 2013 Dom0: 3.7.7-1-x86_64 #1 SMP Thu Feb 14 15:58:35 EST 2013 x86_64 GNU/Linux DomU: 3.7.10 (32 bit) We''re seeing this bug pop up fairly regularly. The BUG below is from the first time it''s triggered after a reboot - subsequent triggers produce similar tracebacks with the kernel flagged as "Tainted: G D" - I can provide more examples if desired. ------------[ cut here ]------------ kernel BUG at arch/x86/xen/enlighten.c:425! invalid opcode: 0000 [#1] SMP Modules linked in: Pid: 3158, comm: ntpd Not tainted 3.7.10-linode49 #2 EIP: 0061:[<c0102e8a>] EFLAGS: 00210282 CPU: 0 EIP is at set_aliased_prot+0x9a/0x110 EAX: ffffffea EBX: ee2e6000 ECX: 2c74a063 EDX: 80000019 ESI: 00000000 EDI: 00002000 EBP: 2c74a063 ESP: ead9da10 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 CR0: 8005003b CR2: 4a258bc4 CR3: 2aa7a000 CR4: 00002660 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process ntpd (pid: 3158, ti=ead9c000 task=eb6e3300 task.ti=ead9c000) Stack: 0007fb6b ee2e6000 80000019 ec1ed000 00000001 ee2e6000 00000200 00002000 eb6e3300 c0102f27 eb19cc80 eb19cc80 00000000 c010ba5f eb19cc80 eb19cc80 c013083c c010863b c078c405 00000000 00000000 c091e1c0 c015ade1 00000000 Call Trace: [<c0102f27>] ? xen_free_ldt+0x27/0x40 [<c010ba5f>] ? destroy_context+0x2f/0xa0 [<c013083c>] ? __mmdrop+0x1c/0xb0 [<c010863b>] ? __switch_to+0x13b/0x3a0 [<c078c405>] ? _raw_spin_lock+0x5/0x10 [<c015ade1>] ? finish_task_switch+0xb1/0xc0 [<c078b77d>] ? __schedule+0x1fd/0x590 [<c0639aa7>] ? get_unique_tuple+0x167/0x200 [<c06258c2>] ? nf_ct_invert_tuple+0x52/0x70 [<c078c62f>] ? _raw_spin_lock_bh+0xf/0x20 [<c078ad9d>] ? schedule_hrtimeout_range_clock+0x10d/0x120 [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 [<c014ecca>] ? add_wait_queue+0x1a/0x50 [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [<c078adbf>] ? schedule_hrtimeout_range+0xf/0x20 [<c01e4a47>] ? poll_schedule_timeout+0x37/0x50 [<c01e57f5>] ? do_select+0x445/0x510 [<c04f97f5>] ? info_for_irq+0x5/0x20 [<c04f9eb0>] ? evtchn_from_irq+0x10/0x40 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c01e4be0>] ? __pollwait+0xf0/0xf0 [<c069c55c>] ? udp_sendmsg+0x2ec/0x820 [<c0678250>] ? ip_append_page+0x4e0/0x4e0 [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [<c05d9cc2>] ? __skb_recv_datagram+0xf2/0x260 [<c078c62f>] ? _raw_spin_lock_bh+0xf/0x20 [<c05d1586>] ? lock_sock_fast+0x16/0x50 [<c069bf76>] ? udp_recvmsg+0x76/0x2e0 [<c06a293c>] ? inet_recvmsg+0x5c/0xb0 [<c06a3817>] ? inet_sendmsg+0x47/0xb0 [<c05ce847>] ? sock_recvmsg+0xe7/0x100 [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [<c0112017>] ? save_xstate_sig+0x297/0x310 [<c01e5a6d>] ? core_sys_select+0x1ad/0x2a0 [<c014165f>] ? __set_task_blocked+0x2f/0x80 [<c01419a9>] ? set_current_blocked+0x49/0x60 [<c0142fd4>] ? signal_delivered+0x44/0x60 [<c0111002>] ? fpu_finit+0x52/0x60 [<c018adb6>] ? __audit_syscall_exit+0x3b6/0x3f0 [<c0112749>] ? syscall_trace_leave+0xd9/0x110 [<c01e5dcf>] ? sys_select+0x2f/0xb0 [<c078c844>] ? syscall_call+0x7/0xb Code: d2 0f a4 c2 0c c1 e0 0c 09 da 09 c6 89 f0 ff 15 d8 34 92 c0 31 f6 89 c5 8b 5c 24 04 89 54 24 08 89 c1 e8 3a e3 ff ff 85 c0 74 04 <0f> 0b eb fe 8b 04 24 8b 54 24 0c c1 e0 05 8b 04 10 c1 e8 1e 69 EIP: [<c0102e8a>] set_aliased_prot+0x9a/0x110 SS:ESP 0069:ead9da10 ---[ end trace a8da5f670b3c74bb ]--- Several processes have been seen to trigger it - audispd gam java klogd ntpd perl syslogd Eventually init will trigger it and cause a kernel panic: Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b I''ve done some Googling to see if this is a known issue. All I could find was a closed Debian bug from Dec 2010 with no resolution: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=607709 Please let me know if you need further info from me. Thanks, James Sinclair _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 08.03.13 at 03:23, James Sinclair <james.sinclair@linode.com> wrote: > Xen: (XEN) Xen version 4.1.5-pre (maker@(none)) (gcc version 4.4.5 (Debian > 4.4.5-8) ) Mon Feb 4 12:59:38 EST 2013 > Dom0: 3.7.7-1-x86_64 #1 SMP Thu Feb 14 15:58:35 EST 2013 x86_64 GNU/Linux > DomU: 3.7.10 (32 bit) > > We''re seeing this bug pop up fairly regularly. The BUG below is from the > first time it''s triggered after a reboot - subsequent triggers produce similar > tracebacks with the kernel flagged as "Tainted: G D" - I can provide more > examples if desired.Two fundamental things that are missing: For one, this is almost certainly being accompanied by some hypervisor message(s), so you will want to provide the hypervisor log (making sure the log level is high enough). And then, even more with you using a not yet released hypervisor version, you would want to clarify whether this is a regression of some sort (i.e. whether a certain hypervisor and/or kernel version are known not to exhibit this).> ------------[ cut here ]------------ > kernel BUG at arch/x86/xen/enlighten.c:425! > invalid opcode: 0000 [#1] SMP > Modules linked in: > Pid: 3158, comm: ntpd Not tainted 3.7.10-linode49 #2 > EIP: 0061:[<c0102e8a>] EFLAGS: 00210282 CPU: 0 > EIP is at set_aliased_prot+0x9a/0x110 > EAX: ffffffea EBX: ee2e6000 ECX: 2c74a063 EDX: 80000019 > ESI: 00000000 EDI: 00002000 EBP: 2c74a063 ESP: ead9da10 > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 > CR0: 8005003b CR2: 4a258bc4 CR3: 2aa7a000 CR4: 00002660 > DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > DR6: ffff0ff0 DR7: 00000400 > Process ntpd (pid: 3158, ti=ead9c000 task=eb6e3300 task.ti=ead9c000) > Stack: > 0007fb6b ee2e6000 80000019 ec1ed000 00000001 ee2e6000 00000200 00002000 > eb6e3300 c0102f27 eb19cc80 eb19cc80 00000000 c010ba5f eb19cc80 eb19cc80 > c013083c c010863b c078c405 00000000 00000000 c091e1c0 c015ade1 00000000 > Call Trace: > [<c0102f27>] ? xen_free_ldt+0x27/0x40Quite unusual to have a process with LDT these days. And hence quite likely for this to be the culprit. The primary guess therefore would be that the LDT didn''t get cleared yet (MMUEXT_SET_LDT). Jan> [<c010ba5f>] ? destroy_context+0x2f/0xa0 > [<c013083c>] ? __mmdrop+0x1c/0xb0 > [<c010863b>] ? __switch_to+0x13b/0x3a0 > [<c078c405>] ? _raw_spin_lock+0x5/0x10 > [<c015ade1>] ? finish_task_switch+0xb1/0xc0 > [<c078b77d>] ? __schedule+0x1fd/0x590 > [<c0639aa7>] ? get_unique_tuple+0x167/0x200 > [<c06258c2>] ? nf_ct_invert_tuple+0x52/0x70 > [<c078c62f>] ? _raw_spin_lock_bh+0xf/0x20 > [<c078ad9d>] ? schedule_hrtimeout_range_clock+0x10d/0x120 > [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 > [<c014ecca>] ? add_wait_queue+0x1a/0x50 > [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 > [<c078adbf>] ? schedule_hrtimeout_range+0xf/0x20 > [<c01e4a47>] ? poll_schedule_timeout+0x37/0x50 > [<c01e57f5>] ? do_select+0x445/0x510 > [<c04f97f5>] ? info_for_irq+0x5/0x20 > [<c04f9eb0>] ? evtchn_from_irq+0x10/0x40 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c01e4be0>] ? __pollwait+0xf0/0xf0 > [<c069c55c>] ? udp_sendmsg+0x2ec/0x820 > [<c0678250>] ? ip_append_page+0x4e0/0x4e0 > [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 > [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 > [<c05d9cc2>] ? __skb_recv_datagram+0xf2/0x260 > [<c078c62f>] ? _raw_spin_lock_bh+0xf/0x20 > [<c05d1586>] ? lock_sock_fast+0x16/0x50 > [<c069bf76>] ? udp_recvmsg+0x76/0x2e0 > [<c06a293c>] ? inet_recvmsg+0x5c/0xb0 > [<c06a3817>] ? inet_sendmsg+0x47/0xb0 > [<c05ce847>] ? sock_recvmsg+0xe7/0x100 > [<c078c437>] ? _raw_spin_lock_irqsave+0x27/0x40 > [<c078c481>] ? _raw_spin_unlock_irqrestore+0x11/0x20 > [<c0112017>] ? save_xstate_sig+0x297/0x310 > [<c01e5a6d>] ? core_sys_select+0x1ad/0x2a0 > [<c014165f>] ? __set_task_blocked+0x2f/0x80 > [<c01419a9>] ? set_current_blocked+0x49/0x60 > [<c0142fd4>] ? signal_delivered+0x44/0x60 > [<c0111002>] ? fpu_finit+0x52/0x60 > [<c018adb6>] ? __audit_syscall_exit+0x3b6/0x3f0 > [<c0112749>] ? syscall_trace_leave+0xd9/0x110 > [<c01e5dcf>] ? sys_select+0x2f/0xb0 > [<c078c844>] ? syscall_call+0x7/0xb > Code: d2 0f a4 c2 0c c1 e0 0c 09 da 09 c6 89 f0 ff 15 d8 34 92 c0 31 f6 > 89 c5 8b 5c 24 04 89 54 24 08 89 c1 e8 3a e3 ff ff 85 c0 74 04 <0f> 0b eb fe 8b > 04 24 8b 54 24 0c c1 e0 05 8b 04 10 c1 e8 1e 69 > EIP: [<c0102e8a>] set_aliased_prot+0x9a/0x110 SS:ESP 0069:ead9da10 > ---[ end trace a8da5f670b3c74bb ]--- > > Several processes have been seen to trigger it - audispd gam java klogd ntpd > perl syslogd > Eventually init will trigger it and cause a kernel panic: > > Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b > > I''ve done some Googling to see if this is a known issue. All I could find > was a closed Debian bug from Dec 2010 with no resolution: > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=607709 > > Please let me know if you need further info from me. > > Thanks, > James Sinclair
>>> On 12.03.13 at 04:56, James Sinclair <james.sinclair@linode.com> wrote:(re-adding xen-devel to Cc)> I''ve done further testing and I have confirmed that I can trigger the bug > pretty reliably on several Xen versions: > > * xen-3.4.4 > * xen-4.1.4 > * xen-4.1.5-pre > > Dom0 kernels range between 2.6.18 & 3.7.7-1. I''ve also tried quite a few DomU > kernels and I can trigger it in everything I''ve tried between 2.6.39.1 and > 3.7.10. In the end, I haven''t managed to find a combination of Xen, Dom0 & > DomU where I cannot trigger it.Partly because you apparently never tried non-pvops DomU kernels. But that''s not the point here.> As for the hypervisor log, I''m not seeing anything get logged at all. I > suspect that I''m not capturing them correctly. Using Xen 4.1.4 I added the > following to the command line: > > loglvl=all guest_loglvl=all sync_console console_to_ring earlyprintk=xen > debug loglevel=8Now, this mixture of hypervisor and kernel options already suggests that you don''t look in the right place. Please put Xen options on the Xen command line, and kernel ones on the kernel line.> I''m seeing extra logs from Xen during boot, but nothing when the bug is > triggered in the DomU. I''m assuming any extra logging will show up on the > console or in /var/log/xen/xend.log - is there somewhere else I should be > looking? Or something I''m not setting correctly?They show up on the serial console (if in use) or in the output of "xl dmesg" (or "xm dmesg" if xend is in use). Without extra precautions, they _won''t_ show up anywhere under /var/log. Jan
On 12/03/2013, at 6:45 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 12.03.13 at 04:56, James Sinclair <james.sinclair@linode.com> wrote: > > (re-adding xen-devel to Cc) > >> I''ve done further testing and I have confirmed that I can trigger the bug >> pretty reliably on several Xen versions: >> >> * xen-3.4.4 >> * xen-4.1.4 >> * xen-4.1.5-pre >> >> Dom0 kernels range between 2.6.18 & 3.7.7-1. I''ve also tried quite a few DomU >> kernels and I can trigger it in everything I''ve tried between 2.6.39.1 and >> 3.7.10. In the end, I haven''t managed to find a combination of Xen, Dom0 & >> DomU where I cannot trigger it. > > Partly because you apparently never tried non-pvops DomU kernels. > But that''s not the point here. > >> As for the hypervisor log, I''m not seeing anything get logged at all. I >> suspect that I''m not capturing them correctly. Using Xen 4.1.4 I added the >> following to the command line: >> >> loglvl=all guest_loglvl=all sync_console console_to_ring earlyprintk=xen >> debug loglevel=8 > > Now, this mixture of hypervisor and kernel options already > suggests that you don''t look in the right place. Please put Xen > options on the Xen command line, and kernel ones on the kernel > line.The documentation I found that mentioned using "earlyprintk=xen debug loglevel=8" did not make it clear those are kernel options. I''ve corrected that now: Hypervisor options: loglvl=all guest_loglvl=all sync_console console_to_ring Kernel options: earlyprintk=xen debug loglevel=8> >> I''m seeing extra logs from Xen during boot, but nothing when the bug is >> triggered in the DomU. I''m assuming any extra logging will show up on the >> console or in /var/log/xen/xend.log - is there somewhere else I should be >> looking? Or something I''m not setting correctly? > > They show up on the serial console (if in use) or in the output of > "xl dmesg" (or "xm dmesg" if xend is in use). Without extra > precautions, they _won''t_ show up anywhere under /var/log.I''m not seeing anything extra logged on the serial console nor in "xl dmesg" when the bug hits. The only log entry I''m getting is a warning from XendDomainInfo that the domain has crashed, and that''s only logged when the bug hits a critical process (usually init) that causes the domU kernel to panic. I''m working on building a hypervisor with "debug=y" enabled to see if that gives me anything extra. James
On 13/03/2013, at 9:56 AM, James Sinclair <james.sinclair@linode.com> wrote:> On 12/03/2013, at 6:45 PM, Jan Beulich <JBeulich@suse.com> wrote: > >>>>> On 12.03.13 at 04:56, James Sinclair <james.sinclair@linode.com> wrote: >> >> (re-adding xen-devel to Cc) >> >>> I''ve done further testing and I have confirmed that I can trigger the bug >>> pretty reliably on several Xen versions: >>> >>> * xen-3.4.4 >>> * xen-4.1.4 >>> * xen-4.1.5-pre >>> >>> Dom0 kernels range between 2.6.18 & 3.7.7-1. I''ve also tried quite a few DomU >>> kernels and I can trigger it in everything I''ve tried between 2.6.39.1 and >>> 3.7.10. In the end, I haven''t managed to find a combination of Xen, Dom0 & >>> DomU where I cannot trigger it. >> >> Partly because you apparently never tried non-pvops DomU kernels. >> But that''s not the point here. >> >>> As for the hypervisor log, I''m not seeing anything get logged at all. I >>> suspect that I''m not capturing them correctly. Using Xen 4.1.4 I added the >>> following to the command line: >>> >>> loglvl=all guest_loglvl=all sync_console console_to_ring earlyprintk=xen >>> debug loglevel=8 >> >> Now, this mixture of hypervisor and kernel options already >> suggests that you don''t look in the right place. Please put Xen >> options on the Xen command line, and kernel ones on the kernel >> line. > > The documentation I found that mentioned using "earlyprintk=xen debug loglevel=8" did not make it clear those are kernel options. I''ve corrected that now: > > Hypervisor options: loglvl=all guest_loglvl=all sync_console console_to_ring > Kernel options: earlyprintk=xen debug loglevel=8 > >> >>> I''m seeing extra logs from Xen during boot, but nothing when the bug is >>> triggered in the DomU. I''m assuming any extra logging will show up on the >>> console or in /var/log/xen/xend.log - is there somewhere else I should be >>> looking? Or something I''m not setting correctly? >> >> They show up on the serial console (if in use) or in the output of >> "xl dmesg" (or "xm dmesg" if xend is in use). Without extra >> precautions, they _won''t_ show up anywhere under /var/log. > > I''m not seeing anything extra logged on the serial console nor in "xl dmesg" when the bug hits. The only log entry I''m getting is a warning from XendDomainInfo that the domain has crashed, and that''s only logged when the bug hits a critical process (usually init) that causes the domU kernel to panic. > > I''m working on building a hypervisor with "debug=y" enabled to see if that gives me anything extra.I''ve managed to set up a debug build of the 4.1.4 hypervisor and, with logging turned all the way up (as far as I can tell), nothing gets logged when the domU hits the kernel bug. I''m not sure what to try next. James
On Tue, 2013-03-12 at 22:56 +0000, James Sinclair wrote:> The documentation I found that mentioned using "earlyprintk=xen debug > loglevel=8" did not make it clear those are kernel options.Which document was this? Perhaps we can update it.> I''ve corrected that now: > > Hypervisor options: loglvl=all guest_loglvl=all sync_console console_to_ring > Kernel options: earlyprintk=xen debug loglevel=8Ian.
On 13/03/2013, at 8:50 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Tue, 2013-03-12 at 22:56 +0000, James Sinclair wrote: >> The documentation I found that mentioned using "earlyprintk=xen debug >> loglevel=8" did not make it clear those are kernel options. > > Which document was this? Perhaps we can update it.I thought it was in the wiki but after going through my browser''s history of wiki pages I didn''t find it. Anyway, I''ve corrected my configuration and triggered the bug - still nothing being logged when it happens. What''s the next step? I there something else I should try to diagnose the issue? James