Hi, Mr Ian Campbell and other gurus, We found a xen dom0 alignment check panic problem in our test during restarting some processes, here is the callstack alignment check: 0000 [#1] SMP last sysfs file: /sys/hypervisor/properties/capabilities CPU 2 Modules linked in: xt_iprange xt_mac arptable_filter arp_tables xt_physdev 8021q garp xt_state iptable_filter ip_tables autofs4 ipmi_devintf ipmi_si ipmi_msghandler ebtable_filter ebtable_nat ebtable_broute bridge stp llc ebtables lockd sunrpc bonding ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xenfs dm_multipath fuse xen_netback xen_blkback blktap blkback_pagemap loop nbd video output sbs sbshc parport_pc lp parport joydev ses enclosure snd_seq_dummy serio_raw bnx2 snd_seq_oss snd_seq_midi_event snd_seq dcdbas snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr iTCO_wdt iTCO_vendor_support shpchp raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 cciss Pid: 8601, comm: connector Not tainted 2.6.32.36xen #1 PowerEdge R710 RIP: e030:[<ffffffffa02ce51a>] [<ffffffffa02ce51a>] bond_3ad_get_active_agg_info+0x61/0x74 [bonding] RSP: e02b:ffff88009222b800 EFLAGS: 00050202 RAX: 0000000000000001 RBX: ffff88009222b838 RCX: ffff880250875580 RDX: ffff88024dc76c50 RSI: ffff88009222b838 RDI: ffff88024dc77200 RBP: ffff88009222b808 R08: ffff880246a72f50 R09: ffffffff816fb2a0 R10: ffff8800af2c10e8 R11: ffffffff813cca10 R12: ffff880250875000 R13: ffff8800af2c10e8 R14: ffff880250875580 R15: ffff88024dc1ae80 FS: 00007fd130d61740(0000) GS:ffff880028072000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff1cb42c40 CR3: 00000001f8a5f000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process td_connector (pid: 8601, threadinfo ffff88009222a000, task ffff88008adcc470) Stack: 0000000000000002 ffff88009222b878 ffffffffa02cf3db ffff8800af2c10e8 <0> ffff8802508755ac 4f52505f00704550 0000000200000003 0001001100010002 <0> 0000001472655356 0000000000000000 0000000000000002 ffff880250875580 Call Trace: [<ffffffffa02cf3db>] bond_3ad_xmit_xor+0x70/0x17f [bonding] [<ffffffffa02ccd1d>] bond_start_xmit+0x391/0x3ea [bonding] [<ffffffffa0241422>] ? ipv4_confirm+0x179/0x195 [nf_conntrack_ipv4] [<ffffffff813a3657>] dev_hard_start_xmit+0x1b9/0x27e [<ffffffff813a644a>] dev_queue_xmit+0x267/0x30e [<ffffffff813ce523>] ip_finish_output2+0x1a9/0x1ed [<ffffffff813ce5c9>] ip_finish_output+0x62/0x67 [<ffffffff813ce67c>] ip_output+0xae/0xb5 [<ffffffff813cca20>] dst_output+0x10/0x12 [<ffffffff813ce0d9>] ip_local_out+0x23/0x28 [<ffffffff813cf0fa>] ip_queue_xmit+0x2ce/0x32a [<ffffffff810acb19>] ? call_rcu_sched+0x15/0x17 [<ffffffff810acb29>] ? call_rcu+0xe/0x10 [<ffffffff8121e3c6>] ? radix_tree_node_free+0x14/0x16 [<ffffffff813dfd6f>] tcp_transmit_skb+0x62d/0x66d [<ffffffff8100f175>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8d2>] ? check_events+0x12/0x20 [<ffffffff81120369>] ? __d_free+0x50/0x55 [<ffffffff813e118c>] tcp_write_xmit+0x6d8/0x7be [<ffffffff813e12d7>] __tcp_push_pending_frames+0x2f/0x62 [<ffffffff813e12d7>] __tcp_push_pending_frames+0x2f/0x62 [<ffffffff813e19e3>] tcp_send_fin+0x102/0x10a [<ffffffff813d59e2>] tcp_close+0x138/0x388 [<ffffffff813f1e0e>] inet_release+0x5d/0x64 [<ffffffff8139361f>] sock_release+0x1f/0x71 [<ffffffff81393af2>] sock_close+0x27/0x2b [<ffffffff8110f063>] __fput+0x112/0x1b6 [<ffffffff8110f520>] fput+0x1a/0x1c [<ffffffff8110a5a9>] filp_close+0x6c/0x77 [<ffffffff81058c8b>] put_files_struct+0x7c/0xd0 [<ffffffff81058d18>] exit_files+0x39/0x3e [<ffffffff8105a059>] do_exit+0x247/0x677 [<ffffffff810673d8>] ? freezing+0x13/0x15 [<ffffffff8105a528>] sys_exit_group+0x0/0x1b [<ffffffff8106a843>] get_signal_to_deliver+0x300/0x324 [<ffffffff810121da>] do_notify_resume+0x90/0x6d6 [<ffffffff8100c412>] ? xen_mc_flush+0x173/0x195 [<ffffffff8102f82d>] ? paravirt_end_context_switch+0x17/0x31 [<ffffffff8100b459>] ? xen_end_context_switch+0x1e/0x22 [<ffffffff81049a5b>] ? finish_task_switch+0x51/0xa9 [<ffffffff8101303e>] int_signal+0x12/0x17 Code: fc ff ff 48 85 c0 75 e3 83 c8 ff eb 2e 66 8b 42 06 66 89 03 66 8b 42 32 66 89 43 02 8b 42 0c 66 89 43 04 66 8b 42 16 66 89 43 06 <8b> 42 0e 89 43 08 66 8b 42 12 66 89 43 0c 31 c0 5b c9 c3 55 48 RIP [<ffffffffa02ce51a>] bond_3ad_get_active_agg_info+0x61/0x74 [bonding] RSP <ffff88009222b800> ---[ end trace d269ed1e3064b31a ]--- Kernel panic - not syncing: Fatal exception in interrupt We guess it is due to the EFLAGS.AC bit set to 1, which leads to CPU alignment check. Since lots of unaligned memory operations exists in the kernel, dom0 could panic anywhere. But we have no idea who set this AC flag at all. We found some mail may be related to this problem, http://lists.xen.org/archives/html/xen-devel/2013-01/msg02285.html http://old-list-archives.xen.org/archives/html/xen-devel/2011-11/msg00827.html http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660425 but all these posts reported a domU panic (maybe PV domU) , while mine is related to dom0 The Xen version is 4.0.1 and dom0 kernel comes from jeremy''s git tree http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a It is xen-2.6.32.36 version of jeremy''s dom0 git tree, so I guess maybe it is too old to be related with CPU SMAP feature Any help is appreciated, thanks. Best regards, jerry
Pasi Kärkkäinen
2013-Jun-01 10:59 UTC
Re: dom0 alignment check panic due to EFLAGS.AC been set
On Sat, Jun 01, 2013 at 05:27:27PM +0800, Ma JieYue wrote:> > We found some mail may be related to this problem, > > http://lists.xen.org/archives/html/xen-devel/2013-01/msg02285.html > http://old-list-archives.xen.org/archives/html/xen-devel/2011-11/msg00827.html > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660425 > > but all these posts reported a domU panic (maybe PV domU) , while mine > is related to dom0 > > > The Xen version is 4.0.1 and dom0 kernel comes from jeremy''s git tree >I suggest upgrading your Xen hypervisor.. 4.0.1 is very old, and not even the latest on 4.0.x branch. Currently Xen 4.2.2 is the latest stable release.> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a > > It is xen-2.6.32.36 version of jeremy''s dom0 git tree, so I guess > maybe it is too old to be related with CPU SMAP feature >Jeremy''s xen.git is not maintained anymore, so it doesn''t have the latest xen related fixes and features, and also it''s lacking security fixes, so I don''t recommend using it anymore. You should switch to mainline Linux 3.x kernel, which should be better in every way.> > > Any help is appreciated, thanks. > > > Best regards, > > jerry >-- Pasi
Thank you for your reply. I admit xen4.0.1 is old, but from other bug reports in xen-devel,> http://lists.xen.org/archives/html/xen-devel/2013-01/msg02285.html > http://old-list-archives.xen.org/archives/html/xen-devel/2011-11/msg00827.html > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660425I tend to believe it still exists, and from http://lists.xen.org/archives/html/xen-devel/2013-01/msg02285.html, I think maybe there hasn''t been any specific patch to fix this EFLAGS.AC problem. It is obviously this EFLAGS.AC panic is caused by 3 conditions: 1. CPU EFLAGS reg AC bit been set, which I don''t know why 2. CR0 AM mask allow this alignment check panic, which is by default behavior 3. Current CPL is 3, in which Dom0 is running I tried to study the arch/x86/x86_64/entry.S, I guess the create_bounce_frame is called when Xen switch to dom0, and it did unset the CPU EFLAGS AC bit create_bounce_frame: ... .Lft13: movq %rax,(%rsi) # RCX /* Rewrite our stack frame and return to guest-OS mode. */ /* IA32 Ref. Vol. 3: TF, VM, RF and NT flags are cleared on trap. */ /* Also clear AC: alignment checks shouldn''t trigger in kernel mode. */ movl $TRAP_syscall,UREGS_entry_vector+8(%rsp) andl $~(X86_EFLAGS_AC|X86_EFLAGS_VM|X86_EFLAGS_RF|\ X86_EFLAGS_NT|X86_EFLAGS_TF),UREGS_eflags+8(%rsp) ... and also alignment check won''t happen when running in Xen, which CPL is 0. Someone also reported in mail list that a 2.6.24 pv kernel never panic in alignment check, but when he changed to 2.6.32 pv kernel, it happened often. So, I guess it is a dom0 kernel bug, isn''t it? jeremy, konrad, could you take a look at this? BRgs jerry On Sat, Jun 1, 2013 at 6:59 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:> On Sat, Jun 01, 2013 at 05:27:27PM +0800, Ma JieYue wrote: >> >> We found some mail may be related to this problem, >> >> http://lists.xen.org/archives/html/xen-devel/2013-01/msg02285.html >> http://old-list-archives.xen.org/archives/html/xen-devel/2011-11/msg00827.html >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660425 >> >> but all these posts reported a domU panic (maybe PV domU) , while mine >> is related to dom0 >> >> >> The Xen version is 4.0.1 and dom0 kernel comes from jeremy''s git tree >> > > I suggest upgrading your Xen hypervisor.. 4.0.1 is very old, > and not even the latest on 4.0.x branch. > > Currently Xen 4.2.2 is the latest stable release. > >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ae333e97552c81ab10395ad1ffc6d6daaadb144a >> >> It is xen-2.6.32.36 version of jeremy''s dom0 git tree, so I guess >> maybe it is too old to be related with CPU SMAP feature >> > > Jeremy''s xen.git is not maintained anymore, so it doesn''t have the latest > xen related fixes and features, and also it''s lacking security fixes, > so I don''t recommend using it anymore. > > You should switch to mainline Linux 3.x kernel, which should be better in every way. > >> >> >> Any help is appreciated, thanks. >> >> >> Best regards, >> >> jerry >> > > > -- Pasi >