Olivier Hanesse
2011-Mar-10 11:25 UTC
[Xen-devel] Kernel panic with 2.6.32-30 under network activity
Hello, I''ve got several kernel panic on a domU under network activity (multiple rsync using rsh). I didn''t manage to reproduce it manually, but it happened 5times during the last month. Each time, it is the same kernel trace. I am using Debian 5.0.8 with kernel/hypervisor : ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for 64-bit PCs ii xen-hypervisor-4.0-amd64 4.0.1-2 The Xen Hypervisor on AMD64 Here is the trace : [469390.126691] alignment check: 0000 [#1] SMP [469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate [469390.126718] CPU 0 [469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys [469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1 [469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>] eth_header+0x61/0x9c [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 [469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX: ffff88001ecd0cee [469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI: ffff88001ecd0cee [469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09: 0000000000000034 [469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12: ffff880002935144 [469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15: ffff88001fe80000 [469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000) knlGS:0000000000000000 [469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4: 0000000000002660 [469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000, task ffff88001ea61530) [469390.126904] Stack: [469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc ffff88001f1a4ae8 [469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000 ffffffff81255a20 [469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000 ffff88001ecd0cfc [469390.126954] Call Trace: [469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 [469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b [469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 [469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa [469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20 [469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 [469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20 [469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c [469390.127049] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53 [469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 [469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 [469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66 [469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26 [469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af [469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 [469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 [469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 [469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 [469390.127144] [<ffffffff8100c241>] ? __raw_callee_save_xen_pud_val+0x11/0x1e [469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.127163] [<ffffffff8100c205>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d [469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 [469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f [469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 [469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 [469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 [469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 [469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20 02 00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04 24 <89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00 [469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c [469390.128009] RSP <ffff88001ec3f9b8> [469390.128009] ---[ end trace dd6b1396ef9d9a96 ]--- [469390.128009] Kernel panic - not syncing: Fatal exception in interrupt [469390.128009] Pid: 22077, comm: rsh Tainted: G D 2.6.32-bpo.5-amd64 #1 [469390.128009] Call Trace: [469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143 [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe [469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af [469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4 [469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92 [469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30 [469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284 [469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c [469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c [469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 [469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b [469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 [469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 [469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c [469390.128009] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53 [469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 [469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 [469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66 [469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26 [469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af [469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 [469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 [469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 [469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 [469390.128009] [<ffffffff8100c241>] ? __raw_callee_save_xen_pud_val+0x11/0x1e [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.128009] [<ffffffff8100c205>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d [469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 [469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f [469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 [469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 [469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 [469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 I found another post, which may be the same bug (same kernel, network activity ... ) : http://jira.mongodb.org/browse/SERVER-2383 Any ideas ? Regards Olivier _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-16 03:20 UTC
Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote:> Hello, > > I''ve got several kernel panic on a domU under network activity (multiple > rsync using rsh). I didn''t manage to reproduce it manually, but it happened > 5times during the last month.Does it happend all the time?> Each time, it is the same kernel trace. > > I am using Debian 5.0.8 with kernel/hypervisor : > > ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for > 64-bit PCs > ii xen-hypervisor-4.0-amd64 4.0.1-2 The > Xen Hypervisor on AMD64 > > Here is the trace : > > [469390.126691] alignment check: 0000 [#1] SMPaligment check? Was there anything else in the log before this? Was there anything in the Dom0 log?> [469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate > [469390.126718] CPU 0 > [469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev > snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror > dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys > [469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1 > [469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>] > eth_header+0x61/0x9c > [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 > [469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX: > ffff88001ecd0cee > [469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI: > ffff88001ecd0cee > [469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09: > 0000000000000034 > [469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12: > ffff880002935144 > [469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15: > ffff88001fe80000 > [469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000) > knlGS:0000000000000000 > [469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4: > 0000000000002660 > [469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000, task > ffff88001ea61530) > [469390.126904] Stack: > [469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc > ffff88001f1a4ae8 > [469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000 > ffffffff81255a20 > [469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000 > ffff88001ecd0cfc > [469390.126954] Call Trace: > [469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 > [469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b > [469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 > [469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa > [469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20 > [469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 > [469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20 > [469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c > [469390.127049] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53 > [469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 > [469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 > [469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66 > [469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26 > [469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af > [469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 > [469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 > [469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 > [469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 > [469390.127144] [<ffffffff8100c241>] ? > __raw_callee_save_xen_pud_val+0x11/0x1e > [469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.127163] [<ffffffff8100c205>] ? > __raw_callee_save_xen_pmd_val+0x11/0x1e > [469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d > [469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 > [469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f > [469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 > [469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 > [469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 > [469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 > [469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20 02 > 00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04 24 > <89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00 > [469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c > [469390.128009] RSP <ffff88001ec3f9b8> > [469390.128009] ---[ end trace dd6b1396ef9d9a96 ]--- > [469390.128009] Kernel panic - not syncing: Fatal exception in interrupt > [469390.128009] Pid: 22077, comm: rsh Tainted: G D > 2.6.32-bpo.5-amd64 #1 > [469390.128009] Call Trace: > [469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143 > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe > [469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af > [469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4 > [469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92 > [469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30 > [469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284 > [469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c > [469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c > [469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 > [469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b > [469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 > [469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 > [469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c > [469390.128009] [<ffffffff8128a00e>] ? __tcp_push_pending_frames+0x22/0x53 > [469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 > [469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 > [469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66 > [469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26 > [469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af > [469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 > [469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 > [469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 > [469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 > [469390.128009] [<ffffffff8100c241>] ? > __raw_callee_save_xen_pud_val+0x11/0x1e > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.128009] [<ffffffff8100c205>] ? > __raw_callee_save_xen_pmd_val+0x11/0x1e > [469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d > [469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 > [469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f > [469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 > [469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 > [469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > [469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 > > I found another post, which may be the same bug (same kernel, network > activity ... ) : > > http://jira.mongodb.org/browse/SERVER-2383 > > Any ideas ?None.. What type of CPU do you have? Are you pinning your guest to a specific CPU? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-16 09:34 UTC
[Xen-users] Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
>>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote: >> [469390.126691] alignment check: 0000 [#1] SMP > > aligment check? Was there anything else in the log before this? Was there > anything in the Dom0 log?This together with>> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286makes me wonder if either eflags got restored from a corrupted stack slot somewhere, or whether something in the kernel or one of the modules intentionally played with EFLAGS.AC. Jan _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Olivier Hanesse
2011-Mar-16 09:35 UTC
Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
Hello, Yes, this bug happens quite often. About my CPU, I am using : model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz There is no log at all before this message on the domU. I got this message from xen console. This guest isn''t pinned to a specific cpu : Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 18098.1 0 domU 15 0 1 -b- 3060.8 any cpu domU 15 1 4 -b- 1693.4 any cpu My dom0 is pinned : release : 2.6.32-bpo.5-xen-amd64 version : #1 SMP Mon Jan 17 22:05:11 UTC 2011 machine : x86_64 nr_cpus : 8 nr_nodes : 1 cores_per_socket : 4 threads_per_core : 1 cpu_mhz : 2493 hw_caps : bfebfbff:20000800:00000000:00000940:000ce3bd:00000000:00000001:00000000 virt_caps : hvm total_memory : 10239 free_memory : 405 node_to_cpu : node0:0-7 node_to_memory : node0:405 node_to_dma32_mem : node0:405 max_node_id : 0 xen_major : 4 xen_minor : 0 xen_extra : .1 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable xen_commandline : dom0_mem=512M loglvl=all guest_loglvl=all dom0_max_vcpus=1 dom0_vcpus_pin console=vga,com1 com1=19200,8n1 clocksource=pit cpuidle=0 cc_compiler : gcc version 4.4.5 (Debian 4.4.5-10) cc_compile_by : waldi cc_compile_domain : debian.org cc_compile_date : Wed Jan 12 14:04:06 UTC 2011 xend_config_format : 4 I was running top/vmstat before this crash, I saw nothing strange (kernel not swapping, no load, not a lot of IOs ... just a network rsync). About log in Dom0, in "xm dmesg" (XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935 (XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935 (XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935 (XEN) grant_table.c:204:d0 Increased maptrack size to 2 frames. (XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935 (XEN) traps.c:2869: GPF (0060): ffff82c48014efea -> ffff82c4801f9935 I don''t know if this is relevant or not. I will check at the next kernel panic, if another line is appended. Hope this helps. Olivier 2011/3/16 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote: > > Hello, > > > > I''ve got several kernel panic on a domU under network activity (multiple > > rsync using rsh). I didn''t manage to reproduce it manually, but it > happened > > 5times during the last month. > > Does it happend all the time? > > Each time, it is the same kernel trace. > > > > I am using Debian 5.0.8 with kernel/hypervisor : > > > > ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 > for > > 64-bit PCs > > ii xen-hypervisor-4.0-amd64 4.0.1-2 > The > > Xen Hypervisor on AMD64 > > > > Here is the trace : > > > > [469390.126691] alignment check: 0000 [#1] SMP > > aligment check? Was there anything else in the log before this? Was there > anything in the Dom0 log? > > > [469390.126711] last sysfs file: /sys/devices/virtual/net/lo/operstate > > [469390.126718] CPU 0 > > [469390.126725] Modules linked in: snd_pcsp xen_netfront snd_pcm evdev > > snd_timer snd soundcore snd_page_alloc ext3 jbd mbcache dm_mirror > > dm_region_hash dm_log dm_snapshot dm_mod xen_blkfront thermal_sys > > [469390.126772] Pid: 22077, comm: rsh Not tainted 2.6.32-bpo.5-amd64 #1 > > [469390.126779] RIP: e030:[<ffffffff8126093d>] [<ffffffff8126093d>] > > eth_header+0x61/0x9c > > [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 > > [469390.126802] RAX: 00000000090f0900 RBX: 0000000000000008 RCX: > > ffff88001ecd0cee > > [469390.126811] RDX: 0000000000000800 RSI: 000000000000000e RDI: > > ffff88001ecd0cee > > [469390.126820] RBP: ffff8800029016d0 R08: 0000000000000000 R09: > > 0000000000000034 > > [469390.126829] R10: 000000000000000e R11: ffffffff81255821 R12: > > ffff880002935144 > > [469390.126838] R13: 0000000000000034 R14: ffff88001fe80000 R15: > > ffff88001fe80000 > > [469390.126851] FS: 00007f340c2276e0(0000) GS:ffff880002f4d000(0000) > > knlGS:0000000000000000 > > [469390.126860] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > [469390.126867] CR2: 00007fffb8f33a8c CR3: 000000001d875000 CR4: > > 0000000000002660 > > [469390.126877] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [469390.126886] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > > 0000000000000400 > > [469390.126895] Process rsh (pid: 22077, threadinfo ffff88001ec3e000, > task > > ffff88001ea61530) > > [469390.126904] Stack: > > [469390.126908] 0000000000000000 0000000000000000 ffff88001ecd0cfc > > ffff88001f1a4ae8 > > [469390.126921] <0> ffff880002935100 ffff880002935140 0000000000000000 > > ffffffff81255a20 > > [469390.126937] <0> 0000000000000000 ffffffff8127743d 0000000000000000 > > ffff88001ecd0cfc > > [469390.126954] Call Trace: > > [469390.126963] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 > > [469390.126974] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b > > [469390.126983] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 > > [469390.126994] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa > > [469390.127003] [<ffffffff8100e242>] ? check_events+0x12/0x20 > > [469390.127013] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 > > [469390.127022] [<ffffffff8100e242>] ? check_events+0x12/0x20 > > [469390.127031] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.127040] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c > > [469390.127049] [<ffffffff8128a00e>] ? > __tcp_push_pending_frames+0x22/0x53 > > [469390.127059] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 > > [469390.127069] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 > > [469390.127079] [<ffffffff812410d1>] ? sock_release+0x19/0x66 > > [469390.127087] [<ffffffff81241140>] ? sock_close+0x22/0x26 > > [469390.127097] [<ffffffff810ef879>] ? __fput+0x100/0x1af > > [469390.127106] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 > > [469390.127116] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 > > [469390.127127] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 > > [469390.127135] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 > > [469390.127144] [<ffffffff8100c241>] ? > > __raw_callee_save_xen_pud_val+0x11/0x1e > > [469390.127154] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.127163] [<ffffffff8100c205>] ? > > __raw_callee_save_xen_pmd_val+0x11/0x1e > > [469390.127173] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d > > [469390.127183] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 > > [469390.127193] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f > > [469390.127202] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 > > [469390.127211] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 > > [469390.127219] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 > > [469390.127228] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.127240] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 > > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 > > [469390.128009] Code: 89 e8 86 e0 66 89 47 0c 48 85 ed 75 07 49 8b ae 20 > 02 > > 00 00 8b 45 00 4d 85 e4 89 47 06 66 8b 45 04 66 89 47 0a 74 12 41 8b 04 > 24 > > <89> 07 66 41 8b 44 24 04 66 89 47 04 eb 18 41 f6 86 60 01 00 00 > > [469390.128009] RIP [<ffffffff8126093d>] eth_header+0x61/0x9c > > [469390.128009] RSP <ffff88001ec3f9b8> > > [469390.128009] ---[ end trace dd6b1396ef9d9a96 ]--- > > [469390.128009] Kernel panic - not syncing: Fatal exception in interrupt > > [469390.128009] Pid: 22077, comm: rsh Tainted: G D > > 2.6.32-bpo.5-amd64 #1 > > [469390.128009] Call Trace: > > [469390.128009] [<ffffffff812f9d03>] ? panic+0x86/0x143 > > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe > > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.128009] [<ffffffff812fbbca>] ? _spin_unlock_irqrestore+0xd/0xe > > [469390.128009] [<ffffffff8104e387>] ? release_console_sem+0x17e/0x1af > > [469390.128009] [<ffffffff812fca65>] ? oops_end+0xa7/0xb4 > > [469390.128009] [<ffffffff81012416>] ? do_alignment_check+0x88/0x92 > > [469390.128009] [<ffffffff81011a75>] ? alignment_check+0x25/0x30 > > [469390.128009] [<ffffffff81255821>] ? neigh_resolve_output+0x0/0x284 > > [469390.128009] [<ffffffff8126093d>] ? eth_header+0x61/0x9c > > [469390.128009] [<ffffffff81260900>] ? eth_header+0x24/0x9c > > [469390.128009] [<ffffffff81255a20>] ? neigh_resolve_output+0x1ff/0x284 > > [469390.128009] [<ffffffff8127743d>] ? ip_finish_output2+0x1d6/0x22b > > [469390.128009] [<ffffffff8127708f>] ? ip_queue_xmit+0x311/0x386 > > [469390.128009] [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa > > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 > > [469390.128009] [<ffffffff81287a47>] ? tcp_transmit_skb+0x648/0x687 > > [469390.128009] [<ffffffff8100e242>] ? check_events+0x12/0x20 > > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.128009] [<ffffffff81289ec9>] ? tcp_write_xmit+0x874/0x96c > > [469390.128009] [<ffffffff8128a00e>] ? > __tcp_push_pending_frames+0x22/0x53 > > [469390.128009] [<ffffffff8127d409>] ? tcp_close+0x176/0x3d0 > > [469390.128009] [<ffffffff81299f0c>] ? inet_release+0x4e/0x54 > > [469390.128009] [<ffffffff812410d1>] ? sock_release+0x19/0x66 > > [469390.128009] [<ffffffff81241140>] ? sock_close+0x22/0x26 > > [469390.128009] [<ffffffff810ef879>] ? __fput+0x100/0x1af > > [469390.128009] [<ffffffff810eccb6>] ? filp_close+0x5b/0x62 > > [469390.128009] [<ffffffff8104f878>] ? put_files_struct+0x64/0xc1 > > [469390.128009] [<ffffffff812fbb02>] ? _spin_lock_irq+0x7/0x22 > > [469390.128009] [<ffffffff81051141>] ? do_exit+0x236/0x6c6 > > [469390.128009] [<ffffffff8100c241>] ? > > __raw_callee_save_xen_pud_val+0x11/0x1e > > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.128009] [<ffffffff8100c205>] ? > > __raw_callee_save_xen_pmd_val+0x11/0x1e > > [469390.128009] [<ffffffff81051647>] ? do_group_exit+0x76/0x9d > > [469390.128009] [<ffffffff8105dec1>] ? get_signal_to_deliver+0x318/0x343 > > [469390.128009] [<ffffffff8101004f>] ? do_notify_resume+0x87/0x73f > > [469390.128009] [<ffffffff812fbf45>] ? page_fault+0x25/0x30 > > [469390.128009] [<ffffffff812fc17a>] ? error_exit+0x2a/0x60 > > [469390.128009] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6 > > [469390.128009] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1 > > [469390.128009] [<ffffffff8119564d>] ? __put_user_4+0x1d/0x30 > > [469390.128009] [<ffffffff81010e0e>] ? int_signal+0x12/0x17 > > > > I found another post, which may be the same bug (same kernel, network > > activity ... ) : > > > > http://jira.mongodb.org/browse/SERVER-2383 > > > > Any ideas ? > > None.. What type of CPU do you have? Are you pinning your > guest to a specific CPU? >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Mar-16 10:11 UTC
Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote:> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote: > >> [469390.126691] alignment check: 0000 [#1] SMP > > > > aligment check? Was there anything else in the log before this? Was there > > anything in the Dom0 log? > > This together with > > >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 > > makes me wonder if either eflags got restored from a corrupted > stack slot somewhere, or whether something in the kernel or one > of the modules intentionally played with EFLAGS.AC.Can a PV kernel running in ring-3 change AC? The Intel manual says "They should not be modified by application programs" over a list including AC but the list also includes e.g. IOPL and IF so I suspect it meant "can not" rather than "should not"? In which case it can''t happen by accident. The hypervisor appears to clear the guest''s EFLAGS.AC on context switch to a guest and failsafe bounce but not in e.g. do_iret so it''s not entirely clear what his policy is... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-16 10:40 UTC
Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
>>> On 16.03.11 at 11:11, Ian Campbell <Ian.Campbell@citrix.com> wrote: > On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote: >> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >> > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote: >> >> [469390.126691] alignment check: 0000 [#1] SMP >> > >> > aligment check? Was there anything else in the log before this? Was there >> > anything in the Dom0 log? >> >> This together with >> >> >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 >> >> makes me wonder if either eflags got restored from a corrupted >> stack slot somewhere, or whether something in the kernel or one >> of the modules intentionally played with EFLAGS.AC. > > Can a PV kernel running in ring-3 change AC?Yes. We had this problem until we cleared the flag in create_bounce_frame().> The Intel manual says "They should not be modified by application > programs" over a list including AC but the list also includes e.g. IOPL > and IF so I suspect it meant "can not" rather than "should not"? In > which case it can''t happen by accident.No, afaik "should not" is the correct term.> The hypervisor appears to clear the guest''s EFLAGS.AC on context switch > to a guest and failsafe bounce but not in e.g. do_iret so it''s not > entirely clear what his policy is...do_iret() isn''t increasing privilege, and hence restoring whatever the outer context of iret had in place is correct. The important thing is that on the transition to kernel mode the flag must always get cleared (which I think has been the case since the problem in create_bounce_frame() was fixed). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Olivier Hanesse
2011-Mar-17 10:34 UTC
Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
It happens again a few minutes ago. It is the same kernel stack each time (alignment check: 0000 [#1] SMP etc ...) The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8 real cores available. 20 domUs are running on it, 35 vcpus are set up, is that too much ? The bug happens randomly on domUs I was running the same config with xen3.2 without any issue. I found this old post : http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html It may be related, no issue with 2.6.24, and issue with 2.6.32. 2011/3/16 Jan Beulich <JBeulich@novell.com>> >>> On 16.03.11 at 11:11, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > On Wed, 2011-03-16 at 09:34 +0000, Jan Beulich wrote: > >> >>> On 16.03.11 at 04:20, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > wrote: > >> > On Thu, Mar 10, 2011 at 12:25:55PM +0100, Olivier Hanesse wrote: > >> >> [469390.126691] alignment check: 0000 [#1] SMP > >> > > >> > aligment check? Was there anything else in the log before this? Was > there > >> > anything in the Dom0 log? > >> > >> This together with > >> > >> >> [469390.126795] RSP: e02b:ffff88001ec3f9b8 EFLAGS: 00050286 > >> > >> makes me wonder if either eflags got restored from a corrupted > >> stack slot somewhere, or whether something in the kernel or one > >> of the modules intentionally played with EFLAGS.AC. > > > > Can a PV kernel running in ring-3 change AC? > > Yes. We had this problem until we cleared the flag in > create_bounce_frame(). > > > The Intel manual says "They should not be modified by application > > programs" over a list including AC but the list also includes e.g. IOPL > > and IF so I suspect it meant "can not" rather than "should not"? In > > which case it can''t happen by accident. > > No, afaik "should not" is the correct term. > > > The hypervisor appears to clear the guest''s EFLAGS.AC on context switch > > to a guest and failsafe bounce but not in e.g. do_iret so it''s not > > entirely clear what his policy is... > > do_iret() isn''t increasing privilege, and hence restoring whatever > the outer context of iret had in place is correct. The important > thing is that on the transition to kernel mode the flag must always > get cleared (which I think has been the case since the problem > in create_bounce_frame() was fixed). > > Jan > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-17 11:57 UTC
[Xen-users] Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
>>> On 17.03.11 at 11:34, Olivier Hanesse <olivier.hanesse@gmail.com> wrote: > It happens again a few minutes ago. It is the same kernel stack each time > (alignment check: 0000 [#1] SMP etc ...) > > The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8 real > cores available. > 20 domUs are running on it, 35 vcpus are set up, is that too much ? The bug > happens randomly on domUs > I was running the same config with xen3.2 without any issue.Are we to read this as "same kernels in DomU-s and Dom0"? If so, that would hint at some subtle Xen regression. If not, you''d need to be more precise as to what works and what doesn''t, and would possibly want to try intermediate versions to narrow when this got introduced.> I found this old post : > http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html > > It may be related, no issue with 2.6.24, and issue with 2.6.32.Yes, that indeed looks very similar. Nevertheless, without this being generally reproducible, we''ll have to rely on you doing some analysis/debugging work on this. Jan _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Olivier Hanesse
2011-Mar-17 12:11 UTC
[Xen-users] Re: [Xen-devel] Kernel panic with 2.6.32-30 under network activity
2011/3/17 Jan Beulich <JBeulich@novell.com>> >>> On 17.03.11 at 11:34, Olivier Hanesse <olivier.hanesse@gmail.com> > wrote: > > It happens again a few minutes ago. It is the same kernel stack each time > > (alignment check: 0000 [#1] SMP etc ...) > > > > The dom0 where all the faulty domU are running is a dual Xeon 5420 so 8 > real > > cores available. > > 20 domUs are running on it, 35 vcpus are set up, is that too much ? The > bug > > happens randomly on domUs > > I was running the same config with xen3.2 without any issue. > > Are we to read this as "same kernels in DomU-s and Dom0"? If so, > that would hint at some subtle Xen regression. If not, you''d need > to be more precise as to what works and what doesn''t, and would > possibly want to try intermediate versions to narrow when this > got introduced. > >Dom0 et DomU are using different kernels (both coming from Debian repository, but same version) domU : ii linux-image-2.6.32-bpo.5-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for 64-bit PCs dom0: ii linux-image-2.6.32-bpo.5-xen-amd64 2.6.32-30~bpo50+1 Linux 2.6.32 for 64-bit PCs, Xen dom0 suppor I was running Debian Lenny''s version for xen 3.2, so it was 2.6.26> > I found this old post : > > http://lists.xensource.com/archives/html/xen-devel/2010-03/msg01561.html > > > > It may be related, no issue with 2.6.24, and issue with 2.6.32. > > Yes, that indeed looks very similar. Nevertheless, without this > being generally reproducible, we''ll have to rely on you doing > some analysis/debugging work on this. >I "pinned" all domUs cpus in order that they don''t use the same cpu as dom0 (which is pinned to cpu0). I can run any analysis/debugging tools you want. I will also try an older kernel (for example 2.6.32-10) and see what happens.> > Jan > >Regards Olivier _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users