I was experimenting with DomU redundancy and load balancing, and I think this GPF started to show up after a couple of DomUs with CARP and HAProxy were added that constantly generate a strong flow of network traffic by pinging target machines and each other as well. Or may be it is not related to CARP and pinging, but just depends on traffic volume: the more VMs added and running, the more chances that Dom0-DomU networking will collapse, the critical point being 8 guest domains, while I need 10. I can''t give exact steps to reproduce, as it happens randomly, usually without any correlated user activity, after several hours (or several minutes) of normal performance. But sometimes it happens not so long after a balancer''s DomU startup or shutdown. After GPF happens, all VMs loose their networking connectivity. Dom0 is openSUSE 12.1 on AMD64 (Linux 3.1.0-1.2-xen) with Xen version 4.1.2_05-1.9, which is patched as described in openSUSE bug 727081 (bugzilla.novell.com/show_bug.cgi?id=727081). Supposedly "offending" DomU is paravirtualized NetBSD 5.1.1 for AMD64 with recompiled kernel (CARP enabled, no more changes). Other VMs are openSUSE 11.4 and 12.1 for AMD64. Trace log in /var/log/messages always looks similar (varying digits replaced with asterisks ***): general protection fault: 0000 [#1] SMP CPU {core-number} Modules linked in: 8250 8250_pnp af_packet asus_wmi ata_generic blkback_pagemap blkbk blktap bridge btrfs button cdrom dm_mod domctl drm drm_kms_helper edd eeepc_wmi ehci_hcd evtchn fuse gntdev hid hwmon i2c_algo_bit i2c_core i2c_i801 i915 iTCO_vendor_support iTCO_wdt linear llc lzo_compress mei(C) microcode netbk parport parport_pc pata_via pci_hotplug pcspkr ppdev processor r8169 rfkill serial_core [serio_raw] sg snd snd_hda_codec snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hwdep snd_mixer_oss snd_page_alloc snd_pcm snd_pcm_oss snd_seq snd_seq_device snd_timer soundcore sparse_keymap sr_mod stp thermal_sys uas usbbk usbcore usbhid usb_storage video wmi xenblk xenbus_be xennet zlib_deflate Pid: {process-id}, comm: netback/{0/1} Tainted: G C 3.1.0-1.2-xen #1 System manufacturer System Product Name/P8H67-M RIP: e030:[<ffffffff803e7451>] [<ffffffff803e7451>] skb_release_data.part.47+0x61/0xc0 RSP: e02b:ffff880******d40 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880********0 RCX: ffff880******000 RDX: {..RCX.+.0e80..} RSI: 00000000000000** RDI: 00***c**00000000 RBP: {.....RBX......} R08: {..RCX.-.cff0..} R09: 0000000********* R10: 000000000000000* R11: {.task.+.0470..} R12: ffff880026a51000 R13: ffff880********0 R14: ffffc900048****0 R15: 0000000000000001 FS: 00007f*******7*0(0000) GS:ffff880******000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000***********0 CR3: 0000000******000 CR4: 0000000000042660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process netback/{0/1} (pid: {process-id}, threadinfo ffff880******000, task ffff880********0) Stack: 0000000000000000 {.....RBX......} 0000000000000000 ffffffff803e7511 {.....RBX......} ffffffffa0***d2c {.....task.....} {thread.+.1e00.} {thread.+.1db0.} {.R14.-.22a40..} ffffc9000000000* 0000000000000000 Call Trace: [<ffffffff803e7511>] __kfree_skb+0x11/0x20 [<ffffffffa0***d2c>] net_rx_action+0x66c/0x9c0 [netbk] [<ffffffffa0***72a>] netbk_action_thread+0x5a/0x270 [netbk] [<ffffffff8006438e>] kthread+0x7e/0x90 [<ffffffff8050f814>] kernel_thread_helper+0x4/0x10 Code: 48 8b 7c 02 08 e8 90 69 cf ff 8b 95 d0 00 00 00 48 8b 8d d8 00 00 00 48 01 ca 0f b7 02 39 c3 7c d1 f6 42 0c 10 74 1e 48 8b 7a 30 RIP [<ffffffff803e7451>] skb_release_data.part.47+0x61/0xc0 RSP <ffff880******d40> ---[ end trace **************** ]--- Preceeding and subsequent messages don''t seem to be related with GPF, time gap is from minutes to half an hour or even more. But if this could give some insight, I will post them, too.
On Fri, Feb 03, 2012 at 07:32:40PM +0300, Anton Samsonov wrote:> I was experimenting with DomU redundancy and load balancing, > and I think this GPF started to show up after a couple of DomUs > with CARP and HAProxy were added that constantly generate > a strong flow of network traffic by pinging target machines > and each other as well. Or may be it is not related to CARP > and pinging, but just depends on traffic volume: the more VMs > added and running, the more chances that Dom0-DomU networking > will collapse, the critical point being 8 guest domains, while I need 10. > > I can''t give exact steps to reproduce, as it happens randomly, > usually without any correlated user activity, after several hours > (or several minutes) of normal performance. But sometimes > it happens not so long after a balancer''s DomU startup or shutdown. > After GPF happens, all VMs loose their networking connectivity. > > Dom0 is openSUSE 12.1 on AMD64 (Linux 3.1.0-1.2-xen)Do you get the same issue with a pv-ops dom0? So also 3.1, but from kernel.org?> with Xen version 4.1.2_05-1.9, which is patched as described > in openSUSE bug 727081 (bugzilla.novell.com/show_bug.cgi?id=727081). > Supposedly "offending" DomU is paravirtualized NetBSD 5.1.1 > for AMD64 with recompiled kernel (CARP enabled, no more changes).What is CARP?> Other VMs are openSUSE 11.4 and 12.1 for AMD64. > > > Trace log in /var/log/messages always looks similar (varying digits > replaced with asterisks ***): > > > general protection fault: 0000 [#1] SMP > CPU {core-number} > Modules linked in: 8250 8250_pnp af_packet asus_wmi ata_generic > blkback_pagemap blkbk blktap bridge btrfs button cdrom dm_mod > domctl drm drm_kms_helper edd eeepc_wmi ehci_hcd evtchn fuse > gntdev hid hwmon i2c_algo_bit i2c_core i2c_i801 i915 > iTCO_vendor_support iTCO_wdt linear llc lzo_compress mei(C) > microcode netbk parport parport_pc pata_via pci_hotplug pcspkr > ppdev processor r8169 rfkill serial_core [serio_raw] sg > snd snd_hda_codec snd_hda_codec_hdmi snd_hda_codec_realtek > snd_hda_intel snd_hwdep snd_mixer_oss snd_page_alloc snd_pcm > snd_pcm_oss snd_seq snd_seq_device snd_timer soundcore > sparse_keymap sr_mod stp thermal_sys uas usbbk usbcore > usbhid usb_storage video wmi xenblk xenbus_be xennet zlib_deflate > > Pid: {process-id}, comm: netback/{0/1} Tainted: G > C 3.1.0-1.2-xen #1 System manufacturer System Product Name/P8H67-M > RIP: e030:[<ffffffff803e7451>] [<ffffffff803e7451>] > skb_release_data.part.47+0x61/0xc0 > RSP: e02b:ffff880******d40 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff880********0 RCX: ffff880******000 > RDX: {..RCX.+.0e80..} RSI: 00000000000000** RDI: 00***c**00000000 > RBP: {.....RBX......} R08: {..RCX.-.cff0..} R09: 0000000********* > R10: 000000000000000* R11: {.task.+.0470..} R12: ffff880026a51000 > R13: ffff880********0 R14: ffffc900048****0 R15: 0000000000000001 > FS: 00007f*******7*0(0000) GS:ffff880******000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000***********0 CR3: 0000000******000 CR4: 0000000000042660 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process netback/{0/1} (pid: {process-id}, threadinfo ffff880******000, > task ffff880********0) > Stack: > 0000000000000000 {.....RBX......} 0000000000000000 ffffffff803e7511 > {.....RBX......} ffffffffa0***d2c {.....task.....} {thread.+.1e00.} > {thread.+.1db0.} {.R14.-.22a40..} ffffc9000000000* 0000000000000000Hm, that is a pretty neat stack output. Wonder which patch of theirs does that.> Call Trace: > [<ffffffff803e7511>] __kfree_skb+0x11/0x20 > [<ffffffffa0***d2c>] net_rx_action+0x66c/0x9c0 [netbk] > [<ffffffffa0***72a>] netbk_action_thread+0x5a/0x270 [netbk] > [<ffffffff8006438e>] kthread+0x7e/0x90 > [<ffffffff8050f814>] kernel_thread_helper+0x4/0x10 > Code: 48 8b 7c 02 08 e8 90 69 cf ff 8b 95 d0 00 00 00 > 48 8b 8d d8 00 00 00 48 01 ca 0f b7 02 39 c3 7c > d1 f6 42 0c 10 74 1e 48 8b 7a 30 > RIP [<ffffffff803e7451>] skb_release_data.part.47+0x61/0xc0 > RSP <ffff880******d40> > ---[ end trace **************** ]--- > > > Preceeding and subsequent messages don''t seem to be related with GPF, > time gap is from minutes to half an hour or even more. But if this could give > some insight, I will post them, too. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > lists.xensource.com/xen-devel
2012/2/10 Konrad Rzeszutek Wilk <konrad@darnok.org>: AS>> Dom0 is openSUSE 12.1 on AMD64 (Linux 3.1.0-1.2-xen) KRW> Do you get the same issue with a pv-ops dom0? So also 3.1, but from kernel.org? Unfortunately, I''m not skilled at compiling the kernel myself. I tried building the newest 3.2.6 with all Xen options (which I could find by "Xen" keyword) enabled, but the resulting system didn''t have netback.ko module at all, barely booted, and xm was not able to communicate with the hypervisor. As to vanilla kernel package provided by openSUSE, it is not Xen-enabled. Meanwhile, an update was released, so I was testing 3.1.9-1.4-xen for about a week, though the outcome is still negative. AS>> Supposedly "offending" DomU is paravirtualized NetBSD 5.1.1 AS>> for AMD64 with recompiled kernel (CARP enabled, no more changes). KRW> What is CARP? CARP is Common Address Redundancy Protocol, a "non-patented version of VRRP", used for high availability and load balancing. It is supported in NetBSD kernel (although user-space implementation, uCARP, exist as well), but is not compiled by default. All my work to enable it was a quiet simple recompilation with following config: include "arch/amd64/conf/XEN3_DOMU" pseudo-device carp It looks like GPFs happen only when those load-balancing DomUs are running; at least, then they are shutoff, no fault is observed in a whole day, but then they run, fault can happen even after some minutes of Dom0 uptime, especially while DomUs are stopping or starting. KRW> Hm, that is a pretty neat stack output. Wonder which patch of theirs does that. It was not verbatim dump, but a generalized text. If you are interested, here is an excerpt from /var/log/messages for penultimate GPF (with date and hostname removed): ===[ Preceeding entries ]== (Those can be absolutely unrelated to GPF, but all 3 recent faults after kernel update were happening during VMs shutdown, either massive or singular.) 21:18:33 avahi-daemon[3086]: Withdrawing workstation service for vif10.0. 21:18:33 kernel: [29722.267359] br1: port 10(vif10.0) entering forwarding state 21:18:33 kernel: [29722.267443] br1: port 10(vif10.0) entering disabled state 21:18:33 logger: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/10/0 21:18:33 logger: /etc/xen/scripts/vif-bridge: brctl delif br1 vif10.0 failed 21:18:33 logger: /etc/xen/scripts/vif-bridge: ifconfig vif10.0 down failed 21:18:33 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif10.0, bridge br1. 21:18:33 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/10/0 21:18:33 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/10/0 21:18:33 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/10/0 21:18:33 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/10/51712 21:18:33 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/10/0 21:18:33 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/10/51712 21:18:53 avahi-daemon[3086]: Withdrawing workstation service for vif9.0. 21:18:53 kernel: [29742.222676] br1: port 9(vif9.0) entering forwarding state 21:18:53 kernel: [29742.222779] br1: port 9(vif9.0) entering disabled state 21:18:53 logger: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/9/0 21:18:53 logger: /etc/xen/scripts/vif-bridge: brctl delif br1 vif9.0 failed 21:18:53 logger: /etc/xen/scripts/vif-bridge: ifconfig vif9.0 down failed 21:18:53 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif9.0, bridge br1. 21:18:53 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/9/0 21:18:53 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/9/0 21:18:53 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/9/0 21:18:53 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/9/0 21:18:53 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/9/51712 21:18:53 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/9/51712 21:19:13 avahi-daemon[3086]: Withdrawing workstation service for vif8.0. 21:19:13 kernel: [29762.605500] br1: port 8(vif8.0) entering forwarding state 21:19:13 kernel: [29762.605572] br1: port 8(vif8.0) entering disabled state 21:19:13 logger: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/8/0 21:19:13 logger: /etc/xen/scripts/vif-bridge: brctl delif br1 vif8.0 failed 21:19:13 logger: /etc/xen/scripts/vif-bridge: ifconfig vif8.0 down failed 21:19:13 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif8.0, bridge br1. 21:19:13 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/8/0 21:19:13 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/8/0 21:19:13 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/8/0 21:19:13 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/8/0 21:19:13 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/8/51712 21:19:13 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/8/51712 21:19:26 avahi-daemon[3086]: Withdrawing workstation service for vif7.0. 21:19:26 kernel: [29775.558990] br1: port 7(vif7.0) entering forwarding state 21:19:26 kernel: [29775.559105] br1: port 7(vif7.0) entering disabled state 21:19:26 logger: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/7/0 21:19:26 logger: /etc/xen/scripts/vif-bridge: brctl delif br1 vif7.0 failed 21:19:26 logger: /etc/xen/scripts/vif-bridge: ifconfig vif7.0 down failed 21:19:26 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif7.0, bridge br1. 21:19:26 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/7/0 21:19:26 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/7/0 21:19:26 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/7/0 21:19:26 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/7/51712 21:19:26 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/7/0 21:19:26 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/7/51712 ===[ Fault alert itself ]== 21:19:37 kernel: [29786.610984] general protection fault: 0000 [#1] SMP 21:19:37 kernel: [29786.610992] CPU 0 21:19:37 kernel: [29786.610993] Modules linked in: fuse ip6t_LOG xt_tcpudp xt_pkttype xt_physdev ipt_LOG xt_limit nfsd lockd nfs_acl auth_rpcgss sunrpc usbbk netbk blkbk blkback_pagemap blktap domctl xenbus_be gntdev evtchn af_packet bridge stp llc edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables microcode snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device eeepc_wmi asus_wmi sparse_keymap rfkill usb_storage ppdev pci_hotplug uas 8250_pnp sg i2c_i801 sr_mod wmi parport_pc snd_hda_codec_hdmi snd_hda_codec_realtek parport r8169 pcspkr mei(C) 8250 serial_core iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc usbhid hid dm_mod linear btrfs zlib_deflate lzo_compress i915 drm_kms_helper drm i2c_algo_bit ehci_hcd usbcor 21:19:37 kernel: e i2c_core button video processor thermal_sys hwmon xenblk cdrom xennet ata_generic pata_via 21:19:37 kernel: [29786.611076] 21:19:37 kernel: [29786.611078] Pid: 3461, comm: netback/1 Tainted: G C 3.1.9-1.4-xen #1 System manufacturer System Product Name/P8H67-M 21:19:37 kernel: [29786.611084] RIP: e030:[<ffffffff803e7f81>] [<ffffffff803e7f81>] skb_release_data.part.46+0x61/0xc0 21:19:37 kernel: [29786.611092] RSP: e02b:ffff8802c8339d40 EFLAGS: 00010202 21:19:37 kernel: [29786.611095] RAX: 0000000000000000 RBX: ffff88007ccf39c0 RCX: ffff8800e70db000 21:19:37 kernel: [29786.611098] RDX: ffff8800e70dbe80 RSI: 000000000000001f RDI: 0028f49c00000000 21:19:37 kernel: [29786.611101] RBP: ffff88007ccf39c0 R08: ffff8800e70d0010 R09: 000000000000004e 21:19:37 kernel: [29786.611103] R10: 0000000000000003 R11: ffff8802d0074c30 R12: ffff8802bb22f000 21:19:37 kernel: [29786.611106] R13: ffff88027ea382c0 R14: ffffc900048cb960 R15: 0000000000000001 21:19:37 kernel: [29786.611114] FS: 00007f303913f700(0000) GS:ffff8802de3c2000(0000) knlGS:0000000000000000 21:19:37 kernel: [29786.611117] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b 21:19:37 kernel: [29786.611119] CR2: 00000000006b6e30 CR3: 00000002c93fb000 CR4: 0000000000042660 21:19:37 kernel: [29786.611126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 21:19:37 kernel: [29786.611131] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 21:19:37 kernel: [29786.611136] Process netback/1 (pid: 3461, threadinfo ffff8802c8338000, task ffff8802d00747c0) 21:19:37 kernel: [29786.611140] Stack: 21:19:37 kernel: [29786.611144] 0000000000000000 ffff88007ccf39c0 0000000000000000 ffffffff803e8041 21:19:37 kernel: [29786.611151] ffff88007ccf39c0 ffffffffa059fd3c ffff8802d00747c0 ffff8802c8339e00 21:19:37 kernel: [29786.611157] ffff8802c8339db0 ffffc900048a8f20 ffffc90000000002 0000000000000000 21:19:37 kernel: [29786.611164] Call Trace: 21:19:37 kernel: [29786.611173] [<ffffffff803e8041>] __kfree_skb+0x11/0x20 21:19:37 kernel: [29786.611182] [<ffffffffa059fd3c>] net_rx_action+0x66c/0x9c0 [netbk] 21:19:37 kernel: [29786.611201] [<ffffffffa05a173a>] netbk_action_thread+0x5a/0x270 [netbk] 21:19:37 kernel: [29786.611211] [<ffffffff8006444e>] kthread+0x7e/0x90 21:19:37 kernel: [29786.611220] [<ffffffff80510d24>] kernel_thread_helper+0x4/0x10 21:19:37 kernel: [29786.611225] Code: 48 8b 7c 02 08 e8 a0 60 cf ff 8b 95 d0 00 00 00 48 8b 8d d8 00 00 00 48 01 ca 0f b7 02 39 c3 7c d1 f6 42 0c 10 74 1e 48 8b 7a 30 21:19:37 kernel: [29786.611265] RIP [<ffffffff803e7f81>] skb_release_data.part.46+0x61/0xc0 21:19:37 kernel: [29786.611271] RSP <ffff8802c8339d40> 21:19:37 kernel: [29786.671491] ---[ end trace 6875b40b2a9f1d46 ]--- (Note that numbers after "+" in call trace did not changed after kernel update, as compared to previously posted, although absolute addresses did changed.) ===[ Subsequent entries ]== (Again, sometimes those can be absolutely unrelated to GPF, and happen minutes after.) 21:19:38 avahi-daemon[3086]: Withdrawing workstation service for vif6.0. 21:19:38 kernel: [29787.904571] br1: port 6(vif6.0) entering forwarding state 21:19:38 kernel: [29787.904649] br1: port 6(vif6.0) entering disabled state 21:19:38 logger: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/6/0 21:19:38 logger: /etc/xen/scripts/vif-bridge: brctl delif br1 vif6.0 failed 21:19:38 logger: /etc/xen/scripts/vif-bridge: ifconfig vif6.0 down failed 21:19:38 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif6.0, bridge br1. 21:19:39 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/6/0 21:19:39 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/6/0 21:19:39 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/6/0 21:19:39 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/6/51712 21:19:39 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/6/0 21:19:39 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/6/51712 21:19:58 kernel: [29807.860561] br1: port 5(vif5.0) entering forwarding state
On Wed, Feb 15, 2012 at 07:29:54PM +0300, Anton Samsonov wrote:> 2012/2/10 Konrad Rzeszutek Wilk <konrad@darnok.org>: > > AS>> Dom0 is openSUSE 12.1 on AMD64 (Linux 3.1.0-1.2-xen) > KRW> Do you get the same issue with a pv-ops dom0? So also 3.1, but > from kernel.org? > > Unfortunately, I''m not skilled at compiling the kernel myself. I tried > building the newest 3.2.6 > with all Xen options (which I could find by "Xen" keyword) enabled, > but the resulting system > didn''t have netback.ko module at all, barely booted, and xm was not > able to communicate > with the hypervisor. >See this wiki page for all the common troubleshooting steps when xend does not start: wiki.xen.org/wiki/Xen_Common_Problems#Starting_xend_fails.3F About compiling the dom0 kernel, see this wiki page: wiki.xen.org/wiki/XenParavirtOps#Configure_kernel_for_dom0_support Hopefully those help.. -- Pasi
2012/2/15 Pasi Kärkkäinen <pasik@iki.fi>: AS>> Unfortunately, I''m not skilled at compiling the kernel myself. AS>> I tried building the newest 3.2.6 with all Xen options enabled, AS>> but the resulting system didn''t have netback.ko module at all, AS>> barely booted, and xm was not able to communicate with VMM. PK> About compiling the dom0 kernel, see this wiki page: PK> wiki.xen.org/wiki/XenParavirtOps#Configure_kernel_for_dom0_support Thanks, it looks like I just messed some "y" with "m" and vice versa ("m" is presented as a meaningless bullet in GUI). By the way, that how-to contains dubious lines for CONFIG_XEN_DEV_EVTCHN and CONFIG_XEN_GNTDEV. Well, now the system boots more eagerly, although the kernel still seems to be slightly incompatible with distro''s environment and my hardware. But at least xend is now responding and is able to run DomUs as usual. I started and stopped all the swarm of VMs several times (without letting them to run for some time), and observed no GPF. But instead of this I get screen garbling: while a DomU is starting or stopping, the whole graphical desktop is sometimes painted with either black or not-so-random garbage, and even mouse pointer can become garbled; I have to move / resize windows to get them repainted. Network connectivity between Dom0 and [subsequently started] DomUs does not break though. On one hand, I am not sure whether the video driver is not to be blamed for glitches, because graphics already does not work as usual: it is not hardware-accelerated with my custom kernel (while it is with stock kernel), and the screen is garbled on Xorg startup, before login promt is displayed. On the other hand, this is not in any way normal, as Xen operations must not interfere with Dom0''s desktop (or was it direct VRAM corruption?). This happens even when "suspicious" domains (NetBSD with CARP) are not started: on a freshly booted Dom0, just having 4 essential DomUs is enough to get that screen garbling when shutting down 1 or 2 of them for the first time. But when I return to stock kernel, I can run a dozen of such DomUs (including those NetBSD load-balancers), starting and stopping them many times without a problem. Recently, no GPF occurred when only 1 out of 2 balancers is started, or none of them started at all; or it just needs much more uptime to accumulate memory corruption for a GPF.
On Tue, Feb 21, 2012 at 06:06:14PM +0300, Anton Samsonov wrote:> 2012/2/15 Pasi Kärkkäinen <pasik@iki.fi>: > > AS>> Unfortunately, I''m not skilled at compiling the kernel myself. > AS>> I tried building the newest 3.2.6 with all Xen options enabled, > AS>> but the resulting system didn''t have netback.ko module at all, > AS>> barely booted, and xm was not able to communicate with VMM. > > PK> About compiling the dom0 kernel, see this wiki page: > PK> wiki.xen.org/wiki/XenParavirtOps#Configure_kernel_for_dom0_support > > Thanks, it looks like I just messed some "y" with "m" and vice versa > ("m" is presented as a meaningless bullet in GUI). By the way, that how-to > contains dubious lines for CONFIG_XEN_DEV_EVTCHN and CONFIG_XEN_GNTDEV. > > Well, now the system boots more eagerly, although the kernel still > seems to be slightly incompatible with distro''s environment and my hardware. > But at least xend is now responding and is able to run DomUs as usual. > > > I started and stopped all the swarm of VMs several times (without letting them > to run for some time), and observed no GPF. But instead of this I get > screen garbling: while a DomU is starting or stopping, the whole graphical > desktop is sometimes painted with either black or not-so-random garbage, > and even mouse pointer can become garbled; I have to move / resize windows > to get them repainted. Network connectivity between Dom0 and [subsequently > started] DomUs does not break though. > > On one hand, I am not sure whether the video driver is not to be > blamed for glitches, > because graphics already does not work as usual: it is not hardware-accelerated > with my custom kernel (while it is with stock kernel), and the screen is garbled > on Xorg startup, before login promt is displayed. On the other hand, this is notSo... I am curious, what graphic card do you have and do you get any of these Red Hat BZs? RH BZ# 742032, 787403, and 745574?> in any way normal, as Xen operations must not interfere with Dom0''s desktop > (or was it direct VRAM corruption?).It is complicated. There is a bug in 3.2 when using radeon or nouveau for a lengthy time of period that ends up "corrupting" memory. The workaround is to provide ''nopat'' on the argument line.> > This happens even when "suspicious" domains (NetBSD with CARP) are not > started: on a freshly booted Dom0, just having 4 essential DomUs is enough > to get that screen garbling when shutting down 1 or 2 of them for the > first time.Hmm, that is weird. Never seen that before. Can you include more details on your machine?> > > But when I return to stock kernel, I can run a dozen of such DomUs (including > those NetBSD load-balancers), starting and stopping them many times > without a problem. Recently, no GPF occurred when only 1 out of 2 balancers > is started, or none of them started at all; or it just needs much more uptime > to accumulate memory corruption for a GPF. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > lists.xen.org/xen-devel
2012/2/21 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>: AS>> With custom-built kernel, I didn''t yet see any GPF, but screen garbling AS>> happens almost every time when a DomU is starting or stopping: AS>> the whole graphical is painted with either black or not-so-random garbage. KRW> So... I am curious, what graphic card do you have and do you get any of KRW> these Red Hat BZs? RH BZ# 742032, 787403, and 745574? KRW> There is a bug in 3.2 when using radeon or nouveau for a lengthy time KRW> that ends up "corrupting" memory. The workaround is ''nopat'' kernel arg. My idea is that my custom-built kernel can''t be considered a trusted proving ground, as it is of very low quality. Video issue is just the most obvious example. Another indisputable example is how the Dom0 reboots: instead of simple CPU restart, the whole system goes into soft-off for several seconds, then wakes back. When I boot this kernel in bare-metal mode (without Xen VMM), none of those happens: GUI is accelerated (at least in 2D; I don''t use OpenGL desktop), screen is not garbled at login and logout dialogs, system reboots quickly. Anyway, I tried your solution with "nopat". It didn''t worked: with 4 DomUs running for a minute and then shut down in reverse order (4th, 3rd, 2nd, 1st), the screen went black right between the 3rd VM was completely shut and 2nd VM was requested to shut. There was no "lengthy time" of Dom0 running, my video adapter is neither nVidia nor ATi, but an integrated Intel HD Graphics 2000 using i915 driver, and I see no similarities to the Red Hat bugs mentioned by you. KRW> Can you include more details on your machine? My guess is that it is not just my hardware that causes GPF, but either a bug in netback module, or a compiler issue for specific combination of Xen (and/or particularly netback) together with openSUSE build technology. As an example of the latter, look again at the Novell BZ #727081 mentioned in the original post — the comment #30 says: "The compiler apparently makes use of the 128-byte area called ''red zone'' in the ABI, and this is incompatible with xc_cpuid_x86.c:cpuid() using pushes and pops around the cpuid instruction". The consequence is that, on some machines, libxenguest segfaults when you try to start a DomU. With Core i7-920 there is no problem, but with Core i5-2300 I faced that issue, and wonder whether the same incompatibility can take place in netback module. I though the traceback gives some hints on where to debug. My specs are: MB: Asus P8H67-M (Intel H67 chipset) CPU: Intel Core i5 model 2300 (Turbo mode disabled) RAM: 12GB DDR3-1333 non-ECC (recently checked by MemTest86+ 4.20) Video: Intel HD Graphics 2000 (integrated into CPU) Network: dedicated soft-bridge for most DomUs, + bridged Realtek RTL8111E for gateway DomU (not with CARP)
On Wed, Feb 22, 2012 at 03:17:24PM +0300, Anton Samsonov wrote:> 2012/2/21 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>: > > AS>> With custom-built kernel, I didn't yet see any GPF, but screen garbling > AS>> happens almost every time when a DomU is starting or stopping: > AS>> the whole graphical is painted with either black or not-so-random garbage. > > KRW> So... I am curious, what graphic card do you have and do you get any of > KRW> these Red Hat BZs? RH BZ# 742032, 787403, and 745574? > KRW> There is a bug in 3.2 when using radeon or nouveau for a lengthy time > KRW> that ends up "corrupting" memory. The workaround is 'nopat' kernel arg. > > My idea is that my custom-built kernel can't be considered a trusted > proving ground, > as it is of very low quality. Video issue is just the most obvious > example. Another > indisputable example is how the Dom0 reboots: instead of simple CPU restart, > the whole system goes into soft-off for several seconds, then wakes back. > When I boot this kernel in bare-metal mode (without Xen VMM), none of those > happens: GUI is accelerated (at least in 2D; I don't use OpenGL desktop), > screen is not garbled at login and logout dialogs, system reboots quickly. > > Anyway, I tried your solution with "nopat". It didn't worked: with 4 > DomUs running > for a minute and then shut down in reverse order (4th, 3rd, 2nd, 1st), > the screen > went black right between the 3rd VM was completely shut and 2nd VM was > requested to shut. There was no "lengthy time" of Dom0 running, my video adapter > is neither nVidia nor ATi, but an integrated Intel HD Graphics 2000 > using i915 driver, > and I see no similarities to the Red Hat bugs mentioned by you. > > > KRW> Can you include more details on your machine? > > My guess is that it is not just my hardware that causes GPF, but either > a bug in netback module, or a compiler issue for specific combination of Xen > (and/or particularly netback) together with openSUSE build technology. > > As an example of the latter, look again at the Novell BZ #727081 mentioned > in the original post — the comment #30 says: "The compiler apparently makes use > of the 128-byte area called 'red zone' in the ABI, and this is incompatible > with xc_cpuid_x86.c:cpuid() using pushes and pops around the cpuid instruction". > The consequence is that, on some machines, libxenguest segfaults when you > try to start a DomU. With Core i7-920 there is no problem, but with Core i5-2300 > I faced that issue, and wonder whether the same incompatibility can take place > in netback module. I though the traceback gives some hints on where to debug. > > My specs are: > MB: Asus P8H67-M (Intel H67 chipset) > CPU: Intel Core i5 model 2300 (Turbo mode disabled) > RAM: 12GB DDR3-1333 non-ECC (recently checked by MemTest86+ 4.20)Do you have CONFIG_DMAR enabled in your config?> Video: Intel HD Graphics 2000 (integrated into CPU) > Network: dedicated soft-bridge for most DomUs, > + bridged Realtek RTL8111E for gateway DomU (not with CARP)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org lists.xen.org/xen-devel
>>> Anton Samsonov <avscomputing@gmail.com> 02/22/12 11:46 PM >>> >As an example of the latter, look again at the Novell BZ #727081 mentioned >in the original post — the comment #30 says: "The compiler apparently makes use >of the 128-byte area called 'red zone' in the ABI, and this is incompatible >with xc_cpuid_x86.c:cpuid() using pushes and pops around the cpuid instruction". >The consequence is that, on some machines, libxenguest segfaults when you >try to start a DomU. With Core i7-920 there is no problem, but with Core i5-2300 >I faced that issue, and wonder whether the same incompatibility can take place >in netback module. I though the traceback gives some hints on where to debug.That's impossible - use of the red zone is disallowed in the kernel via compiler option. And the problem you cite was a source code problem, not a compiler one (the fact that it had an effect only on some systems was attributed to the function in question only getting run when a specific hardware feature was available iirc). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org lists.xen.org/xen-devel