Hi All, This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms. (Sorry it''s a little late. :-) If the status changes, I''ll have an update later.) Test environment: Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git Dom0: Linux kernel 3.9.3 Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems Below are the features we tested. - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows) - Save/Restore and live migration - PCI device assignment and SR-IOV - power management: C-state/P-state, Dom0 S3, HVM S3 - AVX and XSAVE instruction set - MCE - CPU online/offline for Dom0 - vCPU hot-plug - Nested Virtualization (Please look at my report in the following link.) http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html New bugs (4): (some of which are not regressions) 1. sometimes failed to online cpu in Dom0 http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 2. dom0 call trace when running sriov hvm guest with igbvf http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852 -- a regression in Linux kernel (Dom0). 3. Booting multiple guests will lead Dom0 call trace http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853 4. After live migration, guest console continuously prints "Clocksource tsc unstable" http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854 Old bugs: (11) 1. [ACPI] Dom0 can''t resume from S3 sleep http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 2. [XL]"xl vcpu-set" causes dom0 crash or panic http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 5. Dom0 cannot be shutdown before PCI device detachment from guest http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 6. xl pci-list shows one PCI device (PF or VF) could be assigned to two different guests http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834 7. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than that with qemu-xen-unstable.git http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836 8. [upstream qemu]''maxvcpus=NUM'' item is not supported in upstream QEMU http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837 9. [upstream qemu] Guest console hangs after save/restore or live-migration when setting ''hpet=0'' in guest config file http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838 10. [upstream qemu] ''xen_platform_pci=0'' setting cannot make the guest use emulated PCI devices by default http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839 11. Live migration fail when migrating the same guest for more than 2 times http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845 Best Regards, Yongjie (Jay)
On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie wrote:> Hi All, > This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms. > (Sorry it''s a little late. :-) If the status changes, I''ll have an update later.)OK, I''ve some updates and ideas that can help with narrowing some of these issues down. Thank you for doing this.> > Test environment: > Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git > Dom0: Linux kernel 3.9.3Could you please test v3.10-rc3. There have been some changes for the VCPU hotplug added in v3.10 that I am not sure whether they are in v3.9?> Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems > > Below are the features we tested. > - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows) > - Save/Restore and live migration > - PCI device assignment and SR-IOV > - power management: C-state/P-state, Dom0 S3, HVM S3 > - AVX and XSAVE instruction set > - MCE > - CPU online/offline for Dom0 > - vCPU hot-plug > - Nested Virtualization (Please look at my report in the following link.) > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html > > New bugs (4): (some of which are not regressions) > 1. sometimes failed to online cpu in Dom0 > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851That looks like you are hitting the udev race. Could you verify that these patches: https://lkml.org/lkml/2013/5/13/520 fix the issue (They are destined for v3.11)> 2. dom0 call trace when running sriov hvm guest with igbvf > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852 > -- a regression in Linux kernel (Dom0).Hm, the call-trace you refer too: [ 68.404440] Already setup the GSI :37 [ 68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module parameter is deprecated - please use the pci sysfs interface. [ 68.506230] ------------[ cut here ]------------ [ 68.506265] WARNING: at /home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvops.git/fs/sysfs/dir.c:536 sysfs_add_one+0xcc/0xf0() [ 68.506279] Hardware name: S2600CP is a deprecated warning. Did you follow the ''pci sysfs'' interface way? Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7 ixgbe: Implement PCI SR-IOV sysfs callback operation it says it is using this: commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b Author: Donald Dutile <ddutile@redhat.com> Date: Mon Nov 5 15:20:36 2012 -0500 PCI: SRIOV control and status via sysfs Provide files under sysfs to determine the maximum number of VFs an SR-IOV-capable PCIe device supports, and methods to enable and disable the VFs on a per-device basis. Currently, VF enablement by SR-IOV-capable PCIe devices is done via driver-specific module parameters. If not setup in modprobe files, it requires admin to unload & reload PF drivers with number of desired VFs to enable. Additionally, the enablement is system wide: all devices controlled by the same driver have the same number of VFs enabled. Although the latter is probably desired, there are PCI configurations setup by system BIOS that may not enable that to occur. Two files are created for the PF of PCIe devices with SR-IOV support: sriov_totalvfs Contains the maximum number of VFs the device could support as reported by the TotalVFs register in the SR-IOV extended capability. sriov_numvfs Contains the number of VFs currently enabled on this device as reported by the NumVFs register in the SR-IOV extended capability. Writing zero to this file disables all VFs. Writing a positive number to this file enables that number of VFs. These files are readable for all SR-IOV PF devices. Writes to the sriov_numvfs file are effective only if a driver that supports the sriov_configure() method is attached. Signed-off-by: Donald Dutile <ddutile@redhat.com> Can you try that please?> 3. Booting multiple guests will lead Dom0 call trace > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853That one worries me. Did you do a git bisect to figure out what is commit is causing this?> 4. After live migration, guest console continuously prints "Clocksource tsc unstable" > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854This looks like a current bug with QEMU unstable missing a ACPI table? Did you try booting the guest with the old QEMU? device_model_version = ''qemu-xen-traditional''> > Old bugs: (11) > 1. [ACPI] Dom0 can''t resume from S3 sleep > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707That should be fixed in v3.11 (as now we have the fixes) Could you try v3.10 with the Rafael''s ACPI tree merged in? (so the patches that he wants to submit for v3.11)> 2. [XL]"xl vcpu-set" causes dom0 crash or panic > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730That I think is fixed in v3.10. Could you please check v3.10-rc3?> 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747That looks to be with v2.6.32. Is the issue present with v3.9 or v3.10-rc3?> 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822That I believe was an QEMU bug: http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html which should be in QEMU traditional now (05-21 was when it went in the tree)> 5. Dom0 cannot be shutdown before PCI device detachment from guest > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826Ok, I can reproduce that too.> 6. xl pci-list shows one PCI device (PF or VF) could be assigned to two different guests > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834OK, I can reproduce that too:> xl create /vm-pv.cfgParsing config from /vm-pv.cfg libxl: error: libxl_pci.c:1043:libxl__device_pci_add: PCI device 0:1:0.0 is not assignable Daemon running with PID 3933 15:11:17 # 16 :/mnt/lab/latest/> xl pci-list 1Vdev Device 05.0 0000:01:00.0> xl listName ID Mem VCPUs State Time(s) Domain-0 0 2047 4 r----- 26.7 latest 1 2043 1 -b---- 5.3 latestadesa 4 1024 3 -b---- 5.1 15:11:20 # 20 :/mnt/lab/latest/> xl pci-list 4Vdev Device 00.0 0000:01:00.0 The rest I hadn''t had a chance to look at. George, have you seen these issues?> 7. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than that with qemu-xen-unstable.git > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836 > 8. [upstream qemu]''maxvcpus=NUM'' item is not supported in upstream QEMU > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837 > 9. [upstream qemu] Guest console hangs after save/restore or live-migration when setting ''hpet=0'' in guest config file > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838 > 10. [upstream qemu] ''xen_platform_pci=0'' setting cannot make the guest use emulated PCI devices by default > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839 > 11. Live migration fail when migrating the same guest for more than 2 times > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845 > > Best Regards, > Yongjie (Jay) > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
> > 5. Dom0 cannot be shutdown before PCI device detachment from guest > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 > > Ok, I can reproduce that too.This is what dom0 tells me: [ 483.586675] INFO: task init:4163 blocked for more than 120 seconds. [ 483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[ 483.620747] init D ffff880062b59c78 5904 4163 1 0x00000000 [ 483.637699] ffff880062b59bc8 0000000000000^G[ 483.655189] ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [ 483.672505] ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500 [ 483.689527] Call Trace: [ 483.706298] [<ffffffff816a0814>] schedule+0x24/0x70 [ 483.723604] [<ffffffff813bb0dd>] read_reply+0xad/0x160 [ 483.741162] [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [ 483.758572] [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [ 483.775741] [<ffffffff813bb3c6>] xs_single+0x46/0x60 [ 483.792791] [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60 [ 483.809929] [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120 ^G[ 483.826947] [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 [ 483.843792] [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [ 483.860412] [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [ 483.877312] [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [ 483.894036] [<ffffffff8142e275>] device_shutdown+0x15/0x180 [ 483.910605] [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [ 483.927100] [<ffffffff810a88a1>] kernel_restart+0x11^G[ 483.943262] [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [ 483.959480] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 483.975786] [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [ 483.991819] [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [ 484.007675] [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 484.023336] [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [ 484.039176] [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [ 484.055174] [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [ 484.070747] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 484.086121] [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [ 484.101318] [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b [ 484.116585] 3 locks held by init/4163: [ 484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260 ^G^G^G^G^G^G[ 484.147704] #1: (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [ 484.164359] #2: (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0 create ! title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung."
On 28/05/13 16:21, Konrad Rzeszutek Wilk wrote:>>> 5. Dom0 cannot be shutdown before PCI device detachment from guest >>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 >> Ok, I can reproduce that too. > This is what dom0 tells me: > > [ 483.586675] INFO: task init:4163 blocked for more than 120 seconds. > [ 483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[ 483.620747] init D ffff880062b59c78 5904 4163 1 0x00000000 > [ 483.637699] ffff880062b59bc8 0000000000000^G[ 483.655189] ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 > [ 483.672505] ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500 > [ 483.689527] Call Trace: > [ 483.706298] [<ffffffff816a0814>] schedule+0x24/0x70 > [ 483.723604] [<ffffffff813bb0dd>] read_reply+0xad/0x160 > [ 483.741162] [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 > [ 483.758572] [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 > [ 483.775741] [<ffffffff813bb3c6>] xs_single+0x46/0x60 > [ 483.792791] [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60 > [ 483.809929] [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120 > ^G[ 483.826947] [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 > [ 483.843792] [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 > [ 483.860412] [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 > [ 483.877312] [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 > [ 483.894036] [<ffffffff8142e275>] device_shutdown+0x15/0x180 > [ 483.910605] [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 > [ 483.927100] [<ffffffff810a88a1>] kernel_restart+0x11^G[ 483.943262] [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 > [ 483.959480] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 483.975786] [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 > [ 483.991819] [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 > [ 484.007675] [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 484.023336] [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 > [ 484.039176] [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 > [ 484.055174] [<ffffffff816aae95>] ? sysret_check+0x22/0x5d > [ 484.070747] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 > [ 484.086121] [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 > [ 484.101318] [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b > [ 484.116585] 3 locks held by init/4163: > [ 484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260 > ^G^G^G^G^G^G[ 484.147704] #1: (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 > [ 484.164359] #2: (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0 > > create ! > title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung."1. I think that these commands have to come at the top 2. You don''t need quotes in the title 3. You need to be polite and say "thanks" at the end so it knows it can stop paying attention. :-) -George
Sorry for replying late. :-)> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Tuesday, May 28, 2013 11:16 PM > To: Ren, Yongjie; george.dunlap@eu.citrix.com > Cc: xen-devel@lists.xen.org; Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie wrote: > > Hi All, > > This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms. > > (Sorry it''s a little late. :-) If the status changes, I''ll have an update > later.) > > OK, I''ve some updates and ideas that can help with narrowing some of > these > issues down. Thank you for doing this. > > > > > Test environment: > > Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git > > Dom0: Linux kernel 3.9.3 > > Could you please test v3.10-rc3. There have been some changes > for the VCPU hotplug added in v3.10 that I am not sure whether > they are in v3.9?I didn''t try every bug with v3.10.-rc3, but most of them still exist.> > Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems > > > > Below are the features we tested. > > - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows) > > - Save/Restore and live migration > > - PCI device assignment and SR-IOV > > - power management: C-state/P-state, Dom0 S3, HVM S3 > > - AVX and XSAVE instruction set > > - MCE > > - CPU online/offline for Dom0 > > - vCPU hot-plug > > - Nested Virtualization (Please look at my report in the following link.) > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html > > > > New bugs (4): (some of which are not regressions) > > 1. sometimes failed to online cpu in Dom0 > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > That looks like you are hitting the udev race. > > Could you verify that these patches: > https://lkml.org/lkml/2013/5/13/520 > > fix the issue (They are destined for v3.11) >Not tried yet. I''ll update it to you later.> > 2. dom0 call trace when running sriov hvm guest with igbvf > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852 > > -- a regression in Linux kernel (Dom0). > > Hm, the call-trace you refer too: > > [ 68.404440] Already setup the GSI :37 > > [ 68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module > parameter is deprecated - please use the pci sysfs interface. > > [ 68.506230] ------------[ cut here ]------------ > > [ 68.506265] WARNING: at > /home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvop > s.git/fs/sysfs/dir.c:536 sysfs_add_one+0xcc/0xf0() > > [ 68.506279] Hardware name: S2600CP > > is a deprecated warning. Did you follow the ''pci sysfs'' interface way? > > Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7 > ixgbe: Implement PCI SR-IOV sysfs callback operation > it says it is using this: > > commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b > Author: Donald Dutile <ddutile@redhat.com> > Date: Mon Nov 5 15:20:36 2012 -0500 > > PCI: SRIOV control and status via sysfs > > Provide files under sysfs to determine the maximum number of VFs > an SR-IOV-capable PCIe device supports, and methods to enable and > disable the VFs on a per-device basis. > > Currently, VF enablement by SR-IOV-capable PCIe devices is done > via driver-specific module parameters. If not setup in modprobe > files, > it requires admin to unload & reload PF drivers with number of desired > VFs to enable. Additionally, the enablement is system wide: all > devices controlled by the same driver have the same number of VFs > enabled. Although the latter is probably desired, there are PCI > configurations setup by system BIOS that may not enable that to > occur. > > Two files are created for the PF of PCIe devices with SR-IOV support: > > sriov_totalvfs Contains the maximum number of VFs the device > could support as reported by the TotalVFs > register > in the SR-IOV extended capability. > > sriov_numvfs Contains the number of VFs currently enabled > on > this device as reported by the NumVFs > register in > the SR-IOV extended capability. > > Writing zero to this file disables all VFs. > > Writing a positive number to this file enables > that > number of VFs. > > These files are readable for all SR-IOV PF devices. Writes to the > sriov_numvfs file are effective only if a driver that supports the > sriov_configure() method is attached. > > Signed-off-by: Donald Dutile <ddutile@redhat.com> > > > Can you try that please? >Recently, one of my workmates already had a fix as below. https://lkml.org/lkml/2013/5/30/20 And, seems also already been fixed by another guy. https://patchwork.kernel.org/patch/2613481/> > > 3. Booting multiple guests will lead Dom0 call trace > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853 > > That one worries me. Did you do a git bisect to figure out what > is commit is causing this? >I only found this bug on some Intel ~EX server. I don''t know which version on Xen/Dom0 can work fine. If anyone want to reproduce or debug it, it should be good. And our team is trying to debug it internally first.> > 4. After live migration, guest console continuously prints "Clocksource > tsc unstable" > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854 > > This looks like a current bug with QEMU unstable missing a ACPI table? > > Did you try booting the guest with the old QEMU? > > device_model_version = ''qemu-xen-traditional'' >This issue still exists with traditional qemu-xen. After more testing, this bug can''t reproduced by some other guests. RHEL6.4 guest will have this issue after live migration, while RHEL6.3 & Fedora 17 & Ubuntu 12.10 guests can work fine.> > > > Old bugs: (11) > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > That should be fixed in v3.11 (as now we have the fixes) > Could you try v3.10 with the Rafael''s ACPI tree merged in? > (so the patches that he wants to submit for v3.11) >I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug branch), and found Dom0 S3 sleep/resume can''t work, either.> > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > That I think is fixed in v3.10. Could you please check v3.10-rc3? >Still exists on v3.10-rc3. The following command lines can reproduce it: # xl vcpu-set 0 1 # xl vcpu-set 0 20> > 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747 > > That looks to be with v2.6.32. Is the issue present with v3.9 > or v3.10-rc3? >We didn''t test ia32pae Xen for a long time. Now, we only cover ia32e Xen/Dom0. So, this bug is only a legacy issue. If we have effort to verify it, we''ll update it in the bugzilla.> > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > That I believe was an QEMU bug: > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > which should be in QEMU traditional now (05-21 was when it went > in the tree) >In this year or past year, this bug always exists (at least in our testing). ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest - Jay> > 5. Dom0 cannot be shutdown before PCI device detachment from guest > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 > > Ok, I can reproduce that too. > > > 6. xl pci-list shows one PCI device (PF or VF) could be assigned to two > different guests > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834 > > OK, I can reproduce that too: > > > xl create /vm-pv.cfg > Parsing config from /vm-pv.cfg > libxl: error: libxl_pci.c:1043:libxl__device_pci_add: PCI device 0:1:0.0 is not > assignable > Daemon running with PID 3933 > > 15:11:17 # 16 :/mnt/lab/latest/ > > xl pci-list 1 > Vdev Device > 05.0 0000:01:00.0 > > > xl list > Name ID Mem VCPUs > State Time(s) > Domain-0 0 2047 4 > r----- 26.7 > latest 1 2043 1 > -b---- 5.3 > latestadesa 4 1024 3 > -b---- 5.1 > > 15:11:20 # 20 :/mnt/lab/latest/ > > xl pci-list 4 > Vdev Device > 00.0 0000:01:00.0 > > > The rest I hadn''t had a chance to look at. George, have you seen > these issues? > > > 7. [upstream qemu] Guest free memory with upstream qemu is 14MB > lower than that with qemu-xen-unstable.git > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836 > > 8. [upstream qemu]''maxvcpus=NUM'' item is not supported in upstream > QEMU > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837 > > 9. [upstream qemu] Guest console hangs after save/restore or > live-migration when setting ''hpet=0'' in guest config file > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838 > > 10. [upstream qemu] ''xen_platform_pci=0'' setting cannot make the > guest use emulated PCI devices by default > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839 > > 11. Live migration fail when migrating the same guest for more than 2 > times > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845 > > > > Best Regards, > > Yongjie (Jay) > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel > >
On Tue, Jun 04, 2013 at 03:59:33PM +0000, Ren, Yongjie wrote:> Sorry for replying late. :-) > > > -----Original Message----- > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Tuesday, May 28, 2013 11:16 PM > > To: Ren, Yongjie; george.dunlap@eu.citrix.com > > Cc: xen-devel@lists.xen.org; Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie wrote: > > > Hi All, > > > This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms. > > > (Sorry it''s a little late. :-) If the status changes, I''ll have an update > > later.) > > > > OK, I''ve some updates and ideas that can help with narrowing some of > > these > > issues down. Thank you for doing this. > > > > > > > > Test environment: > > > Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git > > > Dom0: Linux kernel 3.9.3 > > > > Could you please test v3.10-rc3. There have been some changes > > for the VCPU hotplug added in v3.10 that I am not sure whether > > they are in v3.9? > I didn''t try every bug with v3.10.-rc3, but most of them still exist. > > > > Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems > > > > > > Below are the features we tested. > > > - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows) > > > - Save/Restore and live migration > > > - PCI device assignment and SR-IOV > > > - power management: C-state/P-state, Dom0 S3, HVM S3 > > > - AVX and XSAVE instruction set > > > - MCE > > > - CPU online/offline for Dom0 > > > - vCPU hot-plug > > > - Nested Virtualization (Please look at my report in the following link.) > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html > > > > > > New bugs (4): (some of which are not regressions) > > > 1. sometimes failed to online cpu in Dom0 > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > That looks like you are hitting the udev race. > > > > Could you verify that these patches: > > https://lkml.org/lkml/2013/5/13/520 > > > > fix the issue (They are destined for v3.11) > > > Not tried yet. I''ll update it to you later.Thanks!> > > > 2. dom0 call trace when running sriov hvm guest with igbvf > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852 > > > -- a regression in Linux kernel (Dom0). > > > > Hm, the call-trace you refer too: > > > > [ 68.404440] Already setup the GSI :37 > > > > [ 68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module > > parameter is deprecated - please use the pci sysfs interface. > > > > [ 68.506230] ------------[ cut here ]------------ > > > > [ 68.506265] WARNING: at > > /home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvop > > s.git/fs/sysfs/dir.c:536 sysfs_add_one+0xcc/0xf0() > > > > [ 68.506279] Hardware name: S2600CP > > > > is a deprecated warning. Did you follow the ''pci sysfs'' interface way? > > > > Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7 > > ixgbe: Implement PCI SR-IOV sysfs callback operation > > it says it is using this: > > > > commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b > > Author: Donald Dutile <ddutile@redhat.com> > > Date: Mon Nov 5 15:20:36 2012 -0500 > > > > PCI: SRIOV control and status via sysfs > > > > Provide files under sysfs to determine the maximum number of VFs > > an SR-IOV-capable PCIe device supports, and methods to enable and > > disable the VFs on a per-device basis. > > > > Currently, VF enablement by SR-IOV-capable PCIe devices is done > > via driver-specific module parameters. If not setup in modprobe > > files, > > it requires admin to unload & reload PF drivers with number of desired > > VFs to enable. Additionally, the enablement is system wide: all > > devices controlled by the same driver have the same number of VFs > > enabled. Although the latter is probably desired, there are PCI > > configurations setup by system BIOS that may not enable that to > > occur. > > > > Two files are created for the PF of PCIe devices with SR-IOV support: > > > > sriov_totalvfs Contains the maximum number of VFs the device > > could support as reported by the TotalVFs > > register > > in the SR-IOV extended capability. > > > > sriov_numvfs Contains the number of VFs currently enabled > > on > > this device as reported by the NumVFs > > register in > > the SR-IOV extended capability. > > > > Writing zero to this file disables all VFs. > > > > Writing a positive number to this file enables > > that > > number of VFs. > > > > These files are readable for all SR-IOV PF devices. Writes to the > > sriov_numvfs file are effective only if a driver that supports the > > sriov_configure() method is attached. > > > > Signed-off-by: Donald Dutile <ddutile@redhat.com> > > > > > > Can you try that please? > > > Recently, one of my workmates already had a fix as below. > https://lkml.org/lkml/2013/5/30/20 > And, seems also already been fixed by another guy. > https://patchwork.kernel.org/patch/2613481/ >Great! Care to update the bug with said relevant information?> > > > > 3. Booting multiple guests will lead Dom0 call trace > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853 > > > > That one worries me. Did you do a git bisect to figure out what > > is commit is causing this? > > > I only found this bug on some Intel ~EX server. > I don''t know which version on Xen/Dom0 can work fine. > If anyone want to reproduce or debug it, it should be good. > And our team is trying to debug it internally first.Ah, OK. Then please continue on debugging it. Thanks!> > > > 4. After live migration, guest console continuously prints "Clocksource > > tsc unstable" > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854 > > > > This looks like a current bug with QEMU unstable missing a ACPI table? > > > > Did you try booting the guest with the old QEMU? > > > > device_model_version = ''qemu-xen-traditional'' > > > This issue still exists with traditional qemu-xen. > After more testing, this bug can''t reproduced by some other guests. > RHEL6.4 guest will have this issue after live migration, while RHEL6.3 & > Fedora 17 & Ubuntu 12.10 guests can work fine.There is a recent thread on this where the culprit was the PV timeclock not being updated correctly. But that would seem to be at odds with your reporting - where you are using Fedora 17 and it works fine. Hm, I am at loss on this one.> > > > > > > Old bugs: (11) > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > That should be fixed in v3.11 (as now we have the fixes) > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > (so the patches that he wants to submit for v3.11) > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug branch), > and found Dom0 S3 sleep/resume can''t work, either.The patches he has to submit for v3.11 are in the linux-next branch. You need to use that branch.> > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > Still exists on v3.10-rc3. > The following command lines can reproduce it: > # xl vcpu-set 0 1 > # xl vcpu-set 0 20Ugh, same exact stack trace? And can you attach the full dmesg or serial output (so that Ican see what there is at bootup)> > > > 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747 > > > > That looks to be with v2.6.32. Is the issue present with v3.9 > > or v3.10-rc3? > > > We didn''t test ia32pae Xen for a long time. > Now, we only cover ia32e Xen/Dom0. > So, this bug is only a legacy issue. > If we have effort to verify it, we''ll update it in the bugzilla.How about just dropping that bug as ''WONTFIX''.> > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > That I believe was an QEMU bug: > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > which should be in QEMU traditional now (05-21 was when it went > > in the tree) > > > In this year or past year, this bug always exists (at least in our testing). > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guestCould you retry with Xen 4.3 please?
[This email is either empty or too large to be displayed at this time]
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > Could you verify that these patches: > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > Not tried yet. I''ll update it to you later. > > > > Thanks! > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and found this > bug still exist. For example, we did CPU online-offline for Dom0 for 100 times, > and found 2 times (of 100 times) failed.Hm, does it fail b/c udev can''t online the sysfs entry? .. snip..> > > > > > > > > > > > > Old bugs: (11) > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > > > > > That should be fixed in v3.11 (as now we have the fixes) > > > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > > > (so the patches that he wants to submit for v3.11) > > > > > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug > > branch), > > > and found Dom0 S3 sleep/resume can''t work, either. > > > > The patches he has to submit for v3.11 are in the linux-next branch. > > You need to use that branch. > > > Dom0 S3 sleep/resume doesn''t work with linux-next branch, either. > attached the log.It does work on my box. So I am not sure if this is related to the IvyTown box you are using. Does it work on other machines?> > > > > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > > > > > Still exists on v3.10-rc3. > > > The following command lines can reproduce it: > > > # xl vcpu-set 0 1 > > > # xl vcpu-set 0 20 > > > > Ugh, same exact stack trace? And can you attach the full dmesg or serial > > output (so that Ican see what there is at bootup) > > > Yes, the same. Also attached in this mail.One of the fixes is this one: http://www.gossamer-threads.com/lists/xen/devel/284897 but the other ones I had not seen. I am wondering if the update_sd_lb_stats is b/c of the previous conditions (that is the tick_nohz_idle_start hadn''t been called). It is a shoot in the dark - but if you use the above mentioned patch do you still see the update_sd_lb_stats crash?> > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > That I believe was an QEMU bug: > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > which should be in QEMU traditional now (05-21 was when it went > > > > in the tree) > > > > > > > In this year or past year, this bug always exists (at least in our testing). > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > Could you retry with Xen 4.3 please? > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU number of a guest.Could you give some more details? Could you include the /var/log/xen/qemu-... log file? You are using the traditional QEMU right? (you need to have this in your guest config: device_model_version = ''qemu-xen-traditional''
> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Wednesday, June 05, 2013 10:50 PM > To: Ren, Yongjie > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > Yongxue; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > Could you verify that these patches: > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > Thanks! > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and found this > > bug still exist. For example, we did CPU online-offline for Dom0 for 100 > times, > > and found 2 times (of 100 times) failed. > > Hm, does it fail b/c udev can''t online the sysfs entry? >I think no. When it fails to online CPU #3 (trying online #1~#3), it doesn''t show any info about CPU #3 via the output of "devadm monitor --env" CMD. It does show info about #1 and #2 which are onlined succefully.> .. snip.. > > > > > > > > > > > > > > > > Old bugs: (11) > > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > > > > > > > That should be fixed in v3.11 (as now we have the fixes) > > > > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > > > > (so the patches that he wants to submit for v3.11) > > > > > > > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug > > > branch), > > > > and found Dom0 S3 sleep/resume can''t work, either. > > > > > > The patches he has to submit for v3.11 are in the linux-next branch. > > > You need to use that branch. > > > > > Dom0 S3 sleep/resume doesn''t work with linux-next branch, either. > > attached the log. > > It does work on my box. So I am not sure if this is related to the > IvyTown box you are using. Does it work on other machines? >No, it doesn''t work on other machines, either. I also tried on SandyBridge, IvyBridge desktop and Haswell mobile machines.> > > > > > > > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > > > > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > > > > > > > Still exists on v3.10-rc3. > > > > The following command lines can reproduce it: > > > > # xl vcpu-set 0 1 > > > > # xl vcpu-set 0 20 > > > > > > Ugh, same exact stack trace? And can you attach the full dmesg or > serial > > > output (so that Ican see what there is at bootup) > > > > > Yes, the same. Also attached in this mail. > > One of the fixes is this one: > http://www.gossamer-threads.com/lists/xen/devel/284897 > > but the other ones I had not seen. I am wondering if the > update_sd_lb_stats is b/c of the previous conditions (that is the > tick_nohz_idle_start hadn''t been called). > > It is a shoot in the dark - but if you use the above mentioned patch > do you still see the update_sd_lb_stats crash? >Yes, with the patch we still see the update_sd_lb_stats crash. It has almost the same trace log as before. Log file is attached.> > > > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > That I believe was an QEMU bug: > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it went > > > > > in the tree) > > > > > > > > > In this year or past year, this bug always exists (at least in our > testing). > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > Could you retry with Xen 4.3 please? > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU number of a > guest. >sorry, when I said this message, I still use rhel6.4 kernel as the guest. After upgrading guest kernel to 3.10.0-rc3, the result became better. Basically vCPU increment/decrement can work fine. I''ll close that bug. But there''s still a minor issue as following. After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its vCPU number. # xl vcpu-set $domID 32 then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you set vCPU number to 32 (from 19), then it works to get 32vCPU for the guest. but ''xl vcpu-set $domID 8'' can work fine as we expected. vCPU decrement has the same result. Can you also have a try to reproduce my issue?> Could you give some more details? Could you include the > /var/log/xen/qemu-... log file? >Attached the qemu log.> You are using the traditional QEMU right? (you need to have this in your > guest > config: > device_model_version = ''qemu-xen-traditional'' >Yes. -- Jay _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie wrote:> > -----Original Message----- > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Wednesday, June 05, 2013 10:50 PM > > To: Ren, Yongjie > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > > Yongxue; xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > > > Could you verify that these patches: > > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > > > Thanks! > > > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and found this > > > bug still exist. For example, we did CPU online-offline for Dom0 for 100 > > times, > > > and found 2 times (of 100 times) failed. > > > > Hm, does it fail b/c udev can''t online the sysfs entry? > > > I think no. > When it fails to online CPU #3 (trying online #1~#3), it doesn''t show any info > about CPU #3 via the output of "devadm monitor --env" CMD. It does show > info about #1 and #2 which are onlined succefully.And if you re-trigger the the ''xl vcpu-set'' it eventually comes back up right?> > > .. snip.. > > > > > > > > > > > > > > > > > > > Old bugs: (11) > > > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > > > > > > > > > That should be fixed in v3.11 (as now we have the fixes) > > > > > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > > > > > (so the patches that he wants to submit for v3.11) > > > > > > > > > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug > > > > branch), > > > > > and found Dom0 S3 sleep/resume can''t work, either. > > > > > > > > The patches he has to submit for v3.11 are in the linux-next branch. > > > > You need to use that branch. > > > > > > > Dom0 S3 sleep/resume doesn''t work with linux-next branch, either. > > > attached the log. > > > > It does work on my box. So I am not sure if this is related to the > > IvyTown box you are using. Does it work on other machines? > > > No, it doesn''t work on other machines, either. I also tried on SandyBridge, > IvyBridge desktop and Haswell mobile machines.I just double checked on my AMD machines with v3.10-rc5 with these extra patches: ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations. 7c4ae96 Revert "xen/pat: Disable PAT support for now." 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value." bd4fd16 microcode_xen: Add support for AMD family >= 15h 6271c21 x86/microcode: check proper return code. b9a48c8 xen: add CPU microcode update driver c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in register_cpu() f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel. and it worked. Let me recompile a kernel without most of them to doublecheck whether those patches are making the ACPI S3 suspend/resume working. This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208 04/18/2012 with 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT] (rev a1) as its graphic card.> > > > > > > > > > > > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > > > > > > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > > > > > > > > > Still exists on v3.10-rc3. > > > > > The following command lines can reproduce it: > > > > > # xl vcpu-set 0 1 > > > > > # xl vcpu-set 0 20 > > > > > > > > Ugh, same exact stack trace? And can you attach the full dmesg or > > serial > > > > output (so that Ican see what there is at bootup) > > > > > > > Yes, the same. Also attached in this mail. > > > > One of the fixes is this one: > > http://www.gossamer-threads.com/lists/xen/devel/284897 > > > > but the other ones I had not seen. I am wondering if the > > update_sd_lb_stats is b/c of the previous conditions (that is the > > tick_nohz_idle_start hadn''t been called). > > > > It is a shoot in the dark - but if you use the above mentioned patch > > do you still see the update_sd_lb_stats crash? > > > Yes, with the patch we still see the update_sd_lb_stats crash. > It has almost the same trace log as before. Log file is attached.Would it be possible to do a bit of ''git bisect'' to figure out why this started?> > > > > > > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > > > That I believe was an QEMU bug: > > > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it went > > > > > > in the tree) > > > > > > > > > > > In this year or past year, this bug always exists (at least in our > > testing). > > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > > Could you retry with Xen 4.3 please? > > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU number of a > > guest. > > > sorry, when I said this message, I still use rhel6.4 kernel as the guest. > After upgrading guest kernel to 3.10.0-rc3, the result became better. > Basically vCPU increment/decrement can work fine. I''ll close that bug.Excellent!> But there''s still a minor issue as following. > After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its vCPU number. > # xl vcpu-set $domID 32 > then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you set > vCPU number to 32 (from 19), then it works to get 32vCPU for the guest. > but ''xl vcpu-set $domID 8'' can work fine as we expected. > vCPU decrement has the same result. > Can you also have a try to reproduce my issue?Sure. Now how many PCPUS do you have? And what version of QEMU traditional were you using?> > > Could you give some more details? Could you include the > > /var/log/xen/qemu-... log file? > > > Attached the qemu log.Thank you.> > > You are using the traditional QEMU right? (you need to have this in your > > guest > > config: > > device_model_version = ''qemu-xen-traditional'' > > > Yes. > > -- > Jay> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
> I just double checked on my AMD machines with v3.10-rc5 with > these extra patches: > > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations. > 7c4ae96 Revert "xen/pat: Disable PAT support for now." > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value." > bd4fd16 microcode_xen: Add support for AMD family >= 15h > 6271c21 x86/microcode: check proper return code. > b9a48c8 xen: add CPU microcode update driver > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted > 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in register_cpu() > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. > 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel. > > and it worked. Let me recompile a kernel without most of them to doublecheck > whether those patches are making the ACPI S3 suspend/resume working.Still works. I removed all but: c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in register_cpu() on top of 3.10-rc6 and the suspend/resume on the host works.
On Mon, Jun 17, 2013 at 04:35:39PM -0400, Konrad Rzeszutek Wilk wrote:> > I just double checked on my AMD machines with v3.10-rc5 with > > these extra patches: > > > > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations. > > 7c4ae96 Revert "xen/pat: Disable PAT support for now." > > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value." > > bd4fd16 microcode_xen: Add support for AMD family >= 15h > > 6271c21 x86/microcode: check proper return code. > > b9a48c8 xen: add CPU microcode update driver > > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted > > 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in register_cpu() > > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. > > 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel. > > > > and it worked. Let me recompile a kernel without most of them to doublecheck > > whether those patches are making the ACPI S3 suspend/resume working. > > Still works. I removed all but: > > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted > 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in register_cpu() > > on top of 3.10-rc6 and the suspend/resume on the host works.Correction. This is v3.10-rc6 + Rafaels'' linux-next branch which had: f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel.
> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Monday, June 17, 2013 10:23 PM > To: Ren, Yongjie > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > Yongxue; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie wrote: > > > -----Original Message----- > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > > Sent: Wednesday, June 05, 2013 10:50 PM > > > To: Ren, Yongjie > > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > > > Yongxue; xen-devel@lists.xen.org > > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > > > > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > > > > > Could you verify that these patches: > > > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > > > > > Thanks! > > > > > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and found > this > > > > bug still exist. For example, we did CPU online-offline for Dom0 for > 100 > > > times, > > > > and found 2 times (of 100 times) failed. > > > > > > Hm, does it fail b/c udev can''t online the sysfs entry? > > > > > I think no. > > When it fails to online CPU #3 (trying online #1~#3), it doesn''t show any > info > > about CPU #3 via the output of "devadm monitor --env" CMD. It does > show > > info about #1 and #2 which are onlined succefully. > > And if you re-trigger the the ''xl vcpu-set'' it eventually comes back up right? >We don''t use ''xl vcpu-set'' command when doing the CPU hot-plug. We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to test. (see the attachment about my test code in that bugzilla.) And, yes, if a CPU failed to online, it can also be onlined again when we re-trigger online function.> > > > > .. snip.. > > > > > > > > > > > > > > > > > > > > > > Old bugs: (11) > > > > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > > > > > > > > > > > That should be fixed in v3.11 (as now we have the fixes) > > > > > > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > > > > > > (so the patches that he wants to submit for v3.11) > > > > > > > > > > > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug > > > > > branch), > > > > > > and found Dom0 S3 sleep/resume can''t work, either. > > > > > > > > > > The patches he has to submit for v3.11 are in the linux-next branch. > > > > > You need to use that branch. > > > > > > > > > Dom0 S3 sleep/resume doesn''t work with linux-next branch, either. > > > > attached the log. > > > > > > It does work on my box. So I am not sure if this is related to the > > > IvyTown box you are using. Does it work on other machines? > > > > > No, it doesn''t work on other machines, either. I also tried on > SandyBridge, > > IvyBridge desktop and Haswell mobile machines. > > I just double checked on my AMD machines with v3.10-rc5 with > these extra patches: > > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on > CPA/set_p.._wb/wc operations. > 7c4ae96 Revert "xen/pat: Disable PAT support for now." > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value." > bd4fd16 microcode_xen: Add support for AMD family >= 15h > 6271c21 x86/microcode: check proper return code. > b9a48c8 xen: add CPU microcode update driver > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is > emitted > 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in > register_cpu() > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. > 29ca6e9 x86 / ACPI / sleep: Provide registration for > acpi_suspend_lowlevel. > > and it worked. Let me recompile a kernel without most of them to > doublecheck > whether those patches are making the ACPI S3 suspend/resume working. > This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208 > 04/18/2012 > with 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce > 8600 GT] (rev a1) > as its graphic card. >After re-testing with linux-pm.git tree (kernel:3.10.rc6+ commit: a913b188df) on my IvyTown-EP and IvyBridge desktop systems, Dom0 S3 sleep/resume can work! When these codes are upstreamed to linux.git tree, I can close this bug.> > > > > > > > > > > > > > > > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > > > > > > > > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > > > > > > > > > > > Still exists on v3.10-rc3. > > > > > > The following command lines can reproduce it: > > > > > > # xl vcpu-set 0 1 > > > > > > # xl vcpu-set 0 20 > > > > > > > > > > Ugh, same exact stack trace? And can you attach the full dmesg or > > > serial > > > > > output (so that Ican see what there is at bootup) > > > > > > > > > Yes, the same. Also attached in this mail. > > > > > > One of the fixes is this one: > > > http://www.gossamer-threads.com/lists/xen/devel/284897 > > > > > > but the other ones I had not seen. I am wondering if the > > > update_sd_lb_stats is b/c of the previous conditions (that is the > > > tick_nohz_idle_start hadn''t been called). > > > > > > It is a shoot in the dark - but if you use the above mentioned patch > > > do you still see the update_sd_lb_stats crash? > > > > > Yes, with the patch we still see the update_sd_lb_stats crash. > > It has almost the same trace log as before. Log file is attached. > > Would it be possible to do a bit of ''git bisect'' to figure out why > this started? >It''s hard. This issue exists for a long time. We don''t even know which version of linux upstream as dom0 can work for this bug.> > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM > guest > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > > > > > That I believe was an QEMU bug: > > > > > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it > went > > > > > > > in the tree) > > > > > > > > > > > > > In this year or past year, this bug always exists (at least in our > > > testing). > > > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > > > > Could you retry with Xen 4.3 please? > > > > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU number of > a > > > guest. > > > > > sorry, when I said this message, I still use rhel6.4 kernel as the guest. > > After upgrading guest kernel to 3.10.0-rc3, the result became better. > > Basically vCPU increment/decrement can work fine. I''ll close that bug. > > Excellent! > > But there''s still a minor issue as following. > > After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its vCPU > number. > > # xl vcpu-set $domID 32 > > then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you > set > > vCPU number to 32 (from 19), then it works to get 32vCPU for the guest. > > but ''xl vcpu-set $domID 8'' can work fine as we expected. > > vCPU decrement has the same result. > > Can you also have a try to reproduce my issue? >This issue doesn''t exist when using the latest QEMU traditional tree. My pervious QEMU was old (March 2013), and I found some of your patches were applied in May 2013. These fixes can fix the issue we reported. Close this bug. But, it introduced another issue: when doing ''xl vcpu-set'' for HVM several times (e.g. 5 times), the guest will panic. Log is attached. Before your patches in qemu traditional tree in May 2013, we never meet guest kernel panic. dom0: 3.10.0-rc3 Xen: 4.3.0-RCx QEMU: the latest traditional tree guest kernel: 3.10.0-RC3 I''ll file another bug to track this bug ? Can you reproduce this ?> Sure. Now how many PCPUS do you have? And what version of QEMU > traditional > were you using? >There''re 32 pCPU in that system we used. Best Regards, Yongjie (Jay) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, Jun 20, 2013 at 02:53:06AM +0000, Ren, Yongjie wrote:> > -----Original Message----- > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Monday, June 17, 2013 10:23 PM > > To: Ren, Yongjie > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > > Yongxue; xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie wrote: > > > > -----Original Message----- > > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > > > Sent: Wednesday, June 05, 2013 10:50 PM > > > > To: Ren, Yongjie > > > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > > > > Yongxue; xen-devel@lists.xen.org > > > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > > > > > > > > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > > > > > > > Could you verify that these patches: > > > > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and found > > this > > > > > bug still exist. For example, we did CPU online-offline for Dom0 for > > 100 > > > > times, > > > > > and found 2 times (of 100 times) failed. > > > > > > > > Hm, does it fail b/c udev can''t online the sysfs entry? > > > > > > > I think no. > > > When it fails to online CPU #3 (trying online #1~#3), it doesn''t show any > > info > > > about CPU #3 via the output of "devadm monitor --env" CMD. It does > > show > > > info about #1 and #2 which are onlined succefully. > > > > And if you re-trigger the the ''xl vcpu-set'' it eventually comes back up right? > > > We don''t use ''xl vcpu-set'' command when doing the CPU hot-plug. > We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to test.Oh. That is very different than what I thought. You are not offlining/onlining vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0 vCPUs in the remaining vCPUS. There should be no vCPU re-sizing correct?> (see the attachment about my test code in that bugzilla.) > And, yes, if a CPU failed to online, it can also be onlined again when we re-trigger > online function. > > > > > > > > .. snip.. > > > > > > > > > > > > > > > > > > > > > > > > > Old bugs: (11) > > > > > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep > > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707 > > > > > > > > > > > > > > > > That should be fixed in v3.11 (as now we have the fixes) > > > > > > > > Could you try v3.10 with the Rafael''s ACPI tree merged in? > > > > > > > > (so the patches that he wants to submit for v3.11) > > > > > > > > > > > > > > > I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug > > > > > > branch), > > > > > > > and found Dom0 S3 sleep/resume can''t work, either. > > > > > > > > > > > > The patches he has to submit for v3.11 are in the linux-next branch. > > > > > > You need to use that branch. > > > > > > > > > > > Dom0 S3 sleep/resume doesn''t work with linux-next branch, either. > > > > > attached the log. > > > > > > > > It does work on my box. So I am not sure if this is related to the > > > > IvyTown box you are using. Does it work on other machines? > > > > > > > No, it doesn''t work on other machines, either. I also tried on > > SandyBridge, > > > IvyBridge desktop and Haswell mobile machines. > > > > I just double checked on my AMD machines with v3.10-rc5 with > > these extra patches: > > > > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on > > CPA/set_p.._wb/wc operations. > > 7c4ae96 Revert "xen/pat: Disable PAT support for now." > > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value." > > bd4fd16 microcode_xen: Add support for AMD family >= 15h > > 6271c21 x86/microcode: check proper return code. > > b9a48c8 xen: add CPU microcode update driver > > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is > > emitted > > 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks in > > register_cpu() > > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback. > > 29ca6e9 x86 / ACPI / sleep: Provide registration for > > acpi_suspend_lowlevel. > > > > and it worked. Let me recompile a kernel without most of them to > > doublecheck > > whether those patches are making the ACPI S3 suspend/resume working. > > This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208 > > 04/18/2012 > > with 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce > > 8600 GT] (rev a1) > > as its graphic card. > > > After re-testing with linux-pm.git tree (kernel:3.10.rc6+ commit: a913b188df) on > my IvyTown-EP and IvyBridge desktop systems, Dom0 S3 sleep/resume can work! > When these codes are upstreamed to linux.git tree, I can close this bug.Yes! Thought Ben found another issue with extended sleep - where it will not use the hypercall. <sigh>> > > > > > > > > > > > > > > > > > > > > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic > > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730 > > > > > > > > > > > > > > > > That I think is fixed in v3.10. Could you please check v3.10-rc3? > > > > > > > > > > > > > > > Still exists on v3.10-rc3. > > > > > > > The following command lines can reproduce it: > > > > > > > # xl vcpu-set 0 1 > > > > > > > # xl vcpu-set 0 20 > > > > > > > > > > > > Ugh, same exact stack trace? And can you attach the full dmesg or > > > > serial > > > > > > output (so that Ican see what there is at bootup) > > > > > > > > > > > Yes, the same. Also attached in this mail. > > > > > > > > One of the fixes is this one: > > > > http://www.gossamer-threads.com/lists/xen/devel/284897 > > > > > > > > but the other ones I had not seen. I am wondering if the > > > > update_sd_lb_stats is b/c of the previous conditions (that is the > > > > tick_nohz_idle_start hadn''t been called). > > > > > > > > It is a shoot in the dark - but if you use the above mentioned patch > > > > do you still see the update_sd_lb_stats crash? > > > > > > > Yes, with the patch we still see the update_sd_lb_stats crash. > > > It has almost the same trace log as before. Log file is attached. > > > > Would it be possible to do a bit of ''git bisect'' to figure out why > > this started? > > > It''s hard. > This issue exists for a long time. We don''t even know which version of > linux upstream as dom0 can work for this bug.Then a bit of digging will be needed. Sadly I am out of time to do this ATM.> > > > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM > > guest > > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > > > > > > > That I believe was an QEMU bug: > > > > > > > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it > > went > > > > > > > > in the tree) > > > > > > > > > > > > > > > In this year or past year, this bug always exists (at least in our > > > > testing). > > > > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > > > > > > Could you retry with Xen 4.3 please? > > > > > > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU number of > > a > > > > guest. > > > > > > > sorry, when I said this message, I still use rhel6.4 kernel as the guest. > > > After upgrading guest kernel to 3.10.0-rc3, the result became better. > > > Basically vCPU increment/decrement can work fine. I''ll close that bug. > > > > Excellent! > > > But there''s still a minor issue as following. > > > After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its vCPU > > number. > > > # xl vcpu-set $domID 32 > > > then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you > > set > > > vCPU number to 32 (from 19), then it works to get 32vCPU for the guest. > > > but ''xl vcpu-set $domID 8'' can work fine as we expected. > > > vCPU decrement has the same result. > > > Can you also have a try to reproduce my issue? > > > This issue doesn''t exist when using the latest QEMU traditional tree. > My pervious QEMU was old (March 2013), and I found some of your patches > were applied in May 2013. These fixes can fix the issue we reported. > Close this bug.Yes!> > But, it introduced another issue: when doing ''xl vcpu-set'' for HVM several > times (e.g. 5 times), the guest will panic. Log is attached. > Before your patches in qemu traditional tree in May 2013, we never meet > guest kernel panic. > dom0: 3.10.0-rc3 > Xen: 4.3.0-RCx > QEMU: the latest traditional tree > guest kernel: 3.10.0-RC3 > I''ll file another bug to track this bug ?Please.> Can you reproduce this ?Could you tell me how you are doing ''xl vcpu-set''? Is there a particular test script you are using?> > > Sure. Now how many PCPUS do you have? And what version of QEMU > > traditional > > were you using? > > > There''re 32 pCPU in that system we used. > > Best Regards, > Yongjie (Jay)
> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Saturday, June 22, 2013 2:18 AM > To: Ren, Yongjie > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > Yongxue; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > > > > > > > > > Could you verify that these patches: > > > > > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and > found > > > this > > > > > > bug still exist. For example, we did CPU online-offline for Dom0 for > > > 100 > > > > > times, > > > > > > and found 2 times (of 100 times) failed. > > > > > > > > > > Hm, does it fail b/c udev can''t online the sysfs entry? > > > > > > > > > I think no. > > > > When it fails to online CPU #3 (trying online #1~#3), it doesn''t show > any > > > info > > > > about CPU #3 via the output of "devadm monitor --env" CMD. It does > > > show > > > > info about #1 and #2 which are onlined succefully. > > > > > > And if you re-trigger the the ''xl vcpu-set'' it eventually comes back up > right? > > > > > We don''t use ''xl vcpu-set'' command when doing the CPU hot-plug. > > We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to > test. > > Oh. That is very different than what I thought. You are not offlining/onlining > vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0 vCPUs > in the > remaining vCPUS. > > There should be no vCPU re-sizing correct? >Yes, for this case we do online/offline for pCPUs not vCPUs. (vCPU number doesn''t change.)> > (see the attachment about my test code in that bugzilla.) > > And, yes, if a CPU failed to online, it can also be onlined again when we > re-trigger > > online function. > > > > > >> > > > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM > > > guest > > > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > > > > > > > > > That I believe was an QEMU bug: > > > > > > > > > > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it > > > went > > > > > > > > > in the tree) > > > > > > > > > > > > > > > > > In this year or past year, this bug always exists (at least in our > > > > > testing). > > > > > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > > > > > > > > Could you retry with Xen 4.3 please? > > > > > > > > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU > number of > > > a > > > > > guest. > > > > > > > > > sorry, when I said this message, I still use rhel6.4 kernel as the guest. > > > > After upgrading guest kernel to 3.10.0-rc3, the result became better. > > > > Basically vCPU increment/decrement can work fine. I''ll close that > bug. > > > > > > Excellent! > > > > But there''s still a minor issue as following. > > > > After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its > vCPU > > > number. > > > > # xl vcpu-set $domID 32 > > > > then you can only get less than 32 (e.g. 19) CPUs in the guest; again, > you > > > set > > > > vCPU number to 32 (from 19), then it works to get 32vCPU for the > guest. > > > > but ''xl vcpu-set $domID 8'' can work fine as we expected. > > > > vCPU decrement has the same result. > > > > Can you also have a try to reproduce my issue? > > > > > This issue doesn''t exist when using the latest QEMU traditional tree. > > My pervious QEMU was old (March 2013), and I found some of your > patches > > were applied in May 2013. These fixes can fix the issue we reported. > > Close this bug. > > Yes! > > > > But, it introduced another issue: when doing ''xl vcpu-set'' for HVM > several > > times (e.g. 5 times), the guest will panic. Log is attached. > > Before your patches in qemu traditional tree in May 2013, we never > meet > > guest kernel panic. > > dom0: 3.10.0-rc3 > > Xen: 4.3.0-RCx > > QEMU: the latest traditional tree > > guest kernel: 3.10.0-RC3 > > I''ll file another bug to track this bug ? > > Please. > > Can you reproduce this ? > > Could you tell me how you are doing ''xl vcpu-set''? Is there a particular > test script you are using? >1. xl vcpu-set $domID 2 2. xl vcpu-set $domID 20 3. repeat step #1 and #2 for several times. (guest kernel panic ...) I also filed a bug in bugzilla to track this. You can get more info in the following link. http://bugzilla.xenproject.org/bugzilla/show_bug.cgi?id=1860 -- Jay> > > > > Sure. Now how many PCPUS do you have? And what version of QEMU > > > traditional > > > were you using? > > > > > There''re 32 pCPU in that system we used. > > > > Best Regards, > > Yongjie (Jay) >
On Tue, Jul 02, 2013 at 08:09:48AM +0000, Ren, Yongjie wrote:> > -----Original Message----- > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Saturday, June 22, 2013 2:18 AM > > To: Ren, Yongjie > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian, > > Yongxue; xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1 > > > > > > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851 > > > > > > > > > > > > > > > > > > > > That looks like you are hitting the udev race. > > > > > > > > > > > > > > > > > > > > Could you verify that these patches: > > > > > > > > > > https://lkml.org/lkml/2013/5/13/520 > > > > > > > > > > > > > > > > > > > > fix the issue (They are destined for v3.11) > > > > > > > > > > > > > > > > > > > Not tried yet. I''ll update it to you later. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and > > found > > > > this > > > > > > > bug still exist. For example, we did CPU online-offline for Dom0 for > > > > 100 > > > > > > times, > > > > > > > and found 2 times (of 100 times) failed. > > > > > > > > > > > > Hm, does it fail b/c udev can''t online the sysfs entry? > > > > > > > > > > > I think no. > > > > > When it fails to online CPU #3 (trying online #1~#3), it doesn''t show > > any > > > > info > > > > > about CPU #3 via the output of "devadm monitor --env" CMD. It does > > > > show > > > > > info about #1 and #2 which are onlined succefully. > > > > > > > > And if you re-trigger the the ''xl vcpu-set'' it eventually comes back up > > right? > > > > > > > We don''t use ''xl vcpu-set'' command when doing the CPU hot-plug. > > > We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to > > test. > > > > Oh. That is very different than what I thought. You are not offlining/onlining > > vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0 vCPUs > > in the > > remaining vCPUS. > > > > There should be no vCPU re-sizing correct? > > > Yes, for this case we do online/offline for pCPUs not vCPUs. > (vCPU number doesn''t change.)OK, so nothing to do with Linux but mostly with Xen hypervisor. Do you know who added this functionality? Can they help?> > > > (see the attachment about my test code in that bugzilla.) > > > And, yes, if a CPU failed to online, it can also be onlined again when we > > re-trigger > > > online function. > > > > > > > > > > > > > > > > > > > > > 4. ''xl vcpu-set'' can''t decrease the vCPU number of a HVM > > > > guest > > > > > > > > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822 > > > > > > > > > > > > > > > > > > > > That I believe was an QEMU bug: > > > > > > > > > > > > > > > > > > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html > > > > > > > > > > > > > > > > > > > > which should be in QEMU traditional now (05-21 was when it > > > > went > > > > > > > > > > in the tree) > > > > > > > > > > > > > > > > > > > In this year or past year, this bug always exists (at least in our > > > > > > testing). > > > > > > > > > ''xl vcpu-set'' can''t decrease the vCPU number of a HVM guest > > > > > > > > > > > > > > > > Could you retry with Xen 4.3 please? > > > > > > > > > > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU > > number of > > > > a > > > > > > guest. > > > > > > > > > > > sorry, when I said this message, I still use rhel6.4 kernel as the guest. > > > > > After upgrading guest kernel to 3.10.0-rc3, the result became better. > > > > > Basically vCPU increment/decrement can work fine. I''ll close that > > bug. > > > > > > > > Excellent! > > > > > But there''s still a minor issue as following. > > > > > After booting guest with ''vcpus=4'' and ''maxvcpus=32'', change its > > vCPU > > > > number. > > > > > # xl vcpu-set $domID 32 > > > > > then you can only get less than 32 (e.g. 19) CPUs in the guest; again, > > you > > > > set > > > > > vCPU number to 32 (from 19), then it works to get 32vCPU for the > > guest. > > > > > but ''xl vcpu-set $domID 8'' can work fine as we expected. > > > > > vCPU decrement has the same result. > > > > > Can you also have a try to reproduce my issue? > > > > > > > This issue doesn''t exist when using the latest QEMU traditional tree. > > > My pervious QEMU was old (March 2013), and I found some of your > > patches > > > were applied in May 2013. These fixes can fix the issue we reported. > > > Close this bug. > > > > Yes! > > > > > > But, it introduced another issue: when doing ''xl vcpu-set'' for HVM > > several > > > times (e.g. 5 times), the guest will panic. Log is attached. > > > Before your patches in qemu traditional tree in May 2013, we never > > meet > > > guest kernel panic. > > > dom0: 3.10.0-rc3 > > > Xen: 4.3.0-RCx > > > QEMU: the latest traditional tree > > > guest kernel: 3.10.0-RC3 > > > I''ll file another bug to track this bug ? > > > > Please. > > > Can you reproduce this ? > > > > Could you tell me how you are doing ''xl vcpu-set''? Is there a particular > > test script you are using? > > > 1. xl vcpu-set $domID 2 > 2. xl vcpu-set $domID 20 > 3. repeat step #1 and #2 for several times. (guest kernel panic ...) > > I also filed a bug in bugzilla to track this. > You can get more info in the following link. > http://bugzilla.xenproject.org/bugzilla/show_bug.cgi?id=1860OK, thank you. I am a bit busy right now tracking down some other bugs that I promised I would look after. But after that I should have some time.> > -- > Jay > > > > > > > > > Sure. Now how many PCPUS do you have? And what version of QEMU > > > > traditional > > > > were you using? > > > > > > > There''re 32 pCPU in that system we used. > > > > > > Best Regards, > > > Yongjie (Jay) > > >
Konrad Rzeszutek Wilk
2013-Nov-08 16:21 UTC
Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:> > > 5. Dom0 cannot be shutdown before PCI device detachment from guest > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 > > > > Ok, I can reproduce that too. > > This is what dom0 tells me: > > [ 483.586675] INFO: task init:4163 blocked for more than 120 seconds. > [ 483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[ 483.620747] init D ffff880062b59c78 5904 4163 1 0x00000000 > [ 483.637699] ffff880062b59bc8 0000000000000^G[ 483.655189] ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 > [ 483.672505] ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500 > [ 483.689527] Call Trace: > [ 483.706298] [<ffffffff816a0814>] schedule+0x24/0x70 > [ 483.723604] [<ffffffff813bb0dd>] read_reply+0xad/0x160 > [ 483.741162] [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 > [ 483.758572] [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 > [ 483.775741] [<ffffffff813bb3c6>] xs_single+0x46/0x60 > [ 483.792791] [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60 > [ 483.809929] [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120 > ^G[ 483.826947] [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 > [ 483.843792] [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 > [ 483.860412] [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 > [ 483.877312] [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 > [ 483.894036] [<ffffffff8142e275>] device_shutdown+0x15/0x180 > [ 483.910605] [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 > [ 483.927100] [<ffffffff810a88a1>] kernel_restart+0x11^G[ 483.943262] [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 > [ 483.959480] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 483.975786] [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 > [ 483.991819] [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 > [ 484.007675] [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 484.023336] [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 > [ 484.039176] [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 > [ 484.055174] [<ffffffff816aae95>] ? sysret_check+0x22/0x5d > [ 484.070747] [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 > [ 484.086121] [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 > [ 484.101318] [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b > [ 484.116585] 3 locks held by init/4163: > [ 484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260 > ^G^G^G^G^G^G[ 484.147704] #1: (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 > [ 484.164359] #2: (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0 >A bit of debugging shows that when we are in this state: MSent SIGKILL to[ 100.454603] xen-pciback pci-1-0: shutdown telnet> send brk [ 110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z) ... snip.. xenstored x 0000000000000002 5504 3437 1 0x00000006 ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000 ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000 ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480 Call Trace: [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130 [<ffffffff816b1594>] schedule+0x24/0x70 [<ffffffff8109c43d>] do_exit+0x79d/0xbc0 [<ffffffff8109c981>] do_group_exit+0x51/0x140 [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760 [<ffffffff8104c49f>] do_signal+0x4f/0x610 [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60 [<ffffffff811c3d39>] ? vfs_write+0x129/0x170 [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80 [<ffffffff816bc372>] int_signal+0x12/0x17 The ''x'' means that the task has been killed. (The other two threads ''xenbus'' and ''xenwatch'' are sleeping). Since the xenstored can actually be in a domain nowadays and not just in the initial domain and xenstored can be restarted anytime - we can''t depend on the task pid. Nor can we depend on the other domain telling us that it is dead. The best we can do is to get out of the way of the shutdown process and not hang on forever. This patch should solve it: From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Fri, 8 Nov 2013 10:48:58 -0500 Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart. The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. ''process_msg'' is running from within the ''xenbus'' thread. Whenever a message shows up in XenBus it is put on a xs_state.reply_list list and ''read_reply'' picks it up. The problem is if the backend domain or the xenstored process is killed. In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - stuck forever waiting for the reply_list to have some contents. This is normally not a problem - as the backend domain can come back or the xenstored process can be restarted. However if the domain is in process of being powered off/restarted/halted - there is no point of waiting on it coming back - as we are effectively being terminated and should not impede the progress. This patch solves this problem by checking the ''system_state'' value to see if we are in heading towards death. We also make the wait mechanism a bit more asynchronous. Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- drivers/xen/xenbus/xenbus_xs.c | 24 +++++++++++++++++++++--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c index b6d5fff..177fb19 100644 --- a/drivers/xen/xenbus/xenbus_xs.c +++ b/drivers/xen/xenbus/xenbus_xs.c @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) while (list_empty(&xs_state.reply_list)) { spin_unlock(&xs_state.reply_lock); - /* XXX FIXME: Avoid synchronous wait for response here. */ - wait_event(xs_state.reply_waitq, - !list_empty(&xs_state.reply_list)); + wait_event_timeout(xs_state.reply_waitq, + !list_empty(&xs_state.reply_list), + msecs_to_jiffies(500)); + + /* + * If we are in the process of being shut-down there is + * no point of trying to contact XenBus - it is either + * killed (xenstored application) or the other domain + * has been killed or is unreachable. + */ + switch (system_state) { + case SYSTEM_POWER_OFF: + case SYSTEM_RESTART: + case SYSTEM_HALT: + return ERR_PTR(-EIO); + default: + break; + } spin_lock(&xs_state.reply_lock); } @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg) mutex_unlock(&xs_state.request_mutex); + if (IS_ERR(ret)) + return ret; + if ((msg->type == XS_TRANSACTION_END) || ((req_msg.type == XS_TRANSACTION_START) && (msg->type == XS_ERROR))) -- 1.7.7.6
xen@bugs.xenproject.org
2013-Nov-08 16:30 UTC
Processed: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Processing commands for xen@bugs.xenproject.org:> On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:Command failed: Unknown command `On''. at /srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 45. Stop processing here. --- Xen Hypervisor Bug Tracker See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on reporting bugs Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues
Matt Wilson
2013-Nov-10 20:20 UTC
Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
On Fri, Nov 08, 2013 at 11:21:21AM -0500, Konrad Rzeszutek Wilk wrote: [...]> This patch should solve it: > From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Fri, 8 Nov 2013 10:48:58 -0500 > Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling > shutdown/restart. > > The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. > ''process_msg'' is running from within the ''xenbus'' thread. Whenever > a message shows up in XenBus it is put on a xs_state.reply_list list > and ''read_reply'' picks it up. > > The problem is if the backend domain or the xenstored process is killed. > In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - > stuck forever waiting for the reply_list to have some contents. > > This is normally not a problem - as the backend domain can come back > or the xenstored process can be restarted. However if the domain > is in process of being powered off/restarted/halted - there is no > point of waiting on it coming back - as we are effectively being > terminated and should not impede the progress. > > This patch solves this problem by checking the ''system_state'' value > to see if we are in heading towards death. We also make the wait > mechanism a bit more asynchronous. > > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>Makes sense to me. Acked-by: Matt Wilson <msw@amazon.com>> --- > drivers/xen/xenbus/xenbus_xs.c | 24 +++++++++++++++++++++--- > 1 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c > index b6d5fff..177fb19 100644 > --- a/drivers/xen/xenbus/xenbus_xs.c > +++ b/drivers/xen/xenbus/xenbus_xs.c > @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) > > while (list_empty(&xs_state.reply_list)) { > spin_unlock(&xs_state.reply_lock); > - /* XXX FIXME: Avoid synchronous wait for response here. */ > - wait_event(xs_state.reply_waitq, > - !list_empty(&xs_state.reply_list)); > + wait_event_timeout(xs_state.reply_waitq, > + !list_empty(&xs_state.reply_list), > + msecs_to_jiffies(500)); > + > + /* > + * If we are in the process of being shut-down there is > + * no point of trying to contact XenBus - it is either > + * killed (xenstored application) or the other domain > + * has been killed or is unreachable. > + */ > + switch (system_state) { > + case SYSTEM_POWER_OFF: > + case SYSTEM_RESTART: > + case SYSTEM_HALT: > + return ERR_PTR(-EIO); > + default: > + break; > + } > spin_lock(&xs_state.reply_lock); > } > > @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg) > > mutex_unlock(&xs_state.request_mutex); > > + if (IS_ERR(ret)) > + return ret; > + > if ((msg->type == XS_TRANSACTION_END) || > ((req_msg.type == XS_TRANSACTION_START) && > (msg->type == XS_ERROR)))
xen@bugs.xenproject.org
2013-Nov-10 20:30 UTC
Processed: Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Processing commands for xen@bugs.xenproject.org:> On Fri, Nov 08, 2013 at 11:21:21AM -0500, Konrad Rzeszutek Wilk wrote:Command failed: Unknown command `On''. at /srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 51. Stop processing here. --- Xen Hypervisor Bug Tracker See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on reporting bugs Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues
Liu, SongtaoX
2013-Nov-11 02:40 UTC
Re: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Yes, the patch fixed the dom0 hang issue during rebooting with guest pci device conflict. Thanks. Regards Songtao> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Saturday, November 09, 2013 12:21 AM > To: Ren, Yongjie; george.dunlap@eu.citrix.com; xen@bugs.xenproject.org > Cc: Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue; xen-devel@lists.xen.org > Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." > Was:Re: [Xen-devel] test report for Xen 4.3 RC1 > > On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote: > > > > 5. Dom0 cannot be shutdown before PCI device detachment from guest > > > > http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826 > > > > > > Ok, I can reproduce that too. > > > > This is what dom0 tells me: > > > > [ 483.586675] INFO: task init:4163 blocked for more than 120 seconds. > > [ 483.603675] "echo 0 > > /proc/sys/kernel/hung_task_timG^G[ 483.620747] init D > ffff880062b59c78 5904 4163 1 0x00000000 > > [ 483.637699] ffff880062b59bc8 0000000000000^G[ 483.655189] > > ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [ > > 483.672505] ffff880062b59fd8 ffff880062b58000 ffff880062f20180 > ffff880078bca500 [ 483.689527] Call Trace: > > [ 483.706298] [<ffffffff816a0814>] schedule+0x24/0x70 [ 483.723604] > > [<ffffffff813bb0dd>] read_reply+0xad/0x160 [ 483.741162] > > [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [ 483.758572] > > [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [ 483.775741] > > [<ffffffff813bb3c6>] xs_single+0x46/0x60 [ 483.792791] > > [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60 > > [ 483.809929] [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120 > > ^G[ 483.826947] [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 [ > > 483.843792] [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [ 483.860412] > > [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [ 483.877312] > > [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [ 483.894036] > > [<ffffffff8142e275>] device_shutdown+0x15/0x180 [ 483.910605] > > [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [ 483.927100] > > [<ffffffff810a88a1>] kernel_restart+0x11^G[ 483.943262] > > [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [ 483.959480] > > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 483.975786] > > [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [ 483.991819] > > [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [ 484.007675] > > [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 484.023336] > > [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [ 484.039176] > > [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [ 484.055174] > > [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [ 484.070747] > > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 > > [ 484.086121] [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [ > > 484.101318] [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b [ > > 484.116585] 3 locks held by init/4163: > > [ 484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260 > > ^G^G^G^G^G^G[ 484.147704] #1: (&__lockdep_no_validate__){......}, > > at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [ 484.164359] > > #2: (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] > > xs_talkv+0x6b/0x1f0 > > > > A bit of debugging shows that when we are in this state: > > > MSent SIGKILL to[ 100.454603] xen-pciback pci-1-0: shutdown > > telnet> send brk > [ 110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) > terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i) > thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) > show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) > show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) > show-blocked-tasks(w) dump-ftrace-buffer(z) > > ... snip.. > > xenstored x 0000000000000002 5504 3437 1 0x00000006 > ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000 > ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000 > ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480 > Call Trace: > [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130 > [<ffffffff816b1594>] schedule+0x24/0x70 > [<ffffffff8109c43d>] do_exit+0x79d/0xbc0 > [<ffffffff8109c981>] do_group_exit+0x51/0x140 > [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760 > [<ffffffff8104c49f>] do_signal+0x4f/0x610 > [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60 > [<ffffffff811c3d39>] ? vfs_write+0x129/0x170 > [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80 > [<ffffffff816bc372>] int_signal+0x12/0x17 > > > The ''x'' means that the task has been killed. > > (The other two threads ''xenbus'' and ''xenwatch'' are sleeping). > > Since the xenstored can actually be in a domain nowadays and not > just in the initial domain and xenstored can be restarted anytime - we > can''t depend on the task pid. Nor can we depend on the other > domain telling us that it is dead. > > The best we can do is to get out of the way of the shutdown > process and not hang on forever. > > This patch should solve it: > From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 > 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Fri, 8 Nov 2013 10:48:58 -0500 > Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling > shutdown/restart. > > The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. > ''process_msg'' is running from within the ''xenbus'' thread. Whenever > a message shows up in XenBus it is put on a xs_state.reply_list list > and ''read_reply'' picks it up. > > The problem is if the backend domain or the xenstored process is killed. > In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - > stuck forever waiting for the reply_list to have some contents. > > This is normally not a problem - as the backend domain can come back > or the xenstored process can be restarted. However if the domain > is in process of being powered off/restarted/halted - there is no > point of waiting on it coming back - as we are effectively being > terminated and should not impede the progress. > > This patch solves this problem by checking the ''system_state'' value > to see if we are in heading towards death. We also make the wait > mechanism a bit more asynchronous. > > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > drivers/xen/xenbus/xenbus_xs.c | 24 +++++++++++++++++++++--- > 1 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c > index b6d5fff..177fb19 100644 > --- a/drivers/xen/xenbus/xenbus_xs.c > +++ b/drivers/xen/xenbus/xenbus_xs.c > @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type > *type, unsigned int *len) > > while (list_empty(&xs_state.reply_list)) { > spin_unlock(&xs_state.reply_lock); > - /* XXX FIXME: Avoid synchronous wait for response here. */ > - wait_event(xs_state.reply_waitq, > - !list_empty(&xs_state.reply_list)); > + wait_event_timeout(xs_state.reply_waitq, > + !list_empty(&xs_state.reply_list), > + msecs_to_jiffies(500)); > + > + /* > + * If we are in the process of being shut-down there is > + * no point of trying to contact XenBus - it is either > + * killed (xenstored application) or the other domain > + * has been killed or is unreachable. > + */ > + switch (system_state) { > + case SYSTEM_POWER_OFF: > + case SYSTEM_RESTART: > + case SYSTEM_HALT: > + return ERR_PTR(-EIO); > + default: > + break; > + } > spin_lock(&xs_state.reply_lock); > } > > @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct > xsd_sockmsg *msg) > > mutex_unlock(&xs_state.request_mutex); > > + if (IS_ERR(ret)) > + return ret; > + > if ((msg->type == XS_TRANSACTION_END) || > ((req_msg.type == XS_TRANSACTION_START) && > (msg->type == XS_ERROR))) > -- > 1.7.7.6
xen@bugs.xenproject.org
2013-Nov-11 02:45 UTC
Processed: RE: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Processing commands for xen@bugs.xenproject.org:> Yes, the patch fixed the dom0 hang issue during rebooting with guest pci deCommand failed: Unknown command `Yes,''. at /srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 50.Stop processing here. --- Xen Hypervisor Bug Tracker See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on reporting bugs Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues
On Tue, 2013-05-28 at 16:24 +0100, George Dunlap wrote:> > create ! > > title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung." > > 1. I think that these commands have to come at the top > 2. You don''t need quotes in the title > 3. You need to be polite and say "thanks" at the end so it knows it can > stop paying attention. :-)4. Use Bcc and not Cc so that the entirely subsequent thread doesn''t get sent to the bot when folks reply-all. Ian.