thr3ads.net - Xen devel - test report for Xen 4.3 RC1 [May 2013]

If this information is useful, please help other people find it:
Share via:

Ren, Yongjie

2013-May-27 03:49 UTC

test report for Xen 4.3 RC1

Hi All,
This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms.
(Sorry it''s a little late. :-)  If the status changes, I''ll
have an update later.)

Test environment:
Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git
Dom0: Linux kernel 3.9.3
Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems

Below are the features we tested.
- PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows)
- Save/Restore and live migration
- PCI device assignment and SR-IOV
- power management: C-state/P-state, Dom0 S3, HVM S3
- AVX and XSAVE instruction set
- MCE
- CPU online/offline for Dom0
- vCPU hot-plug
- Nested Virtualization  (Please look at my report in the following link.)
 http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html

New bugs (4): (some of which are not regressions)
1. sometimes failed to online cpu in Dom0
  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
2. dom0 call trace when running sriov hvm guest with igbvf
  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852
  -- a regression in Linux kernel (Dom0).
3. Booting multiple guests will lead Dom0 call trace
  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853
4. After live migration, guest console continuously prints "Clocksource tsc
unstable"
  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854

Old bugs: (11)
1. [ACPI] Dom0 can''t resume from S3 sleep
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
2. [XL]"xl vcpu-set" causes dom0 crash or panic
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
3. Sometimes Xen panic on ia32pae Sandybridge when restore guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
4. ''xl vcpu-set'' can''t decrease the vCPU number of a
HVM guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
5. Dom0 cannot be shutdown before PCI device detachment from guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
6. xl pci-list shows one PCI device (PF or VF) could be assigned to two
different guests
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834
7. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than that
with qemu-xen-unstable.git
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836
8. [upstream qemu]''maxvcpus=NUM'' item is not supported in
upstream QEMU
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837
9. [upstream qemu] Guest console hangs after save/restore or live-migration when
setting ''hpet=0'' in guest config file
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838
10. [upstream qemu] ''xen_platform_pci=0'' setting cannot make
the guest use emulated PCI devices by default
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839
11. Live migration fail when migrating the same guest for more than 2 times
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845

Best Regards,
     Yongjie (Jay)

Konrad Rzeszutek Wilk

2013-May-28 15:15 UTC

head link

Re: test report for Xen 4.3 RC1

On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie
wrote:> Hi All,
> This is a report based on our testing for Xen 4.3.0 RC1 on Intel platforms.
> (Sorry it''s a little late. :-)  If the status changes,
I''ll have an update later.)
OK, I''ve some updates and ideas that can help with narrowing some of
these
issues down. Thank you for doing this.
> 
> Test environment:
> Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git
> Dom0: Linux kernel 3.9.3
Could you please test v3.10-rc3. There have been some changes
for the VCPU hotplug added in v3.10 that I am not sure whether
they are in v3.9?> Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems
> 
> Below are the features we tested.
> - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows)
> - Save/Restore and live migration
> - PCI device assignment and SR-IOV
> - power management: C-state/P-state, Dom0 S3, HVM S3
> - AVX and XSAVE instruction set
> - MCE
> - CPU online/offline for Dom0
> - vCPU hot-plug
> - Nested Virtualization  (Please look at my report in the following link.)
>  http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html
> 
> New bugs (4): (some of which are not regressions)
> 1. sometimes failed to online cpu in Dom0
>   http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
That looks like you are hitting the udev race. 

Could you verify that these patches:
https://lkml.org/lkml/2013/5/13/520

fix the issue (They are destined for v3.11)
> 2. dom0 call trace when running sriov hvm guest with igbvf
>   http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852
>   -- a regression in Linux kernel (Dom0).
Hm, the call-trace you refer too:

[   68.404440] Already setup the GSI :37

[   68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module parameter
is deprecated - please use the pci sysfs interface.

[   68.506230] ------------[ cut here ]------------

[   68.506265] WARNING: at
/home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvops.git/fs/sysfs/dir.c:536
sysfs_add_one+0xcc/0xf0()

[   68.506279] Hardware name: S2600CP

is a deprecated warning. Did you follow the ''pci sysfs''
interface way?

Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7
    ixgbe: Implement PCI SR-IOV sysfs callback operation
it says it is using this:

commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b
Author: Donald Dutile <ddutile@redhat.com>
Date:   Mon Nov 5 15:20:36 2012 -0500

    PCI: SRIOV control and status via sysfs
    
    Provide files under sysfs to determine the maximum number of VFs
    an SR-IOV-capable PCIe device supports, and methods to enable and
    disable the VFs on a per-device basis.
    
    Currently, VF enablement by SR-IOV-capable PCIe devices is done
    via driver-specific module parameters.  If not setup in modprobe files,
    it requires admin to unload & reload PF drivers with number of desired
    VFs to enable.  Additionally, the enablement is system wide: all
    devices controlled by the same driver have the same number of VFs
    enabled.  Although the latter is probably desired, there are PCI
    configurations setup by system BIOS that may not enable that to occur.
    
    Two files are created for the PF of PCIe devices with SR-IOV support:
    
        sriov_totalvfs  Contains the maximum number of VFs the device
                        could support as reported by the TotalVFs register
                        in the SR-IOV extended capability.
    
        sriov_numvfs    Contains the number of VFs currently enabled on
                        this device as reported by the NumVFs register in
                        the SR-IOV extended capability.
    
                        Writing zero to this file disables all VFs.
    
                        Writing a positive number to this file enables that
                        number of VFs.
    
    These files are readable for all SR-IOV PF devices.  Writes to the
    sriov_numvfs file are effective only if a driver that supports the
    sriov_configure() method is attached.
    
    Signed-off-by: Donald Dutile <ddutile@redhat.com>


Can you try that please?

> 3. Booting multiple guests will lead Dom0 call trace
>   http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853
That one worries me. Did you do a git bisect to figure out what
is commit is causing this?
> 4. After live migration, guest console continuously prints
"Clocksource tsc unstable"
>   http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854
This looks like a current bug with QEMU unstable missing a ACPI table?

Did you try booting the guest with the old QEMU?

device_model_version = ''qemu-xen-traditional''
> 
> Old bugs: (11)
> 1. [ACPI] Dom0 can''t resume from S3 sleep
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
That should be fixed in v3.11 (as now we have the fixes)
Could you try v3.10 with the Rafael''s ACPI tree merged in?
(so the patches that he wants to submit for v3.11)
> 2. [XL]"xl vcpu-set" causes dom0 crash or panic
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
That I think is fixed in v3.10. Could you please check v3.10-rc3?
> 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
That looks to be with v2.6.32. Is the issue present with v3.9
or v3.10-rc3?
> 4. ''xl vcpu-set'' can''t decrease the vCPU number
of a HVM guest
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
That I believe was an QEMU bug:
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html

which should be in QEMU traditional now (05-21 was when it went
in the tree)
> 5. Dom0 cannot be shutdown before PCI device detachment from guest
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
Ok, I can reproduce that too.
> 6. xl pci-list shows one PCI device (PF or VF) could be assigned to two
different guests
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834
OK, I can reproduce that too:
> xl create  /vm-pv.cfg Parsing config from /vm-pv.cfg
libxl: error: libxl_pci.c:1043:libxl__device_pci_add: PCI device 0:1:0.0 is not
assignable
Daemon running with PID 3933

15:11:17 # 16 :/mnt/lab/latest/ > xl pci-list 1Vdev Device
05.0 0000:01:00.0
> xl listName                                        ID   Mem VCPUs	State	Time(s)
Domain-0                                     0  2047     4     r-----      26.7
latest                                       1  2043     1     -b----       5.3
latestadesa                                  4  1024     3     -b----       5.1

15:11:20 # 20 :/mnt/lab/latest/ > xl pci-list 4Vdev Device
00.0 0000:01:00.0


The rest I hadn''t had a chance to look at. George, have you seen
these issues?
> 7. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than
that with qemu-xen-unstable.git
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836
> 8. [upstream qemu]''maxvcpus=NUM'' item is not supported in
upstream QEMU
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837
> 9. [upstream qemu] Guest console hangs after save/restore or live-migration
when setting ''hpet=0'' in guest config file
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838
> 10. [upstream qemu] ''xen_platform_pci=0'' setting cannot
make the guest use emulated PCI devices by default
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839
> 11. Live migration fail when migrating the same guest for more than 2 times
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845
> 
> Best Regards,
>      Yongjie (Jay)
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Konrad Rzeszutek Wilk

2013-May-28 15:21 UTC

head link

Re: test report for Xen 4.3 RC1

> > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> 
> Ok, I can reproduce that too.
This is what dom0 tells me:

[  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
[  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[  483.620747]
init            D ffff880062b59c78  5904  4163      1 0x00000000
[  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]  ffff880062b58000
ffff880062b58000 ffff880062b58010 ffff880062b58000
[  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
ffff880078bca500
[  483.689527] Call Trace:
[  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
[  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
[  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
[  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
[  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
[  483.792791]  [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
[  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
[  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
[  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
[  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
[  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
[  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
[  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262] 
[<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
[  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 
483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
[  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
[  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 
484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
[  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
[  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
[  484.070747]  [<ffffffff810ed52d>] ?
trace_hardirqs_on_caller+0x10d/0x1d0
[  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
[  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
[  484.116585] 3 locks held by init/4163:
[  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......}, at:
[<ffffffff8142e323>] device_shutdown+0xc3/0x180
[  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at:
[<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0

create !
title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests
hung."

George Dunlap

2013-May-28 15:24 UTC

head link

Re: test report for Xen 4.3 RC1

On 28/05/13 16:21, Konrad Rzeszutek Wilk wrote:>>> 5. Dom0 cannot be shutdown before PCI device detachment from guest
>>>    http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
>> Ok, I can reproduce that too.
> This is what dom0 tells me:
>
> [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> [  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[ 
483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
> [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189] 
ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
> [  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
ffff880078bca500
> [  483.689527] Call Trace:
> [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
> [  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
> [  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
> [  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
> [  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
> [  483.792791]  [<ffffffff813bbab4>]
xenbus_transaction_start+0x24/0x60
> [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
> [  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
> [  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
> [  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
> [  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
> [  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
> [  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[ 
483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
> [  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 
483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
> [  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
> [  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 
484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
> [  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
> [  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
> [  484.070747]  [<ffffffff810ed52d>] ?
trace_hardirqs_on_caller+0x10d/0x1d0
> [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
> [  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
> [  484.116585] 3 locks held by init/4163:
> [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>]
SYSC_reboot+0xe0/0x260
> ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......},
at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
> [  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at:
[<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0
>
> create !
> title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests
hung."
1. I think that these commands have to come at the top
2. You don''t need quotes in the title
3. You need to be polite and say "thanks" at the end so it knows it
can
stop paying attention. :-)

  -George

Ren, Yongjie

2013-Jun-04 15:59 UTC

head link

Re: test report for Xen 4.3 RC1

Sorry for replying late. :-)
> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Tuesday, May 28, 2013 11:16 PM
> To: Ren, Yongjie; george.dunlap@eu.citrix.com
> Cc: xen-devel@lists.xen.org; Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue
> Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie wrote:
> > Hi All,
> > This is a report based on our testing for Xen 4.3.0 RC1 on Intel
platforms.
> > (Sorry it''s a little late. :-)  If the status changes,
I''ll have an update
> later.)
> 
> OK, I''ve some updates and ideas that can help with narrowing some
of
> these
> issues down. Thank you for doing this.
> 
> >
> > Test environment:
> > Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git
> > Dom0: Linux kernel 3.9.3
> 
> Could you please test v3.10-rc3. There have been some changes
> for the VCPU hotplug added in v3.10 that I am not sure whether
> they are in v3.9?I didn''t try every bug with v3.10.-rc3, but most of them still exist.
> > Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems
> >
> > Below are the features we tested.
> > - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows)
> > - Save/Restore and live migration
> > - PCI device assignment and SR-IOV
> > - power management: C-state/P-state, Dom0 S3, HVM S3
> > - AVX and XSAVE instruction set
> > - MCE
> > - CPU online/offline for Dom0
> > - vCPU hot-plug
> > - Nested Virtualization  (Please look at my report in the following
link.)
> >  http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html
> >
> > New bugs (4): (some of which are not regressions)
> > 1. sometimes failed to online cpu in Dom0
> >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> 
> That looks like you are hitting the udev race.
> 
> Could you verify that these patches:
> https://lkml.org/lkml/2013/5/13/520
> 
> fix the issue (They are destined for v3.11)
> Not tried yet. I''ll update it to you later.
> > 2. dom0 call trace when running sriov hvm guest with igbvf
> >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852
> >   -- a regression in Linux kernel (Dom0).
> 
> Hm, the call-trace you refer too:
> 
> [   68.404440] Already setup the GSI :37
> 
> [   68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module
> parameter is deprecated - please use the pci sysfs interface.
> 
> [   68.506230] ------------[ cut here ]------------
> 
> [   68.506265] WARNING: at
> /home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvop
> s.git/fs/sysfs/dir.c:536 sysfs_add_one+0xcc/0xf0()
> 
> [   68.506279] Hardware name: S2600CP
> 
> is a deprecated warning. Did you follow the ''pci sysfs''
interface way?
> 
> Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7
>     ixgbe: Implement PCI SR-IOV sysfs callback operation
> it says it is using this:
> 
> commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b
> Author: Donald Dutile <ddutile@redhat.com>
> Date:   Mon Nov 5 15:20:36 2012 -0500
> 
>     PCI: SRIOV control and status via sysfs
> 
>     Provide files under sysfs to determine the maximum number of VFs
>     an SR-IOV-capable PCIe device supports, and methods to enable and
>     disable the VFs on a per-device basis.
> 
>     Currently, VF enablement by SR-IOV-capable PCIe devices is done
>     via driver-specific module parameters.  If not setup in modprobe
> files,
>     it requires admin to unload & reload PF drivers with number of
desired
>     VFs to enable.  Additionally, the enablement is system wide: all
>     devices controlled by the same driver have the same number of VFs
>     enabled.  Although the latter is probably desired, there are PCI
>     configurations setup by system BIOS that may not enable that to
> occur.
> 
>     Two files are created for the PF of PCIe devices with SR-IOV support:
> 
>         sriov_totalvfs  Contains the maximum number of VFs the device
>                         could support as reported by the TotalVFs
> register
>                         in the SR-IOV extended capability.
> 
>         sriov_numvfs    Contains the number of VFs currently enabled
> on
>                         this device as reported by the NumVFs
> register in
>                         the SR-IOV extended capability.
> 
>                         Writing zero to this file disables all VFs.
> 
>                         Writing a positive number to this file enables
> that
>                         number of VFs.
> 
>     These files are readable for all SR-IOV PF devices.  Writes to the
>     sriov_numvfs file are effective only if a driver that supports the
>     sriov_configure() method is attached.
> 
>     Signed-off-by: Donald Dutile <ddutile@redhat.com>
> 
> 
> Can you try that please?
> Recently, one of my workmates already had a fix as below. 
https://lkml.org/lkml/2013/5/30/20
And, seems also already been fixed by another guy. 
https://patchwork.kernel.org/patch/2613481/
> 
> > 3. Booting multiple guests will lead Dom0 call trace
> >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853
> 
> That one worries me. Did you do a git bisect to figure out what
> is commit is causing this?
> I only found this bug on some Intel ~EX server. 
I don''t know which version on Xen/Dom0 can work fine.
If anyone want to reproduce or debug it, it should be good.
And our team is trying to debug it internally first.
> > 4. After live migration, guest console continuously prints
"Clocksource
> tsc unstable"
> >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854
> 
> This looks like a current bug with QEMU unstable missing a ACPI table?
> 
> Did you try booting the guest with the old QEMU?
> 
> device_model_version = ''qemu-xen-traditional''
> This issue still exists with traditional qemu-xen.
After more testing, this bug can''t reproduced by some other guests.
RHEL6.4 guest will have this issue after live migration, while RHEL6.3 & 
Fedora 17 & Ubuntu 12.10 guests can work fine.
> >
> > Old bugs: (11)
> > 1. [ACPI] Dom0 can''t resume from S3 sleep
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> 
> That should be fixed in v3.11 (as now we have the fixes)
> Could you try v3.10 with the Rafael''s ACPI tree merged in?
> (so the patches that he wants to submit for v3.11)
> I re-tested with Rafel''s linux-pm.git tree (master and acpi-hotplug
branch),
and found Dom0 S3 sleep/resume can''t work, either.
> > 2. [XL]"xl vcpu-set" causes dom0 crash or panic
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> 
> That I think is fixed in v3.10. Could you please check v3.10-rc3?
> Still exists on v3.10-rc3.
The following command lines can reproduce it:
# xl vcpu-set 0 1
# xl vcpu-set 0 20
> > 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
> 
> That looks to be with v2.6.32. Is the issue present with v3.9
> or v3.10-rc3?
>We didn''t test ia32pae Xen for a long time. 
Now, we only cover ia32e Xen/Dom0.
So, this bug is only a legacy issue. 
If we have effort to verify it, we''ll update it in the bugzilla.
> > 4. ''xl vcpu-set'' can''t decrease the vCPU
number of a HVM guest
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> 
> That I believe was an QEMU bug:
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> 
> which should be in QEMU traditional now (05-21 was when it went
> in the tree)
> In this year or past year, this bug always exists (at least in our testing).
''xl vcpu-set'' can''t decrease the vCPU number of a HVM
guest

- Jay
> > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> 
> Ok, I can reproduce that too.
> 
> > 6. xl pci-list shows one PCI device (PF or VF) could be assigned to
two
> different guests
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834
> 
> OK, I can reproduce that too:
> 
> > xl create  /vm-pv.cfg
> Parsing config from /vm-pv.cfg
> libxl: error: libxl_pci.c:1043:libxl__device_pci_add: PCI device 0:1:0.0 is
not
> assignable
> Daemon running with PID 3933
> 
> 15:11:17 # 16 :/mnt/lab/latest/
> > xl pci-list 1
> Vdev Device
> 05.0 0000:01:00.0
> 
> > xl list
> Name                                        ID   Mem VCPUs
> 	State	Time(s)
> Domain-0                                     0  2047     4
> r-----      26.7
> latest                                       1  2043     1
> -b----       5.3
> latestadesa                                  4  1024     3
> -b----       5.1
> 
> 15:11:20 # 20 :/mnt/lab/latest/
> > xl pci-list 4
> Vdev Device
> 00.0 0000:01:00.0
> 
> 
> The rest I hadn''t had a chance to look at. George, have you seen
> these issues?
> 
> > 7. [upstream qemu] Guest free memory with upstream qemu is 14MB
> lower than that with qemu-xen-unstable.git
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836
> > 8. [upstream qemu]''maxvcpus=NUM'' item is not
supported in upstream
> QEMU
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837
> > 9. [upstream qemu] Guest console hangs after save/restore or
> live-migration when setting ''hpet=0'' in guest config file
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838
> > 10. [upstream qemu] ''xen_platform_pci=0'' setting
cannot make the
> guest use emulated PCI devices by default
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839
> > 11. Live migration fail when migrating the same guest for more than 2
> times
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1845
> >
> > Best Regards,
> >      Yongjie (Jay)
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> >

Konrad Rzeszutek Wilk

2013-Jun-04 16:35 UTC

head link

Re: test report for Xen 4.3 RC1

On Tue, Jun 04, 2013 at 03:59:33PM +0000, Ren, Yongjie
wrote:> Sorry for replying late. :-)
> 
> > -----Original Message-----
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Tuesday, May 28, 2013 11:16 PM
> > To: Ren, Yongjie; george.dunlap@eu.citrix.com
> > Cc: xen-devel@lists.xen.org; Xu, YongweiX; Liu, SongtaoX; Tian,
Yongxue
> > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > 
> > On Mon, May 27, 2013 at 03:49:27AM +0000, Ren, Yongjie wrote:
> > > Hi All,
> > > This is a report based on our testing for Xen 4.3.0 RC1 on Intel
platforms.
> > > (Sorry it''s a little late. :-)  If the status changes,
I''ll have an update
> > later.)
> > 
> > OK, I''ve some updates and ideas that can help with narrowing
some of
> > these
> > issues down. Thank you for doing this.
> > 
> > >
> > > Test environment:
> > > Xen: Xen 4.3 RC1 with qemu-upstream-unstable.git
> > > Dom0: Linux kernel 3.9.3
> > 
> > Could you please test v3.10-rc3. There have been some changes
> > for the VCPU hotplug added in v3.10 that I am not sure whether
> > they are in v3.9?
> I didn''t try every bug with v3.10.-rc3, but most of them still
exist.
> 
> > > Hardware: Intel Sandy Bridge, Ivy Bridge, Haswell systems
> > >
> > > Below are the features we tested.
> > > - PV and HVM guest booting (HVM: Ubuntu, Fedora, RHEL, Windows)
> > > - Save/Restore and live migration
> > > - PCI device assignment and SR-IOV
> > > - power management: C-state/P-state, Dom0 S3, HVM S3
> > > - AVX and XSAVE instruction set
> > > - MCE
> > > - CPU online/offline for Dom0
> > > - vCPU hot-plug
> > > - Nested Virtualization  (Please look at my report in the
following link.)
> > > 
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01145.html
> > >
> > > New bugs (4): (some of which are not regressions)
> > > 1. sometimes failed to online cpu in Dom0
> > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > 
> > That looks like you are hitting the udev race.
> > 
> > Could you verify that these patches:
> > https://lkml.org/lkml/2013/5/13/520
> > 
> > fix the issue (They are destined for v3.11)
> > 
> Not tried yet. I''ll update it to you later.
Thanks!> 
> > > 2. dom0 call trace when running sriov hvm guest with igbvf
> > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1852
> > >   -- a regression in Linux kernel (Dom0).
> > 
> > Hm, the call-trace you refer too:
> > 
> > [   68.404440] Already setup the GSI :37
> > 
> > [   68.405105] igb 0000:04:00.0: Enabling SR-IOV VFs using the module
> > parameter is deprecated - please use the pci sysfs interface.
> > 
> > [   68.506230] ------------[ cut here ]------------
> > 
> > [   68.506265] WARNING: at
> > /home/www/builds_xen_unstable/xen-src-27009-20130509/linux-2.6-pvop
> > s.git/fs/sysfs/dir.c:536 sysfs_add_one+0xcc/0xf0()
> > 
> > [   68.506279] Hardware name: S2600CP
> > 
> > is a deprecated warning. Did you follow the ''pci
sysfs'' interface way?
> > 
> > Looking at da36b64736cf2552e7fb5109c0255d4af804f5e7
> >     ixgbe: Implement PCI SR-IOV sysfs callback operation
> > it says it is using this:
> > 
> > commit 1789382a72a537447d65ea4131d8bcc1ad85ce7b
> > Author: Donald Dutile <ddutile@redhat.com>
> > Date:   Mon Nov 5 15:20:36 2012 -0500
> > 
> >     PCI: SRIOV control and status via sysfs
> > 
> >     Provide files under sysfs to determine the maximum number of VFs
> >     an SR-IOV-capable PCIe device supports, and methods to enable and
> >     disable the VFs on a per-device basis.
> > 
> >     Currently, VF enablement by SR-IOV-capable PCIe devices is done
> >     via driver-specific module parameters.  If not setup in modprobe
> > files,
> >     it requires admin to unload & reload PF drivers with number of
desired
> >     VFs to enable.  Additionally, the enablement is system wide: all
> >     devices controlled by the same driver have the same number of VFs
> >     enabled.  Although the latter is probably desired, there are PCI
> >     configurations setup by system BIOS that may not enable that to
> > occur.
> > 
> >     Two files are created for the PF of PCIe devices with SR-IOV
support:
> > 
> >         sriov_totalvfs  Contains the maximum number of VFs the device
> >                         could support as reported by the TotalVFs
> > register
> >                         in the SR-IOV extended capability.
> > 
> >         sriov_numvfs    Contains the number of VFs currently enabled
> > on
> >                         this device as reported by the NumVFs
> > register in
> >                         the SR-IOV extended capability.
> > 
> >                         Writing zero to this file disables all VFs.
> > 
> >                         Writing a positive number to this file enables
> > that
> >                         number of VFs.
> > 
> >     These files are readable for all SR-IOV PF devices.  Writes to the
> >     sriov_numvfs file are effective only if a driver that supports the
> >     sriov_configure() method is attached.
> > 
> >     Signed-off-by: Donald Dutile <ddutile@redhat.com>
> > 
> > 
> > Can you try that please?
> > 
> Recently, one of my workmates already had a fix as below. 
> https://lkml.org/lkml/2013/5/30/20
> And, seems also already been fixed by another guy. 
> https://patchwork.kernel.org/patch/2613481/
> 
Great! Care to update the bug with said relevant
information?> > 
> > > 3. Booting multiple guests will lead Dom0 call trace
> > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1853
> > 
> > That one worries me. Did you do a git bisect to figure out what
> > is commit is causing this?
> > 
> I only found this bug on some Intel ~EX server. 
> I don''t know which version on Xen/Dom0 can work fine.
> If anyone want to reproduce or debug it, it should be good.
> And our team is trying to debug it internally first.
Ah, OK. Then please continue on debugging it. Thanks!> 
> > > 4. After live migration, guest console continuously prints
"Clocksource
> > tsc unstable"
> > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1854
> > 
> > This looks like a current bug with QEMU unstable missing a ACPI table?
> > 
> > Did you try booting the guest with the old QEMU?
> > 
> > device_model_version = ''qemu-xen-traditional''
> > 
> This issue still exists with traditional qemu-xen.
> After more testing, this bug can''t reproduced by some other
guests.
> RHEL6.4 guest will have this issue after live migration, while RHEL6.3
&
> Fedora 17 & Ubuntu 12.10 guests can work fine.
There is a recent thread on this where the culprit was the PV timeclock
not being updated correctly. But that would seem to be at odds with
your reporting - where you are using Fedora 17 and it works fine.

Hm, I am at loss on this one.> 
> > >
> > > Old bugs: (11)
> > > 1. [ACPI] Dom0 can''t resume from S3 sleep
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > 
> > That should be fixed in v3.11 (as now we have the fixes)
> > Could you try v3.10 with the Rafael''s ACPI tree merged in?
> > (so the patches that he wants to submit for v3.11)
> > 
> I re-tested with Rafel''s linux-pm.git tree (master and
acpi-hotplug branch),
> and found Dom0 S3 sleep/resume can''t work, either.
The patches he has to submit for v3.11 are in the linux-next branch.
You need to use that branch.
> 
> > > 2. [XL]"xl vcpu-set" causes dom0 crash or panic
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > 
> > That I think is fixed in v3.10. Could you please check v3.10-rc3?
> > 
> Still exists on v3.10-rc3.
> The following command lines can reproduce it:
> # xl vcpu-set 0 1
> # xl vcpu-set 0 20
Ugh, same exact stack trace? And can you attach the full dmesg or serial
output (so that Ican see what there is at bootup)> 
> > > 3. Sometimes Xen panic on ia32pae Sandybridge when restore guest
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
> > 
> > That looks to be with v2.6.32. Is the issue present with v3.9
> > or v3.10-rc3?
> >
> We didn''t test ia32pae Xen for a long time. 
> Now, we only cover ia32e Xen/Dom0.
> So, this bug is only a legacy issue. 
> If we have effort to verify it, we''ll update it in the bugzilla.
How about just dropping that bug as ''WONTFIX''.
> 
> > > 4. ''xl vcpu-set'' can''t decrease the
vCPU number of a HVM guest
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > 
> > That I believe was an QEMU bug:
> > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > 
> > which should be in QEMU traditional now (05-21 was when it went
> > in the tree)
> > 
> In this year or past year, this bug always exists (at least in our
testing).
> ''xl vcpu-set'' can''t decrease the vCPU number of
a HVM guest
Could you retry with Xen 4.3 please?

Ren, Yongjie

2013-Jun-05 10:14 UTC

head link

Re: test report for Xen 4.3 RC1

[This email is either empty or too large to be displayed at this time]

Konrad Rzeszutek Wilk

2013-Jun-05 14:50 UTC

head link

Re: test report for Xen 4.3 RC1

> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > >
> > > > That looks like you are hitting the udev race.
> > > >
> > > > Could you verify that these patches:
> > > > https://lkml.org/lkml/2013/5/13/520
> > > >
> > > > fix the issue (They are destined for v3.11)
> > > >
> > > Not tried yet. I''ll update it to you later.
> > 
> > Thanks!
> > >
> We tested kernel 3.9.3 with the 2 patches you mentioned, and found this
> bug still exist. For example, we did CPU online-offline for Dom0 for 100
times,
> and found 2 times (of 100 times) failed.
Hm, does it fail b/c udev can''t online the sysfs entry?
.. snip..> > >
> > > > >
> > > > > Old bugs: (11)
> > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep
> > > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > > >
> > > > That should be fixed in v3.11 (as now we have the fixes)
> > > > Could you try v3.10 with the Rafael''s ACPI tree
merged in?
> > > > (so the patches that he wants to submit for v3.11)
> > > >
> > > I re-tested with Rafel''s linux-pm.git tree (master and
acpi-hotplug
> > branch),
> > > and found Dom0 S3 sleep/resume can''t work, either.
> > 
> > The patches he has to submit for v3.11 are in the linux-next branch.
> > You need to use that branch.
> > 
> Dom0 S3 sleep/resume doesn''t work with linux-next branch, either.
> attached the log.
It does work on my box. So I am not sure if this is related to the
IvyTown box you are using. Does it work on other
machines?> 
> > >
> > > > > 2. [XL]"xl vcpu-set" causes dom0 crash or
panic
> > > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > > >
> > > > That I think is fixed in v3.10. Could you please check
v3.10-rc3?
> > > >
> > > Still exists on v3.10-rc3.
> > > The following command lines can reproduce it:
> > > # xl vcpu-set 0 1
> > > # xl vcpu-set 0 20
> > 
> > Ugh, same exact stack trace? And can you attach the full dmesg or
serial
> > output (so that Ican see what there is at bootup)
> >
> Yes, the same. Also attached in this mail.
One of the fixes is this one:
http://www.gossamer-threads.com/lists/xen/devel/284897

but the other ones I had not seen. I am wondering if the
update_sd_lb_stats is b/c of the previous conditions (that is the
tick_nohz_idle_start hadn''t been called).

It is a shoot in the dark - but if you use the above mentioned patch
do you still see the update_sd_lb_stats crash?
> > >
> > > > > 4. ''xl vcpu-set'' can''t
decrease the vCPU number of a HVM guest
> > > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > >
> > > > That I believe was an QEMU bug:
> > > >
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > >
> > > > which should be in QEMU traditional now (05-21 was when it
went
> > > > in the tree)
> > > >
> > > In this year or past year, this bug always exists (at least in
our testing).
> > > ''xl vcpu-set'' can''t decrease the vCPU
number of a HVM guest
> > 
> > Could you retry with Xen 4.3 please?
> >
> With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the vCPU
number of a guest.
Could you give some more details? Could you include the /var/log/xen/qemu-...
log file?
You are using the traditional QEMU right? (you need to have this in your guest
config:
device_model_version = ''qemu-xen-traditional''

Ren, Yongjie

2013-Jun-16 04:10 UTC

head link

Re: test report for Xen 4.3 RC1

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Wednesday, June 05, 2013 10:50 PM
> To: Ren, Yongjie
> Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> Yongxue; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> > >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > >
> > > > > That looks like you are hitting the udev race.
> > > > >
> > > > > Could you verify that these patches:
> > > > > https://lkml.org/lkml/2013/5/13/520
> > > > >
> > > > > fix the issue (They are destined for v3.11)
> > > > >
> > > > Not tried yet. I''ll update it to you later.
> > >
> > > Thanks!
> > > >
> > We tested kernel 3.9.3 with the 2 patches you mentioned, and found
this
> > bug still exist. For example, we did CPU online-offline for Dom0 for
100
> times,
> > and found 2 times (of 100 times) failed.
> 
> Hm, does it fail b/c udev can''t online the sysfs entry?
>I think no. 
When it fails to online CPU #3 (trying online #1~#3), it doesn''t show
any info
about CPU #3 via the output of "devadm monitor --env" CMD. It does
show
info about #1 and #2 which are onlined succefully.
> .. snip..
> > > >
> > > > > >
> > > > > > Old bugs: (11)
> > > > > > 1. [ACPI] Dom0 can''t resume from S3 sleep
> > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > > > >
> > > > > That should be fixed in v3.11 (as now we have the
fixes)
> > > > > Could you try v3.10 with the Rafael''s ACPI
tree merged in?
> > > > > (so the patches that he wants to submit for v3.11)
> > > > >
> > > > I re-tested with Rafel''s linux-pm.git tree (master
and acpi-hotplug
> > > branch),
> > > > and found Dom0 S3 sleep/resume can''t work, either.
> > >
> > > The patches he has to submit for v3.11 are in the linux-next
branch.
> > > You need to use that branch.
> > >
> > Dom0 S3 sleep/resume doesn''t work with linux-next branch,
either.
> > attached the log.
> 
> It does work on my box. So I am not sure if this is related to the
> IvyTown box you are using. Does it work on other machines?
>No, it doesn''t work on other machines, either. I also tried on
SandyBridge,
IvyBridge desktop and Haswell mobile machines.
> >
> > > >
> > > > > > 2. [XL]"xl vcpu-set" causes dom0 crash
or panic
> > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > > > >
> > > > > That I think is fixed in v3.10. Could you please check
v3.10-rc3?
> > > > >
> > > > Still exists on v3.10-rc3.
> > > > The following command lines can reproduce it:
> > > > # xl vcpu-set 0 1
> > > > # xl vcpu-set 0 20
> > >
> > > Ugh, same exact stack trace? And can you attach the full dmesg or
> serial
> > > output (so that Ican see what there is at bootup)
> > >
> > Yes, the same. Also attached in this mail.
> 
> One of the fixes is this one:
> http://www.gossamer-threads.com/lists/xen/devel/284897
> 
> but the other ones I had not seen. I am wondering if the
> update_sd_lb_stats is b/c of the previous conditions (that is the
> tick_nohz_idle_start hadn''t been called).
> 
> It is a shoot in the dark - but if you use the above mentioned patch
> do you still see the update_sd_lb_stats crash?
> Yes, with the patch we still see the update_sd_lb_stats crash.
It has almost the same trace log as before. Log file is attached.
> > > >
> > > > > > 4. ''xl vcpu-set'' can''t
decrease the vCPU number of a HVM guest
> > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > >
> > > > > That I believe was an QEMU bug:
> > > > >
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > >
> > > > > which should be in QEMU traditional now (05-21 was when
it went
> > > > > in the tree)
> > > > >
> > > > In this year or past year, this bug always exists (at least
in our
> testing).
> > > > ''xl vcpu-set'' can''t decrease the
vCPU number of a HVM guest
> > >
> > > Could you retry with Xen 4.3 please?
> > >
> > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease the
vCPU number of a
> guest.
> sorry, when I said this message, I still use rhel6.4 kernel as the guest.
After upgrading guest kernel to 3.10.0-rc3, the result became better.
Basically vCPU increment/decrement can work fine. I''ll close that bug.
But there''s still a minor issue as following.
After booting guest with ''vcpus=4'' and
''maxvcpus=32'', change its vCPU number.
# xl vcpu-set $domID 32
then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you set
vCPU number to 32 (from 19), then it works to get 32vCPU for the guest.
but ''xl vcpu-set $domID 8'' can work fine as we expected.
vCPU decrement has the same result.
Can you also have a try to reproduce my issue?
> Could you give some more details? Could you include the
> /var/log/xen/qemu-... log file?
>Attached the qemu log.
> You are using the traditional QEMU right? (you need to have this in your
> guest
> config:
> device_model_version = ''qemu-xen-traditional''
>Yes. 

--
  Jay



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Jun-17 14:23 UTC

head link

Re: test report for Xen 4.3 RC1

On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie
wrote:> > -----Original Message-----
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Wednesday, June 05, 2013 10:50 PM
> > To: Ren, Yongjie
> > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> > Yongxue; xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > 
> > > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > > >
> > > > > > That looks like you are hitting the udev race.
> > > > > >
> > > > > > Could you verify that these patches:
> > > > > > https://lkml.org/lkml/2013/5/13/520
> > > > > >
> > > > > > fix the issue (They are destined for v3.11)
> > > > > >
> > > > > Not tried yet. I''ll update it to you later.
> > > >
> > > > Thanks!
> > > > >
> > > We tested kernel 3.9.3 with the 2 patches you mentioned, and
found this
> > > bug still exist. For example, we did CPU online-offline for Dom0
for 100
> > times,
> > > and found 2 times (of 100 times) failed.
> > 
> > Hm, does it fail b/c udev can''t online the sysfs entry?
> >
> I think no. 
> When it fails to online CPU #3 (trying online #1~#3), it doesn''t
show any info
> about CPU #3 via the output of "devadm monitor --env" CMD. It
does show
> info about #1 and #2 which are onlined succefully.
And if you re-trigger the the ''xl vcpu-set'' it eventually
comes back up right?
> 
> > .. snip..
> > > > >
> > > > > > >
> > > > > > > Old bugs: (11)
> > > > > > > 1. [ACPI] Dom0 can''t resume from S3
sleep
> > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > > > > >
> > > > > > That should be fixed in v3.11 (as now we have the
fixes)
> > > > > > Could you try v3.10 with the Rafael''s
ACPI tree merged in?
> > > > > > (so the patches that he wants to submit for v3.11)
> > > > > >
> > > > > I re-tested with Rafel''s linux-pm.git tree
(master and acpi-hotplug
> > > > branch),
> > > > > and found Dom0 S3 sleep/resume can''t work,
either.
> > > >
> > > > The patches he has to submit for v3.11 are in the linux-next
branch.
> > > > You need to use that branch.
> > > >
> > > Dom0 S3 sleep/resume doesn''t work with linux-next
branch, either.
> > > attached the log.
> > 
> > It does work on my box. So I am not sure if this is related to the
> > IvyTown box you are using. Does it work on other machines?
> >
> No, it doesn''t work on other machines, either. I also tried on
SandyBridge,
> IvyBridge desktop and Haswell mobile machines.
I just double checked on my AMD machines with v3.10-rc5 with
these extra patches:

ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc
operations.
7c4ae96 Revert "xen/pat: Disable PAT support for now."
729c6ec Revert "xen/pat: Disable PAT using pat_enabled value."
bd4fd16 microcode_xen: Add support for AMD family >= 15h
6271c21 x86/microcode: check proper return code.
b9a48c8 xen: add CPU microcode update driver
c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted
0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks
in register_cpu()
f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback.
29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel.

and it worked. Let me recompile a kernel without most of them to doublecheck
whether those patches are making the ACPI S3 suspend/resume working.
This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208 04/18/2012
with 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT]
(rev a1)
as its graphic card.> 
> > >
> > > > >
> > > > > > > 2. [XL]"xl vcpu-set" causes dom0
crash or panic
> > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > > > > >
> > > > > > That I think is fixed in v3.10. Could you please
check v3.10-rc3?
> > > > > >
> > > > > Still exists on v3.10-rc3.
> > > > > The following command lines can reproduce it:
> > > > > # xl vcpu-set 0 1
> > > > > # xl vcpu-set 0 20
> > > >
> > > > Ugh, same exact stack trace? And can you attach the full
dmesg or
> > serial
> > > > output (so that Ican see what there is at bootup)
> > > >
> > > Yes, the same. Also attached in this mail.
> > 
> > One of the fixes is this one:
> > http://www.gossamer-threads.com/lists/xen/devel/284897
> > 
> > but the other ones I had not seen. I am wondering if the
> > update_sd_lb_stats is b/c of the previous conditions (that is the
> > tick_nohz_idle_start hadn''t been called).
> > 
> > It is a shoot in the dark - but if you use the above mentioned patch
> > do you still see the update_sd_lb_stats crash?
> > 
> Yes, with the patch we still see the update_sd_lb_stats crash.
> It has almost the same trace log as before. Log file is attached.
Would it be possible to do a bit of ''git bisect'' to figure out
why
this started?> 
> > > > >
> > > > > > > 4. ''xl vcpu-set''
can''t decrease the vCPU number of a HVM guest
> > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > > >
> > > > > > That I believe was an QEMU bug:
> > > > > >
> > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > > >
> > > > > > which should be in QEMU traditional now (05-21 was
when it went
> > > > > > in the tree)
> > > > > >
> > > > > In this year or past year, this bug always exists (at
least in our
> > testing).
> > > > > ''xl vcpu-set'' can''t decrease
the vCPU number of a HVM guest
> > > >
> > > > Could you retry with Xen 4.3 please?
> > > >
> > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t decrease
the vCPU number of a
> > guest.
> > 
> sorry, when I said this message, I still use rhel6.4 kernel as the guest.
> After upgrading guest kernel to 3.10.0-rc3, the result became better.
> Basically vCPU increment/decrement can work fine. I''ll close that
bug.
Excellent!> But there''s still a minor issue as following.
> After booting guest with ''vcpus=4'' and
''maxvcpus=32'', change its vCPU number.
> # xl vcpu-set $domID 32
> then you can only get less than 32 (e.g. 19) CPUs in the guest; again, you
set
> vCPU number to 32 (from 19), then it works to get 32vCPU for the guest.
> but ''xl vcpu-set $domID 8'' can work fine as we expected.
> vCPU decrement has the same result.
> Can you also have a try to reproduce my issue?
Sure. Now how many PCPUS do you have? And what version of QEMU traditional
were you using?
> 
> > Could you give some more details? Could you include the
> > /var/log/xen/qemu-... log file?
> >
> Attached the qemu log.
Thank you.> 
> > You are using the traditional QEMU right? (you need to have this in
your
> > guest
> > config:
> > device_model_version = ''qemu-xen-traditional''
> >
> Yes. 
> 
> --
>   Jay

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Jun-17 20:35 UTC

head link

Re: test report for Xen 4.3 RC1

> I just double checked on my AMD machines with v3.10-rc5 with
> these extra patches:
> 
> ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc
operations.
> 7c4ae96 Revert "xen/pat: Disable PAT support for now."
> 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value."
> bd4fd16 microcode_xen: Add support for AMD family >= 15h
> 6271c21 x86/microcode: check proper return code.
> b9a48c8 xen: add CPU microcode update driver
> c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is
emitted
> 0790542 cpu: fix "crash_notes" and "crash_notes_size"
leaks in register_cpu()
> f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback.
> 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel.
> 
> and it worked. Let me recompile a kernel without most of them to
doublecheck
> whether those patches are making the ACPI S3 suspend/resume working.
Still works. I removed all but:

 c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is emitted
 0790542 cpu: fix "crash_notes" and "crash_notes_size" leaks
in register_cpu()

on top of 3.10-rc6 and the suspend/resume on the host works.

Konrad Rzeszutek Wilk

2013-Jun-17 20:36 UTC

head link

Re: test report for Xen 4.3 RC1

On Mon, Jun 17, 2013 at 04:35:39PM -0400, Konrad Rzeszutek Wilk
wrote:> > I just double checked on my AMD machines with v3.10-rc5 with
> > these extra patches:
> > 
> > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on
CPA/set_p.._wb/wc operations.
> > 7c4ae96 Revert "xen/pat: Disable PAT support for now."
> > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled
value."
> > bd4fd16 microcode_xen: Add support for AMD family >= 15h
> > 6271c21 x86/microcode: check proper return code.
> > b9a48c8 xen: add CPU microcode update driver
> > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is
emitted
> > 0790542 cpu: fix "crash_notes" and
"crash_notes_size" leaks in register_cpu()
> > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel
callback.
> > 29ca6e9 x86 / ACPI / sleep: Provide registration for
acpi_suspend_lowlevel.
> > 
> > and it worked. Let me recompile a kernel without most of them to
doublecheck
> > whether those patches are making the ACPI S3 suspend/resume working.
> 
> Still works. I removed all but:
> 
>  c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is
emitted
>  0790542 cpu: fix "crash_notes" and "crash_notes_size"
leaks in register_cpu()
> 
> on top of 3.10-rc6 and the suspend/resume on the host works.
Correction. This is v3.10-rc6 + Rafaels'' linux-next branch which had:

 f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback.
 29ca6e9 x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel.

Ren, Yongjie

2013-Jun-20 02:53 UTC

head link

Re: test report for Xen 4.3 RC1

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Monday, June 17, 2013 10:23 PM
> To: Ren, Yongjie
> Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> Yongxue; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie wrote:
> > > -----Original Message-----
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > Sent: Wednesday, June 05, 2013 10:50 PM
> > > To: Ren, Yongjie
> > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX;
Tian,
> > > Yongxue; xen-devel@lists.xen.org
> > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > >
> > > > >
> > >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > > > >
> > > > > > > That looks like you are hitting the udev
race.
> > > > > > >
> > > > > > > Could you verify that these patches:
> > > > > > > https://lkml.org/lkml/2013/5/13/520
> > > > > > >
> > > > > > > fix the issue (They are destined for v3.11)
> > > > > > >
> > > > > > Not tried yet. I''ll update it to you
later.
> > > > >
> > > > > Thanks!
> > > > > >
> > > > We tested kernel 3.9.3 with the 2 patches you mentioned, and
found
> this
> > > > bug still exist. For example, we did CPU online-offline for
Dom0 for
> 100
> > > times,
> > > > and found 2 times (of 100 times) failed.
> > >
> > > Hm, does it fail b/c udev can''t online the sysfs entry?
> > >
> > I think no.
> > When it fails to online CPU #3 (trying online #1~#3), it
doesn''t show any
> info
> > about CPU #3 via the output of "devadm monitor --env" CMD.
It does
> show
> > info about #1 and #2 which are onlined succefully.
> 
> And if you re-trigger the the ''xl vcpu-set'' it eventually
comes back up right?
> We don''t use ''xl vcpu-set'' command when doing the CPU
hot-plug.
We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to
test.
(see the attachment about my test code in that bugzilla.)
And, yes, if a CPU failed to online, it can also be onlined again when we
re-trigger
online function.
> >
> > > .. snip..
> > > > > >
> > > > > > > >
> > > > > > > > Old bugs: (11)
> > > > > > > > 1. [ACPI] Dom0 can''t resume
from S3 sleep
> > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > > > > > >
> > > > > > > That should be fixed in v3.11 (as now we have
the fixes)
> > > > > > > Could you try v3.10 with the
Rafael''s ACPI tree merged in?
> > > > > > > (so the patches that he wants to submit for
v3.11)
> > > > > > >
> > > > > > I re-tested with Rafel''s linux-pm.git
tree (master and acpi-hotplug
> > > > > branch),
> > > > > > and found Dom0 S3 sleep/resume can''t
work, either.
> > > > >
> > > > > The patches he has to submit for v3.11 are in the
linux-next branch.
> > > > > You need to use that branch.
> > > > >
> > > > Dom0 S3 sleep/resume doesn''t work with linux-next
branch, either.
> > > > attached the log.
> > >
> > > It does work on my box. So I am not sure if this is related to
the
> > > IvyTown box you are using. Does it work on other machines?
> > >
> > No, it doesn''t work on other machines, either. I also tried
on
> SandyBridge,
> > IvyBridge desktop and Haswell mobile machines.
> 
> I just double checked on my AMD machines with v3.10-rc5 with
> these extra patches:
> 
> ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on
> CPA/set_p.._wb/wc operations.
> 7c4ae96 Revert "xen/pat: Disable PAT support for now."
> 729c6ec Revert "xen/pat: Disable PAT using pat_enabled value."
> bd4fd16 microcode_xen: Add support for AMD family >= 15h
> 6271c21 x86/microcode: check proper return code.
> b9a48c8 xen: add CPU microcode update driver
> c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is
> emitted
> 0790542 cpu: fix "crash_notes" and "crash_notes_size"
leaks in
> register_cpu()
> f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback.
> 29ca6e9 x86 / ACPI / sleep: Provide registration for
> acpi_suspend_lowlevel.
> 
> and it worked. Let me recompile a kernel without most of them to
> doublecheck
> whether those patches are making the ACPI S3 suspend/resume working.
> This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208
> 04/18/2012
> with 01:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce
> 8600 GT] (rev a1)
> as its graphic card.
> After re-testing with linux-pm.git tree (kernel:3.10.rc6+ commit: a913b188df) on
my IvyTown-EP and IvyBridge desktop systems, Dom0 S3 sleep/resume can work!
When these codes are upstreamed to linux.git tree, I can close this bug.
> >
> > > >
> > > > > >
> > > > > > > > 2. [XL]"xl vcpu-set" causes
dom0 crash or panic
> > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > > > > > >
> > > > > > > That I think is fixed in v3.10. Could you
please check v3.10-rc3?
> > > > > > >
> > > > > > Still exists on v3.10-rc3.
> > > > > > The following command lines can reproduce it:
> > > > > > # xl vcpu-set 0 1
> > > > > > # xl vcpu-set 0 20
> > > > >
> > > > > Ugh, same exact stack trace? And can you attach the
full dmesg or
> > > serial
> > > > > output (so that Ican see what there is at bootup)
> > > > >
> > > > Yes, the same. Also attached in this mail.
> > >
> > > One of the fixes is this one:
> > > http://www.gossamer-threads.com/lists/xen/devel/284897
> > >
> > > but the other ones I had not seen. I am wondering if the
> > > update_sd_lb_stats is b/c of the previous conditions (that is the
> > > tick_nohz_idle_start hadn''t been called).
> > >
> > > It is a shoot in the dark - but if you use the above mentioned
patch
> > > do you still see the update_sd_lb_stats crash?
> > >
> > Yes, with the patch we still see the update_sd_lb_stats crash.
> > It has almost the same trace log as before. Log file is attached.
> 
> Would it be possible to do a bit of ''git bisect'' to
figure out why
> this started?
>It''s hard.
This issue exists for a long time. We don''t even know which version of 
linux upstream as dom0 can work for this bug.
> > > > > > > > 4. ''xl vcpu-set''
can''t decrease the vCPU number of a HVM
> guest
> > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > > > >
> > > > > > > That I believe was an QEMU bug:
> > > > > > >
> > >
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > > > >
> > > > > > > which should be in QEMU traditional now
(05-21 was when it
> went
> > > > > > > in the tree)
> > > > > > >
> > > > > > In this year or past year, this bug always exists
(at least in our
> > > testing).
> > > > > > ''xl vcpu-set'' can''t
decrease the vCPU number of a HVM guest
> > > > >
> > > > > Could you retry with Xen 4.3 please?
> > > > >
> > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t
decrease the vCPU number of
> a
> > > guest.
> > >
> > sorry, when I said this message, I still use rhel6.4 kernel as the
guest.
> > After upgrading guest kernel to 3.10.0-rc3, the result became better.
> > Basically vCPU increment/decrement can work fine. I''ll close
that bug.
> 
> Excellent!
> > But there''s still a minor issue as following.
> > After booting guest with ''vcpus=4'' and
''maxvcpus=32'', change its vCPU
> number.
> > # xl vcpu-set $domID 32
> > then you can only get less than 32 (e.g. 19) CPUs in the guest; again,
you
> set
> > vCPU number to 32 (from 19), then it works to get 32vCPU for the
guest.
> > but ''xl vcpu-set $domID 8'' can work fine as we
expected.
> > vCPU decrement has the same result.
> > Can you also have a try to reproduce my issue?
> This issue doesn''t exist when using the latest QEMU traditional tree.
My pervious QEMU was old (March 2013), and I found some of your patches 
were applied in May 2013. These fixes can fix the issue we reported. 
Close this bug.

But, it introduced another issue: when doing ''xl vcpu-set'' for
HVM several
times (e.g. 5 times), the guest will panic. Log is attached.
Before your patches in qemu traditional tree in May 2013, we never meet
guest kernel panic. 
dom0: 3.10.0-rc3
Xen: 4.3.0-RCx
QEMU: the latest traditional tree
guest kernel: 3.10.0-RC3
I''ll file another bug to track this bug ?
Can you reproduce this ?
> Sure. Now how many PCPUS do you have? And what version of QEMU
> traditional
> were you using?
> There''re 32 pCPU in that system we used.

Best Regards,
     Yongjie (Jay)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Jun-21 18:17 UTC

head link

Re: test report for Xen 4.3 RC1

On Thu, Jun 20, 2013 at 02:53:06AM +0000, Ren, Yongjie
wrote:> > -----Original Message-----
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Monday, June 17, 2013 10:23 PM
> > To: Ren, Yongjie
> > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> > Yongxue; xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > 
> > On Sun, Jun 16, 2013 at 04:10:22AM +0000, Ren, Yongjie wrote:
> > > > -----Original Message-----
> > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > > Sent: Wednesday, June 05, 2013 10:50 PM
> > > > To: Ren, Yongjie
> > > > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu,
SongtaoX; Tian,
> > > > Yongxue; xen-devel@lists.xen.org
> > > > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > > >
> > > > > >
> > > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > > > > >
> > > > > > > > That looks like you are hitting the udev
race.
> > > > > > > >
> > > > > > > > Could you verify that these patches:
> > > > > > > > https://lkml.org/lkml/2013/5/13/520
> > > > > > > >
> > > > > > > > fix the issue (They are destined for
v3.11)
> > > > > > > >
> > > > > > > Not tried yet. I''ll update it to you
later.
> > > > > >
> > > > > > Thanks!
> > > > > > >
> > > > > We tested kernel 3.9.3 with the 2 patches you
mentioned, and found
> > this
> > > > > bug still exist. For example, we did CPU online-offline
for Dom0 for
> > 100
> > > > times,
> > > > > and found 2 times (of 100 times) failed.
> > > >
> > > > Hm, does it fail b/c udev can''t online the sysfs
entry?
> > > >
> > > I think no.
> > > When it fails to online CPU #3 (trying online #1~#3), it
doesn''t show any
> > info
> > > about CPU #3 via the output of "devadm monitor --env"
CMD. It does
> > show
> > > info about #1 and #2 which are onlined succefully.
> > 
> > And if you re-trigger the the ''xl vcpu-set'' it
eventually comes back up right?
> > 
> We don''t use ''xl vcpu-set'' command when doing
the CPU hot-plug.
> We just call the xc_cpu_online/offline() in tools/libxc/xc_cpu_hotplug.c to
test.
Oh. That is very different than what I thought. You are not offlining/onlining
vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0 vCPUs in the
remaining vCPUS.

There should be no vCPU re-sizing correct?
> (see the attachment about my test code in that bugzilla.)
> And, yes, if a CPU failed to online, it can also be onlined again when we
re-trigger
> online function.
> 
> > >
> > > > .. snip..
> > > > > > >
> > > > > > > > >
> > > > > > > > > Old bugs: (11)
> > > > > > > > > 1. [ACPI] Dom0 can''t
resume from S3 sleep
> > > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
> > > > > > > >
> > > > > > > > That should be fixed in v3.11 (as now we
have the fixes)
> > > > > > > > Could you try v3.10 with the
Rafael''s ACPI tree merged in?
> > > > > > > > (so the patches that he wants to submit
for v3.11)
> > > > > > > >
> > > > > > > I re-tested with Rafel''s
linux-pm.git tree (master and acpi-hotplug
> > > > > > branch),
> > > > > > > and found Dom0 S3 sleep/resume can''t
work, either.
> > > > > >
> > > > > > The patches he has to submit for v3.11 are in the
linux-next branch.
> > > > > > You need to use that branch.
> > > > > >
> > > > > Dom0 S3 sleep/resume doesn''t work with
linux-next branch, either.
> > > > > attached the log.
> > > >
> > > > It does work on my box. So I am not sure if this is related
to the
> > > > IvyTown box you are using. Does it work on other machines?
> > > >
> > > No, it doesn''t work on other machines, either. I also
tried on
> > SandyBridge,
> > > IvyBridge desktop and Haswell mobile machines.
> > 
> > I just double checked on my AMD machines with v3.10-rc5 with
> > these extra patches:
> > 
> > ebe2886 x86/cpa: Use pte_attrs instead of pte_flags on
> > CPA/set_p.._wb/wc operations.
> > 7c4ae96 Revert "xen/pat: Disable PAT support for now."
> > 729c6ec Revert "xen/pat: Disable PAT using pat_enabled
value."
> > bd4fd16 microcode_xen: Add support for AMD family >= 15h
> > 6271c21 x86/microcode: check proper return code.
> > b9a48c8 xen: add CPU microcode update driver
> > c62566c cpu: make sure that cpu/online file created before KOBJ_ADD is
> > emitted
> > 0790542 cpu: fix "crash_notes" and
"crash_notes_size" leaks in
> > register_cpu()
> > f90099b xen / ACPI / sleep: Register an acpi_suspend_lowlevel
callback.
> > 29ca6e9 x86 / ACPI / sleep: Provide registration for
> > acpi_suspend_lowlevel.
> > 
> > and it worked. Let me recompile a kernel without most of them to
> > doublecheck
> > whether those patches are making the ACPI S3 suspend/resume working.
> > This is with Xen 4.3 (82cb411). The machine is M5A97, BIOS 1208
> > 04/18/2012
> > with 01:00.0 VGA compatible controller: NVIDIA Corporation G84
[GeForce
> > 8600 GT] (rev a1)
> > as its graphic card.
> > 
> After re-testing with linux-pm.git tree (kernel:3.10.rc6+ commit:
a913b188df) on
> my IvyTown-EP and IvyBridge desktop systems, Dom0 S3 sleep/resume can work!
> When these codes are upstreamed to linux.git tree, I can close this bug.
Yes! Thought Ben found another issue with extended sleep - where it will
not use the hypercall. <sigh>> 
> > >
> > > > >
> > > > > > >
> > > > > > > > > 2. [XL]"xl vcpu-set"
causes dom0 crash or panic
> > > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
> > > > > > > >
> > > > > > > > That I think is fixed in v3.10. Could
you please check v3.10-rc3?
> > > > > > > >
> > > > > > > Still exists on v3.10-rc3.
> > > > > > > The following command lines can reproduce it:
> > > > > > > # xl vcpu-set 0 1
> > > > > > > # xl vcpu-set 0 20
> > > > > >
> > > > > > Ugh, same exact stack trace? And can you attach
the full dmesg or
> > > > serial
> > > > > > output (so that Ican see what there is at bootup)
> > > > > >
> > > > > Yes, the same. Also attached in this mail.
> > > >
> > > > One of the fixes is this one:
> > > > http://www.gossamer-threads.com/lists/xen/devel/284897
> > > >
> > > > but the other ones I had not seen. I am wondering if the
> > > > update_sd_lb_stats is b/c of the previous conditions (that
is the
> > > > tick_nohz_idle_start hadn''t been called).
> > > >
> > > > It is a shoot in the dark - but if you use the above
mentioned patch
> > > > do you still see the update_sd_lb_stats crash?
> > > >
> > > Yes, with the patch we still see the update_sd_lb_stats crash.
> > > It has almost the same trace log as before. Log file is attached.
> > 
> > Would it be possible to do a bit of ''git bisect'' to
figure out why
> > this started?
> >
> It''s hard.
> This issue exists for a long time. We don''t even know which
version of
> linux upstream as dom0 can work for this bug.
Then a bit of digging will be needed. Sadly I am out of time to do this
ATM.
> 
> > > > > > > > > 4. ''xl vcpu-set''
can''t decrease the vCPU number of a HVM
> > guest
> > > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > > > > >
> > > > > > > > That I believe was an QEMU bug:
> > > > > > > >
> > > >
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > > > > >
> > > > > > > > which should be in QEMU traditional now
(05-21 was when it
> > went
> > > > > > > > in the tree)
> > > > > > > >
> > > > > > > In this year or past year, this bug always
exists (at least in our
> > > > testing).
> > > > > > > ''xl vcpu-set''
can''t decrease the vCPU number of a HVM guest
> > > > > >
> > > > > > Could you retry with Xen 4.3 please?
> > > > > >
> > > > > With Xen 4.3 & Linux:3.10.0-rc3, I can''t
decrease the vCPU number of
> > a
> > > > guest.
> > > >
> > > sorry, when I said this message, I still use rhel6.4 kernel as
the guest.
> > > After upgrading guest kernel to 3.10.0-rc3, the result became
better.
> > > Basically vCPU increment/decrement can work fine. I''ll
close that bug.
> > 
> > Excellent!
> > > But there''s still a minor issue as following.
> > > After booting guest with ''vcpus=4'' and
''maxvcpus=32'', change its vCPU
> > number.
> > > # xl vcpu-set $domID 32
> > > then you can only get less than 32 (e.g. 19) CPUs in the guest;
again, you
> > set
> > > vCPU number to 32 (from 19), then it works to get 32vCPU for the
guest.
> > > but ''xl vcpu-set $domID 8'' can work fine as we
expected.
> > > vCPU decrement has the same result.
> > > Can you also have a try to reproduce my issue?
> > 
> This issue doesn''t exist when using the latest QEMU traditional
tree.
> My pervious QEMU was old (March 2013), and I found some of your patches 
> were applied in May 2013. These fixes can fix the issue we reported. 
> Close this bug.
Yes!> 
> But, it introduced another issue: when doing ''xl
vcpu-set'' for HVM several
> times (e.g. 5 times), the guest will panic. Log is attached.
> Before your patches in qemu traditional tree in May 2013, we never meet
> guest kernel panic. 
> dom0: 3.10.0-rc3
> Xen: 4.3.0-RCx
> QEMU: the latest traditional tree
> guest kernel: 3.10.0-RC3
> I''ll file another bug to track this bug ?
Please.> Can you reproduce this ?
Could you tell me how you are doing ''xl vcpu-set''? Is there a
particular
test script you are using?
> 
> > Sure. Now how many PCPUS do you have? And what version of QEMU
> > traditional
> > were you using?
> > 
> There''re 32 pCPU in that system we used.
> 
> Best Regards,
>      Yongjie (Jay)

Ren, Yongjie

2013-Jul-02 08:09 UTC

head link

Re: test report for Xen 4.3 RC1

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Saturday, June 22, 2013 2:18 AM
> To: Ren, Yongjie
> Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> Yongxue; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> > >
> http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > > > > > >
> > > > > > > > > That looks like you are hitting the
udev race.
> > > > > > > > >
> > > > > > > > > Could you verify that these
patches:
> > > > > > > > > https://lkml.org/lkml/2013/5/13/520
> > > > > > > > >
> > > > > > > > > fix the issue (They are destined
for v3.11)
> > > > > > > > >
> > > > > > > > Not tried yet. I''ll update it
to you later.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > >
> > > > > > We tested kernel 3.9.3 with the 2 patches you
mentioned, and
> found
> > > this
> > > > > > bug still exist. For example, we did CPU
online-offline for Dom0 for
> > > 100
> > > > > times,
> > > > > > and found 2 times (of 100 times) failed.
> > > > >
> > > > > Hm, does it fail b/c udev can''t online the
sysfs entry?
> > > > >
> > > > I think no.
> > > > When it fails to online CPU #3 (trying online #1~#3), it
doesn''t show
> any
> > > info
> > > > about CPU #3 via the output of "devadm monitor
--env" CMD. It does
> > > show
> > > > info about #1 and #2 which are onlined succefully.
> > >
> > > And if you re-trigger the the ''xl vcpu-set'' it
eventually comes back up
> right?
> > >
> > We don''t use ''xl vcpu-set'' command when
doing the CPU hot-plug.
> > We just call the xc_cpu_online/offline() in
tools/libxc/xc_cpu_hotplug.c to
> test.
> 
> Oh. That is very different than what I thought. You are not
offlining/onlining
> vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0 vCPUs
> in the
> remaining vCPUS.
> 
> There should be no vCPU re-sizing correct?
> Yes, for this case we do online/offline for pCPUs not vCPUs.
(vCPU number doesn''t change.)
> > (see the attachment about my test code in that bugzilla.)
> > And, yes, if a CPU failed to online, it can also be onlined again when
we
> re-trigger
> > online function.
> >
> > > >
> > > > > > > > > > 4. ''xl
vcpu-set'' can''t decrease the vCPU number of a HVM
> > > guest
> > > > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > > > > > >
> > > > > > > > > That I believe was an QEMU bug:
> > > > > > > > >
> > > > >
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > > > > > >
> > > > > > > > > which should be in QEMU traditional
now (05-21 was when it
> > > went
> > > > > > > > > in the tree)
> > > > > > > > >
> > > > > > > > In this year or past year, this bug
always exists (at least in our
> > > > > testing).
> > > > > > > > ''xl vcpu-set''
can''t decrease the vCPU number of a HVM guest
> > > > > > >
> > > > > > > Could you retry with Xen 4.3 please?
> > > > > > >
> > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I
can''t decrease the vCPU
> number of
> > > a
> > > > > guest.
> > > > >
> > > > sorry, when I said this message, I still use rhel6.4 kernel
as the guest.
> > > > After upgrading guest kernel to 3.10.0-rc3, the result
became better.
> > > > Basically vCPU increment/decrement can work fine.
I''ll close that
> bug.
> > >
> > > Excellent!
> > > > But there''s still a minor issue as following.
> > > > After booting guest with ''vcpus=4'' and
''maxvcpus=32'', change its
> vCPU
> > > number.
> > > > # xl vcpu-set $domID 32
> > > > then you can only get less than 32 (e.g. 19) CPUs in the
guest; again,
> you
> > > set
> > > > vCPU number to 32 (from 19), then it works to get 32vCPU for
the
> guest.
> > > > but ''xl vcpu-set $domID 8'' can work fine
as we expected.
> > > > vCPU decrement has the same result.
> > > > Can you also have a try to reproduce my issue?
> > >
> > This issue doesn''t exist when using the latest QEMU
traditional tree.
> > My pervious QEMU was old (March 2013), and I found some of your
> patches
> > were applied in May 2013. These fixes can fix the issue we reported.
> > Close this bug.
> 
> Yes!
> >
> > But, it introduced another issue: when doing ''xl
vcpu-set'' for HVM
> several
> > times (e.g. 5 times), the guest will panic. Log is attached.
> > Before your patches in qemu traditional tree in May 2013, we never
> meet
> > guest kernel panic.
> > dom0: 3.10.0-rc3
> > Xen: 4.3.0-RCx
> > QEMU: the latest traditional tree
> > guest kernel: 3.10.0-RC3
> > I''ll file another bug to track this bug ?
> 
> Please.
> > Can you reproduce this ?
> 
> Could you tell me how you are doing ''xl vcpu-set''? Is
there a particular
> test script you are using?
> 1. xl vcpu-set $domID 2
2. xl vcpu-set $domID 20
3. repeat step #1 and #2 for several times.  (guest kernel panic ...)

I also filed a bug in bugzilla to track this. 
You can get more info in the following link.
http://bugzilla.xenproject.org/bugzilla/show_bug.cgi?id=1860

--
  Jay

> >
> > > Sure. Now how many PCPUS do you have? And what version of QEMU
> > > traditional
> > > were you using?
> > >
> > There''re 32 pCPU in that system we used.
> >
> > Best Regards,
> >      Yongjie (Jay)
>

Konrad Rzeszutek Wilk

2013-Jul-02 13:36 UTC

head link

Re: test report for Xen 4.3 RC1

On Tue, Jul 02, 2013 at 08:09:48AM +0000, Ren, Yongjie
wrote:> > -----Original Message-----
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Saturday, June 22, 2013 2:18 AM
> > To: Ren, Yongjie
> > Cc: george.dunlap@eu.citrix.com; Xu, YongweiX; Liu, SongtaoX; Tian,
> > Yongxue; xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
> > 
> > > >
> > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1851
> > > > > > > > > >
> > > > > > > > > > That looks like you are
hitting the udev race.
> > > > > > > > > >
> > > > > > > > > > Could you verify that these
patches:
> > > > > > > > > >
https://lkml.org/lkml/2013/5/13/520
> > > > > > > > > >
> > > > > > > > > > fix the issue (They are
destined for v3.11)
> > > > > > > > > >
> > > > > > > > > Not tried yet. I''ll update
it to you later.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > We tested kernel 3.9.3 with the 2 patches you
mentioned, and
> > found
> > > > this
> > > > > > > bug still exist. For example, we did CPU
online-offline for Dom0 for
> > > > 100
> > > > > > times,
> > > > > > > and found 2 times (of 100 times) failed.
> > > > > >
> > > > > > Hm, does it fail b/c udev can''t online
the sysfs entry?
> > > > > >
> > > > > I think no.
> > > > > When it fails to online CPU #3 (trying online #1~#3),
it doesn''t show
> > any
> > > > info
> > > > > about CPU #3 via the output of "devadm monitor
--env" CMD. It does
> > > > show
> > > > > info about #1 and #2 which are onlined succefully.
> > > >
> > > > And if you re-trigger the the ''xl
vcpu-set'' it eventually comes back up
> > right?
> > > >
> > > We don''t use ''xl vcpu-set'' command
when doing the CPU hot-plug.
> > > We just call the xc_cpu_online/offline() in
tools/libxc/xc_cpu_hotplug.c to
> > test.
> > 
> > Oh. That is very different than what I thought. You are not
offlining/onlining
> > vCPUS - you offlining/onlining pCPUS! So Xen has to cramp the dom0
vCPUs
> > in the
> > remaining vCPUS.
> > 
> > There should be no vCPU re-sizing correct?
> > 
> Yes, for this case we do online/offline for pCPUs not vCPUs.
> (vCPU number doesn''t change.)
OK, so nothing to do with Linux but mostly with Xen hypervisor. Do you know
who added this functionality? Can they help?
> 
> > > (see the attachment about my test code in that bugzilla.)
> > > And, yes, if a CPU failed to online, it can also be onlined again
when we
> > re-trigger
> > > online function.
> > >
> > > > >
> 
> 
> > > > > > > > > > > 4. ''xl
vcpu-set'' can''t decrease the vCPU number of a HVM
> > > > guest
> > > > > > > > > > >  
http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
> > > > > > > > > >
> > > > > > > > > > That I believe was an QEMU
bug:
> > > > > > > > > >
> > > > > >
> > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01054.html
> > > > > > > > > >
> > > > > > > > > > which should be in QEMU
traditional now (05-21 was when it
> > > > went
> > > > > > > > > > in the tree)
> > > > > > > > > >
> > > > > > > > > In this year or past year, this bug
always exists (at least in our
> > > > > > testing).
> > > > > > > > > ''xl vcpu-set''
can''t decrease the vCPU number of a HVM guest
> > > > > > > >
> > > > > > > > Could you retry with Xen 4.3 please?
> > > > > > > >
> > > > > > > With Xen 4.3 & Linux:3.10.0-rc3, I
can''t decrease the vCPU
> > number of
> > > > a
> > > > > > guest.
> > > > > >
> > > > > sorry, when I said this message, I still use rhel6.4
kernel as the guest.
> > > > > After upgrading guest kernel to 3.10.0-rc3, the result
became better.
> > > > > Basically vCPU increment/decrement can work fine.
I''ll close that
> > bug.
> > > >
> > > > Excellent!
> > > > > But there''s still a minor issue as following.
> > > > > After booting guest with ''vcpus=4''
and ''maxvcpus=32'', change its
> > vCPU
> > > > number.
> > > > > # xl vcpu-set $domID 32
> > > > > then you can only get less than 32 (e.g. 19) CPUs in
the guest; again,
> > you
> > > > set
> > > > > vCPU number to 32 (from 19), then it works to get
32vCPU for the
> > guest.
> > > > > but ''xl vcpu-set $domID 8'' can work
fine as we expected.
> > > > > vCPU decrement has the same result.
> > > > > Can you also have a try to reproduce my issue?
> > > >
> > > This issue doesn''t exist when using the latest QEMU
traditional tree.
> > > My pervious QEMU was old (March 2013), and I found some of your
> > patches
> > > were applied in May 2013. These fixes can fix the issue we
reported.
> > > Close this bug.
> > 
> > Yes!
> > >
> > > But, it introduced another issue: when doing ''xl
vcpu-set'' for HVM
> > several
> > > times (e.g. 5 times), the guest will panic. Log is attached.
> > > Before your patches in qemu traditional tree in May 2013, we
never
> > meet
> > > guest kernel panic.
> > > dom0: 3.10.0-rc3
> > > Xen: 4.3.0-RCx
> > > QEMU: the latest traditional tree
> > > guest kernel: 3.10.0-RC3
> > > I''ll file another bug to track this bug ?
> > 
> > Please.
> > > Can you reproduce this ?
> > 
> > Could you tell me how you are doing ''xl vcpu-set''?
Is there a particular
> > test script you are using?
> > 
> 1. xl vcpu-set $domID 2
> 2. xl vcpu-set $domID 20
> 3. repeat step #1 and #2 for several times.  (guest kernel panic ...)
> 
> I also filed a bug in bugzilla to track this. 
> You can get more info in the following link.
> http://bugzilla.xenproject.org/bugzilla/show_bug.cgi?id=1860
OK, thank you.

I am a bit busy right now tracking down some other bugs that I promised
I would look after. But after that I should have some time.
> 
> --
>   Jay
> 
> 
> > >
> > > > Sure. Now how many PCPUS do you have? And what version of
QEMU
> > > > traditional
> > > > were you using?
> > > >
> > > There''re 32 pCPU in that system we used.
> > >
> > > Best Regards,
> > >      Yongjie (Jay)
> > 
>

Konrad Rzeszutek Wilk

2013-Nov-08 16:21 UTC

head link

Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk
wrote:> > > 5. Dom0 cannot be shutdown before PCI device detachment from
guest
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > 
> > Ok, I can reproduce that too.
> 
> This is what dom0 tells me:
> 
> [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> [  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[ 
483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
> [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189] 
ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
> [  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
ffff880078bca500
> [  483.689527] Call Trace:
> [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
> [  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
> [  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
> [  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
> [  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
> [  483.792791]  [<ffffffff813bbab4>]
xenbus_transaction_start+0x24/0x60
> [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
> [  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
> [  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
> [  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
> [  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
> [  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
> [  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[ 
483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
> [  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 
483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
> [  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
> [  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[ 
484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
> [  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
> [  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
> [  484.070747]  [<ffffffff810ed52d>] ?
trace_hardirqs_on_caller+0x10d/0x1d0
> [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
> [  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
> [  484.116585] 3 locks held by init/4163:
> [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>]
SYSC_reboot+0xe0/0x260
> ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......},
at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
> [  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at:
[<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0
> 
A bit of debugging shows that when we are in this state:


MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown

telnet> send brk 
[  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i)
thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p)
show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V)
show-blocked-tasks(w) dump-ftrace-buffer(z)

... snip..

 xenstored       x 0000000000000002  5504  3437      1 0x00000006
  ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
 Call Trace:
  [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
  [<ffffffff816b1594>] schedule+0x24/0x70
  [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
  [<ffffffff8109c981>] do_group_exit+0x51/0x140
  [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
  [<ffffffff8104c49f>] do_signal+0x4f/0x610
  [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
  [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
  [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
  [<ffffffff816bc372>] int_signal+0x12/0x17


The ''x'' means that the task has been killed.

(The other two threads ''xenbus'' and
''xenwatch'' are sleeping).

Since the xenstored can actually be in a domain nowadays and not
just in the initial domain and xenstored can be restarted anytime - we
can''t depend on the task pid. Nor can we depend on the other
domain telling us that it is dead.

The best we can do is to get out of the way of the shutdown
process and not hang on forever.

This patch should solve it:
From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 8 Nov 2013 10:48:58 -0500
Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
 shutdown/restart.

The ''read_reply'' works with ''process_msg'' to
read of a reply in XenBus.
''process_msg'' is running from within the
''xenbus'' thread. Whenever
a message shows up in XenBus it is put on a xs_state.reply_list list
and ''read_reply'' picks it up.

The problem is if the backend domain or the xenstored process is killed.
In which case ''xenbus'' is still awaiting - and
''read_reply'' if called -
stuck forever waiting for the reply_list to have some contents.

This is normally not a problem - as the backend domain can come back
or the xenstored process can be restarted. However if the domain
is in process of being powered off/restarted/halted - there is no
point of waiting on it coming back - as we are effectively being
terminated and should not impede the progress.

This patch solves this problem by checking the ''system_state''
value
to see if we are in heading towards death. We also make the wait
mechanism a bit more asynchronous.

Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index b6d5fff..177fb19 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type,
unsigned int *len)
 
 	while (list_empty(&xs_state.reply_list)) {
 		spin_unlock(&xs_state.reply_lock);
-		/* XXX FIXME: Avoid synchronous wait for response here. */
-		wait_event(xs_state.reply_waitq,
-			   !list_empty(&xs_state.reply_list));
+		wait_event_timeout(xs_state.reply_waitq,
+				   !list_empty(&xs_state.reply_list),
+				   msecs_to_jiffies(500));
+
+		/*
+		 * If we are in the process of being shut-down there is
+		 * no point of trying to contact XenBus - it is either
+		 * killed (xenstored application) or the other domain
+		 * has been killed or is unreachable.
+		 */
+		switch (system_state) {
+			case SYSTEM_POWER_OFF:
+			case SYSTEM_RESTART:
+			case SYSTEM_HALT:
+				return ERR_PTR(-EIO);
+			default:
+				break;
+		}
 		spin_lock(&xs_state.reply_lock);
 	}
 
@@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg)
 
 	mutex_unlock(&xs_state.request_mutex);
 
+	if (IS_ERR(ret))
+		return ret;
+
 	if ((msg->type == XS_TRANSACTION_END) ||
 	    ((req_msg.type == XS_TRANSACTION_START) &&
 	     (msg->type == XS_ERROR)))
-- 
1.7.7.6

xen@bugs.xenproject.org

2013-Nov-08 16:30 UTC

head link

Processed: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processing commands for xen@bugs.xenproject.org:
> On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:Command failed: Unknown command `On''. at
/srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 45.
Stop processing here.

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

Matt Wilson

2013-Nov-10 20:20 UTC

head link

Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

On Fri, Nov 08, 2013 at 11:21:21AM -0500, Konrad Rzeszutek Wilk wrote:
[...]> This patch should solve it:
> From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Fri, 8 Nov 2013 10:48:58 -0500
> Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
>  shutdown/restart.
> 
> The ''read_reply'' works with
''process_msg'' to read of a reply in XenBus.
> ''process_msg'' is running from within the
''xenbus'' thread. Whenever
> a message shows up in XenBus it is put on a xs_state.reply_list list
> and ''read_reply'' picks it up.
> 
> The problem is if the backend domain or the xenstored process is killed.
> In which case ''xenbus'' is still awaiting - and
''read_reply'' if called -
> stuck forever waiting for the reply_list to have some contents.
> 
> This is normally not a problem - as the backend domain can come back
> or the xenstored process can be restarted. However if the domain
> is in process of being powered off/restarted/halted - there is no
> point of waiting on it coming back - as we are effectively being
> terminated and should not impede the progress.
> 
> This patch solves this problem by checking the
''system_state'' value
> to see if we are in heading towards death. We also make the wait
> mechanism a bit more asynchronous.
> 
> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Makes sense to me.

Acked-by: Matt Wilson <msw@amazon.com>
> ---
>  drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xenbus/xenbus_xs.c
b/drivers/xen/xenbus/xenbus_xs.c
> index b6d5fff..177fb19 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type,
unsigned int *len)
>  
>  	while (list_empty(&xs_state.reply_list)) {
>  		spin_unlock(&xs_state.reply_lock);
> -		/* XXX FIXME: Avoid synchronous wait for response here. */
> -		wait_event(xs_state.reply_waitq,
> -			   !list_empty(&xs_state.reply_list));
> +		wait_event_timeout(xs_state.reply_waitq,
> +				   !list_empty(&xs_state.reply_list),
> +				   msecs_to_jiffies(500));
> +
> +		/*
> +		 * If we are in the process of being shut-down there is
> +		 * no point of trying to contact XenBus - it is either
> +		 * killed (xenstored application) or the other domain
> +		 * has been killed or is unreachable.
> +		 */
> +		switch (system_state) {
> +			case SYSTEM_POWER_OFF:
> +			case SYSTEM_RESTART:
> +			case SYSTEM_HALT:
> +				return ERR_PTR(-EIO);
> +			default:
> +				break;
> +		}
>  		spin_lock(&xs_state.reply_lock);
>  	}
>  
> @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg
*msg)
>  
>  	mutex_unlock(&xs_state.request_mutex);
>  
> +	if (IS_ERR(ret))
> +		return ret;
> +
>  	if ((msg->type == XS_TRANSACTION_END) ||
>  	    ((req_msg.type == XS_TRANSACTION_START) &&
>  	     (msg->type == XS_ERROR)))

xen@bugs.xenproject.org

2013-Nov-10 20:30 UTC

head link

Processed: Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processing commands for xen@bugs.xenproject.org:
> On Fri, Nov 08, 2013 at 11:21:21AM -0500, Konrad Rzeszutek Wilk wrote:Command failed: Unknown command `On''. at
/srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 51.
Stop processing here.

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

Liu, SongtaoX

2013-Nov-11 02:40 UTC

head link

Re: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Yes, the patch fixed the dom0 hang issue during rebooting with guest pci device
conflict.
Thanks.


Regards
Songtao
> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Saturday, November 09, 2013 12:21 AM
> To: Ren, Yongjie; george.dunlap@eu.citrix.com; xen@bugs.xenproject.org
> Cc: Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue; xen-devel@lists.xen.org
> Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests
hung."
> Was:Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > 5. Dom0 cannot be shutdown before PCI device detachment from
guest
> > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > >
> > > Ok, I can reproduce that too.
> >
> > This is what dom0 tells me:
> >
> > [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> > [  483.603675] "echo 0 >
> /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D
> ffff880062b59c78  5904  4163      1 0x00000000
> > [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]
> > ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [
> > 483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
> ffff880078bca500 [  483.689527] Call Trace:
> > [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70 [ 
483.723604]
> > [<ffffffff813bb0dd>] read_reply+0xad/0x160 [  483.741162]
> > [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [  483.758572]
> > [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [  483.775741]
> > [<ffffffff813bb3c6>] xs_single+0x46/0x60 [  483.792791]
> > [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> > [  483.809929]  [<ffffffff813ba202>]
__xenbus_switch_ste+0x32/0x120
> > ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
[
> > 483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [ 
483.860412]
> > [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [  483.877312]
> > [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [ 
483.894036]
> > [<ffffffff8142e275>] device_shutdown+0x15/0x180 [  483.910605]
> > [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [ 
483.927100]
> > [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]
> > [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [  483.959480]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[ 
483.975786]
> > [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [  483.991819]
> > [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [ 
484.007675]
> > [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]
> > [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [  484.039176]
> > [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [  484.055174]
> > [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [  484.070747]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> > [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [
> > 484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
[
> > 484.116585] 3 locks held by init/4163:
> > [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>]
SYSC_reboot+0xe0/0x260
> > ^G^G^G^G^G^G[  484.147704]  #1: 
(&__lockdep_no_validate__){......},
> > at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [ 
484.164359]
> > #2:  (&xs_state.request_mutex){+.+...}, at:
[<ffffffff813bb1fb>]
> > xs_talkv+0x6b/0x1f0
> >
> 
> A bit of debugging shows that when we are in this state:
> 
> 
> MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown
> 
> telnet> send brk
> [  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
> terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i)
> thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
> show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p)
> show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u)
force-fb(V)
> show-blocked-tasks(w) dump-ftrace-buffer(z)
> 
> ... snip..
> 
>  xenstored       x 0000000000000002  5504  3437      1 0x00000006
>   ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
>  Call Trace:
>   [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
>   [<ffffffff816b1594>] schedule+0x24/0x70
>   [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
>   [<ffffffff8109c981>] do_group_exit+0x51/0x140
>   [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
>   [<ffffffff8104c49f>] do_signal+0x4f/0x610
>   [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
>   [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
>   [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
>   [<ffffffff816bc372>] int_signal+0x12/0x17
> 
> 
> The ''x'' means that the task has been killed.
> 
> (The other two threads ''xenbus'' and
''xenwatch'' are sleeping).
> 
> Since the xenstored can actually be in a domain nowadays and not
> just in the initial domain and xenstored can be restarted anytime - we
> can''t depend on the task pid. Nor can we depend on the other
> domain telling us that it is dead.
> 
> The best we can do is to get out of the way of the shutdown
> process and not hang on forever.
> 
> This patch should solve it:
> From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00
> 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Fri, 8 Nov 2013 10:48:58 -0500
> Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
>  shutdown/restart.
> 
> The ''read_reply'' works with
''process_msg'' to read of a reply in XenBus.
> ''process_msg'' is running from within the
''xenbus'' thread. Whenever
> a message shows up in XenBus it is put on a xs_state.reply_list list
> and ''read_reply'' picks it up.
> 
> The problem is if the backend domain or the xenstored process is killed.
> In which case ''xenbus'' is still awaiting - and
''read_reply'' if called -
> stuck forever waiting for the reply_list to have some contents.
> 
> This is normally not a problem - as the backend domain can come back
> or the xenstored process can be restarted. However if the domain
> is in process of being powered off/restarted/halted - there is no
> point of waiting on it coming back - as we are effectively being
> terminated and should not impede the progress.
> 
> This patch solves this problem by checking the
''system_state'' value
> to see if we are in heading towards death. We also make the wait
> mechanism a bit more asynchronous.
> 
> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xenbus/xenbus_xs.c
b/drivers/xen/xenbus/xenbus_xs.c
> index b6d5fff..177fb19 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type
> *type, unsigned int *len)
> 
>  	while (list_empty(&xs_state.reply_list)) {
>  		spin_unlock(&xs_state.reply_lock);
> -		/* XXX FIXME: Avoid synchronous wait for response here. */
> -		wait_event(xs_state.reply_waitq,
> -			   !list_empty(&xs_state.reply_list));
> +		wait_event_timeout(xs_state.reply_waitq,
> +				   !list_empty(&xs_state.reply_list),
> +				   msecs_to_jiffies(500));
> +
> +		/*
> +		 * If we are in the process of being shut-down there is
> +		 * no point of trying to contact XenBus - it is either
> +		 * killed (xenstored application) or the other domain
> +		 * has been killed or is unreachable.
> +		 */
> +		switch (system_state) {
> +			case SYSTEM_POWER_OFF:
> +			case SYSTEM_RESTART:
> +			case SYSTEM_HALT:
> +				return ERR_PTR(-EIO);
> +			default:
> +				break;
> +		}
>  		spin_lock(&xs_state.reply_lock);
>  	}
> 
> @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct
> xsd_sockmsg *msg)
> 
>  	mutex_unlock(&xs_state.request_mutex);
> 
> +	if (IS_ERR(ret))
> +		return ret;
> +
>  	if ((msg->type == XS_TRANSACTION_END) ||
>  	    ((req_msg.type == XS_TRANSACTION_START) &&
>  	     (msg->type == XS_ERROR)))
> --
> 1.7.7.6

xen@bugs.xenproject.org

2013-Nov-11 02:45 UTC

head link

Processed: RE: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processing commands for xen@bugs.xenproject.org:
> Yes, the patch fixed the dom0 hang issue during rebooting with guest pci
deCommand failed: Unknown command `Yes,''. at
/srv/xen-devel-bugs/lib/emesinae/control.pl line 437, <M> line 50.Stop processing here.

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

Ian Campbell

2013-Nov-11 10:22 UTC

head link

Re: test report for Xen 4.3 RC1

On Tue, 2013-05-28 at 16:24 +0100, George Dunlap wrote:> > create !
> > title -1 "linux, xenbus mutex hangs when rebooting dom0 and
guests hung."
> 
> 1. I think that these commands have to come at the top
> 2. You don''t need quotes in the title
> 3. You need to be polite and say "thanks" at the end so it knows
it can
> stop paying attention. :-)
4. Use Bcc and not Cc so that the entirely subsequent thread doesn''t
get
sent to the bot when folks reply-all.

Ian.

Xen devel - May 2013 - test report for Xen 4.3 RC1

test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1

Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processed: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processed: Re: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Re: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Processed: RE: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1

Re: test report for Xen 4.3 RC1