Hi everyone, I would like to ask what the current status of FLR, or better of FLR emulation is in latest Xen and if we can expect better support in the future. I''m asking because with xl (latest build and traditional qemu, not upstream), I always had problems with rebooting domUs which have vga cards passed through to them, because appearently they don''t get reinitialized and then cause either bluescreens (windows), blackscreens (linux) or the complete freeze of the dom0. As far as I understood this is caused by the vga card do not have FLR capability (lspci -vvv shows FLReset-). So while lately rebooting sometimes works on windows, it never works on linux domUs and it appears that xl is simply not really capable of dealing with reboots with non-FLR''ed vga cards passed through the domUs and I have to reboot the dom0 to get the vga cards running again. Is this the current status or is this supposed to work and I only have a problem on my setup? Also, I''m specifically referring to xl because back in the day when I used xm with xen 4.0 and 4.1, this never was an issue and i could reboot both linux and windows domUs without issues as often as I wanted (with the same hardware setup I now use with xl). So to me it seems that there is a possibility to handle non-FLR''ed vga cards gracefully, but xl simply isn''t capable of that / does not do that. It would be great to have a quick roundup of the current situation and future plans, because I''m planing a project to use xen''s vga passthrough in a cloud / big data setup and the unreliable reboot behaviour is currently a deal breaker for me. Thanks in advance! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, 2013-09-26 at 18:05 +0200, Matthias wrote:> I would like to ask what the current status of FLR, or better of FLR > emulation is in latest Xen and if we can expect better support in the > future. > > Is this the current status or is this supposed to work and I only have > a problem on my setup?xl simply asks the dom0 kernel to reset the card, so this is entirely dependent on the functionality of your dom0 kernel and/or the features of the particular hardware WRT allowing things to be reset.> Also, I''m specifically referring to xl because back in the day when I > used xm with xen 4.0 and 4.1, this never was an issue and i could > reboot both linux and windows domUs without issues as often as I > wanted (with the same hardware setup I now use with xl). So to me it > seems that there is a possibility to handle non-FLR''ed vga cards > gracefully, but xl simply isn''t capable of that / does not do that.This I''m afraid I don''t know enough about to comment much. tools/python/xen/util/pci.py appears to implement various FLR quirks for bits of hardware, including some GFX from the looks of things. These all belong in the upstream Linux kernel these days. You don''t say which kernel you are using but you could try updating it. You could also check the kernel source for a quirk for your particular hardware. Ian.
On 26/09/13 17:05, Matthias wrote:> Hi everyone, > > I would like to ask what the current status of FLR, or better of FLR > emulation is in latest Xen and if we can expect better support in the > future.What are these cards, are they multi-function and do they actually support FLR? Many graphics cards do not. I have the following hack to pciback to fallback to a bus reset for multi-function devices without FLR. Does it help for your use case? You will need to ensure that all functions are co-assigned to the same domain. David 8<--------------------------------------- diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 4e8ba38..5a03e63 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -14,6 +14,7 @@ #include <linux/wait.h> #include <linux/sched.h> #include <linux/atomic.h> +#include <linux/delay.h> #include <xen/events.h> #include <asm/xen/pci.h> #include <asm/xen/hypervisor.h> @@ -43,6 +44,7 @@ struct pcistub_device { struct kref kref; struct list_head dev_list; spinlock_t lock; + bool created_reset_file; struct pci_dev *dev; struct xen_pcibk_device *pdev;/* non-NULL if struct pci_dev is in use */ @@ -60,6 +62,114 @@ static LIST_HEAD(pcistub_devices); static int initialize_devices; static LIST_HEAD(seized_devices); +/* + * pci_reset_function() will only work if there is a mechanism to + * reset that single function (e.g., FLR or a D-state transition). + * For PCI hardware that has two or more functions but no per-function + * reset, we can do a bus reset iff all the functions are co-assigned + * to the same domain. + * + * If a function has no per-function reset mechanism the ''reset'' sysfs + * file that the toolstack uses to reset a function prior to assigning + * the device will be missing. In this case, pciback adds its own + * which will try a bus reset. + * + * Note: pciback does not check for co-assigment before doing a bus + * reset, only that the devices are bound to pciback. The toolstack + * is assumed to have done the right thing. + */ +static int __pcistub_reset_function(struct pci_dev *dev) +{ + struct pci_dev *pdev; + u16 ctrl; + int ret; + + ret = __pci_reset_function_locked(dev); + if (ret == 0) + return 0; + + if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self) + return -ENOTTY; + + list_for_each_entry(pdev, &dev->bus->devices, bus_list) { + if (pdev != dev && (!pdev->driver + || strcmp(pdev->driver->name, "pciback"))) + return -ENOTTY; + pci_save_state(pdev); + } + + pci_read_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, &ctrl); + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl); + msleep(200); + + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; + pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl); + msleep(200); + + list_for_each_entry(pdev, &dev->bus->devices, bus_list) + pci_restore_state(pdev); + + return 0; +} + +static int pcistub_reset_function(struct pci_dev *dev) +{ + int ret; + + device_lock(&dev->dev); + ret = __pcistub_reset_function(dev); + device_unlock(&dev->dev); + + return ret; +} + +static ssize_t pcistub_reset_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct pci_dev *pdev = to_pci_dev(dev); + unsigned long val; + ssize_t result = strict_strtoul(buf, 0, &val); + + if (result < 0) + return result; + + if (val != 1) + return -EINVAL; + + result = pcistub_reset_function(pdev); + if (result < 0) + return result; + return count; +} +static DEVICE_ATTR(reset, 0200, NULL, pcistub_reset_store); + +static int pcistub_try_create_reset_file(struct pcistub_device *psdev) +{ + struct device *dev = &psdev->dev->dev; + struct sysfs_dirent *reset_dirent; + int ret; + + reset_dirent = sysfs_get_dirent(dev->kobj.sd, NULL, "reset"); + if (reset_dirent) { + sysfs_put(reset_dirent); + return 0; + } + + ret = device_create_file(dev, &dev_attr_reset); + if (ret < 0) + return ret; + psdev->created_reset_file = true; + return 0; +} + +static void pcistub_remove_reset_file(struct pcistub_device *psdev) +{ + if (psdev && psdev->created_reset_file) + device_remove_file(&psdev->dev->dev, &dev_attr_reset); +} + static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev) { struct pcistub_device *psdev; @@ -95,12 +205,15 @@ static void pcistub_device_release(struct kref *kref) dev_dbg(&dev->dev, "pcistub_device_release\n"); + pcistub_remove_reset_file(psdev); + xen_unregister_device_domain_owner(dev); /* Call the reset function which does not take lock as this * is called from "unbind" which takes a device_lock mutex. */ - __pci_reset_function_locked(dev); + __pcistub_reset_function(psdev->dev); + if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state)) dev_dbg(&dev->dev, "Could not reload PCI state\n"); else @@ -268,7 +381,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev) /* This is OK - we are running from workqueue context * and want to inhibit the user from fiddling with ''reset'' */ - pci_reset_function(dev); + pcistub_reset_function(psdev->dev); pci_restore_state(psdev->dev); /* This disables the device. */ @@ -392,7 +505,7 @@ static int pcistub_init_device(struct pci_dev *dev) dev_err(&dev->dev, "Could not store PCI conf saved state!\n"); else { dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n"); - __pci_reset_function_locked(dev); + __pcistub_reset_function(dev); pci_restore_state(dev); } /* Now disable the device (this also ensures some private device @@ -467,6 +580,10 @@ static int pcistub_seize(struct pci_dev *dev) if (!psdev) return -ENOMEM; + err = pcistub_try_create_reset_file(psdev); + if (err < 0) + goto out; + spin_lock_irqsave(&pcistub_devices_lock, flags); if (initialize_devices) { @@ -485,10 +602,9 @@ static int pcistub_seize(struct pci_dev *dev) } spin_unlock_irqrestore(&pcistub_devices_lock, flags); - +out: if (err) pcistub_device_put(psdev); - return err; }
On 09/26/2013 12:20 PM, David Vrabel wrote:> On 26/09/13 17:05, Matthias wrote: >> Hi everyone, >> >> I would like to ask what the current status of FLR, or better of FLR >> emulation is in latest Xen and if we can expect better support in the >> future. > > What are these cards, are they multi-function and do they actually > support FLR? Many graphics cards do not. > > I have the following hack to pciback to fallback to a bus reset for > multi-function devices without FLR. Does it help for your use case? > You will need to ensure that all functions are co-assigned to the same > domain.New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as well as fallback support for D0-D3 and secondary bus resets. This functionality is also in the some of the last 2.6 kernels like 2.6.39. If you are using an older kernel I guess you might need to patch it. Also depending on your hw there might be a specific quirk you need (e.g. the 82599 quirk in pci/quirks.c). Ross> > David > > 8<--------------------------------------- > diff --git a/drivers/xen/xen-pciback/pci_stub.c > b/drivers/xen/xen-pciback/pci_stub.c > index 4e8ba38..5a03e63 100644 > --- a/drivers/xen/xen-pciback/pci_stub.c > +++ b/drivers/xen/xen-pciback/pci_stub.c > @@ -14,6 +14,7 @@ > #include <linux/wait.h> > #include <linux/sched.h> > #include <linux/atomic.h> > +#include <linux/delay.h> > #include <xen/events.h> > #include <asm/xen/pci.h> > #include <asm/xen/hypervisor.h> > @@ -43,6 +44,7 @@ struct pcistub_device { > struct kref kref; > struct list_head dev_list; > spinlock_t lock; > + bool created_reset_file; > > struct pci_dev *dev; > struct xen_pcibk_device *pdev;/* non-NULL if struct pci_dev is in use */ > @@ -60,6 +62,114 @@ static LIST_HEAD(pcistub_devices); > static int initialize_devices; > static LIST_HEAD(seized_devices); > > +/* > + * pci_reset_function() will only work if there is a mechanism to > + * reset that single function (e.g., FLR or a D-state transition). > + * For PCI hardware that has two or more functions but no per-function > + * reset, we can do a bus reset iff all the functions are co-assigned > + * to the same domain. > + * > + * If a function has no per-function reset mechanism the ''reset'' sysfs > + * file that the toolstack uses to reset a function prior to assigning > + * the device will be missing. In this case, pciback adds its own > + * which will try a bus reset. > + * > + * Note: pciback does not check for co-assigment before doing a bus > + * reset, only that the devices are bound to pciback. The toolstack > + * is assumed to have done the right thing. > + */ > +static int __pcistub_reset_function(struct pci_dev *dev) > +{ > + struct pci_dev *pdev; > + u16 ctrl; > + int ret; > + > + ret = __pci_reset_function_locked(dev); > + if (ret == 0) > + return 0; > + > + if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self) > + return -ENOTTY; > + > + list_for_each_entry(pdev, &dev->bus->devices, bus_list) { > + if (pdev != dev && (!pdev->driver > + || strcmp(pdev->driver->name, "pciback"))) > + return -ENOTTY; > + pci_save_state(pdev); > + } > + > + pci_read_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, &ctrl); > + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl); > + msleep(200); > + > + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl); > + msleep(200); > + > + list_for_each_entry(pdev, &dev->bus->devices, bus_list) > + pci_restore_state(pdev); > + > + return 0; > +} > + > +static int pcistub_reset_function(struct pci_dev *dev) > +{ > + int ret; > + > + device_lock(&dev->dev); > + ret = __pcistub_reset_function(dev); > + device_unlock(&dev->dev); > + > + return ret; > +} > + > +static ssize_t pcistub_reset_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + unsigned long val; > + ssize_t result = strict_strtoul(buf, 0, &val); > + > + if (result < 0) > + return result; > + > + if (val != 1) > + return -EINVAL; > + > + result = pcistub_reset_function(pdev); > + if (result < 0) > + return result; > + return count; > +} > +static DEVICE_ATTR(reset, 0200, NULL, pcistub_reset_store); > + > +static int pcistub_try_create_reset_file(struct pcistub_device *psdev) > +{ > + struct device *dev = &psdev->dev->dev; > + struct sysfs_dirent *reset_dirent; > + int ret; > + > + reset_dirent = sysfs_get_dirent(dev->kobj.sd, NULL, "reset"); > + if (reset_dirent) { > + sysfs_put(reset_dirent); > + return 0; > + } > + > + ret = device_create_file(dev, &dev_attr_reset); > + if (ret < 0) > + return ret; > + psdev->created_reset_file = true; > + return 0; > +} > + > +static void pcistub_remove_reset_file(struct pcistub_device *psdev) > +{ > + if (psdev && psdev->created_reset_file) > + device_remove_file(&psdev->dev->dev, &dev_attr_reset); > +} > + > static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev) > { > struct pcistub_device *psdev; > @@ -95,12 +205,15 @@ static void pcistub_device_release(struct kref *kref) > > dev_dbg(&dev->dev, "pcistub_device_release\n"); > > + pcistub_remove_reset_file(psdev); > + > xen_unregister_device_domain_owner(dev); > > /* Call the reset function which does not take lock as this > * is called from "unbind" which takes a device_lock mutex. > */ > - __pci_reset_function_locked(dev); > + __pcistub_reset_function(psdev->dev); > + > if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state)) > dev_dbg(&dev->dev, "Could not reload PCI state\n"); > else > @@ -268,7 +381,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev) > /* This is OK - we are running from workqueue context > * and want to inhibit the user from fiddling with ''reset'' > */ > - pci_reset_function(dev); > + pcistub_reset_function(psdev->dev); > pci_restore_state(psdev->dev); > > /* This disables the device. */ > @@ -392,7 +505,7 @@ static int pcistub_init_device(struct pci_dev *dev) > dev_err(&dev->dev, "Could not store PCI conf saved state!\n"); > else { > dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n"); > - __pci_reset_function_locked(dev); > + __pcistub_reset_function(dev); > pci_restore_state(dev); > } > /* Now disable the device (this also ensures some private device > @@ -467,6 +580,10 @@ static int pcistub_seize(struct pci_dev *dev) > if (!psdev) > return -ENOMEM; > > + err = pcistub_try_create_reset_file(psdev); > + if (err < 0) > + goto out; > + > spin_lock_irqsave(&pcistub_devices_lock, flags); > > if (initialize_devices) { > @@ -485,10 +602,9 @@ static int pcistub_seize(struct pci_dev *dev) > } > > spin_unlock_irqrestore(&pcistub_devices_lock, flags); > - > +out: > if (err) > pcistub_device_put(psdev); > - > return err; > } > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 kernel I found which doesn''t give me this issue: http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html So I would assume that the kernel should be new enough to handle that. On the other hand, as far as I understand the whole process, the kernel itself will only deal with the vga card if it is actually bind to the dom0 / to it''s driver which it is not. Is there any way to test either if the ask-command from xl is really executed on dom0 or to test this command manually? Btw: Hardware is a Radeon HD 5750 and a Radeon HD 5400.. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 26/09/13 18:48, Ross Philipson wrote:> On 09/26/2013 12:20 PM, David Vrabel wrote: >> On 26/09/13 17:05, Matthias wrote: >>> Hi everyone, >>> >>> I would like to ask what the current status of FLR, or better of FLR >>> emulation is in latest Xen and if we can expect better support in the >>> future. >> >> What are these cards, are they multi-function and do they actually >> support FLR? Many graphics cards do not. >> >> I have the following hack to pciback to fallback to a bus reset for >> multi-function devices without FLR. Does it help for your use case? >> You will need to ensure that all functions are co-assigned to the same >> domain. > > New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as > well as fallback support for D0-D3 and secondary bus resets. This > functionality is also in the some of the last 2.6 kernels like 2.6.39. > If you are using an older kernel I guess you might need to patch it.It will only do a secondary bus reset iff the function to be reset is the only function on that bus. If you have a multi-function device secondary bus reset is not tried. David
Hi, thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both with dual functions (due to audio capabilities), both co-assigned to their respective domU and both not capable of FLR from lspci -vvv output. also, @Ross, I''m running a 3.8.2 Kernel, so this should be fine, but I assume that the ''official'' command where xl asks the dom0 about the reset do not work (if I have understand david correctly) since it''s dual function so no dual bus reset is actually executed causing the misbehaviour, and on the other side xm doing a bus reset so it works in this specific case. I''m currently recompiling the kernel to see if your patch works David. Also, just to understand it better, is the secondary bus reset the thing which you can manually invoke via /sys/bus/pci/devices/.../reset ? So as a workaround, would the following work in principle? xl pci-assignable-remove 0X:00.0 xl pci-assignable-remove 0X:00.1 echo "1" > /sys/bus/pci/devices/0X:00.0/reset echo "1" > /sys/bus/pci/devices/0X:00.1/reset xl pci-assignable-add 0X:00.0 xl pci-assignable-add 0X:00.1 Anyway, thanks for your answers and I will report if the patch works! 2013/9/26 David Vrabel <david.vrabel@citrix.com>> On 26/09/13 18:48, Ross Philipson wrote: > > On 09/26/2013 12:20 PM, David Vrabel wrote: > >> On 26/09/13 17:05, Matthias wrote: > >>> Hi everyone, > >>> > >>> I would like to ask what the current status of FLR, or better of FLR > >>> emulation is in latest Xen and if we can expect better support in the > >>> future. > >> > >> What are these cards, are they multi-function and do they actually > >> support FLR? Many graphics cards do not. > >> > >> I have the following hack to pciback to fallback to a bus reset for > >> multi-function devices without FLR. Does it help for your use case? > >> You will need to ensure that all functions are co-assigned to the same > >> domain. > > > > New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as > > well as fallback support for D0-D3 and secondary bus resets. This > > functionality is also in the some of the last 2.6 kernels like 2.6.39. > > If you are using an older kernel I guess you might need to patch it. > > It will only do a secondary bus reset iff the function to be reset is > the only function on that bus. If you have a multi-function device > secondary bus reset is not tried. > > David >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 09/26/2013 07:41 PM, Matthias wrote:> Hi, > > thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both > with dual functions (due to audio capabilities), both co-assigned to > their respective domU and both not capable of FLR from lspci -vvv output. > > also, @Ross, I''m running a 3.8.2 Kernel, so this should be fine, but I > assume that the ''official'' command where xl asks the dom0 about the > reset do not work (if I have understand david correctly) since it''s dual > function so no dual bus reset is actually executed causing the > misbehaviour, and on the other side xm doing a bus reset so it works in > this specific case. > > I''m currently recompiling the kernel to see if your patch works David. > > Also, just to understand it better, is the secondary bus reset the thing > which you can manually invoke via /sys/bus/pci/devices/.../reset ? > > So as a workaround, would the following work in principle? > > xl pci-assignable-remove 0X:00.0 > xl pci-assignable-remove 0X:00.1 > echo "1" > /sys/bus/pci/devices/0X:00.0/reset > echo "1" > /sys/bus/pci/devices/0X:00.1/resetThis bit is up to the driver to implement. Since pciback is a placeholder rather than a driver that knows about the hardware the reset node won''t be there. You could try to do something with setpci to force the registers between D0 and D3 power states in a vague hope that might do something, but I doubt it. The reason nvidia cards work OK is because the domU driver knows how to reinitialize the hardware and acts accordingly. If the manufacturer won''t implement a standard function to reset the hardware, then it is up to their drivers to handle the situation. As a workaround, if (on Windows domUs) ejecting the card before shutdown/reboot of domU works, you could probably write some powershell magic that does that on shutdown/reboot as a reasonable workaround. Gordan
Hi Gordon, I tried your patch on my dom0 kernel and I think it somehow helped in the sense that now I can reboot the domUs now without crashing the whole host, but linux domU still gets a blackscreen and windows7 domU only starts till black screen with (actual movable) cursor, but not furthor.. this might only be a coincidence, though, have to double check this.. I tried some other stuff, too: 1) after domU shutdown rebind both functions to the dom0 drivers, do a sysfs reset and re-add to assignable devices -> crashes dom0 2) after domU shutdown rebind both functions to the dom0 drivers and readd to assignable devices -> dom0 crashes somtime when domU using the devices comes up, sometimes not, but no success either way 3) sysfs reset of the devices within domU seems to be passed through dom0 (see commands in qemu-log) but no effect Also, I analysed your code and compared it to the stuff in the python tools of xm and it is the same approach and i don''t see any obvious differences.. Then I tried to replicate the secondary bus reset on command lind for testing purposes via printf ''\x40'' | dd of=/sys/devices/pci0000\:00/0000\:00\:0b.0/config bs=1 seek=$((0x3e)) count=1 conv=notrunc but I think I got some endians or offset slightly wrong because after that xl refuses to give the device (00:0b.0 is the bus of my 2-function vga card I have assigned to my domU) to the domU and later crashes dom0. So I''m a little lost at that point and would welcome some suggestions. Does FLR reset works for any of you for vga cards? 2013/9/26 Gordan Bobic <gordan@bobich.net>> On 09/26/2013 07:41 PM, Matthias wrote: > >> Hi, >> >> thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both >> with dual functions (due to audio capabilities), both co-assigned to >> their respective domU and both not capable of FLR from lspci -vvv output. >> >> also, @Ross, I''m running a 3.8.2 Kernel, so this should be fine, but I >> assume that the ''official'' command where xl asks the dom0 about the >> reset do not work (if I have understand david correctly) since it''s dual >> function so no dual bus reset is actually executed causing the >> misbehaviour, and on the other side xm doing a bus reset so it works in >> this specific case. >> >> I''m currently recompiling the kernel to see if your patch works David. >> >> Also, just to understand it better, is the secondary bus reset the thing >> which you can manually invoke via /sys/bus/pci/devices/.../reset ? >> >> So as a workaround, would the following work in principle? >> >> xl pci-assignable-remove 0X:00.0 >> xl pci-assignable-remove 0X:00.1 >> echo "1" > /sys/bus/pci/devices/0X:00.0/**reset >> echo "1" > /sys/bus/pci/devices/0X:00.1/**reset >> > > This bit is up to the driver to implement. Since pciback is a placeholder > rather than a driver that knows about the hardware the reset node won''t be > there. > > You could try to do something with setpci to force the registers between > D0 and D3 power states in a vague hope that might do something, but I doubt > it. > > The reason nvidia cards work OK is because the domU driver knows how to > reinitialize the hardware and acts accordingly. If the manufacturer won''t > implement a standard function to reset the hardware, then it is up to their > drivers to handle the situation. > > As a workaround, if (on Windows domUs) ejecting the card before > shutdown/reboot of domU works, you could probably write some powershell > magic that does that on shutdown/reboot as a reasonable workaround. > > Gordan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 27 Sep 2013 14:26:31 +0200, Matthias <matthias.kannenberg@googlemail.com> wrote:> Hi Gordon, > > I tried your patch on my dom0 kernel and I think it somehow helped in > the sense that now I can reboot the domUs now without crashing the > whole host, but linux domU still gets a blackscreen and windows7 domU > only starts till black screen with (actual movable) cursor, but not > furthor.. this might only be a coincidence, though, have to double > check this..What patch? Nothing I posted to the list is fit for public consumption yet. You shouldn''t be using it unless you really, REALLY know exactly what it does and know exactly what you are trying to achieve.> I tried some other stuff, too: > > 1) after domU shutdown rebind both functions to the dom0 drivers, do > a > sysfs reset and re-add to assignable devices -> crashes dom0My experience shows that letting dom0 drivers ever touch the hardware is a recipe for disaster.> 2) after domU shutdown rebind both functions to the dom0 drivers and > readd to assignable devices -> dom0 crashes somtime when domU using > the devices comes up, sometimes not, but no success either way > 3) sysfs reset of the devices within domU seems to be passed through > dom0 (see commands in qemu-log) but no effectIt''s up to the drivers to do the sensible thing. Nvidia drivers handle this a little more sanely, but if the drivers cannot handle clobbering the device''s state into a known state, you are pretty much fighting a losing battle.> Also, I analysed your code and compared it to the stuff in the python > tools of xm and it is the same approach and i don''t see any obvious > differences..I am starting to suspect you aren''t actually talking about my code but somebody else''s...> Then I tried to replicate the secondary bus reset on > command lind for testing purposes via > > printf ''x40'' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config > bs=1 > seek=$((0x3e)) count=1 conv=notrunc > > but I think I got some endians or offset slightly wrong because after > that xl refuses to give the device (00:0b.0 is the bus of my > 2-function vga card I have assigned to my domU) to the domU and later > crashes dom0. > > So I''m a little lost at that point and would welcome some > suggestions. > > Does FLR reset works for any of you for vga cards?If you are talking about VGA cards with _proper_ FLR implementations on PCI level - there is no such thing. In all cases it is down to the domU driver to handle the card in whatever state it is. This works reasonably well with supported Nvidia cards (i.e. Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to get it working properly on any other GPUs. Even with Nvidia cards rebooting can lead to issues. For example, I have two GPUs passed to two different domUs. One is a GTX470 modified to Q5000. The other is a GTX480 modified to Q6000. The domU with Q5000 always handled reboots reasonably reliably. The one with a Q6000 did not. I since switched the one with a Q6000 to a QK5000 (modified GTX680), and now the reboots seem to work reasonably reliably, but I have found that there is still a crash if the monitor on the card changes between shutdown and restart - I''m guessing the card remembers it''s state and if it isn''t consistent when it returns, driver gets confused. I have other issues (see recent thread about Nvidia passthrough from David), but they seem to be specific to my setup. It''s not perfect, but it''s the only workable solution I have found. Gordan
On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:> I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 > kernel I found which doesn''t give me this issue: > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.htmlSo v3.12 (or rather the latest and greaters of the Linus) has the mechanism for the NMI - so you can actually see what is causing the stall.> > So I would assume that the kernel should be new enough to handle that. On > the other hand, as far as I understand the whole process, the kernel itself > will only deal with the vga card if it is actually bind to the dom0 / to > it''s driver which it is not. Is there any way to test either if the > ask-command from xl is really executed on dom0 or to test this command > manually? > > Btw: Hardware is a Radeon HD 5750 and a Radeon HD 5400..> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Fri, Sep 27, 2013 at 02:27:46PM +0100, Gordan Bobic wrote:> On Fri, 27 Sep 2013 14:26:31 +0200, Matthias > <matthias.kannenberg@googlemail.com> wrote: > >Hi Gordon, > > > >I tried your patch on my dom0 kernel and I think it somehow helped in > >the sense that now I can reboot the domUs now without crashing the > >whole host, but linux domU still gets a blackscreen and windows7 domU > >only starts till black screen with (actual movable) cursor, but not > >furthor.. this might only be a coincidence, though, have to double > >check this.. > > What patch? Nothing I posted to the list is fit for public > consumption yet. You shouldn''t be using it unless you really, > REALLY know exactly what it does and know exactly what you > are trying to achieve. > > >I tried some other stuff, too: > > > >1) after domU shutdown rebind both functions to the dom0 drivers, > >do a > >sysfs reset and re-add to assignable devices -> crashes dom0 > > My experience shows that letting dom0 drivers ever touch the hardware > is a recipe for disaster. > > >2) after domU shutdown rebind both functions to the dom0 drivers and > >readd to assignable devices -> dom0 crashes somtime when domU using > >the devices comes up, sometimes not, but no success either way > > 3) sysfs reset of the devices within domU seems to be passed through > >dom0 (see commands in qemu-log) but no effect > > It''s up to the drivers to do the sensible thing. Nvidia drivers > handle this a little more sanely, but if the drivers cannot handle > clobbering the device''s state into a known state, you are pretty > much fighting a losing battle. > > >Also, I analysed your code and compared it to the stuff in the python > >tools of xm and it is the same approach and i don''t see any obvious > >differences.. > > I am starting to suspect you aren''t actually talking about my code > but somebody else''s... > > >Then I tried to replicate the secondary bus reset on > >command lind for testing purposes via > > > > printf ''x40'' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config > >bs=1 > >seek=$((0x3e)) count=1 conv=notrunc > > > >but I think I got some endians or offset slightly wrong because after > >that xl refuses to give the device (00:0b.0 is the bus of my > >2-function vga card I have assigned to my domU) to the domU and later > >crashes dom0. > > > >So I''m a little lost at that point and would welcome some > >suggestions. > > > >Does FLR reset works for any of you for vga cards? > > If you are talking about VGA cards with _proper_ FLR implementations > on PCI level - there is no such thing. In all cases it is down to > the domU driver to handle the card in whatever state it is. This > works reasonably well with supported Nvidia cards (i.e. > Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce > cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to > get it working properly on any other GPUs. > > Even with Nvidia cards rebooting can lead to issues. For example, > I have two GPUs passed to two different domUs. One is a GTX470 > modified to Q5000. The other is a GTX480 modified to Q6000. The > domU with Q5000 always handled reboots reasonably reliably. The > one with a Q6000 did not. I since switched the one with a Q6000 > to a QK5000 (modified GTX680), and now the reboots seem to work > reasonably reliably, but I have found that there is still a > crash if the monitor on the card changes between shutdown and > restart - I''m guessing the card remembers it''s state and if it > isn''t consistent when it returns, driver gets confused. I have > other issues (see recent thread about Nvidia passthrough from > David), but they seem to be specific to my setup.This state thing. If one were to capture the cards state before doing any PCI passthrough in and tried to write it exactly back would that eliminate some of these issues? I know that the pciback does that to the PCI configuration values. (Or at least it should) whenever a device has been de-assigned from a guest - or unplugged. But I presume that the rest (the BAR contents) are not in any way saved/restored. What would be the worst if one wrote exactly all of the MMIO values back as they were? (Probably a recipe for disaster, but who knows).> > It''s not perfect, but it''s the only workable solution I have > found. > > Gordan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Fri, 27 Sep 2013 09:48:34 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Fri, Sep 27, 2013 at 02:27:46PM +0100, Gordan Bobic wrote: >> On Fri, 27 Sep 2013 14:26:31 +0200, Matthias >> <matthias.kannenberg@googlemail.com> wrote: >> >Hi Gordon, >> > >> >I tried your patch on my dom0 kernel and I think it somehow helped >> in >> >the sense that now I can reboot the domUs now without crashing the >> >whole host, but linux domU still gets a blackscreen and windows7 >> domU >> >only starts till black screen with (actual movable) cursor, but not >> >furthor.. this might only be a coincidence, though, have to double >> >check this.. >> >> What patch? Nothing I posted to the list is fit for public >> consumption yet. You shouldn''t be using it unless you really, >> REALLY know exactly what it does and know exactly what you >> are trying to achieve. >> >> >I tried some other stuff, too: >> > >> >1) after domU shutdown rebind both functions to the dom0 drivers, >> >do a >> >sysfs reset and re-add to assignable devices -> crashes dom0 >> >> My experience shows that letting dom0 drivers ever touch the >> hardware >> is a recipe for disaster. >> >> >2) after domU shutdown rebind both functions to the dom0 drivers >> and >> >readd to assignable devices -> dom0 crashes somtime when domU using >> >the devices comes up, sometimes not, but no success either way >> > 3) sysfs reset of the devices within domU seems to be passed >> through >> >dom0 (see commands in qemu-log) but no effect >> >> It''s up to the drivers to do the sensible thing. Nvidia drivers >> handle this a little more sanely, but if the drivers cannot handle >> clobbering the device''s state into a known state, you are pretty >> much fighting a losing battle. >> >> >Also, I analysed your code and compared it to the stuff in the >> python >> >tools of xm and it is the same approach and i don''t see any obvious >> >differences.. >> >> I am starting to suspect you aren''t actually talking about my code >> but somebody else''s... >> >> >Then I tried to replicate the secondary bus reset on >> >command lind for testing purposes via >> > >> > printf ''x40'' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config >> >bs=1 >> >seek=$((0x3e)) count=1 conv=notrunc >> > >> >but I think I got some endians or offset slightly wrong because >> after >> >that xl refuses to give the device (00:0b.0 is the bus of my >> >2-function vga card I have assigned to my domU) to the domU and >> later >> >crashes dom0. >> > >> >So I''m a little lost at that point and would welcome some >> >suggestions. >> > >> >Does FLR reset works for any of you for vga cards? >> >> If you are talking about VGA cards with _proper_ FLR implementations >> on PCI level - there is no such thing. In all cases it is down to >> the domU driver to handle the card in whatever state it is. This >> works reasonably well with supported Nvidia cards (i.e. >> Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce >> cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to >> get it working properly on any other GPUs. >> >> Even with Nvidia cards rebooting can lead to issues. For example, >> I have two GPUs passed to two different domUs. One is a GTX470 >> modified to Q5000. The other is a GTX480 modified to Q6000. The >> domU with Q5000 always handled reboots reasonably reliably. The >> one with a Q6000 did not. I since switched the one with a Q6000 >> to a QK5000 (modified GTX680), and now the reboots seem to work >> reasonably reliably, but I have found that there is still a >> crash if the monitor on the card changes between shutdown and >> restart - I''m guessing the card remembers it''s state and if it >> isn''t consistent when it returns, driver gets confused. I have >> other issues (see recent thread about Nvidia passthrough from >> David), but they seem to be specific to my setup. > > This state thing. If one were to capture the cards state before > doing any PCI passthrough in and tried to write it exactly > back would that eliminate some of these issues? > > I know that the pciback does that to the PCI configuration values. > (Or at least it should) whenever a device has been de-assigned > from a guest - or unplugged. > > But I presume that the rest (the BAR contents) are not in any > way saved/restored. What would be the worst if one wrote exactly > all of the MMIO values back as they were? > > (Probably a recipe for disaster, but who knows). >> >> It''s not perfect, but it''s the only workable solution I have >> found.That doesn''t cover the entire state of the device. What about the rest of the device memory and states of all the proprietary registers? Since there are open source FB and accelerated drivers available for Radeon cards, enough is publicly known about them to be able to achieve suitable resetting. How difficult that might be to achieve, I have no idea. I have seen the open source Radeon Xorg driver successfully reset the GPU when the GPU stopped responding without taking Xorg or any of the running apps down in the process, so something similar to what it does might just be good enough. Whether it is a good idea to adopt anything but a fully hands-off approach to any passthrough hardware is a different question entirely. Gordan
Hi Konrad, good call! I was able to reproduce the error with the 3.12-rc2 kernel, got a lot of information with the new NMI traces (log attached), but since I''m not a xen hacker I don''t really know how to continue from here. So I might add this to the original post and maybe someone can help me. After all the error persists for half a year now and besides 2 kernel version / .config Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue back (even with bisecting the .config because at some point it seemed random). 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 > > kernel I found which doesn''t give me this issue: > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > So v3.12 (or rather the latest and greaters of the Linus) has the mechanism > for the NMI - so you can actually see what is causing the stall. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hi Matthias, Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ? -- Sander Friday, September 27, 2013, 7:07:33 PM, you wrote:> Hi Konrad,> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got > a lot of information with the new NMI traces (log attached), but since I''m > not a xen hacker I don''t really know how to continue from here. So I might > add this to the original post and maybe someone can help me. After all the > error persists for half a year now and besides 2 kernel version / .config > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue > back (even with bisecting the .config because at some point it seemed > random).> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: >> > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 >> > kernel I found which doesn''t give me this issue: >> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html >> >> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism >> for the NMI - so you can actually see what is causing the stall. >>
Konrad Rzeszutek Wilk
2013-Sep-27 17:53 UTC
Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:> Hi Konrad, > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, got > a lot of information with the new NMI traces (log attached), but since I''m > not a xen hacker I don''t really know how to continue from here. So I might > add this to the original post and maybe someone can help me. After all the > error persists for half a year now and besides 2 kernel version / .config > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue > back (even with bisecting the .config because at some point it seemed > random).Can you tell me a bit on how this happens? Is it happening after you boot the machine? Does it happen after a specific workload? It looks like something in the RCU is taking far too long and the RCU callback mechanism starts complaining. The CPU0 is when the RCU mechanism detects that something is off and starts sending NMI to all CPUs. CPU2 is the only one that looks to be doing RCU callback: NMI backtrace for cpu 1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2 Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012 task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000 RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>] cfb_imageblit+0x1b3/0x411 RSP: e02b:ffff88007de439f0 EFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003 RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000 RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0 R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000 FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660 Stack: 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b Call Trace: <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8 [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290 [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306 [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8 [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d [<ffffffff813db9c8>] ? printk+0x4f/0x51 [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598 <================= [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239 [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e [<ffffffff81084c35>] ? update_process_times+0x30/0x5b [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159 [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa [<ffffffff8103df22>] ? check_events+0x12/0x20 [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5 [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52 [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6 [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51 [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32 [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13 [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160 Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 Which looks to be printing something on the VT console (which is running in KMS mode as it uses framebuffer calls). So is there something on the screen scrolling widly in a loop? But then there are also complains about INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.115 msecs this taking too long. I am wondering if there is some time issue on your box. What version of Xen do you have?> > > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 > > > kernel I found which doesn''t give me this issue: > > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > > > So v3.12 (or rather the latest and greaters of the Linus) has the mechanism > > for the NMI - so you can actually see what is causing the stall. > >
Hi Sander, thanks for the advice, I have actually no rcu stalls when i use the no-cpuidle function. Do you have a little more insight on what is actually causing this behaviour and if there is a better solution then this option, cause I don''t want to sacrifice my C-states (I would assume this makes the overall server more power hungry?). Does this has something to do with the new tickless-kernel options in the newer kernel, or is this really only an apci incompatibility with xen? Thanks! 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>> Hi Matthias, > > Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in > grub ? > > -- > Sander > > Friday, September 27, 2013, 7:07:33 PM, you wrote: > > > Hi Konrad, > > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, > got > > a lot of information with the new NMI traces (log attached), but since > I''m > > not a xen hacker I don''t really know how to continue from here. So I > might > > add this to the original post and maybe someone can help me. After all > the > > error persists for half a year now and besides 2 kernel version / .config > > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue > > back (even with bisecting the .config because at some point it seemed > > random). > > > > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > >> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > >> > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 > >> > kernel I found which doesn''t give me this issue: > >> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > >> > >> So v3.12 (or rather the latest and greaters of the Linus) has the > mechanism > >> for the NMI - so you can actually see what is causing the stall. > >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Friday, September 27, 2013, 9:19:14 PM, you wrote:> Hi Sander,> thanks for the advice, I have actually no rcu stalls when i use the no-cpuidle function. Do you have a little more insight on what is actually causing this behaviour and if there is a better solution then this option, cause I don''t want to sacrifice my C-states (I would assume this makes the overall server more power hungry?).> Does this has something to do with the new tickless-kernel options in the newer kernel, or is this really only an apci incompatibility with xen?> Thanks!Are you running xen-unstable ? Some patches went in lately You also seem to have a motherboard with a AMD 890fx chipset, i suspect your bios also has issues around the HPET as mine had. I was also seeing RCU stalls on boot (and only on boot) .. hitting any key on the console when it appears to stall during boot made it continue in my case (happens several times). Took a while to find the problems, Jan Beulich has made and commited some patches that went in xen-unstable recently. Are you running xen-unstable ? If not, could you give it a try and provide the xl dmesg / serial log ? -- Sander> 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>> Hi Matthias, > > Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ? > > -- > Sander >> Friday, September 27, 2013, 7:07:33 PM, you wrote: >>> Hi Konrad,>>> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got >> a lot of information with the new NMI traces (log attached), but since I''m >> not a xen hacker I don''t really know how to continue from here. So I might >> add this to the original post and maybe someone can help me. After all the >> error persists for half a year now and besides 2 kernel version / .config >> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue >> back (even with bisecting the .config because at some point it seemed >> random).> >>> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>>>>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: >>> > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 >>> > kernel I found which doesn''t give me this issue: >>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html >>> >>> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism >>> for the NMI - so you can actually see what is causing the stall. >>>> >
Yes, running the most recent xen-unstable-staging tree, but I have these issues at least since february with xen-unstable, so I don''t suspect recent changes to be the issue in my case. I will do some testing with switching from tickless-idle to non-tickless and after you mentioned hpet issues maybe changing the clocksource, will see what happens.. 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>> > Friday, September 27, 2013, 9:19:14 PM, you wrote: > > > Hi Sander, > > > thanks for the advice, I have actually no rcu stalls when i use the > no-cpuidle function. Do you have a little more insight on what is actually > causing this behaviour and if there is a better solution then this option, > cause I don''t want to sacrifice my C-states (I would assume this makes the > overall server more power hungry?). > > > Does this has something to do with the new tickless-kernel options in > the newer kernel, or is this really only an apci incompatibility with xen? > > > Thanks! > > Are you running xen-unstable ? > Some patches went in lately > > You also seem to have a motherboard with a AMD 890fx chipset, i suspect > your bios also has issues around the HPET as mine had. > I was also seeing RCU stalls on boot (and only on boot) .. hitting any > key on the console when it appears to stall during boot made it continue in > my case (happens several times). > Took a while to find the problems, Jan Beulich has made and commited some > patches that went in xen-unstable recently. > > Are you running xen-unstable ? > If not, could you give it a try and provide the xl dmesg / serial log ? > > -- > Sander > > > > > > > > > > 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it> > > > Hi Matthias, > > > > Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in > grub ? > > > > -- > > Sander > > > > > Friday, September 27, 2013, 7:07:33 PM, you wrote: > > > >> Hi Konrad, > > > >> good call! I was able to reproduce the error with the 3.12-rc2 kernel, > got > >> a lot of information with the new NMI traces (log attached), but since > I''m > >> not a xen hacker I don''t really know how to continue from here. So I > might > >> add this to the original post and maybe someone can help me. After all > the > >> error persists for half a year now and besides 2 kernel version / > .config > >> Combinations (a 3.8.2 and a 3.6.something) I could never trace this > issue > >> back (even with bisecting the .config because at some point it seemed > >> random). > > > > > >> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > >>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > >>> > I''m currently on a vanilla 3.8.2 kernel because this is the only > >3.4 > >>> > kernel I found which doesn''t give me this issue: > >>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > >>> > >>> So v3.12 (or rather the latest and greaters of the Linus) has the > mechanism > >>> for the NMI - so you can actually see what is causing the stall. > >>> > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Friday, September 27, 2013, 9:48:39 PM, you wrote:> Yes, running the most recent xen-unstable-staging tree, but I have these issues at least since february with xen-unstable, so I don''t suspect recent changes to be the issue in my case.> I will do some testing with switching from tickless-idle to non-tickless and after you mentioned hpet issues maybe changing the clocksource, will see what happens..I''m now running with tickless-idle, so i suspect it will make no difference. So i think trying to make it boot by pressing a key on the keyboard when it doesn''t make progress on boot (see if that works) and if it does .. provide the output of "xl dmesg" would be the best shot. ( BTW there were 2 seperate issues .. see threads: http://lists.xen.org/archives/html/xen-devel/2013-03/msg01796.html http://lists.xen.org/archives/html/xen-devel/2013-08/msg00201.html )> 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>>> Friday, September 27, 2013, 9:19:14 PM, you wrote: >>> Hi Sander,>>> thanks for the advice, I have actually no rcu stalls when i use the no-cpuidle function. Do you have a little more insight on what is actually causing this behaviour and if there is a better solution then this option, cause I don''t want to sacrifice my C-states (I would assume this makes the overall server more power hungry?).>>> Does this has something to do with the new tickless-kernel options in the newer kernel, or is this really only an apci incompatibility with xen?>>> Thanks!> > Are you running xen-unstable ? > Some patches went in lately > > You also seem to have a motherboard with a AMD 890fx chipset, i suspect your bios also has issues around the HPET as mine had. > I was also seeing RCU stalls on boot (and only on boot) .. hitting any key on the console when it appears to stall during boot made it continue in my case (happens several times). > Took a while to find the problems, Jan Beulich has made and commited some patches that went in xen-unstable recently. > > Are you running xen-unstable ? > If not, could you give it a try and provide the xl dmesg / serial log ? > > -- > Sander >> > > > > > >>> 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>>>> Hi Matthias, >> >> Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ? >> >> -- >> Sander >>>>> Friday, September 27, 2013, 7:07:33 PM, you wrote: >> >>> Hi Konrad, >> >>> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got >>> a lot of information with the new NMI traces (log attached), but since I''m >>> not a xen hacker I don''t really know how to continue from here. So I might >>> add this to the original post and maybe someone can help me. After all the >>> error persists for half a year now and besides 2 kernel version / .config >>> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue >>> back (even with bisecting the .config because at some point it seemed >>> random). >> >> >>> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> >>>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: >>>> > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 >>>> > kernel I found which doesn''t give me this issue: >>>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html >>>> >>>> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism >>>> for the NMI - so you can actually see what is causing the stall. >>>> >> >>> > >
Hi David, with your patch as inspiration, I did various test in the past days but didn''t manage to succeed in resetting my vga the right way.. With your patch, and later mine, secondary bus reset is executed but after that i can''t boot the vm because i get a ''device model is not ready'' / ''refused to pass the pci device'' error.. I also tried to don''t reset the secondary function of the vga card after executing a secondary bus reset when the first function reset is called, but with the same result. Do you have any idea if I am missing anything? I tried it with both load/restore configure and not doing so, but it seems xenstore can''t handle the vga after the parent bus reset. Something else that is odd is that my vga has in fact a sysfs/reset file (both functions have a seperate one) but neither doing a normal reset nor doing it by hand does make any change / I think it is not executed, because when I commented out the reset completly, the VM showed the same behaviour on the second boot then when doing a normal reset.. BTW: the same result comes when I''m doing a d0->d3 transition via the kernel. FLR and AR_FLR do not work anyway due to no capability in the card.. So I compared the xen-pciback reset-method with both the pci/pci.c method and what was done in python/xen/util/pci.py and the actions are basically the same (the quirks in pci.py are only for some nvidia and integrated vgas) and I don''t see what I am missing.. can it be that after the parent bus reset, the vga card somehow looses it''s entry in xenstore or something? Can you elaborate a bit more what hardware you are having and if your patch works fine for you? I''m currently testing with a AMD HD5400. Thanks in advance! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Matthias
2013-Oct-03 22:34 UTC
Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
Hi Konrad, sorry I missed your entry, google mail might not be the best software to view mailing lists ;) The RCU stall happens roughly 2 minutes after the machine is fully booted, and I''m usually working via SSH by then.. I basically have two cases where the stall happens: 1) Without the no-cpuidle function, It happens when I start xencommons 2) With or without no-cpuidle, this happens sometimes and arbitrary and I have the feeling that logging in via SSH (or network traffic in general?) will increase the chance of the rcu stall and (and this is only a guess) in most cases this actually happens when I enter a command of more then 16 chars in the ssh command prompt. (I don''t really think that this is really causing the issue, I just noticed that when entering the usual commands to start all the xen stuff / boot the domUs, it stalls mostly on the same commands / when ssh freezes I came to the same part of the command). But more ssh-intensive commands like ''dmesg'' or ''htop'' don''t cause it.. Also, I can''t really say what is on the screen because my dom0 does not have a vga card / both vga cards in the server are passed to different domUs and when I don''t hide the vga cards on boot via xen-pciback.hide, the rcu usually does not stall and everything is fine.. 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>> On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote: > > Hi Konrad, > > > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, > got > > a lot of information with the new NMI traces (log attached), but since > I''m > > not a xen hacker I don''t really know how to continue from here. So I > might > > add this to the original post and maybe someone can help me. After all > the > > error persists for half a year now and besides 2 kernel version / .config > > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue > > back (even with bisecting the .config because at some point it seemed > > random). > > Can you tell me a bit on how this happens? Is it happening after you > boot the machine? Does it happen after a specific workload? > > > It looks like something in the RCU is taking far too long and > the RCU callback mechanism starts complaining. The CPU0 is when the > RCU mechanism detects that something is off and starts sending NMI to > all CPUs. CPU2 is the only one that looks to be doing RCU callback: > > > NMI backtrace for cpu 1 > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2 > Hardware name: System manufacturer System Product Name/Crosshair IV > Formula, BIOS 3029 10/09/2012 > task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000 > RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>] > cfb_imageblit+0x1b3/0x411 > RSP: e02b:ffff88007de439f0 EFLAGS: 00000046 > RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003 > RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000 > RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0 > R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d > R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000 > FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000) > knlGS:0000000000000000 > CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660 > Stack: > 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa > ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800 > 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b > Call Trace: > <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d > [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d > [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc > [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290 > [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc > [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306 > [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb > [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d > [<ffffffff813db9c8>] ? printk+0x4f/0x51 > [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598 > <=================> [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239 > [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e > [<ffffffff81084c35>] ? update_process_times+0x30/0x5b > [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a > [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c > [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159 > [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca > [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b > [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d > [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa > [<ffffffff8103df22>] ? check_events+0x12/0x20 > [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5 > [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52 > [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c > [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb > [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6 > [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51 > [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32 > [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30 > <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13 > [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e > [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160 > Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 > 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 > 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 > > > Which looks to be printing something on the VT console (which is running > in KMS mode as it uses framebuffer calls). So is there something on the > screen scrolling widly in a loop? > > But then there are also complains about > > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long > to run: 1.115 msecs > > this taking too long. I am wondering if there is some time issue > on your box. > > What version of Xen do you have? > > > > > > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > > > I''m currently on a vanilla 3.8.2 kernel because this is the only >3.4 > > > > kernel I found which doesn''t give me this issue: > > > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > > > > > So v3.12 (or rather the latest and greaters of the Linus) has the > mechanism > > > for the NMI - so you can actually see what is causing the stall. > > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Pasi Kärkkäinen
2013-Oct-04 06:07 UTC
Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote:> Hi Konrad, > > sorry I missed your entry, google mail might not be the best software to > view mailing lists ;) > > The RCU stall happens roughly 2 minutes after the machine is fully booted, > and I''m usually working via SSH by then.. > > I basically have two cases where the stall happens: > > 1) Without the no-cpuidle function, It happens when I start xencommons > 2) With or without no-cpuidle, this happens sometimes and arbitrary and I > have the feeling that logging in via SSH (or network traffic in general?) > will increase the chance of the rcu stall and (and this is only a guess) > in most cases this actually happens when I enter a command of more then 16 > chars in the ssh command prompt. (I don''t really think that this is really > causing the issue, I just noticed that when entering the usual commands to > start all the xen stuff / boot the domUs, it stalls mostly on the same > commands / when ssh freezes I came to the same part of the command). But > more ssh-intensive commands like ''dmesg'' or ''htop'' don''t cause it.. > > Also, I can''t really say what is on the screen because my dom0 does not > have a vga card / both vga cards in the server are passed to different > domUs and when I don''t hide the vga cards on boot via xen-pciback.hide, > the rcu usually does not stall and everything is fine.. >For debugging you should have a serial console.. so maybe get a pci serial card, if you don''t have any management processors offering SOL ? -- Pasi> 2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@oracle.com> > > On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote: > > Hi Konrad, > > > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, > got > > a lot of information with the new NMI traces (log attached), but since > I''m > > not a xen hacker I don''t really know how to continue from here. So I > might > > add this to the original post and maybe someone can help me. After all > the > > error persists for half a year now and besides 2 kernel version / > .config > > Combinations (a 3.8.2 and a 3.6.something) I could never trace this > issue > > back (even with bisecting the .config because at some point it seemed > > random). > > Can you tell me a bit on how this happens? Is it happening after you > boot the machine? Does it happen after a specific workload? > > It looks like something in the RCU is taking far too long and > the RCU callback mechanism starts complaining. The CPU0 is when the > RCU mechanism detects that something is off and starts sending NMI to > all CPUs. CPU2 is the only one that looks to be doing RCU callback: > > NMI backtrace for cpu 1 > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2 > Hardware name: System manufacturer System Product Name/Crosshair IV > Formula, BIOS 3029 10/09/2012 > task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000 > RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>] > cfb_imageblit+0x1b3/0x411 > RSP: e02b:ffff88007de439f0 EFLAGS: 00000046 > RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003 > RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000 > RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0 > R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d > R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000 > FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000) > knlGS:0000000000000000 > CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660 > Stack: > 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa > ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800 > 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b > Call Trace: > <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d > [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d > [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc > [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290 > [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc > [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306 > [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb > [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d > [<ffffffff813db9c8>] ? printk+0x4f/0x51 > [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598 > <=================> [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239 > [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e > [<ffffffff81084c35>] ? update_process_times+0x30/0x5b > [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a > [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c > [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159 > [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca > [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b > [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d > [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa > [<ffffffff8103df22>] ? check_events+0x12/0x20 > [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5 > [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52 > [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c > [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb > [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6 > [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51 > [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32 > [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30 > <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13 > [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e > [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160 > Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 > c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 > c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 > > Which looks to be printing something on the VT console (which is running > in KMS mode as it uses framebuffer calls). So is there something on the > screen scrolling widly in a loop? > > But then there are also complains about > > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long > to run: 1.115 msecs > > this taking too long. I am wondering if there is some time issue > on your box. > > What version of Xen do you have? > > > > > > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@oracle.com> > > > > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > > > I''m currently on a vanilla 3.8.2 kernel because this is the only > >3.4 > > > > kernel I found which doesn''t give me this issue: > > > > > [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > > > > > So v3.12 (or rather the latest and greaters of the Linus) has the > mechanism > > > for the NMI - so you can actually see what is causing the stall. > > > > > _______________________________________________ > Xen-devel mailing list > [4]Xen-devel@lists.xen.org > [5]http://lists.xen.org/xen-devel > > References > > Visible links > 1. mailto:konrad.wilk@oracle.com > 2. mailto:konrad.wilk@oracle.com > 3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > 4. mailto:Xen-devel@lists.xen.org > 5. http://lists.xen.org/xen-devel> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Possibly Parallel Threads
- [PATCH] Support Function Level Reset (FLR) in the xen-pciback module (v1) and some fixes.
- xl fails to work with some command
- [PATCH] Improve the current FLR logic
- PCI passthrough of a SATA/PATA controller, "FLR functionality not supported"
- Problems with pci passthrough with Xen 3.4.2