thr3ads.net - Xen devel - [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Qing He

2009-Jan-08 09:06 UTC

[Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

This patchset enables MSI-INTx interrupt translation for HVM
domains. The intention of the patch is to use MSI as the physical
interrupt mechanism for passthrough device as much as possible,
thus reducing the pirq sharing among domains.

When MSI is globally enabled, if the device has the MSI capability
but doesn''t used by the guest, hypervisor will try to user MSI as
the underlying pirq and inject translated INTx irq to guest
vioapic. When guest itself enabled MSI or MSI-X, the translation
is automatically turned off.

Add a config file option to disable/enable this feature. Also, in
order to allow the user to override the option per device, a
per-device option mechanism is implemented and an MSI-INTx option
is added

The patch set contains 6 patches and is organized as below:
    [PATCH 1/6] hypervisor patch
   *[PATCH 2/6] qemu patch
    [PATCH 3/6] add pci options in domain config file
    [PATCH 4/6] XenAPI version of the above patch
    [PATCH 5/6] add global and per-device option for MSI-INTx
    [PATCH 6/6] documentation and example

 * against ioemu-remote

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 1/6] passthrough: MSI-INTx translation for HVM

passthrough: MSI-INTx translation for HVM

The patch adds a new type of pt_irq: PT_IRQ_TYPE_MSI_TRANSLATE.
With it, guest dpci irq can now use msi as the underlying pirq
while injected as INTx irq.

Signed-off-by: Qing He <qing.he@intel.com>
---

diff -r e2f36d066b7b -r d11b806ed94f tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/libxc/xc_domain.c	Sun Jan 04 17:41:53 2009 +0800
@@ -888,7 +888,8 @@
     bind->hvm_domid = domid;
     bind->irq_type = irq_type;
     bind->machine_irq = machine_irq;
-    if ( irq_type == PT_IRQ_TYPE_PCI )
+    if ( irq_type == PT_IRQ_TYPE_PCI ||
+         irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
     {
         bind->u.pci.bus = bus;
         bind->u.pci.device = device;    
diff -r e2f36d066b7b -r d11b806ed94f xen/arch/x86/hvm/vmsi.c
--- a/xen/arch/x86/hvm/vmsi.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/arch/x86/hvm/vmsi.c	Sun Jan 04 17:41:53 2009 +0800
@@ -134,7 +134,7 @@
                 "vector=%x trig_mode=%x\n",
                 dest, dest_mode, delivery_mode, vector, trig_mode);
 
-    if ( !test_bit(_HVM_IRQ_DPCI_MSI, &hvm_irq_dpci->mirq[pirq].flags) )
+    if ( !( hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_GUEST_MSI ) )
     {
         gdprintk(XENLOG_WARNING, "pirq %x not msi \n", pirq);
         return 0;
diff -r e2f36d066b7b -r d11b806ed94f xen/drivers/passthrough/io.c
--- a/xen/drivers/passthrough/io.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/drivers/passthrough/io.c	Sun Jan 04 17:41:53 2009 +0800
@@ -23,6 +23,11 @@
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <xen/hvm/irq.h>
+
+static int pt_irq_need_timer(uint32_t flags)
+{
+    return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
+}
 
 static void pt_irq_time_out(void *data)
 {
@@ -93,7 +98,8 @@
 
         if ( !test_and_set_bit(pirq, hvm_irq_dpci->mapping))
         {
-            set_bit(_HVM_IRQ_DPCI_MSI, &hvm_irq_dpci->mirq[pirq].flags);
+            hvm_irq_dpci->mirq[pirq].flags = HVM_IRQ_DPCI_MACH_MSI |
+                                             HVM_IRQ_DPCI_GUEST_MSI;
             hvm_irq_dpci->mirq[pirq].gmsi.gvec = pt_irq_bind->u.msi.gvec;
             hvm_irq_dpci->mirq[pirq].gmsi.gflags =
pt_irq_bind->u.msi.gflags;
             hvm_irq_dpci->msi_gvec_pirq[pt_irq_bind->u.msi.gvec] = pirq;
@@ -104,7 +110,7 @@
                 hvm_irq_dpci->msi_gvec_pirq[pt_irq_bind->u.msi.gvec] = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gflags = 0;
                 hvm_irq_dpci->mirq[pirq].gmsi.gvec = 0;
-                clear_bit(_HVM_IRQ_DPCI_MSI,
&hvm_irq_dpci->mirq[pirq].flags);
+                hvm_irq_dpci->mirq[pirq].flags = 0;
                 clear_bit(pirq, hvm_irq_dpci->mapping);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -150,17 +156,33 @@
         if ( !test_and_set_bit(machine_gsi, hvm_irq_dpci->mapping))
         {
             unsigned int vector = domain_irq_to_vector(d, machine_gsi);
+            unsigned int share;
 
             hvm_irq_dpci->mirq[machine_gsi].dom = d;
+            if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
+            {
+                hvm_irq_dpci->mirq[machine_gsi].flags =
HVM_IRQ_DPCI_MACH_MSI |
+                                                        HVM_IRQ_DPCI_GUEST_PCI
|
+                                                        HVM_IRQ_DPCI_TRANSLATE;
+                share = 0;
+            }
+            else    /* PT_IRQ_TYPE_PCI */
+            {
+                hvm_irq_dpci->mirq[machine_gsi].flags =
HVM_IRQ_DPCI_MACH_PCI |
+                                                        HVM_IRQ_DPCI_GUEST_PCI;
+                share = BIND_PIRQ__WILL_SHARE;
+            }
 
             /* Init timer before binding */
-            init_timer(&hvm_irq_dpci->hvm_timer[vector],
-                       pt_irq_time_out,
&hvm_irq_dpci->mirq[machine_gsi], 0);
+            if ( pt_irq_need_timer(hvm_irq_dpci->mirq[machine_gsi].flags) )
+                init_timer(&hvm_irq_dpci->hvm_timer[vector],
+                           pt_irq_time_out,
&hvm_irq_dpci->mirq[machine_gsi], 0);
             /* Deal with gsi for legacy devices */
-            rc = pirq_guest_bind(d->vcpu[0], machine_gsi,
BIND_PIRQ__WILL_SHARE);
+            rc = pirq_guest_bind(d->vcpu[0], machine_gsi, share);
             if ( unlikely(rc) )
             {
-                kill_timer(&hvm_irq_dpci->hvm_timer[vector]);
+                if (
pt_irq_need_timer(hvm_irq_dpci->mirq[machine_gsi].flags) )
+                    kill_timer(&hvm_irq_dpci->hvm_timer[vector]);
                 hvm_irq_dpci->mirq[machine_gsi].dom = NULL;
                 clear_bit(machine_gsi, hvm_irq_dpci->mapping);
                 hvm_irq_dpci->girq[guest_gsi].machine_gsi = 0;
@@ -237,7 +259,8 @@
         if ( list_empty(&hvm_irq_dpci->mirq[machine_gsi].digl_list) )
         {
             pirq_guest_unbind(d, machine_gsi);
-            kill_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
machine_gsi)]);
+            if ( pt_irq_need_timer(hvm_irq_dpci->mirq[machine_gsi].flags) )
+               
kill_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
machine_gsi)]);
             hvm_irq_dpci->mirq[machine_gsi].dom   = NULL;
             hvm_irq_dpci->mirq[machine_gsi].flags = 0;
             clear_bit(machine_gsi, hvm_irq_dpci->mapping);
@@ -267,7 +290,7 @@
      * PIC) and we need to detect that.
      */
     set_bit(mirq, dpci->dirq_mask);
-    if ( !test_bit(_HVM_IRQ_DPCI_MSI, &dpci->mirq[mirq].flags) )
+    if ( pt_irq_need_timer(dpci->mirq[mirq].flags) )
         set_timer(&dpci->hvm_timer[domain_irq_to_vector(d, mirq)],
                   NOW() + PT_IRQ_TIME_OUT);
     vcpu_kick(d->vcpu[0]);
@@ -276,35 +299,40 @@
 }
 
 #ifdef SUPPORT_MSI_REMAPPING
+/* called with d->event_lock held */
+static void __msi_pirq_eoi(struct domain *d, int pirq)
+{
+    struct hvm_irq_dpci *hvm_irq_dpci = d->arch.hvm_domain.irq.dpci;
+    irq_desc_t *desc;
+
+    if ( ( pirq >= 0 ) && ( pirq < NR_IRQS ) &&
+         test_bit(pirq, hvm_irq_dpci->mapping) &&
+         ( hvm_irq_dpci->mirq[pirq].flags & HVM_IRQ_DPCI_MACH_MSI) )
+    {
+         BUG_ON(!local_irq_is_enabled());
+         desc = domain_spin_lock_irq_desc(d, pirq, NULL);
+         if ( !desc )
+            return;
+
+         desc->status &= ~IRQ_INPROGRESS;
+         spin_unlock_irq(&desc->lock);
+
+         pirq_guest_eoi(d, pirq);
+    }
+}
+
 void hvm_dpci_msi_eoi(struct domain *d, int vector)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = d->arch.hvm_domain.irq.dpci;
-    irq_desc_t *desc;
     int pirq;
 
     if ( !iommu_enabled || (hvm_irq_dpci == NULL) )
        return;
 
     spin_lock(&d->event_lock);
+
     pirq = hvm_irq_dpci->msi_gvec_pirq[vector];
-
-    if ( ( pirq >= 0 ) && (pirq < NR_IRQS) &&
-          test_bit(pirq, hvm_irq_dpci->mapping) &&
-         (test_bit(_HVM_IRQ_DPCI_MSI, &hvm_irq_dpci->mirq[pirq].flags)))
-     {
-         BUG_ON(!local_irq_is_enabled());
-         desc = domain_spin_lock_irq_desc(d, pirq, NULL);
-         if (!desc)
-         {
-            spin_unlock(&d->event_lock);
-            return;
-         }
-
-         desc->status &= ~IRQ_INPROGRESS;
-         spin_unlock_irq(&desc->lock);
-
-         pirq_guest_eoi(d, pirq);
-     }
+    __msi_pirq_eoi(d, pirq);
 
     spin_unlock(&d->event_lock);
 }
@@ -336,14 +364,15 @@
 
         spin_lock(&d->event_lock);
 #ifdef SUPPORT_MSI_REMAPPING
-        if ( test_bit(_HVM_IRQ_DPCI_MSI, &hvm_irq_dpci->mirq[irq].flags)
)
+        if ( hvm_irq_dpci->mirq[irq].flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             hvm_pci_msi_assert(d, irq);
             spin_unlock(&d->event_lock);
             continue;
         }
 #endif
-        stop_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
irq)]);
+        if ( pt_irq_need_timer(hvm_irq_dpci->mirq[irq].flags) )
+            stop_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
irq)]);
 
         list_for_each_entry ( digl, &hvm_irq_dpci->mirq[irq].digl_list,
list )
         {
@@ -351,6 +380,14 @@
             intx = digl->intx;
             hvm_pci_intx_assert(d, device, intx);
             hvm_irq_dpci->mirq[irq].pending++;
+
+#ifdef SUPPORT_MSI_REMAPPING
+            if ( hvm_irq_dpci->mirq[irq].flags & HVM_IRQ_DPCI_TRANSLATE
)
+            {
+                /* for translated MSI to INTx interrupt, eoi as early as
possible */
+                __msi_pirq_eoi(d, irq);
+            }
+#endif
         }
 
         /*
@@ -360,8 +397,9 @@
          * guest will never deal with the irq, then the physical interrupt line
          * will never be deasserted.
          */
-        set_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
irq)],
-                  NOW() + PT_IRQ_TIME_OUT);
+        if ( pt_irq_need_timer(hvm_irq_dpci->mirq[irq].flags) )
+            set_timer(&hvm_irq_dpci->hvm_timer[domain_irq_to_vector(d,
irq)],
+                      NOW() + PT_IRQ_TIME_OUT);
         spin_unlock(&d->event_lock);
     }
 }
@@ -405,9 +443,12 @@
              * No need to get vector lock for timer
              * since interrupt is still not EOIed
              */
-            stop_timer(&hvm_irq_dpci->hvm_timer[
-                domain_irq_to_vector(d, machine_gsi)]);
-            pirq_guest_eoi(d, machine_gsi);
+            if ( pt_irq_need_timer(hvm_irq_dpci->mirq[machine_gsi].flags) )
+            {
+                stop_timer(&hvm_irq_dpci->hvm_timer[
+                    domain_irq_to_vector(d, machine_gsi)]);
+                pirq_guest_eoi(d, machine_gsi);
+            }
         }
     }
     spin_unlock(&d->event_lock);
diff -r e2f36d066b7b -r d11b806ed94f xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/include/public/domctl.h	Sun Jan 04 17:41:53 2009 +0800
@@ -466,6 +466,7 @@
     PT_IRQ_TYPE_PCI,
     PT_IRQ_TYPE_ISA,
     PT_IRQ_TYPE_MSI,
+    PT_IRQ_TYPE_MSI_TRANSLATE,
 } pt_irq_type_t;
 struct xen_domctl_bind_pt_irq {
     uint32_t machine_irq;
diff -r e2f36d066b7b -r d11b806ed94f xen/include/xen/hvm/irq.h
--- a/xen/include/xen/hvm/irq.h	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/include/xen/hvm/irq.h	Sun Jan 04 17:41:53 2009 +0800
@@ -35,7 +35,16 @@
     uint8_t link;
 };
 
-#define _HVM_IRQ_DPCI_MSI  0x1
+#define _HVM_IRQ_DPCI_MACH_PCI_SHIFT            0
+#define _HVM_IRQ_DPCI_MACH_MSI_SHIFT            1
+#define _HVM_IRQ_DPCI_GUEST_PCI_SHIFT           4
+#define _HVM_IRQ_DPCI_GUEST_MSI_SHIFT           5
+#define _HVM_IRQ_DPCI_TRANSLATE_SHIFT          15
+#define HVM_IRQ_DPCI_MACH_PCI        (1 << _HVM_IRQ_DPCI_MACH_PCI_SHIFT)
+#define HVM_IRQ_DPCI_MACH_MSI        (1 << _HVM_IRQ_DPCI_MACH_MSI_SHIFT)
+#define HVM_IRQ_DPCI_GUEST_PCI       (1 << _HVM_IRQ_DPCI_GUEST_PCI_SHIFT)
+#define HVM_IRQ_DPCI_GUEST_MSI       (1 << _HVM_IRQ_DPCI_GUEST_MSI_SHIFT)
+#define HVM_IRQ_DPCI_TRANSLATE       (1 << _HVM_IRQ_DPCI_TRANSLATE_SHIFT)
 
 struct hvm_gmsi_info {
     uint32_t gvec;

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 2/6] ioemu:passthrough: MSI-INTx interrupt translation support

passthrough: MSI-INTx interrupt translation support
    
This patch enables Xen to use MSI for MSI-capable devices as the
underlying interrupt source even if the guest does not explicitly
use it. The guest will still see an IO-APIC based INTx interrupt
translated by Xen from the MSI irq.

If the guest enables MSI or MSI-X for the passthrough device, this
translation is automatically turned off. It can also be disabled in
the config file at domain startup time.

Signed-off-by: Qing He <qing.he@intel.com>
---

diff --git a/hw/pass-through.c b/hw/pass-through.c
index d280ff6..625e644 100644
--- a/hw/pass-through.c
+++ b/hw/pass-through.c
@@ -34,6 +34,7 @@ struct php_dev {
     uint8_t r_bus;
     uint8_t r_dev;
     uint8_t r_func;
+    char *opt;
 };
 struct dpci_infos {
 
@@ -492,7 +493,7 @@ static struct pt_reg_info_tbl pt_emu_reg_msi_tbl[] = {
         .size       = 2,
         .init_val   = 0x0000,
         .ro_mask    = 0x018E,
-        .emu_mask   = 0xFFFE,
+        .emu_mask   = 0xFFFF,
         .init       = pt_msgctrl_reg_init,
         .u.w.read   = pt_word_reg_read,
         .u.w.write  = pt_msgctrl_reg_write,
@@ -692,7 +693,7 @@ static int token_value(char *token)
     return strtol(token, NULL, 16);
 }
 
-static int next_bdf(char **str, int *seg, int *bus, int *dev, int *func)
+static int next_bdf(char **str, int *seg, int *bus, int *dev, int *func, char
**opt)
 {
     char *token;
     const char *delim = ":.-";
@@ -711,18 +712,60 @@ static int next_bdf(char **str, int *seg, int *bus, int
*dev, int *func)
     *dev  = token_value(token);
 
     token  = strsep(str, delim);
+    *opt = strchr(token, '','');
+    if (*opt)
+        *(*opt)++ = ''\0'';
+
     *func  = token_value(token);
 
     return 1;
 }
 
+static int get_next_keyval(char **option, char **key, char **val)
+{
+    char *opt, *k, *v;
+
+    k = *option;
+    opt = strchr(k, '','');
+    if (opt)
+        *opt++ = ''\0'';
+    v = strchr(k, ''='');
+    if (!v)
+        return -1;
+    *v++ = ''\0'';
+
+    *key = k;
+    *val = v;
+    *option = opt;
+
+    return 0;
+}
+
+static void msi_set_enable(struct pt_dev *ptdev, int en)
+{
+    uint16_t val;
+    uint32_t address;
+    if (!ptdev->msi)
+        return;
+
+    address = ptdev->msi->ctrl_offset;
+    if (!address)
+        return;
+
+    val = pci_read_word(ptdev->pci_dev, address);
+    val &= ~PCI_MSI_FLAGS_ENABLE;
+    val |= en & PCI_MSI_FLAGS_ENABLE;
+    pci_write_word(ptdev->pci_dev, address, val);
+}
+
 /* Insert a new pass-through device into a specific pci slot.
  * input  dom:bus:dev.func@slot, chose free one if slot == 0
  * return -1: required slot not available
  *         0: no free hotplug slots, but normal slot should okay
  *        >0: the new hotplug slot
  */
-static int __insert_to_pci_slot(int bus, int dev, int func, int slot)
+static int __insert_to_pci_slot(int bus, int dev, int func, int slot,
+                                char *opt)
 {
     int i, php_slot;
 
@@ -759,6 +802,7 @@ found:
     dpci_infos.php_devs[php_slot].r_bus  = bus;
     dpci_infos.php_devs[php_slot].r_dev  = dev;
     dpci_infos.php_devs[php_slot].r_func = func;
+    dpci_infos.php_devs[php_slot].opt = opt;
     return PHP_TO_PCI_SLOT(php_slot);
 }
 
@@ -768,19 +812,19 @@ found:
 int insert_to_pci_slot(char *bdf_slt)
 {
     int seg, bus, dev, func, slot;
-    char *bdf_str, *slt_str;
+    char *bdf_str, *slt_str, *opt;
     const char *delim="@";
 
     bdf_str = strsep(&bdf_slt, delim);
     slt_str = bdf_slt;
     slot = token_value(slt_str);
 
-    if ( !next_bdf(&bdf_str, &seg, &bus, &dev, &func))
+    if ( !next_bdf(&bdf_str, &seg, &bus, &dev, &func,
&opt))
     {
         return -1;
     }
 
-    return __insert_to_pci_slot(bus, dev, func, slot);
+    return __insert_to_pci_slot(bus, dev, func, slot, opt);
 
 }
 
@@ -807,8 +851,9 @@ int test_pci_slot(int slot)
 int bdf_to_slot(char *bdf_str)
 {
     int seg, bus, dev, func, i;
+    char *opt;
 
-    if ( !next_bdf(&bdf_str, &seg, &bus, &dev, &func))
+    if ( !next_bdf(&bdf_str, &seg, &bus, &dev, &func,
&opt))
     {
         return -1;
     }
@@ -1960,9 +2005,15 @@ static uint32_t pt_msgctrl_reg_init(struct pt_dev *ptdev,
         pci_write_word(pdev, real_offset, reg_field &
~PCI_MSI_FLAGS_ENABLE);
     }
     ptdev->msi->flags |= (reg_field | MSI_FLAG_UNINIT);
+    ptdev->msi->ctrl_offset = real_offset;
     
     /* All register is 0 after reset, except first 4 byte */
     reg_field &= reg->ro_mask;
+
+    if (ptdev->msi_trans_cap) {
+        PT_LOG("Turning on MSI-INTx translation\n");
+        ptdev->msi_trans_en = 1;
+    }
     
     return reg_field;
 }
@@ -2673,6 +2724,34 @@ static int pt_linkctrl2_reg_write(struct pt_dev *ptdev,
     return 0;
 }
 
+static void pt_unmap_msi_translate(struct pt_dev *ptdev)
+{
+    uint16_t e_device, e_intx;
+    int rc;
+
+    /* MSI_ENABLE bit should be disabed until the new handler is set */
+    msi_set_enable(ptdev, 0);
+
+    e_device = (ptdev->dev.devfn >> 3) & 0x1f;
+    /* fix virtual interrupt pin to INTA# */
+    e_intx = 0;
+    rc = xc_domain_unbind_pt_irq(xc_handle, domid, ptdev->msi->pirq,
+                                 PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                 e_device, e_intx, 0);
+    if (rc < 0)
+        PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!
rc=%d\n", rc);
+
+    if (ptdev->machine_irq)
+    {
+        rc = xc_domain_bind_pt_pci_irq(xc_handle, domid, ptdev->machine_irq,
+                                       0, e_device, e_intx);
+        if ( rc < 0 )
+            PT_LOG("Error: Rebinding of interrupt failed! rc=%d\n",
rc);
+    }
+
+    ptdev->msi_trans_en = 0;
+}
+
 /* write Message Control register */
 static int pt_msgctrl_reg_write(struct pt_dev *ptdev, 
     struct pt_reg_tbl *cfg_entry, 
@@ -2682,7 +2761,9 @@ static int pt_msgctrl_reg_write(struct pt_dev *ptdev,
     uint16_t writable_mask = 0;
     uint16_t throughable_mask = 0;
     uint16_t old_ctrl = cfg_entry->data;
+    uint8_t e_device, e_intx;
     PCIDevice *pd = (PCIDevice *)ptdev;
+    uint16_t val;
 
     /* Currently no support for multi-vector */
     if ((*value & PCI_MSI_FLAGS_QSIZE) != 0x0)
@@ -2699,21 +2780,29 @@ static int pt_msgctrl_reg_write(struct pt_dev *ptdev,
     PT_LOG("old_ctrl:%04xh new_ctrl:%04xh\n", old_ctrl,
cfg_entry->data);
     
     /* create value for writing to I/O device register */
+    val = *value;
     throughable_mask = ~reg->emu_mask & valid_mask;
     *value = ((*value & throughable_mask) | (dev_value &
~throughable_mask));
 
     /* update MSI */
-    if (*value & PCI_MSI_FLAGS_ENABLE)
+    if (val & PCI_MSI_FLAGS_ENABLE)
     {
         /* setup MSI pirq for the first time */
         if (ptdev->msi->flags & MSI_FLAG_UNINIT)
         {
-            /* Init physical one */
-            PT_LOG("setup msi for dev %x\n", pd->devfn);
-            if (pt_msi_setup(ptdev))
+            if (ptdev->msi_trans_en) {
+                PT_LOG("guest enabling MSI, disable MSI-INTx
translation\n");
+                pt_unmap_msi_translate(ptdev);
+            }
+            else
             {
-                PT_LOG("pt_msi_setup error!!!\n");
-                return -1;
+                /* Init physical one */
+                PT_LOG("setup msi for dev %x\n", pd->devfn);
+                if (pt_msi_setup(ptdev))
+                {
+                    PT_LOG("pt_msi_setup error!!!\n");
+                    return -1;
+                }
             }
             pt_msi_update(ptdev);
 
@@ -2725,6 +2814,12 @@ static int pt_msgctrl_reg_write(struct pt_dev *ptdev,
     else
         ptdev->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
 
+    /* pass through MSI_ENABLE bit when no MSI-INTx translation */
+    if (!ptdev->msi_trans_en) {
+        *value &= ~PCI_MSI_FLAGS_ENABLE;
+        *value |= val & PCI_MSI_FLAGS_ENABLE;
+    }
+
     return 0;
 }
 
@@ -2870,7 +2965,13 @@ static int pt_msixctrl_reg_write(struct pt_dev *ptdev,
 
     /* update MSI-X */
     if ((*value & PCI_MSIX_ENABLE) && !(*value &
PCI_MSIX_MASK))
+    {
+        if (ptdev->msi_trans_en) {
+            PT_LOG("guest enabling MSI-X, disable MSI-INTx
translation\n");
+            pt_unmap_msi_translate(ptdev);
+        }
         pt_msix_update(ptdev);
+    }
 
     ptdev->msix->enabled = !!(*value & PCI_MSIX_ENABLE);
 
@@ -2879,7 +2980,8 @@ static int pt_msixctrl_reg_write(struct pt_dev *ptdev,
 
 struct pt_dev * register_real_device(PCIBus *e_bus,
         const char *e_dev_name, int e_devfn, uint8_t r_bus, uint8_t r_dev,
-        uint8_t r_func, uint32_t machine_irq, struct pci_access *pci_access)
+        uint8_t r_func, uint32_t machine_irq, struct pci_access *pci_access,
+        char *opt)
 {
     int rc = -1, i;
     struct pt_dev *assigned_device = NULL;
@@ -2887,6 +2989,8 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
     uint8_t e_device, e_intx;
     struct pci_config_cf8 machine_bdf;
     int free_pci_slot = -1;
+    char *key, *val;
+    int msi_translate;
 
     PT_LOG("Assigning real physical device %02x:%02x.%x ...\n",
         r_bus, r_dev, r_func);
@@ -2908,13 +3012,41 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
 
     if ( e_devfn == PT_VIRT_DEVFN_AUTO ) {
         /*indicate a static assignment(not hotplug), so find a free PCI hot
plug slot */
-        free_pci_slot = __insert_to_pci_slot(r_bus, r_dev, r_func, 0);
+        free_pci_slot = __insert_to_pci_slot(r_bus, r_dev, r_func, 0, NULL);
         if ( free_pci_slot > 0 )
             e_devfn = free_pci_slot  << 3;
         else
             PT_LOG("Error: no free virtual PCI hot plug slot, thus no live
migration.\n");
     }
 
+    msi_translate = direct_pci_msitranslate;
+    while (opt) {
+        if (get_next_keyval(&opt, &key, &val)) {
+            PT_LOG("Error: unrecognized PCI assignment option
\"%s\"\n", opt);
+            break;
+        }
+
+        if (strcmp(key, "msitranslate") == 0)
+        {
+            if (strcmp(val, "0") == 0 || strcmp(val, "no")
== 0)
+            {
+                PT_LOG("Disable MSI translation via per device
option\n");
+                msi_translate = 0;
+            }
+            else if (strcmp(val, "1") == 0 || strcmp(val,
"yes") == 0)
+            {
+                PT_LOG("Enable MSI translation via per device
option\n");
+                msi_translate = 1;
+            }
+            else
+                PT_LOG("Error: unrecognized value for
msitranslate=\n");
+        }
+        else
+            PT_LOG("Error: unrecognized PCI assignment option
\"%s=%s\"\n", key, val);
+
+    }
+
+
     /* Register device */
     assigned_device = (struct pt_dev *) pci_register_device(e_bus, e_dev_name,
                                 sizeof(struct pt_dev), e_devfn,
@@ -2929,6 +3061,7 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
         dpci_infos.php_devs[PCI_TO_PHP_SLOT(free_pci_slot)].pt_dev =
assigned_device;
 
     assigned_device->pci_dev = pci_dev;
+    assigned_device->msi_trans_cap = msi_translate;
 
     /* Assign device */
     machine_bdf.reg = 0;
@@ -2960,6 +3093,28 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
     /* fix virtual interrupt pin to INTA# */
     e_intx = 0;
 
+    while (assigned_device->msi_trans_en)
+    {
+        if (pt_msi_setup(assigned_device))
+        {
+            PT_LOG("Error: MSI-INTx translation MSI setup failed,
fallback\n");
+            assigned_device->msi_trans_en = 0;
+            break;
+        }
+
+        rc = xc_domain_bind_pt_irq(xc_handle, domid,
assigned_device->msi->pirq,
+                                   PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                   e_device, e_intx, 0);
+        if ( rc < 0)
+        {
+            PT_LOG("Error: MSI-INTx translation bind failed,
fallback\n");
+            assigned_device->msi_trans_en = 0;
+            break;
+        }
+        msi_set_enable(assigned_device, 1);
+        break;
+    }
+
     if ( PT_MACHINE_IRQ_AUTO == machine_irq )
     {
         int pirq = pci_dev->irq;
@@ -2973,9 +3125,15 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
             PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
         }
         else
+        {
             machine_irq = pirq;
+            assigned_device->machine_irq = pirq;
+        }
     }
 
+    if (assigned_device->msi_trans_en)
+        goto out;
+
     /* bind machine_irq to device */
     if ( 0 != machine_irq )
     {
@@ -2995,8 +3153,9 @@ struct pt_dev * register_real_device(PCIBus *e_bus,
     }
 
 out:
-    PT_LOG("Real physical device %02x:%02x.%x registered
successfuly!\n",
-        r_bus, r_dev, r_func);
+    PT_LOG("Real physical device %02x:%02x.%x registered
successfuly!\n"
+           "IRQ type = %s\n", r_bus, r_dev, r_func,
+           assigned_device->msi_trans_en?
"MSI-INTx":"INTx");
 
     return assigned_device;
 }
@@ -3029,9 +3188,9 @@ int unregister_real_device(int php_slot)
     e_device = (assigned_device->dev.devfn >> 3) & 0x1f;
     /* fix virtual interrupt pin to INTA# */
     e_intx = 0;
-    machine_irq = pci_dev->irq;
+    machine_irq = assigned_device->machine_irq;
 
-    if ( machine_irq != 0 ) {
+    if ( assigned_device->msi_trans_en == 0 && machine_irq ) {
         rc = xc_domain_unbind_pt_irq(xc_handle, domid, machine_irq,
PT_IRQ_TYPE_PCI, 0,
                                        e_device, e_intx, 0);
         if ( rc < 0 )
@@ -3040,6 +3199,16 @@ int unregister_real_device(int php_slot)
             PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n",
rc);
         }
     }
+    else if (assigned_device->msi_trans_en)
+    {
+        rc = xc_domain_unbind_pt_irq(xc_handle, domid,
assigned_device->msi->pirq,
+                                     PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                     e_device, e_intx, 0);
+        if (rc < 0)
+            PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!
rc=%d\n", rc);
+    }
+
+    /* TODO: unmap passthrough MSI and MSI-X irqs */
 
     /* delete all emulated config registers */
     pt_config_delete(assigned_device);
@@ -3075,7 +3244,10 @@ int power_on_php_slot(int php_slot)
             php_dev->r_dev,
             php_dev->r_func,
             PT_MACHINE_IRQ_AUTO,
-            dpci_infos.pci_access);
+            dpci_infos.pci_access,
+            php_dev->opt);
+
+    php_dev->opt = NULL;
 
     php_dev->pt_dev = pt_dev;
 
@@ -3097,6 +3269,7 @@ int pt_init(PCIBus *e_bus, const char *direct_pci)
     char slot_str[8];
     char *direct_pci_head = NULL;
     char *direct_pci_p = NULL;
+    char *opt;
 
     /* Initialize libpci */
     pci_access = pci_alloc();
@@ -3125,11 +3298,11 @@ int pt_init(PCIBus *e_bus, const char *direct_pci)
     vslots = qemu_mallocz ( strlen(direct_pci) / 3 );
 
     /* Assign given devices to guest */
-    while ( next_bdf(&direct_pci_p, &seg, &b, &d, &f) )
+    while ( next_bdf(&direct_pci_p, &seg, &b, &d, &f,
&opt) )
     {
         /* Register real device with the emulated bus */
         pt_dev = register_real_device(e_bus, "DIRECT PCI",
PT_VIRT_DEVFN_AUTO,
-            b, d, f, PT_MACHINE_IRQ_AUTO, pci_access);
+            b, d, f, PT_MACHINE_IRQ_AUTO, pci_access, opt);
         if ( pt_dev == NULL )
         {
             PT_LOG("Error: Registration failed (%02x:%02x.%x)\n", b,
d, f);
diff --git a/hw/pass-through.h b/hw/pass-through.h
index 8aa664b..a7d2727 100644
--- a/hw/pass-through.h
+++ b/hw/pass-through.h
@@ -121,6 +121,7 @@ struct pt_region {
 
 struct pt_msi_info {
     uint32_t flags;
+    uint32_t ctrl_offset; /* saved control offset */
     int pirq;          /* guest pirq corresponding */
     uint32_t addr_lo;  /* guest message address */
     uint32_t addr_hi;  /* guest message upper address */
@@ -158,6 +159,10 @@ struct pt_dev {
                                                 /* emul reg group list */
     struct pt_msi_info *msi;                    /* MSI virtualization */
     struct pt_msix_info *msix;                  /* MSI-X virtualization */
+    int machine_irq;                            /* saved pirq */
+    /* Physical MSI to guest INTx translation when possible */
+    int msi_trans_cap;
+    int msi_trans_en;
 };
 
 /* Used for formatting PCI BDF into cf8 format */
diff --git a/hw/pci.h b/hw/pci.h
index 4adc4d7..a527a39 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -64,6 +64,7 @@ struct PCIDevice {
 };
 
 extern char direct_pci_str[];
+extern int direct_pci_msitranslate;
 
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
                                int instance_size, int devfn,
diff --git a/xenstore.c b/xenstore.c
index 86e8b63..ff3d023 100644
--- a/xenstore.c
+++ b/xenstore.c
@@ -290,8 +290,10 @@ const char *xenstore_get_guest_uuid(void) {
 #endif
 }
 
-#define DIRECT_PCI_STR_LEN 160
+#define DIRECT_PCI_STR_LEN 512
+#define PT_PCI_MSITRANSLATE_DEFAULT 1
 char direct_pci_str[DIRECT_PCI_STR_LEN];
+int direct_pci_msitranslate;
 void xenstore_parse_domain_config(int hvm_domid)
 {
     char **e_danger = NULL;
@@ -556,20 +558,50 @@ void xenstore_parse_domain_config(int hvm_domid)
             free(dev);
             dev = xs_read(xsh, XBT_NULL, buf, &len);
 
-            if ( strlen(dev) + strlen(direct_pci_str) > DIRECT_PCI_STR_LEN )
{
+            if ( strlen(dev) + strlen(direct_pci_str) > DIRECT_PCI_STR_LEN -
1) {
                 fprintf(stderr, "qemu: too many pci pass-through
devices\n");
                 memset(direct_pci_str, 0, DIRECT_PCI_STR_LEN);
                 goto out;
             }
 
+
             /* append to direct_pci_str */
+            if ( !dev )
+                continue;
+
+            strcat(direct_pci_str, dev);
+
+            if (pasprintf(&buf,
"/local/domain/0/backend/pci/%u/%u/opts-%d",
+                          hvm_domid, pci_devid, i) != -1) {
+                free(dev);
+                dev = xs_read(xsh, XBT_NULL, buf, &len);
+            }
             if ( dev ) {
+                if ( strlen(dev) + strlen(direct_pci_str) >
DIRECT_PCI_STR_LEN - 2) {
+                    fprintf(stderr, "qemu: too many pci pass-through
devices\n");
+                    memset(direct_pci_str, 0, DIRECT_PCI_STR_LEN);
+                    goto out;
+                }
+                strcat(direct_pci_str, ",");
                 strcat(direct_pci_str, dev);
-                strcat(direct_pci_str, "-");
             }
+
+            strcat(direct_pci_str, "-");
         }
     }
 
+    /* get the pci pass-through parameter */
+    if (pasprintf(&buf,
"/local/domain/0/backend/pci/%u/%u/msitranslate",
+                  hvm_domid, pci_devid) == -1)
+        goto out;
+
+    free(params);
+    params = xs_read(xsh, XBT_NULL, buf, &len);
+    if (params)
+        direct_pci_msitranslate = atoi(params);
+    else
+        direct_pci_msitranslate = PT_PCI_MSITRANSLATE_DEFAULT;
+
  out:
     free(danger_type);
     free(params);

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 3/6] pci: add pci option support for XML-RPC server

pci: add pci option support for XML-RPC server

Allow the per-device options for passthrough pci devices
for exmaple, in domain config file:
    pci = [''01:00.0,opt1=val1,opt2=val2'',
''01:00.1'' ]
or in the PCI hotplug case:
    xm pci-attach -o opt1=val1 --options=opt2=val2 <domid> 01:00.0 6

This patch is for xml-rpc server

Signed-off-by: Qing He <qing.he@intel.com>
---

diff -r d11b806ed94f -r 02ebed7b45d0 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Sun Jan 04 17:41:53 2009 +0800
+++ b/tools/python/xen/xend/XendDomainInfo.py	Thu Jan 08 01:57:58 2009 +0800
@@ -645,10 +645,17 @@
                           " already been assigned to other domain, or
maybe"
                           " it doesn''t exist." % (bus, dev,
func))
 
-        bdf_str = "%s:%s:%s.%s@%s" %
(new_dev[''domain''],
+        opts = ''''
+        if ''opts'' in new_dev and
len(new_dev[''opts'']) > 0:
+            config_opts = new_dev[''opts'']
+            config_opts = map(lambda (x, y): x+''=''+y,
config_opts)
+            opts = '','' + reduce(lambda x, y:
x+'',''+y, config_opts)
+
+        bdf_str = "%s:%s:%s.%s%s@%s" %
(new_dev[''domain''],
                 new_dev[''bus''],
                 new_dev[''slot''],
                 new_dev[''func''],
+                opts,
                 new_dev[''vslt''])
         self.image.signalDeviceModel(''pci-ins'',
''pci-inserted'', bdf_str)
 
@@ -2154,7 +2161,11 @@
         xc.domain_max_vcpus(self.domid,
int(self.info[''VCPUs_max'']))
 
         # Test whether the devices can be assigned with VT-d
-        pci_str = str(self.info["platform"].get("pci"))
+        pci = self.info["platform"].get("pci")
+        pci_str = ''''
+        if pci and len(pci) > 0:
+            pci = map(lambda x: x[0:4], pci)  # strip options 
+            pci_str = str(pci)
         if hvm and pci_str:
             bdf = xc.test_assign_device(self.domid, pci_str)
             if bdf != 0:
diff -r d11b806ed94f -r 02ebed7b45d0 tools/python/xen/xend/server/pciif.py
--- a/tools/python/xen/xend/server/pciif.py	Sun Jan 04 17:41:53 2009 +0800
+++ b/tools/python/xen/xend/server/pciif.py	Thu Jan 08 01:57:58 2009 +0800
@@ -75,6 +75,12 @@
             slot = parse_hex(pci_config.get(''slot'', 0))
             func = parse_hex(pci_config.get(''func'', 0))
 
+            opts = pci_config.get(''opts'',
'''')
+            if len(opts) > 0:
+                opts = map(lambda (x, y): x+''=''+y, opts)
+                opts = reduce(lambda x, y: x+'',''+y, opts)
+                back[''opts-%i'' % pcidevid] = opts
+
             vslt = pci_config.get(''vslt'')
             if vslt is not None:
                 vslots = vslots + vslt + ";"
@@ -108,6 +114,9 @@
                 dev = back[''dev-%i'' % i]
                 state = states[i]
                 uuid = back[''uuid-%i'' %i]
+                opts = ''''
+                if ''opts-%i'' % i in back:
+                    opts = back[''opts-%i'' % i]
             except:
                 raise XendError(''Error reading config'')
 
@@ -129,6 +138,8 @@
                 self.writeBackend(devid, ''state-%i'' %
(num_olddevs + i),
                                  
str(xenbusState[''Initialising'']))
                 self.writeBackend(devid, ''uuid-%i'' %
(num_olddevs + i), uuid)
+                if len(opts) > 0:
+                    self.writeBackend(devid, ''opts-%i'' %
(num_olddevs + i), opts)
                 self.writeBackend(devid, ''num_devs'',
str(num_olddevs + i + 1))
 
                 # Update vslots
@@ -540,6 +551,9 @@
                 self.removeBackend(devid, ''vdev-%i'' % i)
                 self.removeBackend(devid, ''state-%i'' % i)
                 self.removeBackend(devid, ''uuid-%i'' % i)
+                tmpopts = self.readBackend(devid, ''opts-%i'' %
i)
+                if tmpopts is not None:
+                    self.removeBackend(devid, ''opts-%i'' % i)
             else:
                 if new_num_devs != i:
                     tmpdev = self.readBackend(devid, ''dev-%i''
% i)
@@ -556,6 +570,9 @@
                     tmpuuid = self.readBackend(devid,
''uuid-%i'' % i)
                     self.writeBackend(devid, ''uuid-%i'' %
new_num_devs, tmpuuid)
                     self.removeBackend(devid, ''uuid-%i'' % i)
+                    tmpopts = self.readBackend(devid,
''opts-%i'' % i)
+                    if tmpopts is not None:
+                        self.removeBackend(devid, ''opts-%i'' %
i)
                 new_num_devs = new_num_devs + 1
 
         self.writeBackend(devid, ''num_devs'',
str(new_num_devs))
diff -r d11b806ed94f -r 02ebed7b45d0 tools/python/xen/xm/create.py
--- a/tools/python/xen/xm/create.py	Sun Jan 04 17:41:53 2009 +0800
+++ b/tools/python/xen/xm/create.py	Thu Jan 08 01:57:58 2009 +0800
@@ -667,9 +667,20 @@
     """Create the config for pci devices.
     """
     config_pci = []
-    for (domain, bus, slot, func) in vals.pci:
-        config_pci.append([''dev'',
[''domain'', domain], [''bus'', bus], \
-                        [''slot'', slot],
[''func'', func]])
+    for (domain, bus, slot, func, opts) in vals.pci:
+        config_pci_opts = []
+        d = comma_sep_kv_to_dict(opts)
+
+        def f(k):
+            config_pci_opts.append([k, d[k]])
+
+        config_pci_bdf = [''dev'', [''domain'',
domain], [''bus'', bus], \
+                          [''slot'', slot],
[''func'', func]]
+        map(f, d.keys())
+        if len(config_pci_opts)>0:
+            config_pci_bdf.append([''opts'', config_pci_opts])
+
+        config_pci.append(config_pci_bdf)
 
     if len(config_pci)>0:
         config_pci.insert(0, ''pci'')
@@ -991,14 +1002,18 @@
         pci_match =
re.match(r"((?P<domain>[0-9a-fA-F]{1,4})[:,])?" + \
                 r"(?P<bus>[0-9a-fA-F]{1,2})[:,]" + \
                 r"(?P<slot>[0-9a-fA-F]{1,2})[.,]" + \
-                r"(?P<func>[0-7])$", pci_dev_str)
+                r"(?P<func>[0-7])" + \
+                r"(,(?P<opts>.*))?$", pci_dev_str)
         if pci_match!=None:
-            pci_dev_info = pci_match.groupdict(''0'')
+            pci_dev_info = pci_match.groupdict('''')
+            if pci_dev_info[''domain'']=='''':
+                pci_dev_info[''domain'']=''0''
             try:
                 pci.append(
(''0x''+pci_dev_info[''domain''], \
                        
''0x''+pci_dev_info[''bus''], \
                        
''0x''+pci_dev_info[''slot''], \
-                       
''0x''+pci_dev_info[''func'']))
+                       
''0x''+pci_dev_info[''func''], \
+                        pci_dev_info[''opts'']))
             except IndexError:
                 err(''Error in PCI slot syntax
"%s"''%(pci_dev_str))
     vals.pci = pci
diff -r d11b806ed94f -r 02ebed7b45d0 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Sun Jan 04 17:41:53 2009 +0800
+++ b/tools/python/xen/xm/main.py	Thu Jan 08 01:57:58 2009 +0800
@@ -187,7 +187,7 @@
     ''vnet-delete''   :  (''<VnetId>'',
''Delete a Vnet.''),
     ''vnet-list''     :  (''[-l|--long]'',
''List Vnets.''),
     ''vtpm-list''     :  (''<Domain>
[--long]'', ''List virtual TPM devices.''),
-    ''pci-attach''    :  (''<Domain>
<domain:bus:slot.func> [virtual slot]'',
+    ''pci-attach''    :  (''[-o|--options=<opt>]
<Domain> <domain:bus:slot.func> [virtual slot]'',
                         ''Insert a new pass-through pci
device.''),
     ''pci-detach''    :  (''<Domain>
<domain:bus:slot.func>'',
                         ''Remove a domain\''s pass-through pci
device.''),
@@ -2428,7 +2428,7 @@
             vif.append(vif_param)
         server.xend.domain.device_create(dom, vif)
 
-def parse_pci_configuration(args, state):
+def parse_pci_configuration(args, state, opts = ''''):
     dom = args[0]
     pci_dev_str = args[1]
     if len(args) == 3:
@@ -2443,12 +2443,17 @@
     if pci_match == None:
         raise OptionError("Invalid argument: %s %s" %
(pci_dev_str,vslt))
     pci_dev_info = pci_match.groupdict(''0'')
+
     try:
-        pci.append([''dev'', [''domain'',
''0x''+ pci_dev_info[''domain'']], \
+        pci_bdf =[''dev'', [''domain'',
''0x''+ pci_dev_info[''domain'']], \
                 [''bus'', ''0x''+
pci_dev_info[''bus'']],
                 [''slot'', ''0x''+
pci_dev_info[''slot'']],
                 [''func'', ''0x''+
pci_dev_info[''func'']],
-                [''vslt'', ''0x%x'' % int(vslt,
16)]])
+                [''vslt'', ''0x%x'' % int(vslt,
16)]]
+        if len(opts) > 0:
+            pci_bdf.append([''opts'', opts])
+        pci.append(pci_bdf)
+
     except:
         raise OptionError("Invalid argument: %s %s" %
(pci_dev_str,vslt))
     pci.append([''state'', state])
@@ -2456,8 +2461,22 @@
     return (dom, pci)
 
 def xm_pci_attach(args):
-    arg_check(args, ''pci-attach'', 2, 3)
-    (dom, pci) = parse_pci_configuration(args,
''Initialising'')
+    config_pci_opts = []
+    (options, params) = getopt.gnu_getopt(args, ''o:'',
[''options=''])
+    for (k, v) in options:
+        if k in (''-o'', ''--options''):
+            if len(v.split(''='')) != 2:
+                err("Invalid pci attach option: %s" % v)
+                usage(''pci-attach'')
+            config_pci_opts.append(v.split(''=''))
+
+    n = len([i for i in params if i != ''--''])
+    if n < 2 or n > 3:
+        err("Invalid argument for ''xm pci-attach''")
+        usage(''pci-attach'')
+
+    (dom, pci) = parse_pci_configuration(params,
''Initialising'',
+                     config_pci_opts)
 
     if serverType == SERVER_XEN_API:
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 4/6] pci: add pci option support for XenAPI server

pci: add pci option support for XenAPI sever

Allow the per-device options for passthrough pci devices
This patch is for XenAPI server

A new key-value pair element of ''pci'' named
''pci_opt'' is added
to xml config file

Signed-off-by: Qing He <qing.he@intel.com>
---

diff -r 02ebed7b45d0 -r 82b05ac013fc tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Thu Jan 08 01:57:58 2009 +0800
+++ b/tools/python/xen/xend/XendConfig.py	Thu Jan 08 02:09:19 2009 +0800
@@ -1247,6 +1247,11 @@
                         ''PPCI'': ppci_uuid,
                         ''hotplug_slot'':
pci_dev.get(''vslot'', 0)
                     }
+
+                    dpci_opts = pci_dev.get(''opts'')
+                    if dpci_opts and len(dpci_opts) > 0:
+                        dpci_record[''options''] = dpci_opts
+
                     XendDPCI(dpci_uuid, dpci_record)
 
                 target[''devices''][pci_devs_uuid] = (dev_type,
@@ -1762,6 +1767,11 @@
                         ''PPCI'': ppci_uuid,
                         ''hotplug_slot'':
pci_dev.get(''vslot'', 0)
                     }
+
+                    dpci_opts = pci_dev.get(''opts'')
+                    if dpci_opts and len(dpci_opts) > 0:
+                        dpci_record[''options''] = dpci_opts
+
                     XendDPCI(dpci_uuid, dpci_record)
 
                 self[''devices''][dev_uuid] = (dev_type,
diff -r 02ebed7b45d0 -r 82b05ac013fc tools/python/xen/xend/XendDPCI.py
--- a/tools/python/xen/xend/XendDPCI.py	Thu Jan 08 01:57:58 2009 +0800
+++ b/tools/python/xen/xend/XendDPCI.py	Thu Jan 08 02:09:19 2009 +0800
@@ -41,7 +41,8 @@
                   ''virtual_name'',
                   ''VM'',
                   ''PPCI'',
-                  ''hotplug_slot'']
+                  ''hotplug_slot'',
+                  ''options'']
         return XendBase.getAttrRO() + attrRO
 
     def getAttrRW(self):
@@ -119,6 +120,8 @@
         self.VM = record[''VM'']
         self.PPCI = record[''PPCI'']
         self.hotplug_slot = record[''hotplug_slot'']
+        if ''options'' in record.keys():
+            self.options = record[''options'']
 
     def destroy(self):
         xendom = XendDomain.instance()
@@ -152,3 +155,5 @@
     def get_hotplug_slot(self):
         return self.hotplug_slot
 
+    def get_options(self):
+        return self.options
diff -r 02ebed7b45d0 -r 82b05ac013fc tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Thu Jan 08 01:57:58 2009 +0800
+++ b/tools/python/xen/xend/XendDomainInfo.py	Thu Jan 08 02:09:19 2009 +0800
@@ -3458,6 +3458,11 @@
 
         dpci_uuid = uuid.createString()
 
+        dpci_opts = []
+        opts_dict = xenapi_pci.get(''options'')
+        for k in opts_dict.keys():
+            dpci_opts.append([k, opts_dict[k]])
+
         # Convert xenapi to sxp
         ppci = XendAPIStore.get(xenapi_pci.get(''PPCI''),
''PPCI'')
 
@@ -3469,6 +3474,7 @@
                     [''slot'', ''0x%02x'' %
ppci.get_slot()],
                     [''func'', ''0x%1x'' %
ppci.get_func()],
                     [''vslt'', ''0x%02x'' %
xenapi_pci.get(''hotplug_slot'')],
+                    [''opts'', dpci_opts],
                     [''uuid'', dpci_uuid]
                 ],
                 [''state'', ''Initialising'']
diff -r 02ebed7b45d0 -r 82b05ac013fc tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Thu Jan 08 01:57:58 2009 +0800
+++ b/tools/python/xen/xm/main.py	Thu Jan 08 02:09:19 2009 +0800
@@ -2499,7 +2499,8 @@
         dpci_record = {
             "VM":           get_single_vm(dom),
             "PPCI":         target_ref,
-            "hotplug_slot": vslt
+            "hotplug_slot": vslt,
+            "options":      dict([k, v] for k, v in config_pci_opts)
         }
         server.xenapi.DPCI.create(dpci_record)
 
diff -r 02ebed7b45d0 -r 82b05ac013fc tools/python/xen/xm/xenapi_create.py
--- a/tools/python/xen/xm/xenapi_create.py	Thu Jan 08 01:57:58 2009 +0800
+++ b/tools/python/xen/xm/xenapi_create.py	Thu Jan 08 02:09:19 2009 +0800
@@ -533,7 +533,10 @@
             "PPCI":
                 target_ref,
             "hotplug_slot":
-                int(pci.attributes["func"].value, 16)
+                int(pci.attributes["func"].value, 16),
+            "options":
+                get_child_nodes_as_dict(pci,
+                  "pci_opt", "key", "value")
         }
 
         return server.xenapi.DPCI.create(dpci_record)
@@ -931,6 +934,12 @@
                     = get_child_by_name(dev_sxp, "func",
"0")
                 pci.attributes["vslt"] \
                     = get_child_by_name(dev_sxp, "vslt",
"0")
+                for opt in get_child_by_name(dev_sxp, "opts",
""):
+                    if len(opt) > 0:
+                        pci_opt = document.createElement("pci_opt")
+                        pci_opt.attributes["key"] = opt[0]
+                        pci_opt.attributes["value"] = opt[1]
+                        pci.appendChild(pci_opt)
 
                 pcis.append(pci)
 
diff -r b0a4862c9018 -r 6e84f349112d tools/python/xen/xm/create.dtd
--- a/tools/python/xen/xm/create.dtd	Thu Jan 08 02:13:31 2009 +0800
+++ b/tools/python/xen/xm/create.dtd	Thu Jan 08 02:15:01 2009 +0800
@@ -82,11 +82,12 @@
 <!ELEMENT vtpm   (name*)>
 <!ATTLIST vtpm   backend         CDATA #REQUIRED>
 
-<!ELEMENT pci    EMPTY>
+<!ELEMENT pci    (pci_opt*)>
 <!ATTLIST pci    domain          CDATA #REQUIRED
                  bus             CDATA #REQUIRED
                  slot            CDATA #REQUIRED
                  func            CDATA #REQUIRED
+                 opts_str        CDATA #IMPLIED
                  vslt            CDATA #IMPLIED>
 
 <!ELEMENT vscsi  EMPTY>
@@ -138,6 +139,10 @@
 <!ATTLIST vcpu_param key   CDATA #REQUIRED
                      value CDATA #REQUIRED>
 
+<!ELEMENT pci_opt    EMPTY>
+<!ATTLIST pci_opt    key   CDATA #REQUIRED
+                     value CDATA #REQUIRED>
+
 <!ELEMENT other_config EMPTY>
 <!ATTLIST other_config key   CDATA #REQUIRED
                        value CDATA #REQUIRED>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 5/6] pci: add config options for MSI-INTx translation in HVM

pci: add config options for MSI-INTx translation in HVM

Add a config file option ''pci_msitranslate'' to enable MSI-INTx
translation
in HVM, and also a per-device option ''msitranslate'' to allow
device based
overriden

Signed-off-by: Qing He <qing.he@intel.com>
---

diff -r 82b05ac013fc -r b0a4862c9018 tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Thu Jan 08 02:09:19 2009 +0800
+++ b/tools/python/xen/xend/XendConfig.py	Thu Jan 08 02:13:31 2009 +0800
@@ -166,6 +166,7 @@
     ''guest_os_type'': str,
     ''hap'': int,
     ''xen_extended_power_mgmt'': int,
+    ''pci_msitranslate'': int,
 }
 
 # Xen API console ''other_config'' keys.
diff -r 82b05ac013fc -r b0a4862c9018 tools/python/xen/xend/server/pciif.py
--- a/tools/python/xen/xend/server/pciif.py	Thu Jan 08 02:09:19 2009 +0800
+++ b/tools/python/xen/xend/server/pciif.py	Thu Jan 08 02:13:31 2009 +0800
@@ -95,6 +95,9 @@
 
         back[''num_devs'']=str(pcidevid)
         back[''uuid''] =
config.get(''uuid'','''')
+        if ''pci_msitranslate'' in
self.vm.info[''platform'']:
+           
back[''msitranslate'']=str(self.vm.info[''platform''][''pci_msitranslate''])
+
         return (0, back, {})
 
 
diff -r 82b05ac013fc -r b0a4862c9018 tools/python/xen/xm/create.py
--- a/tools/python/xen/xm/create.py	Thu Jan 08 02:09:19 2009 +0800
+++ b/tools/python/xen/xm/create.py	Thu Jan 08 02:13:31 2009 +0800
@@ -318,11 +318,14 @@
           backend driver domain to use for the disk.
           The option may be repeated to add more than one
disk.""")
 
-gopts.var(''pci'', val=''BUS:DEV.FUNC'',
+gopts.var(''pci'',
val=''BUS:DEV.FUNC[,msitranslate=0|1]'',
           fn=append_value, default=[],
           use="""Add a PCI device to a domain, using given
params (in hex).
-         For example ''pci=c0:02.1''.
-         The option may be repeated to add more than one pci
device.""")
+          For example ''pci=c0:02.1''.
+          If msitranslate is set, MSI-INTx translation is enabled if possible.
+          Guest that doesn''t support MSI will get IO-APIC type IRQs
+          translated from physical MSI, HVM only. Default is 1.
+          The option may be repeated to add more than one pci
device.""")
 
 gopts.var(''vscsi'', val=''PDEV,VDEV[,DOM]'',
           fn=append_value, default=[],
@@ -588,6 +591,11 @@
           fn=set_bool, default=None,
           use="""Do not inject spurious page faults into this
guest""")
 
+gopts.var(''pci_msitranslate'',
val=''TRANSLATE'',
+          fn=set_int, default=1,
+          use="""Global PCI MSI-INTx translation flag
(0=disable;
+          1=enable.""")
+
 def err(msg):
     """Print an error to stderr and exit.
     """
@@ -672,6 +680,9 @@
         d = comma_sep_kv_to_dict(opts)
 
         def f(k):
+            if k not in [''msitranslate'']:
+                err(''Invalid pci option: '' + k)
+
             config_pci_opts.append([k, d[k]])
 
         config_pci_bdf = [''dev'', [''domain'',
domain], [''bus'', bus], \
@@ -878,7 +889,7 @@
              ''sdl'', ''display'',
''xauthority'', ''rtc_timeoffset'',
''monitor'',
              ''acpi'', ''apic'',
''usb'', ''usbdevice'',
''keymap'', ''pci'', ''hpet'',
              ''guest_os_type'', ''hap'',
''opengl'', ''cpuid'',
''cpuid_check'',
-             ''viridian'',
''xen_extended_power_mgmt'' ]
+             ''viridian'',
''xen_extended_power_mgmt'',
''pci_msitranslate'' ]
 
     for a in args:
         if a in vals.__dict__ and vals.__dict__[a] is not None:
diff -r 82b05ac013fc -r b0a4862c9018 tools/python/xen/xm/xenapi_create.py
--- a/tools/python/xen/xm/xenapi_create.py	Thu Jan 08 02:09:19 2009 +0800
+++ b/tools/python/xen/xm/xenapi_create.py	Thu Jan 08 02:13:31 2009 +0800
@@ -1041,6 +1041,7 @@
             ''vhpt'',
             ''guest_os_type'',
             ''hap'',
+            ''pci_msitranslate'',
         ]
 
         platform_configs = []

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 09:06 UTC

head link

[Xen-devel] [PATCH 6/6] passthough: MSI-INTx translation documentation

passthough: MSI-INTx translation documentation

Signed-off-by: Qing He <qing.he@intel.com>
---

diff -r c4ef5731a992 -r 80fcb0b96801 docs/misc/vtd.txt
--- a/docs/misc/vtd.txt	Wed Jan 07 07:39:55 2009 +0800
+++ b/docs/misc/vtd.txt	Wed Jan 07 20:35:11 2009 +0800
@@ -38,6 +38,30 @@
 Add "msi=1" option in kernel line of host grub.
 
 
+MSI-INTx translation for passthrough devices in HVM
+---------------------------------------------------
+
+If the assigned device uses a physical IRQ that is shared by more than
+one device among multiple domains, there may be significant impact on
+device performance. Unfortunately, this is quite a common case if the
+IO-APIC (INTx) IRQ is used. MSI can avoid this issue, but was only
+available if the guest enables it.
+
+With MSI-INTx translation turned on, Xen enables device MSI if it''s
+available, regardless of whether the guest uses INTx or MSI. If the
+guest uses INTx IRQ, Xen will inject a translated INTx IRQ to guest''s
+virtual ioapic whenever an MSI message is received. This reduces the
+interrupt sharing of the system. If the guest OS enables MSI or MSI-X,
+the translation is automatically turned off.
+
+To enable or disable MSI-INTx translation globally, add
"pci_msitranslate"
+in the config file:
+	pci_msitranslate = 1         (default is 1)
+
+To override for a specific device:
+	pci = [ ''01:00.0,msitranslate=0'',
''03:00.0'' ]
+
+
 Caveat on Conventional PCI Device Passthrough
 ---------------------------------------------
 
@@ -79,6 +103,11 @@
  3. Attach a PCI device to the guest by the physical BDF and desired virtual
slot(optional). Following command would insert the physical device into
guest''s virtual slot 7
 
 	[root@vt-vtd ~]# xm pci-attach HVMDomainVtd 0:2:0.0 7
+
+    To specify options for the device, use -o or --options=. Following command
would disable MSI-INTx translation for the device
+
+	[root@vt-vtd ~]# xm pci-attach -o msitranslate=0 0:2:0.0 7
+
 
 VTd hotplug usage model:
 ------------------------
diff -r c4ef5731a992 -r 80fcb0b96801 tools/examples/xmexample.hvm
--- a/tools/examples/xmexample.hvm	Wed Jan 07 07:39:55 2009 +0800
+++ b/tools/examples/xmexample.hvm	Wed Jan 07 20:35:11 2009 +0800
@@ -288,6 +288,39 @@
 #  ''x'' -> we don''t care (do not check)
 #  ''s'' -> the bit must be the same as on the host that
started this VM
 
+#-----------------------------------------------------------------------------
+#   Configure passthrough PCI{,-X,e} devices:
+#
+#   pci=[ ''[SSSS:]BB:DD.F[,option1[,option2[...]]]'', ... ]
+#
+#   [SSSS]:BB:DD.F  "bus segment:bus:device.function"(1) of the
device to
+#                   be assigned, bus segment is optional. All fields are
+#                   in hexadecimal and no field should be longer than that
+#                   as shown in the pattern. Successful assignment may need
+#                   certain hardware support and additional configurations
+#                   (e.g. VT-d, see docs/misc/vtd.txt for more details).
+#
+#       (1) bus segment is sometimes also referred to as the PCI
"domain",
+#           not to be confused with Xen domain.
+#
+#
+#   optionN         per-device options in "key=val" format. Current
+#                   available options are:
+#                   - msitranslate=0|1
+#                      per-device overriden of pci_msitranslate, see below
+#
+#pci=[ ''07:00.0'', ''07:00.1'' ]
+
+#   MSI-INTx translation for MSI capable devices:
+#
+#   If it''s set, Xen will enable MSI for the device that supports it
even
+# if the guest don''t use MSI. In the case, an IO-APIC type interrupt
will
+# be injected to the guest every time a corresponding MSI message is
+# received.
+#   If the guest enables MSI or MSI-X, the translation is automatically
+# turned off.
+# 
+#pci_msitranslate=1
 
 #-----------------------------------------------------------------------------
 #   Configure PVSCSI devices:

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-08 10:44 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

I have one question.

MSI interrupt is edge-triggered, and INTx interrupt is level-triggered.
Guest OS handles translated interrupt as level-triggered, though physical
interrupt is edge-triggered. When two interrupts are raised during short
period, Guest OS might lose 2nd interrupt, I think.

What do you think?

Thanks,
--
Shohei Fujiwara

On Thu,  8 Jan 2009 17:06:43 +0800
Qing He <qing.he@intel.com> wrote:
> This patchset enables MSI-INTx interrupt translation for HVM
> domains. The intention of the patch is to use MSI as the physical
> interrupt mechanism for passthrough device as much as possible,
> thus reducing the pirq sharing among domains.
> 
> When MSI is globally enabled, if the device has the MSI capability
> but doesn''t used by the guest, hypervisor will try to user MSI as
> the underlying pirq and inject translated INTx irq to guest
> vioapic. When guest itself enabled MSI or MSI-X, the translation
> is automatically turned off.
> 
> Add a config file option to disable/enable this feature. Also, in
> order to allow the user to override the option per device, a
> per-device option mechanism is implemented and an MSI-INTx option
> is added
> 
> The patch set contains 6 patches and is organized as below:
>     [PATCH 1/6] hypervisor patch
>    *[PATCH 2/6] qemu patch
>     [PATCH 3/6] add pci options in domain config file
>     [PATCH 4/6] XenAPI version of the above patch
>     [PATCH 5/6] add global and per-device option for MSI-INTx
>     [PATCH 6/6] documentation and example
> 
>  * against ioemu-remote
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-08 14:52 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Thu, 2009-01-08 at 18:44 +0800, Shohei Fujiwara
wrote:> I have one question.
> 
> MSI interrupt is edge-triggered, and INTx interrupt is level-triggered.
> Guest OS handles translated interrupt as level-triggered, though physical
> interrupt is edge-triggered. When two interrupts are raised during short
> period, Guest OS might lose 2nd interrupt, I think.
This problem is handled by a different EOI timing. As soon as the
hypervisor receives an MSI, it issues the EOI ASAP, and the duration of
the injected level-triggered IRQ and guest EOI are all handled by the
virtual APICs. If a 2nd interrupt comes up at this time, the hypervisor
can receive the EOI and this results in a pending IRQ in virtual APICs
instead of lost.

Generally, it''s easy to "translate" an edged interrupt to a
level one,
but not the other way.

Thanks,
Qing> 
> What do you think?
> 
> Thanks,
> --
> Shohei Fujiwara
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-09 04:26 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Thank you for your reply.

On Thu, 8 Jan 2009 22:52:00 +0800
Qing He <qing.he@intel.com> wrote:
> On Thu, 2009-01-08 at 18:44 +0800, Shohei Fujiwara wrote:
> > I have one question.
> > 
> > MSI interrupt is edge-triggered, and INTx interrupt is
level-triggered.
> > Guest OS handles translated interrupt as level-triggered, though
physical
> > interrupt is edge-triggered. When two interrupts are raised during
short
> > period, Guest OS might lose 2nd interrupt, I think.
> 
> This problem is handled by a different EOI timing. As soon as the
> hypervisor receives an MSI, it issues the EOI ASAP, and the duration of
> the injected level-triggered IRQ and guest EOI are all handled by the
> virtual APICs. If a 2nd interrupt comes up at this time, the hypervisor
> can receive the EOI and this results in a pending IRQ in virtual APICs
> instead of lost.
There is the assumption that Guest OS handles all causes which happen
before Guest OS receives the interrupt. But is the assumption right for
all OS?

In the case of level-triggerd interrupt, I/O device asserts interrupt
line, when the cause of interrupt happens. OS handles the cause,
I/O device de-asserts interrupt, and OS sends EOI to APICs.

When I/O APIC receives EOI, I/O APIC re-transmits interrupt to Local APIC 
if some interrupt line is asserted.

Some OS might rely on this re-transmittion by I/O APIC. 

This is the part of "handle_IRQ_event" in linux 2.6.18:

	do {
		ret = action->handler(irq, action->dev_id, regs);
		if (ret == IRQ_HANDLED)
			status |= action->flags;
		retval |= ret;
		action = action->next;
	} while (action);

The assumption is right for linux.

But other OS might have the code like the following:

	do {
		ret = action->handler(irq, action->dev_id, regs);
		if (ret == IRQ_HANDLED) {
			status |= action->flags;
			retval |= ret;
			break;
			^^^^^^
		}
		action = action->next;
	} while (action);

This code will work on real machine, because I/O APIC re-transmits
interrupt, if the cause to be handled remains.  If some OS has the
code like the above, the assumption isn''t right.

Actually, my concern is whether the assumption is right for Windows,
or not. Do you know about this, or does your patch works well with
Windows guest?

Thanks,
--
Shohei Fujiwara
> Generally, it''s easy to "translate" an edged interrupt
to a level one,
> but not the other way.
> 
> Thanks,
> Qing
> > 
> > What do you think?
> > 
> > Thanks,
> > --
> > Shohei Fujiwara
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-09 06:57 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Fri, 2009-01-09 at 12:26 +0800, Shohei Fujiwara
wrote:> There is the assumption that Guest OS handles all causes which happen
> before Guest OS receives the interrupt. But is the assumption right for
> all OS?
> 
> In the case of level-triggerd interrupt, I/O device asserts interrupt
> line, when the cause of interrupt happens. OS handles the cause,
> I/O device de-asserts interrupt, and OS sends EOI to APICs.
> 
> When I/O APIC receives EOI, I/O APIC re-transmits interrupt to Local APIC 
> if some interrupt line is asserted.
> 
> Some OS might rely on this re-transmittion by I/O APIC. 
> 
> 
> But other OS might have the code like the following:
> 
> 	do {
> 		ret = action->handler(irq, action->dev_id, regs);
> 		if (ret == IRQ_HANDLED) {
> 			status |= action->flags;
> 			retval |= ret;
> 			break;
> 			^^^^^^
> 		}
> 		action = action->next;
> 	} while (action);
> Hmm, I think now I understand what you mean. If the guest irq is shared by
a normal IRQ and a MSI-INTx translated IRQ, two sources may assert the
pin while they both get pending. When this irq is injected, if the guest
only handles one irq source each time, and issues EOI right after it
clears the normal IRQ, the MSI is lost. Is it what you mean?

There is logic to avoid this from happening, see
hvm_irq->gsi_assert_count[gsi]. Basically, it''s used to count how
many
sources have asserted a shared pin. And at the time of  EOI, after the
decrement of the counter, if it''s still not 0, the LAPIC is
re-asserted.
This may result in some spurious interrupts to guest, but that''s better
than losing interrupts.

The sharing of guest irq is generally not a good idea, in fact, this is
not even well supported in current Xen code. You may have seen something
like "girq[ggsi].mirq = mirq". That way, we are already stuck.
Currently, if the number of assigned devices is <= 8, there should be no
sharing, otherwise, very weird things may happen...

Uncommon devices of OS does have the possibility to fail MSI-INTx, for
example, if the device doesn''t behave the same way using INTx and MSI,
or the guest OS doesn''t always clear guest source before issuing EOI.
That''s why I add the per-device disable function: if a device or OS
doesn''t work properly, just turn it off. Fortunately, this is extremely
rare.

Btw, AFAIK, Windows handles all irq sources in one ISR, similar to
Linux.

Thanks,
Qing> 
> This code will work on real machine, because I/O APIC re-transmits
> interrupt, if the cause to be handled remains.  If some OS has the
> code like the above, the assumption isn''t right.
> 
> Actually, my concern is whether the assumption is right for Windows,
> or not. Do you know about this, or does your patch works well with
> Windows guest?
> 
> Thanks,
> --
> Shohei Fujiwara
> 
> > Generally, it''s easy to "translate" an edged
interrupt to a level one,
> > but not the other way.
> > 
> > Thanks,
> > Qing
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > --
> > > Shohei Fujiwara
> > > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-13 09:05 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

What I understand are the followings.

    - When guest gsi isn''t shared, MSI-INTx interrupt translation works
fine.
    - When guest gsi is shared between passthrough device and emulated
      device, MSI-INTx interrupt translation work, though guest OS
      receives spurious interrupts.
    - Sharing guest gsi among passthrough devices isn''t supported.
    - There are some unsuitable devices for MSI-INTx interrupt translation.


May I ask you any additional questions?

First, why can''t we assign more than 8 devices?
In the view of guest OS, assigned device is always single function
device. This means assigned devices use only INTA. And Interrupt
routing in hypervisor is shown as follows.
>From xen/include/asm-x86/hvm/irq.c:    #define hvm_pci_intx_gsi(dev, intx)  \
        (((((dev)<<2) + ((dev)>>3) + (intx)) & 31) + 16)

I think sharing guest gsi among passthrough devices doesn''t
occur if assigned device is <= 32.


Second, it is nice to create blacklist of unsuitable devices for MSI-INTx
interrupt, isn''t it? The reason is the problem seems device-specific
problem.

Thanks,
--
Shohei Fujiwara


On Fri, 9 Jan 2009 14:57:16 +0800
Qing He <qing.he@intel.com> wrote:
> On Fri, 2009-01-09 at 12:26 +0800, Shohei Fujiwara wrote:
> > There is the assumption that Guest OS handles all causes which happen
> > before Guest OS receives the interrupt. But is the assumption right
for
> > all OS?
> > 
> > In the case of level-triggerd interrupt, I/O device asserts interrupt
> > line, when the cause of interrupt happens. OS handles the cause,
> > I/O device de-asserts interrupt, and OS sends EOI to APICs.
> > 
> > When I/O APIC receives EOI, I/O APIC re-transmits interrupt to Local
APIC
> > if some interrupt line is asserted.
> > 
> > Some OS might rely on this re-transmittion by I/O APIC. 
> > 
> 
> > 
> > But other OS might have the code like the following:
> > 
> > 	do {
> > 		ret = action->handler(irq, action->dev_id, regs);
> > 		if (ret == IRQ_HANDLED) {
> > 			status |= action->flags;
> > 			retval |= ret;
> > 			break;
> > 			^^^^^^
> > 		}
> > 		action = action->next;
> > 	} while (action);
> > 
> Hmm, I think now I understand what you mean. If the guest irq is shared by
> a normal IRQ and a MSI-INTx translated IRQ, two sources may assert the
> pin while they both get pending. When this irq is injected, if the guest
> only handles one irq source each time, and issues EOI right after it
> clears the normal IRQ, the MSI is lost. Is it what you mean?
> 
> There is logic to avoid this from happening, see
> hvm_irq->gsi_assert_count[gsi]. Basically, it''s used to count
how many
> sources have asserted a shared pin. And at the time of  EOI, after the
> decrement of the counter, if it''s still not 0, the LAPIC is
re-asserted.
> This may result in some spurious interrupts to guest, but that''s
better
> than losing interrupts.
> 
> The sharing of guest irq is generally not a good idea, in fact, this is
> not even well supported in current Xen code. You may have seen something
> like "girq[ggsi].mirq = mirq". That way, we are already stuck.
> Currently, if the number of assigned devices is <= 8, there should be no
> sharing, otherwise, very weird things may happen...
> 
> Uncommon devices of OS does have the possibility to fail MSI-INTx, for
> example, if the device doesn''t behave the same way using INTx and
MSI,
> or the guest OS doesn''t always clear guest source before issuing
EOI.
> That''s why I add the per-device disable function: if a device or
OS
> doesn''t work properly, just turn it off. Fortunately, this is
extremely
> rare.
> 
> Btw, AFAIK, Windows handles all irq sources in one ISR, similar to
> Linux.
> 
> Thanks,
> Qing
> > 
> > This code will work on real machine, because I/O APIC re-transmits
> > interrupt, if the cause to be handled remains.  If some OS has the
> > code like the above, the assumption isn''t right.
> > 
> > Actually, my concern is whether the assumption is right for Windows,
> > or not. Do you know about this, or does your patch works well with
> > Windows guest?
> > 
> > Thanks,
> > --
> > Shohei Fujiwara
> > 
> > > Generally, it''s easy to "translate" an edged
interrupt to a level one,
> > > but not the other way.
> > > 
> > > Thanks,
> > > Qing
> > > > 
> > > > What do you think?
> > > > 
> > > > Thanks,
> > > > --
> > > > Shohei Fujiwara
> > > > 
> > > 
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xensource.com
> > > http://lists.xensource.com/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-13 09:28 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Tue, 2009-01-13 at 17:05 +0800, Shohei Fujiwara
wrote:> What I understand are the followings.
> 
>     - When guest gsi isn''t shared, MSI-INTx interrupt translation
works fine.
>     - When guest gsi is shared between passthrough device and emulated
>       device, MSI-INTx interrupt translation work, though guest OS
>       receives spurious interrupts.
>     - Sharing guest gsi among passthrough devices isn''t supported.
Yes.
>     - There are some unsuitable devices for MSI-INTx interrupt translation.
I still don''t know an example of such unsuitable devices. And to be
exact, it should be device/guest OS driver combination. The improper
function may be caused by the guest driver (e.g. not cleaning irq source
before issuing EOI)
> 
> 
> May I ask you any additional questions?
> 
> First, why can''t we assign more than 8 devices?
> In the view of guest OS, assigned device is always single function
> device. This means assigned devices use only INTA. And Interrupt
> routing in hypervisor is shown as follows.
Well, I just checked the code, seems I''m using the stale knowledge.
The assigned devices didn''t fix to INTA a few months ago, at that time,
it was 8, since (0:3.1) collided with (0:11.0).

It''s now 32, and the situation is definitely much better.
> 
> From xen/include/asm-x86/hvm/irq.c:
>     #define hvm_pci_intx_gsi(dev, intx)  \
>         (((((dev)<<2) + ((dev)>>3) + (intx)) & 31) + 16)
> 
> I think sharing guest gsi among passthrough devices doesn''t
> occur if assigned device is <= 32.
> 
> 
> Second, it is nice to create blacklist of unsuitable devices for MSI-INTx
> interrupt, isn''t it? The reason is the problem seems
device-specific problem.
This is reasonable, however, I don''t know a real unsuitable at this
time. Isn''t it better to add the blacklist when there is the real need?

What do you think?

Thanks,
Qing
> 
> Thanks,
> --
> Shohei Fujiwara
> 
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-14 06:39 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Tue, 13 Jan 2009 17:28:13 +0800
Qing He <qing.he@intel.com> wrote:
> On Tue, 2009-01-13 at 17:05 +0800, Shohei Fujiwara wrote:
> > What I understand are the followings.
> > Second, it is nice to create blacklist of unsuitable devices for
MSI-INTx
> > interrupt, isn''t it? The reason is the problem seems
device-specific problem.
> 
> This is reasonable, however, I don''t know a real unsuitable at
this
> time. Isn''t it better to add the blacklist when there is the real
need?
> 
> What do you think?
I think adding blacklist now is better, because it is easy to specify
an unsuitable device when it is found. But I don''t know what other
developers think about this.

BTW, I found a issue when I read your code.

MSI-INTx translation is enabled when guest domain is started.
When guest OS enables MSI/MSI-X, MSI-INTx translation is disabled.
But when guest OS disables MSI/MSI-X, MSI-INTx translation isn''t
re-enabled.
This means normal INTx interrupt is used.

I think the results should be the same at all times,
so it is good to re-enable MSI-INTx translation.

Thanks,
--
Shohei Fujiwara

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-14 07:38 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Wed, 2009-01-14 at 14:39 +0800, Shohei Fujiwara
wrote:> I think adding blacklist now is better, because it is easy to specify
> an unsuitable device when it is found. But I don''t know what other
> developers think about this.
I''m OK either way
> 
> 
> BTW, I found a issue when I read your code.
> 
> MSI-INTx translation is enabled when guest domain is started.
> When guest OS enables MSI/MSI-X, MSI-INTx translation is disabled.
> But when guest OS disables MSI/MSI-X, MSI-INTx translation isn''t
re-enabled.
> This means normal INTx interrupt is used.
> 
> I think the results should be the same at all times,
> so it is good to re-enable MSI-INTx translation.
There is a complication here. Some guests, notably certain versions of
Linux (2.6.25 for instance) uses MSI enable bit as mask, like:
>>  static void msi_set_mask_bit(unsigned int irq, int flag)
>>  {
>>  	...
>>  	switch (entry->msi_attrib.type) {
>>  	case PCI_CAP_ID_MSI:
>>  		if (entry->msi_attrib.maskbit) {
>>  			...
>>  		} else {
>>  			msi_set_enable(entry->dev, !flag);
>>  		}
>>  		break;
>>  	...
(It is said not work for some devices since pending msi may be lost in
this way, so it got reverted in later versions)

If the guest extensively uses this technique, re-enabling MSI-INTx during
mask would be a pain, since it needs several hypercalls and thus hurts
performance.


Furthermore, why a guest driver would want to disable MSI if it can benefit
from it (Maybe because MSI doesn''t work:-)? The presumption behind not
re-enabling the translation is that it''s seldom that guest would want
to
disable-after-enable the MSI. And it''s more efficient for
guests like linux-2.6.25.
> 
> Thanks,
> --
> Shohei Fujiwara
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-14 08:26 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Wed, 14 Jan 2009 15:38:46 +0800
Qing He <qing.he@intel.com> wrote:
> On Wed, 2009-01-14 at 14:39 +0800, Shohei Fujiwara wrote:
> > I think adding blacklist now is better, because it is easy to specify
> > an unsuitable device when it is found. But I don''t know what
other
> > developers think about this.
> 
> I''m OK either way
> 
> > 
> > 
> > BTW, I found a issue when I read your code.
> > 
> > MSI-INTx translation is enabled when guest domain is started.
> > When guest OS enables MSI/MSI-X, MSI-INTx translation is disabled.
> > But when guest OS disables MSI/MSI-X, MSI-INTx translation
isn''t re-enabled.
> > This means normal INTx interrupt is used.
> > 
> > I think the results should be the same at all times,
> > so it is good to re-enable MSI-INTx translation.
> 
> There is a complication here. Some guests, notably certain versions of
> Linux (2.6.25 for instance) uses MSI enable bit as mask, like:
> 
> >>  static void msi_set_mask_bit(unsigned int irq, int flag)
> >>  {
> >>  	...
> >>  	switch (entry->msi_attrib.type) {
> >>  	case PCI_CAP_ID_MSI:
> >>  		if (entry->msi_attrib.maskbit) {
> >>  			...
> >>  		} else {
> >>  			msi_set_enable(entry->dev, !flag);
> >>  		}
> >>  		break;
> >>  	...
> 
> (It is said not work for some devices since pending msi may be lost in
> this way, so it got reverted in later versions)
> 
> If the guest extensively uses this technique, re-enabling MSI-INTx during
> mask would be a pain, since it needs several hypercalls and thus hurts
> performance.
>
> Furthermore, why a guest driver would want to disable MSI if it can benefit
> from it (Maybe because MSI doesn''t work:-)? The presumption behind
not
> re-enabling the translation is that it''s seldom that guest would
want to
> disable-after-enable the MSI. And it''s more efficient for
> guests like linux-2.6.25.
Basically, I would not like to assume any behavior of guest OS or
user. User might unload device driver at runtime, and load it again
disabling MSI. So we still need to use I/O APIC. we are not freed
from sharing machine GSI problem.

Thanks,
--
Shohei Fujiwara


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-14 09:17 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Wed, 2009-01-14 at 16:26 +0800, Shohei Fujiwara
wrote:> > If the guest extensively uses this technique, re-enabling MSI-INTx
during
> > mask would be a pain, since it needs several hypercalls and thus hurts
> > performance.
> >
> > Furthermore, why a guest driver would want to disable MSI if it can
benefit
> > from it (Maybe because MSI doesn''t work:-)? The presumption
behind not
> > re-enabling the translation is that it''s seldom that guest
would want to
> > disable-after-enable the MSI. And it''s more efficient for
> > guests like linux-2.6.25.
> 
> Basically, I would not like to assume any behavior of guest OS or
> user. User might unload device driver at runtime, and load it again
> disabling MSI. So we still need to use I/O APIC. we are not freed
> from sharing machine GSI problem.
For the user unloading and reloading issue, if the driver uses MSI for
both times, it works fine. If the driver uses INTx for both times
without ever enabling MSI, it also works fine with MSI-INTx translation
on at all time. Only if the user starts with driver in MSI mode, unloads
it, and then reloads it in legacy irq mode, he may go back to no
translation, but why does he want to do this? And even in this case,
it still works correctly, just without MSI-INTx translation.

This patch is never intended to be a perfect for all solution, as
mentioned early, if the device or guest driver does something nasty, it
would fail, that''s why the option is added in the first place.

Since INTx is considered leagacy regarding to MSI, the bottom line is
that the patch should not hurt guest passthrough MSI performance, which
is not the case if MSI-INTx is turned on when MSI is simply masked
(implemented as disabling by some guests).

Using INTx after enabling and disabling MSI may not be as fast as it
can be (it''s definitely no slower). But considering the rarity of this
situation, I would think the trade-off is just sound.

What do you think?

Thanks,
Qing

> 
> Thanks,
> --
> Shohei Fujiwara
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-15 02:35 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

I think MSI-INTx translation can be enhanced.

In case of using MSI enable bit as mask, guest OS need to set interrupt
disable bit to prevent device from asserting INTx. Actually, linux
2.6.25 set interrupt disable bit in
drivers/pci/msi.c:msi_capability_init.  If interrupt disable bit is 1,
we can assume guest OS might use MSI enable bit as mask. If interrupt
disable bit is 0 and MSI enable bit is 0, we can assume guest OS uses
INTx.

I think we can archive both of good performance and not-using INTx, if
we control MSI-INTx translation as follows.

MSI/MSI-X  | Interrupt   | MSI-INTx    | Note
enable bit | disable bit | translation |
-----------+-------------+-------------+------------------------------
0          | 0           | enable      | Re-enabling MSI-INTx 
           |             |             | translation is needed.
-----------+-------------+-------------+------------------------------
0          | 1           | disable     | Guest OS might use MSI enable
           |             |             | bit as mask.
-----------+-------------+-------------+------------------------------
1          | 0           | disable     | Guest OS use MSI/MSI-X.
-----------+-------------+-------------+------------------------------
1          | 1           | disable     | Guest OS use MSI/MSI-X.

It is great if we can guarantee INTx interrupt isn''t used.
We don''t need to consider sharing machine gsi. We don''t need
to consider
hot-pluging I/O APIC. In long term, It will be possible even to to remove
I/O APIC from real machine.

What do you think?

Thanks
--
Shohei Fujiwara

On Wed, 14 Jan 2009 17:17:52 +0800
Qing He <qing.he@intel.com> wrote:
> On Wed, 2009-01-14 at 16:26 +0800, Shohei Fujiwara wrote:
> > > If the guest extensively uses this technique, re-enabling
MSI-INTx during
> > > mask would be a pain, since it needs several hypercalls and thus
hurts
> > > performance.
> > >
> > > Furthermore, why a guest driver would want to disable MSI if it
can benefit
> > > from it (Maybe because MSI doesn''t work:-)? The
presumption behind not
> > > re-enabling the translation is that it''s seldom that
guest would want to
> > > disable-after-enable the MSI. And it''s more efficient
for
> > > guests like linux-2.6.25.
> > 
> > Basically, I would not like to assume any behavior of guest OS or
> > user. User might unload device driver at runtime, and load it again
> > disabling MSI. So we still need to use I/O APIC. we are not freed
> > from sharing machine GSI problem.
> 
> For the user unloading and reloading issue, if the driver uses MSI for
> both times, it works fine. If the driver uses INTx for both times
> without ever enabling MSI, it also works fine with MSI-INTx translation
> on at all time. Only if the user starts with driver in MSI mode, unloads
> it, and then reloads it in legacy irq mode, he may go back to no
> translation, but why does he want to do this? And even in this case,
> it still works correctly, just without MSI-INTx translation.
> 
> This patch is never intended to be a perfect for all solution, as
> mentioned early, if the device or guest driver does something nasty, it
> would fail, that''s why the option is added in the first place.
> 
> Since INTx is considered leagacy regarding to MSI, the bottom line is
> that the patch should not hurt guest passthrough MSI performance, which
> is not the case if MSI-INTx is turned on when MSI is simply masked
> (implemented as disabling by some guests).
> 
> Using INTx after enabling and disabling MSI may not be as fast as it
> can be (it''s definitely no slower). But considering the rarity of
this
> situation, I would think the trade-off is just sound.
> 
> What do you think?
> 
> 
> Thanks,
> Qing

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Jan-15 06:25 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Thu, 2009-01-15 at 10:35 +0800, Shohei Fujiwara
wrote:> I think MSI-INTx translation can be enhanced.
> 
> In case of using MSI enable bit as mask, guest OS need to set interrupt
> disable bit to prevent device from asserting INTx. Actually, linux
> 2.6.25 set interrupt disable bit in
> drivers/pci/msi.c:msi_capability_init.  If interrupt disable bit is 1,
> we can assume guest OS might use MSI enable bit as mask. If interrupt
> disable bit is 0 and MSI enable bit is 0, we can assume guest OS uses
> INTx.
This should be better, but cannot garauntee it. Since INTx disable bit
was added in PCI 2.3. Early guests may exist not using it. Also, some
devices have problems, that setting INTx disable bit also completely
disables MSI from signalling, so OS may not touch INTx disable bit of
these devices.

Theoretically, MSI-enable-as-masking is not correct and should not be
used in the first place. If that were true it would be much easier for
us. However, the existing (and widely used) guest complicates this matter...
> 
> I think we can archive both of good performance and not-using INTx, if
> we control MSI-INTx translation as follows.
> 
> MSI/MSI-X  | Interrupt   | MSI-INTx    | Note
> enable bit | disable bit | translation |
> -----------+-------------+-------------+------------------------------
> 0          | 0           | enable      | Re-enabling MSI-INTx 
>            |             |             | translation is needed.
> -----------+-------------+-------------+------------------------------
> 0          | 1           | disable     | Guest OS might use MSI enable
>            |             |             | bit as mask.
> -----------+-------------+-------------+------------------------------
> 1          | 0           | disable     | Guest OS use MSI/MSI-X.
> -----------+-------------+-------------+------------------------------
> 1          | 1           | disable     | Guest OS use MSI/MSI-X.
> 
> It is great if we can guarantee INTx interrupt isn''t used.
> We don''t need to consider sharing machine gsi. We don''t
need to consider
> hot-pluging I/O APIC. In long term, It will be possible even to to remove
> I/O APIC from real machine.
I am not against it, it does look more symmetric, though in the cost of
a little overhead (4 hypercalls for a mask operation) in the worst
(but rare) case. I''ll ack the patch if you add it now, or you can wait
for me to add it at some later time.

Thanks,
Qing
> 
> What do you think?
> 
> Thanks
> --
> Shohei Fujiwara
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Jan-16 04:34 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Thank you for long discussion.

On Thu, 15 Jan 2009 14:25:14 +0800
Qing He <qing.he@intel.com> wrote:
> On Thu, 2009-01-15 at 10:35 +0800, Shohei Fujiwara wrote:
> > I think MSI-INTx translation can be enhanced.
> > 
> > In case of using MSI enable bit as mask, guest OS need to set
interrupt
> > disable bit to prevent device from asserting INTx. Actually, linux
> > 2.6.25 set interrupt disable bit in
> > drivers/pci/msi.c:msi_capability_init.  If interrupt disable bit is 1,
> > we can assume guest OS might use MSI enable bit as mask. If interrupt
> > disable bit is 0 and MSI enable bit is 0, we can assume guest OS uses
> > INTx.
> 
> This should be better, but cannot garauntee it. Since INTx disable bit
> was added in PCI 2.3. Early guests may exist not using it. Also, some
> devices have problems, that setting INTx disable bit also completely
> disables MSI from signalling, so OS may not touch INTx disable bit of
> these devices.
I think we can guarantee it, even if OS doesn''t touch INTx disable bit.
When OS set 1 to MSI/MSI-X enable bit, qemu will disable MSI-INTx
translation.
When OS set 0 to MSI/MSI-X enable bit, qemu will re-enable MSI-INTx
translation.
> Theoretically, MSI-enable-as-masking is not correct and should not be
> used in the first place. If that were true it would be much easier for
> us. However, the existing (and widely used) guest complicates this
matter...
I agree with you that we need make MSI-enable-as-masking work on HVM
domain.
> > I think we can archive both of good performance and not-using INTx, if
> > we control MSI-INTx translation as follows.
> > 
> > MSI/MSI-X  | Interrupt   | MSI-INTx    | Note
> > enable bit | disable bit | translation |
> > -----------+-------------+-------------+------------------------------
> > 0          | 0           | enable      | Re-enabling MSI-INTx 
> >            |             |             | translation is needed.
> > -----------+-------------+-------------+------------------------------
> > 0          | 1           | disable     | Guest OS might use MSI enable
> >            |             |             | bit as mask.
> > -----------+-------------+-------------+------------------------------
> > 1          | 0           | disable     | Guest OS use MSI/MSI-X.
> > -----------+-------------+-------------+------------------------------
> > 1          | 1           | disable     | Guest OS use MSI/MSI-X.
> > 
> > It is great if we can guarantee INTx interrupt isn''t used.
> > We don''t need to consider sharing machine gsi. We
don''t need to consider
> > hot-pluging I/O APIC. In long term, It will be possible even to to
remove
> > I/O APIC from real machine.
> 
> I am not against it, it does look more symmetric, though in the cost of
> a little overhead (4 hypercalls for a mask operation) in the worst
> (but rare) case. I''ll ack the patch if you add it now, or you can
wait
> for me to add it at some later time.
Currently I''m trying to enabling PCI pass-through with stub domain.
I''m happy if you can add it.

Thanks,
--
Shohei Fujiwara


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Feb-27 02:41 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Thu, 8 Jan 2009 22:52:00 +0800
Qing He <qing.he@intel.com> wrote:
> On Thu, 2009-01-08 at 18:44 +0800, Shohei Fujiwara wrote:
> > I have one question.
> > 
> > MSI interrupt is edge-triggered, and INTx interrupt is
level-triggered.
> > Guest OS handles translated interrupt as level-triggered, though
physical
> > interrupt is edge-triggered. When two interrupts are raised during
short
> > period, Guest OS might lose 2nd interrupt, I think.
> 
> This problem is handled by a different EOI timing. As soon as the
> hypervisor receives an MSI, it issues the EOI ASAP, and the duration of
> the injected level-triggered IRQ and guest EOI are all handled by the
> virtual APICs. If a 2nd interrupt comes up at this time, the hypervisor
> can receive the EOI and this results in a pending IRQ in virtual APICs
> instead of lost.
I found hypervisor doesn''t issue EOI before injecting the
interrupt to guest domain, if MSI isn''t maskable. 
The 2nd interrupt may be lost. What do you think about this?

xen/arch/x86/irq.c:
asmlinkage void do_IRQ(struct cpu_user_regs *regs)
{
    unsigned int      vector = regs->entry_vector;
    irq_desc_t       *desc = &irq_desc[vector];
    struct irqaction *action;

    perfc_incr(irqs);

    spin_lock(&desc->lock);
    desc->handler->ack(vector);


xen/arch/x86/io_apic.c:
static void ack_msi_vector(unsigned int vector)
{
    if ( msi_maskable_irq(irq_desc[vector].msi_desc) )
        ack_APIC_irq(); /* ACKTYPE_NONE */
}


Thanks,
--
Shohei Fujiwara

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Mar-01 14:55 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On 27/02/2009 02:41, "Shohei Fujiwara"
<fujiwara-sxa@necst.nec.co.jp> wrote:
> I found hypervisor doesn''t issue EOI before injecting the
> interrupt to guest domain, if MSI isn''t maskable.
> The 2nd interrupt may be lost. What do you think about this?
The second interrupt should not be lost! It should be marked in the local
APIC''s IRR bitmap and should hence cause an interrupt to fire as soon
as the
in-service interrupt is EOIed (and hence cleared from the local APIC''s
ISR
bitmap).

Are you thus *sure* this is really the root-cause of your problem?

Could it be that the MSI *is* maskable, and that masking/unmasking loses
interrupt messages, for example?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Mar-02 07:19 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Thank you for your reply.

On Sun, 01 Mar 2009 14:55:21 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 27/02/2009 02:41, "Shohei Fujiwara"
<fujiwara-sxa@necst.nec.co.jp> wrote:
> 
> > I found hypervisor doesn''t issue EOI before injecting the
> > interrupt to guest domain, if MSI isn''t maskable.
> > The 2nd interrupt may be lost. What do you think about this?
> 
> The second interrupt should not be lost! It should be marked in the local
> APIC''s IRR bitmap and should hence cause an interrupt to fire as
soon as the
> in-service interrupt is EOIed (and hence cleared from the local
APIC''s ISR
> bitmap).
I understand that we don''t need to issue EOI before injecting the
interrupt to guest domain.
> Are you thus *sure* this is really the root-cause of your problem?
When I was reading the code on Friday, I found the hypervisor doesn''t
issue EOI before injecting the interrupt to guest domain. This seemed
to be different from Qing''s explanation, and then I made the question.
So I have no real problem.


Today, I found the hypervisor issues EOI ASAP. So Qing''s explanation
reflected the implementation of the hypervisor.

xen/drivers/passthrough/io.c:
void hvm_dirq_assist(struct vcpu *v)
{
...
            if ( hvm_irq_dpci->mirq[irq].flags & HVM_IRQ_DPCI_TRANSLATE )
            {
                /* for translated MSI to INTx interrupt, eoi as early as
possible */
                __msi_pirq_eoi(d, irq);
            }
            
Thanks,
--
Shohei Fujiwara

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Mar-02 08:47 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On 02/03/2009 07:19, "Shohei Fujiwara"
<fujiwara-sxa@necst.nec.co.jp> wrote:
> Today, I found the hypervisor issues EOI ASAP. So Qing''s
explanation
> reflected the implementation of the hypervisor.
> 
> xen/drivers/passthrough/io.c:
> void hvm_dirq_assist(struct vcpu *v)
> {
> ...
>             if ( hvm_irq_dpci->mirq[irq].flags &
HVM_IRQ_DPCI_TRANSLATE )
>             {
>                 /* for translated MSI to INTx interrupt, eoi as early as
> possible */
>                 __msi_pirq_eoi(d, irq);
>             }
Ah, well the passthrough-specific logic is not something I''m so
familiar
with! That above code indeed doesn''t look correct. :-)

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Qing He

2009-Mar-02 09:24 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Mon, 2009-03-02 at 16:47 +0800, Keir Fraser wrote:> On 02/03/2009 07:19, "Shohei Fujiwara"
<fujiwara-sxa@necst.nec.co.jp> wrote:
> 
> > Today, I found the hypervisor issues EOI ASAP. So Qing''s
explanation
> > reflected the implementation of the hypervisor.
> > 
> > xen/drivers/passthrough/io.c:
> > void hvm_dirq_assist(struct vcpu *v)
> > {
> > ...
> >             if ( hvm_irq_dpci->mirq[irq].flags &
HVM_IRQ_DPCI_TRANSLATE )
> >             {
> >                 /* for translated MSI to INTx interrupt, eoi as early
as
> > possible */
> >                 __msi_pirq_eoi(d, irq);
> >             }
> 
> Ah, well the passthrough-specific logic is not something I''m so
familiar
> with! That above code indeed doesn''t look correct. :-)
I recently learned that some device may keep issuing MSI until it is explicitly
cleared by software. In that case it just behaves like level-triggered
interrupt, and doing something like the code above will cause interrupt storm.
This is reflected in cs. 19065.

Do you mean this issue?

Thanks,
Qing

> 
>  -- Keir
> 
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Mar-02 09:40 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On 02/03/2009 09:24, "Qing He" <qing.he@intel.com> wrote:
>> Ah, well the passthrough-specific logic is not something I''m
so familiar
>> with! That above code indeed doesn''t look correct. :-)
> 
> I recently learned that some device may keep issuing MSI until it is
> explicitly
> cleared by software. In that case it just behaves like level-triggered
> interrupt, and doing something like the code above will cause interrupt
storm.
> This is reflected in cs. 19065.
> 
> Do you mean this issue?
That would be the expected result of early EOI, yes. But Shohei was walking
about losing interrupts. I don''t know how we would do that.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shohei Fujiwara

2009-Mar-02 10:27 UTC

head link

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

On Mon, 02 Mar 2009 09:40:40 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 02/03/2009 09:24, "Qing He" <qing.he@intel.com> wrote:
> 
> >> Ah, well the passthrough-specific logic is not something
I''m so familiar
> >> with! That above code indeed doesn''t look correct. :-)
> > 
> > I recently learned that some device may keep issuing MSI until it is
> > explicitly
> > cleared by software. In that case it just behaves like level-triggered
> > interrupt, and doing something like the code above will cause
interrupt storm.
> > This is reflected in cs. 19065.
> > 
> > Do you mean this issue?
> 
> That would be the expected result of early EOI, yes. But Shohei was walking
> about losing interrupts. I don''t know how we would do that.
Currently I don''t think early EOI is needed, because IRR in local APIC
keeps interrupts. I don''t think losing interrupts will occur.

Thanks,
--
Shohei Fujiwara

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jan 2009 - [PATCH 0/6] MSI-INTx interrupt translation for HVM

[Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

[Xen-devel] [PATCH 1/6] passthrough: MSI-INTx translation for HVM

[Xen-devel] [PATCH 2/6] ioemu:passthrough: MSI-INTx interrupt translation support

[Xen-devel] [PATCH 3/6] pci: add pci option support for XML-RPC server

[Xen-devel] [PATCH 4/6] pci: add pci option support for XenAPI server

[Xen-devel] [PATCH 5/6] pci: add config options for MSI-INTx translation in HVM

[Xen-devel] [PATCH 6/6] passthough: MSI-INTx translation documentation

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM

Re: [Xen-devel] [PATCH 0/6] MSI-INTx interrupt translation for HVM