The second is msi.c. I don''t understand it very well, and need to figure what to do for PVH. Would appreciate suggestions if anyone knows. Thanks for your time, Mukesh diff -r ba5e9253d04d xen/arch/x86/msi.c --- a/xen/arch/x86/msi.c Thu Nov 01 16:53:31 2012 -0700 +++ b/xen/arch/x86/msi.c Fri Dec 07 17:45:07 2012 -0800 @@ -766,6 +766,9 @@ static int msix_capability_init(struct p WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, dev->msix_pba.first, dev->msix_pba.last)); +/* PVH: fixme: not a clue what to do here :) */ +if (is_pvh_domain(dev->domain) && dev->domain->domain_id != 0) +{ if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, dev->msix_table.last) ) WARN(); @@ -793,6 +796,7 @@ static int msix_capability_init(struct p /* XXX How to deal with existing mappings? */ } } +} } WARN_ON(dev->msix_nr_entries != nr_entries); WARN_ON(dev->msix_table.first != (table_paddr >> PAGE_SHIFT));
>>> On 08.12.12 at 02:46, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > The second is msi.c. I don''t understand it very well, and need to > figure what to do for PVH. Would appreciate suggestions if anyone knows.Why do you think you need to do something specially for PVH here in the first place? The only adjustment I would expect might be needed is address translation (depending on how PVH deal with MMIO addresses). In any case is checking the domain ID here pointless - this code can only be run in the context of Dom0 anyway. What the code here does is protect the MSI-X table of a device from Dom0 write accesses - Xen itself needs to be able to fully control the mask bits, and hence Dom0 shouldn''t (and qemu, even if running in Dom0) mustn''t write to it (which was happening in the past). Jan> --- a/xen/arch/x86/msi.c Thu Nov 01 16:53:31 2012 -0700 > +++ b/xen/arch/x86/msi.c Fri Dec 07 17:45:07 2012 -0800 > @@ -766,6 +766,9 @@ static int msix_capability_init(struct p > WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, dev->msix_pba.first, > dev->msix_pba.last)); > > +/* PVH: fixme: not a clue what to do here :) */ > +if (is_pvh_domain(dev->domain) && dev->domain->domain_id != 0) > +{ > if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > dev->msix_table.last) ) > WARN(); > @@ -793,6 +796,7 @@ static int msix_capability_init(struct p > /* XXX How to deal with existing mappings? */ > } > } > +} > } > WARN_ON(dev->msix_nr_entries != nr_entries); > WARN_ON(dev->msix_table.first != (table_paddr >> PAGE_SHIFT));
On Mon, 10 Dec 2012 09:43:34 +0000 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 08.12.12 at 02:46, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > The second is msi.c. I don''t understand it very well, and need to > > figure what to do for PVH. Would appreciate suggestions if anyone > > knows. > > Why do you think you need to do something specially for PVH here > in the first place? The only adjustment I would expect might be > needed is address translation (depending on how PVH deal with > MMIO addresses).Ok, thanks. Looks like I''m missing some address translation somewhere, getting EPT violation from dom0. Time to read up more on msi-x and debug. thanks Mukesh
On Tue, 11 Dec 2012, Mukesh Rathor wrote:> On Mon, 10 Dec 2012 09:43:34 +0000 > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >>> On 08.12.12 at 02:46, Mukesh Rathor <mukesh.rathor@oracle.com> > > >>> wrote: > > > The second is msi.c. I don''t understand it very well, and need to > > > figure what to do for PVH. Would appreciate suggestions if anyone > > > knows. > > > > Why do you think you need to do something specially for PVH here > > in the first place? The only adjustment I would expect might be > > needed is address translation (depending on how PVH deal with > > MMIO addresses). > > Ok, thanks. Looks like I''m missing some address translation somewhere, > getting EPT violation from dom0. Time to read up more on msi-x and debug.That''s strange because AFAIK Linux is never editing the MSI-X entries directly: give a look at arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps MSIs into pirqs using hypercalls. Xen should be the only one to touch the real MSI-X table.
On Tue, 11 Dec 2012 12:10:19 +0000 Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:> On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > On Mon, 10 Dec 2012 09:43:34 +0000 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > >>> On 08.12.12 at 02:46, Mukesh Rathor <mukesh.rathor@oracle.com> > > > >>> wrote: > > > > The second is msi.c. I don''t understand it very well, and need > > > > to figure what to do for PVH. Would appreciate suggestions if > > > > anyone knows. > > > > > > Why do you think you need to do something specially for PVH here > > > in the first place? The only adjustment I would expect might be > > > needed is address translation (depending on how PVH deal with > > > MMIO addresses). > > > > Ok, thanks. Looks like I''m missing some address translation > > somewhere, getting EPT violation from dom0. Time to read up more on > > msi-x and debug. > > That''s strange because AFAIK Linux is never editing the MSI-X entries > directly: give a look at > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps MSIs > into pirqs using hypercalls. Xen should be the only one to touch the > real MSI-X table.So, this is what''s happening. The side effect of : if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, dev->msix_table.last) ) WARN(); if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, dev->msix_pba.last) ) WARN(); in msix_capability_init() in xen is that the dom0 EPT entries that I''ve mapped are going from RW to read only. Then when dom0 accesses it, I get EPT violation. In case of pure PV, the PTE entry to access the iomem is RW, and the above rangeset adding doesn''t affect it. I don''t understand why? Looking into that now... thanks Mukesh
On Wed, 12 Dec 2012 17:15:23 -0800 Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Tue, 11 Dec 2012 12:10:19 +0000 > Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > > On Mon, 10 Dec 2012 09:43:34 +0000 > > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > That''s strange because AFAIK Linux is never editing the MSI-X > > entries directly: give a look at > > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > > MSIs into pirqs using hypercalls. Xen should be the only one to > > touch the real MSI-X table. > > So, this is what''s happening. The side effect of : > > if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > dev->msix_table.last) ) > WARN(); > if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > dev->msix_pba.last) ) > WARN(); > > in msix_capability_init() in xen is that the dom0 EPT entries that > I''ve mapped are going from RW to read only. Then when dom0 accesses > it, I get EPT violation. In case of pure PV, the PTE entry to access > the iomem is RW, and the above rangeset adding doesn''t affect it. I > don''t understand why? Looking into that now...Ah, it''s coming from: ept_p2m_type_to_flags(): case p2m_mmio_direct: entry->r = entry->x = 1; entry->w = !rangeset_contains_singleton(mmio_ro_ranges, entry->mfn); I suppose, the best fix would be to add a check here for dom0 and let it have w access? thanks mukesh
>>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 12 Dec 2012 17:15:23 -0800 > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> On Tue, 11 Dec 2012 12:10:19 +0000 >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: >> > > On Mon, 10 Dec 2012 09:43:34 +0000 >> > > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> > That''s strange because AFAIK Linux is never editing the MSI-X >> > entries directly: give a look at >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps >> > MSIs into pirqs using hypercalls. Xen should be the only one to >> > touch the real MSI-X table. >> >> So, this is what''s happening. The side effect of : >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, >> dev->msix_table.last) ) >> WARN(); >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, >> dev->msix_pba.last) ) >> WARN(); >> >> in msix_capability_init() in xen is that the dom0 EPT entries that >> I''ve mapped are going from RW to read only. Then when dom0 accesses >> it, I get EPT violation. In case of pure PV, the PTE entry to access >> the iomem is RW, and the above rangeset adding doesn''t affect it. I >> don''t understand why? Looking into that now...As far as I was able to tell back at the time when I implemented this, existing code shouldn''t have mappings for these tables in place at the time these ranges get added here. But I noted in the patch description that this is a potential issue (and may need fixing if deemed severe enough - back then, apparently nobody really cared, perhaps largely because passthrough to PV guests isn''t considered fully secure anyway). Now - did that change? I.e. can you describe where the mappings come from that cause the problem here? Because if that is possible even with non-hostile code, that might warrant properly dealing with that, even if pretty involved (I don''t think there''s a way to find all mappings of a given page other than a brute force scan of all L1 page tables the domain currently has).> Ah, it''s coming from: > > ept_p2m_type_to_flags(): > case p2m_mmio_direct: > entry->r = entry->x = 1; > entry->w = !rangeset_contains_singleton(mmio_ro_ranges, entry->mfn); > > I suppose, the best fix would be to add a check here for dom0 and let it > have w access?No, absolutely not. Nor would that help with the same issue in a DomU. Jan
On Thu, 13 Dec 2012, Jan Beulich wrote:> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > On Wed, 12 Dec 2012 17:15:23 -0800 > > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > >> On Tue, 11 Dec 2012 12:10:19 +0000 > >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > >> > >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > >> > > >> > That''s strange because AFAIK Linux is never editing the MSI-X > >> > entries directly: give a look at > >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > >> > MSIs into pirqs using hypercalls. Xen should be the only one to > >> > touch the real MSI-X table. > >> > >> So, this is what''s happening. The side effect of : > >> > >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > >> dev->msix_table.last) ) > >> WARN(); > >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > >> dev->msix_pba.last) ) > >> WARN(); > >> > >> in msix_capability_init() in xen is that the dom0 EPT entries that > >> I''ve mapped are going from RW to read only. Then when dom0 accesses > >> it, I get EPT violation. In case of pure PV, the PTE entry to access > >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > >> don''t understand why? Looking into that now... > > As far as I was able to tell back at the time when I implemented > this, existing code shouldn''t have mappings for these tables in > place at the time these ranges get added here. But I noted in > the patch description that this is a potential issue (and may need > fixing if deemed severe enough - back then, apparently nobody > really cared, perhaps largely because passthrough to PV guests > isn''t considered fully secure anyway). > > Now - did that change? I.e. can you describe where the mappings > come from that cause the problem here?The generic Linux MSI-X handling code does that, before calling the arch specific msi setup function, for which we have a xen version (xen_initdom_setup_msi_irqs): pci_enable_msix -> msix_capability_init -> msix_map_region
>>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com>wrote:> On Thu, 13 Dec 2012, Jan Beulich wrote: >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > On Wed, 12 Dec 2012 17:15:23 -0800 >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: >> >> >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X >> >> > entries directly: give a look at >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to >> >> > touch the real MSI-X table. >> >> >> >> So, this is what''s happening. The side effect of : >> >> >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, >> >> dev->msix_table.last) ) >> >> WARN(); >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, >> >> dev->msix_pba.last) ) >> >> WARN(); >> >> >> >> in msix_capability_init() in xen is that the dom0 EPT entries that >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I >> >> don''t understand why? Looking into that now... >> >> As far as I was able to tell back at the time when I implemented >> this, existing code shouldn''t have mappings for these tables in >> place at the time these ranges get added here. But I noted in >> the patch description that this is a potential issue (and may need >> fixing if deemed severe enough - back then, apparently nobody >> really cared, perhaps largely because passthrough to PV guests >> isn''t considered fully secure anyway). >> >> Now - did that change? I.e. can you describe where the mappings >> come from that cause the problem here? > > The generic Linux MSI-X handling code does that, before calling the > arch specific msi setup function, for which we have a xen version > (xen_initdom_setup_msi_irqs): > > pci_enable_msix -> msix_capability_init -> msix_map_regionAh, okay, (of course?) I had looked only at the forward ported version of this. Is all that fiddling with the mask bits really being suppressed properly when running under Xen? Otherwise pv-ops is quite broken in this regard at present... And if it is, I don''t see what the respective ioremap() is good for here in the Xen case. Jan
On Thu, 13 Dec 2012, Jan Beulich wrote:> >>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > wrote: > > On Thu, 13 Dec 2012, Jan Beulich wrote: > >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> > On Wed, 12 Dec 2012 17:15:23 -0800 > >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> > > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 > >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > >> >> > >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >> > > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X > >> >> > entries directly: give a look at > >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to > >> >> > touch the real MSI-X table. > >> >> > >> >> So, this is what''s happening. The side effect of : > >> >> > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > >> >> dev->msix_table.last) ) > >> >> WARN(); > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > >> >> dev->msix_pba.last) ) > >> >> WARN(); > >> >> > >> >> in msix_capability_init() in xen is that the dom0 EPT entries that > >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses > >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access > >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > >> >> don''t understand why? Looking into that now... > >> > >> As far as I was able to tell back at the time when I implemented > >> this, existing code shouldn''t have mappings for these tables in > >> place at the time these ranges get added here. But I noted in > >> the patch description that this is a potential issue (and may need > >> fixing if deemed severe enough - back then, apparently nobody > >> really cared, perhaps largely because passthrough to PV guests > >> isn''t considered fully secure anyway). > >> > >> Now - did that change? I.e. can you describe where the mappings > >> come from that cause the problem here? > > > > The generic Linux MSI-X handling code does that, before calling the > > arch specific msi setup function, for which we have a xen version > > (xen_initdom_setup_msi_irqs): > > > > pci_enable_msix -> msix_capability_init -> msix_map_region > > Ah, okay, (of course?) I had looked only at the forward ported > version of this. Is all that fiddling with the mask bits really > being suppressed properly when running under Xen? Otherwise > pv-ops is quite broken in this regard at present... And if it is, > I don''t see what the respective ioremap() is good for here in > the Xen case.Actually I think that you might be right: just looking at the code it seems that the mask bits get written to the table once as part of the initialization process: pci_enable_msix -> msix_capability_init -> msix_program_entries Unfortunately msix_program_entries is called few lines after arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as a pirq. However after that is done, all the masking/unmask is done via irq_mask that we handle properly masking/unmasking the corresponding event channels. Possible solutions on top of my head: - in msix_program_entries instead of writing to the table directly (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); - replace msix_program_entries with arch_msix_program_entries, but it would probably be unpopular.
On Thu, 13 Dec 2012 14:25:16 +0000 Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:> On Thu, 13 Dec 2012, Jan Beulich wrote: > > >>> On 13.12.12 at 13:19, Stefano Stabellini > > >>> <stefano.stabellini@eu.citrix.com> > > wrote: > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor > > >> >>> <mukesh.rathor@oracle.com> wrote: > > Actually I think that you might be right: just looking at the code it > seems that the mask bits get written to the table once as part of the > initialization process: > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > Unfortunately msix_program_entries is called few lines after > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI > as a pirq. > However after that is done, all the masking/unmask is done via > irq_mask that we handle properly masking/unmasking the corresponding > event channels. > > > Possible solutions on top of my head: > > - in msix_program_entries instead of writing to the table directly > (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); > > - replace msix_program_entries with arch_msix_program_entries, but it > would probably be unpopular.Can you or Jan or somebody please take that over? I can focus on other PVH things then and try to get a patch in asap. Thanks, Mukesh
On Fri, 14 Dec 2012, Mukesh Rathor wrote:> On Thu, 13 Dec 2012 14:25:16 +0000 > Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > >>> On 13.12.12 at 13:19, Stefano Stabellini > > > >>> <stefano.stabellini@eu.citrix.com> > > > wrote: > > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor > > > >> >>> <mukesh.rathor@oracle.com> wrote: > > > > Actually I think that you might be right: just looking at the code it > > seems that the mask bits get written to the table once as part of the > > initialization process: > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > Unfortunately msix_program_entries is called few lines after > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI > > as a pirq. > > However after that is done, all the masking/unmask is done via > > irq_mask that we handle properly masking/unmasking the corresponding > > event channels. > > > > > > Possible solutions on top of my head: > > > > - in msix_program_entries instead of writing to the table directly > > (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); > > > > - replace msix_program_entries with arch_msix_program_entries, but it > > would probably be unpopular. > > > Can you or Jan or somebody please take that over? I can focus on other > PVH things then and try to get a patch in asap.The following patch moves the MSI-X masking before arch_setup_msi_irqs, that is when Linux calls PHYSDEVOP_map_pirq that is the hypercall that causes Xen to execute msix_capability_init. I don''t have access to a machine with an MSI-X device right now so I have only tested the appended patch with MSI. --- diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index a825d78..ef73e80 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -652,11 +652,22 @@ static void msix_program_entries(struct pci_dev *dev, int i = 0; list_for_each_entry(entry, &dev->msi_list, list) { + entries[i].vector = entry->irq; + irq_set_msi_desc(entry->irq, entry); + i++; + } +} + +static void msix_mask_entries(struct pci_dev *dev, + struct msix_entry *entries) +{ + struct msi_desc *entry; + int i = 0; + + list_for_each_entry(entry, &dev->msi_list, list) { int offset = entries[i].entry * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_VECTOR_CTRL; - entries[i].vector = entry->irq; - irq_set_msi_desc(entry->irq, entry); entry->masked = readl(entry->mask_base + offset); msix_mask_irq(entry, 1); i++; @@ -696,6 +707,8 @@ static int msix_capability_init(struct pci_dev *dev, if (ret) return ret; + msix_mask_entries(dev, entries); + ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX); if (ret) goto error;
>>> On 17.12.12 at 13:42, Stefano Stabellini <stefano.stabellini@eu.citrix.com>wrote:> On Fri, 14 Dec 2012, Mukesh Rathor wrote: >> On Thu, 13 Dec 2012 14:25:16 +0000 >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: >> >> > On Thu, 13 Dec 2012, Jan Beulich wrote: >> > > >>> On 13.12.12 at 13:19, Stefano Stabellini >> > > >>> <stefano.stabellini@eu.citrix.com> >> > > wrote: >> > > > On Thu, 13 Dec 2012, Jan Beulich wrote: >> > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor >> > > >> >>> <mukesh.rathor@oracle.com> wrote: >> > >> > Actually I think that you might be right: just looking at the code it >> > seems that the mask bits get written to the table once as part of the >> > initialization process: >> > >> > pci_enable_msix -> msix_capability_init -> msix_program_entries >> > >> > Unfortunately msix_program_entries is called few lines after >> > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI >> > as a pirq. >> > However after that is done, all the masking/unmask is done via >> > irq_mask that we handle properly masking/unmasking the corresponding >> > event channels. >> > >> > >> > Possible solutions on top of my head: >> > >> > - in msix_program_entries instead of writing to the table directly >> > (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); >> > >> > - replace msix_program_entries with arch_msix_program_entries, but it >> > would probably be unpopular. >> >> >> Can you or Jan or somebody please take that over? I can focus on other >> PVH things then and try to get a patch in asap. > > The following patch moves the MSI-X masking before arch_setup_msi_irqs, > that is when Linux calls PHYSDEVOP_map_pirq that is the hypercall that > causes Xen to execute msix_capability_init.And in what way would that help? Jan> I don''t have access to a machine with an MSI-X device right now so I > have only tested the appended patch with MSI. > > --- > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index a825d78..ef73e80 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -652,11 +652,22 @@ static void msix_program_entries(struct pci_dev *dev, > int i = 0; > > list_for_each_entry(entry, &dev->msi_list, list) { > + entries[i].vector = entry->irq; > + irq_set_msi_desc(entry->irq, entry); > + i++; > + } > +} > + > +static void msix_mask_entries(struct pci_dev *dev, > + struct msix_entry *entries) > +{ > + struct msi_desc *entry; > + int i = 0; > + > + list_for_each_entry(entry, &dev->msi_list, list) { > int offset = entries[i].entry * PCI_MSIX_ENTRY_SIZE + > PCI_MSIX_ENTRY_VECTOR_CTRL; > > - entries[i].vector = entry->irq; > - irq_set_msi_desc(entry->irq, entry); > entry->masked = readl(entry->mask_base + offset); > msix_mask_irq(entry, 1); > i++; > @@ -696,6 +707,8 @@ static int msix_capability_init(struct pci_dev *dev, > if (ret) > return ret; > > + msix_mask_entries(dev, entries); > + > ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX); > if (ret) > goto error;
On Mon, 17 Dec 2012, Jan Beulich wrote:> >>> On 17.12.12 at 13:42, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > wrote: > > On Fri, 14 Dec 2012, Mukesh Rathor wrote: > >> On Thu, 13 Dec 2012 14:25:16 +0000 > >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > >> > >> > On Thu, 13 Dec 2012, Jan Beulich wrote: > >> > > >>> On 13.12.12 at 13:19, Stefano Stabellini > >> > > >>> <stefano.stabellini@eu.citrix.com> > >> > > wrote: > >> > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > >> > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor > >> > > >> >>> <mukesh.rathor@oracle.com> wrote: > >> > > >> > Actually I think that you might be right: just looking at the code it > >> > seems that the mask bits get written to the table once as part of the > >> > initialization process: > >> > > >> > pci_enable_msix -> msix_capability_init -> msix_program_entries > >> > > >> > Unfortunately msix_program_entries is called few lines after > >> > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI > >> > as a pirq. > >> > However after that is done, all the masking/unmask is done via > >> > irq_mask that we handle properly masking/unmasking the corresponding > >> > event channels. > >> > > >> > > >> > Possible solutions on top of my head: > >> > > >> > - in msix_program_entries instead of writing to the table directly > >> > (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); > >> > > >> > - replace msix_program_entries with arch_msix_program_entries, but it > >> > would probably be unpopular. > >> > >> > >> Can you or Jan or somebody please take that over? I can focus on other > >> PVH things then and try to get a patch in asap. > > > > The following patch moves the MSI-X masking before arch_setup_msi_irqs, > > that is when Linux calls PHYSDEVOP_map_pirq that is the hypercall that > > causes Xen to execute msix_capability_init. > > And in what way would that help?I was working under the assumption that before the call to msix_capability_init (in particular before rangeset_add_range(mmio_ro_ranges, dev->msix_table...) in Xen, the table is actually writeable by the guest. If that is the case, then this scheme should work. If it is not the case, this patch is wrong.> > I don''t have access to a machine with an MSI-X device right now so I > > have only tested the appended patch with MSI. > > > > --- > > > > > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index a825d78..ef73e80 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -652,11 +652,22 @@ static void msix_program_entries(struct pci_dev *dev, > > int i = 0; > > > > list_for_each_entry(entry, &dev->msi_list, list) { > > + entries[i].vector = entry->irq; > > + irq_set_msi_desc(entry->irq, entry); > > + i++; > > + } > > +} > > + > > +static void msix_mask_entries(struct pci_dev *dev, > > + struct msix_entry *entries) > > +{ > > + struct msi_desc *entry; > > + int i = 0; > > + > > + list_for_each_entry(entry, &dev->msi_list, list) { > > int offset = entries[i].entry * PCI_MSIX_ENTRY_SIZE + > > PCI_MSIX_ENTRY_VECTOR_CTRL; > > > > - entries[i].vector = entry->irq; > > - irq_set_msi_desc(entry->irq, entry); > > entry->masked = readl(entry->mask_base + offset); > > msix_mask_irq(entry, 1); > > i++; > > @@ -696,6 +707,8 @@ static int msix_capability_init(struct pci_dev *dev, > > if (ret) > > return ret; > > > > + msix_mask_entries(dev, entries); > > + > > ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX); > > if (ret) > > goto error; > > >
>>> On 17.12.12 at 15:16, Stefano Stabellini <stefano.stabellini@eu.citrix.com>wrote:> On Mon, 17 Dec 2012, Jan Beulich wrote: >> >>> On 17.12.12 at 13:42, Stefano Stabellini <stefano.stabellini@eu.citrix.com> >> wrote: >> > On Fri, 14 Dec 2012, Mukesh Rathor wrote: >> >> On Thu, 13 Dec 2012 14:25:16 +0000 >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: >> >> >> >> > On Thu, 13 Dec 2012, Jan Beulich wrote: >> >> > > >>> On 13.12.12 at 13:19, Stefano Stabellini >> >> > > >>> <stefano.stabellini@eu.citrix.com> >> >> > > wrote: >> >> > > > On Thu, 13 Dec 2012, Jan Beulich wrote: >> >> > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor >> >> > > >> >>> <mukesh.rathor@oracle.com> wrote: >> >> > >> >> > Actually I think that you might be right: just looking at the code it >> >> > seems that the mask bits get written to the table once as part of the >> >> > initialization process: >> >> > >> >> > pci_enable_msix -> msix_capability_init -> msix_program_entries >> >> > >> >> > Unfortunately msix_program_entries is called few lines after >> >> > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI >> >> > as a pirq. >> >> > However after that is done, all the masking/unmask is done via >> >> > irq_mask that we handle properly masking/unmasking the corresponding >> >> > event channels. >> >> > >> >> > >> >> > Possible solutions on top of my head: >> >> > >> >> > - in msix_program_entries instead of writing to the table directly >> >> > (__msix_mask_irq), call desc->irq_data.chip->irq_mask(); >> >> > >> >> > - replace msix_program_entries with arch_msix_program_entries, but it >> >> > would probably be unpopular. >> >> >> >> >> >> Can you or Jan or somebody please take that over? I can focus on other >> >> PVH things then and try to get a patch in asap. >> > >> > The following patch moves the MSI-X masking before arch_setup_msi_irqs, >> > that is when Linux calls PHYSDEVOP_map_pirq that is the hypercall that >> > causes Xen to execute msix_capability_init. >> >> And in what way would that help? > > I was working under the assumption that before the call to > msix_capability_init (in particular before > rangeset_add_range(mmio_ro_ranges, dev->msix_table...) in Xen, the table > is actually writeable by the guest. > > If that is the case, then this scheme should work. > If it is not the case, this patch is wrong.The question is not if or when the table is writable - the Dom0 kernel should _never_ try to write to the mask bits. Jan
On Thu, Dec 13, 2012 at 02:25:16PM +0000, Stefano Stabellini wrote:> On Thu, 13 Dec 2012, Jan Beulich wrote: > > >>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > > wrote: > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > >> > On Wed, 12 Dec 2012 17:15:23 -0800 > > >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > >> > > > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 > > >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > >> >> > > >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > > >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > > >> >> > > > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X > > >> >> > entries directly: give a look at > > >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > > >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to > > >> >> > touch the real MSI-X table. > > >> >> > > >> >> So, this is what''s happening. The side effect of : > > >> >> > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > > >> >> dev->msix_table.last) ) > > >> >> WARN(); > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > > >> >> dev->msix_pba.last) ) > > >> >> WARN(); > > >> >> > > >> >> in msix_capability_init() in xen is that the dom0 EPT entries that > > >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses > > >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access > > >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > > >> >> don''t understand why? Looking into that now... > > >> > > >> As far as I was able to tell back at the time when I implemented > > >> this, existing code shouldn''t have mappings for these tables in > > >> place at the time these ranges get added here. But I noted in > > >> the patch description that this is a potential issue (and may need > > >> fixing if deemed severe enough - back then, apparently nobody > > >> really cared, perhaps largely because passthrough to PV guests > > >> isn''t considered fully secure anyway). > > >> > > >> Now - did that change? I.e. can you describe where the mappings > > >> come from that cause the problem here? > > > > > > The generic Linux MSI-X handling code does that, before calling the > > > arch specific msi setup function, for which we have a xen version > > > (xen_initdom_setup_msi_irqs): > > > > > > pci_enable_msix -> msix_capability_init -> msix_map_region > > > > Ah, okay, (of course?) I had looked only at the forward ported > > version of this. Is all that fiddling with the mask bits really > > being suppressed properly when running under Xen? Otherwise > > pv-ops is quite broken in this regard at present... And if it is, > > I don''t see what the respective ioremap() is good for here in > > the Xen case. > > Actually I think that you might be right: just looking at the code it > seems that the mask bits get written to the table once as part of the > initialization process: > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > Unfortunately msix_program_entries is called few lines after > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > a pirq. > However after that is done, all the masking/unmask is done via irq_mask > that we handle properly masking/unmasking the corresponding event > channels. > > > Possible solutions on top of my head:There is also the potential to piggyback on Joerg''s patches that introduce a new x86_msi_ops: compose_msi_msg. See here: https://lkml.org/lkml/2012/8/20/432 (I think there was also a more recent one posted at some point).
On Tue, 18 Dec 2012, Konrad Rzeszutek Wilk wrote:> On Thu, Dec 13, 2012 at 02:25:16PM +0000, Stefano Stabellini wrote: > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > >>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > > > wrote: > > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > >> > On Wed, 12 Dec 2012 17:15:23 -0800 > > > >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > >> > > > > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 > > > >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > > >> >> > > > >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > > >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > > > >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >> > > > > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X > > > >> >> > entries directly: give a look at > > > >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > > > >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to > > > >> >> > touch the real MSI-X table. > > > >> >> > > > >> >> So, this is what''s happening. The side effect of : > > > >> >> > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > > > >> >> dev->msix_table.last) ) > > > >> >> WARN(); > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > > > >> >> dev->msix_pba.last) ) > > > >> >> WARN(); > > > >> >> > > > >> >> in msix_capability_init() in xen is that the dom0 EPT entries that > > > >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses > > > >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access > > > >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > > > >> >> don''t understand why? Looking into that now... > > > >> > > > >> As far as I was able to tell back at the time when I implemented > > > >> this, existing code shouldn''t have mappings for these tables in > > > >> place at the time these ranges get added here. But I noted in > > > >> the patch description that this is a potential issue (and may need > > > >> fixing if deemed severe enough - back then, apparently nobody > > > >> really cared, perhaps largely because passthrough to PV guests > > > >> isn''t considered fully secure anyway). > > > >> > > > >> Now - did that change? I.e. can you describe where the mappings > > > >> come from that cause the problem here? > > > > > > > > The generic Linux MSI-X handling code does that, before calling the > > > > arch specific msi setup function, for which we have a xen version > > > > (xen_initdom_setup_msi_irqs): > > > > > > > > pci_enable_msix -> msix_capability_init -> msix_map_region > > > > > > Ah, okay, (of course?) I had looked only at the forward ported > > > version of this. Is all that fiddling with the mask bits really > > > being suppressed properly when running under Xen? Otherwise > > > pv-ops is quite broken in this regard at present... And if it is, > > > I don''t see what the respective ioremap() is good for here in > > > the Xen case. > > > > Actually I think that you might be right: just looking at the code it > > seems that the mask bits get written to the table once as part of the > > initialization process: > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > Unfortunately msix_program_entries is called few lines after > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > a pirq. > > However after that is done, all the masking/unmask is done via irq_mask > > that we handle properly masking/unmasking the corresponding event > > channels. > > > > > > Possible solutions on top of my head: > > There is also the potential to piggyback on Joerg''s patches > that introduce a new x86_msi_ops: compose_msi_msg. > > See here: https://lkml.org/lkml/2012/8/20/432 > (I think there was also a more recent one posted at some point).Given that dom0 should never write to the MSI-X table, introducing a new msi_ops that replaces msix_program_entries (or at least the part of msix_program_entries that masks all the entried) is the only solution left.
On Wed, Dec 19, 2012 at 11:19:22AM +0000, Stefano Stabellini wrote:> On Tue, 18 Dec 2012, Konrad Rzeszutek Wilk wrote: > > On Thu, Dec 13, 2012 at 02:25:16PM +0000, Stefano Stabellini wrote: > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > > >>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > > > > wrote: > > > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > >> > On Wed, 12 Dec 2012 17:15:23 -0800 > > > > >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > >> > > > > > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 > > > > >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > > > >> >> > > > > >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > > > >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > > > > >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > >> >> > > > > > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X > > > > >> >> > entries directly: give a look at > > > > >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > > > > >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to > > > > >> >> > touch the real MSI-X table. > > > > >> >> > > > > >> >> So, this is what''s happening. The side effect of : > > > > >> >> > > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > > > > >> >> dev->msix_table.last) ) > > > > >> >> WARN(); > > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > > > > >> >> dev->msix_pba.last) ) > > > > >> >> WARN(); > > > > >> >> > > > > >> >> in msix_capability_init() in xen is that the dom0 EPT entries that > > > > >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses > > > > >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access > > > > >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > > > > >> >> don''t understand why? Looking into that now... > > > > >> > > > > >> As far as I was able to tell back at the time when I implemented > > > > >> this, existing code shouldn''t have mappings for these tables in > > > > >> place at the time these ranges get added here. But I noted in > > > > >> the patch description that this is a potential issue (and may need > > > > >> fixing if deemed severe enough - back then, apparently nobody > > > > >> really cared, perhaps largely because passthrough to PV guests > > > > >> isn''t considered fully secure anyway). > > > > >> > > > > >> Now - did that change? I.e. can you describe where the mappings > > > > >> come from that cause the problem here? > > > > > > > > > > The generic Linux MSI-X handling code does that, before calling the > > > > > arch specific msi setup function, for which we have a xen version > > > > > (xen_initdom_setup_msi_irqs): > > > > > > > > > > pci_enable_msix -> msix_capability_init -> msix_map_region > > > > > > > > Ah, okay, (of course?) I had looked only at the forward ported > > > > version of this. Is all that fiddling with the mask bits really > > > > being suppressed properly when running under Xen? Otherwise > > > > pv-ops is quite broken in this regard at present... And if it is, > > > > I don''t see what the respective ioremap() is good for here in > > > > the Xen case. > > > > > > Actually I think that you might be right: just looking at the code it > > > seems that the mask bits get written to the table once as part of the > > > initialization process: > > > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > > > Unfortunately msix_program_entries is called few lines after > > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > > a pirq. > > > However after that is done, all the masking/unmask is done via irq_mask > > > that we handle properly masking/unmasking the corresponding event > > > channels. > > > > > > > > > Possible solutions on top of my head: > > > > There is also the potential to piggyback on Joerg''s patches > > that introduce a new x86_msi_ops: compose_msi_msg. > > > > See here: https://lkml.org/lkml/2012/8/20/432 > > (I think there was also a more recent one posted at some point). > > Given that dom0 should never write to the MSI-X table, introducing a newHow does this work with QEMU setting up MSI and MSI-X on behalf of guests? Or is that actually handled by Xen hypervisor?> msi_ops that replaces msix_program_entries (or at least the part of > msix_program_entries that masks all the entried) is the only solution > left.so this one (__msix_mask_irq): mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; 198 if (flag) 199 mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; 200 writel(mask_bits, desc->mask_base + offset);
On Wed, 19 Dec 2012, Konrad Rzeszutek Wilk wrote:> On Wed, Dec 19, 2012 at 11:19:22AM +0000, Stefano Stabellini wrote: > > On Tue, 18 Dec 2012, Konrad Rzeszutek Wilk wrote: > > > On Thu, Dec 13, 2012 at 02:25:16PM +0000, Stefano Stabellini wrote: > > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > > > >>> On 13.12.12 at 13:19, Stefano Stabellini <stefano.stabellini@eu.citrix.com> > > > > > wrote: > > > > > > On Thu, 13 Dec 2012, Jan Beulich wrote: > > > > > >> >>> On 13.12.12 at 02:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > > >> > On Wed, 12 Dec 2012 17:15:23 -0800 > > > > > >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > > >> > > > > > > >> >> On Tue, 11 Dec 2012 12:10:19 +0000 > > > > > >> >> Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote: > > > > > >> >> > > > > > >> >> > On Tue, 11 Dec 2012, Mukesh Rathor wrote: > > > > > >> >> > > On Mon, 10 Dec 2012 09:43:34 +0000 > > > > > >> >> > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > >> >> > > > > > > >> >> > That''s strange because AFAIK Linux is never editing the MSI-X > > > > > >> >> > entries directly: give a look at > > > > > >> >> > arch/x86/pci/xen.c:xen_initdom_setup_msi_irqs, Linux only remaps > > > > > >> >> > MSIs into pirqs using hypercalls. Xen should be the only one to > > > > > >> >> > touch the real MSI-X table. > > > > > >> >> > > > > > >> >> So, this is what''s happening. The side effect of : > > > > > >> >> > > > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first, > > > > > >> >> dev->msix_table.last) ) > > > > > >> >> WARN(); > > > > > >> >> if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first, > > > > > >> >> dev->msix_pba.last) ) > > > > > >> >> WARN(); > > > > > >> >> > > > > > >> >> in msix_capability_init() in xen is that the dom0 EPT entries that > > > > > >> >> I''ve mapped are going from RW to read only. Then when dom0 accesses > > > > > >> >> it, I get EPT violation. In case of pure PV, the PTE entry to access > > > > > >> >> the iomem is RW, and the above rangeset adding doesn''t affect it. I > > > > > >> >> don''t understand why? Looking into that now... > > > > > >> > > > > > >> As far as I was able to tell back at the time when I implemented > > > > > >> this, existing code shouldn''t have mappings for these tables in > > > > > >> place at the time these ranges get added here. But I noted in > > > > > >> the patch description that this is a potential issue (and may need > > > > > >> fixing if deemed severe enough - back then, apparently nobody > > > > > >> really cared, perhaps largely because passthrough to PV guests > > > > > >> isn''t considered fully secure anyway). > > > > > >> > > > > > >> Now - did that change? I.e. can you describe where the mappings > > > > > >> come from that cause the problem here? > > > > > > > > > > > > The generic Linux MSI-X handling code does that, before calling the > > > > > > arch specific msi setup function, for which we have a xen version > > > > > > (xen_initdom_setup_msi_irqs): > > > > > > > > > > > > pci_enable_msix -> msix_capability_init -> msix_map_region > > > > > > > > > > Ah, okay, (of course?) I had looked only at the forward ported > > > > > version of this. Is all that fiddling with the mask bits really > > > > > being suppressed properly when running under Xen? Otherwise > > > > > pv-ops is quite broken in this regard at present... And if it is, > > > > > I don''t see what the respective ioremap() is good for here in > > > > > the Xen case. > > > > > > > > Actually I think that you might be right: just looking at the code it > > > > seems that the mask bits get written to the table once as part of the > > > > initialization process: > > > > > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > > > > > Unfortunately msix_program_entries is called few lines after > > > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > > > a pirq. > > > > However after that is done, all the masking/unmask is done via irq_mask > > > > that we handle properly masking/unmasking the corresponding event > > > > channels. > > > > > > > > > > > > Possible solutions on top of my head: > > > > > > There is also the potential to piggyback on Joerg''s patches > > > that introduce a new x86_msi_ops: compose_msi_msg. > > > > > > See here: https://lkml.org/lkml/2012/8/20/432 > > > (I think there was also a more recent one posted at some point). > > > > Given that dom0 should never write to the MSI-X table, introducing a new > > How does this work with QEMU setting up MSI and MSI-X on behalf of > guests? Or is that actually handled by Xen hypervisor?In the case of HVM guests, QEMU emulates the PCI config space and the table, so it is OK for the guest to write to it.> > msi_ops that replaces msix_program_entries (or at least the part of > > msix_program_entries that masks all the entried) is the only solution > > left. > > so this one (__msix_mask_irq): > > mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; > 198 if (flag) > 199 mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; > 200 writel(mask_bits, desc->mask_base + offset); >Yes, that''s the one. Once could argue that __msix_mask_irq should call mask_irq rather than writing to the table directly.
> > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > > > > > > > Unfortunately msix_program_entries is called few lines after > > > > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > > > > a pirq. > > > > > However after that is done, all the masking/unmask is done via irq_mask > > > > > that we handle properly masking/unmasking the corresponding event > > > > > channels. > > > > > > > > > > > > > > > Possible solutions on top of my head: > > > > > > > > There is also the potential to piggyback on Joerg''s patches > > > > that introduce a new x86_msi_ops: compose_msi_msg. > > > > > > > > See here: https://lkml.org/lkml/2012/8/20/432 > > > > (I think there was also a more recent one posted at some point). > > > > > > Given that dom0 should never write to the MSI-X table, introducing a new > > > > How does this work with QEMU setting up MSI and MSI-X on behalf of > > guests? Or is that actually handled by Xen hypervisor? > > In the case of HVM guests, QEMU emulates the PCI config space and the > table, so it is OK for the guest to write to it. > > > > > msi_ops that replaces msix_program_entries (or at least the part of > > > msix_program_entries that masks all the entried) is the only solution > > > left. > > > > so this one (__msix_mask_irq): > > > > mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; > > 198 if (flag) > > 199 mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; > > 200 writel(mask_bits, desc->mask_base + offset); > > > > Yes, that''s the one. Once could argue that __msix_mask_irq should call > mask_irq rather than writing to the table directly.You mean ''irq_mask '' ? Not really - that is within the IOAPIC domain. To be more generic it should encompass then also the other usages - that is the ''readl'' and ''writel'' users. My understading of the reason we have been fortunate enough to have this working right now is b/c the hypercall we do beforehand writes the ''pirq'' in the MSI-X BAR and that is later what the Linux kernel does (by doing readl) - and we end up re-writing that value by the Linux kernel. The other thing we can do and entirely bypass the msi.c writes is xen_initdom_setup_msi_irqs make the desc->mask_base point to somewhere safe. Meaning point to an page we allocate when we setup the IRQs and we fill it with whatever we want (which I guess would be the pirq values we just got).
On Fri, 21 Dec 2012, Konrad Rzeszutek Wilk wrote:> > > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > > > > > > > > > Unfortunately msix_program_entries is called few lines after > > > > > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > > > > > a pirq. > > > > > > However after that is done, all the masking/unmask is done via irq_mask > > > > > > that we handle properly masking/unmasking the corresponding event > > > > > > channels. > > > > > > > > > > > > > > > > > > Possible solutions on top of my head: > > > > > > > > > > There is also the potential to piggyback on Joerg''s patches > > > > > that introduce a new x86_msi_ops: compose_msi_msg. > > > > > > > > > > See here: https://lkml.org/lkml/2012/8/20/432 > > > > > (I think there was also a more recent one posted at some point). > > > > > > > > Given that dom0 should never write to the MSI-X table, introducing a new > > > > > > How does this work with QEMU setting up MSI and MSI-X on behalf of > > > guests? Or is that actually handled by Xen hypervisor? > > > > In the case of HVM guests, QEMU emulates the PCI config space and the > > table, so it is OK for the guest to write to it. > > > > > > > > msi_ops that replaces msix_program_entries (or at least the part of > > > > msix_program_entries that masks all the entried) is the only solution > > > > left. > > > > > > so this one (__msix_mask_irq): > > > > > > mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; > > > 198 if (flag) > > > 199 mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; > > > 200 writel(mask_bits, desc->mask_base + offset); > > > > > > > Yes, that''s the one. Once could argue that __msix_mask_irq should call > > mask_irq rather than writing to the table directly. > > You mean ''irq_mask '' ? Not really - that is within the IOAPIC domain.The concept of IOAPIC domain is a bit blurry to me when it comes to MSI-X. But I see what you mean.> To be more generic it should encompass then also the other usages - > that is the ''readl'' and ''writel'' users. > > My understading of the reason we have been fortunate enough to have this > working right now is b/c the hypercall we do beforehand writes the > ''pirq'' in the MSI-X BAR and that is later what the Linux kernel > does (by doing readl) - and we end up re-writing that value > by the Linux kernel. > > The other thing we can do and entirely bypass the msi.c writes is > xen_initdom_setup_msi_irqs make the desc->mask_base point to > somewhere safe. Meaning point to an page we allocate when > we setup the IRQs and we fill it with whatever we want > (which I guess would be the pirq values we just got).Ah yes, that would work. Even though more code to work around generic Linux infrastructure is not ideal.
On Fri, Jan 04, 2013 at 04:57:13PM +0000, Stefano Stabellini wrote:> On Fri, 21 Dec 2012, Konrad Rzeszutek Wilk wrote: > > > > > > > pci_enable_msix -> msix_capability_init -> msix_program_entries > > > > > > > > > > > > > > Unfortunately msix_program_entries is called few lines after > > > > > > > arch_setup_msi_irqs, where we call PHYSDEVOP_map_pirq to map the MSI as > > > > > > > a pirq. > > > > > > > However after that is done, all the masking/unmask is done via irq_mask > > > > > > > that we handle properly masking/unmasking the corresponding event > > > > > > > channels. > > > > > > > > > > > > > > > > > > > > > Possible solutions on top of my head: > > > > > > > > > > > > There is also the potential to piggyback on Joerg''s patches > > > > > > that introduce a new x86_msi_ops: compose_msi_msg. > > > > > > > > > > > > See here: https://lkml.org/lkml/2012/8/20/432 > > > > > > (I think there was also a more recent one posted at some point). > > > > > > > > > > Given that dom0 should never write to the MSI-X table, introducing a new > > > > > > > > How does this work with QEMU setting up MSI and MSI-X on behalf of > > > > guests? Or is that actually handled by Xen hypervisor? > > > > > > In the case of HVM guests, QEMU emulates the PCI config space and the > > > table, so it is OK for the guest to write to it. > > > > > > > > > > > msi_ops that replaces msix_program_entries (or at least the part of > > > > > msix_program_entries that masks all the entried) is the only solution > > > > > left. > > > > > > > > so this one (__msix_mask_irq): > > > > > > > > mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; > > > > 198 if (flag) > > > > 199 mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; > > > > 200 writel(mask_bits, desc->mask_base + offset); > > > > > > > > > > Yes, that''s the one. Once could argue that __msix_mask_irq should call > > > mask_irq rather than writing to the table directly. > > > > You mean ''irq_mask '' ? Not really - that is within the IOAPIC domain. > > The concept of IOAPIC domain is a bit blurry to me when it comes to MSI-X. > But I see what you mean. > > > > To be more generic it should encompass then also the other usages - > > that is the ''readl'' and ''writel'' users. > > > > My understading of the reason we have been fortunate enough to have this > > working right now is b/c the hypercall we do beforehand writes the > > ''pirq'' in the MSI-X BAR and that is later what the Linux kernel > > does (by doing readl) - and we end up re-writing that value > > by the Linux kernel. > > > > The other thing we can do and entirely bypass the msi.c writes is > > xen_initdom_setup_msi_irqs make the desc->mask_base point to > > somewhere safe. Meaning point to an page we allocate when > > we setup the IRQs and we fill it with whatever we want > > (which I guess would be the pirq values we just got). > > Ah yes, that would work. Even though more code to work around generic > Linux infrastructure is not ideal.I concur. I am just thinking of a fail-safe mechanism. The abstraction of how to do the MSI-X read/write is not yet completly clear in my mind.> > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >