Currently the MSI is disabled because of some lock issue. This patch try to cleanup the lock related to MSI lock. Because I have no AMD''s iommu environment, so I didn''t test AMD''s platform, WeiWang, can you please have a check on AMD''s code path? Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> The current issue related including: 1) The sequence of pci_dev''s lock and irq_desc''s lock is not same. For example, In pci_cleanup_msi, it will get pci_dev lock before msi_free_vectors get irq_desc''s lock, while in pci_disable_msi , it will get irq_desc''s lock before pci_dev lock. 2) The sequence of pci_dev''s lock and pcidevs_lock is not same. For example, reassign_device_ownership will get pci_dev''s lock before get pcidevs_lock, while pci_lock_domai_pdev will get pcidevs_lock before pci_dev''s lock. 3) The lock to bus2bridge is not always confusing. 4) xmalloc with irq_save in several code path, which cause debug version failed. Also some minor issue exits,(can be identified by the patch), including: the spin_lock_irqsave and spin_lock for iommu->lock and hd->mapping_lock; race condition may happen in XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device, the rangeset is freed in rangeset_domain_destroy() before the unmap_domain_pirq try to deny the irq access in arch_domain_destro, etc. This patch try to do some cleanup for these issues. 1) The basic idea is to remove the pci_dev''s lock, instead, we try to use the big pcidevs_lock to protect the whole pci_dev stuff. It including both pci_dev adding/removing, and also the assignment of the devices. We checked the code and seems there is no critical code path for this. We try to use fine-grained lock and seems the code will be very tricky. 2) Split the pci_enable_msi into two step, firstly it will just construct the msi_desc through pci_enable_msi without holding the irq_desc lock, and then it will setup msi through setup_msi_irq with irq_desc holded. 3) Change the iommu->lock and hd->mapping_lock to be irq_save. 4) Fix to some minor issues. Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will always be the last lock be hold for peformance consideration. Something need notice: 1) It is tricky to support multi-vector MSI. The reason is: the mask bit is in the same register shared by multiple vector, that means the ISR need compete for this register and need hold another lock, and it will cause things very tricky (the support is buggy in current code). So in fact, I think the current code does not support it. 2) I''m not sure if pci_remove_device need some improvement, to make sure the resouce is freed, since it assume the pci device is owned by other domain still. For example, it may need call pci_clean_dpci_irqs(). But because I''m not sure how will this function be used, so I didn''t touch it. One thing left is the IOMMU''s page fault handler, which will try to access the vt-d''s mapping table and context table currently. We can simply remove the print_vtd_entries in iommu_page_fault_do_one, or place the print task in a delay work. I tried to split the patch to some smaller one, but seems the logic is something related, so I give up later. Thanks Yunhong Jiang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
[Yunhong Jiang]> This patch try to do some cleanup for these issues. > 1) The basic idea is to remove the pci_dev''s lock, instead, we try > to use the big pcidevs_lock to protect the whole pci_dev > stuff. It including both pci_dev adding/removing, and also the > assignment of the devices. We checked the code and seems there is > no critical code path for this. We try to use fine-grained lock > and seems the code will be very tricky. > 2) Split the pci_enable_msi into two step, firstly it will just > construct the msi_desc through pci_enable_msi without holding the > irq_desc lock, and then it will setup msi through setup_msi_irq > with irq_desc holded. > 3) Change the iommu->lock and hd->mapping_lock to be irq_save. > 4) Fix to some minor issues.> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> > iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will > always be the last lock be hold for peformance consideration.So what exactly is it that pcidevs_lock is supposed to "protect" now? Does it indicate that someone is holding a reference to a pci_dev? Does it indicate that someone will modify some pci_dev? Does it indicate that someone is messing around with interrupts or MSI descriptors? Regarding pci_enable_msi: Can you not get into race conditions with the splitup you''re doing? pci_enable_msi() does afterall poke around in the MSI capability record. eSk _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Espen Skoglund >Sent: Wednesday, December 10, 2008 7:34 PM > >[Yunhong Jiang] >> This patch try to do some cleanup for these issues. >> 1) The basic idea is to remove the pci_dev''s lock, instead, we try >> to use the big pcidevs_lock to protect the whole pci_dev >> stuff. It including both pci_dev adding/removing, and also the >> assignment of the devices. We checked the code and seems there is >> no critical code path for this. We try to use fine-grained lock >> and seems the code will be very tricky. >> 2) Split the pci_enable_msi into two step, firstly it will just >> construct the msi_desc through pci_enable_msi without holding the >> irq_desc lock, and then it will setup msi through setup_msi_irq >> with irq_desc holded. >> 3) Change the iommu->lock and hd->mapping_lock to be irq_save. >> 4) Fix to some minor issues. > >> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will >> always be the last lock be hold for peformance consideration. > >So what exactly is it that pcidevs_lock is supposed to "protect" now? >Does it indicate that someone is holding a reference to a pci_dev? >Does it indicate that someone will modify some pci_dev? Does it >indicate that someone is messing around with interrupts or MSI >descriptors?I think it protects all above. As those operations are all rare, such a big lock can avoid complex lock/unlock sequence regarding to different paths to different resource of an assigned device.> >Regarding pci_enable_msi: Can you not get into race conditions with >the splitup you''re doing? pci_enable_msi() does afterall poke around >in the MSI capability record. > > eSkIf I read it correctly, only one out of lock is to mask/unmask MSI vector. MSI-X has per-vector mask dword, and thus no contention. multiple-vector support for MSI is not supported yet. All other touch to MSI/MSI-X capabilities are protected by that lock. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin <> wrote:>> From: Espen Skoglund >> Sent: Wednesday, December 10, 2008 7:34 PM >> >> [Yunhong Jiang] >>> This patch try to do some cleanup for these issues. >>> 1) The basic idea is to remove the pci_dev''s lock, instead, we try >>> to use the big pcidevs_lock to protect the whole pci_dev >>> stuff. It including both pci_dev adding/removing, and also the >>> assignment of the devices. We checked the code and seems there is >>> no critical code path for this. We try to use fine-grained lock >>> and seems the code will be very tricky. >>> 2) Split the pci_enable_msi into two step, firstly it will just >>> construct the msi_desc through pci_enable_msi without holding the >>> irq_desc lock, and then it will setup msi through setup_msi_irq >>> with irq_desc holded. 3) Change the iommu->lock and hd->mapping_lock to >>> be irq_save. 4) Fix to some minor issues. >> >>> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >>> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will >>> always be the last lock be hold for peformance consideration. >> >> So what exactly is it that pcidevs_lock is supposed to "protect" now? >> Does it indicate that someone is holding a reference to a pci_dev? >> Does it indicate that someone will modify some pci_dev? Does it >> indicate that someone is messing around with interrupts or MSI >> descriptors? > > I think it protects all above. As those operations are all rare, such a > big lock can avoid complex lock/unlock sequence regarding to different > paths to different resource of an assigned device.Excactly. Maybe the name can be changed more meaningful, but I''m really poor on that.> >> >> Regarding pci_enable_msi: Can you not get into race conditions with >> the splitup you''re doing? pci_enable_msi() does afterall poke around in >> the MSI capability record. >> >> eSk > > If I read it correctly, only one out of lock is to mask/unmask MSI > vector. MSI-X has per-vector mask dword, and thus no contention. > multiple-vector support for MSI is not supported yet. All other touch > to MSI/MSI-X capabilities are protected by that lock.Excactly. Maybe I should stated this more cleanly in my original mail. And that''s the reason we can''t support multiple entry MSI very easily.> > Thanks, > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
[Kevin Tian]>> From: Espen Skoglund >> Sent: Wednesday, December 10, 2008 7:34 PM >> >> [Yunhong Jiang] >>> This patch try to do some cleanup for these issues. >>> 1) The basic idea is to remove the pci_dev''s lock, instead, we try >>> to use the big pcidevs_lock to protect the whole pci_dev >>> stuff. It including both pci_dev adding/removing, and also the >>> assignment of the devices. We checked the code and seems there is >>> no critical code path for this. We try to use fine-grained lock >>> and seems the code will be very tricky. >>> 2) Split the pci_enable_msi into two step, firstly it will just >>> construct the msi_desc through pci_enable_msi without holding the >>> irq_desc lock, and then it will setup msi through setup_msi_irq >>> with irq_desc holded. >>> 3) Change the iommu->lock and hd->mapping_lock to be irq_save. >>> 4) Fix to some minor issues. >> >>> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >>> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will >>> always be the last lock be hold for peformance consideration. >> >> So what exactly is it that pcidevs_lock is supposed to "protect" now? >> Does it indicate that someone is holding a reference to a pci_dev? >> Does it indicate that someone will modify some pci_dev? Does it >> indicate that someone is messing around with interrupts or MSI >> descriptors?> I think it protects all above. As those operations are all rare, such a > big lock can avoid complex lock/unlock sequence regarding to different > paths to different resource of an assigned device.Except, since it is used as a read_lock most of the time it does not actually protect things in the way a spinlock would do. eSk _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Espen Skoglund <mailto:espen.skoglund@netronome.com> wrote:> [Kevin Tian] >>> From: Espen Skoglund >>> Sent: Wednesday, December 10, 2008 7:34 PM >>> >>> [Yunhong Jiang] >>>> This patch try to do some cleanup for these issues. >>>> 1) The basic idea is to remove the pci_dev''s lock, instead, we try >>>> to use the big pcidevs_lock to protect the whole pci_dev >>>> stuff. It including both pci_dev adding/removing, and also the >>>> assignment of the devices. We checked the code and seems there is >>>> no critical code path for this. We try to use fine-grained lock >>>> and seems the code will be very tricky. >>>> 2) Split the pci_enable_msi into two step, firstly it will just >>>> construct the msi_desc through pci_enable_msi without holding the >>>> irq_desc lock, and then it will setup msi through setup_msi_irq >>>> with irq_desc holded. 3) Change the iommu->lock and hd->mapping_lock to >>>> be irq_save. 4) Fix to some minor issues. >>> >>>> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >>>> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock will >>>> always be the last lock be hold for peformance consideration. >>> >>> So what exactly is it that pcidevs_lock is supposed to "protect" now? >>> Does it indicate that someone is holding a reference to a pci_dev? >>> Does it indicate that someone will modify some pci_dev? Does it >>> indicate that someone is messing around with interrupts or MSI >>> descriptors? > >> I think it protects all above. As those operations are all rare, such a >> big lock can avoid complex lock/unlock sequence regarding to different >> paths to different resource of an assigned device. > > Except, since it is used as a read_lock most of the time it does not > actually protect things in the way a spinlock would do.Ahh, yes, I didn''t realize this. So how do you think changing it to spinlock, since it is not performance ciritical. Or do you know how much benifit for read_lock? Also, as for the reference to pci_dev, do you have plan to add such support? For example, I''m not sure if we can return fail for pci_remove_device if there is still reference to it? Will dom0 support such failure?> > eSk_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/12/2008 15:24, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:>> Except, since it is used as a read_lock most of the time it does not >> actually protect things in the way a spinlock would do. > > Ahh, yes, I didn''t realize this. So how do you think changing it to spinlock, > since it is not performance ciritical. Or do you know how much benifit for > read_lock?A spinlock is actually faster unless the read sections are frequently contended, in which case the lost parallelism counteracts the faster lock. -- Keir> Also, as for the reference to pci_dev, do you have plan to add such support? > For example, I''m not sure if we can return fail for pci_remove_device if there > is still reference to it? Will dom0 support such failure?_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
[Yunhong Jiang]> Espen Skoglund <mailto:espen.skoglund@netronome.com> wrote: >> [Kevin Tian] >>>> From: Espen Skoglund >>>> Sent: Wednesday, December 10, 2008 7:34 PM >>>> >>>> [Yunhong Jiang] >>>>> This patch try to do some cleanup for these issues. >>>>> 1) The basic idea is to remove the pci_dev''s lock, instead, we >>>>> try to use the big pcidevs_lock to protect the whole pci_dev >>>>> stuff. It including both pci_dev adding/removing, and also >>>>> the assignment of the devices. We checked the code and seems >>>>> there is no critical code path for this. We try to use >>>>> fine-grained lock and seems the code will be very tricky. >>>>> 2) Split the pci_enable_msi into two step, firstly it will just >>>>> construct the msi_desc through pci_enable_msi without holding >>>>> the irq_desc lock, and then it will setup msi through >>>>> setup_msi_irq with irq_desc holded. >>>>> 3) Change the iommu->lock and hd->mapping_lock to be irq_save. >>>>> 4) Fix to some minor issues. >>>> >>>>> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >>>>> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock >>>>> will always be the last lock be hold for peformance >>>>> consideration. >>>> >>>> So what exactly is it that pcidevs_lock is supposed to "protect" >>>> now? Does it indicate that someone is holding a reference to a >>>> pci_dev? Does it indicate that someone will modify some pci_dev? >>>> Does it indicate that someone is messing around with interrupts >>>> or MSI descriptors? >> >>> I think it protects all above. As those operations are all rare, >>> such a big lock can avoid complex lock/unlock sequence regarding >>> to different paths to different resource of an assigned device. >> >> Except, since it is used as a read_lock most of the time it does >> not actually protect things in the way a spinlock would do.> Ahh, yes, I didn''t realize this. So how do you think changing it to > spinlock, since it is not performance ciritical. Or do you know how > much benifit for read_lock?It may not be performance critical, but I don''t particularly like seeing big global locks introduced. By the looks of it this lock would also protect everything related to interrupt management.> Also, as for the reference to pci_dev, do you have plan to add such > support? For example, I''m not sure if we can return fail for > pci_remove_device if there is still reference to it? Will dom0 > support such failure?The reason I opted for a pdev->spinlock rather than refcount was that removing a pdev with a refcount is kind of messy. The pdev is not an anonymous struct that can be released by, e.g., an RCU like mechanism. It is intrinsically tied with a BDF, and we can not create a new pdev for that BDF until the previous pdev struct has been completely purged. The idea with pcidevs_lock was to only protect the pci_dev lists itself. The lock order of pcidevs_lock and pdev->lock only matters when write_lock is involved. So, the follwoing two combinations are fine: read_lock(pcidevs_lock) => spin_lock(pdev) spin_lock(pdev) => read_lock(pcidevs_lock) The following combinations are not: read_lock(pcidevs_lock) => spin_lock(pdev) spin_lock(pdev) => write_lock(pcidevs_lock) There is only one place in Xen where the last case occurs (the location you identified in reassing_device_ownership). However, this code can easily be fixed so that it is deadlock free. The idea behind pdev->lock was to protect agains concurrent updates and usages relating to a particular pdev --- including assignment to domains, IOMMU assignment, and MSI assignment. My guess is that where things broke down was where the MSI related pdev->locking interferred with the other interrupt related locking. I don''t have a clear enough picture of inetrrupt related locks to pinpoint the problem more exectly. eSk _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
[Espen Skoglund]>> Also, as for the reference to pci_dev, do you have plan to add such >> support? For example, I''m not sure if we can return fail for >> pci_remove_device if there is still reference to it? Will dom0 >> support such failure?> The reason I opted for a pdev->spinlock rather than refcount was > that removing a pdev with a refcount is kind of messy. The pdev is > not an anonymous struct that can be released by, e.g., an RCU like > mechanism. It is intrinsically tied with a BDF, and we can not > create a new pdev for that BDF until the previous pdev struct has > been completely purged.> [...me rambling on...]Forget what I just said about rw_lock ordering. I forgot about the write_lock on list insert/delete. eSk _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/12/2008 17:01, "Espen Skoglund" <espen.skoglund@netronome.com> wrote:>> Ahh, yes, I didn''t realize this. So how do you think changing it to >> spinlock, since it is not performance ciritical. Or do you know how >> much benifit for read_lock? > > It may not be performance critical, but I don''t particularly like > seeing big global locks introduced. By the looks of it this lock > would also protect everything related to interrupt management.A global lock on control-plane operations such as add/remove/assign seems okay to me. On data path operations such as interrupt handlers then obviously not. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xen-devel-bounces@lists.xensource.com <> wrote:> [Yunhong Jiang] >> Espen Skoglund <mailto:espen.skoglund@netronome.com> wrote: >>> [Kevin Tian] >>>>> From: Espen Skoglund >>>>> Sent: Wednesday, December 10, 2008 7:34 PM >>>>> >>>>> [Yunhong Jiang] >>>>>> This patch try to do some cleanup for these issues. >>>>>> 1) The basic idea is to remove the pci_dev''s lock, instead, we >>>>>> try to use the big pcidevs_lock to protect the whole pci_dev >>>>>> stuff. It including both pci_dev adding/removing, and also >>>>>> the assignment of the devices. We checked the code and seems >>>>>> there is no critical code path for this. We try to use >>>>>> fine-grained lock and seems the code will be very tricky. >>>>>> 2) Split the pci_enable_msi into two step, firstly it will just >>>>>> construct the msi_desc through pci_enable_msi without holding >>>>>> the irq_desc lock, and then it will setup msi through >>>>>> setup_msi_irq with irq_desc holded. >>>>>> 3) Change the iommu->lock and hd->mapping_lock to be irq_save. >>>>>> 4) Fix to some minor issues. >>>>> >>>>>> Now the lock sequence is: pcidevs_lock -> domai''s event_lock -> >>>>>> iommu''s lock -> hvm_iommu''s mapping_lock. The irq_desc''s lock >>>>>> will always be the last lock be hold for peformance >>>>>> consideration. >>>>> >>>>> So what exactly is it that pcidevs_lock is supposed to "protect" >>>>> now? Does it indicate that someone is holding a reference to a >>>>> pci_dev? Does it indicate that someone will modify some pci_dev? >>>>> Does it indicate that someone is messing around with interrupts >>>>> or MSI descriptors? >>> >>>> I think it protects all above. As those operations are all rare, >>>> such a big lock can avoid complex lock/unlock sequence regarding >>>> to different paths to different resource of an assigned device. >>> >>> Except, since it is used as a read_lock most of the time it does >>> not actually protect things in the way a spinlock would do. > >> Ahh, yes, I didn''t realize this. So how do you think changing it to >> spinlock, since it is not performance ciritical. Or do you know how >> much benifit for read_lock? > > It may not be performance critical, but I don''t particularly like > seeing big global locks introduced. By the looks of it this lock > would also protect everything related to interrupt management.The interrupt should be protected by irq_desc''s lock. The content will be used by interrupt including the mask register and the message (data/address) registers, which should always be protected by irq_desc, others will not be used by ISR, so is interrupt safe and can be protected by pcidevs_lock. Or you noticed anything wrong in the code that violate the rule?> > >> Also, as for the reference to pci_dev, do you have plan to add such >> support? For example, I''m not sure if we can return fail for >> pci_remove_device if there is still reference to it? Will dom0 >> support such failure? > > The reason I opted for a pdev->spinlock rather than refcount was that > removing a pdev with a refcount is kind of messy. The pdev is not an > anonymous struct that can be released by, e.g., an RCU like mechanism. > It is intrinsically tied with a BDF, and we can not create a new pdev > for that BDF until the previous pdev struct has been completely purged. > > The idea with pcidevs_lock was to only protect the pci_dev lists > itself. The lock order of pcidevs_lock and pdev->lock only matters > when write_lock is involved. So, the follwoing two combinations are fine: > > read_lock(pcidevs_lock) => spin_lock(pdev) > spin_lock(pdev) => read_lock(pcidevs_lock) > > The following combinations are not: > > read_lock(pcidevs_lock) => spin_lock(pdev) > spin_lock(pdev) => write_lock(pcidevs_lock) > > There is only one place in Xen where the last case occurs (the > location you identified in reassing_device_ownership). However, this > code can easily be fixed so that it is deadlock free.As you stated in another mail, more deadlock may exists for pcidevs_lock.> > The idea behind pdev->lock was to protect agains concurrent updates > and usages relating to a particular pdev --- including assignment to > domains, IOMMU assignment, and MSI assignment. My guess is that where > things broke down was where the MSI related pdev->locking interferred > with the other interrupt related locking. I don''t have a clear enough > picture of inetrrupt related locks to pinpoint the problem more exectly.I suspect it is more about MSI related pdev->locking. I''d list how information protected currently: 1) device list, i.e. the alldevs_list, wihch is portected currently by pcidevs_lock; 2) device assignment, i.e. pci_dev->lock, which is protected by pci_dev->lock, if I understand correctly 3) IOMMU assignment, i.e. domain->arch.pdev_list, which is protected by pcidevs_lock currently 4) MSI assignment, i.e. pci_dev->msi_list, which is protected by pci_dev->lock. I think the issue is on item 3/4. For Item 3, because the iommu assignment is protected by a pcidevs_lock, it will cause reassign_device_ownership deadlock. If you do want fine grained lock, maybe you can consider per-domain lock like domain->arch.pdev_lock :-) For item 4, I think the reason the issue comes on multiple vector MSI support. Thanks Yunhong Jiang> > > eSk > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel