On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote:> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <mst at redhat.com> wrote: > > > > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote: > > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote: > > > > I have no problem with that. For example, can we teach > > > > the DMA API on intel x86 to use PT for virtio by default? > > > > That would allow merging Andy's patches with > > > > full compatibility with old guests and hosts. > > > > > > Well, the only incompatibility comes from an experimental qemu feature, > > > more explicitly from a bug in that features implementation. So why > > > should we work around that in the kernel? I think it is not too hard to > > > fix qemu to generate a correct DMAR table which excludes the virtio > > > devices from iommu translation. > > > > > > > > > Joerg > > > > It's not that easy - you'd have to dedicate some buses > > for iommu bypass, and teach management tools to only put > > virtio there - but it's possible. > > > > This will absolutely address guests that don't need to set up IOMMU for > > virtio devices, and virtio that bypasses the IOMMU. > > > > But the problem is that we do want to *allow* guests > > to set up IOMMU for virtio devices. > > In that case, these are two other usecases: > > > > A- monolitic virtio within QEMU: > > iommu only needed for VFIO -> > > guest should always use iommu=pt > > iommu=on works but is just useless overhead. > > > > B- modular out of process virtio outside QEMU: > > iommu needed for VFIO or kernel driver -> > > guest should use iommu=pt or iommu=on > > depending on security/performance requirements > > > > Note that there could easily be a mix of these in the same system. > > > > So for these cases we do need QEMU to specify to guest that IOMMU covers > > the virtio devices. Also, once one does this, the default on linux is > > iommu=on and not pt, which works but ATM is very slow. > > > > This poses three problems: > > > > 1. How do we address the different needs of A and B? > > One way would be for virtio to pass the information to guest > > using some virtio specific way, and have drivers > > specify what kind of DMA access they want. > > > > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests > > use the more sensible iommu=pt. > > > > 3. Once we do allow IOMMU, how can we keep existing guests work in this configuration? > > Creating different hypervisor configurations depending on guest is very nasty. > > Again, one way would be some virtio specific interface. > > > > I'd rather we figured the answers to this before merging Andy's patches > > because I'm concerned that instead of 1 broken configuration > > (virtio always bypasses IOMMU) we'll get two bad configurations > > (in the second one, virtio uses the slow default with no > > gain in security). > > > > Suggestions wellcome. > > I think there's still no downside of using my patches, even on x86. > > Old kernels on new QEMU work unless IOMMU is enabled on the host. I > think that's the best we can possibly do. > New kernels work at full speed on old QEMU.Only if IOMMU is disabled, right?> New kernels with new QEMU and iommu enabled work slower. Even newer > kernels with default passthrough work at full speed, and there's no > obvious downside to the existence of kernels with just my patches. > > --Andy >I tried to explain the possible downside. Let me try again. Imagine that guest kernel notifies hypervisor that it wants IOMMU to actually work. This will make old kernel on new QEMU work even with IOMMU enabled on host - better than "the best we can do" that you described above. Specifically, QEMU will assume that if it didn't get notification, it's an old kernel so it should ignore the IOMMU. But if we apply your patches this trick won't work. Without implementing it all, I think the easiest incremental step would be to teach linux to make passthrough the default when running as a guest on top of QEMU, put your patches on top. If someone specifies non passthrough on command line it'll still be broken, but not too bad.> > > > -- > > MST
On Wed, Nov 11, 2015 at 2:05 AM, Michael S. Tsirkin <mst at redhat.com> wrote:> On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote: >> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <mst at redhat.com> wrote: >> > >> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote: >> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote: >> > > > I have no problem with that. For example, can we teach >> > > > the DMA API on intel x86 to use PT for virtio by default? >> > > > That would allow merging Andy's patches with >> > > > full compatibility with old guests and hosts. >> > > >> > > Well, the only incompatibility comes from an experimental qemu feature, >> > > more explicitly from a bug in that features implementation. So why >> > > should we work around that in the kernel? I think it is not too hard to >> > > fix qemu to generate a correct DMAR table which excludes the virtio >> > > devices from iommu translation. >> > > >> > > >> > > Joerg >> > >> > It's not that easy - you'd have to dedicate some buses >> > for iommu bypass, and teach management tools to only put >> > virtio there - but it's possible. >> > >> > This will absolutely address guests that don't need to set up IOMMU for >> > virtio devices, and virtio that bypasses the IOMMU. >> > >> > But the problem is that we do want to *allow* guests >> > to set up IOMMU for virtio devices. >> > In that case, these are two other usecases: >> > >> > A- monolitic virtio within QEMU: >> > iommu only needed for VFIO -> >> > guest should always use iommu=pt >> > iommu=on works but is just useless overhead. >> > >> > B- modular out of process virtio outside QEMU: >> > iommu needed for VFIO or kernel driver -> >> > guest should use iommu=pt or iommu=on >> > depending on security/performance requirements >> > >> > Note that there could easily be a mix of these in the same system. >> > >> > So for these cases we do need QEMU to specify to guest that IOMMU covers >> > the virtio devices. Also, once one does this, the default on linux is >> > iommu=on and not pt, which works but ATM is very slow. >> > >> > This poses three problems: >> > >> > 1. How do we address the different needs of A and B? >> > One way would be for virtio to pass the information to guest >> > using some virtio specific way, and have drivers >> > specify what kind of DMA access they want. >> > >> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most guests >> > use the more sensible iommu=pt. >> > >> > 3. Once we do allow IOMMU, how can we keep existing guests work in this configuration? >> > Creating different hypervisor configurations depending on guest is very nasty. >> > Again, one way would be some virtio specific interface. >> > >> > I'd rather we figured the answers to this before merging Andy's patches >> > because I'm concerned that instead of 1 broken configuration >> > (virtio always bypasses IOMMU) we'll get two bad configurations >> > (in the second one, virtio uses the slow default with no >> > gain in security). >> > >> > Suggestions wellcome. >> >> I think there's still no downside of using my patches, even on x86. >> >> Old kernels on new QEMU work unless IOMMU is enabled on the host. I >> think that's the best we can possibly do. >> New kernels work at full speed on old QEMU. > > Only if IOMMU is disabled, right? > >> New kernels with new QEMU and iommu enabled work slower. Even newer >> kernels with default passthrough work at full speed, and there's no >> obvious downside to the existence of kernels with just my patches. >> >> --Andy >> > > I tried to explain the possible downside. Let me try again. Imagine > that guest kernel notifies hypervisor that it wants IOMMU to actually > work. This will make old kernel on new QEMU work even with IOMMU > enabled on host - better than "the best we can do" that you described > above. Specifically, QEMU will assume that if it didn't get > notification, it's an old kernel so it should ignore the IOMMU.Can you flesh out this trick? On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the kernel wants, it can switch it to a non-passthrough mode. My patches cause the virtio driver to do exactly this, except that the host implementation doesn't actually exist yet, so the patches will instead have no particular effect. On powerpc and sparc, we *already* screwed up. The host already tells the guest that there's an IOMMU and that it's *enabled* because those platforms don't have selective IOMMU coverage the way that x86 does. So we need to work around it. I think that, if we want fancy virt-friendly IOMMU stuff like you're talking about, then the right thing to do is to create a virtio bus instead of pretending to be PCI. That bus could have a virtio IOMMU and its own cross-platform enumeration mechanism for devices on the bus, and everything would be peachy. In the mean time, there are existing mechanisms by which every PCI driver is supposed to notify the host/platform of how it intends to map DMA memory, and virtio gets it wrong.> > But if we apply your patches this trick won't work. >I still don't understand what trick. If we want virtio devices to be assignable, then they should be translated through the IOMMU, and the DMA API is the right interface for that.> Without implementing it all, I think the easiest incremental step would > be to teach linux to make passthrough the default when running as a > guest on top of QEMU, put your patches on top. If someone specifies > non passthrough on command line it'll still be broken, > but not too bad.Can powerpc and sparc do exact 1:1 passthrough for a given device? If so, that might be a reasonable way forward. After all, if a new powerpc kernel asks for exact passthrough (dma addr = phys addr with no offset at all), then old QEMU will just ignore it and therefore accidentally get it right. Ben? --Andy
On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote:> > Can you flesh out this trick? > > On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the > kernel wants, it can switch it to a non-passthrough mode. My patches > cause the virtio driver to do exactly this, except that the host > implementation doesn't actually exist yet, so the patches will instead > have no particular effect.At some level, yes ? we're compatible with a 1982 IBM PC and thus the IOMMU is entirely disabled at boot until the kernel turns it on ? except in TXT mode where we abandon that compatibility. But no, the virtio driver has *nothing* to do with switching the device out of passthrough mode. It is either in passthrough mode, or it isn't. If the VMM *doesn't* expose an IOMMU to the guest, obviously the devices are in passthrough mode. If the guest kernel doesn't have IOMMU support enabled, then obviously the devices are in passthrough mode. And if the ACPI tables exposed to the guest kernel *tell* it that the virtio devices are not actually behind the IOMMU (which qemu gets wrong), then it'll be in passthrough mode. If the IOMMU is exposed, and enabled, and telling the guest kernel that it *does* cover the virtio devices, then those virtio devices will *not* be in passthrough mode. You choosing to use the DMA API in the virtio device drivers instead of being buggy, has nothing to do with whether it's actually in passthrough mode or not. Whether it's in passthrough mode or not, using the DMA API is technically the right thing to do ? because it should either *do* the translation, or return a 1:1 mapped IOVA, as appropriate.> On powerpc and sparc, we *already* screwed up. The host already tells > the guest that there's an IOMMU and that it's *enabled* because those > platforms don't have selective IOMMU coverage the way that x86 does. > So we need to work around it.No, we need it on x86 too because once we fix the virtio device driver bug and make it start using the DMA API, then we start to trip up on the qemu bug where it lies about which devices are covered by the IOMMU. Of course, we still have that same qemu bug w.r.t. assigned devices, which it *also* claims are behind its IOMMU when they're not...> I think that, if we want fancy virt-friendly IOMMU stuff like you're > talking about, then the right thing to do is to create a virtio bus > instead of pretending to be PCI. That bus could have a virtio IOMMU > and its own cross-platform enumeration mechanism for devices on the > bus, and everything would be peachy.That doesn't really help very much for the x86 case where the problem is compatibility with *existing* (arguably broken) qemu implementations. Having said that, if this were real hardware I'd just be blacklisting it and saying "Another BIOS with broken DMAR tables --> IOMMU completely disabled". So perhaps we should just do that.> I still don't understand what trick. If we want virtio devices to be > assignable, then they should be translated through the IOMMU, and the > DMA API is the right interface for that.The DMA API is the right interface *regardless* of whether there's actual translation to be done. The device driver itself should not be involved in any way with that decision. When you want to access MMIO, you use ioremap() and writel() instead of doing random crap for yourself. When you want DMA, you use the DMA API to get a bus address for your device *even* if you expect there to be no IOMMU and you expect it to precisely match the physical address. No excuses. -- dwmw2 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5691 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20151111/b4c7dbeb/attachment.bin>