On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote:> > Can you flesh out this trick? > > On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the > kernel wants, it can switch it to a non-passthrough mode. My patches > cause the virtio driver to do exactly this, except that the host > implementation doesn't actually exist yet, so the patches will instead > have no particular effect.At some level, yes ? we're compatible with a 1982 IBM PC and thus the IOMMU is entirely disabled at boot until the kernel turns it on ? except in TXT mode where we abandon that compatibility. But no, the virtio driver has *nothing* to do with switching the device out of passthrough mode. It is either in passthrough mode, or it isn't. If the VMM *doesn't* expose an IOMMU to the guest, obviously the devices are in passthrough mode. If the guest kernel doesn't have IOMMU support enabled, then obviously the devices are in passthrough mode. And if the ACPI tables exposed to the guest kernel *tell* it that the virtio devices are not actually behind the IOMMU (which qemu gets wrong), then it'll be in passthrough mode. If the IOMMU is exposed, and enabled, and telling the guest kernel that it *does* cover the virtio devices, then those virtio devices will *not* be in passthrough mode. You choosing to use the DMA API in the virtio device drivers instead of being buggy, has nothing to do with whether it's actually in passthrough mode or not. Whether it's in passthrough mode or not, using the DMA API is technically the right thing to do ? because it should either *do* the translation, or return a 1:1 mapped IOVA, as appropriate.> On powerpc and sparc, we *already* screwed up. The host already tells > the guest that there's an IOMMU and that it's *enabled* because those > platforms don't have selective IOMMU coverage the way that x86 does. > So we need to work around it.No, we need it on x86 too because once we fix the virtio device driver bug and make it start using the DMA API, then we start to trip up on the qemu bug where it lies about which devices are covered by the IOMMU. Of course, we still have that same qemu bug w.r.t. assigned devices, which it *also* claims are behind its IOMMU when they're not...> I think that, if we want fancy virt-friendly IOMMU stuff like you're > talking about, then the right thing to do is to create a virtio bus > instead of pretending to be PCI. That bus could have a virtio IOMMU > and its own cross-platform enumeration mechanism for devices on the > bus, and everything would be peachy.That doesn't really help very much for the x86 case where the problem is compatibility with *existing* (arguably broken) qemu implementations. Having said that, if this were real hardware I'd just be blacklisting it and saying "Another BIOS with broken DMAR tables --> IOMMU completely disabled". So perhaps we should just do that.> I still don't understand what trick. If we want virtio devices to be > assignable, then they should be translated through the IOMMU, and the > DMA API is the right interface for that.The DMA API is the right interface *regardless* of whether there's actual translation to be done. The device driver itself should not be involved in any way with that decision. When you want to access MMIO, you use ioremap() and writel() instead of doing random crap for yourself. When you want DMA, you use the DMA API to get a bus address for your device *even* if you expect there to be no IOMMU and you expect it to precisely match the physical address. No excuses. -- dwmw2 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5691 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20151111/b4c7dbeb/attachment.bin>
On Wed, Nov 11, 2015 at 11:30:27PM +0100, David Woodhouse wrote:> On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote: > > > > Can you flesh out this trick? > > > > On x86 IIUC the IOMMU more-or-less defaults to passthrough. If the > > kernel wants, it can switch it to a non-passthrough mode. My patches > > cause the virtio driver to do exactly this, except that the host > > implementation doesn't actually exist yet, so the patches will instead > > have no particular effect. > > At some level, yes ? we're compatible with a 1982 IBM PC and thus the > IOMMU is entirely disabled at boot until the kernel turns it on ? > except in TXT mode where we abandon that compatibility. > > But no, the virtio driver has *nothing* to do with switching the device > out of passthrough mode. It is either in passthrough mode, or it isn't. > > If the VMM *doesn't* expose an IOMMU to the guest, obviously the > devices are in passthrough mode. If the guest kernel doesn't have IOMMU > support enabled, then obviously the devices are in passthrough mode. > And if the ACPI tables exposed to the guest kernel *tell* it that the > virtio devices are not actually behind the IOMMU (which qemu gets > wrong), then it'll be in passthrough mode. > > If the IOMMU is exposed, and enabled, and telling the guest kernel that > it *does* cover the virtio devices, then those virtio devices will > *not* be in passthrough mode.This we need to fix. Because in most configurations if you are using kernel drivers, then you don't want IOMMU with virtio, but if you are using VFIO then you do. Intel's iommu can be programmed to still do a kind of passthrough (1:1) mapping, it's just a matter of doing this for virtio devices when not using VFIO.> You choosing to use the DMA API in the virtio device drivers instead of > being buggy, has nothing to do with whether it's actually in > passthrough mode or not. Whether it's in passthrough mode or not, using > the DMA API is technically the right thing to do ? because it should > either *do* the translation, or return a 1:1 mapped IOVA, as > appropriate.Right but first we need to actually make DMA API do the right thing at least on x86,ppc and arm.> > On powerpc and sparc, we *already* screwed up. The host already tells > > the guest that there's an IOMMU and that it's *enabled* because those > > platforms don't have selective IOMMU coverage the way that x86 does. > > So we need to work around it. > > No, we need it on x86 too because once we fix the virtio device driver > bug and make it start using the DMA API, then we start to trip up on > the qemu bug where it lies about which devices are covered by the > IOMMU. > > Of course, we still have that same qemu bug w.r.t. assigned devices, > which it *also* claims are behind its IOMMU when they're not...I'm not worried about qemu bugs that much. I am interested in being able to use both VFIO and kernel drivers with virtio devices with good performance and without tweaking kernel parameters.> > I think that, if we want fancy virt-friendly IOMMU stuff like you're > > talking about, then the right thing to do is to create a virtio bus > > instead of pretending to be PCI. That bus could have a virtio IOMMU > > and its own cross-platform enumeration mechanism for devices on the > > bus, and everything would be peachy. > > That doesn't really help very much for the x86 case where the problem > is compatibility with *existing* (arguably broken) qemu > implementations. > > Having said that, if this were real hardware I'd just be blacklisting > it and saying "Another BIOS with broken DMAR tables --> IOMMU > completely disabled". So perhaps we should just do that. >Yes, once there is new QEMU where virtio is covered by the IOMMU, that would be one way to address existing QEMU bugs.> > I still don't understand what trick. If we want virtio devices to be > > assignable, then they should be translated through the IOMMU, and the > > DMA API is the right interface for that. > > The DMA API is the right interface *regardless* of whether there's > actual translation to be done. The device driver itself should not be > involved in any way with that decision.With virt, each device can have different priveledges: some are part of hypervisor so with a kernel driver trying to get protection from them using an IOMMU which is also part of hypervisor makes no sense - but when using a userspace driver then getting protection from the userspace driver does make sense. Others are real devices so getting protection from them makes some sense. Which is which? It's easiest for the device driver itself to gain that knowledge. Please note this is *not* the same question as whether a specific device is covered by an IOMMU.> When you want to access MMIO, you use ioremap() and writel() instead of > doing random crap for yourself. When you want DMA, you use the DMA API > to get a bus address for your device *even* if you expect there to be > no IOMMU and you expect it to precisely match the physical address. No > excuses.No problem, but the fact remains that virtio does need per-device control over whether it's passthrough or not. Forget the bugs, that's not the issue - the issue is that it's sometimes part of hypervisor and sometimes isn't. We just can't say it's always not a part of hypervisor so you always want maximum protection - that drops performance by to the floor. Linux doesn't seem to support that usecase at the moment, if this is a generic problem then we need to teach Linux to solve it, but if virtio is unique in this requirement, then we should just keep doing virtio specific things to solve it.> -- > dwmw2 > >
On Thu, 2015-11-12 at 13:09 +0200, Michael S. Tsirkin wrote:> On Wed, Nov 11, 2015 at 11:30:27PM +0100, David Woodhouse wrote: > > > > If the IOMMU is exposed, and enabled, and telling the guest kernel that > > it *does* cover the virtio devices, then those virtio devices will > > *not* be in passthrough mode. > > This we need to fix. Because in most configurations if you are > using kernel drivers, then you don't want IOMMU with virtio, > but if you are using VFIO then you do.This is *absolutely* not specific to virtio. There are *plenty* of other users (especially networking) where we only really care about the existence of the IOMMU for VFIO purposes and assigning devices to guests, and we are willing to dispense with the protection that it offers for native in-kernel drivers. For that, boot with iommu=pt. There is no way, currently, to enable the passthrough mode on a per- device basis. Although it has been discussed right here, very recently. Let's not conflate those issues.> > You choosing to use the DMA API in the virtio device drivers instead of > > being buggy, has nothing to do with whether it's actually in > > passthrough mode or not. Whether it's in passthrough mode or not, using > > the DMA API is technically the right thing to do ? because it should > > either *do* the translation, or return a 1:1 mapped IOVA, as > > appropriate. > > Right but first we need to actually make DMA API do the right thing > at least on x86,ppc and arm.It already does the right thing on x86, modulo BIOS bugs (including the qemu ACPI table but that you said you're not too worried about).> I'm not worried about qemu bugs that much.??I am interested in being > able to use both VFIO and kernel drivers with virtio devices with good > performance and without tweaking kernel parameters.OK, then you are interested in the semi-orthogonal discussion about DMA_ATTR_IOMMU_BYPASS. Either way, device drivers SHALL use the DMA API.> > Having said that, if this were real hardware I'd just be blacklisting > > it and saying "Another BIOS with broken DMAR tables --> IOMMU > > completely disabled". So perhaps we should just do that. > > > Yes, once there is new QEMU where virtio is covered by the IOMMU, > that would be one way to address existing QEMU bugs.No, that's not required. All that's required is to fix the currently- broken ACPI table so that it *admits* that the virtio devices aren't covered by the IOMMU. And I've never waited for a fix to be available before, before blacklisting *other* broken firmwares... The only reason I'm holding off for now is because ARM and PPC also need a quirk for their platform code to realise that certain devices actually *aren't* covered by the IOMMU, and I might be able to just use the same thing and still enable the IOMMU in the offending qemu versions. Although as noted, it would need to cover assigned devices as well as virtio ? qemu currently lies to us and tells us that the emulated IOMMU in the guest does cover *those* too.> With virt, each device can have different priveledges: > some are part of hypervisor so with a kernel driver > trying to get protection from them using an IOMMU which is also > part of hypervisor makes no sense > - but when using a > userspace driver then getting protection from the userspace > driver does make sense. Others are real devices so > getting protection from them makes some sense. > > Which is which? It's easiest for the device driver itself to > gain that knowledge. Please note this is *not* the same > question as whether a specific device is covered by an IOMMU.OK. How does your device driver know whether the virtio PCI device it's talking to is actually implemented by the hypervisor, or whether it's one of the real PCI implementations that apparently exist?> Linux doesn't seem to support that usecase at the moment, if this is a > generic problem then we need to teach Linux to solve it, but if virtio > is unique in this requirement, then we should just keep doing virtio > specific things to solve it.It is a generic problem. There is a discussion elsewhere about how (or indeed whether) to solve it. It absolutely isn't virtio-specific, and we absolutely shouldn't be doing virtio-specific things to solve it. Nothing excuses just eschewing the correct DMA API. That's just broken, and only ever worked in conjunction with *other* bugs elsewhere in the platform. -- dwmw2 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5691 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20151112/cdbeb044/attachment.bin>