Michael S. Tsirkin
2018-May-31 17:43 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Thu, May 31, 2018 at 09:09:24AM +0530, Anshuman Khandual wrote:> On 05/24/2018 12:51 PM, Ram Pai wrote: > > On Wed, May 23, 2018 at 09:50:02PM +0300, Michael S. Tsirkin wrote: > >> subj: s/virito/virtio/ > >> > > ..snip.. > >>> machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init); > >>> + > >>> +bool platform_forces_virtio_dma(struct virtio_device *vdev) > >>> +{ > >>> + /* > >>> + * On protected guest platforms, force virtio core to use DMA > >>> + * MAP API for all virtio devices. But there can also be some > >>> + * exceptions for individual devices like virtio balloon. > >>> + */ > >>> + return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL); > >>> +} > >> > >> Isn't this kind of slow? vring_use_dma_api is on > >> data path and supposed to be very fast. > > > > Yes it is slow and not ideal. This won't be the final code. The final > > code will cache the information in some global variable and used > > in this function. > > Right this will be a simple check based on a global variable. >Pls work on a long term solution. Short term needs can be served by enabling the iommu platform in qemu. -- MST
Christoph Hellwig
2018-Jun-07 05:23 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:> Pls work on a long term solution. Short term needs can be served by > enabling the iommu platform in qemu.So, I spent some time looking at converting virtio to dma ops overrides, and the current virtio spec, and the sad through I have to tell is that both the spec and the Linux implementation are complete and utterly fucked up. Both in the flag naming and the implementation there is an implication of DMA API == IOMMU, which is fundamentally wrong. The DMA API does a few different things: a) address translation This does include IOMMUs. But it also includes random offsets between PCI bars and system memory that we see on various platforms. Worse so some of these offsets might be based on banks, e.g. on the broadcom bmips platform. It also deals with bitmask in physical addresses related to memory encryption like AMD SEV. I'd be really curious how for example the Intel virtio based NIC is going to work on any of those plaforms. b) coherency On many architectures DMA is not cache coherent, and we need to invalidate and/or write back cache lines before doing DMA. Again, I wonder how this is every going to work with hardware based virtio implementations. Even worse I think this is actually broken at least for VIVT event for virtualized implementations. E.g. a KVM guest is going to access memory using different virtual addresses than qemu, vhost might throw in another different address space. c) bounce buffering Many DMA implementations can not address all physical memory due to addressing limitations. In such cases we copy the DMA memory into a known addressable bounc buffer and DMA from there. d) flushing write combining buffers or similar On some hardware platforms we need workarounds to e.g. read from a certain mmio address to make sure DMA can actually see memory written by the host. All of this is bypassed by virtio by default despite generally being platform issues, not particular to a given device.
Michael S. Tsirkin
2018-Jun-07 16:28 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote:> On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote: > > Pls work on a long term solution. Short term needs can be served by > > enabling the iommu platform in qemu. > > So, I spent some time looking at converting virtio to dma ops overrides, > and the current virtio spec, and the sad through I have to tell is that > both the spec and the Linux implementation are complete and utterly fucked > up.Let me restate it: DMA API has support for a wide range of hardware, and hardware based virtio implementations likely won't benefit from all of it. And given virtio right now is optimized for specific workloads, improving portability without regressing performance isn't easy. I think it's unsurprising since it started a strictly a guest/host mechanism. People did implement offloads on specific platforms though, and they are known to work. To improve portability even further, we might need to make spec and code changes. I'm not really sympathetic to people complaining that they can't even set a flag in qemu though. If that's the case the stack in question is way too inflexible.> Both in the flag naming and the implementation there is an implication > of DMA API == IOMMU, which is fundamentally wrong.Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it. It's possible that some setups will benefit from a more fine-grained approach where some aspects of the DMA API are bypassed, others aren't. This seems to be what was being asked for in this thread, with comments claiming IOMMU flag adds too much overhead.> The DMA API does a few different things: > > a) address translation > > This does include IOMMUs. But it also includes random offsets > between PCI bars and system memory that we see on various > platforms.I don't think you mean bars. That's unrelated to DMA.> Worse so some of these offsets might be based on > banks, e.g. on the broadcom bmips platform. It also deals > with bitmask in physical addresses related to memory encryption > like AMD SEV. I'd be really curious how for example the > Intel virtio based NIC is going to work on any of those > plaforms.SEV guys report that they just set the iommu flag and then it all works. I guess if there's translation we can think of this as a kind of iommu. Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION? And apparently some people complain that just setting that flag makes qemu check translation on each access with an unacceptable performance overhead. Forcing same behaviour for everyone on general principles even without the flag is unlikely to make them happy.> b) coherency > > On many architectures DMA is not cache coherent, and we need > to invalidate and/or write back cache lines before doing > DMA. Again, I wonder how this is every going to work with > hardware based virtio implementations.You mean dma_Xmb and friends? There's a new feature VIRTIO_F_IO_BARRIER that's being proposed for that.> Even worse I think this > is actually broken at least for VIVT event for virtualized > implementations. E.g. a KVM guest is going to access memory > using different virtual addresses than qemu, vhost might throw > in another different address space.I don't really know what VIVT is. Could you help me please?> c) bounce buffering > > Many DMA implementations can not address all physical memory > due to addressing limitations. In such cases we copy the > DMA memory into a known addressable bounc buffer and DMA > from there.Don't do it then?> d) flushing write combining buffers or similar > > On some hardware platforms we need workarounds to e.g. read > from a certain mmio address to make sure DMA can actually > see memory written by the host.I guess it isn't an issue as long as WC isn't actually used. It will become an issue when virtio spec adds some WC capability - I suspect we can ignore this for now.> > All of this is bypassed by virtio by default despite generally being > platform issues, not particular to a given device.It's both a device and a platform issue. A PV device is often more like another CPU than like a PCI device. -- MST
Michael S. Tsirkin
2018-Jun-11 03:28 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Sun, Jun 10, 2018 at 07:39:09PM -0700, Ram Pai wrote:> On Thu, Jun 07, 2018 at 07:28:35PM +0300, Michael S. Tsirkin wrote: > > On Wed, Jun 06, 2018 at 10:23:06PM -0700, Christoph Hellwig wrote: > > > On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote: > > > > Pls work on a long term solution. Short term needs can be served by > > > > enabling the iommu platform in qemu. > > > > > > So, I spent some time looking at converting virtio to dma ops overrides, > > > and the current virtio spec, and the sad through I have to tell is that > > > both the spec and the Linux implementation are complete and utterly fucked > > > up. > > > > Let me restate it: DMA API has support for a wide range of hardware, and > > hardware based virtio implementations likely won't benefit from all of > > it. > > > > And given virtio right now is optimized for specific workloads, improving > > portability without regressing performance isn't easy. > > > > I think it's unsurprising since it started a strictly a guest/host > > mechanism. People did implement offloads on specific platforms though, > > and they are known to work. To improve portability even further, > > we might need to make spec and code changes. > > > > I'm not really sympathetic to people complaining that they can't even > > set a flag in qemu though. If that's the case the stack in question is > > way too inflexible. > > We did consider your suggestion. But can't see how it will work. > Maybe you can guide us here. > > In our case qemu has absolutely no idea if the VM will switch itself to > secure mode or not. Its a dynamic decision made entirely by the VM > through direct interaction with the hardware/firmware; no > qemu/hypervisor involved. > > If the administrator, who invokes qemu, enables the flag, the DMA ops > associated with the virito devices will be called, and hence will be > able to do the right things. Yes we might incur performance hit due to > the IOMMU translations, but lets ignore that for now; the functionality > will work. Good till now. > > However if the administrator > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the > flag, virtio will not be able to pass control to the DMA ops associated > with the virtio devices. Which means, we have no opportunity to share > the I/O buffers with the hypervisor/qemu. > > How do you suggest, we handle this case?As step 1, ignore it as a user error. Further you can for example add per-device quirks in virtio so it can be switched to dma api. make extra decisions in platform code then.> > > > > > > > > Both in the flag naming and the implementation there is an implication > > > of DMA API == IOMMU, which is fundamentally wrong. > > > > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it. > > > > It's possible that some setups will benefit from a more > > fine-grained approach where some aspects of the DMA > > API are bypassed, others aren't. > > > > This seems to be what was being asked for in this thread, > > with comments claiming IOMMU flag adds too much overhead. > > > > > > > The DMA API does a few different things: > > > > > > a) address translation > > > > > > This does include IOMMUs. But it also includes random offsets > > > between PCI bars and system memory that we see on various > > > platforms. > > > > I don't think you mean bars. That's unrelated to DMA. > > > > > Worse so some of these offsets might be based on > > > banks, e.g. on the broadcom bmips platform. It also deals > > > with bitmask in physical addresses related to memory encryption > > > like AMD SEV. I'd be really curious how for example the > > > Intel virtio based NIC is going to work on any of those > > > plaforms. > > > > SEV guys report that they just set the iommu flag and then it all works. > > This is one of the fundamental difference between SEV architecture and > the ultravisor architecture. In SEV, qemu is aware of SEV. In > ultravisor architecture, only the VM that runs within qemu is aware of > ultravisor; hypervisor/qemu/administrator are untrusted entities.Spo one option is to teach qemu that it's on a platform with an ultravisor, this might have more advantages.> I hope, we can make virtio subsystem flexibe enough to support various > security paradigms.So if you are worried about qemu attacking guests, I see more problems than just passing an incorrect iommu flag.> Apart from the above reason, Christoph and Ben point to so many other > reasons to make it flexibe. So why not, make it happen? >I don't see a flexibility argument. I just don't think new platforms should use workarounds that we put in place for old ones.> > I guess if there's translation we can think of this as a kind of iommu. > > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION? > > > > And apparently some people complain that just setting that flag makes > > qemu check translation on each access with an unacceptable performance > > overhead. Forcing same behaviour for everyone on general principles > > even without the flag is unlikely to make them happy. > > > > > b) coherency > > > > > > On many architectures DMA is not cache coherent, and we need > > > to invalidate and/or write back cache lines before doing > > > DMA. Again, I wonder how this is every going to work with > > > hardware based virtio implementations. > > > > > > You mean dma_Xmb and friends? > > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed > > for that. > > > > > > > Even worse I think this > > > is actually broken at least for VIVT event for virtualized > > > implementations. E.g. a KVM guest is going to access memory > > > using different virtual addresses than qemu, vhost might throw > > > in another different address space. > > > > I don't really know what VIVT is. Could you help me please? > > > > > c) bounce buffering > > > > > > Many DMA implementations can not address all physical memory > > > due to addressing limitations. In such cases we copy the > > > DMA memory into a known addressable bounc buffer and DMA > > > from there. > > > > Don't do it then? > > > > > > > d) flushing write combining buffers or similar > > > > > > On some hardware platforms we need workarounds to e.g. read > > > from a certain mmio address to make sure DMA can actually > > > see memory written by the host. > > > > I guess it isn't an issue as long as WC isn't actually used. > > It will become an issue when virtio spec adds some WC capability - > > I suspect we can ignore this for now. > > > > > > > > All of this is bypassed by virtio by default despite generally being > > > platform issues, not particular to a given device. > > > > It's both a device and a platform issue. A PV device is often more like > > another CPU than like a PCI device. > > > > > > > > -- > > MST > > -- > Ram Pai
Benjamin Herrenschmidt
2018-Jun-11 03:29 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:> > However if the administrator > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the > flag, virtio will not be able to pass control to the DMA ops associated > with the virtio devices. Which means, we have no opportunity to share > the I/O buffers with the hypervisor/qemu. > > How do you suggest, we handle this case?At the risk of repeating myself, let's just do the first pass which is to switch virtio over to always using the DMA API in the actual data flow code, with a hook at initialization time that replaces the DMA ops with some home cooked "direct" ops in the case where the IOMMU flag isn't set. This will be equivalent to what we have today but avoids having 2 separate code path all over the driver. Then a second stage, I think, is to replace this "hook" so that the architecture gets a say in the matter. Basically, something like: arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode). IE, virtio would tell the arch whether the "other side" is in fact QEMU in a mode that bypasses the IOMMU and is cache coherent with the guest. This is our legacy "qemu special" mode. If the performance is sufficient we may want to deprecate it over time and have qemu enable the iommu by default but we still need it. A weak implementation of the above will be provied that just puts in the direct ops when qemu_direct_mode is set, and thus provides today's behaviour on any arch that doesn't override it. If the flag is not set, the ops are left to whatever the arch PCI layer already set. This will provide the opportunity for architectures that want to do something special, such as in our case, when we want to force even the "qemu_direct_mode" to go via bounce buffers, to put our own ops in there, while retaining the legacy behaviour otherwise. It also means that the "gunk" is entirely localized in that one function, the rest of virtio just uses the DMA API normally. Christoph, are you actually hacking "stage 1" above already or should we produce patches ? Cheers, Ben.
Benjamin Herrenschmidt
2018-Jun-11 03:34 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:> > > However if the administrator > > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the > > flag, virtio will not be able to pass control to the DMA ops associated > > with the virtio devices. Which means, we have no opportunity to share > > the I/O buffers with the hypervisor/qemu. > > > > How do you suggest, we handle this case? > > As step 1, ignore it as a user error.Ugh ... not again. Ram, don't bring that subject back we ALREADY addressed it, and requiring the *user* to do special things is just utterly and completely wrong. The *user* has no bloody idea what that stuff is, will never know to set whatver magic qemu flag etc... The user will just start a a VM normally and expect things to work. Requiring the *user* to know things like that iommu virtio flag is complete nonsense. If by "user" you mean libvirt, then you are now requesting about 4 or 5 different projects to be patched to add speical cases for something they know nothing about and is completely irrelevant, while it can be entirely addressed with a 1-liner in virtio kernel side to allow the arch to plumb alternate DMA ops. So for some reason you seem to be dead set on a path that leads to mountain of user pain, changes to many different projects and overall havok while there is a much much simpler and elegant solution at hand which I described (again) in the response to Ram I sent about 5mn ago.> Further you can for example add per-device quirks in virtio so it can be > switched to dma api. make extra decisions in platform code then. > > > > > > > > > > > > > > Both in the flag naming and the implementation there is an implication > > > > of DMA API == IOMMU, which is fundamentally wrong. > > > > > > Maybe we need to extend the meaning of PLATFORM_IOMMU or rename it. > > > > > > It's possible that some setups will benefit from a more > > > fine-grained approach where some aspects of the DMA > > > API are bypassed, others aren't. > > > > > > This seems to be what was being asked for in this thread, > > > with comments claiming IOMMU flag adds too much overhead. > > > > > > > > > > The DMA API does a few different things: > > > > > > > > a) address translation > > > > > > > > This does include IOMMUs. But it also includes random offsets > > > > between PCI bars and system memory that we see on various > > > > platforms. > > > > > > I don't think you mean bars. That's unrelated to DMA. > > > > > > > Worse so some of these offsets might be based on > > > > banks, e.g. on the broadcom bmips platform. It also deals > > > > with bitmask in physical addresses related to memory encryption > > > > like AMD SEV. I'd be really curious how for example the > > > > Intel virtio based NIC is going to work on any of those > > > > plaforms. > > > > > > SEV guys report that they just set the iommu flag and then it all works. > > > > This is one of the fundamental difference between SEV architecture and > > the ultravisor architecture. In SEV, qemu is aware of SEV. In > > ultravisor architecture, only the VM that runs within qemu is aware of > > ultravisor; hypervisor/qemu/administrator are untrusted entities. > > Spo one option is to teach qemu that it's on a platform with an > ultravisor, this might have more advantages. > > > I hope, we can make virtio subsystem flexibe enough to support various > > security paradigms. > > So if you are worried about qemu attacking guests, I see > more problems than just passing an incorrect iommu > flag. > > > > Apart from the above reason, Christoph and Ben point to so many other > > reasons to make it flexibe. So why not, make it happen? > > > > I don't see a flexibility argument. I just don't think new platforms > should use workarounds that we put in place for old ones. > > > > > I guess if there's translation we can think of this as a kind of iommu. > > > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION? > > > > > > And apparently some people complain that just setting that flag makes > > > qemu check translation on each access with an unacceptable performance > > > overhead. Forcing same behaviour for everyone on general principles > > > even without the flag is unlikely to make them happy. > > > > > > > b) coherency > > > > > > > > On many architectures DMA is not cache coherent, and we need > > > > to invalidate and/or write back cache lines before doing > > > > DMA. Again, I wonder how this is every going to work with > > > > hardware based virtio implementations. > > > > > > > > > You mean dma_Xmb and friends? > > > There's a new feature VIRTIO_F_IO_BARRIER that's being proposed > > > for that. > > > > > > > > > > Even worse I think this > > > > is actually broken at least for VIVT event for virtualized > > > > implementations. E.g. a KVM guest is going to access memory > > > > using different virtual addresses than qemu, vhost might throw > > > > in another different address space. > > > > > > I don't really know what VIVT is. Could you help me please? > > > > > > > c) bounce buffering > > > > > > > > Many DMA implementations can not address all physical memory > > > > due to addressing limitations. In such cases we copy the > > > > DMA memory into a known addressable bounc buffer and DMA > > > > from there. > > > > > > Don't do it then? > > > > > > > > > > d) flushing write combining buffers or similar > > > > > > > > On some hardware platforms we need workarounds to e.g. read > > > > from a certain mmio address to make sure DMA can actually > > > > see memory written by the host. > > > > > > I guess it isn't an issue as long as WC isn't actually used. > > > It will become an issue when virtio spec adds some WC capability - > > > I suspect we can ignore this for now. > > > > > > > > > > > All of this is bypassed by virtio by default despite generally being > > > > platform issues, not particular to a given device. > > > > > > It's both a device and a platform issue. A PV device is often more like > > > another CPU than like a PCI device. > > > > > > > > > > > > -- > > > MST > > > > -- > > Ram Pai
Christoph Hellwig
2018-Jun-13 07:41 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:> At the risk of repeating myself, let's just do the first pass which is > to switch virtio over to always using the DMA API in the actual data > flow code, with a hook at initialization time that replaces the DMA ops > with some home cooked "direct" ops in the case where the IOMMU flag > isn't set. > > This will be equivalent to what we have today but avoids having 2 > separate code path all over the driver. > > Then a second stage, I think, is to replace this "hook" so that the > architecture gets a say in the matter.I don't think we can actually use dma_direct_ops. It still allows architectures to override parts of the dma setup, which virtio seems to blindly assume phys == dma and not cache flushing. I think the right way forward is to either add a new VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed possible). And then make sure recent qemu always sets it.
Michael S. Tsirkin
2018-Jun-13 14:03 UTC
[RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:> On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote: > > > > However if the administrator > > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the > > flag, virtio will not be able to pass control to the DMA ops associated > > with the virtio devices. Which means, we have no opportunity to share > > the I/O buffers with the hypervisor/qemu. > > > > How do you suggest, we handle this case? > > At the risk of repeating myself, let's just do the first pass which is > to switch virtio over to always using the DMA API in the actual data > flow code, with a hook at initialization time that replaces the DMA ops > with some home cooked "direct" ops in the case where the IOMMU flag > isn't set.I'm not sure I understand all of the details, will have to see the patch, but superficially it sounds good to me.> This will be equivalent to what we have today but avoids having 2 > separate code path all over the driver. > > Then a second stage, I think, is to replace this "hook" so that the > architecture gets a say in the matter. > > Basically, something like: > > arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode). > > IE, virtio would tell the arch whether the "other side" is in fact QEMU > in a mode that bypasses the IOMMU and is cache coherent with the guest. > This is our legacy "qemu special" mode. If the performance is > sufficient we may want to deprecate it over time and have qemu enable > the iommu by default but we still need it. > > A weak implementation of the above will be provied that just puts in > the direct ops when qemu_direct_mode is set, and thus provides today's > behaviour on any arch that doesn't override it. If the flag is not set, > the ops are left to whatever the arch PCI layer already set. > > This will provide the opportunity for architectures that want to do > something special, such as in our case, when we want to force even the > "qemu_direct_mode" to go via bounce buffers, to put our own ops in > there, while retaining the legacy behaviour otherwise. > > It also means that the "gunk" is entirely localized in that one > function, the rest of virtio just uses the DMA API normally. > > Christoph, are you actually hacking "stage 1" above already or should > we produce patches ? > > Cheers, > Ben.