thr3ads.net - Linux Virtualization - [PATCH RFC] fixup! virtio: convert to use DMA api [Apr 2016]

If this information is useful, please help other people find it:
Share via:

David Woodhouse

2016-Apr-18 15:51 UTC

[PATCH RFC] fixup! virtio: convert to use DMA api

On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin
wrote:> 
> > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > its own operating system's IOMMU code is expected to be broken,
and
> > that the virtio driver should eschew the DMA API?
> 
> No - it tells guest that e.g. the ACPI tables (or whatever the
> equivalent is) do not match reality with respect to this device
> since IOMMU is ignored by hypervisor.
> Hypervisor has no idea what does guest IOMMU code do - hopefully
> it is not actually broken.
OK, that makes sense ? thanks.

So where the platform *does* have a way to coherently tell the guest
that some devices are behind and IOMMU and some aren't, we should never
see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
the truth; qv.)

This should *only* be a crutch for platforms which cannot properly
convey that information from the hypervisor to the guest. It should be
clearly documented "thou shalt not use this unless you've first
attempted to fix the broken platform to get it right for itself".

And if we look at it as such... does it make more sense for this to be
a more *generic* qemu??guest interface? That way the software hacks can
live in the OS IOMMU code where they belong, and prevent assignment to
nested guests for example. And can cover cases like assigned PCI
devices in existing qemu/x86 which need the same treatment.

Put another way: if we're going to add code to the guest OS to look at
this information, why can't we add that code in the guest's IOMMU
support instead, to look at an out-of-band qemu-specific "ignore IOMMU
for these devices" list instead?
> The status quo is that that the IOMMU might well be bypassed
> and then you need to program physical addresses into the device,
> but maybe not. If DMA API does not give you physical addresses, you
> need to bypass it, but hypervisor does not know or care.
Right. The status quo is that qemu doesn't provide correct information
about IOMMU topology to guests, and they have to have heuristics to
work out whether to eschew the IOMMU for a given device or not. This is
true for virtio and assigned PCI devices alike.

Furthermore, some platforms don't *have* a standard way for qemu to
'tell the truth' to the guests, and that's where the real fun comes
in.
But still, I'd like to see a generic solution for that lack instead of
a virtio-specific hack.

-- 
dwmw2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5691 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20160418/66393474/attachment.bin>

Michael S. Tsirkin

2016-Apr-18 16:27 UTC

head link

[PATCH RFC] fixup! virtio: convert to use DMA api

On Mon, Apr 18, 2016 at 11:51:41AM -0400, David Woodhouse
wrote:> On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> > 
> > > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest
that
> > > its own operating system's IOMMU code is expected to be
broken, and
> > > that the virtio driver should eschew the DMA API?
> > 
> > No - it tells guest that e.g. the ACPI tables (or whatever the
> > equivalent is) do not match reality with respect to this device
> > since IOMMU is ignored by hypervisor.
> > Hypervisor has no idea what does guest IOMMU code do - hopefully
> > it is not actually broken.
> 
> OK, that makes sense ? thanks.
> 
> So where the platform *does* have a way to coherently tell the guest
> that some devices are behind and IOMMU and some aren't, we should never
> see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
> perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
> the truth; qv.)
> 
> This should *only* be a crutch for platforms which cannot properly
> convey that information from the hypervisor to the guest. It should be
> clearly documented "thou shalt not use this unless you've first
> attempted to fix the broken platform to get it right for itself".
> 
> And if we look at it as such... does it make more sense for this to be
> a more *generic* qemu??guest interface? That way the software hacks can
> live in the OS IOMMU code where they belong, and prevent assignment to
> nested guests for example. And can cover cases like assigned PCI
> devices in existing qemu/x86 which need the same treatment.
>
> Put another way: if we're going to add code to the guest OS to look at
> this information, why can't we add that code in the guest's IOMMU
> support instead, to look at an out-of-band qemu-specific "ignore IOMMU
> for these devices" list instead?
I balk at adding more hacks to a broken system. My goals are
merely to
- make things work correctly with an IOMMU and new guests,
  so people can use userspace drivers with virtio devices
- prevent security risks when guest kernel mistakenly thinks
  it's protected by an IOMMU, but in fact isn't
- avoid breaking any working configurations

Looking at guest code, it looks like virtio was always
bypassing the IOMMU even if configured, but no other
guest driver did.

This makes me think the problem where guest drivers
ignore the IOMMU is virtio specific
and so a virtio specific solution seems cleaner.

The problem for assigned devices is IMHO different: they bypass
the guest IOMMU too but no guest driver knows about this,
so guests do not work. Seems cleaner to fix QEMU to make
existing guests work.

> > The status quo is that that the IOMMU might well be bypassed
> > and then you need to program physical addresses into the device,
> > but maybe not. If DMA API does not give you physical addresses, you
> > need to bypass it, but hypervisor does not know or care.
> 
> Right. The status quo is that qemu doesn't provide correct information
> about IOMMU topology to guests, and they have to have heuristics to
> work out whether to eschew the IOMMU for a given device or not. This is
> true for virtio and assigned PCI devices alike.
True but I think we should fix QEMU to shadow IOMMU
page tables for assigned devices. This seems rather
possible with VT-D, and there are patches already on list.

It looks like this will fix all legacy guests which is
much nicer than what you suggest which will only help new guests.
> Furthermore, some platforms don't *have* a standard way for qemu to
> 'tell the truth' to the guests, and that's where the real fun
comes in.
> But still, I'd like to see a generic solution for that lack instead of
> a virtio-specific hack.
But the issue is not just these holes.  E.g. with VT-D it is only easy
to emulate because there's a "caching mode" hook. It is
fundamentally
paravirtualization.  So a completely generic solution would be a
paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
justified if many platforms have hard to emulate interfaces.


> -- 
> dwmw2
> 
>

David Woodhouse

2016-Apr-18 18:29 UTC

head link

[PATCH RFC] fixup! virtio: convert to use DMA api

On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin
wrote:> I balk at adding more hacks to a broken system. My goals are
> merely to
> - make things work correctly with an IOMMU and new guests,
> ? so people can use userspace drivers with virtio devices
> - prevent security risks when guest kernel mistakenly thinks
> ? it's protected by an IOMMU, but in fact isn't
> - avoid breaking any working configurations
AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
That's just an optimisation, for telling an OS "you don't really
need
to bother with the IOMMU, even though you it works".

There are two main reasons why an operating system might want to use
the IOMMU via the DMA API for native drivers:?
?- To protect against driver bugs triggering rogue DMA.
?- To protect against hardware (or firmware) bugs.

With virtio, the first reason still exists. But the second is moot
because the device is part of the hypervisor and if the hypervisor is
untrustworthy then you're screwed anyway... but then again, in SoC
devices you could replace 'hypervisor' with 'chip' and the same
is
true, isn't it? Is there *really* anything virtio-specific here?

Sure, I want my *external* network device on a PCIe card with software-
loadable firmware to be behind an IOMMU because I don't trust it as far
as I can throw it. But for on-SoC devices surely the situation is
*just* the same as devices provided by a hypervisor?

And some people want that external network device to use passthrough
anyway, for performance reasons.

On the whole, there are *plenty* of reasons why we might want to have a
passthrough mapping on a per-device basis, and I really struggle to
find justification for having this 'hint' in a virtio-specific way.

And it's complicating the discussion of the *actual* fix we're looking
at.
> Looking at guest code, it looks like virtio was always
> bypassing the IOMMU even if configured, but no other
> guest driver did.
> 
> This makes me think the problem where guest drivers
> ignore the IOMMU is virtio specific
> and so a virtio specific solution seems cleaner.
> 
> The problem for assigned devices is IMHO different: they bypass
> the guest IOMMU too but no guest driver knows about this,
> so guests do not work. Seems cleaner to fix QEMU to make
> existing guests work.
I certainly agree that it's better to fix QEMU. Whether devices are
behind an IOMMU or not, the DMAR tables we expose to a guest should
tell the truth.

Part of the issue here is virtio-specific; part isn't.

Basically, we have a conjunction of two separate bugs which happened to
work (for virtio) ? the IOMMU support in QEMU wasn't working for virtio
(and assigned) devices even though it theoretically *should* have been,
and the virtio drivers weren't using the DMA API as they theoretically
should have been.

So there were corner cases like assigned PCI devices, and real hardware
implementations of virtio stuff (and perhaps virtio devices being
assigned to nested guests) which didn't work. But for the *common* use
case, one bug cancelled out the other.

Now we want to fix both bugs, and of course that involves carefully
coordinating both fixes.

I *like* your idea of a flag from the hypervisor which essentially says
"trust me, I'm telling the truth now".

But don't think that wants to be virtio-specific, because we actually
want it to cover *all* the corner cases, not just the common case which
*happened* to work before due to the alignment of the two previous
bugs.

An updated guest OS can look for this flag (in its generic IOMMU code)
and can apply a heuristic of its own to work out which devices *aren't*
behind the IOMMU, if the flag isn't present. And it can get that right
even for assigned devices, so that new kernels can run happily even on
today's QEMU instances. And the virtio driver in new kernels should
just use the DMA API and expect it to work. Just as the various drivers
for assigned PCI devices do.

The other interesting case for compatibility is old kernels running in
a new QEMU. And for that case, things are likely to break if you
suddenly start putting the virtio devices behind an IOMMU. There's
nothing you can do on ARM and Power to stop that breakage, since they
don't *have* a way to tell legacy guests that certain devices aren't
translated. So I suspect you probably can't enable virtio-behind-IOMMU
in QEMU *ever* for those platforms as the default behaviour.

For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
the truth, and even legacy kernels ought to cope with that.
FSVO 'ought to' where I suspect some of them will actually crash with a
NULL pointer dereference if there's no "catch-all" DMAR unit in
the
tables, which puts it back into the same camp as ARM and Power.

> True but I think we should fix QEMU to shadow IOMMU
> page tables for assigned devices. This seems rather
> possible with VT-D, and there are patches already on list.
> 
> It looks like this will fix all legacy guests which is
> much nicer than what you suggest which will only help new guests.
Yes, we should do that. And in the short term we should at *least* fix
the DMAR tables to tell the truth.
> > 
> > Furthermore, some platforms don't *have* a standard way for qemu
to
> > 'tell the truth' to the guests, and that's where the real
fun comes in.
> > But still, I'd like to see a generic solution for that lack
instead of
> > a virtio-specific hack.
> But the issue is not just these holes.??E.g. with VT-D it is only easy
> to emulate because there's a "caching mode" hook. It is
fundamentally
> paravirtualization.??So a completely generic solution would be a
> paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> justified if many platforms have hard to emulate interfaces.
Hm, I'm not sure I understand the point here.

Either there is a way for the hypervisor to expose an IOMMU to a guest
(be it full hardware virt, or paravirt). Or there isn't.

If there is, it doesn't matter *how* it's done. And if there isn't,
the
whole discussion is moot anyway.

-- 
dwmw2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5691 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20160418/fcc2bb27/attachment.bin>

Apparently Analagous Threads

Search for more seemingly similar threads

Linux Virtualization - Apr 2016 - [PATCH RFC] fixup! virtio: convert to use DMA api

[PATCH RFC] fixup! virtio: convert to use DMA api

[PATCH RFC] fixup! virtio: convert to use DMA api

[PATCH RFC] fixup! virtio: convert to use DMA api

Apparently Analagous Threads