Linus Torvalds
2022-Aug-05 22:57 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <will at kernel.org> wrote:> > [tl;dr a change from ~18 months ago breaks Android userspace and I don't > know what to do about it]Augh. I had hoped that android being "closer" to upstream would have meant that somebody actually tests android with upstream kernels. People occasionally talk about it, but apparently it's not actually done. Or maybe it's done onl;y with a very limited android user space. The whole "we notice that something that happened 18 months ago broke our environment" is kind of broken.> After some digging, we narrowed this change in behaviour down to > e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging > reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to > blame. Indeed, our tests once again pass if we revert that patch (there's > a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET > but otherwise it reverts cleanly).I have to say, this smells for *so* many reasons. Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"? That seems insane, but seems fundamental in that commit e13a6915a03f ("vhost/vsock: add IOTLB API support") This code if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { if (vhost_init_device_iotlb(&vsock->dev, true)) goto err; } just makes me go "What?" It makes no sense. Why isn't that feature called something-something-IOTLB? Can we please just split that flag into two, and have that odd "platform access" be one bit, and the "enable iommu" be an entirely different bit? Now, since clearly nobody runs Android on newer kernels, I do think that the actual bit number choice should probably be one that makes the non-android use case binaries continue to work. And then the android system binaries that use this could maybe be compiled to know about *both* bits,. and work regardless? I'm also hoping that maybe Google android people could actually do some *testing*? I know, that sounds like a lot to ask, but humor me. Even if the product team runs stuff that is 18 months old, how about the dev team have a machine or two that actually tests current kernels, so that it's not a "oh, a few years have passed, and now we notice that a change doesn't work for us" situation any more. Is that really too much to ask for a big company like google? And hey, it's possible that the bit encoding is *so* incestuous that it's really hard to split it into two. But it really sounds to me like somebody mindlessly re-used a feature bit for a *completely* different thing. Why? Why have feature bits at all, when you then re-use the same bit for two different features? It kind of seems to defeat the whole purpose. Linus
Stefano Garzarella
2022-Aug-06 08:17 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
Hi Linus, On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:>On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <will at kernel.org> wrote: >> >> [tl;dr a change from ~18 months ago breaks Android userspace and I don't >> know what to do about it] > >Augh. > >I had hoped that android being "closer" to upstream would have meant >that somebody actually tests android with upstream kernels. People >occasionally talk about it, but apparently it's not actually done. > >Or maybe it's done onl;y with a very limited android user space. > >The whole "we notice that something that happened 18 months ago broke >our environment" is kind of broken. > >> After some digging, we narrowed this change in behaviour down to >> e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging >> reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to >> blame. Indeed, our tests once again pass if we revert that patch (there's >> a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET >> but otherwise it reverts cleanly). > >I have to say, this smells for *so* many reasons. > >Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"? > >That seems insane, but seems fundamental in that commit e13a6915a03f >("vhost/vsock: add IOTLB API support") > >This code > > if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { > if (vhost_init_device_iotlb(&vsock->dev, true)) > goto err; > } > >just makes me go "What?" It makes no sense. Why isn't that feature >called something-something-IOTLB?I honestly don't know the reason for the name but VIRTIO_F_ACCESS_PLATFORM comes from the virtio specification: https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-6600006 VIRTIO_F_ACCESS_PLATFORM(33) This feature indicates that the device can be used on a platform where device access to data in memory is limited and/or translated. E.g. this is the case if the device can be located behind an IOMMU that translates bus addresses from the device into physical addresses in memory, if the device can be limited to only access certain memory addresses or if special commands such as a cache flush can be needed to synchronise data in memory with the device. Whether accesses are actually limited or translated is described by platform-specific means. If this feature bit is set to 0, then the device has same access to memory addresses supplied to it as the driver has. In particular, the device will always use physical addresses matching addresses used by the driver (typically meaning physical addresses used by the CPU) and not translated further, and can access any address supplied to it by the driver. When clear, this overrides any platform-specific description of whether device access is limited or translated in any way, e.g. whether an IOMMU may be present.> >Can we please just split that flag into two, and have that odd >"platform access" be one bit, and the "enable iommu" be an entirely >different bit?IIUC the problem here is that the VMM does the translation and then for the device there is actually no need to translate, so this feature should not be negotiated by crosvm and vhost-vsock, but just between guest's driver and crosvm. Perhaps the confusion is that we use VIRTIO_F_ACCESS_PLATFORM both between guest and VMM and between VMM and vhost device. In fact, prior to commit e13a6915a03f ("vhost/vsock: add IOTLB API support"), vhost-vsock did not work when a VMM (e.g., QEMU) tried to negotiate translation with the device: https://bugzilla.redhat.com/show_bug.cgi?id=1894101 The simplest solution is that crosvm doesn't negotiate VIRTIO_F_ACCESS_PLATFORM with the vhost-vsock device if it doesn't want to use translation and send messages to set it. In fact before commit e13a6915a03f ("vhost/vsock: add IOTLB API support") this feature was not exposed by the vhost-vsock device, so it was never negotiated. Now crosvm is enabling a new feature (not masking guest-negotiated features) so I don't think it's a break in user space, if the user space enable it. I tried to explain what I understood when I made the change, Michael and Jason surely can add more information. Thanks, Stefano
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:> On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <will at kernel.org> wrote: > > > > [tl;dr a change from ~18 months ago breaks Android userspace and I don't > > know what to do about it] > > Augh. > > I had hoped that android being "closer" to upstream would have meant > that somebody actually tests android with upstream kernels. People > occasionally talk about it, but apparently it's not actually done. > > Or maybe it's done onl;y with a very limited android user space.We do actually test every -rc with Android (and run a whole bunch of regression tests), this is largely using x86 builds for convenience but we've been bringing up arm64 recently and are getting increasingly more coverage there. So this _will_ improve and relatively soon. The kicker in this case is that we'd only catch it on systems using pKVM (arm64 host only; upstreaming ongoing) with restricted DMA (requires device-tree) and so it slipped through. This is made more challenging for CI because arm64 devices don't tend to have support for nested virtualisation and so we have to run bare-metal but, as I say, we're getting there.> > After some digging, we narrowed this change in behaviour down to > > e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging > > reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to > > blame. Indeed, our tests once again pass if we revert that patch (there's > > a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET > > but otherwise it reverts cleanly). > > I have to say, this smells for *so* many reasons. > > Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?It was already renamed once (!) It used to be VIRTIO_F_IOMMU_PLATFORM...> That seems insane, but seems fundamental in that commit e13a6915a03f > ("vhost/vsock: add IOTLB API support") > > This code > > if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { > if (vhost_init_device_iotlb(&vsock->dev, true)) > goto err; > } > > just makes me go "What?" It makes no sense. Why isn't that feature > called something-something-IOTLB? > > Can we please just split that flag into two, and have that odd > "platform access" be one bit, and the "enable iommu" be an entirely > different bit?Something along those lines makes sense to me, but it's fiddly because the bits being used here are part of the virtio spec and we can't freely allocate them in Linux. I reckon it would probably be better to have a separate mechanism to enable IOTLB and not repurpose this flag for it. Hindsight is a wonderful thing.> And hey, it's possible that the bit encoding is *so* incestuous that > it's really hard to split it into two. But it really sounds to me like > somebody mindlessly re-used a feature bit for a *completely* different > thing. Why? > > Why have feature bits at all, when you then re-use the same bit for > two different features? It kind of seems to defeat the whole purpose.No argument here, and it's a big part of the reason I made the effort to write this up. Yes, we hit this in Android. Yes, we should've hit it sooner. But is it specific to Android? No. Anybody wanting a guest to use the DMA API for its virtio devices is going to be setting this flag and if they implement the same algorithm as crosvm then they're going to hit exactly the same problem that we did. Will
Christoph Hellwig
2022-Aug-07 06:52 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:> Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"?Because, as far as the virtio spec and virtio "guest" implementation is concerned it is not about IOMMU support at all. It is about treating virtio DMA as real DMA by the platform, which lets the platform let whatever method of DMA mapping it needs to the virtio device. This is needed to make sure harware virtio device are treated like actual hardware and not like a magic thing bypassing the normal PCIe rules. Using an IOMMU if one is present for bus is just one thing, others are using offets of DMAs that are very common on non-x86 platforms, or doing the horrible cache flushing needed on devices where PCIe is not cache coherent. It really is vhost that seems to abuse it so that if the guest claims it can handle VIRTIO_F_ACCESS_PLATFORM (which every modern guest should) it enables magic behavior, which I don't think is what the virtio spec intended.
Michael S. Tsirkin
2022-Aug-07 13:27 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
On Fri, Aug 05, 2022 at 03:57:08PM -0700, Linus Torvalds wrote:> And hey, it's possible that the bit encoding is *so* incestuous that > it's really hard to split it into two. But it really sounds to me like > somebody mindlessly re-used a feature bit for a *completely* different > thing. Why? > > Why have feature bits at all, when you then re-use the same bit for > two different features? It kind of seems to defeat the whole purpose.What can I say? Hindsight is 20/20. The two things are *related* in that IOTLB in vhost is a way for userspace (the platform) to limit device access to guest memory. So we reused the feature bits (it's not the only one, just the one we changed most recently). It bothered me a bit but everyone seemed happy and was able to refer to virtio spec for documentation so there was less documentation to write for Linux. It's not that it's hard to split it generally, it's just that it's been there like this for a while so it's hard to change now - we need to find a way that does not break existing userspace. -- MST