Hi folks, [tl;dr a change from ~18 months ago breaks Android userspace and I don't know what to do about it] As part of the development work for next year's Android, we've recently been bringing up a 5.15 KVM host and have observed that vsock no longer works for communicating with a guest because crosvm gets an unexpected -EFAULT back from the VHOST_VSOCK_SET_RUNNING ioctl(): | E crosvm : vpipe worker thread exited with error: VhostVsockStart(IoctlError(Os { code: 14, kind: Uncategorized, message: "Bad address" })) The same guest payload works correctly on a 5.10 KVM host kernel. After some digging, we narrowed this change in behaviour down to e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to blame. Indeed, our tests once again pass if we revert that patch (there's a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET but otherwise it reverts cleanly). On Android, KVM runs in a mode where the host kernel is, by default, unable to access guest memory [1] and so memory used for virtio (e.g. the rings and descriptor data) needs to be shared explicitly with the host using hypercalls issued by the guest. We implement this on top of restricted DMA [2], whereby the guest allocates a pool of shared memory during boot and bounces all virtio transfers through this window. In order to get the guest to use the DMA API for virtio devices (which is required for the SWIOTLB code to kick in and do the aforementioned bouncing), crosvm sets the VIRTIO_F_ACCESS_PLATFORM feature flag on its emulated devices and this is picked up by virtio_has_dma_quirk() in the guest kernel. This has been working well for us so far. With e13a6915a03f, the vhost backend for vsock now advertises VIRTIO_F_ACCESS_PLATFORM in its response to the VHOST_GET_FEATURES ioctl() and accepts it in the VHOST_SET_FEATURES as a mechanism to enable the IOTLB feature (note: this is nothing to do with SWIOTLB!). This feature is used for emulation of a virtual IOMMU and requires explicit support for issuing IOTLB messages (see VHOST_IOTLB_MSG and VHOST_IOTLB_MSG_V2) from userspace to manage address translations for the virtio device. Looking at how crosvm uses these vhost ioctl()s, we can see: let avail_features = self .vhost_handle .get_features() .map_err(Error::VhostGetFeatures)?; let features: c_ulonglong = self.acked_features & avail_features; self.vhost_handle .set_features(features) .map_err(Error::VhostSetFeatures)?; where 'acked_features' is the feature set negotiated between crosvm and the guest, while 'avail_features' is the supported feature set advertised by vhost. The intersection of these now includes VIRTIO_F_ACCESS_PLATFORM in the 5.15 kernel and so we quietly start enabling IOTLB, despite not having any of the necessary support in crosvm and therefore the vsock thread effectively grinds to a halt and we cannot communicate with the guest. The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is being used for two very different things within the same device; for the guest it basically means "use the DMA API, it knows what to do" but for vhost it very specifically means "enable IOTLB". We've recently had other problems with this flag [3] but in this case it used to work reliably and now it doesn't anymore. So how should we fix this? One possibility is for us to hack crosvm to clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features, but others here have reasonably pointed out that they didn't expect a kernel change to break userspace. On the flip side, the offending commit in the kernel isn't exactly new (it's from the end of 2020!) and so it's likely that others (e.g. QEMU) are using this feature. I also strongly suspect that vhost net suffers from exactly the same issue, we just don't happen to be using that (yet) in Android. Thanks, Will [1] https://lwn.net/Articles/836693/ [2] https://lwn.net/Articles/841916/ [3] https://lore.kernel.org/lkml/YtkCQsSvE9AZyrys at google.com/
Linus Torvalds
2022-Aug-05 22:57 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
On Fri, Aug 5, 2022 at 11:11 AM Will Deacon <will at kernel.org> wrote:> > [tl;dr a change from ~18 months ago breaks Android userspace and I don't > know what to do about it]Augh. I had hoped that android being "closer" to upstream would have meant that somebody actually tests android with upstream kernels. People occasionally talk about it, but apparently it's not actually done. Or maybe it's done onl;y with a very limited android user space. The whole "we notice that something that happened 18 months ago broke our environment" is kind of broken.> After some digging, we narrowed this change in behaviour down to > e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging > reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to > blame. Indeed, our tests once again pass if we revert that patch (there's > a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET > but otherwise it reverts cleanly).I have to say, this smells for *so* many reasons. Why is "IOMMU support" called "VIRTIO_F_ACCESS_PLATFORM"? That seems insane, but seems fundamental in that commit e13a6915a03f ("vhost/vsock: add IOTLB API support") This code if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) { if (vhost_init_device_iotlb(&vsock->dev, true)) goto err; } just makes me go "What?" It makes no sense. Why isn't that feature called something-something-IOTLB? Can we please just split that flag into two, and have that odd "platform access" be one bit, and the "enable iommu" be an entirely different bit? Now, since clearly nobody runs Android on newer kernels, I do think that the actual bit number choice should probably be one that makes the non-android use case binaries continue to work. And then the android system binaries that use this could maybe be compiled to know about *both* bits,. and work regardless? I'm also hoping that maybe Google android people could actually do some *testing*? I know, that sounds like a lot to ask, but humor me. Even if the product team runs stuff that is 18 months old, how about the dev team have a machine or two that actually tests current kernels, so that it's not a "oh, a few years have passed, and now we notice that a change doesn't work for us" situation any more. Is that really too much to ask for a big company like google? And hey, it's possible that the bit encoding is *so* incestuous that it's really hard to split it into two. But it really sounds to me like somebody mindlessly re-used a feature bit for a *completely* different thing. Why? Why have feature bits at all, when you then re-use the same bit for two different features? It kind of seems to defeat the whole purpose. Linus
Stefano Garzarella
2022-Aug-06 07:48 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
Hi Will, On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:>Hi folks, > >[tl;dr a change from ~18 months ago breaks Android userspace and I don't > know what to do about it] > >As part of the development work for next year's Android, we've recently >been bringing up a 5.15 KVM host and have observed that vsock no longer >works for communicating with a guest because crosvm gets an unexpected >-EFAULT back from the VHOST_VSOCK_SET_RUNNING ioctl(): > > | E crosvm : vpipe worker thread exited with error: VhostVsockStart(IoctlError(Os { code: 14, kind: Uncategorized, message: "Bad address" })) > >The same guest payload works correctly on a 5.10 KVM host kernel. > >After some digging, we narrowed this change in behaviour down to >e13a6915a03f ("vhost/vsock: add IOTLB API support") and further digging >reveals that the infamous VIRTIO_F_ACCESS_PLATFORM feature flag is to >blame. Indeed, our tests once again pass if we revert that patch (there's >a trivial conflict with the later addition of VIRTIO_VSOCK_F_SEQPACKET >but otherwise it reverts cleanly). > >On Android, KVM runs in a mode where the host kernel is, by default, >unable to access guest memory [1] and so memory used for virtio (e.g. >the rings and descriptor data) needs to be shared explicitly with the >host using hypercalls issued by the guest. We implement this on top of >restricted DMA [2], whereby the guest allocates a pool of shared memory >during boot and bounces all virtio transfers through this window. In >order to get the guest to use the DMA API for virtio devices (which is >required for the SWIOTLB code to kick in and do the aforementioned >bouncing), crosvm sets the VIRTIO_F_ACCESS_PLATFORM feature flag on its >emulated devices and this is picked up by virtio_has_dma_quirk() in the >guest kernel. This has been working well for us so far. > >With e13a6915a03f, the vhost backend for vsock now advertises >VIRTIO_F_ACCESS_PLATFORM in its response to the VHOST_GET_FEATURES >ioctl() and accepts it in the VHOST_SET_FEATURES as a mechanism to >enable the IOTLB feature (note: this is nothing to do with SWIOTLB!). >This feature is used for emulation of a virtual IOMMU and requires >explicit support for issuing IOTLB messages (see VHOST_IOTLB_MSG and >VHOST_IOTLB_MSG_V2) from userspace to manage address translations for >the virtio device. > >Looking at how crosvm uses these vhost ioctl()s, we can see: > > let avail_features = self > .vhost_handle > .get_features() > .map_err(Error::VhostGetFeatures)?; > > let features: c_ulonglong = self.acked_features & avail_features; > self.vhost_handle > .set_features(features) > .map_err(Error::VhostSetFeatures)?;> >where 'acked_features' is the feature set negotiated between crosvm and >the guest, while 'avail_features' is the supported feature set >advertised by vhost. The intersection of these now includes >VIRTIO_F_ACCESS_PLATFORM in the 5.15 kernel and so we quietly start >enabling IOTLB, despite not having any of the necessary support in >crosvm and therefore the vsock thread effectively grinds to a halt and >we cannot communicate with the guest. > >The fundamental issue is, I think, that VIRTIO_F_ACCESS_PLATFORM is >being used for two very different things within the same device; for the >guest it basically means "use the DMA API, it knows what to do" but for >vhost it very specifically means "enable IOTLB". We've recently had >other problems with this flag [3] but in this case it used to work >reliably and now it doesn't anymore. > >So how should we fix this? One possibility is for us to hack crosvm to >clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhostWhy do you consider this a hack? If the VMM implements the translation feature, it is right in my opinion that it does not enable the feature for the vhost device. Otherwise, if it wants the vhost device to do the translation, enable the feature and send the IOTLB messages to set the translation. QEMU for example masks features when not required or supported. crosvm should negotiate only the features it supports. @Michael and @Jason can correct me, but if a vhost device negotiates VIRTIO_F_ACCESS_PLATFORM, then it expects the VMM to send IOTLB messages to set the translation. Indeed that patch was to fix https://bugzilla.redhat.com/show_bug.cgi?id=1894101>features, >but others here have reasonably pointed out that they didn't expect a >kernel change to break userspace. On the flip side, the offending commit >in the kernel isn't exactly new (it's from the end of 2020!) and so it's >likely that others (e.g. QEMU) are using this feature.Yep, QEMU can use it.>I also strongly >suspect that vhost net suffers from exactly the same issue, we just >don't happen to be using that (yet) in Android.I think so too, the implementation in vsock was done following vhost-net. Thanks, Stefano
Michael S. Tsirkin
2022-Aug-07 13:14 UTC
IOTLB support for vhost/vsock breaks crosvm on Android
Will, thanks very much for the analysis and the writeup! On Fri, Aug 05, 2022 at 07:11:06PM +0100, Will Deacon wrote:> So how should we fix this? One possibility is for us to hack crosvm to > clear the VIRTIO_F_ACCESS_PLATFORM flag when setting the vhost features, > but others here have reasonably pointed out that they didn't expect a > kernel change to break userspace. On the flip side, the offending commit > in the kernel isn't exactly new (it's from the end of 2020!) and so it's > likely that others (e.g. QEMU) are using this feature.Exactly, that's the problem. vhost is reusing the virtio bits and it's only natural that what you are doing would happen. To be precise, this is what we expected people to do (and what QEMU does): #define QEMU_VHOST_FEATURES ((1 << VIRTIO_F_VERSION_1) | (1 << VIRTIO_NET_F_RX_MRG) | .... ) VHOST_GET_FEATURES(... &host_features); host_features &= QEMU_VHOST_FEATURES VHOST_SET_FEATURES(host_features & guest_features) Here QEMU_VHOST_FEATURES are the bits userspace knows about. Our assumption was that whatever userspace enables, it knows what the effect on vhost is going to be. But yes, I understand absolutely how someone would instead just use the guest features. It is unfortunate that we did not catch this in time. In hindsight, we should have just created vhost level macros instead of reusing virtio ones. Would address the concern about naming: PLATFORM_ACCESS makes sense for the guest since there it means "whatever access rules platform has", but for vhost a better name would be VHOST_F_IOTLB. We should have also taken greater pains to document what we expect userspace to do. I remember now how I thought about something like this but after coding this up in QEMU I forgot to document this :( Also, I suspect given the history the GET/SET features ioctl and burned wrt extending it and we have to use a new when we add new features. All this we can do going forward. But what can we do about the specific issue? I am not 100% sure since as Will points out, QEMU and other userspace already rely on the current behaviour. Looking at QEMU specifically, it always sends some translations at startup, this in order to handle device rings. So, *maybe* we can get away with assuming that if no IOTLB ioctl was ever invoked then this userspace does not know about IOTLB and translation should ignore IOTLB completely. I am a bit nervous about breaking some *other* userspace which actually wants device to be blocked from accessing memory until IOTLB has been setup. If we get it wrong we are making guest and possibly even host vulnerable. And of course just revering is not an option either since there are now whole stacks depending on the feature. Will I'd like your input on whether you feel a hack in the kernel is justified here. Also yes, I think it's a good idea to change crosvm anyway. While the work around I describe might make sense upstream I don't think it's a reasonable thing to do in stable kernels. I think I'll prepare a patch documenting the legal vhost features as a 1st step even though crosvm is rust so it's not importing the header directly, right? -- MST