Andi Kleen
2021-Oct-10 22:11 UTC
[PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On 10/9/2021 1:39 PM, Dan Williams wrote:> On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst at redhat.com> wrote: >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: >>> From: Andi Kleen <ak at linux.intel.com> >>> >>> For Confidential VM guests like TDX, the host is untrusted and hence >>> the devices emulated by the host or any data coming from the host >>> cannot be trusted. So the drivers that interact with the outside world >>> have to be hardened by sharing memory with host on need basis >>> with proper hardening fixes. >>> >>> For the PCI driver case, to share the memory with the host add >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. >>> >>> Signed-off-by: Andi Kleen <ak at linux.intel.com> >>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy at linux.intel.com> >> So I proposed to make all pci mappings shared, eliminating the need >> to patch drivers. >> >> To which Andi replied >> One problem with removing the ioremap opt-in is that >> it's still possible for drivers to get at devices without going through probe. >> >> To which Greg replied: >> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ at kroah.com/ >> If there are in-kernel PCI drivers that do not do this, they need to be >> fixed today. >> >> Can you guys resolve the differences here? > I agree with you and Greg here. If a driver is accessing hardware > resources outside of the bind lifetime of one of the devices it > supports, and in a way that neither modrobe-policy nor > device-authorization -policy infrastructure can block, that sounds > like a bug report.The 5.15 tree has something like ~2.4k IO accesses (including MMIO and others) in init functions that also register drivers (thanks Elena for the number) Some are probably old drivers that could be fixed, but it's quite a few legitimate cases. For example for platform or ISA drivers that's the only way they can be implemented because they often have no other enumeration mechanism. For PCI drivers it's rarer, but also still can happen. One example that comes to mind here is the x86 Intel uncore drivers, which support a mix of MSR, ioremap and PCI config space accesses all from the same driver. This particular example can (and should be) fixed in other ways, but similar things also happen in other drivers, and they're not all broken. Even for the broken ones they're usually for some crufty old devices that has very few users, so it's likely untestable in practice. My point is just that the ecosystem of devices that Linux supports is messy enough that there are legitimate exceptions from the "First IO only in probe call only" rule. And we can't just fix them all. Even if we could it would be hard to maintain. Using a "firewall model" hooking into a few strategic points like we're proposing here is much saner for everyone. Now we can argue about the details. Right now what we're proposing has some redundancies: it has both a device model filter and low level filter for ioremap (this patch and some others). The low level filter is for catching issues that don't clearly fit into the "enumeration<->probe" model. You could call that redundant, but I would call it defense in depth or better safe than sorry. In theory it would be enough to have the low level opt-in only, but that would have the drawback that is something gets enumerated after all you would have all kind of weird device driver failures and in some cases even killed guests. So I think it makes sense to have> Fix those drivers instead of sprinkling > ioremap_shared in select places and with unclear rules about when a > driver is allowed to do "shared" mappings.Only add it when the driver has been audited and hardened. But I agree we need on a documented process for this. I will work on some documentation for a proposal. But essentially I think it should be some variant of what Elena has outlined in her talk at Security Summit. https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf That is using extra auditing/scrutiny at review time, supported with some static code analysis that points to the interaction points, and code needs to be fuzzed explicitly. However short term it's only three virtio drivers, so this is not a urgent problem.> Let the new > device-authorization mechanism (with policy in userspace)Default policy in user space just seems to be a bad idea here. Who should know if a driver is hardened other than the kernel? Maintaining the list somewhere else just doesn't make sense to me. Also there is the more practical problem that some devices are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early. I can see a point of having user space overrides though, but we need to have a sane kernel default that works early. -Andi
Dan Williams
2021-Oct-12 17:42 UTC
[PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen <ak at linux.intel.com> wrote:> > > On 10/9/2021 1:39 PM, Dan Williams wrote: > > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst at redhat.com> wrote: > >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > >>> From: Andi Kleen <ak at linux.intel.com> > >>> > >>> For Confidential VM guests like TDX, the host is untrusted and hence > >>> the devices emulated by the host or any data coming from the host > >>> cannot be trusted. So the drivers that interact with the outside world > >>> have to be hardened by sharing memory with host on need basis > >>> with proper hardening fixes. > >>> > >>> For the PCI driver case, to share the memory with the host add > >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > >>> > >>> Signed-off-by: Andi Kleen <ak at linux.intel.com> > >>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy at linux.intel.com> > >> So I proposed to make all pci mappings shared, eliminating the need > >> to patch drivers. > >> > >> To which Andi replied > >> One problem with removing the ioremap opt-in is that > >> it's still possible for drivers to get at devices without going through probe. > >> > >> To which Greg replied: > >> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ at kroah.com/ > >> If there are in-kernel PCI drivers that do not do this, they need to be > >> fixed today. > >> > >> Can you guys resolve the differences here? > > I agree with you and Greg here. If a driver is accessing hardware > > resources outside of the bind lifetime of one of the devices it > > supports, and in a way that neither modrobe-policy nor > > device-authorization -policy infrastructure can block, that sounds > > like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for > the number) > > Some are probably old drivers that could be fixed, but it's quite a few > legitimate cases. For example for platform or ISA drivers that's the > only way they can be implemented because they often have no other > enumeration mechanism. For PCI drivers it's rarer, but also still can > happen. One example that comes to mind here is the x86 Intel uncore > drivers, which support a mix of MSR, ioremap and PCI config space > accesses all from the same driver. This particular example can (and > should be) fixed in other ways, but similar things also happen in other > drivers, and they're not all broken. Even for the broken ones they're > usually for some crufty old devices that has very few users, so it's > likely untestable in practice. > > My point is just that the ecosystem of devices that Linux supports is > messy enough that there are legitimate exceptions from the "First IO > only in probe call only" rule. > > And we can't just fix them all. Even if we could it would be hard to > maintain. > > Using a "firewall model" hooking into a few strategic points like we're > proposing here is much saner for everyone. > > Now we can argue about the details. Right now what we're proposing has > some redundancies: it has both a device model filter and low level > filter for ioremap (this patch and some others). The low level filter is > for catching issues that don't clearly fit into the > "enumeration<->probe" model. You could call that redundant, but I would > call it defense in depth or better safe than sorry. In theory it would > be enough to have the low level opt-in only, but that would have the > drawback that is something gets enumerated after all you would have all > kind of weird device driver failures and in some cases even killed > guests. So I think it makes sense to haveThe "better safe-than-sorry" argument is hard to build consensus around. The spectre mitigations ran into similar problems where the community rightly wanted to see the details and instrument the problematic paths rather than blanket sprinkle lfence "just to be safe". In this case the rules about when a driver is suitably "hardened" are vague and the overlapping policy engines are confusing. I'd rather see more concerted efforts focused/limited core changes rather than leaf driver changes until there is a clearer definition of hardened. I.e. instead of jumping to the assertion that fixing up these init-path vulnerabilities are too big to fix, dig to the next level to provide more evidence that per-driver opt-in is the only viable option. For example, how many of these problematic paths are built-in to the average kernel config? A strawman might be to add a sprinkling error exits in the module_init() of the problematic drivers, and only fail if the module is built-in, and let modprobe policy handle the rest.> > > > Fix those drivers instead of sprinkling > > ioremap_shared in select places and with unclear rules about when a > > driver is allowed to do "shared" mappings. > > Only add it when the driver has been audited and hardened. > > But I agree we need on a documented process for this. I will work on > some documentation for a proposal. But essentially I think it should be > some variant of what Elena has outlined in her talk at Security Summit. > > https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf > > That is using extra auditing/scrutiny at review time, supported with > some static code analysis that points to the interaction points, and > code needs to be fuzzed explicitly. > > However short term it's only three virtio drivers, so this is not a > urgent problem. > > > Let the new > > device-authorization mechanism (with policy in userspace) > > > Default policy in user space just seems to be a bad idea here. Who > should know if a driver is hardened other than the kernel? Maintaining > the list somewhere else just doesn't make sense to me.I do not understand the maintenance burden correlation of where the policy is driven vs where the list is maintained? Even if I agreed with the contention that out-of-tree userspace would have a hard time tracking the "hardened" driver list there is still an in-tree userspace path to explore. E.g. perf maintains lists of things tightly coupled to the kernel, this authorized device list seems to be in the same category of data.> Also there is the more practical problem that some devices are needed > for booting. For example in TDX we can't print something to the console > with this mechanism, so you would never get any output before the > initrd. Just seems like a nightmare for debugging anything. There really > needs to be an authorization mechanism that works reasonably early. > > I can see a point of having user space overrides though, but we need to > have a sane kernel default that works early.Right, as I suggested [1], just enough early authorization to bootstrap/debug initramfs and then that can authorize the remainder. [1]: https://lore.kernel.org/all/CAPcyv4im4Tsj1SnxSWe=cAHBP1mQ=zgO-D81n2BpD+_HkpitbQ at mail.gmail.com/
Thomas Gleixner
2021-Oct-18 00:55 UTC
[PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
Andi, On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:> On 10/9/2021 1:39 PM, Dan Williams wrote: >> I agree with you and Greg here. If a driver is accessing hardware >> resources outside of the bind lifetime of one of the devices it >> supports, and in a way that neither modrobe-policy nor >> device-authorization -policy infrastructure can block, that sounds >> like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for > the number)These numbers are completely useless simply because they are based on nonsensical criteria. See: https://lore.kernel.org/r/87r1cj2uad.ffs at tglx> My point is just that the ecosystem of devices that Linux supports is > messy enough that there are legitimate exceptions from the "First IO > only in probe call only" rule.Your point is based on your outright refusal to actualy do a proper analysis and your outright refusal to help fixing the real problems. All you have provided so far is handwaving based on a completely useless analysis. Sure, your goal is to get this TDX problem solved, but it's not going to be solved by: 1) Providing a nonsensical analysis 2) Using #1 as an argument to hack some half baken interfaces into the kernel which allow you to tick off your checkbox and then leave the resulting mess for others to clean up. Try again when you have factual data to back up your claims and factual arguments which prove that the problem can't be fixed otherwise. I might be repeating myself, but kernel development works this way: 1) Hack your private POC - Yay! 2) Sit down and think hard about the problems you identified in step #1. Do a thorough analysis. 3) Come up with a sensible integration plan. 4) Do the necessary grump work of cleanups all over the place 5) Add sensible infrastructure which is understandable for the bulk of kernel/driver developers 6) Let your feature fall in place and not in the way you are insisting on: 1) Hack your private POC - Yay! 2) Define that this is the only way to do it and try to shove it down the throat of everyone. 3) Getting told that this is not the way it works 4) Insist on it forever and blame the grumpy maintainers who are just not understanding the great value of your approach. 5) Go back to #2 You should know that already, but I have no problem to give that lecture to you over and over again. I probably should create a form letter. And no, you can bitch about me as much as you want. These are not my personal rules and personal pet pieves. These are rules Linus cares about very much and aside of that they just reflect common sense. The kernel is a common good and not the dump ground for your personal brain waste. The kernel does not serve Intel. Quite the contrary Intel depends on the kernel to work nicely with it's hardware. Ergo, Intel should have a vested interest to serve the kernel and take responsibility for it as a whole. And so should you as an Intel employee. Just dumping your next half baken workaround does not cut it especially not when it is not backed up by sensible arguments. Please try again, but not before you have something substantial to back up your claims. Thanks, Thomas
Greg KH
2021-Oct-18 12:08 UTC
[PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
On Sun, Oct 10, 2021 at 03:11:23PM -0700, Andi Kleen wrote:> > On 10/9/2021 1:39 PM, Dan Williams wrote: > > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst at redhat.com> wrote: > > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > > > > From: Andi Kleen <ak at linux.intel.com> > > > > > > > > For Confidential VM guests like TDX, the host is untrusted and hence > > > > the devices emulated by the host or any data coming from the host > > > > cannot be trusted. So the drivers that interact with the outside world > > > > have to be hardened by sharing memory with host on need basis > > > > with proper hardening fixes. > > > > > > > > For the PCI driver case, to share the memory with the host add > > > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > > > > > > > > Signed-off-by: Andi Kleen <ak at linux.intel.com> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy at linux.intel.com> > > > So I proposed to make all pci mappings shared, eliminating the need > > > to patch drivers. > > > > > > To which Andi replied > > > One problem with removing the ioremap opt-in is that > > > it's still possible for drivers to get at devices without going through probe. > > > > > > To which Greg replied: > > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ at kroah.com/ > > > If there are in-kernel PCI drivers that do not do this, they need to be > > > fixed today. > > > > > > Can you guys resolve the differences here? > > I agree with you and Greg here. If a driver is accessing hardware > > resources outside of the bind lifetime of one of the devices it > > supports, and in a way that neither modrobe-policy nor > > device-authorization -policy infrastructure can block, that sounds > > like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for the > number) > > Some are probably old drivers that could be fixed, but it's quite a few > legitimate cases. For example for platform or ISA drivers that's the only > way they can be implemented because they often have no other enumeration > mechanism. For PCI drivers it's rarer, but also still can happen. One > example that comes to mind here is the x86 Intel uncore drivers, which > support a mix of MSR, ioremap and PCI config space accesses all from the > same driver. This particular example can (and should be) fixed in other > ways, but similar things also happen in other drivers, and they're not all > broken. Even for the broken ones they're usually for some crufty old devices > that has very few users, so it's likely untestable in practice. > > My point is just that the ecosystem of devices that Linux supports is messy > enough that there are legitimate exceptions from the "First IO only in probe > call only" rule.No, there should not be for PCI drivers. If there is, that is a bug that you can, and should, fix.> And we can't just fix them all. Even if we could it would be hard to > maintain.Not true at all, you can fix them, and write a simple coccinelle rule to prevent them from ever coming back in.> Using a "firewall model" hooking into a few strategic points like we're > proposing here is much saner for everyone.No it is not. It is "easier" for you because you all do not want to fix up all of the drivers and want to add additional code complexity on top of the current mess that we have and then you can claim that you have "hardened" the drivers you care about. Despite no one ever explaining exactly what "hardened" means to me. Again, fix the existing drivers, you have the whole source, if this is something that you all care about, it should not be hard to do. Stop making excuses. greg k-h