[v1: Initial post] With confidential computing like TDX the guest doesn't trust the host anymore. The host is allowed to DOS of course, but it is not allowed to read or write any guest memory not explicitely shared with it. This has implication for virtio. Traditionally virtio didn't assume the other side of the communication channel is malicious, and therefore didn't do any boundary checks in virtio ring data structures. This patchkit does hardening for virtio. In a TDX like model the only host memory accesses allowed are in the virtio ring, as well as the (forced) swiotlb buffer. This patch kit does various changes to ensure there can be no access outside these two areas. It is possible for the host to break the communication, but this should result in a IO error on the guest, but no memory safety violations. virtio is quite complicated with many modes. To simplify the task we enforce that virtio is only in split mode without indirect descriptors, when running as a TDX guest. We also enforce use of the DMA API. Then these code paths are hardened against any corruptions on the ring. This patchkit has components in three subsystems: - Hardening changes to virtio, all in the generic virtio-ring - Hardening changes to kernel/dma swiotlb to harden swiotlb against malicious pointers. It requires an API change which needed a tree sweep. - A single x86 patch to enable the arch_has_restricted_memory_access for TDX It depends on Sathya's earlier patchkit that adds the basic infrastructure for TDX. This is only needed for the "am I running in TDX" part.
Andi Kleen
2021-Jun-03 00:41 UTC
[PATCH v1 1/8] virtio: Force only split mode with protected guest
When running under TDX the virtio host is untrusted. The bulk of the kernel memory is encrypted and protected, but the virtio ring is in special shared memory that is shared with the untrusted host. This means virtio needs to be hardened against any attacks from the host through the ring. Of course it's impossible to prevent DOS (the host can chose at any time to stop doing IO), but there should be no buffer overruns or similar that might give access to any private memory in the guest. virtio has a lot of modes, most are difficult to harden. The best for hardening seems to be split mode without indirect descriptors. This also simplifies the hardening job because it's only a single code path. Only allow split mode when in a protected guest. Followon patches harden the split mode code paths, and we don't want an malicious host to force anything else. Also disallow indirect mode for similar reasons. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/virtio/virtio_ring.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71e16b53e9c1..f35629fa47b1 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -11,6 +11,7 @@ #include <linux/module.h> #include <linux/hrtimer.h> #include <linux/dma-mapping.h> +#include <linux/protected_guest.h> #include <xen/xen.h> #ifdef DEBUG @@ -2221,8 +2222,16 @@ void vring_transport_features(struct virtio_device *vdev) unsigned int i; for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { + + /* + * In protected guest mode disallow packed or indirect + * because they ain't hardened. + */ + switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_RING_F_EVENT_IDX: break; @@ -2231,9 +2240,12 @@ void vring_transport_features(struct virtio_device *vdev) case VIRTIO_F_ACCESS_PLATFORM: break; case VIRTIO_F_RING_PACKED: + if (protected_guest_has(VM_MEM_ENCRYPT)) + goto clear; break; case VIRTIO_F_ORDER_PLATFORM: break; + clear: default: /* We don't understand this bit. */ __virtio_clear_bit(vdev, i); -- 2.25.4
Andi Kleen
2021-Jun-03 00:41 UTC
[PATCH v1 2/8] virtio: Add boundary checks to virtio ring
In protected guest mode we don't trust the host. This means we need to make sure the host cannot subvert us through virtio communication. In general it can corrupt our virtio data and cause a DOS, but it should not be able to access any data that is not explicitely under IO. Also boundary checking so that the free list (which is accessible to the host) cannot point outside the virtio ring. Note it could still contain loops or similar, but these should only cause an DOS, not a memory corruption or leak. When we detect any out of bounds descriptor trigger an IO error. We also use a WARN() (in case it was a software bug instead of an attack). This implies that a malicious host can flood the guest kernel log, but that's only a DOS and acceptable in the threat model. This patch only hardens the initial consumption of the free list, the freeing comes later. Any of these errors can cause DMA memory leaks, but there is nothing we can do about that and that would be just a DOS. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/virtio/virtio_ring.c | 46 ++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f35629fa47b1..d37ff5a0ff58 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -413,6 +413,15 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, return desc; } +/* assumes no indirect mode */ +static inline bool inside_split_ring(struct vring_virtqueue *vq, + unsigned index) +{ + return !WARN(index >= vq->split.vring.num, + "desc index %u out of bounds (%u)\n", + index, vq->split.vring.num); +} + static inline int virtqueue_add_split(struct virtqueue *_vq, struct scatterlist *sgs[], unsigned int total_sg, @@ -428,6 +437,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, unsigned int i, n, avail, descs_used, prev, err_idx; int head; bool indirect; + int io_err; START_USE(vq); @@ -481,7 +491,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -494,7 +510,13 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + dma_addr_t addr; + + io_err = -EIO; + if (!inside_split_ring(vq, i)) + goto unmap_release; + io_err = -ENOMEM; + addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -513,6 +535,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, dma_addr_t addr = vring_map_single( vq, desc, total_sg * sizeof(struct vring_desc), DMA_TO_DEVICE); + io_err = -ENOMEM; if (vring_mapping_error(vq, addr)) goto unmap_release; @@ -528,6 +551,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* We're using some buffers from the free list. */ vq->vq.num_free -= descs_used; + io_err = -EIO; + if (!inside_split_ring(vq, head)) + goto unmap_release; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, @@ -545,6 +572,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Put entry in available array (but don't update avail->idx until they * do sync). */ avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); + + if (avail >= vq->split.vring.num) + goto unmap_release; + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head); /* Descriptors and available array need to be set before we expose the @@ -576,6 +607,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, for (n = 0; n < total_sg; n++) { if (i == err_idx) break; + if (!inside_split_ring(vq, i)) + break; vring_unmap_one_split(vq, &desc[i]); i = virtio16_to_cpu(_vq->vdev, desc[i].next); } @@ -584,7 +617,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, kfree(desc); END_USE(vq); - return -ENOMEM; + return io_err; } static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) @@ -1146,7 +1179,12 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq, c = 0; for (n = 0; n < out_sgs + in_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { - dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ? + dma_addr_t addr; + + if (curr >= vq->packed.vring.num) + goto unmap_release; + + addr = vring_map_one_sg(vq, sg, n < out_sgs ? DMA_TO_DEVICE : DMA_FROM_DEVICE); if (vring_mapping_error(vq, addr)) goto unmap_release; -- 2.25.4
Harden the split buffer detachment path by adding boundary checking. Note that when this fails we may fail to unmap some swiotlb mapping, which could result in a leak and a DOS. But that's acceptable because an malicious host can DOS us anyways. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/virtio/virtio_ring.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d37ff5a0ff58..1e9aa1e95e1b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -651,12 +651,19 @@ static bool virtqueue_kick_prepare_split(struct virtqueue *_vq) return needs_kick; } -static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, - void **ctx) +static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, + void **ctx) { unsigned int i, j; __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); + /* We'll leak DMA mappings when this happens, but nothing + * can be done about that. In the worst case the host + * could DOS us, but it can of course do that anyways. + */ + if (!inside_split_ring(vq, head)) + return -EIO; + /* Clear data ptr. */ vq->split.desc_state[head].data = NULL; @@ -666,6 +673,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, while (vq->split.vring.desc[i].flags & nextflag) { vring_unmap_one_split(vq, &vq->split.vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); + if (!inside_split_ring(vq, i)) + return -EIO; vq->vq.num_free++; } @@ -684,7 +693,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, /* Free the indirect table, if any, now that it's unmapped. */ if (!indir_desc) - return; + return 0; len = virtio32_to_cpu(vq->vq.vdev, vq->split.vring.desc[head].len); @@ -701,6 +710,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, } else if (ctx) { *ctx = vq->split.desc_state[head].indir_desc; } + return 0; } static inline bool more_used_split(const struct vring_virtqueue *vq) @@ -717,6 +727,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, void *ret; unsigned int i; u16 last_used; + int err; START_USE(vq); @@ -751,7 +762,12 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, /* detach_buf_split clears data, so grab it now. */ ret = vq->split.desc_state[i].data; - detach_buf_split(vq, i, ctx); + err = detach_buf_split(vq, i, ctx); + if (err) { + END_USE(vq); + return NULL; + } + vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before @@ -863,6 +879,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; detach_buf_split(vq, i, NULL); + /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->split.avail_idx_shadow); -- 2.25.4
Andi Kleen
2021-Jun-03 00:41 UTC
[PATCH v1 4/8] x86/tdx: Add arch_has_restricted_memory_access for TDX
In virtio the host decides whether the guest uses the DMA API or not using the strangely named VIRTIO_F_ACCESS_PLATFORM bit (which really indicates whether the DMA API is used or not) For hardened virtio on TDX we want to enforce that that swiotlb is always used, which requires using the DMA API. While IO wouldn't really work without the swiotlb, it might be possible that an attacker forces swiotlbless IO to manipulate memory in the guest. So we want to force the DMA API (which then forces swiotlb), but without relying on the host. There is already an arch_has_restricted_memory_acces hook for this, which is currently used only by s390. Enable the config option for the hook for x86 and enable it for TDX. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- arch/x86/Kconfig | 1 + arch/x86/mm/mem_encrypt_common.c | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1531a0f905ed..3d804fce31b9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST select X86_X2APIC select SECURITY_LOCKDOWN_LSM select X86_MEM_ENCRYPT_COMMON + select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS help Provide support for running in a trusted domain on Intel processors equipped with Trusted Domain eXtenstions. TDX is a new Intel diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 24c9117547b4..2244d1f033ab 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -9,6 +9,7 @@ #include <asm/mem_encrypt_common.h> #include <linux/dma-mapping.h> +#include <linux/virtio_config.h> #include <linux/swiotlb.h> /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ @@ -37,3 +38,9 @@ void __init mem_encrypt_init(void) amd_mem_encrypt_init(); } +int arch_has_restricted_virtio_memory_access(void) +{ + return is_tdx_guest(); +} +EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access); + -- 2.25.4
swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/iommu/dma-iommu.c | 13 ++++++------- drivers/xen/swiotlb-xen.c | 11 ++++++----- include/linux/dma-mapping.h | 4 ++-- include/linux/swiotlb.h | 8 +++++--- kernel/dma/direct.c | 8 ++++---- kernel/dma/direct.h | 8 ++++---- kernel/dma/mapping.c | 4 ++-- net/xdp/xsk_buff_pool.c | 2 +- 8 files changed, 30 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..7ef13198721b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, __iommu_dma_unmap(dev, dma_addr, size); - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys, } iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size)) swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); return iova; } @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(phys, size, dir); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_cpu(dev, phys, size, dir); } @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev, return; phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_device(dev, phys, size, dir); if (!dev_is_dma_coherent(dev)) @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_cpu(dev, sg_phys(sg), sg->length, dir); } @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device *dev, return; for_each_sg(sgl, sg, nelems, i) { - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_device(dev, sg_phys(sg), sg->length, dir); - if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_device(sg_phys(sg), sg->length, dir); } diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 24d11861ac7d..333846af8d35 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t p, size_t size) return 0; } -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr, + size_t size) { unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); unsigned long xen_pfn = bfn_to_local_pfn(bfn); @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) - return is_swiotlb_buffer(paddr); + return is_swiotlb_buffer(paddr, size); return 0; } @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, } /* NOTE: We use dev_addr here, not paddr! */ - if (is_xen_swiotlb_buffer(hwdev, dev_addr)) + if (is_xen_swiotlb_buffer(hwdev, dev_addr, size)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); } @@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr, xen_dma_sync_for_cpu(dev, dma_addr, size, dir); } - if (is_xen_swiotlb_buffer(dev, dma_addr)) + if (is_xen_swiotlb_buffer(dev, dma_addr, size)) swiotlb_sync_single_for_cpu(dev, paddr, size, dir); } @@ -458,7 +459,7 @@ xen_swiotlb_sync_single_for_device(struct device *dev, dma_addr_t dma_addr, { phys_addr_t paddr = xen_dma_to_phys(dev, dma_addr); - if (is_xen_swiotlb_buffer(dev, dma_addr)) + if (is_xen_swiotlb_buffer(dev, dma_addr, size)) swiotlb_sync_single_for_device(dev, paddr, size, dir); if (!dev_is_dma_coherent(dev)) { diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 183e7103a66d..37fbd12bd4ab 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -142,7 +142,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); +bool dma_need_sync(struct device *dev, dma_addr_t dma_addr, size_t size); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, enum dma_data_direction dir, gfp_t gfp, unsigned long attrs); @@ -258,7 +258,7 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } -static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) +static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr, size_t size) { return false; } diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 216854a5e513..3e447f722d81 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -101,11 +101,13 @@ struct io_tlb_mem { }; extern struct io_tlb_mem *io_tlb_default_mem; -static inline bool is_swiotlb_buffer(phys_addr_t paddr) +static inline bool is_swiotlb_buffer(phys_addr_t paddr, size_t size) { struct io_tlb_mem *mem = io_tlb_default_mem; - return mem && paddr >= mem->start && paddr < mem->end; + if (paddr + size <= paddr) /* wrapping */ + return false; + return mem && paddr >= mem->start && paddr + size <= mem->end; } void __init swiotlb_exit(void); @@ -115,7 +117,7 @@ bool is_swiotlb_active(void); void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE -static inline bool is_swiotlb_buffer(phys_addr_t paddr) +static inline bool is_swiotlb_buffer(phys_addr_t paddr, size_t size) { return false; } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index f737e3347059..9ae6f94e868f 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev, for_each_sg(sgl, sg, nents, i) { phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg)); - if (unlikely(is_swiotlb_buffer(paddr))) + if (unlikely(is_swiotlb_buffer(paddr, sg->length))) swiotlb_sync_single_for_device(dev, paddr, sg->length, dir); @@ -369,7 +369,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(paddr, sg->length, dir); - if (unlikely(is_swiotlb_buffer(paddr))) + if (unlikely(is_swiotlb_buffer(paddr, sg->length))) swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir); @@ -501,10 +501,10 @@ size_t dma_direct_max_mapping_size(struct device *dev) return SIZE_MAX; } -bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr) +bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr, size_t size) { return !dev_is_dma_coherent(dev) || - is_swiotlb_buffer(dma_to_phys(dev, dma_addr)); + is_swiotlb_buffer(dma_to_phys(dev, dma_addr), size); } /** diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h index 50afc05b6f1d..4a17e431ae56 100644 --- a/kernel/dma/direct.h +++ b/kernel/dma/direct.h @@ -16,7 +16,7 @@ bool dma_direct_can_mmap(struct device *dev); int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs); -bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr); +bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr, size_t size); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs); size_t dma_direct_max_mapping_size(struct device *dev); @@ -56,7 +56,7 @@ static inline void dma_direct_sync_single_for_device(struct device *dev, { phys_addr_t paddr = dma_to_phys(dev, addr); - if (unlikely(is_swiotlb_buffer(paddr))) + if (unlikely(is_swiotlb_buffer(paddr, size))) swiotlb_sync_single_for_device(dev, paddr, size, dir); if (!dev_is_dma_coherent(dev)) @@ -73,7 +73,7 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev, arch_sync_dma_for_cpu_all(); } - if (unlikely(is_swiotlb_buffer(paddr))) + if (unlikely(is_swiotlb_buffer(paddr, size))) swiotlb_sync_single_for_cpu(dev, paddr, size, dir); if (dir == DMA_FROM_DEVICE) @@ -113,7 +113,7 @@ static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) dma_direct_sync_single_for_cpu(dev, addr, size, dir); - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } #endif /* _KERNEL_DMA_DIRECT_H */ diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 2b06a809d0b9..9bf02c8d7d1b 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -716,12 +716,12 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) +bool dma_need_sync(struct device *dev, dma_addr_t dma_addr, size_t size) { const struct dma_map_ops *ops = get_dma_ops(dev); if (dma_map_direct(dev, ops)) - return dma_direct_need_sync(dev, dma_addr); + return dma_direct_need_sync(dev, dma_addr, size); return ops->sync_single_for_cpu || ops->sync_single_for_device; } EXPORT_SYMBOL_GPL(dma_need_sync); diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index 8de01aaac4a0..c1e404fe0cf4 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -399,7 +399,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev, __xp_dma_unmap(dma_map, attrs); return -ENOMEM; } - if (dma_need_sync(dev, dma)) + if (dma_need_sync(dev, dma, PAGE_SIZE)) dma_map->dma_need_sync = true; dma_map->dma_pages[i] = dma; } -- 2.25.4
In some situations when we know swiotlb is forced and we have to deal with untrusted hosts, it's useful to know if a mapping was in the swiotlb or not. This allows us to abort any IO operation that would access memory outside the swiotlb. Otherwise it might be possible for a malicious host to inject any guest page in a read operation. While it couldn't directly access the results of the read() inside the guest, there might scenarios where data is echoed back with a write(), and that would then leak guest memory. Add a return value to dma_unmap_single/page. Most users of course will ignore it. The return value is set to EIO if we're in forced swiotlb mode and the buffer is not inside the swiotlb buffer. Otherwise it's always 0. A new callback is used to avoid changing all the IOMMU drivers. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/iommu/dma-iommu.c | 17 +++++++++++------ include/linux/dma-map-ops.h | 3 +++ include/linux/dma-mapping.h | 7 ++++--- kernel/dma/mapping.c | 6 +++++- 4 files changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7ef13198721b..babe46f2ae3a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr, iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist); } -static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, +static int __iommu_dma_unmap_swiotlb_check(struct device *dev, + dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, phys = iommu_iova_to_phys(domain, dma_addr); if (WARN_ON(!phys)) - return; + return -EIO; __iommu_dma_unmap(dev, dma_addr, size); if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); + else if (swiotlb_force == SWIOTLB_FORCE) + return -EIO; + return 0; } static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, @@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, return dma_handle; } -static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir); - __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs); + return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir, + attrs); } /* @@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s int i; for_each_sg(sg, s, nents, i) - __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s), + __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); } @@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = { .mmap = iommu_dma_mmap, .get_sgtable = iommu_dma_get_sgtable, .map_page = iommu_dma_map_page, - .unmap_page = iommu_dma_unmap_page, + .unmap_page_check = iommu_dma_unmap_page_check, .map_sg = iommu_dma_map_sg, .unmap_sg = iommu_dma_unmap_sg, .sync_single_for_cpu = iommu_dma_sync_single_for_cpu, diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d53a96a3d64..0ed0190f7949 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,9 @@ struct dma_map_ops { u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); unsigned long (*get_merge_boundary)(struct device *dev); + int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); }; #ifdef CONFIG_DMA_OPS diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 37fbd12bd4ab..25b8382f8601 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir, unsigned long attrs); -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, +int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs); int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); @@ -160,9 +160,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev, { return DMA_MAPPING_ERROR; } -static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, +static inline int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { + return 0; } static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs) @@ -323,7 +324,7 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size, dir, attrs); } -static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, +static inline int dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { return dma_unmap_page_attrs(dev, addr, size, dir, attrs); diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 9bf02c8d7d1b..dc0ce649d1f9 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -162,18 +162,22 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, } EXPORT_SYMBOL(dma_map_page_attrs); -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, +int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); + int ret = 0; BUG_ON(!valid_dma_direction(dir)); if (dma_map_direct(dev, ops) || arch_dma_unmap_page_direct(dev, addr + size)) dma_direct_unmap_page(dev, addr, size, dir, attrs); + else if (ops->unmap_page_check) + ret = ops->unmap_page_check(dev, addr, size, dir, attrs); else if (ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); debug_dma_unmap_page(dev, addr, size, dir); + return ret; } EXPORT_SYMBOL(dma_unmap_page_attrs); -- 2.25.4
Andi Kleen
2021-Jun-03 00:41 UTC
[PATCH v1 7/8] virtio: Abort IO when descriptor points outside forced swiotlb
Now that we have a return value for unmapping DMA mappings that are outside the forced swiotlb, use that to abort the IO operation. This prevents the host from subverting a read to access some data in the guest address space, which it might then get access somehow in another IO operation. It can subvert reads to point to other reads or other writes, but since it controls IO it can do that anyways. This is only done for the split code path, which is the only one supported with confidential guests. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/virtio/virtio_ring.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1e9aa1e95e1b..244a5b62d85c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -365,29 +365,31 @@ static int vring_mapping_error(const struct vring_virtqueue *vq, * Split ring specific functions - *_split(). */ -static void vring_unmap_one_split(const struct vring_virtqueue *vq, +static int vring_unmap_one_split(const struct vring_virtqueue *vq, struct vring_desc *desc) { u16 flags; + int ret; if (!vq->use_dma_api) - return; + return 0; flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); if (flags & VRING_DESC_F_INDIRECT) { - dma_unmap_single(vring_dma_dev(vq), + ret = dma_unmap_single(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } else { - dma_unmap_page(vring_dma_dev(vq), + ret = dma_unmap_page(vring_dma_dev(vq), virtio64_to_cpu(vq->vq.vdev, desc->addr), virtio32_to_cpu(vq->vq.vdev, desc->len), (flags & VRING_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE); } + return ret; } static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, @@ -609,6 +611,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, break; if (!inside_split_ring(vq, i)) break; + /* + * Ignore unmapping errors since + * we're aborting anyways. + */ vring_unmap_one_split(vq, &desc[i]); i = virtio16_to_cpu(_vq->vdev, desc[i].next); } @@ -671,7 +677,10 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, i = head; while (vq->split.vring.desc[i].flags & nextflag) { - vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + int ret; + ret = vring_unmap_one_split(vq, &vq->split.vring.desc[i]); + if (ret) + return ret; i = virtio16_to_cpu(vq->vq.vdev, vq->split.vring.desc[i].next); if (!inside_split_ring(vq, i)) return -EIO; @@ -878,6 +887,7 @@ static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq) continue; /* detach_buf_split clears data, so grab it now. */ buf = vq->split.desc_state[i].data; + /* Ignore unmap errors because there is nothing to abort */ detach_buf_split(vq, i, NULL); /* Don't need to check for error because nothing is returned */ vq->split.avail_idx_shadow--; -- 2.25.4
Error out with a warning when the free list loops longer than the maximum size while freeing descriptors. While technically we don't care about DOS it is still better to abort it early. We ran into this problem while fuzzing the virtio interactions where the fuzzed code would get stuck for a long time. Signed-off-by: Andi Kleen <ak at linux.intel.com> --- drivers/virtio/virtio_ring.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 244a5b62d85c..96adaa4c5404 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -685,6 +685,11 @@ static int detach_buf_split(struct vring_virtqueue *vq, unsigned int head, if (!inside_split_ring(vq, i)) return -EIO; vq->vq.num_free++; + if (WARN_ONCE(vq->vq.num_free > + vq->split.queue_size_in_bytes / + sizeof(struct vring_desc), + "Virtio freelist corrupted")) + return -EIO; } vring_unmap_one_split(vq, &vq->split.vring.desc[i]); -- 2.25.4
? 2021/6/3 ??8:41, Andi Kleen ??:> [v1: Initial post] > > With confidential computing like TDX the guest doesn't trust the host > anymore. The host is allowed to DOS of course, but it is not allowed > to read or write any guest memory not explicitely shared with it. > > This has implication for virtio. Traditionally virtio didn't assume > the other side of the communication channel is malicious, and therefore > didn't do any boundary checks in virtio ring data structures. > > This patchkit does hardening for virtio. In a TDX like model > the only host memory accesses allowed are in the virtio ring, > as well as the (forced) swiotlb buffer. > > This patch kit does various changes to ensure there can be no > access outside these two areas. It is possible for the host > to break the communication, but this should result in a IO > error on the guest, but no memory safety violations. > > virtio is quite complicated with many modes. To simplify > the task we enforce that virtio is only in split mode without > indirect descriptors, when running as a TDX guest. We also > enforce use of the DMA API. > > Then these code paths are hardened against any corruptions > on the ring. > > This patchkit has components in three subsystems: > - Hardening changes to virtio, all in the generic virtio-ring > - Hardening changes to kernel/dma swiotlb to harden swiotlb against > malicious pointers. It requires an API change which needed a tree sweep. > - A single x86 patch to enable the arch_has_restricted_memory_access > for TDX > > It depends on Sathya's earlier patchkit that adds the basic infrastructure > for TDX. This is only needed for the "am I running in TDX" part.Note that it's probably needed by other cases as well: 1) Other encrypted VM technology 2) VDUSE[1] 3) Smart NICs We have already had discussions and some patches have been posted[2][3][4]. I think the basic idea is similar, basically,? we don't trust any metadata provided by the device. [2] is the series that use the metadata stored in the private memory which can't be accessed by swiotlb, this series aims to eliminate all the possible attacks via virtqueue metadata [3] is one example for the the used length validation [4] is the fix for the malicious config space Thanks [1] https://www.spinics.net/lists/netdev/msg743264.html [2] https://www.spinics.net/lists/kvm/msg241825.html [3] https://patches.linaro.org/patch/450733/ [4] https://lkml.org/lkml/2021/5/17/376> > >