Jean-Philippe Brucker
2017-Apr-07  19:17 UTC
[RFC 0/3] virtio-iommu: a paravirtualized IOMMU
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3"
should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.
Scenario 1: a hardware device passed through twice via VFIO
   MEM____pIOMMU________PCI device________________________       HARDWARE
            |     (2b)                                    \
  ----------|-------------+-------------+------------------\-------------
            |             :     KVM     :                   \
            |             :             :                    \
       pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
            |             :        |    :          |           \
          VFIO            :        |    :        VFIO           \
            |             :        |    :          |             \
            |             :        |    :          |             /
  ----------|-------------+--------|----+----------|------------/--------
            |                      |    :          |           /
            | (1c)            (1b) |    :     (1a) |          / (2a)
            |                      |    :          |         /
            |                      |    :          |        /   USERSPACE
            |___virtio-iommu dev___|    :        net drv___/
                                        :
  --------------------------------------+--------------------------------
                 HOST                   :             GUEST
(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
       buffer with mmap, obtaining virtual address VA. It then send a
       VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
    b. The maping request is relayed to the host through virtio
       (VIRTIO_IOMMU_T_MAP).
    c. The mapping request is relayed to the physical IOMMU through VFIO.
(2) a. The guest userspace driver can now instruct the device to directly
       access the buffer at IOVA
    b. IOVA accesses from the device are translated into physical
       addresses by the IOMMU.
Scenario 2: a virtual net device behind a virtual IOMMU.
  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
         |         |      :             :
    pIOMMU drv     |      :             :
             \     |      :      _____________virtio-net drv      KERNEL
              \_net drv   :     |       :          / (1a)
                   |      :     |       :         /
                  tap     :     |    ________virtio-iommu drv
                   |      :     |   |   : (1b)
  -----------------|------+-----|---|---+-------------------------------
                   |            |   |   :
                   |_virtio-net_|   |   :
                         / (2)      |   :
                        /           |   :                      USERSPACE
              virtio-iommu dev______|   :
                                        :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST
(1) a. Guest virtio-net driver maps the virtio ring and a buffer
    b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
    IOMMU.
Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.
The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.
	1. Firmware note,
	2. device operations (draft for the virtio specification),
	3. future work/possible improvements.
Just to be clear on the terms I'm using:
pIOMMU	physical IOMMU, controlling DMA accesses from physical devices
vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses from
	physical and virtual devices to guest memory.
GVA, GPA, HVA, HPA
	Guest/Host Virtual/Physical Address
IOVA	I/O Virtual Address, the address accessed by a device doing DMA
	through an IOMMU. In the context of a guest OS, IOVA is GVA.
Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.
This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.
Thanks,
Jean-Philippe
Jean-Philippe Brucker
2017-Apr-07  19:17 UTC
[RFC 1/3] virtio-iommu: firmware description of the virtual topology
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.
The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.
The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.
The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.
                       Device ID    Requester ID
                  /       0x0           0x0      \
                 /         |             |        PCI domain 1
                /      0xffff           0xffff   /
        vIOMMU 1
                \     0x10000           0x0      \
                 \         |             |        PCI domain 2
                  \   0x1ffff           0xffff   /
                  /       0x0                    \
                 /         |                      platform devices
                /      0x1fff                    /
        vIOMMU 2
                \      0x2000           0x0      \
                 \         |             |        PCI domain 3
                  \   0x11fff           0xffff   /
Device-tree already offers a way to describe the topology. Here's an
example description of vIOMMU 2 with its devices:
	/* The virtual IOMMU is described with a virtio-mmio node */
	viommu2: virtio at 10000 {
		compatible = "virtio,mmio";
		reg = <0x10000 0x200>;
		dma-coherent;
		interrupts = <0x0 0x5 0x1>;
		
		#iommu-cells = <1>
	};
	
	/* Some platform device has Device ID 0x5 */
	somedevice at 20000 {
		...
		
		iommus = <&viommu2 0x5>;
	};
	
	/*
	 * PCI domain 3 is described by its host controller node, along
	 * with the complete relation to the IOMMU
	 */
	pci {
		...
		/* Linear map between RIDs and Device IDs for the whole bus */
		iommu-map = <0x0 &viommu2 0x10000 0x10000>;
	};
For more details, please refer to [DT-IOMMU].
For ACPI, we expect to add a new node type to the IO Remapping Table
specification [IORT], providing a similar mechanism for describing
translations via ACPI tables. The following is *not* a specification,
simply an example of what the node could be.
         Field      | Len.  | Off.  | Description
    ----------------|-------|-------|---------------------------------
     Type           | 1     | 0     | 5: paravirtualized IOMMU
     Length         | 2     | 1     | The length of the node.
     Revision       | 1     | 3     | 0
     Reserved       | 4     | 4     | Must be zero.
     Number of ID   | 4     | 8     |
       mappings     |       |       |
     Reference to   | 4     | 12    | Offset from the start of the
       ID Array     |       |       | IORT node to the start of its
                    |       |       | Array ID mappings.
                    |       |       |
     Model          | 4     | 16    | 0: virtio-iommu
     Device object  | --    | 20    | ASCII Null terminated string
       name         |       |       | with the full path to the entry
                    |       |       | in the namespace for this IOMMU.
     Padding        | --    | --    | To keep 32-bit alignment and
                    |       |       | leave space for future models.
                    |       |       |
     Array of ID    |       |       |
       mappings     | 20xN  | --    | ID Array.
The OS parses the IORT table to build a map of ID relations between IOMMU
and devices. ID Array is used to find correspondence between IOMMU IDs and
PCI or platform devices. Later on, the virtio-iommu driver finds the
associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.
[DT-IOMMU]
https://www.kernel.org/doc/Documentation/devicetree/bindings/iommu/iommu.txt
          
https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt
[IORT] IO Remapping Table, DEN0049B
      
http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf
Jean-Philippe Brucker
2017-Apr-07  19:17 UTC
[RFC 2/3] virtio-iommu: device probing and operations
After the virtio-iommu device has been probed and the driver is aware of
the devices translated by the IOMMU, it can start sending requests to the
virtio-iommu device. The operations described here are voluntarily
minimalistic, so vIOMMU devices can be as simple as possible to implement,
and can be extended with feature bits.
	I.   Overview
	II.  Feature bits
	III. Device configuration layout
	IV.  Device initialization
	V.   Device operations
	     1. Attach device
	     2. Detach device
	     3. Map region
	     4. Unmap region
  I. Overview
  ==========
Requests are small buffers added by the guest to the request virtqueue.
The guest can add a batch of them to the queue and send a notification
(kick) to the device to have all of them handled.
Here is an example flow:
* attach(address space, device), kick: create a new address space and
  attach a device to it
* map(address space, virt, phys, size, flags): create a mapping between a
  guest-virtual and a guest-physical addresses
* map, map, map, kick
* ... here the guest device can perform DMA to the freshly mapped memory
* unmap(address space, virt, size), unmap, kick
* detach(address space, device), kick
The following description attempts to use the same format as other virtio
devices. We won't go into details of the virtio transport, please refer to
[VIRTIO-v1.0] for more information.
As a quick reminder, the virtio (1.0) transport can be described with the
following flow:
                             HOST  :  GUEST
                     (3)           :
                    .----- [available ring] <-----. (2)
                   /               :               \
                  v   (4)          :          (1)   \
            [device] <--- [descriptor table] <---- [driver]
                  \                :                 ^
                   \               :                /
                (5) '-------> [used ring] ---------'
                                   :            (6)
                                   :
(1) Driver has a buffers with a payload to send via virtio. It writes
    address and size of buffer in a descriptor. It can chain N sub-buffers
    by writing N descriptors and linking them together. The first
    descriptor of the chain is referred to as the head.
(2) Driver queues the head index into the 'available' ring.
(3) Driver notifies the device. Since virtio-iommu uses MMIO, notification
    is done by writing to a doorbell address. KVM traps it and forwards
    the notification to the virtio device. Device dequeues the head index
    from the 'available' ring.
(4) Device reads all descriptors in the chain, handles the payload.
(5) Device writes the head index into the 'used' ring and sends a
    notification to the guest, by injecting an interrupt.
(6) Driver pops the head from the used ring, and optionally read the
    buffers that were updated by the device.
  II. Feature bits
  ===============
VIRTIO_IOMMU_F_INPUT_RANGE (0)
 Available range of virtual addresses is described in input_range
VIRTIO_IOMMU_F_IOASID_BITS (1)
 The number of address spaces supported is described in ioasid_bits
VIRTIO_IOMMU_F_MAP_UNMAP (2)
 Map and unmap requests are available. This is here to allow a device or
 driver to only implement page-table sharing, once we introduce the
 feature. Device will be able to only select one of F_MAP_UNMAP or
 F_PT_SHARING. For the moment, this bit must always be set.
 
VIRTIO_IOMMU_F_BYPASS (3)
 When not attached to an address space, devices behind the IOMMU can
 access the physical address space.
  III. Device configuration layout
  ===============================
	struct virtio_iommu_config {
		u64 page_size_mask;
		struct virtio_iommu_range {
			u64 start;
			u64 end;
		} input_range;
		u8 ioasid_bits;
	};
  IV. Device initialization
  ========================
1. page_size_mask contains the bitmask of all page sizes that can be
   mapped. The least significant bit set defines the page granularity of
   IOMMU mappings. Other bits in the mask are hints describing page sizes
   that the IOMMU can merge into a single mapping (page blocks).
   There is no lower limit for the smallest page granularity supported by
   the IOMMU. It is legal for the driver to map one byte at a time if the
   device advertises it.
   page_size_mask must have at least one bit set.
2. If the VIRTIO_IOMMU_F_IOASID_BITS feature is negotiated, ioasid_bits
   contains the number of bits supported in an I/O Address Space ID, the
   identifier used in map/unmap requests. A value of 0 is valid, and means
   that a single address space is supported.
   If the feature is not negotiated, address space identifiers can use up
   to 32 bits.
3. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, input_range
   contains the virtual address range that the IOMMU is able to translate.
   Any mapping request to virtual addresses outside of this range will
   fail.
   If the feature is not negotiated, virtual mappings span over the whole
   64-bit address space (start = 0, end = 0xffffffffffffffff)
4. If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, devices behind the
   IOMMU not attached to an address space are allowed to access
   guest-physical addresses. Otherwise, accesses to guest-physical
   addresses may fault.
  V. Device operations
  ===================
Driver send requests on the request virtqueue (0), notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writeable. Each request must therefore be described with at least two
descriptors, as illustrated below.
	31                       7      0
	+--------------------------------+ <------- RO descriptor
	|      0 (reserved)     |  type  |
	+--------------------------------+
	|                                |
	|            payload             |
	|                                | <------- WO descriptor
	+--------------------------------+
	|      0 (reserved)     | status |
	+--------------------------------+
	struct virtio_iommu_req_head {
		u8	type;
		u8	reserved[3];
	};
	struct virtio_iommu_req_tail {
		u8	status;
		u8	reserved[3];
	};
(Note on the format choice: this format forces the payload to be split in
two - one read-only buffer, one write-only. It is necessary and sufficient
for our purpose, and does not close the door to future extensions with
more complex requests, such as a WO field sandwiched between two RO ones.
With virtio 1.0 ring requirements, such a request would need to be
described by two chains of descriptors, which might be more complex to
implement efficiently, but still possible. Both devices and drivers must
assume that requests are segmented anyway.)
Type may be one of:
VIRTIO_IOMMU_T_ATTACH			1
VIRTIO_IOMMU_T_DETACH			2
VIRTIO_IOMMU_T_MAP			3
VIRTIO_IOMMU_T_UNMAP			4
A few general-purpose status codes are defined here. Driver must not
assume a specific status to be returned for an invalid request. Except for
0 that always means "success", these values are hints to make
troubleshooting easier.
VIRTIO_IOMMU_S_OK			0
 All good! Carry on.
VIRTIO_IOMMU_S_IOERR			1
 Virtio communication error 
VIRTIO_IOMMU_S_UNSUPP			2
 Unsupported request
VIRTIO_IOMMU_S_DEVERR			3
 Internal device error
VIRTIO_IOMMU_S_INVAL			4
 Invalid parameters
VIRTIO_IOMMU_S_RANGE			5
 Out-of-range parameters
VIRTIO_IOMMU_S_NOENT			6
 Entry not found
VIRTIO_IOMMU_S_FAULT			7
 Bad address
  1. Attach device
  ----------------
struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	flags/reserved;
};
Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
device, it is created. 'device' is an identifier unique to the IOMMU.
The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
but the following rules must apply:
* The device ID is unique from the IOMMU point of view. Multiple devices
  whose DMA transactions are not translated by the same IOMMU may have the
  same device ID. Devices whose DMA transactions may be translated by the
  same IOMMU must have different device IDs.
* Sometimes the host cannot completely isolate two devices from each
  others. For example on a legacy PCI bus, devices can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these devices from each
  others. The method used to communicate this is outside the scope of this
  specification. The IOMMU device must ensure that devices that cannot be
  isolated by the host have the same address spaces.
Multiple devices may be added to the same address space. A device cannot
be attached to multiple address spaces (that is, with the map/unmap
interface. For SVM, see page table and context table sharing proposal.)
If the device is already attached to another address space 'old', it is
detached from the old one and attached to the new one. The device cannot
access mappings from the old address space after this request completes.
The device either returns VIRTIO_IOMMU_S_OK, or an error status. We
suggest the following error status, that would help debug the driver.
NOENT: device not found.
RANGE: address space is outside the range allowed by ioasid_bits.
  2. Detach device
  ----------------
struct virtio_iommu_req_detach {
	le32	device;
	le32	flags/reserved;
};
Detach a device from its address space. When this request completes, the
device cannot access any mapping from that address space anymore. If the
device isn't attached to any address space, the request returns
successfully.
After all devices have been successfully detached from an address space,
its ID can be reused by the driver for another address space.
NOENT: device not found.
INVAL: device wasn't attached to any address space.
  3. Map region
  -------------
struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};
VIRTIO_IOMMU_MAP_F_READ		0x1
VIRTIO_IOMMU_MAP_F_WRITE	0x2
VIRTIO_IOMMU_MAP_F_EXEC		0x4
Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses. Size must always be a multiple of the
page granularity negotiated during initialization. Both phys_addr and
virt_addr must be aligned on the page granularity. The address space must
have been created with VIRTIO_IOMMU_T_ATTACH.
The range defined by (virt_addr, size) must be within the limits specified
by input_range. The range defined by (phys_addr, size) must be within the
guest-physical address space. This includes upper and lower limits, as
well as any carving of guest-physical addresses for use by the host (for
instance MSI doorbells). Guest physical boundaries are set by the host
using a firmware mechanism outside the scope of this specification.
(Note that this format prevents from creating the identity mapping in a
single request (0x0 - 0xfff....fff) -> (0x0 - 0xfff...fff), since it would
result in a size of zero. Hopefully allowing VIRTIO_IOMMU_F_BYPASS
eliminates the need for issuing such request. It would also be unlikely to
conform to the physical range restrictions from the previous paragraph)
(Another note, on flags: it is unlikely that all possible combinations of
flags will be supported by the physical IOMMU. For instance, (W & !R) or
(E & W) might be invalid. I haven't taken time to devise a clever way to
advertise supported and implicit (for instance "W implies R") flags or
combination thereof for the moment, but I could at least try to research
common models. Keeping in mind that we might soon want to add more flags,
such as privileged, device, transient, shared, etc. whatever these would
mean)
This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.
INVAL: invalid flags
RANGE: virt_addr, phys_addr or range are not in the limits specified
       during negotiation. For instance, not aligned to page granularity.
NOENT: address space not found.
  4. Unmap region
  ---------------
struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};
Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. The range,
defined by virt_addr and size, must exactly cover one or more contiguous
mappings created with MAP requests. All mappings covered by the range are
removed. Driver should not send a request covering unmapped areas.
We define a mapping as a virtual region created with a single MAP request.
virt_addr should exactly match the start of an existing mapping. The end
of the range, (virt_addr + size - 1), should exactly match the end of an
existing mapping. Device must reject any request that would affect only
part of a mapping. If the requested range spills outside of mapped
regions, the device's behaviour is undefined.
These rules are illustrated with the following requests (with arguments
(va, size)), assuming each example sequence starts with a blank address
space:
	map(0, 10)
	unmap(0, 10) -> allowed
	map(0, 5)
	map(5, 5)
	unmap(0, 10) -> allowed
	map(0, 10)
	unmap(0, 5) -> forbidden
	map(0, 10)
	unmap(0, 15) -> undefined
	map(0, 5)
	map(10, 5)
	unmap(0, 15) -> undefined
(Note: the semantics of unmap are chosen to be compatible with VFIO's
type1 v2 IOMMU API. This way a device serving as intermediary between
guest and VFIO doesn't have to keep an internal tree of mappings. They are
a bit tighter than VFIO, in that they don't allow unmap spilling outside
mapped regions. Spilling is 'undefined' at the moment, because it should
work in most cases but I don't know if it's worth the added complexity
in
devices that are not simply transmitting requests to VFIO. Splitting
mappings won't ever be allowed, but see the relaxed proposal in 3/3 for
more lenient semantics)
This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.
NOENT: address space not found.
FAULT: mapping not found.
RANGE: request would split a mapping.
[VIRTIO-v1.0] Virtual I/O Device (VIRTIO) Version 1.0.  03 December 2013.
              Committee Speci?cation Draft 01 / Public Review Draft 01.
             
http://docs.oasis-open.org/virtio/virtio/v1.0/csprd01/virtio-v1.0-csprd01.html
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.
	I.   Linux host
	     1. vhost-iommu
	     2. VFIO nested translation
	II.  Page table sharing
	     1. Sharing IOMMU page tables
	     2. Sharing MMU page tables (SVM)
	     3. Fault reporting
	     4. Host implementation with VFIO
	III. Relaxed operations
	IV.  Misc
  I. Linux host
  ============
  1. vhost-iommu
  --------------
An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.
Introducing vhost in a simplified scenario 1 (removed guest userspace
pass-through, irrelevant to this example) gives us the following:
  MEM____pIOMMU________PCI device____________                    HARDWARE
            |                                \
  ----------|-------------+-------------+-----\--------------------------
            |             :     KVM     :      \
       pIOMMU drv         :             :       \                  KERNEL
            |             :             :     net drv
          VFIO            :             :       /
            |             :             :      /
       vhost-iommu_________________________virtio-iommu-drv
                          :             :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST
Introducing vhost in scenario 2, userspace now only handles the device
initialisation part, and most runtime communication is handled in kernel:
  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
    pIOMMU drv     |      :             :                         KERNEL
             \__net drv   :             :
                   |      :             :
                  tap     :             :
                   |      :             :
              _vhost-net________________________virtio-net drv
         (2) /            :             :           / (1a)
            /             :             :          /
   vhost-iommu________________________________virtio-iommu drv
                          :             : (1b)
  ------------------------+-------------+-------------------------------
                 HOST                   :             GUEST
(1) a. Guest virtio driver maps ring and buffers
    b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
    reuse the existing TLB protocol for this. TLB commands are written to
    and read from the vhost-net fd.
As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
has everything needed for map/unmap operations:
	struct vhost_iotlb_msg {
		__u64	iova;
		__u64	size;
		__u64	uaddr;
		__u8	perm; /* R/W */
		__u8	type;
	#define VHOST_IOTLB_MISS
	#define VHOST_IOTLB_UPDATE	/* MAP */
	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
	#define VHOST_IOTLB_ACCESS_FAIL
	};
	struct vhost_msg {
		int type;
		union {
			struct vhost_iotlb_msg iotlb;
			__u8 padding[64];
		};
	};
The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.
If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.
Details of operations would be:
(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
via ioctl:
	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
	      vhost_iommu_add_device)
	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
	#define VHOST_IOMMU_DEVICE_TYPE_TLB
	struct vhost_iommu_add_device {
		__u8 type;
		__u32 devid;
		union {
			struct vhost_iommu_device_vfio {
				int vfio_group_fd;
			};
			struct vhost_iommu_device_tlb {
				int fd;
			};
		};
	};
(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)
vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.
(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)
Turn phys into an hva using the vhost mem table.
- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
  mapping locally and wait for the TLB to ask for it with a
  VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
  introduce a shortcut in the external user API of VFIO).
(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)
- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.
(5) VIRTIO_IOMMU_T_DETACH(address space, devid)
Undo whatever was done in (2).
  2. VFIO nested translation
  --------------------------
For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.
A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address
is
translated into a host-physical one using "stage-2" (s2) page tables.
                             s1      s2
                         GVA --> GPA --> HPA
There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The
ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.
Another model that would help with dynamically changing address spaces is
nesting VFIO containers:
                           Parent  <---------- map/unmap
                          container
                         /   |     \
                        /   group   \
                     Child         Child  <--- map/unmap
                   container     container
                    |   |             |
                 group group        group
At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).
At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.
This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).
A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).
I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.
  II. Page table sharing
  =====================
  1. Sharing IOMMU page tables
  ----------------------------
VIRTIO_IOMMU_F_PT_SHARING
This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.
When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.
Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.
(1) Driver attaches devices to address spaces as usual, but a flag
    VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
    create page tables for use with the MAP/UNMAP API. The driver intends
    to manage the address space itself.
(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
    pg_format array.
	VIRTIO_IOMMU_T_PROBE_TABLE
	struct virtio_iommu_req_probe_table {
		le32	address_space;
		le32	flags;
		le32	len;
	
		le32	nr_contexts;
		struct {
			le32	model;
			u8	format[64];
		} pg_format[len];
	};
Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.
(3) Device responds success with all page table formats implemented by the
    physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
    initialize the array to 0 and deduce from there which entries have
    been filled by the device.
Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)
(4) If the driver is able to use this format, it sends the ATTACH_TABLE
    request.
	VIRTIO_IOMMU_T_ATTACH_TABLE
	struct virtio_iommu_req_attach_table {
		le32	address_space;
		le32	flags;
		le64	table;
	
		le32	nr_contexts;
		/* Page-table format description */
	
		le32	model;
		u8	config[64]
	};
    'table' is a pointer to the page directory. 'nr_contexts'
isn't used
    here.
    For both ATTACH and PROBE, 'flags' are the following (and will be
    explained later):
	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)
Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
have two page table models, with a multitude of configuration bits:
	* ARM LPAE
	* ARM short descriptor
We could define a high-level identifier per page-table model, such as:
	#define PG_TABLE_ARM	0x1
	#define PG_TABLE_X86	0x2
	...
And each model would define its own structure. On ARM 'format' could be
a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
'config' would be:
	struct pg_config_v7s {
		le32	tcr;
		le32	prrr;
		le32	nmrr;
		le32	asid;
	};
	
	struct pg_config_lpae {
		le64	tcr;
		le64	mair;
		le32	asid;
	
		/* And maybe TTB1? */
	};
	struct pg_config_arm {
		le32	variant;
		union ...;
	};
I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.
(5) Once the table is attached, the driver can simply write the page
    tables and expect the physical IOMMU to observe the mappings without
    any additional request. When changing or removing a mapping, however,
    the driver must send an invalidate request.
	VIRTIO_IOMMU_T_INVALIDATE
	struct virtio_iommu_req_invalidate {
		le32	address_space;
		le32	context;
		le32	flags;
		le64	virt_addr;
		le64	range_size;
	
		u8	opaque[64];
	};
    'flags' may be:
    VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
      from 'context' (context is 0 when !F_INDIRECT).
    And with context tables only (explained below):
    VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
      'context' (context is 0 when !F_INDIRECT). virt_addr and
range_size
      are ignored.
    VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
      in the table that changed. Device reads the table again, compares it
      to previous values, and invalidate all mappings for contexts that
      changed. context, virt_addr and range_size are ignored.
IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.
  2. Sharing MMU page tables
  --------------------------
The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.
F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)
F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.
F_INDIRECT means that 'table' pointer is a context table, instead of a
page directory. Each slot in the context table points to a page directory:
                       64              2 1 0
          table ----> +---------------------+
                      |       pgd       |0|1|<--- context 0
                      |       ---       |0|0|<--- context 1
                      |       pgd       |0|1|
                      |       ---       |0|0|
                      |       ---       |0|0|
                      +---------------------+
                                         | \___Entry is valid
                                         |______reserved
Question: do we want per-context page table format, or can it stay global
for the whole indirect table?
Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.
A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.
  3. Fault reporting
  ------------------
VIRTIO_IOMMU_F_EVENT_QUEUE
With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.
	#define VIRTIO_IOMMU_T_FAULT	0x05
	struct virtio_iommu_evt_fault {
		struct virtio_iommu_evt_head {
			u8 type;
			u8 reserved[3];
		};
	
		u32 address_space;
		u32 context;
	
		u64 vaddr;
		u32 flags;	/* Access details: R/W/X */
	
		/* In the reply: */
		u32 reply;	/* Fault handled, or failure */
		u64 paddr;
	};
Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.
Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.
  4. Host implementation with VFIO
  --------------------------------
The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.
For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.
	* VFIO_SVM_INFO: probe page table formats
	* VFIO_SVM_BIND: set pgd and arch-specific configuration
There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).
  III. Relaxed operations
  ======================
VIRTIO_IOMMU_F_RELAXED
Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
implement. Given a MAP([start:end] -> phys, flags) request:
(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
    unmap [max(start, old_start):min(end, old_end)] and replace it with
    [start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
    request exactly (same flags, same phys address), the old mapping is
    kept.
This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.
In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].
This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.
We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.
  IV. Misc
  =======
I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
variable size:
	struct virtio_iommu_req_map_sg {
		struct virtio_iommu_req_head;
		u32	address_space;
		u32	nr_elems;
		u64	virt_addr;
		u64	size;
		u64	phys_addr[nr_elems];
	};
Would create the following mappings:
	virt_addr		-> phys_addr[0]
	virt_addr + size	-> phys_addr[1]
	virt_addr + 2 * size	-> phys_addr[2]
	...
This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.
My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.
Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.
	                       ___ L: request is last in the group
	                      /  _ G: request is part of a group
	                     |  /
	                     v v
	31                   9 8 7      0
	+--------------------------------+ <------- RO descriptor
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |1|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+ <------- WO descriptor
	|        res0           | status |
	+--------------------------------+
This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.
[1]
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
    N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster
Jean-Philippe Brucker
2017-Apr-07  19:23 UTC
[RFC PATCH linux] iommu: Add virtio-iommu driver
The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio-mmio transport. This driver should
illustrate the initial proposal for virtio-iommu, that you hopefully
received with it. It handle attach, detach, map and unmap requests.
The bulk of the code is to create requests and send them through virtio.
Implementing the IOMMU API is fairly straightforward since the
virtio-iommu MAP/UNMAP interface is almost identical. I threw in a custom
map_sg() function which takes up some space, but is optional. The core
function would send a sequence of map requests, waiting for a reply
between each mapping. This optimization avoids yielding to the host after
each map, and instead prepares a batch of requests in the virtio ring and
kicks the host once.
It must be applied on top of the probe deferral work for IOMMU, currently
under discussion. This allows to dissociate early driver detection and
device probing: device-tree or ACPI is parsed early to find which devices
are translated by the IOMMU, but the IOMMU itself cannot be probed until
the core virtio module is loaded.
Enabling DEBUG makes it extremely verbose at the moment, but it should be
calmer in next versions.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 drivers/iommu/Kconfig             |  11 +
 drivers/iommu/Makefile            |   1 +
 drivers/iommu/virtio-iommu.c      | 980 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 142 ++++++
 6 files changed, 1136 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 37e204f3d9be..8cd56ee9a93a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -359,4 +359,15 @@ config MTK_IOMMU_V1
 
 	  if unsure, say N here.
 
+config VIRTIO_IOMMU
+	tristate "Virtio IOMMU driver"
+	depends on VIRTIO_MMIO
+	select IOMMU_API
+	select INTERVAL_TREE
+	select ARM_DMA_USE_IOMMU if ARM
+	help
+	  Para-virtualised IOMMU driver with virtio.
+
+	  Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 195f7b997d8e..1199d8475802 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index 000000000000..1cf4f57b7817
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,980 @@
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2017 ARM Limited
+ *
+ * Author: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/amba/bus.h>
+#include <linux/delay.h>
+#include <linux/dma-iommu.h>
+#include <linux/freezer.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/of_iommu.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
+#include <linux/wait.h>
+
+#include <uapi/linux/virtio_iommu.h>
+
+struct viommu_dev {
+	struct iommu_device		iommu;
+	struct device			*dev;
+	struct virtio_device		*vdev;
+
+	struct virtqueue		*vq;
+	struct list_head		pending_requests;
+	/* Serialize anything touching the vq and the request list */
+	spinlock_t			vq_lock;
+
+	struct list_head		list;
+
+	/* Device configuration */
+	u64				pgsize_bitmap;
+	u64				aperture_start;
+	u64				aperture_end;
+};
+
+struct viommu_mapping {
+	phys_addr_t			paddr;
+	struct interval_tree_node	iova;
+};
+
+struct viommu_domain {
+	struct iommu_domain		domain;
+	struct viommu_dev		*viommu;
+	struct mutex			mutex;
+	u64				id;
+
+	spinlock_t			mappings_lock;
+	struct rb_root			mappings;
+
+	/* Number of devices attached to this domain */
+	unsigned long			attached;
+};
+
+struct viommu_endpoint {
+	struct viommu_dev		*viommu;
+	struct viommu_domain		*vdomain;
+};
+
+struct viommu_request {
+	struct scatterlist		head;
+	struct scatterlist		tail;
+
+	int				written;
+	struct list_head		list;
+};
+
+/* TODO: use an IDA */
+static atomic64_t viommu_domain_ids_gen;
+
+#define to_viommu_domain(domain) container_of(domain, struct viommu_domain,
domain)
+
+/* Virtio transport */
+
+static int viommu_status_to_errno(u8 status)
+{
+	switch (status) {
+	case VIRTIO_IOMMU_S_OK:
+		return 0;
+	case VIRTIO_IOMMU_S_UNSUPP:
+		return -ENOSYS;
+	case VIRTIO_IOMMU_S_INVAL:
+		return -EINVAL;
+	case VIRTIO_IOMMU_S_RANGE:
+		return -ERANGE;
+	case VIRTIO_IOMMU_S_NOENT:
+		return -ENOENT;
+	case VIRTIO_IOMMU_S_FAULT:
+		return -EFAULT;
+	case VIRTIO_IOMMU_S_IOERR:
+	case VIRTIO_IOMMU_S_DEVERR:
+	default:
+		return -EIO;
+	}
+}
+
+static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t *head,
+			       size_t *tail)
+{
+	size_t size;
+	union virtio_iommu_req r;
+
+	*tail = sizeof(struct virtio_iommu_req_tail);
+
+	switch (req->type) {
+	case VIRTIO_IOMMU_T_ATTACH:
+		size = sizeof(r.attach);
+		break;
+	case VIRTIO_IOMMU_T_DETACH:
+		size = sizeof(r.detach);
+		break;
+	case VIRTIO_IOMMU_T_MAP:
+		size = sizeof(r.map);
+		break;
+	case VIRTIO_IOMMU_T_UNMAP:
+		size = sizeof(r.unmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	*head = size - *tail;
+	return 0;
+}
+
+static int viommu_receive_resp(struct viommu_dev *viommu, int nr_expected)
+{
+
+	unsigned int len;
+	int nr_received = 0;
+	struct viommu_request *req, *pending, *next;
+
+	pending = list_first_entry_or_null(&viommu->pending_requests,
+					   struct viommu_request, list);
+	if (WARN_ON(!pending))
+		return 0;
+
+	while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
+		if (req != pending) {
+			dev_warn(viommu->dev, "discarding stale request\n");
+			continue;
+		}
+
+		pending->written = len;
+
+		if (++nr_received == nr_expected) {
+			list_del(&pending->list);
+			/*
+			 * In an ideal world, we'd wake up the waiter for this
+			 * group of requests here. But everything is painfully
+			 * synchronous, so waiter is the caller.
+			 */
+			break;
+		}
+
+		next = list_next_entry(pending, list);
+		list_del(&pending->list);
+
+		if (WARN_ON(list_empty(&viommu->pending_requests)))
+			return 0;
+
+		pending = next;
+	}
+
+	return nr_received;
+}
+
+/* Must be called with vq_lock held */
+static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
+				  struct viommu_request *req, int nr,
+				  int *nr_sent)
+{
+	int i, ret;
+	ktime_t timeout;
+	int nr_received = 0;
+	struct scatterlist *sg[2];
+	/*
+	 * FIXME: as it stands, 1s timeout per request. This is a voluntary
+	 * exaggeration because I have no idea how real our ktime is. Are we
+	 * using a RTC? Are we aware of steal time? I don't know much about
+	 * this, need to do some digging.
+	 */
+	unsigned long timeout_ms = 1000;
+
+	*nr_sent = 0;
+
+	for (i = 0; i < nr; i++, req++) {
+		/*
+		 * The backend will allocate one indirect descriptor for each
+		 * request, which allows to double the ring consumption, but
+		 * might be slower.
+		 */
+		req->written = 0;
+
+		sg[0] = &req->head;
+		sg[1] = &req->tail;
+
+		ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
+					GFP_ATOMIC);
+		if (ret)
+			break;
+
+		list_add_tail(&req->list, &viommu->pending_requests);
+	}
+
+	if (i && !virtqueue_kick(viommu->vq))
+		return -EPIPE;
+
+	/*
+	 * Absolutely no wiggle room here. We're not allowed to sleep as callers
+	 * might be holding spinlocks, so we have to poll like savages until
+	 * something appears. Hopefully the host already handled the request
+	 * during the above kick and returned it to us.
+	 *
+	 * A nice improvement would be for the caller to tell us if we can sleep
+	 * whilst mapping, but this has to go through the IOMMU/DMA API.
+	 */
+	timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
+	while (nr_received < i && ktime_before(ktime_get(), timeout)) {
+		nr_received += viommu_receive_resp(viommu, i - nr_received);
+		if (nr_received < i) {
+			/*
+			 * FIXME: what's a good way to yield to host? A second
+			 * virtqueue_kick won't have any effect since we haven't
+			 * added any descriptor.
+			 */
+			udelay(10);
+		}
+	}
+	dev_dbg(viommu->dev, "request took %lld us\n",
+		ktime_us_delta(ktime_get(), ktime_sub_ms(timeout, timeout_ms * i)));
+
+	if (nr_received != i)
+		ret = -ETIMEDOUT;
+
+	if (ret == -ENOSPC && nr_received)
+		/*
+		 * We've freed some space since virtio told us that the ring is
+		 * full, tell the caller to come back later (after releasing the
+		 * lock first, to be fair to other threads)
+		 */
+		ret = -EAGAIN;
+
+	*nr_sent = nr_received;
+
+	return ret;
+}
+
+/**
+ * viommu_send_reqs_sync - add a batch of requests, kick the host and wait for
+ *                         them to return
+ *
+ * @req: array of requests
+ * @nr: size of the array
+ * @nr_sent: contains the number of requests actually sent after this function
+ *           returns
+ *
+ * Return 0 on success, or an error if we failed to send some of the requests.
+ */
+static int viommu_send_reqs_sync(struct viommu_dev *viommu,
+				 struct viommu_request *req, int nr,
+				 int *nr_sent)
+{
+	int ret;
+	int sent = 0;
+	unsigned long flags;
+
+	*nr_sent = 0;
+	do {
+		spin_lock_irqsave(&viommu->vq_lock, flags);
+		ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
+		spin_unlock_irqrestore(&viommu->vq_lock, flags);
+
+		*nr_sent += sent;
+		req += sent;
+		nr -= sent;
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+/**
+ * viommu_send_req_sync - send one request and wait for reply
+ *
+ * @head_ptr: pointer to a virtio_iommu_req_* structure
+ *
+ * Returns 0 if the request was successful, or an error number otherwise. No
+ * distinction is done between transport and request errors.
+ */
+static int viommu_send_req_sync(struct viommu_dev *viommu, void *head_ptr)
+{
+	int ret;
+	int nr_sent;
+	struct viommu_request req;
+	size_t head_size, tail_size;
+	struct virtio_iommu_req_tail *tail;
+	struct virtio_iommu_req_head *head = head_ptr;
+
+	ret = viommu_get_req_size(head, &head_size, &tail_size);
+	if (ret)
+		return ret;
+
+	dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n",
head->type,
+		head_size + tail_size);
+
+	tail = head_ptr + head_size;
+
+	sg_init_one(&req.head, head, head_size);
+	sg_init_one(&req.tail, tail, tail_size);
+
+	ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
+	if (ret || !req.written || nr_sent != 1) {
+		dev_err(viommu->dev, "failed to send command\n");
+		return -EIO;
+	}
+
+	ret = -viommu_status_to_errno(tail->status);
+
+	if (ret)
+		dev_dbg(viommu->dev, " completed with %d\n", ret);
+
+	return ret;
+}
+
+static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned long iova,
+			  phys_addr_t paddr, size_t size)
+{
+	unsigned long flags;
+	struct viommu_mapping *mapping;
+
+	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+	if (!mapping)
+		return -ENOMEM;
+
+	mapping->paddr = paddr;
+	mapping->iova.start = iova;
+	mapping->iova.last = iova + size - 1;
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	interval_tree_insert(&mapping->iova, &vdomain->mappings);
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	return 0;
+}
+
+static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
+			       unsigned long iova, size_t size)
+{
+	size_t unmapped = 0;
+	unsigned long flags;
+	unsigned long last = iova + size - 1;
+	struct viommu_mapping *mapping = NULL;
+	struct interval_tree_node *node, *next;
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+	while (next) {
+		node = next;
+		mapping = container_of(node, struct viommu_mapping, iova);
+
+		next = interval_tree_iter_next(node, iova, last);
+
+		/*
+		 * Note that for a partial range, this will return the full
+		 * mapping so we avoid sending split requests to the device.
+		 */
+		unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+		interval_tree_remove(node, &vdomain->mappings);
+		kfree(mapping);
+	}
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	return unmapped;
+}
+
+/* IOMMU API */
+
+static bool viommu_capable(enum iommu_cap cap)
+{
+	return false; /* :( */
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+	struct viommu_domain *vdomain;
+
+	if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+		return NULL;
+
+	vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
+	if (!vdomain)
+		return NULL;
+
+	vdomain->id = atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
+
+	mutex_init(&vdomain->mutex);
+	spin_lock_init(&vdomain->mappings_lock);
+	vdomain->mappings = RB_ROOT;
+
+	pr_debug("alloc domain of type %d -> %llu\n", type,
vdomain->id);
+
+	if (type == IOMMU_DOMAIN_DMA &&
+	    iommu_get_dma_cookie(&vdomain->domain)) {
+		kfree(vdomain);
+		return NULL;
+	}
+
+	return &vdomain->domain;
+}
+
+static void viommu_domain_free(struct iommu_domain *domain)
+{
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	pr_debug("free domain %llu\n", vdomain->id);
+
+	iommu_put_dma_cookie(domain);
+
+	/* Free all remaining mappings (size 2^64) */
+	viommu_tlb_unmap(vdomain, 0, 0);
+
+	kfree(vdomain);
+}
+
+static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
+{
+	int i;
+	int ret = 0;
+	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+	struct viommu_endpoint *vdev = fwspec->iommu_priv;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_attach req = {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
+	mutex_lock(&vdomain->mutex);
+	if (!vdomain->viommu) {
+		struct viommu_dev *viommu = vdev->viommu;
+
+		vdomain->viommu = viommu;
+
+		domain->pgsize_bitmap		= viommu->pgsize_bitmap;
+		domain->geometry.aperture_start	= viommu->aperture_start;
+		domain->geometry.aperture_end	= viommu->aperture_end;
+		domain->geometry.force_aperture	= true;
+
+	} else if (vdomain->viommu != vdev->viommu) {
+		dev_err(dev, "cannot attach to foreign VIOMMU\n");
+		ret = -EXDEV;
+	}
+	mutex_unlock(&vdomain->mutex);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * When attaching the device to a new domain, it will be detached from
+	 * the old one and, if as as a result the old domain isn't attached to
+	 * any device, all mappings are removed from the old domain and it is
+	 * freed. (Note that we can't use get_domain_for_dev here, it returns
+	 * the default domain during initial attach.)
+	 *
+	 * Take note of the device disappearing, so we can ignore unmap request
+	 * on stale domains (that is, between this detach and the upcoming
+	 * free.)
+	 *
+	 * vdev->vdomain is protected by group->mutex
+	 */
+	if (vdev->vdomain) {
+		dev_dbg(dev, "detach from domain %llu\n", vdev->vdomain->id);
+		vdev->vdomain->attached--;
+	}
+
+	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		req.device = cpu_to_le32(fwspec->ids[i]);
+
+		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		if (ret)
+			break;
+	}
+
+	vdomain->attached++;
+	vdev->vdomain = vdomain;
+
+	return ret;
+}
+
+static int viommu_map(struct iommu_domain *domain, unsigned long iova,
+		      phys_addr_t paddr, size_t size, int prot)
+{
+	int ret;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_map req = {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id,
iova,
+		 paddr, size);
+
+	if (!vdomain->attached)
+		return -ENODEV;
+
+	if (prot & IOMMU_READ)
+		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_tlb_map(vdomain, iova, paddr, size);
+	if (ret)
+		return ret;
+
+	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	if (ret)
+		viommu_tlb_unmap(vdomain, iova, size);
+
+	return ret;
+}
+
+static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
+			   size_t size)
+{
+	int ret;
+	size_t unmapped;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_unmap req = {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+	};
+
+	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
+
+	/* Callers may unmap after detach, but device already took care of it. */
+	if (!vdomain->attached)
+		return size;
+
+	unmapped = viommu_tlb_unmap(vdomain, iova, size);
+	if (unmapped < size)
+		return 0;
+
+	req.size = cpu_to_le64(unmapped);
+
+	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	if (ret)
+		return 0;
+
+	return unmapped;
+}
+
+static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+			    struct scatterlist *sg, unsigned int nents, int prot)
+{
+	int i, ret;
+	int nr_sent;
+	size_t mapped;
+	size_t min_pagesz;
+	size_t total_size;
+	struct scatterlist *s;
+	unsigned int flags = 0;
+	unsigned long cur_iova;
+	unsigned long mapped_iova;
+	size_t head_size, tail_size;
+	struct viommu_request reqs[nents];
+	struct virtio_iommu_req_map map_reqs[nents];
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	if (!vdomain->attached)
+		return 0;
+
+	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
+
+	if (prot & IOMMU_READ)
+		flags |= VIRTIO_IOMMU_MAP_F_READ;
+
+	if (prot & IOMMU_WRITE)
+		flags |= VIRTIO_IOMMU_MAP_F_WRITE;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+	tail_size = sizeof(struct virtio_iommu_req_tail);
+	head_size = sizeof(*map_reqs) - tail_size;
+
+	cur_iova = iova;
+
+	for_each_sg(sg, s, nents, i) {
+		size_t size = s->length;
+		phys_addr_t paddr = sg_phys(s);
+		void *tail = (void *)&map_reqs[i] + head_size;
+
+		if (!IS_ALIGNED(paddr | size, min_pagesz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		/* TODO: merge physically-contiguous mappings if any */
+		map_reqs[i] = (struct virtio_iommu_req_map) {
+			.head.type	= VIRTIO_IOMMU_T_MAP,
+			.address_space	= cpu_to_le32(vdomain->id),
+			.flags		= cpu_to_le32(flags),
+			.virt_addr	= cpu_to_le64(cur_iova),
+			.phys_addr	= cpu_to_le64(paddr),
+			.size		= cpu_to_le64(size),
+		};
+
+		ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
+		if (ret)
+			break;
+
+		sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
+		sg_init_one(&reqs[i].tail, tail, tail_size);
+
+		cur_iova += size;
+	}
+
+	total_size = cur_iova - iova;
+
+	if (ret) {
+		viommu_tlb_unmap(vdomain, iova, total_size);
+		return 0;
+	}
+
+	ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i, &nr_sent);
+
+	if (nr_sent != nents)
+		goto err_rollback;
+
+	for (i = 0; i < nents; i++) {
+		if (!reqs[i].written || map_reqs[i].tail.status)
+			goto err_rollback;
+	}
+
+	return total_size;
+
+err_rollback:
+	/*
+	 * Any request in the range might have failed. Unmap what was
+	 * successful.
+	 */
+	cur_iova = iova;
+	mapped_iova = iova;
+	mapped = 0;
+	for_each_sg(sg, s, nents, i) {
+		size_t size = s->length;
+
+		cur_iova += size;
+
+		if (!reqs[i].written || map_reqs[i].tail.status) {
+			if (mapped)
+				viommu_unmap(domain, mapped_iova, mapped);
+
+			mapped_iova = cur_iova;
+			mapped = 0;
+		} else {
+			mapped += size;
+		}
+	}
+
+	viommu_tlb_unmap(vdomain, iova, total_size);
+
+	return 0;
+}
+
+static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
+				       dma_addr_t iova)
+{
+	u64 paddr = 0;
+	unsigned long flags;
+	struct viommu_mapping *mapping;
+	struct interval_tree_node *node;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
+	if (node) {
+		mapping = container_of(node, struct viommu_mapping, iova);
+		paddr = mapping->paddr + (iova - mapping->iova.start);
+	}
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id,
iova,
+		 paddr);
+
+	return paddr;
+}
+
+static struct iommu_ops viommu_ops;
+static struct virtio_driver virtio_iommu_drv;
+
+static int viommu_match_node(struct device *dev, void *data)
+{
+	return dev->parent->fwnode == data;
+}
+
+static struct viommu_dev *viommu_get_by_fwnode(struct fwnode_handle *fwnode)
+{
+	struct device *dev = driver_find_device(&virtio_iommu_drv.driver, NULL,
+						fwnode, viommu_match_node);
+	put_device(dev);
+
+	return dev ? dev_to_virtio(dev)->priv : NULL;
+}
+
+static int viommu_add_device(struct device *dev)
+{
+	struct iommu_group *group;
+	struct viommu_endpoint *vdev;
+	struct viommu_dev *viommu = NULL;
+	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+	if (!fwspec || fwspec->ops != &viommu_ops)
+		return -ENODEV;
+
+	viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
+	if (!viommu)
+		return -ENODEV;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->viommu = viommu;
+	fwspec->iommu_priv = vdev;
+
+	/*
+	 * Last step creates a default domain and attaches to it. Everything
+	 * must be ready.
+	 */
+	group = iommu_group_get_for_dev(dev);
+
+	return PTR_ERR_OR_ZERO(group);
+}
+
+static void viommu_remove_device(struct device *dev)
+{
+	kfree(dev->iommu_fwspec->iommu_priv);
+}
+
+static struct iommu_group *
+viommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+	else
+		return generic_device_group(dev);
+}
+
+static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args)
+{
+	u32 *id = args->args;
+
+	dev_dbg(dev, "of_xlate 0x%x\n", *id);
+	return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+/*
+ * (Maybe) temporary hack for device pass-through into guest userspace. On ARM
+ * with an ITS, VFIO will look for a region where to map the doorbell, even
+ * though the virtual doorbell is never written to by the device, and instead
+ * the host injects interrupts directly. TODO: sort this out in VFIO.
+ */
+#define MSI_IOVA_BASE			0x8000000
+#define MSI_IOVA_LENGTH			0x100000
+
+static void viommu_get_resv_regions(struct device *dev, struct list_head *head)
+{
+	struct iommu_resv_region *region;
+	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+	region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH, prot,
+					 IOMMU_RESV_MSI);
+	if (!region)
+		return;
+
+	list_add_tail(®ion->list, head);
+}
+
+static void viommu_put_resv_regions(struct device *dev, struct list_head *head)
+{
+	struct iommu_resv_region *entry, *next;
+
+	list_for_each_entry_safe(entry, next, head, list)
+		kfree(entry);
+}
+
+static struct iommu_ops viommu_ops = {
+	.capable		= viommu_capable,
+	.domain_alloc		= viommu_domain_alloc,
+	.domain_free		= viommu_domain_free,
+	.attach_dev		= viommu_attach_dev,
+	.map			= viommu_map,
+	.unmap			= viommu_unmap,
+	.map_sg			= viommu_map_sg,
+	.iova_to_phys		= viommu_iova_to_phys,
+	.add_device		= viommu_add_device,
+	.remove_device		= viommu_remove_device,
+	.device_group		= viommu_device_group,
+	.of_xlate		= viommu_of_xlate,
+	.get_resv_regions	= viommu_get_resv_regions,
+	.put_resv_regions	= viommu_put_resv_regions,
+};
+
+static int viommu_init_vq(struct viommu_dev *viommu)
+{
+	struct virtio_device *vdev = dev_to_virtio(viommu->dev);
+	vq_callback_t *callback = NULL;
+	const char *name = "request";
+	int ret;
+
+	ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
+				     &name, NULL);
+	if (ret)
+		dev_err(viommu->dev, "cannot find VQ\n");
+
+	return ret;
+}
+
+static int viommu_probe(struct virtio_device *vdev)
+{
+	struct device *parent_dev = vdev->dev.parent;
+	struct viommu_dev *viommu = NULL;
+	struct device *dev = &vdev->dev;
+	int ret;
+
+	viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+	if (!viommu)
+		return -ENOMEM;
+
+	spin_lock_init(&viommu->vq_lock);
+	INIT_LIST_HEAD(&viommu->pending_requests);
+	viommu->dev = dev;
+	viommu->vdev = vdev;
+
+	ret = viommu_init_vq(viommu);
+	if (ret)
+		goto err_free_viommu;
+
+	virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
+		     &viommu->pgsize_bitmap);
+
+	viommu->aperture_end = -1UL;
+
+	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+			     struct virtio_iommu_config, input_range.start,
+			     &viommu->aperture_start);
+
+	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+			     struct virtio_iommu_config, input_range.end,
+			     &viommu->aperture_end);
+
+	if (!viommu->pgsize_bitmap) {
+		ret = -EINVAL;
+		goto err_free_viommu;
+	}
+
+	viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
+
+	/*
+	 * Not strictly necessary, virtio would enable it later. This allows to
+	 * start using the request queue early.
+	 */
+	virtio_device_ready(vdev);
+
+	ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
+				     virtio_bus_name(vdev));
+	if (ret)
+		goto err_free_viommu;
+
+	iommu_device_set_ops(&viommu->iommu, &viommu_ops);
+	iommu_device_set_fwnode(&viommu->iommu, parent_dev->fwnode);
+
+	iommu_device_register(&viommu->iommu);
+
+#ifdef CONFIG_PCI
+	if (pci_bus_type.iommu_ops != &viommu_ops) {
+		pci_request_acs();
+		ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+#endif
+#ifdef CONFIG_ARM_AMBA
+	if (amba_bustype.iommu_ops != &viommu_ops) {
+		ret = bus_set_iommu(&amba_bustype, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+#endif
+	if (platform_bus_type.iommu_ops != &viommu_ops) {
+		ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+
+	vdev->priv = viommu;
+
+	dev_info(viommu->dev, "probe successful\n");
+
+	return 0;
+
+err_unregister:
+	iommu_device_unregister(&viommu->iommu);
+
+err_free_viommu:
+	kfree(viommu);
+
+	return ret;
+}
+
+static void viommu_remove(struct virtio_device *vdev)
+{
+	struct viommu_dev *viommu = vdev->priv;
+
+	iommu_device_unregister(&viommu->iommu);
+	kfree(viommu);
+
+	dev_info(&vdev->dev, "device removed\n");
+}
+
+static void viommu_config_changed(struct virtio_device *vdev)
+{
+	dev_warn(&vdev->dev, "config changed\n");
+}
+
+static unsigned int features[] = {
+	VIRTIO_IOMMU_F_INPUT_RANGE,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct virtio_driver virtio_iommu_drv = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.feature_table		= features,
+	.feature_table_size	= ARRAY_SIZE(features),
+	.probe			= viommu_probe,
+	.remove			= viommu_remove,
+	.config_changed		= viommu_config_changed,
+};
+
+module_virtio_driver(virtio_iommu_drv);
+
+IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
+
+MODULE_DESCRIPTION("virtio-iommu driver");
+MODULE_AUTHOR("Jean-Philippe Brucker <jean-philippe.brucker at
arm.com>");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 1f25c86374ad..c0cb0f173258 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -467,6 +467,7 @@ header-y += virtio_console.h
 header-y += virtio_gpu.h
 header-y += virtio_ids.h
 header-y += virtio_input.h
+header-y += virtio_iommu.h
 header-y += virtio_mmio.h
 header-y += virtio_net.h
 header-y += virtio_pci.h
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..934ed3d3cd3f 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_iommu.h
b/include/uapi/linux/virtio_iommu.h
new file mode 100644
index 000000000000..ec74c9a727d4
--- /dev/null
+++ b/include/uapi/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
+#define _UAPI_LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE		0
+#define VIRTIO_IOMMU_F_IOASID_BITS		1
+#define VIRTIO_IOMMU_F_MAP_UNMAP		2
+#define VIRTIO_IOMMU_F_BYPASS			3
+
+__packed
+struct virtio_iommu_config {
+	/* Supported page sizes */
+	__u64					page_sizes;
+	struct virtio_iommu_range {
+		__u64				start;
+		__u64				end;
+	} input_range;
+	__u8 					ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH			0x01
+#define VIRTIO_IOMMU_T_DETACH			0x02
+#define VIRTIO_IOMMU_T_MAP			0x03
+#define VIRTIO_IOMMU_T_UNMAP			0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK			0x00
+#define VIRTIO_IOMMU_S_IOERR			0x01
+#define VIRTIO_IOMMU_S_UNSUPP			0x02
+#define VIRTIO_IOMMU_S_DEVERR			0x03
+#define VIRTIO_IOMMU_S_INVAL			0x04
+#define VIRTIO_IOMMU_S_RANGE			0x05
+#define VIRTIO_IOMMU_S_NOENT			0x06
+#define VIRTIO_IOMMU_S_FAULT			0x07
+
+__packed
+struct virtio_iommu_req_head {
+	__u8					type;
+	__u8					reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_tail {
+	__u8					status;
+	__u8					reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_attach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__packed
+struct virtio_iommu_req_detach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK			(VIRTIO_IOMMU_MAP_F_READ |	\
+						 VIRTIO_IOMMU_MAP_F_WRITE |	\
+						 VIRTIO_IOMMU_MAP_F_EXEC)
+
+__packed
+struct virtio_iommu_req_map {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					phys_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__packed
+struct virtio_iommu_req_unmap {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+union virtio_iommu_req {
+	struct virtio_iommu_req_head		head;
+
+	struct virtio_iommu_req_attach		attach;
+	struct virtio_iommu_req_detach		detach;
+	struct virtio_iommu_req_map		map;
+	struct virtio_iommu_req_unmap		unmap;
+};
+
+#endif
-- 
2.12.1
Implement a virtio-iommu device and translate DMA traffic from vfio and virtio devices. Virtio needed some rework to support scatter-gather accesses to vring and buffers at page granularity. Patch 3 implements the actual virtio-iommu device. Adding --viommu on the command-line now inserts a virtual IOMMU in front of all virtio and vfio devices: $ lkvm run -k Image --console virtio -p console=hvc0 \ --viommu --vfio 0 --vfio 4 --irqchip gicv3-its ... [ 2.998949] virtio_iommu virtio0: probe successful [ 3.007739] virtio_iommu virtio1: probe successful ... [ 3.165023] iommu: Adding device 0000:00:00.0 to group 0 [ 3.536480] iommu: Adding device 10200.virtio to group 1 [ 3.553643] iommu: Adding device 10600.virtio to group 2 [ 3.570687] iommu: Adding device 10800.virtio to group 3 [ 3.627425] iommu: Adding device 10a00.virtio to group 4 [ 7.823689] iommu: Adding device 0000:00:01.0 to group 5 ... Patches 13 and 14 add debug facilities. Some statistics are gathered for each address space and can be queried via the debug builtin: $ lkvm debug -n guest-1210 --iommu stats iommu 0 "viommu-vfio" kicks 1255 requests 1256 ioas 1 maps 7 unmaps 4 resident 2101248 ioas 6 maps 623 unmaps 620 resident 16384 iommu 1 "viommu-virtio" kicks 11426 requests 11431 ioas 2 maps 2836 unmaps 2835 resident 8192 accesses 2836 ... This is based on the VFIO patchset[1], itself based on Andre's ITS work. The VFIO bits have only been tested on a software model and are unlikely to work on actual hardware, but I also tested virtio on an ARM Juno. [1] http://www.spinics.net/lists/kvm/msg147624.html Jean-Philippe Brucker (15): virtio: synchronize virtio-iommu headers with Linux FDT: (re)introduce a dynamic phandle allocator virtio: add virtio-iommu Add a simple IOMMU iommu: describe IOMMU topology in device-trees irq: register MSI doorbell addresses virtio: factor virtqueue initialization virtio: add vIOMMU instance for virtio devices virtio: access vring and buffers through IOMMU mappings virtio-pci: translate MSIs with the virtual IOMMU virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary vfio: add support for virtual IOMMU virtio-iommu: debug via IPC virtio-iommu: implement basic debug commands virtio: use virtio-iommu when available Makefile | 3 + arm/gic.c | 4 + arm/include/arm-common/fdt-arch.h | 2 +- arm/pci.c | 49 ++- builtin-debug.c | 8 +- builtin-run.c | 2 + fdt.c | 35 ++ include/kvm/builtin-debug.h | 6 + include/kvm/devices.h | 4 + include/kvm/fdt.h | 20 + include/kvm/iommu.h | 105 +++++ include/kvm/irq.h | 3 + include/kvm/kvm-config.h | 1 + include/kvm/vfio.h | 2 + include/kvm/virtio-iommu.h | 15 + include/kvm/virtio-mmio.h | 1 + include/kvm/virtio-pci.h | 2 + include/kvm/virtio.h | 137 +++++- include/linux/virtio_config.h | 74 ++++ include/linux/virtio_ids.h | 4 + include/linux/virtio_iommu.h | 142 ++++++ iommu.c | 240 ++++++++++ irq.c | 35 ++ kvm-ipc.c | 43 +- mips/include/kvm/fdt-arch.h | 2 +- powerpc/include/kvm/fdt-arch.h | 2 +- vfio.c | 281 +++++++++++- virtio/9p.c | 7 +- virtio/balloon.c | 7 +- virtio/blk.c | 10 +- virtio/console.c | 7 +- virtio/core.c | 240 ++++++++-- virtio/iommu.c | 902 ++++++++++++++++++++++++++++++++++++++ virtio/mmio.c | 44 +- virtio/net.c | 8 +- virtio/pci.c | 61 ++- virtio/rng.c | 6 +- virtio/scsi.c | 6 +- x86/include/kvm/fdt-arch.h | 2 +- 39 files changed, 2389 insertions(+), 133 deletions(-) create mode 100644 fdt.c create mode 100644 include/kvm/iommu.h create mode 100644 include/kvm/virtio-iommu.h create mode 100644 include/linux/virtio_config.h create mode 100644 include/linux/virtio_iommu.h create mode 100644 iommu.c create mode 100644 virtio/iommu.c -- 2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux
Pull virtio-iommu header (initial proposal) from Linux. Also add
virtio_config.h because it defines VIRTIO_F_IOMMU_PLATFORM, which I'm
going to need soon, and it's not provided by my toolchain.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/linux/virtio_config.h |  74 ++++++++++++++++++++++
 include/linux/virtio_ids.h    |   4 ++
 include/linux/virtio_iommu.h  | 142 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 include/linux/virtio_config.h
 create mode 100644 include/linux/virtio_iommu.h
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
new file mode 100644
index 00000000..648b688f
--- /dev/null
+++ b/include/linux/virtio_config.h
@@ -0,0 +1,74 @@
+#ifndef _LINUX_VIRTIO_CONFIG_H
+#define _LINUX_VIRTIO_CONFIG_H
+/* This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+
+/* Virtio devices use a standardized configuration space to define their
+ * features and pass configuration information, but each implementation can
+ * store and access that space differently. */
+#include <linux/types.h>
+
+/* Status byte for guest to report progress, and synchronize features. */
+/* We have seen device and processed generic fields (VIRTIO_CONFIG_F_VIRTIO) */
+#define VIRTIO_CONFIG_S_ACKNOWLEDGE	1
+/* We have found a driver for the device. */
+#define VIRTIO_CONFIG_S_DRIVER		2
+/* Driver has used its parts of the config, and is happy */
+#define VIRTIO_CONFIG_S_DRIVER_OK	4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK	8
+/* Device entered invalid state, driver must reset it */
+#define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
+/* We've given up on this device. */
+#define VIRTIO_CONFIG_S_FAILED		0x80
+
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
+ * transport being used (eg. virtio_ring), the rest are per-device feature
+ * bits. */
+#define VIRTIO_TRANSPORT_F_START	28
+#define VIRTIO_TRANSPORT_F_END		34
+
+#ifndef VIRTIO_CONFIG_NO_LEGACY
+/* Do we get callbacks when the ring is completely used, even if we've
+ * suppressed them? */
+#define VIRTIO_F_NOTIFY_ON_EMPTY	24
+
+/* Can the device handle any descriptor layout? */
+#define VIRTIO_F_ANY_LAYOUT		27
+#endif /* VIRTIO_CONFIG_NO_LEGACY */
+
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1		32
+
+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM		33
+#endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
index 5f60aa4b..934ed3d3 100644
--- a/include/linux/virtio_ids.h
+++ b/include/linux/virtio_ids.h
@@ -39,6 +39,10 @@
 #define VIRTIO_ID_9P		9 /* 9p virtio console */
 #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
+#define VIRTIO_ID_GPU          16 /* virtio GPU */
 #define VIRTIO_ID_INPUT        18 /* virtio input */
+#define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
+#define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/linux/virtio_iommu.h b/include/linux/virtio_iommu.h
new file mode 100644
index 00000000..beb21d44
--- /dev/null
+++ b/include/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _LINUX_VIRTIO_IOMMU_H
+#define _LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE		0
+#define VIRTIO_IOMMU_F_IOASID_BITS		1
+#define VIRTIO_IOMMU_F_MAP_UNMAP		2
+#define VIRTIO_IOMMU_F_BYPASS			3
+
+__attribute__((packed))
+struct virtio_iommu_config {
+	/* Supported page sizes */
+	__u64					page_sizes;
+	struct virtio_iommu_range {
+		__u64				start;
+		__u64				end;
+	} input_range;
+	__u8 					ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH			0x01
+#define VIRTIO_IOMMU_T_DETACH			0x02
+#define VIRTIO_IOMMU_T_MAP			0x03
+#define VIRTIO_IOMMU_T_UNMAP			0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK			0x00
+#define VIRTIO_IOMMU_S_IOERR			0x01
+#define VIRTIO_IOMMU_S_UNSUPP			0x02
+#define VIRTIO_IOMMU_S_DEVERR			0x03
+#define VIRTIO_IOMMU_S_INVAL			0x04
+#define VIRTIO_IOMMU_S_RANGE			0x05
+#define VIRTIO_IOMMU_S_NOENT			0x06
+#define VIRTIO_IOMMU_S_FAULT			0x07
+
+__attribute__((packed))
+struct virtio_iommu_req_head {
+	__u8					type;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_tail {
+	__u8					status;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_attach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_detach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK			(VIRTIO_IOMMU_MAP_F_READ |	\
+						 VIRTIO_IOMMU_MAP_F_WRITE |	\
+						 VIRTIO_IOMMU_MAP_F_EXEC)
+
+__attribute__((packed))
+struct virtio_iommu_req_map {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					phys_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_unmap {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+union virtio_iommu_req {
+	struct virtio_iommu_req_head		head;
+
+	struct virtio_iommu_req_attach		attach;
+	struct virtio_iommu_req_detach		detach;
+	struct virtio_iommu_req_map		map;
+	struct virtio_iommu_req_unmap		unmap;
+};
+
+#endif
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator
The phandle allocator was removed because static values were sufficient
for creating a common irqchip. With adding multiple virtual IOMMUs to the
device-tree, there is a need for a dynamic allocation of phandles. Add a
simple allocator that returns values above the static ones.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 Makefile                          |  1 +
 arm/include/arm-common/fdt-arch.h |  2 +-
 fdt.c                             | 15 +++++++++++++++
 include/kvm/fdt.h                 | 13 +++++++++++++
 mips/include/kvm/fdt-arch.h       |  2 +-
 powerpc/include/kvm/fdt-arch.h    |  2 +-
 x86/include/kvm/fdt-arch.h        |  2 +-
 7 files changed, 33 insertions(+), 4 deletions(-)
 create mode 100644 fdt.c
diff --git a/Makefile b/Makefile
index 6d5f5d9d..3e21c597 100644
--- a/Makefile
+++ b/Makefile
@@ -303,6 +303,7 @@ ifeq (y,$(ARCH_WANT_LIBFDT))
 		CFLAGS_STATOPT	+= -DCONFIG_HAS_LIBFDT
 		LIBS_DYNOPT	+= -lfdt
 		LIBS_STATOPT	+= -lfdt
+		OBJS		+= fdt.o
 	endif
 endif
 
diff --git a/arm/include/arm-common/fdt-arch.h
b/arm/include/arm-common/fdt-arch.h
index 60c2d406..ed4ff3d4 100644
--- a/arm/include/arm-common/fdt-arch.h
+++ b/arm/include/arm-common/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef ARM__FDT_H
 #define ARM__FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI,
ARCH_PHANDLES_MAX};
 
 #endif /* ARM__FDT_H */
diff --git a/fdt.c b/fdt.c
new file mode 100644
index 00000000..6db03d4e
--- /dev/null
+++ b/fdt.c
@@ -0,0 +1,15 @@
+/*
+ * Commonly used FDT functions.
+ */
+
+#include "kvm/fdt.h"
+
+static u32 next_phandle = PHANDLE_RESERVED;
+
+u32 fdt_alloc_phandle(void)
+{
+	if (next_phandle == PHANDLE_RESERVED)
+		next_phandle = ARCH_PHANDLES_MAX;
+
+	return next_phandle++;
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index beadc7f3..503887f9 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -35,4 +35,17 @@ enum irq_type {
 		}							\
 	} while (0)
 
+#ifdef CONFIG_HAS_LIBFDT
+
+u32 fdt_alloc_phandle(void);
+
+#else
+
+static inline u32 fdt_alloc_phandle(void)
+{
+	return PHANDLE_RESERVED;
+}
+
+#endif /* CONFIG_HAS_LIBFDT */
+
 #endif /* KVM__FDT_H */
diff --git a/mips/include/kvm/fdt-arch.h b/mips/include/kvm/fdt-arch.h
index b0302457..3d004117 100644
--- a/mips/include/kvm/fdt-arch.h
+++ b/mips/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/powerpc/include/kvm/fdt-arch.h b/powerpc/include/kvm/fdt-arch.h
index d48c0554..4ae4d3a0 100644
--- a/powerpc/include/kvm/fdt-arch.h
+++ b/powerpc/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/x86/include/kvm/fdt-arch.h b/x86/include/kvm/fdt-arch.h
index eebd73f9..aba06ad8 100644
--- a/x86/include/kvm/fdt-arch.h
+++ b/x86/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef X86__FDT_ARCH_H
 #define X86__FDT_ARCH_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 03/15] virtio: add virtio-iommu
Implement a simple para-virtualized IOMMU for handling device address
spaces in guests.
Four operations are implemented:
* attach/detach: guest creates an address space, symbolized by a unique
  identifier (IOASID), and attaches the device to it.
* map/unmap: guest creates a GVA->GPA mapping in an address space. Devices
  attached to this address space can then access the GVA.
Each subsystem can register its own IOMMU, by calling register/unregister.
A unique device-tree phandle is allocated for each IOMMU. The IOMMU
receives commands from the driver through the virtqueue, and has a set of
callbacks for each device, allowing to implement different map/unmap
operations for passed-through and emulated devices. Note that a single
virtual IOMMU per guest would be enough, this multi-instance model is just
here for experimenting and allow different subsystems to offer different
vIOMMU features.
Add a global --viommu parameter to enable the virtual IOMMU.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 Makefile                   |   1 +
 builtin-run.c              |   2 +
 include/kvm/devices.h      |   4 +
 include/kvm/iommu.h        |  64 +++++
 include/kvm/kvm-config.h   |   1 +
 include/kvm/virtio-iommu.h |  10 +
 virtio/iommu.c             | 628 +++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c              |  11 +
 8 files changed, 721 insertions(+)
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/virtio-iommu.h
 create mode 100644 virtio/iommu.c
diff --git a/Makefile b/Makefile
index 3e21c597..67953870 100644
--- a/Makefile
+++ b/Makefile
@@ -68,6 +68,7 @@ OBJS	+= virtio/net.o
 OBJS	+= virtio/rng.o
 OBJS    += virtio/balloon.o
 OBJS	+= virtio/pci.o
+OBJS	+= virtio/iommu.o
 OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
diff --git a/builtin-run.c b/builtin-run.c
index b4790ebc..7535b531 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -113,6 +113,8 @@ void kvm_run_set_wrapper_sandbox(void)
 	OPT_BOOLEAN('\0', "sdl", &(cfg)->sdl, "Enable
SDL framebuffer"),\
 	OPT_BOOLEAN('\0', "rng", &(cfg)->virtio_rng,
"Enable virtio"	\
 			" Random Number Generator"),			\
+	OPT_BOOLEAN('\0', "viommu", &(cfg)->viommu,			\
+			"Enable virtio IOMMU"),				\
 	OPT_CALLBACK('\0', "9p", NULL,
"dir_to_share,tag_name",		\
 		     "Enable virtio 9p to share files between host and"	\
 		     " guest", virtio_9p_rootdir_parser, kvm),		\
diff --git a/include/kvm/devices.h b/include/kvm/devices.h
index 405f1952..70a00c5b 100644
--- a/include/kvm/devices.h
+++ b/include/kvm/devices.h
@@ -11,11 +11,15 @@ enum device_bus_type {
 	DEVICE_BUS_MAX,
 };
 
+struct iommu_ops;
+
 struct device_header {
 	enum device_bus_type	bus_type;
 	void			*data;
 	int			dev_num;
 	struct rb_node		node;
+	struct iommu_ops	*iommu_ops;
+	void			*iommu_data;
 };
 
 int device__register(struct device_header *dev);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
new file mode 100644
index 00000000..925e1993
--- /dev/null
+++ b/include/kvm/iommu.h
@@ -0,0 +1,64 @@
+#ifndef KVM_IOMMU_H
+#define KVM_IOMMU_H
+
+#include <stdlib.h>
+
+#include "devices.h"
+
+#define IOMMU_PROT_NONE		0x0
+#define IOMMU_PROT_READ		0x1
+#define IOMMU_PROT_WRITE	0x2
+#define IOMMU_PROT_EXEC		0x4
+
+struct iommu_ops {
+	const struct iommu_properties *(*get_properties)(struct device_header *);
+
+	void *(*alloc_address_space)(struct device_header *);
+	void (*free_address_space)(void *);
+
+	int (*attach)(void *, struct device_header *, int flags);
+	int (*detach)(void *, struct device_header *);
+	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
+	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+};
+
+struct iommu_properties {
+	const char			*name;
+	u32				phandle;
+
+	size_t				input_addr_size;
+	u64				pgsize_mask;
+};
+
+/*
+ * All devices presented to the system have a device ID, that allows the IOMMU
+ * to identify them. Since multiple buses can share an IOMMU, this device ID
+ * must be unique system-wide. We define it here as:
+ *
+ *	(bus_type << 16) + dev_num
+ *
+ * Where dev_num is the device number on the bus as allocated by devices.c
+ *
+ * TODO: enforce this limit, by checking that the device number allocator
+ * doesn't overflow BUS_SIZE.
+ */
+
+#define BUS_SIZE 0x10000
+
+static inline long device_to_iommu_id(struct device_header *dev)
+{
+	return dev->bus_type * BUS_SIZE + dev->dev_num;
+}
+
+#define iommu_id_to_bus(device_id)	((device_id) / BUS_SIZE)
+#define iommu_id_to_devnum(device_id)	((device_id) % BUS_SIZE)
+
+static inline struct device_header *iommu_get_device(u32 device_id)
+{
+	enum device_bus_type bus = iommu_id_to_bus(device_id);
+	u32 dev_num = iommu_id_to_devnum(device_id);
+
+	return device__find_dev(bus, dev_num);
+}
+
+#endif /* KVM_IOMMU_H */
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 62dc6a2f..9678065b 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -60,6 +60,7 @@ struct kvm_config {
 	bool no_dhcp;
 	bool ioport_debug;
 	bool mmio_debug;
+	bool viommu;
 };
 
 #endif
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
new file mode 100644
index 00000000..5532c82b
--- /dev/null
+++ b/include/kvm/virtio-iommu.h
@@ -0,0 +1,10 @@
+#ifndef KVM_VIRTIO_IOMMU_H
+#define KVM_VIRTIO_IOMMU_H
+
+#include "virtio.h"
+
+const struct iommu_properties *viommu_get_properties(void *dev);
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
+void viommu_unregister(struct kvm *kvm, void *cookie);
+
+#endif
diff --git a/virtio/iommu.c b/virtio/iommu.c
new file mode 100644
index 00000000..c72e7322
--- /dev/null
+++ b/virtio/iommu.c
@@ -0,0 +1,628 @@
+#include <errno.h>
+#include <stdbool.h>
+
+#include <linux/compiler.h>
+
+#include <linux/bitops.h>
+#include <linux/byteorder.h>
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_iommu.h>
+
+#include "kvm/guest_compat.h"
+#include "kvm/iommu.h"
+#include "kvm/threadpool.h"
+#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
+
+/* Max size */
+#define VIOMMU_DEFAULT_QUEUE_SIZE	256
+
+struct viommu_endpoint {
+	struct device_header		*dev;
+	struct viommu_ioas		*ioas;
+	struct list_head		list;
+};
+
+struct viommu_ioas {
+	u32				id;
+
+	struct mutex			devices_mutex;
+	struct list_head		devices;
+	size_t				nr_devices;
+	struct rb_node			node;
+
+	struct iommu_ops		*ops;
+	void				*priv;
+};
+
+struct viommu_dev {
+	struct virtio_device		vdev;
+	struct virtio_iommu_config	config;
+
+	const struct iommu_properties	*properties;
+
+	struct virt_queue		vq;
+	size_t				queue_size;
+	struct thread_pool__job		job;
+
+	struct rb_root			address_spaces;
+	struct kvm			*kvm;
+};
+
+static int compat_id = -1;
+
+static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
+					    u32 ioasid)
+{
+	struct rb_node *node;
+	struct viommu_ioas *ioas;
+
+	node = viommu->address_spaces.rb_node;
+	while (node) {
+		ioas = container_of(node, struct viommu_ioas, node);
+		if (ioas->id > ioasid)
+			node = node->rb_left;
+		else if (ioas->id < ioasid)
+			node = node->rb_right;
+		else
+			return ioas;
+	}
+
+	return NULL;
+}
+
+static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
+					     struct device_header *device,
+					     u32 ioasid)
+{
+	struct rb_node **node, *parent = NULL;
+	struct viommu_ioas *new_ioas, *ioas;
+	struct iommu_ops *ops = device->iommu_ops;
+
+	if (!ops || !ops->get_properties || !ops->alloc_address_space ||
+	    !ops->free_address_space || !ops->attach || !ops->detach ||
+	    !ops->map || !ops->unmap) {
+		/* Catch programming mistakes early */
+		pr_err("Invalid IOMMU ops");
+		return NULL;
+	}
+
+	new_ioas = calloc(1, sizeof(*new_ioas));
+	if (!new_ioas)
+		return NULL;
+
+	INIT_LIST_HEAD(&new_ioas->devices);
+	mutex_init(&new_ioas->devices_mutex);
+	new_ioas->id		= ioasid;
+	new_ioas->ops		= ops;
+	new_ioas->priv		= ops->alloc_address_space(device);
+
+	/* A NULL priv pointer is valid. */
+
+	node = &viommu->address_spaces.rb_node;
+	while (*node) {
+		ioas = container_of(*node, struct viommu_ioas, node);
+		parent = *node;
+
+		if (ioas->id > ioasid) {
+			node = &((*node)->rb_left);
+		} else if (ioas->id < ioasid) {
+			node = &((*node)->rb_right);
+		} else {
+			pr_err("IOAS exists!");
+			free(new_ioas);
+			return NULL;
+		}
+	}
+
+	rb_link_node(&new_ioas->node, parent, node);
+	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
+
+	return new_ioas;
+}
+
+static void viommu_free_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas)
+{
+	if (ioas->priv)
+		ioas->ops->free_address_space(ioas->priv);
+
+	rb_erase(&ioas->node, &viommu->address_spaces);
+	free(ioas);
+}
+
+static int viommu_ioas_add_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_add_tail(&vdev->list, &ioas->devices);
+	ioas->nr_devices++;
+	vdev->ioas = ioas;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static int viommu_ioas_del_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_del(&vdev->list);
+	ioas->nr_devices--;
+	vdev->ioas = NULL;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static struct viommu_endpoint *viommu_alloc_device(struct device_header
*device)
+{
+	struct viommu_endpoint *vdev = calloc(1, sizeof(*vdev));
+
+	device->iommu_data = vdev;
+	vdev->dev = device;
+
+	return vdev;
+}
+
+static int viommu_detach_device(struct viommu_dev *viommu,
+				struct viommu_endpoint *vdev)
+{
+	int ret;
+	struct viommu_ioas *ioas = vdev->ioas;
+	struct device_header *device = vdev->dev;
+
+	if (!ioas)
+		return -EINVAL;
+
+	pr_debug("detaching device %#lx from IOAS %u",
+		 device_to_iommu_id(device), ioas->id);
+
+	ret = device->iommu_ops->detach(ioas->priv, device);
+	if (!ret)
+		ret = viommu_ioas_del_device(ioas, vdev);
+
+	if (!ioas->nr_devices)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_attach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_attach *attach)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(attach->device);
+	u32 ioasid	= le32_to_cpu(attach->address_space);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
+
+	vdev = device->iommu_data;
+	if (!vdev) {
+		vdev = viommu_alloc_device(device);
+		if (!vdev)
+			return -ENOMEM;
+	}
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		ioas = viommu_alloc_ioas(viommu, device, ioasid);
+		if (!ioas)
+			return -ENOMEM;
+	} else if (ioas->ops->map != device->iommu_ops->map ||
+		   ioas->ops->unmap != device->iommu_ops->unmap) {
+		return -EINVAL;
+	}
+
+	if (vdev->ioas) {
+		ret = viommu_detach_device(viommu, vdev);
+		if (ret)
+			return ret;
+	}
+
+	ret = device->iommu_ops->attach(ioas->priv, device, 0);
+	if (!ret)
+		ret = viommu_ioas_add_device(ioas, vdev);
+
+	if (ret && ioas->nr_devices == 0)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_detach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_detach *detach)
+{
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(detach->device);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	vdev = device->iommu_data;
+	if (!vdev)
+		return -ENODEV;
+
+	return viommu_detach_device(viommu, vdev);
+}
+
+static int viommu_handle_map(struct viommu_dev *viommu,
+			     struct virtio_iommu_req_map *map)
+{
+	int prot = 0;
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(map->address_space);
+	u64 virt_addr	= le64_to_cpu(map->virt_addr);
+	u64 phys_addr	= le64_to_cpu(map->phys_addr);
+	u64 size	= le64_to_cpu(map->size);
+	u32 flags	= le64_to_cpu(map->flags);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	if (flags & ~VIRTIO_IOMMU_MAP_F_MASK)
+		return -EINVAL;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_READ)
+		prot |= IOMMU_PROT_READ;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_WRITE)
+		prot |= IOMMU_PROT_WRITE;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
+		prot |= IOMMU_PROT_EXEC;
+
+	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
+		 phys_addr, size, ioasid);
+
+	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+}
+
+static int viommu_handle_unmap(struct viommu_dev *viommu,
+			       struct virtio_iommu_req_unmap *unmap)
+{
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(unmap->address_space);
+	u64 virt_addr	= le64_to_cpu(unmap->virt_addr);
+	u64 size	= le64_to_cpu(unmap->size);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
+		 ioasid);
+
+	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+}
+
+static size_t viommu_get_req_len(union virtio_iommu_req *req)
+{
+	switch (req->head.type) {
+	case VIRTIO_IOMMU_T_ATTACH:
+		return sizeof(req->attach);
+	case VIRTIO_IOMMU_T_DETACH:
+		return sizeof(req->detach);
+	case VIRTIO_IOMMU_T_MAP:
+		return sizeof(req->map);
+	case VIRTIO_IOMMU_T_UNMAP:
+		return sizeof(req->unmap);
+	default:
+		pr_err("unknown request type %x", req->head.type);
+		return 0;
+	}
+}
+
+static int viommu_errno_to_status(int err)
+{
+	switch (err) {
+	case 0:
+		return VIRTIO_IOMMU_S_OK;
+	case EIO:
+		return VIRTIO_IOMMU_S_IOERR;
+	case ENOSYS:
+		return VIRTIO_IOMMU_S_UNSUPP;
+	case ERANGE:
+		return VIRTIO_IOMMU_S_RANGE;
+	case EFAULT:
+		return VIRTIO_IOMMU_S_FAULT;
+	case EINVAL:
+		return VIRTIO_IOMMU_S_INVAL;
+	case ENOENT:
+	case ENODEV:
+	case ESRCH:
+		return VIRTIO_IOMMU_S_NOENT;
+	case ENOMEM:
+	case ENOSPC:
+	default:
+		return VIRTIO_IOMMU_S_DEVERR;
+	}
+}
+
+static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
+					struct iovec *iov, int nr_in, int nr_out)
+{
+	u32 op;
+	int i, ret;
+	ssize_t written_len = 0;
+	size_t len, expected_len;
+	union virtio_iommu_req *req;
+	struct virtio_iommu_req_tail *tail;
+
+	/*
+	 * Are we picking up in the middle of a request buffer? Keep a running
+	 * count.
+	 *
+	 * Here we assume that a request is always made of two descriptors, a
+	 * head and a tail. TODO: get rid of framing assumptions by keeping
+	 * track of request fragments.
+	 */
+	static bool is_head = true;
+	static int cur_status = 0;
+
+	for (i = 0; i < nr_in + nr_out; i++, is_head = !is_head) {
+		len = iov[i].iov_len;
+		if (is_head && len < sizeof(req->head)) {
+			pr_err("invalid command length (%zu)", len);
+			cur_status = EIO;
+			continue;
+		} else if (!is_head && len < sizeof(*tail)) {
+			pr_err("invalid tail length (%zu)", len);
+			cur_status = 0;
+			continue;
+		}
+
+		if (!is_head) {
+			int status = viommu_errno_to_status(cur_status);
+
+			tail = iov[i].iov_base;
+			tail->status = cpu_to_le32(status);
+			written_len += sizeof(tail->status);
+			cur_status = 0;
+			continue;
+		}
+
+		req = iov[i].iov_base;
+		op = req->head.type;
+		expected_len = viommu_get_req_len(req) - sizeof(*tail);
+		if (expected_len != len) {
+			pr_err("invalid command %x length (%zu != %zu)", op,
+			       len, expected_len);
+			cur_status = EIO;
+			continue;
+		}
+
+		switch (op) {
+		case VIRTIO_IOMMU_T_ATTACH:
+			ret = viommu_handle_attach(viommu, &req->attach);
+			break;
+
+		case VIRTIO_IOMMU_T_DETACH:
+			ret = viommu_handle_detach(viommu, &req->detach);
+			break;
+
+		case VIRTIO_IOMMU_T_MAP:
+			ret = viommu_handle_map(viommu, &req->map);
+			break;
+
+		case VIRTIO_IOMMU_T_UNMAP:
+			ret = viommu_handle_unmap(viommu, &req->unmap);
+			break;
+
+		default:
+			pr_err("unhandled command %x", op);
+			ret = -ENOSYS;
+		}
+
+		if (ret)
+			cur_status = -ret;
+	}
+
+	return written_len;
+}
+
+static void viommu_command(struct kvm *kvm, void *dev)
+{
+	int len;
+	u16 head;
+	u16 out, in;
+
+	struct virt_queue *vq;
+	struct viommu_dev *viommu = dev;
+	struct iovec iov[VIOMMU_DEFAULT_QUEUE_SIZE];
+
+	vq = &viommu->vq;
+
+	while (virt_queue__available(vq)) {
+		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+
+		len = viommu_dispatch_commands(viommu, iov, in, out);
+		if (len < 0) {
+			/* Critical error, abort everything */
+			pr_err("failed to dispatch viommu command");
+			return;
+		}
+
+		virt_queue__set_used_elem(vq, head, len);
+	}
+
+	if (virtio_queue__should_signal(vq))
+		viommu->vdev.ops->signal_vq(kvm, &viommu->vdev, 0);
+}
+
+/* Virtio API */
+static u8 *viommu_get_config(struct kvm *kvm, void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return (u8 *)&viommu->config;
+}
+
+static u32 viommu_get_host_features(struct kvm *kvm, void *dev)
+{
+	return 1ULL << VIRTIO_RING_F_EVENT_IDX
+	     | 1ULL << VIRTIO_RING_F_INDIRECT_DESC
+	     | 1ULL << VIRTIO_IOMMU_F_INPUT_RANGE;
+}
+
+static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+}
+
+static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
+			  u32 align, u32 pfn)
+{
+	void *ptr;
+	struct virt_queue *queue;
+	struct viommu_dev *viommu = dev;
+
+	if (vq != 0)
+		return -ENODEV;
+
+	compat__remove_message(compat_id);
+
+	queue = &viommu->vq;
+	queue->pfn = pfn;
+	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
+
+	vring_init(&queue->vring, viommu->queue_size, ptr, align);
+	virtio_init_device_vq(&viommu->vdev, queue);
+
+	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
+
+	return 0;
+}
+
+static int viommu_get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->vq.pfn;
+}
+
+static int viommu_get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->queue_size;
+}
+
+static int viommu_set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+	struct viommu_dev *viommu = dev;
+
+	if (viommu->vq.pfn)
+		/* Already init, can't resize */
+		return viommu->queue_size;
+
+	viommu->queue_size = size;
+
+	return size;
+}
+
+static int viommu_notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	thread_pool__do_job(&viommu->job);
+
+	return 0;
+}
+
+static void viommu_notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+	/* TODO: when implementing vhost */
+}
+
+static void viommu_notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32
fd)
+{
+	/* TODO: when implementing vhost */
+}
+
+static struct virtio_ops iommu_dev_virtio_ops = {
+	.get_config		= viommu_get_config,
+	.get_host_features	= viommu_get_host_features,
+	.set_guest_features	= viommu_set_guest_features,
+	.init_vq		= viommu_init_vq,
+	.get_pfn_vq		= viommu_get_pfn_vq,
+	.get_size_vq		= viommu_get_size_vq,
+	.set_size_vq		= viommu_set_size_vq,
+	.notify_vq		= viommu_notify_vq,
+	.notify_vq_gsi		= viommu_notify_vq_gsi,
+	.notify_vq_eventfd	= viommu_notify_vq_eventfd,
+};
+
+const struct iommu_properties *viommu_get_properties(void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->properties;
+}
+
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
+{
+	struct viommu_dev *viommu;
+	u64 pgsize_mask = ~(PAGE_SIZE - 1);
+
+	if (!kvm->cfg.viommu)
+		return NULL;
+
+	props->phandle = fdt_alloc_phandle();
+
+	viommu = calloc(1, sizeof(struct viommu_dev));
+	if (!viommu)
+		return NULL;
+
+	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
+	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->properties		= props;
+
+	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
+	viommu->config.input_range.end	= props->input_addr_size % BITS_PER_LONG
?
+					  (1UL << props->input_addr_size) - 1 :
+					  -1UL;
+
+	if (virtio_init(kvm, viommu, &viommu->vdev, &iommu_dev_virtio_ops,
+			VIRTIO_MMIO, 0, VIRTIO_ID_IOMMU, 0)) {
+		free(viommu);
+		return NULL;
+	}
+
+	pr_info("Loaded virtual IOMMU %s", props->name);
+
+	if (compat_id == -1)
+		compat_id = virtio_compat_add_message("virtio-iommu",
+						      "CONFIG_VIRTIO_IOMMU");
+
+	return viommu;
+}
+
+void viommu_unregister(struct kvm *kvm, void *viommu)
+{
+	free(viommu);
+}
diff --git a/virtio/mmio.c b/virtio/mmio.c
index f0af4bd1..b3dea51a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,14 +1,17 @@
 #include "kvm/devices.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
+#include "kvm/iommu.h"
 #include "kvm/ioport.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/irq.h"
 #include "kvm/fdt.h"
 
 #include <linux/virtio_mmio.h>
+#include <linux/virtio_ids.h>
 #include <string.h>
 
 static u32 virtio_mmio_io_space_blocks = KVM_VIRTIO_MMIO_AREA;
@@ -237,6 +240,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 							     u8 irq,
 							     enum irq_type))
 {
+	const struct iommu_properties *props;
 	char dev_name[DEVICE_NAME_MAX_LEN];
 	struct virtio_mmio *vmmio = container_of(dev_hdr,
 						 struct virtio_mmio,
@@ -254,6 +258,13 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+
+	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
+		props = viommu_get_properties(vmmio->dev);
+		_FDT(fdt_property_cell(fdt, "phandle", props->phandle));
+		_FDT(fdt_property_cell(fdt, "#iommu-cells", 1));
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
 #else
-- 
2.12.1
Add a rb-tree based IOMMU with support for map, unmap and access
operations. It will be used to store mappings for virtio devices and MSI
doorbells. If needed, it could also be extended with a TLB implementation.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 Makefile            |   1 +
 include/kvm/iommu.h |   9 +++
 iommu.c             | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 iommu.c
diff --git a/Makefile b/Makefile
index 67953870..0c369206 100644
--- a/Makefile
+++ b/Makefile
@@ -73,6 +73,7 @@ OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
 OBJS	+= ioeventfd.o
+OBJS	+= iommu.o
 OBJS	+= net/uip/core.o
 OBJS	+= net/uip/arp.o
 OBJS	+= net/uip/icmp.o
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 925e1993..4164ba20 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -61,4 +61,13 @@ static inline struct device_header *iommu_get_device(u32
device_id)
 	return device__find_dev(bus, dev_num);
 }
 
+void *iommu_alloc_address_space(struct device_header *dev);
+void iommu_free_address_space(void *address_space);
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
+	      int prot);
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot);
+
 #endif /* KVM_IOMMU_H */
diff --git a/iommu.c b/iommu.c
new file mode 100644
index 00000000..0a662404
--- /dev/null
+++ b/iommu.c
@@ -0,0 +1,162 @@
+/*
+ * Implement basic IOMMU operations - map, unmap and translate
+ */
+#include <errno.h>
+
+#include "kvm/iommu.h"
+#include "kvm/kvm.h"
+#include "kvm/mutex.h"
+#include "kvm/rbtree-interval.h"
+
+struct iommu_mapping {
+	struct rb_int_node	iova_range;
+	u64			phys;
+	int			prot;
+};
+
+struct iommu_ioas {
+	struct rb_root		mappings;
+	struct mutex		mutex;
+};
+
+void *iommu_alloc_address_space(struct device_header *unused)
+{
+	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
+
+	if (!ioas)
+		return NULL;
+
+	ioas->mappings = (struct rb_root)RB_ROOT;
+	mutex_init(&ioas->mutex);
+
+	return ioas;
+}
+
+void iommu_free_address_space(void *address_space)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct rb_int_node *int_node;
+	struct rb_node *node, *next;
+	struct iommu_mapping *map;
+
+        /* Postorder allows to free leaves first. */
+	node = rb_first_postorder(&ioas->mappings);
+	while (node) {
+		next = rb_next_postorder(node);
+
+		int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+		free(map);
+
+		node = next;
+	}
+
+	free(ioas);
+}
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr,
+	      u64 size, int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+
+	if (!ioas)
+		return -ENODEV;
+
+	map = malloc(sizeof(struct iommu_mapping));
+	if (!map)
+		return -ENOMEM;
+
+	map->phys = phys_addr;
+	map->iova_range = RB_INT_INIT(virt_addr, virt_addr + size - 1);
+	map->prot = prot;
+
+	mutex_lock(&ioas->mutex);
+	rb_int_insert(&ioas->mappings, &map->iova_range);
+	mutex_unlock(&ioas->mutex);
+
+	return 0;
+}
+
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
+{
+	int ret = 0;
+	struct rb_int_node *node;
+	struct iommu_mapping *map;
+	struct iommu_ioas *ioas = address_space;
+
+	if (!ioas)
+		return -ENODEV;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, virt_addr);
+	while (node && size) {
+		struct rb_node *next = rb_next(&node->node);
+		size_t node_size = node->high - node->low + 1;
+		map = container_of(node, struct iommu_mapping, iova_range);
+
+		if (node_size > size) {
+			pr_debug("cannot split mapping");
+			ret = -EINVAL;
+			break;
+		}
+
+		size -= node_size;
+		virt_addr += node_size;
+
+		rb_erase(&node->node, &ioas->mappings);
+		free(map);
+		node = next ? container_of(next, struct rb_int_node, node) : NULL;
+	}
+
+	if (size && !ret) {
+		pr_debug("mapping not found");
+		ret = -ENXIO;
+	}
+	mutex_unlock(&ioas->mutex);
+
+	return ret;
+}
+
+/*
+ * Translate a virtual address into a physical one. Perform an access of @size
+ * bytes with protection @prot. If @addr isn't mapped in @address_space,
return
+ * 0. If the permissions of the mapping don't match, return 0. If the
access
+ * range specified by (addr, size) spans over multiple mappings, only access
the
+ * first mapping and return the accessed size in @out_size. It is up to the
+ * caller to complete the access by calling the function again on the remaining
+ * range. Subsequent accesses are not guaranteed to succeed.
+ */
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+	struct rb_int_node *node;
+	u64 out_addr = 0;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, addr);
+	if (!node) {
+		pr_err("fault at IOVA %#llx %zu", addr, size);
+		errno = EFAULT;
+		goto out_unlock; /* Segv incomming */
+	}
+
+	map = container_of(node, struct iommu_mapping, iova_range);
+	if (prot & ~map->prot) {
+		pr_err("permission fault at IOVA %#llx", addr);
+		errno = EPERM;
+		goto out_unlock;
+	}
+
+	out_addr = map->phys + (addr - node->low);
+	*out_size = min_t(size_t, node->high - addr + 1, size);
+
+	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size,
size,
+		 prot, out_addr);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+
+	return out_addr;
+}
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees
Add an "iommu-map" property to the PCI host controller, describing
which
iommus translate which devices. We describe individual devices in
iommu-map, not ranges. This patch is incompatible with current mainline
Linux, which requires *all* devices under a host controller to be
described by the iommu-map property when present. Unfortunately all PCI
devices in kvmtool are under the same root complex, and we have to omit
RIDs of devices that aren't behind the virtual IOMMU in iommu-map. Fixing
this either requires a simple patch in Linux, or to implement multiple
host controllers in kvmtool.
Add an "iommus" property to plaform devices that are behind an iommu.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 arm/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fdt.c             | 20 ++++++++++++++++++++
 include/kvm/fdt.h |  7 +++++++
 virtio/mmio.c     |  1 +
 4 files changed, 76 insertions(+), 1 deletion(-)
diff --git a/arm/pci.c b/arm/pci.c
index 557cfa98..968cbf5b 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -1,9 +1,11 @@
 #include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/of_pci.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
+#include "kvm/virtio-iommu.h"
 
 #include "arm-common/pci.h"
 
@@ -24,11 +26,20 @@ struct of_interrupt_map_entry {
 	struct of_gic_irq		gic_irq;
 } __attribute__((packed));
 
+struct of_iommu_map_entry {
+	u32				rid_base;
+	u32				iommu_phandle;
+	u32				iommu_base;
+	u32				length;
+} __attribute__((packed));
+
 void pci__generate_fdt_nodes(void *fdt)
 {
 	struct device_header *dev_hdr;
 	struct of_interrupt_map_entry irq_map[OF_PCI_IRQ_MAP_MAX];
-	unsigned nentries = 0;
+	struct of_iommu_map_entry *iommu_map;
+	unsigned nentries = 0, ntranslated = 0;
+	unsigned i;
 	/* Bus range */
 	u32 bus_range[] = { cpu_to_fdt32(0), cpu_to_fdt32(1), };
 	/* Configuration Space */
@@ -99,6 +110,9 @@ void pci__generate_fdt_nodes(void *fdt)
 			},
 		};
 
+		if (dev_hdr->iommu_ops)
+			ntranslated++;
+
 		nentries++;
 		dev_hdr = device__next_dev(dev_hdr);
 	}
@@ -121,5 +135,38 @@ void pci__generate_fdt_nodes(void *fdt)
 				  sizeof(irq_mask)));
 	}
 
+	if (ntranslated) {
+		const struct iommu_properties *props;
+
+		iommu_map = malloc(ntranslated * sizeof(struct of_iommu_map_entry));
+		if (!iommu_map) {
+			pr_err("cannot allocate iommu_map.");
+			return;
+		}
+
+		dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+		for (i = 0; i < ntranslated; dev_hdr = device__next_dev(dev_hdr)) {
+			struct of_iommu_map_entry *entry = &iommu_map[i];
+
+			if (!dev_hdr->iommu_ops)
+				continue;
+
+			props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+			*entry = (struct of_iommu_map_entry) {
+				.rid_base	= cpu_to_fdt32(dev_hdr->dev_num << 3),
+				.iommu_phandle	= cpu_to_fdt32(props->phandle),
+				.iommu_base	= cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+				.length		= cpu_to_fdt32(1 << 3),
+			};
+
+			i++;
+		}
+
+		_FDT(fdt_property(fdt, "iommu-map", iommu_map,
+				  ntranslated * sizeof(struct of_iommu_map_entry)));
+		free(iommu_map);
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
diff --git a/fdt.c b/fdt.c
index 6db03d4e..15d7bb29 100644
--- a/fdt.c
+++ b/fdt.c
@@ -2,7 +2,10 @@
  * Commonly used FDT functions.
  */
 
+#include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
+#include "kvm/util.h"
 
 static u32 next_phandle = PHANDLE_RESERVED;
 
@@ -13,3 +16,20 @@ u32 fdt_alloc_phandle(void)
 
 	return next_phandle++;
 }
+
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev_hdr)
+{
+	const struct iommu_properties *props;
+
+	if (!dev_hdr->iommu_ops)
+		return;
+
+	props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+	u32 iommus[] = {
+		cpu_to_fdt32(props->phandle),
+		cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+	};
+
+	_FDT(fdt_property(fdt, "iommus", iommus, sizeof(iommus)));
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index 503887f9..c64fe8a3 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -37,7 +37,10 @@ enum irq_type {
 
 #ifdef CONFIG_HAS_LIBFDT
 
+struct device_header;
+
 u32 fdt_alloc_phandle(void);
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev);
 
 #else
 
@@ -46,6 +49,10 @@ static inline u32 fdt_alloc_phandle(void)
 	return PHANDLE_RESERVED;
 }
 
+static inline void fdt_generate_iommus_prop(void *fdt, struct device_header
*dev)
+{
+}
+
 #endif /* CONFIG_HAS_LIBFDT */
 
 #endif /* KVM__FDT_H */
diff --git a/virtio/mmio.c b/virtio/mmio.c
index b3dea51a..16b44fbb 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -258,6 +258,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+	fdt_generate_iommus_prop(fdt, dev_hdr);
 
 	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
 		props = viommu_get_properties(vmmio->dev);
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses
For passed-through devices behind a vIOMMU, we'll need to translate writes
to MSI vectors. Let the IRQ code register MSI doorbells, and add a simple
way for other systems to check if an address is a doorbell.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 arm/gic.c         |  4 ++++
 include/kvm/irq.h |  3 +++
 irq.c             | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+)
diff --git a/arm/gic.c b/arm/gic.c
index bf7a22a9..c708031e 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -108,6 +108,10 @@ static int gic__create_its_frame(struct kvm *kvm, u64
its_frame_addr)
 	};
 	int err;
 
+	err = irq__add_msi_doorbell(kvm, its_frame_addr, KVM_VGIC_V3_ITS_SIZE);
+	if (err)
+		return err;
+
 	err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &its_device);
 	if (err) {
 		fprintf(stderr,
diff --git a/include/kvm/irq.h b/include/kvm/irq.h
index a188a870..2a59257e 100644
--- a/include/kvm/irq.h
+++ b/include/kvm/irq.h
@@ -24,6 +24,9 @@ int irq__allocate_routing_entry(void);
 int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg, u32 device_id);
 void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg);
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size);
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr);
+
 /*
  * The function takes two eventfd arguments, trigger_fd and resample_fd. If
  * resample_fd is <= 0, resampling is disabled and the IRQ is edge-triggered
diff --git a/irq.c b/irq.c
index a4ef75e4..a04f4d37 100644
--- a/irq.c
+++ b/irq.c
@@ -8,6 +8,14 @@
 #include "kvm/irq.h"
 #include "kvm/kvm-arch.h"
 
+struct kvm_msi_doorbell_region {
+	u64			start;
+	u64			end;
+	struct list_head	head;
+};
+
+static LIST_HEAD(msi_doorbells);
+
 static u8 next_line = KVM_IRQ_OFFSET;
 static int allocated_gsis = 0;
 
@@ -147,6 +155,33 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi,
struct msi_msg *msg)
 		die_perror("KVM_SET_GSI_ROUTING");
 }
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size)
+{
+	struct kvm_msi_doorbell_region *doorbell = malloc(sizeof(*doorbell));
+
+	if (!doorbell)
+		return -ENOMEM;
+
+	doorbell->start = addr;
+	doorbell->end = addr + size - 1;
+
+	list_add(&doorbell->head, &msi_doorbells);
+
+	return 0;
+}
+
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr)
+{
+	struct kvm_msi_doorbell_region *doorbell;
+
+	list_for_each_entry(doorbell, &msi_doorbells, head) {
+		if (addr >= doorbell->start && addr <= doorbell->end)
+			return true;
+	}
+
+	return false;
+}
+
 int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
 			   int resample_fd)
 {
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization
All virtio devices are doing the same few operations when initializing
their virtqueues. Move these operations to virtio core, as we'll have to
complexify vring initialization when implementing a virtual IOMMU.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/virtio.h | 16 +++++++++-------
 virtio/9p.c          |  7 ++-----
 virtio/balloon.c     |  7 +++----
 virtio/blk.c         | 10 ++--------
 virtio/console.c     |  7 ++-----
 virtio/iommu.c       | 10 ++--------
 virtio/net.c         |  8 ++------
 virtio/rng.c         |  6 ++----
 virtio/scsi.c        |  6 ++----
 9 files changed, 26 insertions(+), 51 deletions(-)
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 00a791ac..24c0c487 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -169,15 +169,17 @@ int virtio_init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 int virtio_compat_add_message(const char *device, const char *config);
 const char* virtio_trans_name(enum virtio_trans trans);
 
-static inline void *virtio_get_vq(struct kvm *kvm, u32 pfn, u32 page_size)
+static inline void virtio_init_device_vq(struct kvm *kvm,
+					 struct virtio_device *vdev,
+					 struct virt_queue *vq, size_t nr_descs,
+					 u32 page_size, u32 align, u32 pfn)
 {
-	return guest_flat_to_host(kvm, (u64)pfn * page_size);
-}
+	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
 
-static inline void virtio_init_device_vq(struct virtio_device *vdev,
-					 struct virt_queue *vq)
-{
-	vq->endian = vdev->endian;
+	vq->endian	= vdev->endian;
+	vq->pfn		= pfn;
+
+	vring_init(&vq->vring, nr_descs, p, align);
 }
 
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/9p.c b/virtio/9p.c
index 69fdc4be..acd09bdd 100644
--- a/virtio/9p.c
+++ b/virtio/9p.c
@@ -1388,17 +1388,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq,
u32 page_size, u32 align,
 	struct p9_dev *p9dev = dev;
 	struct p9_dev_job *job;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &p9dev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 	job		= &p9dev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTQUEUE_NUM, p, align);
-	virtio_init_device_vq(&p9dev->vdev, queue);
+	virtio_init_device_vq(kvm, &p9dev->vdev, queue, VIRTQUEUE_NUM,
+			      page_size, align, pfn);
 
 	*job		= (struct p9_dev_job) {
 		.vq		= queue,
diff --git a/virtio/balloon.c b/virtio/balloon.c
index 9564aa39..9182cae6 100644
--- a/virtio/balloon.c
+++ b/virtio/balloon.c
@@ -198,16 +198,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 {
 	struct bln_dev *bdev = dev;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
+
+	virtio_init_device_vq(kvm, &bdev->vdev, queue, VIRTIO_BLN_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
-	vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, align);
 
 	return 0;
 }
diff --git a/virtio/blk.c b/virtio/blk.c
index c485e4fc..8c6e59ba 100644
--- a/virtio/blk.c
+++ b/virtio/blk.c
@@ -178,17 +178,11 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 		   u32 pfn)
 {
 	struct blk_dev *bdev = dev;
-	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
-	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_BLK_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&bdev->vdev, queue);
+	virtio_init_device_vq(kvm, &bdev->vdev, &bdev->vqs[vq],
+			      VIRTIO_BLK_QUEUE_SIZE, page_size, align, pfn);
 
 	return 0;
 }
diff --git a/virtio/console.c b/virtio/console.c
index f1c0a190..610962c4 100644
--- a/virtio/console.c
+++ b/virtio/console.c
@@ -143,18 +143,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 		   u32 pfn)
 {
 	struct virt_queue *queue;
-	void *p;
 
 	BUG_ON(vq >= VIRTIO_CONSOLE_NUM_QUEUES);
 
 	compat__remove_message(compat_id);
 
 	queue		= &cdev.vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_CONSOLE_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&cdev.vdev, queue);
+	virtio_init_device_vq(kvm, &cdev.vdev, queue, VIRTIO_CONSOLE_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (vq == VIRTIO_CONSOLE_TX_QUEUE) {
 		thread_pool__init_job(&cdev.jobs[vq], kvm,
virtio_console_handle_callback, queue);
diff --git a/virtio/iommu.c b/virtio/iommu.c
index c72e7322..2e5a23ee 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -497,8 +497,6 @@ static void viommu_set_guest_features(struct kvm *kvm, void
*dev, u32 features)
 static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
 			  u32 align, u32 pfn)
 {
-	void *ptr;
-	struct virt_queue *queue;
 	struct viommu_dev *viommu = dev;
 
 	if (vq != 0)
@@ -506,12 +504,8 @@ static int viommu_init_vq(struct kvm *kvm, void *dev, u32
vq, u32 page_size,
 
 	compat__remove_message(compat_id);
 
-	queue = &viommu->vq;
-	queue->pfn = pfn;
-	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, viommu->queue_size, ptr, align);
-	virtio_init_device_vq(&viommu->vdev, queue);
+	virtio_init_device_vq(kvm, &viommu->vdev, &viommu->vq,
+			      viommu->queue_size, page_size, align, pfn);
 
 	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
 
diff --git a/virtio/net.c b/virtio/net.c
index 529b4111..957cca09 100644
--- a/virtio/net.c
+++ b/virtio/net.c
@@ -505,17 +505,13 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct net_dev *ndev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &ndev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_NET_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&ndev->vdev, queue);
+	virtio_init_device_vq(kvm, &ndev->vdev, queue, VIRTIO_NET_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	mutex_init(&ndev->io_lock[vq]);
 	pthread_cond_init(&ndev->io_cond[vq], NULL);
diff --git a/virtio/rng.c b/virtio/rng.c
index 9b9e1283..5f525540 100644
--- a/virtio/rng.c
+++ b/virtio/rng.c
@@ -92,17 +92,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 	struct rng_dev *rdev = dev;
 	struct virt_queue *queue;
 	struct rng_dev_job *job;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &rdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
 	job = &rdev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTIO_RNG_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &rdev->vdev, queue, VIRTIO_RNG_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	*job = (struct rng_dev_job) {
 		.vq	= queue,
diff --git a/virtio/scsi.c b/virtio/scsi.c
index a429ac85..e0fd85f6 100644
--- a/virtio/scsi.c
+++ b/virtio/scsi.c
@@ -57,16 +57,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32
page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct scsi_dev *sdev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &sdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_SCSI_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &sdev->vdev, queue, VIRTIO_SCSI_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (sdev->vhost_fd == 0)
 		return 0;
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices
Virtio devices can now opt-in to use an IOMMU, by setting the use_iommu
field. None of this will work in the current state, since virtio devices
still access memory linearly. A subsequent patch implements sg accesses.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/virtio-mmio.h |  1 +
 include/kvm/virtio-pci.h  |  1 +
 include/kvm/virtio.h      | 13 ++++++++++++
 virtio/core.c             | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c             | 27 ++++++++++++++++++++++++
 virtio/pci.c              | 26 ++++++++++++++++++++++++
 6 files changed, 120 insertions(+)
diff --git a/include/kvm/virtio-mmio.h b/include/kvm/virtio-mmio.h
index 835f421b..c25a4fd7 100644
--- a/include/kvm/virtio-mmio.h
+++ b/include/kvm/virtio-mmio.h
@@ -44,6 +44,7 @@ struct virtio_mmio_hdr {
 struct virtio_mmio {
 	u32			addr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 	u8			irq;
 	struct virtio_mmio_hdr	hdr;
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index b70cadd8..26772f74 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -22,6 +22,7 @@ struct virtio_pci {
 	struct pci_device_header pci_hdr;
 	struct device_header	dev_hdr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 
 	u16			port_addr;
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 24c0c487..9f2ff237 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <sys/uio.h>
 
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 
 #define VIRTIO_IRQ_LOW		0
@@ -137,10 +138,12 @@ enum virtio_trans {
 };
 
 struct virtio_device {
+	bool			use_iommu;
 	bool			use_vhost;
 	void			*virtio;
 	struct virtio_ops	*ops;
 	u16			endian;
+	void			*iotlb;
 };
 
 struct virtio_ops {
@@ -182,4 +185,14 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 	vring_init(&vq->vring, nr_descs, p, align);
 }
 
+/*
+ * These are callbacks for IOMMU operations on virtio devices. They are not
+ * operations on the virtio-iommu device. Confusing, I know.
+ */
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev);
+
+int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
+int virtio__iommu_detach(void *, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index d6ac289d..32bd4ebc 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -6,11 +6,16 @@
 #include "kvm/guest_compat.h"
 #include "kvm/barrier.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-pci.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
 
+static void *iommu = NULL;
+static struct iommu_properties iommu_props = {
+	.name		= "viommu-virtio",
+};
 
 const char* virtio_trans_name(enum virtio_trans trans)
 {
@@ -198,6 +203,41 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev)
+{
+	return &iommu_props;
+}
+
+int virtio__iommu_attach(void *priv, struct virtio_device *vdev, int flags)
+{
+	struct virtio_tlb *iotlb = priv;
+
+	if (!iotlb)
+		return -ENOMEM;
+
+	if (vdev->iotlb) {
+		pr_err("device already attached");
+		return -EINVAL;
+	}
+
+	vdev->iotlb = iotlb;
+
+	return 0;
+}
+
+int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
+{
+	if (vdev->iotlb != priv) {
+		pr_err("wrong iotlb"); /* bug */
+		return -EINVAL;
+	}
+
+	vdev->iotlb = NULL;
+
+	return 0;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
@@ -233,6 +273,18 @@ int virtio_init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 		return -1;
 	};
 
+	if (!iommu && vdev->use_iommu) {
+		iommu_props.pgsize_mask = ~(PAGE_SIZE - 1);
+		/*
+		 * With legacy MMIO, we only have 32-bit to hold the vring PFN.
+		 * This limits the IOVA size to (32 + 12) = 44 bits, when using
+		 * 4k pages.
+		 */
+		iommu_props.input_addr_size = 44;
+		iommu = viommu_register(kvm, &iommu_props);
+	}
+
+
 	return 0;
 }
 
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 16b44fbb..24a14a71 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,4 +1,5 @@
 #include "kvm/devices.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/iommu.h"
@@ -286,6 +287,30 @@ void virtio_mmio_assign_irq(struct device_header *dev_hdr)
 	vmmio->irq = irq__alloc_line();
 }
 
+#define mmio_dev_to_virtio(dev_hdr)					\
+	container_of(dev_hdr, struct virtio_mmio, dev_hdr)->vdev
+
+static int virtio_mmio_iommu_attach(void *priv, struct device_header *dev_hdr,
+				    int flags)
+{
+	return virtio__iommu_attach(priv, mmio_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_mmio_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, mmio_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_mmio_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_mmio_iommu_attach,
+	.detach			= virtio_mmio_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -294,6 +319,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 	vmmio->addr	= virtio_mmio_get_io_space_block(VIRTIO_MMIO_IO_SIZE);
 	vmmio->kvm	= kvm;
 	vmmio->dev	= dev;
+	vmmio->vdev	= vdev;
 
 	kvm__register_mmio(kvm, vmmio->addr, VIRTIO_MMIO_IO_SIZE,
 			   false, virtio_mmio_mmio_callback, vdev);
@@ -309,6 +335,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 	vmmio->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_MMIO,
 		.data		= generate_virtio_mmio_fdt_node,
+		.iommu_ops	= vdev->use_iommu ? &virtio_mmio_iommu_ops : NULL,
 	};
 
 	device__register(&vmmio->dev_hdr);
diff --git a/virtio/pci.c b/virtio/pci.c
index b6ef389e..674d5143 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -408,6 +408,30 @@ static void virtio_pci__io_mmio_callback(struct kvm_cpu
*vcpu,
 	kvm__emulate_io(vcpu, port, data, direction, len, 1);
 }
 
+#define pci_dev_to_virtio(dev_hdr)				\
+	(container_of(dev_hdr, struct virtio_pci, dev_hdr)->vdev)
+
+static int virtio_pci_iommu_attach(void *priv, struct device_header *dev_hdr,
+				   int flags)
+{
+	return virtio__iommu_attach(priv, pci_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_pci_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, pci_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_pci_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_pci_iommu_attach,
+	.detach			= virtio_pci_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -416,6 +440,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 
 	vpci->kvm = kvm;
 	vpci->dev = dev;
+	vpci->vdev = vdev;
 
 	r = ioport__register(kvm, IOPORT_EMPTY, &virtio_pci__io_ops, IOPORT_SIZE,
vdev);
 	if (r < 0)
@@ -461,6 +486,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 	vpci->dev_hdr = (struct device_header) {
 		.bus_type		= DEVICE_BUS_PCI,
 		.data			= &vpci->pci_hdr,
+		.iommu_ops		= vdev->use_iommu ? &virtio_pci_iommu_ops : NULL,
 	};
 
 	vpci->pci_hdr.msix.cap = PCI_CAP_ID_MSIX;
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings
Teach the virtio core how to access scattered vring structures. When
presenting a virtual IOMMU to the guest in front of virtio devices, the
virtio ring and buffers will be scattered across discontiguous guest-
physical pages. The device has to translate all IOVAs to host-virtual
addresses and gather the pages before accessing any structure.
Buffers described by vring.desc are already returned to the device via an
iovec. We simply have to fill them at a finer granularity and hope that:
1. The driver doesn't provide too many descriptors at a time, since the
   iovec is only as big as the number of descriptor and an overflow is now
   possible.
2. The device doesn't make assumption on message framing from vectors (ie.
   a message can now be contained in more vectors than before). This is
   forbidden by virtio 1.0 (and legacy with ANY_LAYOUT) but our
   virtio-net, for instance, assumes that the first vector always contains
   a full vnet header. In practice it's fine, but still extremely fragile.
For accessing vring and indirect descriptor tables, we now allocate an
iovec describing the IOMMU mappings of the structure, and make all
accesses via this iovec.
                                  ***
A more elegant way to do it would be to create a subprocess per
address-space, and remap fragments of guest memory in a contiguous manner:
                                .---- virtio-blk process
                               /
           viommu process ----+------ virtio-net process
                               \
                                '---- some other device
(0) Initially, parent forks for each emulated device. Each child reserves
    a large chunk of virtual memory with mmap (base), representing the
    IOVA space, but doesn't populate it.
(1) virtio-dev wants to access guest memory, for instance read the vring.
    It sends a TLB miss for an IOVA to the parent via pipe or socket.
(2) Parent viommu checks its translation table, and returns an offset in
    guest memory.
(3) Child does a mmap in its IOVA space, using the fd that backs guest
    memory: mmap(base + iova, pgsize, SHARED|FIXED, fd, offset)
This would be really cool, but I suspect it adds a lot of complexity,
since it's not clear which devices are entirely self-contained and which
need to access parent memory. So stay with scatter-gather accesses for
now.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/virtio.h | 108 +++++++++++++++++++++++++++++--
 virtio/core.c        | 179 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 252 insertions(+), 35 deletions(-)
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 9f2ff237..cdc960cd 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -29,12 +29,16 @@
 
 struct virt_queue {
 	struct vring	vring;
+	struct iovec	*vring_sg;
+	size_t		vring_nr_sg;
 	u32		pfn;
 	/* The last_avail_idx field is an index to ->ring of struct vring_avail.
 	   It's where we assume the next request index is at.  */
 	u16		last_avail_idx;
 	u16		last_used_signalled;
 	u16		endian;
+
+	struct virtio_device *vdev;
 };
 
 /*
@@ -96,26 +100,91 @@ static inline __u64 __virtio_h2g_u64(u16 endian, __u64 val)
 
 #endif
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot);
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_sg, u16 max_sg,
+		       struct iovec iov[]);
+
+/*
+ * Access element in a virtio structure. If @iov is NULL, access is linear and
+ * @ptr represents a Host-Virtual Address (HVA).
+ *
+ * Otherwise, the structure is scattered in the guest-physical space, and is
+ * made virtually-contiguous by the virtual IOMMU. @iov describes the
+ * structure's IOVA->HVA fragments, @base is the IOVA of the structure,
and @ptr
+ * an IOVA inside the structure. @max is the number of elements in @iov.
+ *
+ *                                        HVA
+ *                      IOVA      .----> +---+ iov[0].base
+ *              @base-> +---+ ----'      |   |
+ *                      |   |            +---+
+ *                      +---+ ----.      :   :
+ *                      |   |     '----> +---+ iov[1].base
+ *               @ptr-> |   |            |   |
+ *                      +---+            |   |--> out
+ *                                       +---+
+ */
+static void *virtio_access_sg(struct iovec *iov, int max, void *base, void
*ptr)
+{
+	int i;
+	size_t off = ptr - base;
+
+	if (!iov)
+		return ptr;
+
+	for (i = 0; i < max; i++) {
+		size_t sz = iov[i].iov_len;
+		if (off < sz)
+			return iov[i].iov_base + off;
+		off -= sz;
+	}
+
+	pr_err("virtio_access_sg overflow");
+	return NULL;
+}
+
+/*
+ * We only implement legacy vhost, so vring is a single virtually-contiguous
+ * structure starting at the descriptor table. Differentiation of accesses
+ * allows to ease a future move to virtio 1.0.
+ */
+#define vring_access_avail(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_desc(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_used(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+
 static inline u16 virt_queue__pop(struct virt_queue *queue)
 {
+	void *ptr;
 	__u16 guest_idx;
 
-	guest_idx = queue->vring.avail->ring[queue->last_avail_idx++ %
queue->vring.num];
+	ptr = &queue->vring.avail->ring[queue->last_avail_idx++ %
queue->vring.num];
+	guest_idx = *(u16 *)vring_access_avail(queue, ptr);
+
 	return virtio_guest_to_host_u16(queue, guest_idx);
 }
 
 static inline struct vring_desc *virt_queue__get_desc(struct virt_queue *queue,
u16 desc_ndx)
 {
-	return &queue->vring.desc[desc_ndx];
+	return vring_access_desc(queue, &queue->vring.desc[desc_ndx]);
 }
 
 static inline bool virt_queue__available(struct virt_queue *vq)
 {
+	u16 *evt, *idx;
+
 	if (!vq->vring.avail)
 		return 0;
 
-	vring_avail_event(&vq->vring) = virtio_host_to_guest_u16(vq,
vq->last_avail_idx);
-	return virtio_guest_to_host_u16(vq, vq->vring.avail->idx) !=
vq->last_avail_idx;
+	/* Disgusting casts under the hood: &(*&used[size]) */
+	evt = vring_access_used(vq, &vring_avail_event(&vq->vring));
+	idx = vring_access_avail(vq, &vq->vring.avail->idx);
+
+	*evt = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
+	return virtio_guest_to_host_u16(vq, *idx) != vq->last_avail_idx;
 }
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump);
@@ -177,10 +246,39 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 					 struct virt_queue *vq, size_t nr_descs,
 					 u32 page_size, u32 align, u32 pfn)
 {
-	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
+	void *p;
 
 	vq->endian	= vdev->endian;
 	vq->pfn		= pfn;
+	vq->vdev	= vdev;
+	vq->vring_sg	= NULL;
+
+	if (vdev->iotlb) {
+		u64 addr = (u64)pfn * page_size;
+		size_t size = vring_size(nr_descs, align);
+		/* Our IOMMU maps at PAGE_SIZE granularity */
+		size_t nr_sg = size / PAGE_SIZE;
+		int flags = IOMMU_PROT_READ | IOMMU_PROT_WRITE;
+
+		vq->vring_sg = calloc(nr_sg, sizeof(struct iovec));
+		if (!vq->vring_sg) {
+			pr_err("could not allocate vring_sg");
+			return; /* Explode later. */
+		}
+
+		vq->vring_nr_sg = virtio_populate_sg(kvm, vdev, addr, size,
+						     flags, 0, nr_sg,
+						     vq->vring_sg);
+		if (!vq->vring_nr_sg) {
+			pr_err("could not map vring");
+			free(vq->vring_sg);
+		}
+
+		/* vring is described with its IOVA */
+		p = (void *)addr;
+	} else {
+		p = guest_flat_to_host(kvm, (u64)pfn * page_size);
+	}
 
 	vring_init(&vq->vring, nr_descs, p, align);
 }
diff --git a/virtio/core.c b/virtio/core.c
index 32bd4ebc..ba35e5f1 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -28,7 +28,8 @@ const char* virtio_trans_name(enum virtio_trans trans)
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 {
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
 	/*
 	 * Use wmb to assure that used elem was updated with head and len.
@@ -37,7 +38,7 @@ void virt_queue__used_idx_advance(struct virt_queue *queue,
u16 jump)
 	 */
 	wmb();
 	idx += jump;
-	queue->vring.used->idx = virtio_host_to_guest_u16(queue, idx);
+	*ptr = virtio_host_to_guest_u16(queue, idx);
 
 	/*
 	 * Use wmb to assure used idx has been increased before we signal the guest.
@@ -52,10 +53,12 @@ virt_queue__set_used_elem_no_update(struct virt_queue
*queue, u32 head,
 				    u32 len, u16 offset)
 {
 	struct vring_used_elem *used_elem;
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
-	idx += offset;
-	used_elem	= &queue->vring.used->ring[idx % queue->vring.num];
+	idx = (idx + offset) % queue->vring.num;
+
+	used_elem	= vring_access_used(queue, &queue->vring.used->ring[idx]);
 	used_elem->id	= virtio_host_to_guest_u32(queue, head);
 	used_elem->len	= virtio_host_to_guest_u32(queue, len);
 
@@ -84,16 +87,17 @@ static inline bool virt_desc__test_flag(struct virt_queue
*vq,
  * at the end.
  */
 static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
-			  unsigned int i, unsigned int max)
+			  unsigned int max)
 {
 	unsigned int next;
 
 	/* If this descriptor says it doesn't chain, we're done. */
-	if (!virt_desc__test_flag(vq, &desc[i], VRING_DESC_F_NEXT))
+	if (!virt_desc__test_flag(vq, desc, VRING_DESC_F_NEXT))
 		return max;
 
+	next = virtio_guest_to_host_u16(vq, desc->next);
 	/* Check they're not leading us off end of descriptors. */
-	next = virtio_guest_to_host_u16(vq, desc[i].next);
+	next = min(next, max);
 	/* Make sure compiler knows to grab that: we don't want it changing! */
 	wmb();
 
@@ -102,32 +106,76 @@ static unsigned next_desc(struct virt_queue *vq, struct
vring_desc *desc,
 
 u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[], u16
*out, u16 *in, u16 head, struct kvm *kvm)
 {
-	struct vring_desc *desc;
+	struct vring_desc *desc_base, *desc;
+	bool indirect, is_write;
+	struct iovec *desc_sg;
+	size_t len, nr_sg;
+	u64 addr;
 	u16 idx;
 	u16 max;
 
 	idx = head;
 	*out = *in = 0;
 	max = vq->vring.num;
-	desc = vq->vring.desc;
+	desc_base = vq->vring.desc;
+	desc_sg = vq->vring_sg;
+	nr_sg = vq->vring_nr_sg;
+
+	desc = vring_access_desc(vq, &desc_base[idx]);
+	indirect = virt_desc__test_flag(vq, desc, VRING_DESC_F_INDIRECT);
+	if (indirect) {
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		max = len / sizeof(struct vring_desc);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+		if (desc_sg) {
+			desc_sg = calloc(len / PAGE_SIZE + 1, sizeof(struct iovec));
+			if (!desc_sg)
+				return 0;
+
+			nr_sg = virtio_populate_sg(kvm, vq->vdev, addr, len,
+						   IOMMU_PROT_READ, 0, max,
+						   desc_sg);
+			if (!nr_sg) {
+				pr_err("failed to populate indirect table");
+				free(desc_sg);
+				return 0;
+			}
+
+			desc_base = (void *)addr;
+		} else {
+			desc_base = guest_flat_to_host(kvm, addr);
+		}
 
-	if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_INDIRECT)) {
-		max = virtio_guest_to_host_u32(vq, desc[idx].len) / sizeof(struct
vring_desc);
-		desc = guest_flat_to_host(kvm, virtio_guest_to_host_u64(vq, desc[idx].addr));
 		idx = 0;
 	}
 
 	do {
+		u16 nr_io;
+
+		desc = virtio_access_sg(desc_sg, nr_sg, desc_base, &desc_base[idx]);
+		is_write = virt_desc__test_flag(vq, desc, VRING_DESC_F_WRITE);
+
 		/* Grab the first descriptor, and check it's OK. */
-		iov[*out + *in].iov_len = virtio_guest_to_host_u32(vq, desc[idx].len);
-		iov[*out + *in].iov_base = guest_flat_to_host(kvm,
-							      virtio_guest_to_host_u64(vq, desc[idx].addr));
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+
+		/*
+		 * dodgy assumption alert: device uses vring.desc.num iovecs.
+		 * True in practice, but they are not obligated to do so.
+		 */
+		nr_io = virtio_populate_sg(kvm, vq->vdev, addr, len, is_write ?
+					   IOMMU_PROT_WRITE : IOMMU_PROT_READ,
+					   *out + *in, vq->vring.num, iov);
+
 		/* If this is an input descriptor, increment that count. */
-		if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_WRITE))
-			(*in)++;
+		if (is_write)
+			(*in) += nr_io;
 		else
-			(*out)++;
-	} while ((idx = next_desc(vq, desc, idx, max)) != max);
+			(*out) += nr_io;
+	} while ((idx = next_desc(vq, desc, max)) != max);
+
+	if (indirect && desc_sg)
+		free(desc_sg);
 
 	return head;
 }
@@ -147,23 +195,35 @@ u16 virt_queue__get_inout_iov(struct kvm *kvm, struct
virt_queue *queue,
 			      u16 *in, u16 *out)
 {
 	struct vring_desc *desc;
+	struct iovec *iov;
 	u16 head, idx;
+	bool is_write;
+	size_t len;
+	u64 addr;
+	int prot;
+	u16 *cur;
 
 	idx = head = virt_queue__pop(queue);
 	*out = *in = 0;
 	do {
-		u64 addr;
 		desc = virt_queue__get_desc(queue, idx);
+		is_write = virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE);
+		len = virtio_guest_to_host_u32(queue, desc->len);
 		addr = virtio_guest_to_host_u64(queue, desc->addr);
-		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE)) {
-			in_iov[*in].iov_base = guest_flat_to_host(kvm, addr);
-			in_iov[*in].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*in)++;
+		if (is_write) {
+			prot = IOMMU_PROT_WRITE;
+			iov = in_iov;
+			cur = in;
 		} else {
-			out_iov[*out].iov_base = guest_flat_to_host(kvm, addr);
-			out_iov[*out].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*out)++;
+			prot = IOMMU_PROT_READ;
+			iov = out_iov;
+			cur = out;
 		}
+
+		/* dodgy assumption alert: device uses vring.desc.num iovecs */
+		*cur += virtio_populate_sg(kvm, queue->vdev, addr, len, prot,
+					   *cur, queue->vring.num, iov);
+
 		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_NEXT))
 			idx = virtio_guest_to_host_u16(queue, desc->next);
 		else
@@ -191,9 +251,12 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 {
 	u16 old_idx, new_idx, event_idx;
 
+	u16 *new_ptr	= vring_access_used(vq, &vq->vring.used->idx);
+	u16 *event_ptr	= vring_access_avail(vq,
&vring_used_event(&vq->vring));
+
 	old_idx		= vq->last_used_signalled;
-	new_idx		= virtio_guest_to_host_u16(vq, vq->vring.used->idx);
-	event_idx	= virtio_guest_to_host_u16(vq, vring_used_event(&vq->vring));
+	new_idx		= virtio_guest_to_host_u16(vq, *new_ptr);
+	event_idx	= virtio_guest_to_host_u16(vq, *event_ptr);
 
 	if (vring_need_event(event_idx, new_idx, old_idx)) {
 		vq->last_used_signalled = new_idx;
@@ -238,6 +301,62 @@ int virtio__iommu_detach(void *priv, struct virtio_device
*vdev)
 	return 0;
 }
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot)
+{
+	u64 paddr;
+
+	if (!vdev->iotlb) {
+		*out_size = size;
+		paddr = addr;
+	} else {
+		paddr = iommu_access(vdev->iotlb, addr, size, out_size, prot);
+	}
+
+	return guest_flat_to_host(kvm, paddr);
+}
+
+/*
+ * Fill @iov starting at index @cur_vec with translations of the (@addr, @size)
+ * range. If @vdev doesn't have a tlb, fill a single vector with the
+ * corresponding HVA. Otherwise, fill vectors with GVA->GPA->HVA
translations.
+ * Since the IOVA range may span over multiple IOMMU mappings, there may need
to
+ * be multiple vectors. @nr_vec is the size of the @iov array.
+ */
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_vec, u16 nr_vec,
+		       struct iovec iov[])
+{
+	void *ptr;
+	int vec = cur_vec;
+	size_t consumed = 0;
+
+	while (size > 0 && vec < nr_vec) {
+		ptr = virtio_guest_access(kvm, vdev, addr, size, &consumed,
+					  prot);
+		if (!ptr)
+			break;
+
+		iov[vec].iov_len = consumed;
+		iov[vec].iov_base = ptr;
+
+		size -= consumed;
+		addr += consumed;
+		vec++;
+	}
+
+	if (cur_vec == nr_vec && size)
+		/*
+		 * This is bad. Devices used to offer as many iovecs as vring
+		 * descriptors, so there was no chance of filling up the array.
+		 * But with the IOMMU, buffers may be fragmented and use
+		 * multiple iovecs per descriptor.
+		 */
+		pr_err("reached end of iovec, incomplete buffer");
+
+	return vec - cur_vec;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU
When the virtio device is behind a virtual IOMMU, the doorbell address
written into the MSI-X table by the guest is an IOVA, not a physical one.
When injecting an MSI, KVM needs a physical address to recognize the
doorbell and the associated IRQ chip. Translate the address given by the
guest into a physical one, and store it in a secondary table for easy
access.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/iommu.h      |  4 ++++
 include/kvm/virtio-pci.h |  1 +
 iommu.c                  | 23 +++++++++++++++++++++++
 virtio/pci.c             | 33 ++++++++++++++++++++++++---------
 4 files changed, 52 insertions(+), 9 deletions(-)
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 4164ba20..8f87ce5a 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -70,4 +70,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size,
int flags);
 u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
 		 int prot);
 
+struct msi_msg;
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msi);
+
 #endif /* KVM_IOMMU_H */
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index 26772f74..cb5225d6 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -47,6 +47,7 @@ struct virtio_pci {
 	u32			msix_io_block;
 	u64			msix_pba;
 	struct msix_table	msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
+	struct msi_msg		msix_msgs[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
 
 	/* virtio queue */
 	u16			queue_selector;
diff --git a/iommu.c b/iommu.c
index 0a662404..c10a3f0b 100644
--- a/iommu.c
+++ b/iommu.c
@@ -5,6 +5,7 @@
 
 #include "kvm/iommu.h"
 #include "kvm/kvm.h"
+#include "kvm/msi.h"
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
@@ -160,3 +161,25 @@ out_unlock:
 
 	return out_addr;
 }
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msg)
+{
+	size_t size = 4, out_size;
+	u64 addr = ((u64)msg->address_hi << 32) | msg->address_lo;
+
+	if (!address_space)
+		return 0;
+
+	addr = iommu_access(address_space, addr, size, &out_size,
+			    IOMMU_PROT_WRITE);
+
+	if (!addr || out_size != size) {
+		pr_err("could not translate MSI doorbell");
+		return -EFAULT;
+	}
+
+	msg->address_lo = addr & 0xffffffff;
+	msg->address_hi = addr >> 32;
+
+	return 0;
+}
diff --git a/virtio/pci.c b/virtio/pci.c
index 674d5143..88b1a129 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -156,6 +156,7 @@ static void update_msix_map(struct virtio_pci *vpci,
 			    struct msix_table *msix_entry, u32 vecnum)
 {
 	u32 gsi, i;
+	struct msi_msg *msg;
 
 	/* Find the GSI number used for that vector */
 	if (vecnum == vpci->config_vector) {
@@ -172,14 +173,20 @@ static void update_msix_map(struct virtio_pci *vpci,
 	if (gsi == 0)
 		return;
 
-	msix_entry = &msix_entry[vecnum];
-	irq__update_msix_route(vpci->kvm, gsi, &msix_entry->msg);
+	msg = &vpci->msix_msgs[vecnum];
+	*msg = msix_entry[vecnum].msg;
+
+	if (iommu_translate_msi(vpci->vdev->iotlb, msg))
+		return;
+
+	irq__update_msix_route(vpci->kvm, gsi, msg);
 }
 
 static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device
*vdev, u16 port,
 					void *data, int size, int offset)
 {
 	struct virtio_pci *vpci = vdev->virtio;
+	struct msi_msg *msg;
 	u32 config_offset, vec;
 	int gsi;
 	int type = virtio__get_dev_specific_field(offset - 20,
virtio_pci__msix_enabled(vpci),
@@ -191,8 +198,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm,
struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi >= 0) {
 				vpci->config_gsi = gsi;
@@ -210,8 +221,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm,
struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi < 0) {
 				if (gsi == -ENXIO &&
@@ -328,9 +343,9 @@ static void virtio_pci__signal_msi(struct kvm *kvm, struct
virtio_pci *vpci,
 {
 	static int needs_devid = 0;
 	struct kvm_msi msi = {
-		.address_lo = vpci->msix_table[vec].msg.address_lo,
-		.address_hi = vpci->msix_table[vec].msg.address_hi,
-		.data = vpci->msix_table[vec].msg.data,
+		.address_lo = vpci->msix_msgs[vec].address_lo,
+		.address_hi = vpci->msix_msgs[vec].address_hi,
+		.data = vpci->msix_msgs[vec].data,
 	};
 
 	if (needs_devid == 0) {
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
Pass the VIRTIO_F_IOMMU_PLATFORM to tell the guest when a device is behind
an IOMMU.
Other feature bits in virtio do not depend on the device type and could be
factored the same way. For instance our vring implementation always
supports indirect descriptors (VIRTIO_RING_F_INDIRECT_DESC), so we could
advertise it for all devices at once (only net, scsi and blk at the
moment). However, this might modify guest behaviour: in Linux whenever the
driver attempts to add a chain of descriptors, it will allocate an
indirect table and use a single ring descriptor, which might slightly
reduce performance. Cowardly ignore this.
VIRTIO_RING_F_EVENT_IDX is another feature of the vring, but that one
needs the device to call virtio_queue__should_signal before signaling to
the guest. Arguably we could factor all calls to signal_vq, but let's keep
this patch simple.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/virtio.h | 2 ++
 virtio/core.c        | 6 ++++++
 virtio/mmio.c        | 4 +++-
 virtio/pci.c         | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index cdc960cd..97bd5bdb 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -293,4 +293,6 @@ virtio__iommu_get_properties(struct device_header *dev);
 int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
 int virtio__iommu_detach(void *, struct virtio_device *vdev);
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index ba35e5f1..66e0cecb 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,3 +1,4 @@
+#include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -266,6 +267,11 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev)
+{
+	return vdev->use_iommu ? VIRTIO_F_IOMMU_PLATFORM : 0;
+}
+
 const struct iommu_properties *
 virtio__iommu_get_properties(struct device_header *dev)
 {
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 24a14a71..699d4403 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -127,9 +127,11 @@ static void virtio_mmio_config_in(struct kvm_cpu *vcpu,
 		ioport__write32(data, *(u32 *)(((void *)&vmmio->hdr) + addr));
 		break;
 	case VIRTIO_MMIO_HOST_FEATURES:
-		if (vmmio->hdr.host_features_sel == 0)
+		if (vmmio->hdr.host_features_sel == 0) {
 			val = vdev->ops->get_host_features(vmmio->kvm,
 							   vmmio->dev);
+			val |= virtio_get_common_features(vmmio->kvm, vdev);
+		}
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_MMIO_QUEUE_PFN:
diff --git a/virtio/pci.c b/virtio/pci.c
index 88b1a129..c9f0e558 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -126,6 +126,7 @@ static bool virtio_pci__io_in(struct ioport *ioport, struct
kvm_cpu *vcpu, u16 p
 	switch (offset) {
 	case VIRTIO_PCI_HOST_FEATURES:
 		val = vdev->ops->get_host_features(kvm, vpci->dev);
+		val |= virtio_get_common_features(kvm, vdev);
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_PCI_QUEUE_PFN:
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU
Currently all passed-through devices must access the same guest-physical
address space. Register an IOMMU to offer individual address spaces to
devices. The way we do it is allocate one container per group, and add
mappings on demand.
Since guest cannot access devices unless it is attached to a container,
and we cannot change container at runtime without resetting the device,
this implementation is limited. To implement bypass mode, we'd need to map
the whole guest physical memory first, and unmap everything when attaching
to a new address space. It is also not possible for devices to be attached
to the same address space, they all have different page tables.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/iommu.h |   6 ++
 include/kvm/vfio.h  |   2 +
 iommu.c             |   7 +-
 vfio.c              | 281 ++++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 273 insertions(+), 23 deletions(-)
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 8f87ce5a..45a20f3b 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -10,6 +10,12 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+/*
+ * Test if mapping is present. If not, return an error but do not report it to
+ * stderr
+ */
+#define IOMMU_UNMAP_SILENT	0x1
+
 struct iommu_ops {
 	const struct iommu_properties *(*get_properties)(struct device_header *);
 
diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 71dfa8f7..84126eb9 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
 	struct device_header		dev_hdr;
 
 	int				fd;
+	struct vfio_group		*group;
 	struct vfio_device_info		info;
 	struct vfio_irq_info		irq_info;
 	struct vfio_region		*regions;
@@ -65,6 +66,7 @@ struct vfio_device {
 struct vfio_group {
 	unsigned long			id; /* iommu_group number in sysfs */
 	int				fd;
+	struct vfio_guest_container	*container;
 };
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset);
diff --git a/iommu.c b/iommu.c
index c10a3f0b..2220e4b2 100644
--- a/iommu.c
+++ b/iommu.c
@@ -85,6 +85,7 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size,
int flags)
 	struct rb_int_node *node;
 	struct iommu_mapping *map;
 	struct iommu_ioas *ioas = address_space;
+	bool silent = flags & IOMMU_UNMAP_SILENT;
 
 	if (!ioas)
 		return -ENODEV;
@@ -97,7 +98,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size,
int flags)
 		map = container_of(node, struct iommu_mapping, iova_range);
 
 		if (node_size > size) {
-			pr_debug("cannot split mapping");
+			if (!silent)
+				pr_debug("cannot split mapping");
 			ret = -EINVAL;
 			break;
 		}
@@ -111,7 +113,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64
size, int flags)
 	}
 
 	if (size && !ret) {
-		pr_debug("mapping not found");
+		if (!silent)
+			pr_debug("mapping not found");
 		ret = -ENXIO;
 	}
 	mutex_unlock(&ioas->mutex);
diff --git a/vfio.c b/vfio.c
index f4fd4090..406d0781 100644
--- a/vfio.c
+++ b/vfio.c
@@ -1,10 +1,13 @@
+#include "kvm/iommu.h"
 #include "kvm/irq.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
 #include "kvm/vfio.h"
+#include "kvm/virtio-iommu.h"
 
+#include <linux/bitops.h>
 #include <linux/kvm.h>
 #include <linux/pci_regs.h>
 
@@ -25,7 +28,16 @@ struct vfio_irq_eventfd {
 	int			fd;
 };
 
-static int vfio_container;
+struct vfio_guest_container {
+	struct kvm		*kvm;
+	int			fd;
+
+	void			*msi_doorbells;
+};
+
+static void *viommu = NULL;
+
+static int vfio_host_container;
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset)
 {
@@ -43,6 +55,7 @@ int vfio_group_parser(const struct option *opt, const char
*arg, int unset)
 
 	cur = strtok(buf, ",");
 	group->id = strtoul(cur, NULL, 0);
+	group->container = NULL;
 
 	kvm->cfg.num_vfio_groups = ++idx;
 	free(buf);
@@ -68,11 +81,13 @@ static void vfio_pci_msix_pba_access(struct kvm_cpu *vcpu,
u64 addr, u8 *data,
 static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8
*data,
 				       u32 len, u8 is_write, void *ptr)
 {
+	struct msi_msg msg;
 	struct kvm *kvm = vcpu->kvm;
 	struct vfio_pci_device *pdev = ptr;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_msix_table *table = &pdev->msix_table;
 	struct vfio_device *device = container_of(pdev, struct vfio_device, pci);
+	struct vfio_guest_container *container = device->group->container;
 
 	u64 offset = addr - table->guest_phys_addr;
 
@@ -88,11 +103,16 @@ static void vfio_pci_msix_table_access(struct kvm_cpu
*vcpu, u64 addr, u8 *data,
 
 	memcpy((void *)&entry->config + field, data, len);
 
-	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL)
+	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL || entry->config.ctrl & 1)
+		return;
+
+	msg = entry->config.msg;
+
+	if (container && iommu_translate_msi(container->msi_doorbells,
&msg))
 		return;
 
 	if (entry->gsi < 0) {
-		int ret = irq__add_msix_route(kvm, &entry->config.msg,
+		int ret = irq__add_msix_route(kvm, &msg,
 					      device->dev_hdr.dev_num << 3);
 		if (ret < 0) {
 			pr_err("cannot create MSI-X route");
@@ -111,7 +131,7 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu,
u64 addr, u8 *data,
 		return;
 	}
 
-	irq__update_msix_route(kvm, entry->gsi, &entry->config.msg);
+	irq__update_msix_route(kvm, entry->gsi, &msg);
 }
 
 static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
@@ -122,6 +142,7 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct
vfio_device *device,
 	struct msi_msg msi;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_device *pdev = &device->pci;
+	struct vfio_guest_container *container = device->group->container;
 	struct msi_cap_64 *msi_cap_64 = (void *)&pdev->hdr + pdev->msi.pos;
 
 	/* Only modify routes when guest sets the enable bit */
@@ -144,6 +165,9 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct
vfio_device *device,
 		msi.data = msi_cap_32->data;
 	}
 
+	if (container && iommu_translate_msi(container->msi_doorbells,
&msi))
+		return;
+
 	for (i = 0; i < nr_vectors; i++) {
 		u32 devid = device->dev_hdr.dev_num << 3;
 
@@ -870,6 +894,154 @@ static int vfio_configure_dev_irqs(struct kvm *kvm, struct
vfio_device *device)
 	return ret;
 }
 
+static struct iommu_properties vfio_viommu_props = {
+	.name				= "viommu-vfio",
+
+	.input_addr_size		= 64,
+};
+
+static const struct iommu_properties *
+vfio_viommu_get_properties(struct device_header *dev)
+{
+	return &vfio_viommu_props;
+}
+
+static void *vfio_viommu_alloc(struct device_header *dev_hdr)
+{
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+	struct vfio_guest_container *container = vdev->group->container;
+
+	container->msi_doorbells = iommu_alloc_address_space(NULL);
+	if (!container->msi_doorbells) {
+		pr_err("Failed to create MSI address space");
+		return NULL;
+	}
+
+	return container;
+}
+
+static void vfio_viommu_free(void *priv)
+{
+	struct vfio_guest_container *container = priv;
+
+	/* Half the address space */
+	size_t size = 1UL << (BITS_PER_LONG - 1);
+	unsigned long virt_addr = 0;
+	int i;
+
+	/*
+	 * Remove all mappings in two times, since 2^64 doesn't fit in
+	 * unmap.size
+	 */
+	for (i = 0; i < 2; i++, virt_addr += size) {
+		struct vfio_iommu_type1_dma_unmap unmap = {
+			.argsz	= sizeof(unmap),
+			.iova	= virt_addr,
+			.size	= size,
+		};
+	}
+
+	iommu_free_address_space(container->msi_doorbells);
+	container->msi_doorbells = NULL;
+}
+
+static int vfio_viommu_attach(void *priv, struct device_header *dev_hdr, int
flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+
+	if (!container)
+		return -ENODEV;
+
+	if (container->fd != vdev->group->container->fd)
+		/*
+		 * TODO: We don't support multiple devices in the same address
+		 * space at the moment. It should be easy to implement, just
+		 * create an address space structure that holds multiple
+		 * container fds and multiplex map/unmap requests.
+		 */
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_viommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return 0;
+}
+
+static int vfio_viommu_map(void *priv, u64 virt_addr, u64 phys_addr, u64 size,
+			   int prot)
+{
+	int ret;
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_map map = {
+		.argsz	= sizeof(map),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	map.vaddr = (u64)guest_flat_to_host(container->kvm, phys_addr);
+	if (!map.vaddr) {
+		if (irq__addr_is_msi_doorbell(container->kvm, phys_addr)) {
+			ret = iommu_map(container->msi_doorbells, virt_addr,
+					phys_addr, size, prot);
+			if (ret) {
+				pr_err("could not map MSI");
+				return ret;
+			}
+
+			// TODO: silence guest_flat_to_host
+			pr_info("Nevermind, all is well. Mapped MSI %llx->%llx",
+				virt_addr, phys_addr);
+			return 0;
+		} else {
+			return -ERANGE;
+		}
+	}
+
+	if (prot & IOMMU_PROT_READ)
+		map.flags |= VFIO_DMA_MAP_FLAG_READ;
+
+	if (prot & IOMMU_PROT_WRITE)
+		map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+
+	if (prot & IOMMU_PROT_EXEC) {
+		pr_err("VFIO does not support PROT_EXEC");
+		return -ENOSYS;
+	}
+
+	return ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map);
+}
+
+static int vfio_viommu_unmap(void *priv, u64 virt_addr, u64 size, int flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_unmap unmap = {
+		.argsz	= sizeof(unmap),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	if (!iommu_unmap(container->msi_doorbells, virt_addr, size,
+			 flags | IOMMU_UNMAP_SILENT))
+		return 0;
+
+	return ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap);
+}
+
+static struct iommu_ops vfio_iommu_ops = {
+	.get_properties		= vfio_viommu_get_properties,
+	.alloc_address_space	= vfio_viommu_alloc,
+	.free_address_space	= vfio_viommu_free,
+	.attach			= vfio_viommu_attach,
+	.detach			= vfio_viommu_detach,
+	.map			= vfio_viommu_map,
+	.unmap			= vfio_viommu_unmap,
+};
+
 static int vfio_configure_reserved_regions(struct kvm *kvm,
 					   struct vfio_group *group)
 {
@@ -912,6 +1084,8 @@ static int vfio_configure_device(struct kvm *kvm, struct
vfio_group *group,
 		return -ENOMEM;
 	}
 
+	device->group = group;
+
 	device->fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD,
dirent->d_name);
 	if (device->fd < 0) {
 		pr_err("Failed to get FD for device %s in group %lu",
@@ -945,6 +1119,7 @@ static int vfio_configure_device(struct kvm *kvm, struct
vfio_group *group,
 	device->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_PCI,
 		.data		= &device->pci.hdr,
+		.iommu_ops	= viommu ? &vfio_iommu_ops : NULL,
 	};
 
 	ret = device__register(&device->dev_hdr);
@@ -1009,13 +1184,13 @@ static int vfio_configure_iommu_groups(struct kvm *kvm)
 /* TODO: this should be an arch callback, so arm can return HYP only if vsmmu
*/
 static int vfio_get_iommu_type(void)
 {
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION,
VFIO_TYPE1_NESTING_IOMMU))
 		return VFIO_TYPE1_NESTING_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
 		return VFIO_TYPE1v2_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
 		return VFIO_TYPE1_IOMMU;
 
 	return -ENODEV;
@@ -1033,7 +1208,7 @@ static int vfio_map_mem_bank(struct kvm *kvm, struct
kvm_mem_bank *bank, void *d
 	};
 
 	/* Map the guest memory for DMA (i.e. provide isolation) */
-	if (ioctl(vfio_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+	if (ioctl(vfio_host_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
 		ret = -errno;
 		pr_err("Failed to map 0x%llx -> 0x%llx (%llu) for DMA",
 		       dma_map.iova, dma_map.vaddr, dma_map.size);
@@ -1050,14 +1225,15 @@ static int vfio_unmap_mem_bank(struct kvm *kvm, struct
kvm_mem_bank *bank, void
 		.iova = bank->guest_phys_addr,
 	};
 
-	ioctl(vfio_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	ioctl(vfio_host_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
 
 	return 0;
 }
 
 static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 {
-	int ret;
+	int ret = 0;
+	int container;
 	char group_node[VFIO_PATH_MAX_LEN];
 	struct vfio_group_status group_status = {
 		.argsz = sizeof(group_status),
@@ -1066,6 +1242,25 @@ static int vfio_group_init(struct kvm *kvm, struct
vfio_group *group)
 	snprintf(group_node, VFIO_PATH_MAX_LEN, VFIO_DEV_DIR "/%lu",
 		 group->id);
 
+	if (kvm->cfg.viommu) {
+		container = open(VFIO_DEV_NODE, O_RDWR);
+		if (container < 0) {
+			ret = -errno;
+			pr_err("cannot initialize private container\n");
+			return ret;
+		}
+
+		group->container = malloc(sizeof(struct vfio_guest_container));
+		if (!group->container)
+			return -ENOMEM;
+
+		group->container->fd = container;
+		group->container->kvm = kvm;
+		group->container->msi_doorbells = NULL;
+	} else {
+		container = vfio_host_container;
+	}
+
 	group->fd = open(group_node, O_RDWR);
 	if (group->fd == -1) {
 		ret = -errno;
@@ -1085,29 +1280,52 @@ static int vfio_group_init(struct kvm *kvm, struct
vfio_group *group)
 		return -EINVAL;
 	}
 
-	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &vfio_container)) {
+	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container)) {
 		ret = -errno;
 		pr_err("Failed to add IOMMU group %s to VFIO container",
 		       group_node);
 		return ret;
 	}
 
-	return 0;
+	if (container != vfio_host_container) {
+		struct vfio_iommu_type1_info info = {
+			.argsz = sizeof(info),
+		};
+
+		/* We really need v2 semantics for unmap-all */
+		ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
+		if (ret) {
+			ret = -errno;
+			pr_err("Failed to set IOMMU");
+			return ret;
+		}
+
+		ret = ioctl(container, VFIO_IOMMU_GET_INFO, &info);
+		if (ret)
+			pr_err("Failed to get IOMMU info");
+		else if (info.flags & VFIO_IOMMU_INFO_PGSIZES)
+			vfio_viommu_props.pgsize_mask = info.iova_pgsizes;
+	}
+
+	return ret;
 }
 
-static int vfio_container_init(struct kvm *kvm)
+static int vfio_groups_init(struct kvm *kvm)
 {
 	int api, i, ret, iommu_type;;
 
-	/* Create a container for our IOMMU groups */
-	vfio_container = open(VFIO_DEV_NODE, O_RDWR);
-	if (vfio_container == -1) {
+	/*
+	 * Create a container for our IOMMU groups. Even when using a viommu, we
+	 * still use this one for probing capabilities.
+	 */
+	vfio_host_container = open(VFIO_DEV_NODE, O_RDWR);
+	if (vfio_host_container == -1) {
 		ret = errno;
 		pr_err("Failed to open %s", VFIO_DEV_NODE);
 		return ret;
 	}
 
-	api = ioctl(vfio_container, VFIO_GET_API_VERSION);
+	api = ioctl(vfio_host_container, VFIO_GET_API_VERSION);
 	if (api != VFIO_API_VERSION) {
 		pr_err("Unknown VFIO API version %d", api);
 		return -ENODEV;
@@ -1119,15 +1337,20 @@ static int vfio_container_init(struct kvm *kvm)
 		return iommu_type;
 	}
 
-	/* Sanity check our groups and add them to the container */
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
 		ret = vfio_group_init(kvm, &kvm->cfg.vfio_group[i]);
 		if (ret)
 			return ret;
 	}
 
+	if (kvm->cfg.viommu) {
+		close(vfio_host_container);
+		vfio_host_container = -1;
+		return 0;
+	}
+
 	/* Finalise the container */
-	if (ioctl(vfio_container, VFIO_SET_IOMMU, iommu_type)) {
+	if (ioctl(vfio_host_container, VFIO_SET_IOMMU, iommu_type)) {
 		ret = -errno;
 		pr_err("Failed to set IOMMU type %d for VFIO container",
 		       iommu_type);
@@ -1147,10 +1370,16 @@ static int vfio__init(struct kvm *kvm)
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
-	ret = vfio_container_init(kvm);
+	ret = vfio_groups_init(kvm);
 	if (ret)
 		return ret;
 
+	if (kvm->cfg.viommu) {
+		viommu = viommu_register(kvm, &vfio_viommu_props);
+		if (!viommu)
+			pr_err("could not register viommu");
+	}
+
 	ret = vfio_configure_iommu_groups(kvm);
 	if (ret)
 		return ret;
@@ -1162,17 +1391,27 @@ dev_base_init(vfio__init);
 static int vfio__exit(struct kvm *kvm)
 {
 	int i, fd;
+	struct vfio_guest_container *container;
 
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
+		container = kvm->cfg.vfio_group[i].container;
 		fd = kvm->cfg.vfio_group[i].fd;
 		ioctl(fd, VFIO_GROUP_UNSET_CONTAINER);
 		close(fd);
+
+		if (container != NULL) {
+			close(container->fd);
+			free(container);
+		}
 	}
 
+	if (vfio_host_container == -1)
+		return 0;
+
 	kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_unmap_mem_bank, NULL);
-	return close(vfio_container);
+	return close(vfio_host_container);
 }
 dev_base_exit(vfio__exit);
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC
Add a new parameter to lkvm debug, '-i' or '--iommu'. Commands
will be
added later. For the moment, rework the debug builtin to share dump
facilities with the '-d'/'--dump' parameter.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 builtin-debug.c             |  8 +++++++-
 include/kvm/builtin-debug.h |  6 ++++++
 include/kvm/iommu.h         |  5 +++++
 include/kvm/virtio-iommu.h  |  5 +++++
 kvm-ipc.c                   | 43 ++++++++++++++++++++++++-------------------
 virtio/iommu.c              | 14 ++++++++++++++
 6 files changed, 61 insertions(+), 20 deletions(-)
diff --git a/builtin-debug.c b/builtin-debug.c
index 4ae51d20..e39e2d09 100644
--- a/builtin-debug.c
+++ b/builtin-debug.c
@@ -5,6 +5,7 @@
 #include <kvm/parse-options.h>
 #include <kvm/kvm-ipc.h>
 #include <kvm/read-write.h>
+#include <kvm/virtio-iommu.h>
 
 #include <stdio.h>
 #include <string.h>
@@ -17,6 +18,7 @@ static int nmi = -1;
 static bool dump;
 static const char *instance_name;
 static const char *sysrq;
+static const char *iommu;
 
 static const char * const debug_usage[] = {
 	"lkvm debug [--all] [-n name] [-d] [-m vcpu]",
@@ -28,6 +30,7 @@ static const struct option debug_options[] = {
 	OPT_BOOLEAN('d', "dump", &dump, "Generate a debug
dump from guest"),
 	OPT_INTEGER('m', "nmi", &nmi, "Generate NMI on
VCPU"),
 	OPT_STRING('s', "sysrq", &sysrq, "sysrq",
"Inject a sysrq"),
+	OPT_STRING('i', "iommu", &iommu, "params",
"Debug virtual IOMMU"),
 	OPT_GROUP("Instance options:"),
 	OPT_BOOLEAN('a', "all", &all, "Debug all
instances"),
 	OPT_STRING('n', "name", &instance_name,
"name", "Instance name"),
@@ -68,11 +71,14 @@ static int do_debug(const char *name, int sock)
 		cmd.sysrq = sysrq[0];
 	}
 
+	if (iommu && !viommu_parse_debug_string(iommu, &cmd.iommu))
+		cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_IOMMU;
+
 	r = kvm_ipc__send_msg(sock, KVM_IPC_DEBUG, sizeof(cmd), (u8 *)&cmd);
 	if (r < 0)
 		return r;
 
-	if (!dump)
+	if (!(cmd.dbg_type & KVM_DEBUG_CMD_DUMP_MASK))
 		return 0;
 
 	do {
diff --git a/include/kvm/builtin-debug.h b/include/kvm/builtin-debug.h
index efa02684..cd2155ae 100644
--- a/include/kvm/builtin-debug.h
+++ b/include/kvm/builtin-debug.h
@@ -2,16 +2,22 @@
 #define KVM__DEBUG_H
 
 #include <kvm/util.h>
+#include <kvm/iommu.h>
 #include <linux/types.h>
 
 #define KVM_DEBUG_CMD_TYPE_DUMP	(1 << 0)
 #define KVM_DEBUG_CMD_TYPE_NMI	(1 << 1)
 #define KVM_DEBUG_CMD_TYPE_SYSRQ (1 << 2)
+#define KVM_DEBUG_CMD_TYPE_IOMMU (1 << 3)
+
+#define KVM_DEBUG_CMD_DUMP_MASK \
+	(KVM_DEBUG_CMD_TYPE_IOMMU | KVM_DEBUG_CMD_TYPE_DUMP)
 
 struct debug_cmd_params {
 	u32 dbg_type;
 	u32 cpu;
 	char sysrq;
+	struct iommu_debug_params iommu;
 };
 
 int kvm_cmd_debug(int argc, const char **argv, const char *prefix);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 45a20f3b..60857fa5 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -1,6 +1,7 @@
 #ifndef KVM_IOMMU_H
 #define KVM_IOMMU_H
 
+#include <stdbool.h>
 #include <stdlib.h>
 
 #include "devices.h"
@@ -10,6 +11,10 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+struct iommu_debug_params {
+	bool				print_enabled;
+};
+
 /*
  * Test if mapping is present. If not, return an error but do not report it to
  * stderr
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
index 5532c82b..c9e36fb6 100644
--- a/include/kvm/virtio-iommu.h
+++ b/include/kvm/virtio-iommu.h
@@ -7,4 +7,9 @@ const struct iommu_properties *viommu_get_properties(void *dev);
 void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
 void viommu_unregister(struct kvm *kvm, void *cookie);
 
+struct iommu_debug_params;
+
+int viommu_parse_debug_string(const char *options, struct iommu_debug_params
*);
+int viommu_debug(int fd, struct iommu_debug_params *);
+
 #endif
diff --git a/kvm-ipc.c b/kvm-ipc.c
index e07ad105..a8b56543 100644
--- a/kvm-ipc.c
+++ b/kvm-ipc.c
@@ -14,6 +14,7 @@
 #include "kvm/strbuf.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/8250-serial.h"
+#include "kvm/virtio-iommu.h"
 
 struct kvm_ipc_head {
 	u32 type;
@@ -424,31 +425,35 @@ static void handle_debug(struct kvm *kvm, int fd, u32
type, u32 len, u8 *msg)
 		pthread_kill(kvm->cpus[vcpu]->thread, SIGUSR1);
 	}
 
-	if (!(dbg_type & KVM_DEBUG_CMD_TYPE_DUMP))
-		return;
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_IOMMU)
+		viommu_debug(fd, ¶ms->iommu);
 
-	for (i = 0; i < kvm->nrcpus; i++) {
-		struct kvm_cpu *cpu = kvm->cpus[i];
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_DUMP) {
+		for (i = 0; i < kvm->nrcpus; i++) {
+			struct kvm_cpu *cpu = kvm->cpus[i];
 
-		if (!cpu)
-			continue;
+			if (!cpu)
+				continue;
 
-		printout_done = 0;
+			printout_done = 0;
+
+			kvm_cpu__set_debug_fd(fd);
+			pthread_kill(cpu->thread, SIGUSR1);
+			/*
+			 * Wait for the vCPU to dump state before signalling
+			 * the next thread. Since this is debug code it does
+			 * not matter that we are burning CPU time a bit:
+			 */
+			while (!printout_done)
+				sleep(0);
+		}
 
-		kvm_cpu__set_debug_fd(fd);
-		pthread_kill(cpu->thread, SIGUSR1);
-		/*
-		 * Wait for the vCPU to dump state before signalling
-		 * the next thread. Since this is debug code it does
-		 * not matter that we are burning CPU time a bit:
-		 */
-		while (!printout_done)
-			sleep(0);
+		serial8250__inject_sysrq(kvm, 'p');
 	}
 
-	close(fd);
-
-	serial8250__inject_sysrq(kvm, 'p');
+	if (dbg_type & KVM_DEBUG_CMD_DUMP_MASK)
+		/* builtin-debug is reading, signal EOT */
+		close(fd);
 }
 
 int kvm_ipc__init(struct kvm *kvm)
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 2e5a23ee..5973cef1 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -620,3 +620,17 @@ void viommu_unregister(struct kvm *kvm, void *viommu)
 {
 	free(viommu);
 }
+
+int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params
*params)
+{
+	/* show instances numbers */
+	/* send command to instance */
+	/* - dump mappings */
+	/* - statistics */
+	return -ENOSYS;
+}
+
+int viommu_debug(int sock, struct iommu_debug_params *params)
+{
+	return -ENOSYS;
+}
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands
Using debug printf with the virtual IOMMU can be extremely verbose. To
ease debugging, add a few commands that can be sent via IPC. Format for
commands is "cmd [iommu [address_space]]" (or
cmd:[iommu:[address_space]])
    $ lkvm debug -a -i list
    iommu 0 "viommu-vfio"
      ioas 1
        device 0x2                      # PCI bus
      ioas 2
        device 0x3
    iommu 1 "viommu-virtio"
      ioas 3
        device 0x10003                  # MMIO bus
      ioas 4
        device 0x6
    $ lkvm debug -a -i stats:0          # stats for viommu-vfio
    iommu 0 "viommu-virtio"
      kicks                 510         # virtio kicks from driver
      requests              510         # requests received
      ioas 3
        maps                1           # number of map requests
        unmaps              0           #     "    unmap   "
        resident            8192        # bytes currently mapped
        accesses            1           # number of device accesses
      ioas 4
        maps                290
        unmaps              4
        resident            1335296
        accesses            982
    $ lkvm debug -a -i "print 1, 2"     # Start debug print for
      ...                               # ioas 2 in iommu 1
      ...
      Info: VIOMMU map 0xffffffff000 -> 0x8f4e0000 (4096) to IOAS 2
      ...
    $ lkvm debug -a -i noprint          # Stop all debug print
We don't use atomics for statistics at the moment, since there is no
concurrent write on most of them. Only 'accesses' might be incremented
concurrently, so we might get imprecise values.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 include/kvm/iommu.h |  17 +++
 iommu.c             |  56 +++++++++-
 virtio/iommu.c      | 312 ++++++++++++++++++++++++++++++++++++++++++++++++----
 virtio/mmio.c       |   1 +
 virtio/pci.c        |   1 +
 5 files changed, 362 insertions(+), 25 deletions(-)
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 60857fa5..70a09306 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -11,7 +11,20 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+enum iommu_debug_action {
+	IOMMU_DEBUG_LIST,
+	IOMMU_DEBUG_STATS,
+	IOMMU_DEBUG_SET_PRINT,
+	IOMMU_DEBUG_DUMP,
+
+	IOMMU_DEBUG_NUM_ACTIONS,
+};
+
+#define IOMMU_DEBUG_SELECTOR_INVALID	((unsigned int)-1)
+
 struct iommu_debug_params {
+	enum iommu_debug_action		action;
+	unsigned int			selector[2];
 	bool				print_enabled;
 };
 
@@ -31,6 +44,8 @@ struct iommu_ops {
 	int (*detach)(void *, struct device_header *);
 	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
 	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+
+	int (*debug_address_space)(void *, int fd, struct iommu_debug_params *);
 };
 
 struct iommu_properties {
@@ -74,6 +89,8 @@ static inline struct device_header *iommu_get_device(u32
device_id)
 
 void *iommu_alloc_address_space(struct device_header *dev);
 void iommu_free_address_space(void *address_space);
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params);
 
 int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
 	      int prot);
diff --git a/iommu.c b/iommu.c
index 2220e4b2..bc9fc631 100644
--- a/iommu.c
+++ b/iommu.c
@@ -9,6 +9,10 @@
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
+struct iommu_ioas_stats {
+	u64			accesses;
+};
+
 struct iommu_mapping {
 	struct rb_int_node	iova_range;
 	u64			phys;
@@ -18,8 +22,31 @@ struct iommu_mapping {
 struct iommu_ioas {
 	struct rb_root		mappings;
 	struct mutex		mutex;
+
+	struct iommu_ioas_stats	stats;
+	bool			debug_enabled;
 };
 
+static void iommu_dump(struct iommu_ioas *ioas, int fd)
+{
+	struct rb_node *node;
+	struct iommu_mapping *map;
+
+	mutex_lock(&ioas->mutex);
+
+	dprintf(fd, "START IOMMU DUMP [[[\n"); /* You did ask for it. */
+	for (node = rb_first(&ioas->mappings); node; node = rb_next(node)) {
+		struct rb_int_node *int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+
+		dprintf(fd, "%#llx-%#llx -> %#llx %#x\n", int_node->low,
+			int_node->high, map->phys, map->prot);
+	}
+	dprintf(fd, "]]] END IOMMU DUMP\n");
+
+	mutex_unlock(&ioas->mutex);
+}
+
 void *iommu_alloc_address_space(struct device_header *unused)
 {
 	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
@@ -33,6 +60,27 @@ void *iommu_alloc_address_space(struct device_header *unused)
 	return ioas;
 }
 
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params)
+{
+	struct iommu_ioas *ioas = address_space;
+
+	switch (params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(fd, "    accesses            %llu\n",
ioas->stats.accesses);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = params->print_enabled;
+		break;
+	case IOMMU_DEBUG_DUMP:
+		iommu_dump(ioas, fd);
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 void iommu_free_address_space(void *address_space)
 {
 	struct iommu_ioas *ioas = address_space;
@@ -157,8 +205,12 @@ u64 iommu_access(void *address_space, u64 addr, size_t
size, size_t *out_size,
 	out_addr = map->phys + (addr - node->low);
 	*out_size = min_t(size_t, node->high - addr + 1, size);
 
-	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size,
size,
-		 prot, out_addr);
+	if (ioas->debug_enabled)
+		pr_info("access %llx %zu/%zu %s%s -> %#llx", addr, *out_size,
+			size, prot & IOMMU_PROT_READ ? "R" : "",
+			prot & IOMMU_PROT_WRITE ? "W" : "", out_addr);
+
+	ioas->stats.accesses++;
 out_unlock:
 	mutex_unlock(&ioas->mutex);
 
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 5973cef1..153b537a 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -20,6 +20,17 @@
 /* Max size */
 #define VIOMMU_DEFAULT_QUEUE_SIZE	256
 
+struct viommu_ioas_stats {
+	u64				map;
+	u64				unmap;
+	u64				resident;
+};
+
+struct viommu_stats {
+	u64				kicks;
+	u64				requests;
+};
+
 struct viommu_endpoint {
 	struct device_header		*dev;
 	struct viommu_ioas		*ioas;
@@ -36,9 +47,14 @@ struct viommu_ioas {
 
 	struct iommu_ops		*ops;
 	void				*priv;
+
+	bool				debug_enabled;
+	struct viommu_ioas_stats	stats;
 };
 
 struct viommu_dev {
+	u32				id;
+
 	struct virtio_device		vdev;
 	struct virtio_iommu_config	config;
 
@@ -49,29 +65,77 @@ struct viommu_dev {
 	struct thread_pool__job		job;
 
 	struct rb_root			address_spaces;
+	struct mutex			address_spaces_mutex;
 	struct kvm			*kvm;
+
+	struct list_head		list;
+
+	bool				debug_enabled;
+	struct viommu_stats		stats;
 };
 
 static int compat_id = -1;
 
+static long long viommu_ids;
+static LIST_HEAD(viommus);
+static DEFINE_MUTEX(viommus_mutex);
+
+#define ioas_debug(ioas, fmt, ...)					\
+	do {								\
+		if ((ioas)->debug_enabled)				\
+			pr_info("ioas[%d] " fmt, (ioas)->id, ##__VA_ARGS__); \
+	} while (0)
+
 static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
 					    u32 ioasid)
 {
 	struct rb_node *node;
-	struct viommu_ioas *ioas;
+	struct viommu_ioas *ioas, *found = NULL;
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	node = viommu->address_spaces.rb_node;
 	while (node) {
 		ioas = container_of(node, struct viommu_ioas, node);
-		if (ioas->id > ioasid)
+		if (ioas->id > ioasid) {
 			node = node->rb_left;
-		else if (ioas->id < ioasid)
+		} else if (ioas->id < ioasid) {
 			node = node->rb_right;
-		else
-			return ioas;
+		} else {
+			found = ioas;
+			break;
+		}
 	}
+	mutex_unlock(&viommu->address_spaces_mutex);
 
-	return NULL;
+	return found;
+}
+
+static int viommu_for_each_ioas(struct viommu_dev *viommu,
+				int (*fun)(struct viommu_dev *viommu,
+					   struct viommu_ioas *ioas,
+					   void *data),
+				void *data)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct rb_node *node, *next;
+
+	mutex_lock(&viommu->address_spaces_mutex);
+	node = rb_first(&viommu->address_spaces);
+	while (node) {
+		next = rb_next(node);
+		ioas = container_of(node, struct viommu_ioas, node);
+
+		ret = fun(viommu, ioas, data);
+		if (ret)
+			break;
+
+		node = next;
+	}
+
+	mutex_unlock(&viommu->address_spaces_mutex);
+
+	return ret;
 }
 
 static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
@@ -99,9 +163,12 @@ static struct viommu_ioas *viommu_alloc_ioas(struct
viommu_dev *viommu,
 	new_ioas->id		= ioasid;
 	new_ioas->ops		= ops;
 	new_ioas->priv		= ops->alloc_address_space(device);
+	new_ioas->debug_enabled	= viommu->debug_enabled;
 
 	/* A NULL priv pointer is valid. */
 
+	mutex_lock(&viommu->address_spaces_mutex);
+
 	node = &viommu->address_spaces.rb_node;
 	while (*node) {
 		ioas = container_of(*node, struct viommu_ioas, node);
@@ -114,6 +181,7 @@ static struct viommu_ioas *viommu_alloc_ioas(struct
viommu_dev *viommu,
 		} else {
 			pr_err("IOAS exists!");
 			free(new_ioas);
+			mutex_unlock(&viommu->address_spaces_mutex);
 			return NULL;
 		}
 	}
@@ -121,6 +189,8 @@ static struct viommu_ioas *viommu_alloc_ioas(struct
viommu_dev *viommu,
 	rb_link_node(&new_ioas->node, parent, node);
 	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
 
+	mutex_unlock(&viommu->address_spaces_mutex);
+
 	return new_ioas;
 }
 
@@ -130,7 +200,9 @@ static void viommu_free_ioas(struct viommu_dev *viommu,
 	if (ioas->priv)
 		ioas->ops->free_address_space(ioas->priv);
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	rb_erase(&ioas->node, &viommu->address_spaces);
+	mutex_unlock(&viommu->address_spaces_mutex);
 	free(ioas);
 }
 
@@ -178,8 +250,7 @@ static int viommu_detach_device(struct viommu_dev *viommu,
 	if (!ioas)
 		return -EINVAL;
 
-	pr_debug("detaching device %#lx from IOAS %u",
-		 device_to_iommu_id(device), ioas->id);
+	ioas_debug(ioas, "detaching device %#lx",
device_to_iommu_id(device));
 
 	ret = device->iommu_ops->detach(ioas->priv, device);
 	if (!ret)
@@ -208,8 +279,6 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 		return -ENODEV;
 	}
 
-	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
-
 	vdev = device->iommu_data;
 	if (!vdev) {
 		vdev = viommu_alloc_device(device);
@@ -240,6 +309,9 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 	if (ret && ioas->nr_devices == 0)
 		viommu_free_ioas(viommu, ioas);
 
+	if (!ret)
+		ioas_debug(ioas, "attached device %#x", device_id);
+
 	return ret;
 }
 
@@ -267,6 +339,7 @@ static int viommu_handle_detach(struct viommu_dev *viommu,
 static int viommu_handle_map(struct viommu_dev *viommu,
 			     struct virtio_iommu_req_map *map)
 {
+	int ret;
 	int prot = 0;
 	struct viommu_ioas *ioas;
 
@@ -294,15 +367,21 @@ static int viommu_handle_map(struct viommu_dev *viommu,
 	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
 		prot |= IOMMU_PROT_EXEC;
 
-	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
-		 phys_addr, size, ioasid);
+	ioas_debug(ioas, "map   %#llx -> %#llx (%llu)", virt_addr,
phys_addr, size);
+
+	ret = ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	if (!ret) {
+		ioas->stats.resident += size;
+		ioas->stats.map++;
+	}
 
-	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	return ret;
 }
 
 static int viommu_handle_unmap(struct viommu_dev *viommu,
 			       struct virtio_iommu_req_unmap *unmap)
 {
+	int ret;
 	struct viommu_ioas *ioas;
 
 	u32 ioasid	= le32_to_cpu(unmap->address_space);
@@ -315,10 +394,15 @@ static int viommu_handle_unmap(struct viommu_dev *viommu,
 		return -ESRCH;
 	}
 
-	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
-		 ioasid);
+	ioas_debug(ioas, "unmap %#llx (%llu)", virt_addr, size);
+
+	ret = ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	if (!ret) {
+		ioas->stats.resident -= size;
+		ioas->stats.unmap++;
+	}
 
-	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	return ret;
 }
 
 static size_t viommu_get_req_len(union virtio_iommu_req *req)
@@ -407,6 +491,8 @@ static ssize_t viommu_dispatch_commands(struct viommu_dev
*viommu,
 			continue;
 		}
 
+		viommu->stats.requests++;
+
 		req = iov[i].iov_base;
 		op = req->head.type;
 		expected_len = viommu_get_req_len(req) - sizeof(*tail);
@@ -458,6 +544,8 @@ static void viommu_command(struct kvm *kvm, void *dev)
 
 	vq = &viommu->vq;
 
+	viommu->stats.kicks++;
+
 	while (virt_queue__available(vq)) {
 		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
 
@@ -594,6 +682,7 @@ void *viommu_register(struct kvm *kvm, struct
iommu_properties *props)
 
 	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
 	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->address_spaces_mutex	= (struct mutex)MUTEX_INITIALIZER;
 	viommu->properties		= props;
 
 	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
@@ -607,6 +696,11 @@ void *viommu_register(struct kvm *kvm, struct
iommu_properties *props)
 		return NULL;
 	}
 
+	mutex_lock(&viommus_mutex);
+	viommu->id = viommu_ids++;
+	list_add_tail(&viommu->list, &viommus);
+	mutex_unlock(&viommus_mutex);
+
 	pr_info("Loaded virtual IOMMU %s", props->name);
 
 	if (compat_id == -1)
@@ -616,21 +710,193 @@ void *viommu_register(struct kvm *kvm, struct
iommu_properties *props)
 	return viommu;
 }
 
-void viommu_unregister(struct kvm *kvm, void *viommu)
+void viommu_unregister(struct kvm *kvm, void *dev)
 {
+	struct viommu_dev *viommu = dev;
+
+	mutex_lock(&viommus_mutex);
+	list_del(&viommu->list);
+	mutex_unlock(&viommus_mutex);
+
 	free(viommu);
 }
 
+const char *debug_usage +"  list [iommu [ioas]]            list iommus and
address spaces\n"
+"  stats [iommu [ioas]]           display statistics\n"
+"  dump  [iommu [ioas]]           dump mappings\n"
+"  print [iommu [ioas]]           enable debug print\n"
+"  noprint [iommu [ioas]]         disable debug print\n"
+;
+
 int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params
*params)
 {
-	/* show instances numbers */
-	/* send command to instance */
-	/* - dump mappings */
-	/* - statistics */
-	return -ENOSYS;
+	int pos = 0;
+	int ret = -EINVAL;
+	char *cur, *args = strdup(cmdline);
+	params->action = IOMMU_DEBUG_NUM_ACTIONS;
+
+	if (!args)
+		return -ENOMEM;
+
+	params->selector[0] = IOMMU_DEBUG_SELECTOR_INVALID;
+	params->selector[1] = IOMMU_DEBUG_SELECTOR_INVALID;
+
+	cur = strtok(args, " ,:");
+	while (cur) {
+		if (pos > 2)
+			break;
+
+		if (pos > 0) {
+			errno = 0;
+			params->selector[pos - 1] = strtoul(cur, NULL, 0);
+			if (errno) {
+				ret = -errno;
+				pr_err("Invalid number '%s'", cur);
+				break;
+			}
+		} else if (strncmp(cur, "list", 4) == 0) {
+			params->action = IOMMU_DEBUG_LIST;
+		} else if (strncmp(cur, "stats", 5) == 0) {
+			params->action = IOMMU_DEBUG_STATS;
+		} else if (strncmp(cur, "dump", 4) == 0) {
+			params->action = IOMMU_DEBUG_DUMP;
+		} else if (strncmp(cur, "print", 5) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = true;
+		} else if (strncmp(cur, "noprint", 7) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = false;
+		} else {
+			pr_err("Invalid command '%s'", cur);
+			break;
+		}
+
+		cur = strtok(NULL, " ,:");
+		pos++;
+		ret = 0;
+	}
+
+	free(args);
+
+	if (cur && cur[0])
+		pr_err("Ignoring argument '%s'", cur);
+
+	if (ret)
+		pr_info("Usage:\n%s", debug_usage);
+
+	return ret;
+}
+
+struct viommu_debug_context {
+	int				sock;
+	struct iommu_debug_params	*params;
+	bool				disp;
+};
+
+static int viommu_debug_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas,
+			     void *data)
+{
+	int ret = 0;
+	struct viommu_endpoint *vdev;
+	struct viommu_debug_context *ctx = data;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "  ioas %u\n", ioas->id);
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_LIST:
+		mutex_lock(&ioas->devices_mutex);
+		list_for_each_entry(vdev, &ioas->devices, list) {
+			dprintf(ctx->sock, "    device 0x%lx\n",
+				device_to_iommu_id(vdev->dev));
+		}
+		mutex_unlock(&ioas->devices_mutex);
+		break;
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "    maps                %llu\n",
+			ioas->stats.map);
+		dprintf(ctx->sock, "    unmaps              %llu\n",
+			ioas->stats.unmap);
+		dprintf(ctx->sock, "    resident            %llu\n",
+			ioas->stats.resident);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		ret = -ENOSYS;
+
+	}
+
+	if (ioas->ops->debug_address_space)
+		ret = ioas->ops->debug_address_space(ioas->priv, ctx->sock,
+						     ctx->params);
+
+	return ret;
+}
+
+static int viommu_debug_iommu(struct viommu_dev *viommu,
+			      struct viommu_debug_context *ctx)
+{
+	struct viommu_ioas *ioas;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "iommu %u \"%s\"\n", viommu->id,
+			viommu->properties->name);
+
+	if (ctx->params->selector[1] != IOMMU_DEBUG_SELECTOR_INVALID) {
+		ioas = viommu_find_ioas(viommu, ctx->params->selector[1]);
+		return ioas ? viommu_debug_ioas(viommu, ioas, ctx) : -ESRCH;
+	}
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "  kicks                 %llu\n",
+			viommu->stats.kicks);
+		dprintf(ctx->sock, "  requests              %llu\n",
+			viommu->stats.requests);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		viommu->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		break;
+	}
+
+	return viommu_for_each_ioas(viommu, viommu_debug_ioas, ctx);
 }
 
 int viommu_debug(int sock, struct iommu_debug_params *params)
 {
-	return -ENOSYS;
+	int ret = -ESRCH;
+	bool match;
+	struct viommu_dev *viommu;
+	bool any = (params->selector[0] == IOMMU_DEBUG_SELECTOR_INVALID);
+
+	struct viommu_debug_context ctx = {
+		.sock		= sock,
+		.params		= params,
+	};
+
+	if (params->action == IOMMU_DEBUG_LIST ||
+	    params->action == IOMMU_DEBUG_STATS)
+		ctx.disp = true;
+
+	mutex_lock(&viommus_mutex);
+	list_for_each_entry(viommu, &viommus, list) {
+		match = (params->selector[0] == viommu->id);
+		if (match || any) {
+			ret = viommu_debug_iommu(viommu, &ctx);
+			if (ret || match)
+				break;
+		}
+	}
+	mutex_unlock(&viommus_mutex);
+
+	if (ret)
+		dprintf(sock, "error: %s\n", strerror(-ret));
+
+	return ret;
 }
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 699d4403..7d39120a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -307,6 +307,7 @@ static struct iommu_ops virtio_mmio_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_mmio_iommu_attach,
 	.detach			= virtio_mmio_iommu_detach,
 	.map			= iommu_map,
diff --git a/virtio/pci.c b/virtio/pci.c
index c9f0e558..c5d30eb2 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -442,6 +442,7 @@ static struct iommu_ops virtio_pci_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_pci_iommu_attach,
 	.detach			= virtio_pci_iommu_detach,
 	.map			= iommu_map,
-- 
2.12.1
Jean-Philippe Brucker
2017-Apr-07  19:24 UTC
[RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available
This is for development only. Virtual devices might blow up unexpectedly.
In general it seems to work (slowing devices down by a factor of two of
course). virtio-scsi, virtio-rng and virtio-balloon are still untested.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 virtio/core.c | 3 +++
 1 file changed, 3 insertions(+)
diff --git a/virtio/core.c b/virtio/core.c
index 66e0cecb..4ca632f9 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,4 +1,5 @@
 #include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -369,6 +370,8 @@ int virtio_init(struct kvm *kvm, void *dev, struct
virtio_device *vdev,
 {
 	void *virtio;
 
+	vdev->use_iommu = kvm->cfg.viommu && subsys_id !=
VIRTIO_ID_IOMMU;
+
 	switch (trans) {
 	case VIRTIO_PCI:
 		virtio = calloc(sizeof(struct virtio_pci), 1);
-- 
2.12.1
On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:> There are a number of advantages in a paravirtualized IOMMU over a full > emulation. It is portable and could be reused on different architectures. > It is easier to implement than a full emulation, with less state tracking. > It might be more efficient in some cases, with less context switches to > the host and the possibility of in-kernel emulation.Thanks, this is very interesting. I am read to read it all, but I really would like you to expand some more on the motivation for this work. Productising this would be quite a bit of work. Spending just 6 lines on motivation seems somewhat disproportionate. In particular, do you have any specific efficiency measurements or estimates that you can share? -- MST
Hi All, We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it w.r.t vfio layer it is being referred? Is there type 2 IOMMU w.r.t vfio? If so what is it? Regards, Valmiki
On Mon, 10 Apr 2017 08:00:45 +0530 valmiki <valmikibow at gmail.com> wrote:> Hi All, > > We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it > w.r.t vfio layer it is being referred? > > Is there type 2 IOMMU w.r.t vfio? If so what is it?type1 is the 1st type. It's an arbitrary name. There is no type2, yet.
Jean-Philippe Brucker
2017-Apr-10  18:39 UTC
[virtio-dev] Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
On 07/04/17 22:19, Michael S. Tsirkin wrote:> On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote: >> There are a number of advantages in a paravirtualized IOMMU over a full >> emulation. It is portable and could be reused on different architectures. >> It is easier to implement than a full emulation, with less state tracking. >> It might be more efficient in some cases, with less context switches to >> the host and the possibility of in-kernel emulation. > > Thanks, this is very interesting. I am read to read it all, but I really > would like you to expand some more on the motivation for this work. > Productising this would be quite a bit of work. Spending just 6 lines on > motivation seems somewhat disproportionate. In particular, do you have > any specific efficiency measurements or estimates that you can share?The main motivation for this work is to bring IOMMU virtualization to the ARM world. We don't have any at the moment, and a full ARM SMMU virtualization solution would be counter-productive. We would have to do it for SMMUv2, for the completely orthogonal SMMUv3, and for any future version of the architecture. Doing so in userspace might be acceptable, but then for performance reasons people will want in-kernel emulation of every IOMMU variant out there, which is a maintenance and security nightmare. A single generic vIOMMU is preferable because it reduces maintenance cost and attack surface. The transport code is the same as any virtio device, both for userspace and in-kernel implementations. So instead of rewriting everything from scratch (and the lot of bugs that go with it) for each IOMMU variation, we reuse well-tested code for transport and write the emulation layer once and for all. Note that this work applies to any architecture with an IOMMU, not only ARM and their partners'. Introducing an IOMMU specially designed for virtualization allows us to get rid of complex state tracking inherent to full IOMMU emulations. With a full emulation, all guest accesses to page table and configuration structures have to be trapped and interpreted. A Virtio interface provides well-defined semantics and doesn't need to guess what the guest is trying to do. It transmits requests made from guest device drivers to host IOMMU almost unaltered, removing the intermediate layer of arch-specific configuration structures and page tables. Using a portable standard like Virtio also allows for efficient IOMMU virtualization when guest and host are built for different architectures (for instance when using Qemu TCG.) In-kernel emulation would still work with vhost-iommu, but a platform-specific vIOMMUs would have to stay in userspace. I don't have any measurements at the moment, it is a bit early for that. The kvmtool example was developed on a software model and is mostly here for illustrative purpose, a Qemu implementation would be more suitable for performance analysis. I wouldn't be able to give meaning to these numbers anyway, since on ARM we don't have any existing solution to compare it against. One could compare the complexity of handling guest accesses and parsing page tables in Qemu's VT-d emulation with reading a chain of buffers in Virtio, for a very rough estimate. Thanks, Jean-Philippe
On 2017?04?08? 03:17, Jean-Philippe Brucker wrote:> This is the initial proposal for a paravirtualized IOMMU device using > virtio transport. It contains a description of the device, a Linux driver, > and a toy implementation in kvmtool. With this prototype, you can > translate DMA to guest memory from emulated (virtio), or passed-through > (VFIO) devices. > > In its simplest form, implemented here, the device handles map/unmap > requests from the guest. Future extensions proposed in "RFC 3/3" should > allow to bind page tables to devices. > > There are a number of advantages in a paravirtualized IOMMU over a full > emulation. It is portable and could be reused on different architectures. > It is easier to implement than a full emulation, with less state tracking. > It might be more efficient in some cases, with less context switches to > the host and the possibility of in-kernel emulation.I like the idea. Consider the complexity of IOMMU hardware. I believe we don't want to have and fight for bugs of three or more different IOMMU implementations in either userspace or kernel. Thanks> > When designing it and writing the kvmtool device, I considered two main > scenarios, illustrated below. > > Scenario 1: a hardware device passed through twice via VFIO > > MEM____pIOMMU________PCI device________________________ HARDWARE > | (2b) \ > ----------|-------------+-------------+------------------\------------- > | : KVM : \ > | : : \ > pIOMMU drv : _______virtio-iommu drv \ KERNEL > | : | : | \ > VFIO : | : VFIO \ > | : | : | \ > | : | : | / > ----------|-------------+--------|----+----------|------------/-------- > | | : | / > | (1c) (1b) | : (1a) | / (2a) > | | : | / > | | : | / USERSPACE > |___virtio-iommu dev___| : net drv___/ > : > --------------------------------------+-------------------------------- > HOST : GUEST > > (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a > buffer with mmap, obtaining virtual address VA. It then send a > VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA). > b. The maping request is relayed to the host through virtio > (VIRTIO_IOMMU_T_MAP). > c. The mapping request is relayed to the physical IOMMU through VFIO. > > (2) a. The guest userspace driver can now instruct the device to directly > access the buffer at IOVA > b. IOVA accesses from the device are translated into physical > addresses by the IOMMU. > > Scenario 2: a virtual net device behind a virtual IOMMU. > > MEM__pIOMMU___PCI device HARDWARE > | | > -------|---------|------+-------------+------------------------------- > | | : KVM : > | | : : > pIOMMU drv | : : > \ | : _____________virtio-net drv KERNEL > \_net drv : | : / (1a) > | : | : / > tap : | ________virtio-iommu drv > | : | | : (1b) > -----------------|------+-----|---|---+------------------------------- > | | | : > |_virtio-net_| | : > / (2) | : > / | : USERSPACE > virtio-iommu dev______| : > : > --------------------------------------+------------------------------- > HOST : GUEST > > (1) a. Guest virtio-net driver maps the virtio ring and a buffer > b. The mapping requests are relayed to the host through virtio. > (2) The virtio-net device now needs to access any guest memory via the > IOMMU. > > Physical and virtual IOMMUs are completely dissociated. The net driver is > mapping its own buffers via DMA/IOMMU API, and buffers are copied between > virtio-net and tap. > > > The description itself seemed too long for a single email, so I split it > into three documents, and will attach Linux and kvmtool patches to this > email. > > 1. Firmware note, > 2. device operations (draft for the virtio specification), > 3. future work/possible improvements. > > Just to be clear on the terms I'm using: > > pIOMMU physical IOMMU, controlling DMA accesses from physical devices > vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses from > physical and virtual devices to guest memory. > GVA, GPA, HVA, HPA > Guest/Host Virtual/Physical Address > IOVA I/O Virtual Address, the address accessed by a device doing DMA > through an IOMMU. In the context of a guest OS, IOVA is GVA. > > Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI > virtio-iommu.h header, which is BSD 3-clause. For the time being, the > specification draft in RFC 2/3 is also BSD 3-clause. > > > This proposal may be involuntarily centered around ARM architectures at > times. Any feedback would be appreciated, especially regarding other IOMMU > architectures. > > Thanks, > Jean-Philippe
> From: Jason Wang > Sent: Wednesday, April 12, 2017 5:07 PM > > On 2017?04?08? 03:17, Jean-Philippe Brucker wrote: > > This is the initial proposal for a paravirtualized IOMMU device using > > virtio transport. It contains a description of the device, a Linux driver, > > and a toy implementation in kvmtool. With this prototype, you can > > translate DMA to guest memory from emulated (virtio), or passed-through > > (VFIO) devices. > > > > In its simplest form, implemented here, the device handles map/unmap > > requests from the guest. Future extensions proposed in "RFC 3/3" should > > allow to bind page tables to devices. > > > > There are a number of advantages in a paravirtualized IOMMU over a full > > emulation. It is portable and could be reused on different architectures. > > It is easier to implement than a full emulation, with less state tracking. > > It might be more efficient in some cases, with less context switches to > > the host and the possibility of in-kernel emulation. > > I like the idea. Consider the complexity of IOMMU hardware. I believe we > don't want to have and fight for bugs of three or more different IOMMU > implementations in either userspace or kernel. >Though there are definitely positive things around pvIOMMU approach, it also has some limitations: - Existing IOMMU implementations have been in old distros for quite some time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU only means we completely drop support of old distros in VM; - Similar situation on supporting other guest OSes e.g. Windows. IOMMU is a key kernel component which I'm not sure pvIOMMU through virtio can be recognized in those OSes (not like a virtio device driver); I would image both full-emulated IOMMUs and pvIOMMU would co-exist for some time due to above reasons. Someday when pvIOMMU is mature/ spread enough in the eco-system (and feature-wise comparable to full-emulated IOMMUs for all vendors), then we may make a call. Thanks, Kevin
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM > > This is the initial proposal for a paravirtualized IOMMU device using > virtio transport. It contains a description of the device, a Linux driver, > and a toy implementation in kvmtool. With this prototype, you can > translate DMA to guest memory from emulated (virtio), or passed-through > (VFIO) devices. > > In its simplest form, implemented here, the device handles map/unmap > requests from the guest. Future extensions proposed in "RFC 3/3" should > allow to bind page tables to devices. > > There are a number of advantages in a paravirtualized IOMMU over a full > emulation. It is portable and could be reused on different architectures. > It is easier to implement than a full emulation, with less state tracking. > It might be more efficient in some cases, with less context switches to > the host and the possibility of in-kernel emulation. > > When designing it and writing the kvmtool device, I considered two main > scenarios, illustrated below. > > Scenario 1: a hardware device passed through twice via VFIO > > MEM____pIOMMU________PCI device________________________ > HARDWARE > | (2b) \ > ----------|-------------+-------------+------------------\------------- > | : KVM : \ > | : : \ > pIOMMU drv : _______virtio-iommu drv \ KERNEL > | : | : | \ > VFIO : | : VFIO \ > | : | : | \ > | : | : | / > ----------|-------------+--------|----+----------|------------/-------- > | | : | / > | (1c) (1b) | : (1a) | / (2a) > | | : | / > | | : | / USERSPACE > |___virtio-iommu dev___| : net drv___/ > : > --------------------------------------+-------------------------------- > HOST : GUEST >Usually people draw such layers in reverse order, e.g. hw in the bottom then kernel in the middle then user in the top. :-)> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a > buffer with mmap, obtaining virtual address VA. It then send a > VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly > VA=IOVA). > b. The maping request is relayed to the host through virtio > (VIRTIO_IOMMU_T_MAP). > c. The mapping request is relayed to the physical IOMMU through VFIO. > > (2) a. The guest userspace driver can now instruct the device to directly > access the buffer at IOVA > b. IOVA accesses from the device are translated into physical > addresses by the IOMMU. > > Scenario 2: a virtual net device behind a virtual IOMMU. > > MEM__pIOMMU___PCI device HARDWARE > | | > -------|---------|------+-------------+------------------------------- > | | : KVM : > | | : : > pIOMMU drv | : : > \ | : _____________virtio-net drv KERNEL > \_net drv : | : / (1a) > | : | : / > tap : | ________virtio-iommu drv > | : | | : (1b) > -----------------|------+-----|---|---+------------------------------- > | | | : > |_virtio-net_| | : > / (2) | : > / | : USERSPACE > virtio-iommu dev______| : > : > --------------------------------------+------------------------------- > HOST : GUEST > > (1) a. Guest virtio-net driver maps the virtio ring and a buffer > b. The mapping requests are relayed to the host through virtio. > (2) The virtio-net device now needs to access any guest memory via the > IOMMU. > > Physical and virtual IOMMUs are completely dissociated. The net driver is > mapping its own buffers via DMA/IOMMU API, and buffers are copied > between > virtio-net and tap. > > > The description itself seemed too long for a single email, so I split it > into three documents, and will attach Linux and kvmtool patches to this > email. > > 1. Firmware note, > 2. device operations (draft for the virtio specification), > 3. future work/possible improvements. > > Just to be clear on the terms I'm using: > > pIOMMU physical IOMMU, controlling DMA accesses from physical > devices > vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses > from > physical and virtual devices to guest memory.maybe clearer to call controlling 'virtual' DMA access since we're essentially doing DMA virtualization here. Otherwise I read it a bit confusing since DMA accesses from physical device should be controlled by pIOMMU.> GVA, GPA, HVA, HPA > Guest/Host Virtual/Physical Address > IOVA I/O Virtual Address, the address accessed by a device doing DMA > through an IOMMU. In the context of a guest OS, IOVA is GVA.This statement is not accurate. For kernel DMA protection, it is per-device standalone address space (definitely nothing to do with GVA). For user DMA protection, user space driver decides how it wants to construct IOVA address space. could be a standalone one, or reuse GVA. In virtualization case it is either GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates IOVA space). anyway IOVA concept is clear. possibly just removing the example is still clear. :-)> > Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI > virtio-iommu.h header, which is BSD 3-clause. For the time being, the > specification draft in RFC 2/3 is also BSD 3-clause. > > > This proposal may be involuntarily centered around ARM architectures at > times. Any feedback would be appreciated, especially regarding other > IOMMU > architectures. >thanks for doing this. will definitely look them in detail and feedback. Thanks Kevin
Jean-Philippe Brucker
2017-Apr-13  13:12 UTC
[RFC 0/3] virtio-iommu: a paravirtualized IOMMU
On 13/04/17 09:41, Tian, Kevin wrote:>> From: Jean-Philippe Brucker >> Sent: Saturday, April 8, 2017 3:18 AM >> >> This is the initial proposal for a paravirtualized IOMMU device using >> virtio transport. It contains a description of the device, a Linux driver, >> and a toy implementation in kvmtool. With this prototype, you can >> translate DMA to guest memory from emulated (virtio), or passed-through >> (VFIO) devices. >> >> In its simplest form, implemented here, the device handles map/unmap >> requests from the guest. Future extensions proposed in "RFC 3/3" should >> allow to bind page tables to devices. >> >> There are a number of advantages in a paravirtualized IOMMU over a full >> emulation. It is portable and could be reused on different architectures. >> It is easier to implement than a full emulation, with less state tracking. >> It might be more efficient in some cases, with less context switches to >> the host and the possibility of in-kernel emulation. >> >> When designing it and writing the kvmtool device, I considered two main >> scenarios, illustrated below. >> >> Scenario 1: a hardware device passed through twice via VFIO >> >> MEM____pIOMMU________PCI device________________________ >> HARDWARE >> | (2b) \ >> ----------|-------------+-------------+------------------\------------- >> | : KVM : \ >> | : : \ >> pIOMMU drv : _______virtio-iommu drv \ KERNEL >> | : | : | \ >> VFIO : | : VFIO \ >> | : | : | \ >> | : | : | / >> ----------|-------------+--------|----+----------|------------/-------- >> | | : | / >> | (1c) (1b) | : (1a) | / (2a) >> | | : | / >> | | : | / USERSPACE >> |___virtio-iommu dev___| : net drv___/ >> : >> --------------------------------------+-------------------------------- >> HOST : GUEST >> > > Usually people draw such layers in reverse order, e.g. hw in the > bottom then kernel in the middle then user in the top. :-)Alright, I'll keep that in mind.>> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a >> buffer with mmap, obtaining virtual address VA. It then send a >> VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly >> VA=IOVA). >> b. The maping request is relayed to the host through virtio >> (VIRTIO_IOMMU_T_MAP). >> c. The mapping request is relayed to the physical IOMMU through VFIO. >> >> (2) a. The guest userspace driver can now instruct the device to directly >> access the buffer at IOVA >> b. IOVA accesses from the device are translated into physical >> addresses by the IOMMU. >> >> Scenario 2: a virtual net device behind a virtual IOMMU. >> >> MEM__pIOMMU___PCI device HARDWARE >> | | >> -------|---------|------+-------------+------------------------------- >> | | : KVM : >> | | : : >> pIOMMU drv | : : >> \ | : _____________virtio-net drv KERNEL >> \_net drv : | : / (1a) >> | : | : / >> tap : | ________virtio-iommu drv >> | : | | : (1b) >> -----------------|------+-----|---|---+------------------------------- >> | | | : >> |_virtio-net_| | : >> / (2) | : >> / | : USERSPACE >> virtio-iommu dev______| : >> : >> --------------------------------------+------------------------------- >> HOST : GUEST >> >> (1) a. Guest virtio-net driver maps the virtio ring and a buffer >> b. The mapping requests are relayed to the host through virtio. >> (2) The virtio-net device now needs to access any guest memory via the >> IOMMU. >> >> Physical and virtual IOMMUs are completely dissociated. The net driver is >> mapping its own buffers via DMA/IOMMU API, and buffers are copied >> between >> virtio-net and tap. >> >> >> The description itself seemed too long for a single email, so I split it >> into three documents, and will attach Linux and kvmtool patches to this >> email. >> >> 1. Firmware note, >> 2. device operations (draft for the virtio specification), >> 3. future work/possible improvements. >> >> Just to be clear on the terms I'm using: >> >> pIOMMU physical IOMMU, controlling DMA accesses from physical >> devices >> vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses >> from >> physical and virtual devices to guest memory. > > maybe clearer to call controlling 'virtual' DMA access since we're > essentially doing DMA virtualization here. Otherwise I read it > a bit confusing since DMA accesses from physical device should > be controlled by pIOMMU. > >> GVA, GPA, HVA, HPA >> Guest/Host Virtual/Physical Address >> IOVA I/O Virtual Address, the address accessed by a device doing DMA >> through an IOMMU. In the context of a guest OS, IOVA is GVA. > > This statement is not accurate. For kernel DMA protection, it is > per-device standalone address space (definitely nothing to do > with GVA). For user DMA protection, user space driver decides > how it wants to construct IOVA address space. could be a > standalone one, or reuse GVA. In virtualization case it is either > GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates > IOVA space). > > anyway IOVA concept is clear. possibly just removing the example > is still clear. :-)Ok, I dropped most IOVA references from the RFC to avoid ambiguity anyway. I'll tidy up my so-called clarifications next time :) Thanks, Jean-Philippe>> >> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI >> virtio-iommu.h header, which is BSD 3-clause. For the time being, the >> specification draft in RFC 2/3 is also BSD 3-clause. >> >> >> This proposal may be involuntarily centered around ARM architectures at >> times. Any feedback would be appreciated, especially regarding other >> IOMMU >> architectures. >> > > thanks for doing this. will definitely look them in detail and feedback. > > Thanks > Kevin > >
Tian, Kevin
2017-Apr-18  09:51 UTC
[RFC 1/3] virtio-iommu: firmware description of the virtual topology
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM > > Unlike other virtio devices, the virtio-iommu doesn't work independently, > it is linked to other virtual or assigned devices. So before jumping into > device operations, we need to define a way for the guest to discover the > virtual IOMMU and the devices it translates. > > The host must describe the relation between IOMMU and devices to the > guest > using either device-tree or ACPI. The virtual IOMMU identifies eachDo you plan to support both device tree and ACPI?> virtual device with a 32-bit ID, that we will call "Device ID" in this > document. Device IDs are not necessarily unique system-wide, but they may > not overlap within a single virtual IOMMU. Device ID of passed-through > devices do not need to match IDs seen by the physical IOMMU. > > The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci, > because with PCI the IOMMU interface would itself be an endpoint, and > existing firmware interfaces don't allow to describe IOMMU<->master > relations between PCI endpoints.I'm not familiar with virtio-mmio mechanism. Curious how devices in virtio-mmio are enumerated today? Could we use that mechanism to identify vIOMMUs and then invent a purely para-virtualized method to enumerate devices behind each vIOMMU? Asking this is because each vendor has its own enumeration methods. ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your current proposal looks following ARM definitions which I'm not sure extensible enough to cover features defined only in other vendors' structures. Since the purpose of this series is to go para-virtualize, why not also para-virtualize and simplify the enumeration method? For example, we may define a query interface through vIOMMU registers to allow guest query whether a device belonging to that vIOMMU. Then we can even remove use of any enumeration structure completely... Just a quick example which I may not think through all the pros and cons. :-)> > The following diagram describes a situation where two virtual IOMMUs > translate traffic from devices in the system. vIOMMU 1 translates two PCI > domains, in which each function has a 16-bits requester ID. In order for > the vIOMMU to differentiate guest requests targeted at devices in each > domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI > domains and a collection of platform devices. > > Device ID Requester ID > / 0x0 0x0 \ > / | | PCI domain 1 > / 0xffff 0xffff / > vIOMMU 1 > \ 0x10000 0x0 \ > \ | | PCI domain 2 > \ 0x1ffff 0xffff / > > / 0x0 \ > / | platform devices > / 0x1fff / > vIOMMU 2 > \ 0x2000 0x0 \ > \ | | PCI domain 3 > \ 0x11fff 0xffff / >isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit? Thanks Kevin
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM >[...]> II. Feature bits > ===============> > VIRTIO_IOMMU_F_INPUT_RANGE (0) > Available range of virtual addresses is described in input_rangeUsually only the maximum supported address bits are important. Curious do you see such situation where low end of the address space is not usable (since you have both start/end defined later)? [...]> 1. Attach device > ---------------- > > struct virtio_iommu_req_attach { > le32 address_space; > le32 device; > le32 flags/reserved; > }; > > Attach a device to an address space. 'address_space' is an identifier > unique to the guest. If the address space doesn't exist in the IOMMUBased on your description this address space ID is per operation right? MAP/UNMAP and page-table sharing should have different ID spaces...> device, it is created. 'device' is an identifier unique to the IOMMU. The > host communicates unique device ID to the guest during boot. The method > used to communicate this ID is outside the scope of this specification, > but the following rules must apply: > > * The device ID is unique from the IOMMU point of view. Multiple devices > whose DMA transactions are not translated by the same IOMMU may have > the > same device ID. Devices whose DMA transactions may be translated by the > same IOMMU must have different device IDs. > > * Sometimes the host cannot completely isolate two devices from each > others. For example on a legacy PCI bus, devices can snoop DMA > transactions from their neighbours. In this case, the host must > communicate to the guest that it cannot isolate these devices from each > others. The method used to communicate this is outside the scope of this > specification. The IOMMU device must ensure that devices that cannot be"IOMMU device" -> "IOMMU driver"> isolated by the host have the same address spaces. >Thanks Kevin
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM > > Here I propose a few ideas for extensions and optimizations. This is all > very exploratory, feel free to correct mistakes and suggest more things.[...]> > II. Page table sharing > =====================> > 1. Sharing IOMMU page tables > ---------------------------- > > VIRTIO_IOMMU_F_PT_SHARING > > This is independent of the nested mode described in I.2, but relies on a > similar feature in the physical IOMMU: having two stages of page tables, > one for the host and one for the guest. > > When this is supported, the guest can manage its own s1 page directory, to > avoid sending MAP/UNMAP requests. Feature > VIRTIO_IOMMU_F_PT_SHARING allows > a driver to give a page directory pointer (pgd) to the host and send > invalidations when removing or changing a mapping. In this mode, three > requests are used: probe, attach and invalidate. An address space cannot > be using the MAP/UNMAP interface and PT_SHARING at the same time. > > Device and driver first need to negotiate which page table format they > will be using. This depends on the physical IOMMU, so the request contains > a negotiation part to probe the device capabilities. > > (1) Driver attaches devices to address spaces as usual, but a flag > VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to > create page tables for use with the MAP/UNMAP API. The driver intends > to manage the address space itself. > > (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of > pg_format array. > > VIRTIO_IOMMU_T_PROBE_TABLE > > struct virtio_iommu_req_probe_table { > le32 address_space; > le32 flags; > le32 len; > > le32 nr_contexts; > struct { > le32 model; > u8 format[64]; > } pg_format[len]; > }; > > Introducing a probe request is more flexible than advertising those > features in virtio config, because capabilities are dynamic, and depend on > which devices are attached to an address space. Within a single address > space, devices may support different numbers of contexts (PASIDs), and > some may not support recoverable faults. > > (3) Device responds success with all page table formats implemented by the > physical IOMMU in pg_format. 'model' 0 is invalid, so driver can > initialize the array to 0 and deduce from there which entries have > been filled by the device. > > Using a probe method seems preferable over trying to attach every possible > format until one sticks. For instance, with an ARM guest running on an x86 > host, PROBE_TABLE would return the Intel IOMMU page table format, and > the > guest could use that page table code to handle its mappings, hidden behind > the IOMMU API. This requires that the page-table code is reasonably > abstracted from the architecture, as is done with drivers/iommu/io-pgtable > (an x86 guest could use any format implement by io-pgtable for example.)So essentially you need modify all existing IOMMU drivers to support page table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files can be kept vendor agnostic. But if we talk about the whole pvIOMMU module, it actually includes vendor specific logic thus unlike typical para-virtualized virtio drivers being completely vendor agnostic. Is this understanding accurate? It also means in the host-side pIOMMU driver needs to propagate all supported formats through VFIO to Qemu vIOMMU, meaning such format definitions need be consistently agreed across all those components. [...]> > 2. Sharing MMU page tables > -------------------------- > > The guest can share process page-tables with the physical IOMMU. To do > that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The > page table format is implicit, so the pg_format array can be empty (unless > the guest wants to query some specific property, e.g. number of levels > supported by the pIOMMU?). If the host answers with success, guest can > send its MMU page table details with ATTACH_TABLE and (F_NATIVE | > F_INDIRECT | F_FAULT) flags. > > F_FAULT means that the host communicates page requests from device to > the > guest, and the guest can handle them by mapping virtual address in the > fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see > below.) > > F_NATIVE means that the pIOMMU pgtable format is the same as guest > MMU > pgtable format. > > F_INDIRECT means that 'table' pointer is a context table, instead of a > page directory. Each slot in the context table points to a page directory: > > 64 2 1 0 > table ----> +---------------------+ > | pgd |0|1|<--- context 0 > | --- |0|0|<--- context 1 > | pgd |0|1| > | --- |0|0| > | --- |0|0| > +---------------------+ > | \___Entry is valid > |______reserved > > Question: do we want per-context page table format, or can it stay global > for the whole indirect table?Are you defining this context table format in software, or following hardware definition? At least for VT-d there is a strict hardware-defined structure (PASID table) which must be used here. [...]> > 4. Host implementation with VFIO > -------------------------------- > > The VFIO interface for sharing page tables is being worked on at the > moment by Intel. Other virtual IOMMU implementation will most likely let > guest manage full context tables (PASID tables) themselves, giving the > context table pointer to the pIOMMU via a VFIO ioctl. > > For the architecture-agnostic virtio-iommu however, we shouldn't have to > implement all possible formats of context table (they are at least > different between ARM SMMU and Intel IOMMU, and will certainly be > extendedSince anyway you'll finally require vendor specific page table logic, why not also abstracting this context table too which then doesn't require below host-side changes?> in future physical IOMMU architectures.) In addition, most users might > only care about having one page directory per device, as SVM is a luxury > at the moment and few devices support it. For these reasons, we should > allow to pass single page directories via VFIO, using very similar > structures as described above, whilst reusing the VFIO channel developed > for Intel vIOMMU. > > * VFIO_SVM_INFO: probe page table formats > * VFIO_SVM_BIND: set pgd and arch-specific configuration > > There is an inconvenient with letting the pIOMMU driver manage the guest's > context table. During a page table walk, the pIOMMU translates the context > table pointer using the stage-2 page tables. The context table must > therefore be mapped in guest-physical space by the pIOMMU driver. One > solution is to let the pIOMMU driver reserve some GPA space upfront using > the iommu and sysfs resv API [1]. The host would then carve that region > out of the guest-physical space using a firmware mechanism (for example DT > reserved-memory node).Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA address space thus it's not reasonable for it to randomly specify a reserved range. It might make more sense for GPA owner (e.g. Qemu) to decide and then pass information to pIOMMU driver.> > > III. Relaxed operations > ======================> > VIRTIO_IOMMU_F_RELAXED > > Adding an IOMMU dramatically reduces performance of a device, because > map/unmap operations are costly and produce a lot of TLB traffic. For > significant performance improvements, device might allow the driver to > sacrifice safety for speed. In this mode, the driver does not need to send > UNMAP requests. The semantics of MAP change and are more complex to > implement. Given a MAP([start:end] -> phys, flags) request: > > (1) If [start:end] isn't mapped, request succeeds as usual. > (2) If [start:end] overlaps an existing mapping [old_start:old_end], we > unmap [max(start, old_start):min(end, old_end)] and replace it with > [start:end]. > (3) If [start:end] overlaps an existing mapping that matches the new map > request exactly (same flags, same phys address), the old mapping is > kept. > > This squashing could be performed by the guest. The driver can catch unmap > requests from the DMA layer, and only relay map requests for (1) and (2). > A MAP request is therefore able to split and partially override an > existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests > are unnecessary, but are now allowed to split or carve holes in mappings. > > In this model, a MAP request may take longer, but we may have a net gain > by removing a lot of redundant requests. Squashing series of map/unmap > performed by the guest for the same mapping improves temporal reuse of > IOVA mappings, which I can observe by simply dumping IOMMU activity of a > virtio device. It reduce the number of TLB invalidations to the strict > minimum while keeping correctness of DMA operations (provided the device > obeys its driver). There is a good read on the subject of optimistic > teardown in paper [2]. > > This model is completely unsafe. A stale DMA transaction might access a > page long after the device driver in the guest unmapped it and > decommissioned the page. The DMA transaction might hit into a completely > different part of the system that is now reusing the page. Existing > relaxed implementations attempt to mitigate the risk by setting a timeout > on the teardown. Unmap requests from device drivers are not discarded > entirely, but buffered and sent at a later time. Paper [2] reports good > results with a 10ms delay. > > We could add a way for device and driver to negotiate a vulnerability > window to mitigate the risk of DMA attacks. Driver might not accept a > window at all, since it requires more infrastructure to keep delayed > mappings. In my opinion, it should be made clear that regardless of the > duration of this window, any driver accepting F_RELAXED feature makes the > guest completely vulnerable, and the choice boils down to either isolation > or speed, not a bit of both.Even with above optimization I'd image the performance drop is still significant for kernel map/unmap usages, not to say when such optimization is not possible if safety is required (actually I don't know why IOMMU is still required if safety can be compromised. Aren't we using IOMMU for security purpose?). I think we'd better focus on higher-value usages, e.g. user space DMA protection (DPDK) and SVM, while leaving kernel protection with a lower priority (most for functionality verification). Is this strategy aligned with your thought? btw what about interrupt remapping/posting? Are they also in your plan for pvIOMMU? Last, thanks for very informative write-! Looks a long enabling path is required get pvIOMMU feature on-par with a real IOMMU. Starting with a minimal set is relatively easier. :-) Thanks Kevin
On Fri, Apr 07, 2017 at 08:17:47PM +0100, Jean-Philippe Brucker wrote:> Here I propose a few ideas for extensions and optimizations. This is all > very exploratory, feel free to correct mistakes and suggest more things. > > I. Linux host > 1. vhost-iommuA qemu based implementation would be a first step. Would allow validating the claim that it's much simpler to support than e.g. VTD.> 2. VFIO nested translation > II. Page table sharing > 1. Sharing IOMMU page tables > 2. Sharing MMU page tables (SVM) > 3. Fault reporting > 4. Host implementation with VFIO > III. Relaxed operations > IV. Misc > > > I. Linux host > ============> > 1. vhost-iommu > -------------- > > An advantage of virtualizing an IOMMU using virtio is that it allows to > hoist a lot of the emulation code into the kernel using vhost, and avoid > returning to userspace for each request. The mainline kernel already > implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code > could be reused. > > Introducing vhost in a simplified scenario 1 (removed guest userspace > pass-through, irrelevant to this example) gives us the following: > > MEM____pIOMMU________PCI device____________ HARDWARE > | \ > ----------|-------------+-------------+-----\-------------------------- > | : KVM : \ > pIOMMU drv : : \ KERNEL > | : : net drv > VFIO : : / > | : : / > vhost-iommu_________________________virtio-iommu-drv > : : > --------------------------------------+------------------------------- > HOST : GUEST > > > Introducing vhost in scenario 2, userspace now only handles the device > initialisation part, and most runtime communication is handled in kernel: > > MEM__pIOMMU___PCI device HARDWARE > | | > -------|---------|------+-------------+------------------------------- > | | : KVM : > pIOMMU drv | : : KERNEL > \__net drv : : > | : : > tap : : > | : : > _vhost-net________________________virtio-net drv > (2) / : : / (1a) > / : : / > vhost-iommu________________________________virtio-iommu drv > : : (1b) > ------------------------+-------------+------------------------------- > HOST : GUEST > > (1) a. Guest virtio driver maps ring and buffers > b. Map requests are relayed to the host the same way. > (2) To access any guest memory, vhost-net must query the IOMMU. We can > reuse the existing TLB protocol for this. TLB commands are written to > and read from the vhost-net fd. > > As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure > has everything needed for map/unmap operations: > > struct vhost_iotlb_msg { > __u64 iova; > __u64 size; > __u64 uaddr; > __u8 perm; /* R/W */ > __u8 type; > #define VHOST_IOTLB_MISS > #define VHOST_IOTLB_UPDATE /* MAP */ > #define VHOST_IOTLB_INVALIDATE /* UNMAP */ > #define VHOST_IOTLB_ACCESS_FAIL > }; > > struct vhost_msg { > int type; > union { > struct vhost_iotlb_msg iotlb; > __u8 padding[64]; > }; > }; > > The vhost-iommu device associates a virtual device ID to a TLB fd. We > should be able to use the same commands for [vhost-net <-> virtio-iommu] > and [virtio-net <-> vhost-iommu] communication. A virtio-net device > would open a socketpair and hand one side to vhost-iommu. > > If vhost_msg is ever used for another purpose than TLB, we'll have some > trouble, as there will be multiple clients that want to read/write the > vhost fd. A multicast transport method will be needed. Until then, this > can work. > > Details of operations would be: > > (1) Userspace sets up vhost-iommu as with other vhost devices, by using > standard vhost ioctls. Userspace starts by describing the system topology > via ioctl: > > ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct > vhost_iommu_add_device) > > #define VHOST_IOMMU_DEVICE_TYPE_VFIO > #define VHOST_IOMMU_DEVICE_TYPE_TLB > > struct vhost_iommu_add_device { > __u8 type; > __u32 devid; > union { > struct vhost_iommu_device_vfio { > int vfio_group_fd; > }; > struct vhost_iommu_device_tlb { > int fd; > }; > }; > }; > > (2) VIRTIO_IOMMU_T_ATTACH(address space, devid) > > vhost-iommu creates an address space if necessary, finds the device along > with the relevant operations. If type is VFIO, operations are done on a > container, otherwise they are done on single devices. > > (3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags) > > Turn phys into an hva using the vhost mem table. > > - If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the > mapping locally and wait for the TLB to ask for it with a > VHOST_IOTLB_MISS. > - If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to > introduce a shortcut in the external user API of VFIO). > > (4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags) > > - If type is TLB, send a VHOST_IOTLB_INVALIDATE. > - If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA. > > (5) VIRTIO_IOMMU_T_DETACH(address space, devid) > > Undo whatever was done in (2). > > > 2. VFIO nested translation > -------------------------- > > For my current kvmtool implementation, I am putting each VFIO group in a > different container during initialization. We cannot detach a group from a > container at runtime without first resetting all devices in that group. So > the best way to provide dynamic address spaces right now is one container > per group. The drawback is that we need to maintain multiple sets of page > tables even if the guest wants to put all devices in the same address > space. Another disadvantage is when implementing bypass mode, we need to > map the whole address space at the beginning, then unmap everything on > attach. Adding nested support would be a nice way to provide dynamic > address spaces while keeping groups tied to a container at all times. > > A physical IOMMU may offer nested translation. In this case, address > spaces are managed by two page directories instead of one. A guest- > virtual address is translated into a guest-physical one using what we'll > call here "stage-1" (s1) page tables, and the guest-physical address is > translated into a host-physical one using "stage-2" (s2) page tables. > > s1 s2 > GVA --> GPA --> HPA > > There isn't a lot of support in Linux for nesting IOMMU page directories > at the moment (though SVM support is coming, see II). VFIO does have a > "nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU > code uses this to decide whether to manage the container with s2 page > tables instead of s1, but even then we still only have a single stage and > it is assumed that IOVA=GPA. > > Another model that would help with dynamically changing address spaces is > nesting VFIO containers: > > Parent <---------- map/unmap > container > / | \ > / group \ > Child Child <--- map/unmap > container container > | | | > group group group > > At the beginning all groups are attached to the parent container, and > there is no child container. Doing map/unmap on the parent container maps > stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should > be able to choose whether they want all devices attached to this container > to be able to access GPAs (bypass mode, as it currently is) or simply > block all DMA (in which case there is no need to pin pages here). > > At some point the guest wants to create an address space and attaches > children to it. Using an ioctl (to be defined), we can derive a child > container from the parent container, and move groups from parent to child. > > This returns a child fd. When the guest maps something in this new address > space, we can do a map ioctl on the child container, which maps stage-1 > page tables (map GVA -> GPA). > > A page table walk may access multiple levels of tables (pgd, p4d, pud, > pmd, pt). With nested translation, each access to a table during the > stage-1 walk requires a stage-2 walk. This makes a full translation costly > so it is preferable to use a single stage of translation when possible. > Folding two stages into one is simple with a single container, as shown in > the kvmtool example. The host keeps track of GPA->HVA mappings, so it can > fold the full GVA->HVA mapping before sending the VFIO request. With > nested containers however, the IOMMU driver would have to do the folding > work itself. Keeping a copy of stage-2 mapping created on the parent > container, it would fold them into the actual stage-2 page tables when > receiving a map request on the child container (note that software folding > is not possible when stage-1 pgd is managed by the guest, as described in > next section). > > I don't know if nested VFIO containers are a desirable feature at all. I > find the concept cute on paper, and it would make it easier for userspace > to juggle with address spaces, but it might require some invasive changes > in VFIO, and people have been able to use the current API for IOMMU > virtualization so far. > > > II. Page table sharing > =====================> > 1. Sharing IOMMU page tables > ---------------------------- > > VIRTIO_IOMMU_F_PT_SHARING > > This is independent of the nested mode described in I.2, but relies on a > similar feature in the physical IOMMU: having two stages of page tables, > one for the host and one for the guest. > > When this is supported, the guest can manage its own s1 page directory, to > avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows > a driver to give a page directory pointer (pgd) to the host and send > invalidations when removing or changing a mapping. In this mode, three > requests are used: probe, attach and invalidate. An address space cannot > be using the MAP/UNMAP interface and PT_SHARING at the same time. > > Device and driver first need to negotiate which page table format they > will be using. This depends on the physical IOMMU, so the request contains > a negotiation part to probe the device capabilities. > > (1) Driver attaches devices to address spaces as usual, but a flag > VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to > create page tables for use with the MAP/UNMAP API. The driver intends > to manage the address space itself. > > (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of > pg_format array. > > VIRTIO_IOMMU_T_PROBE_TABLE > > struct virtio_iommu_req_probe_table { > le32 address_space; > le32 flags; > le32 len; > > le32 nr_contexts; > struct { > le32 model; > u8 format[64]; > } pg_format[len]; > }; > > Introducing a probe request is more flexible than advertising those > features in virtio config, because capabilities are dynamic, and depend on > which devices are attached to an address space. Within a single address > space, devices may support different numbers of contexts (PASIDs), and > some may not support recoverable faults. > > (3) Device responds success with all page table formats implemented by the > physical IOMMU in pg_format. 'model' 0 is invalid, so driver can > initialize the array to 0 and deduce from there which entries have > been filled by the device. > > Using a probe method seems preferable over trying to attach every possible > format until one sticks. For instance, with an ARM guest running on an x86 > host, PROBE_TABLE would return the Intel IOMMU page table format, and the > guest could use that page table code to handle its mappings, hidden behind > the IOMMU API. This requires that the page-table code is reasonably > abstracted from the architecture, as is done with drivers/iommu/io-pgtable > (an x86 guest could use any format implement by io-pgtable for example.) > > (4) If the driver is able to use this format, it sends the ATTACH_TABLE > request. > > VIRTIO_IOMMU_T_ATTACH_TABLE > > struct virtio_iommu_req_attach_table { > le32 address_space; > le32 flags; > le64 table; > > le32 nr_contexts; > /* Page-table format description */ > > le32 model; > u8 config[64] > }; > > > 'table' is a pointer to the page directory. 'nr_contexts' isn't used > here. > > For both ATTACH and PROBE, 'flags' are the following (and will be > explained later): > > VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT (1 << 0) > VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE (1 << 1) > VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT (1 << 2) > > Now 'model' is a bit tricky. We need to specify all possible page table > formats and their parameters. I'm not well-versed in x86, s390 or other > IOMMUs, so I'll just focus on the ARM world for this example. We basically > have two page table models, with a multitude of configuration bits: > > * ARM LPAE > * ARM short descriptor > > We could define a high-level identifier per page-table model, such as: > > #define PG_TABLE_ARM 0x1 > #define PG_TABLE_X86 0x2 > ... > > And each model would define its own structure. On ARM 'format' could be a > simple u32 defining a variant, LPAE 32/64 or short descriptor. It could > also contain additional capabilities. Then depending on the variant, > 'config' would be: > > struct pg_config_v7s { > le32 tcr; > le32 prrr; > le32 nmrr; > le32 asid; > }; > > struct pg_config_lpae { > le64 tcr; > le64 mair; > le32 asid; > > /* And maybe TTB1? */ > }; > > struct pg_config_arm { > le32 variant; > union ...; > }; > > I am really uneasy with describing all those nasty architectural details > in the virtio-iommu specification. We certainly won't start describing the > content bit-by-bit of tcr or mair here, but just declaring these fields > might be sufficient. > > (5) Once the table is attached, the driver can simply write the page > tables and expect the physical IOMMU to observe the mappings without > any additional request. When changing or removing a mapping, however, > the driver must send an invalidate request. > > VIRTIO_IOMMU_T_INVALIDATE > > struct virtio_iommu_req_invalidate { > le32 address_space; > le32 context; > le32 flags; > le64 virt_addr; > le64 range_size; > > u8 opaque[64]; > }; > > 'flags' may be: > > VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range > from 'context' (context is 0 when !F_INDIRECT). > > And with context tables only (explained below): > > VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from > 'context' (context is 0 when !F_INDIRECT). virt_addr and range_size > are ignored. > > VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries > in the table that changed. Device reads the table again, compares it > to previous values, and invalidate all mappings for contexts that > changed. context, virt_addr and range_size are ignored. > > IOMMUs may offer hints and quirks in their invalidation packets. The > opaque structure in invalidate would allow to transport those. This > depends on the page table format and as with architectural page-table > definitions, I really don't want to have those details in the spec itself. > > > 2. Sharing MMU page tables > -------------------------- > > The guest can share process page-tables with the physical IOMMU. To do > that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The > page table format is implicit, so the pg_format array can be empty (unless > the guest wants to query some specific property, e.g. number of levels > supported by the pIOMMU?). If the host answers with success, guest can > send its MMU page table details with ATTACH_TABLE and (F_NATIVE | > F_INDIRECT | F_FAULT) flags. > > F_FAULT means that the host communicates page requests from device to the > guest, and the guest can handle them by mapping virtual address in the > fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see > below.) > > F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU > pgtable format. > > F_INDIRECT means that 'table' pointer is a context table, instead of a > page directory. Each slot in the context table points to a page directory: > > 64 2 1 0 > table ----> +---------------------+ > | pgd |0|1|<--- context 0 > | --- |0|0|<--- context 1 > | pgd |0|1| > | --- |0|0| > | --- |0|0| > +---------------------+ > | \___Entry is valid > |______reserved > > Question: do we want per-context page table format, or can it stay global > for the whole indirect table? > > Having a context table allows to provide multiple address spaces for a > single device. In the simplest form, without F_INDIRECT we have a single > address space per device, but some devices may implement more, for > instance devices with the PCI PASID extension. > > A slot's position in the context table gives an ID, between 0 and > nr_contexts. The guest can use this ID to have the device target a > specific address space with DMA. The mechanism to do that is > device-specific. For a PCI device, the ID is a PASID, and PCI doesn't > define a specific way of using them for DMA, it's the device driver's > concern. > > > 3. Fault reporting > ------------------ > > VIRTIO_IOMMU_F_EVENT_QUEUE > > With this feature, an event virtqueue (1) is available. For now it will > only be used for fault handling, but I'm calling it eventq so that other > asynchronous features can piggy-back on it. Device may report faults and > page requests by sending buffers via the used ring. > > #define VIRTIO_IOMMU_T_FAULT 0x05 > > struct virtio_iommu_evt_fault { > struct virtio_iommu_evt_head { > u8 type; > u8 reserved[3]; > }; > > u32 address_space; > u32 context; > > u64 vaddr; > u32 flags; /* Access details: R/W/X */ > > /* In the reply: */ > u32 reply; /* Fault handled, or failure */ > u64 paddr; > }; > > Driver must send the reply via the request queue, with the fault status > in 'reply', and the mapped page in 'paddr' on success. > > Existing fault handling interfaces such as PRI have a tag (PRG) allowing > to identify a page request (or group thereof) when sending a reply. I > wonder if this would be useful to us, but it seems like the > (address_space, context, vaddr) tuple is sufficient to identify a page > fault, provided the device doesn't send duplicate faults. Duplicate faults > could be required if they have a side effect, for instance implementing a > poor man's doorbell. If this is desirable, we could add a fault_id field. > > > 4. Host implementation with VFIO > -------------------------------- > > The VFIO interface for sharing page tables is being worked on at the > moment by Intel. Other virtual IOMMU implementation will most likely let > guest manage full context tables (PASID tables) themselves, giving the > context table pointer to the pIOMMU via a VFIO ioctl. > > For the architecture-agnostic virtio-iommu however, we shouldn't have to > implement all possible formats of context table (they are at least > different between ARM SMMU and Intel IOMMU, and will certainly be extended > in future physical IOMMU architectures.) In addition, most users might > only care about having one page directory per device, as SVM is a luxury > at the moment and few devices support it. For these reasons, we should > allow to pass single page directories via VFIO, using very similar > structures as described above, whilst reusing the VFIO channel developed > for Intel vIOMMU. > > * VFIO_SVM_INFO: probe page table formats > * VFIO_SVM_BIND: set pgd and arch-specific configuration > > There is an inconvenient with letting the pIOMMU driver manage the guest's > context table. During a page table walk, the pIOMMU translates the context > table pointer using the stage-2 page tables. The context table must > therefore be mapped in guest-physical space by the pIOMMU driver. One > solution is to let the pIOMMU driver reserve some GPA space upfront using > the iommu and sysfs resv API [1]. The host would then carve that region > out of the guest-physical space using a firmware mechanism (for example DT > reserved-memory node). > > > III. Relaxed operations > ======================> > VIRTIO_IOMMU_F_RELAXED > > Adding an IOMMU dramatically reduces performance of a device, because > map/unmap operations are costly and produce a lot of TLB traffic. For > significant performance improvements, device might allow the driver to > sacrifice safety for speed. In this mode, the driver does not need to send > UNMAP requests. The semantics of MAP change and are more complex to > implement. Given a MAP([start:end] -> phys, flags) request: > > (1) If [start:end] isn't mapped, request succeeds as usual. > (2) If [start:end] overlaps an existing mapping [old_start:old_end], we > unmap [max(start, old_start):min(end, old_end)] and replace it with > [start:end]. > (3) If [start:end] overlaps an existing mapping that matches the new map > request exactly (same flags, same phys address), the old mapping is > kept. > > This squashing could be performed by the guest. The driver can catch unmap > requests from the DMA layer, and only relay map requests for (1) and (2). > A MAP request is therefore able to split and partially override an > existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests > are unnecessary, but are now allowed to split or carve holes in mappings. > > In this model, a MAP request may take longer, but we may have a net gain > by removing a lot of redundant requests. Squashing series of map/unmap > performed by the guest for the same mapping improves temporal reuse of > IOVA mappings, which I can observe by simply dumping IOMMU activity of a > virtio device. It reduce the number of TLB invalidations to the strict > minimum while keeping correctness of DMA operations (provided the device > obeys its driver). There is a good read on the subject of optimistic > teardown in paper [2]. > > This model is completely unsafe. A stale DMA transaction might access a > page long after the device driver in the guest unmapped it and > decommissioned the page. The DMA transaction might hit into a completely > different part of the system that is now reusing the page. Existing > relaxed implementations attempt to mitigate the risk by setting a timeout > on the teardown. Unmap requests from device drivers are not discarded > entirely, but buffered and sent at a later time. Paper [2] reports good > results with a 10ms delay. > > We could add a way for device and driver to negotiate a vulnerability > window to mitigate the risk of DMA attacks. Driver might not accept a > window at all, since it requires more infrastructure to keep delayed > mappings. In my opinion, it should be made clear that regardless of the > duration of this window, any driver accepting F_RELAXED feature makes the > guest completely vulnerable, and the choice boils down to either isolation > or speed, not a bit of both. > > > IV. Misc > =======> > I think we have enough to go on for a while. To improve MAP throughput, I > considered adding a MAP_SG request depending on a feature bit, with > variable size: > > struct virtio_iommu_req_map_sg { > struct virtio_iommu_req_head; > u32 address_space; > u32 nr_elems; > u64 virt_addr; > u64 size; > u64 phys_addr[nr_elems]; > }; > > Would create the following mappings: > > virt_addr -> phys_addr[0] > virt_addr + size -> phys_addr[1] > virt_addr + 2 * size -> phys_addr[2] > ... > > This would avoid the overhead of multiple map commands. We could try to > find a more cunning format to compress virtually-contiguous mappings with > different (phys, size) pairs as well. But Linux drivers rarely prefer > map_sg() functions over regular map(), so I don't know if the whole map_sg > feature is worth the effort. All we would gain is a few bytes anyway. > > My current map_sg implementation in the virtio-iommu driver adds a batch > of map requests to the queue and kick the host once. That might be enough > of an optimization. > > > Another invasive optimization would be adding grouped requests. By adding > two flags in the header, L and G, we can group sequences of requests > together, and have one status at the end, either 0 if all requests in the > group succeeded, or the status of the first request that failed. This is > all in-order. Requests in a group follow each others, there is no sequence > identifier. > > ___ L: request is last in the group > / _ G: request is part of a group > | / > v v > 31 9 8 7 0 > +--------------------------------+ <------- RO descriptor > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |1|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ <------- WO descriptor > | res0 | status | > +--------------------------------+ > > This adds some complexity on the device, since it must unroll whatever was > done by successful requests in a group as soon as one fails, and reject > all subsequent ones. A group of requests is an atomic operation. As with > map_sg, this change mostly allows to save space and virtio descriptors. > > > [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups > [2] vIOMMU: Efficient IOMMU Emulation > N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster
Hi Jean, I am trying to run and review on my side but I see Linux patches are not with latest kernel version. Will it be possible for you to share your Linux and kvmtool git repository reference? Thanks -Bharat> -----Original Message----- > From: virtualization-bounces at lists.linux-foundation.org > [mailto:virtualization-bounces at lists.linux-foundation.org] On Behalf Of Jean- > Philippe Brucker > Sent: Saturday, April 08, 2017 12:55 AM > To: iommu at lists.linux-foundation.org; kvm at vger.kernel.org; > virtualization at lists.linux-foundation.org; virtio-dev at lists.oasis-open.org > Cc: cdall at linaro.org; lorenzo.pieralisi at arm.com; mst at redhat.com; > marc.zyngier at arm.com; joro at 8bytes.org; will.deacon at arm.com; > robin.murphy at arm.com > Subject: [RFC PATCH kvmtool 00/15] Add virtio-iommu > > Implement a virtio-iommu device and translate DMA traffic from vfio and > virtio devices. Virtio needed some rework to support scatter-gather accesses > to vring and buffers at page granularity. Patch 3 implements the actual virtio- > iommu device. > > Adding --viommu on the command-line now inserts a virtual IOMMU in front > of all virtio and vfio devices: > > $ lkvm run -k Image --console virtio -p console=hvc0 \ > --viommu --vfio 0 --vfio 4 --irqchip gicv3-its > ... > [ 2.998949] virtio_iommu virtio0: probe successful > [ 3.007739] virtio_iommu virtio1: probe successful > ... > [ 3.165023] iommu: Adding device 0000:00:00.0 to group 0 > [ 3.536480] iommu: Adding device 10200.virtio to group 1 > [ 3.553643] iommu: Adding device 10600.virtio to group 2 > [ 3.570687] iommu: Adding device 10800.virtio to group 3 > [ 3.627425] iommu: Adding device 10a00.virtio to group 4 > [ 7.823689] iommu: Adding device 0000:00:01.0 to group 5 > ... > > Patches 13 and 14 add debug facilities. Some statistics are gathered for each > address space and can be queried via the debug builtin: > > $ lkvm debug -n guest-1210 --iommu stats > iommu 0 "viommu-vfio" > kicks 1255 > requests 1256 > ioas 1 > maps 7 > unmaps 4 > resident 2101248 > ioas 6 > maps 623 > unmaps 620 > resident 16384 > iommu 1 "viommu-virtio" > kicks 11426 > requests 11431 > ioas 2 > maps 2836 > unmaps 2835 > resident 8192 > accesses 2836 > ... > > This is based on the VFIO patchset[1], itself based on Andre's ITS work. > The VFIO bits have only been tested on a software model and are unlikely to > work on actual hardware, but I also tested virtio on an ARM Juno. > > [1] http://www.spinics.net/lists/kvm/msg147624.html > > Jean-Philippe Brucker (15): > virtio: synchronize virtio-iommu headers with Linux > FDT: (re)introduce a dynamic phandle allocator > virtio: add virtio-iommu > Add a simple IOMMU > iommu: describe IOMMU topology in device-trees > irq: register MSI doorbell addresses > virtio: factor virtqueue initialization > virtio: add vIOMMU instance for virtio devices > virtio: access vring and buffers through IOMMU mappings > virtio-pci: translate MSIs with the virtual IOMMU > virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary > vfio: add support for virtual IOMMU > virtio-iommu: debug via IPC > virtio-iommu: implement basic debug commands > virtio: use virtio-iommu when available > > Makefile | 3 + > arm/gic.c | 4 + > arm/include/arm-common/fdt-arch.h | 2 +- > arm/pci.c | 49 ++- > builtin-debug.c | 8 +- > builtin-run.c | 2 + > fdt.c | 35 ++ > include/kvm/builtin-debug.h | 6 + > include/kvm/devices.h | 4 + > include/kvm/fdt.h | 20 + > include/kvm/iommu.h | 105 +++++ > include/kvm/irq.h | 3 + > include/kvm/kvm-config.h | 1 + > include/kvm/vfio.h | 2 + > include/kvm/virtio-iommu.h | 15 + > include/kvm/virtio-mmio.h | 1 + > include/kvm/virtio-pci.h | 2 + > include/kvm/virtio.h | 137 +++++- > include/linux/virtio_config.h | 74 ++++ > include/linux/virtio_ids.h | 4 + > include/linux/virtio_iommu.h | 142 ++++++ > iommu.c | 240 ++++++++++ > irq.c | 35 ++ > kvm-ipc.c | 43 +- > mips/include/kvm/fdt-arch.h | 2 +- > powerpc/include/kvm/fdt-arch.h | 2 +- > vfio.c | 281 +++++++++++- > virtio/9p.c | 7 +- > virtio/balloon.c | 7 +- > virtio/blk.c | 10 +- > virtio/console.c | 7 +- > virtio/core.c | 240 ++++++++-- > virtio/iommu.c | 902 > ++++++++++++++++++++++++++++++++++++++ > virtio/mmio.c | 44 +- > virtio/net.c | 8 +- > virtio/pci.c | 61 ++- > virtio/rng.c | 6 +- > virtio/scsi.c | 6 +- > x86/include/kvm/fdt-arch.h | 2 +- > 39 files changed, 2389 insertions(+), 133 deletions(-) create mode 100644 > fdt.c create mode 100644 include/kvm/iommu.h create mode 100644 > include/kvm/virtio-iommu.h create mode 100644 > include/linux/virtio_config.h create mode 100644 > include/linux/virtio_iommu.h create mode 100644 iommu.c create mode > 100644 virtio/iommu.c > > -- > 2.12.1 > > _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Bharat Bhushan
2017-Jun-16  08:48 UTC
[virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
Hi Jean> -----Original Message----- > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis- > open.org] On Behalf Of Jean-Philippe Brucker > Sent: Saturday, April 08, 2017 12:53 AM > To: iommu at lists.linux-foundation.org; kvm at vger.kernel.org; > virtualization at lists.linux-foundation.org; virtio-dev at lists.oasis-open.org > Cc: cdall at linaro.org; will.deacon at arm.com; robin.murphy at arm.com; > lorenzo.pieralisi at arm.com; joro at 8bytes.org; mst at redhat.com; > jasowang at redhat.com; alex.williamson at redhat.com; > marc.zyngier at arm.com > Subject: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver > > The virtio IOMMU is a para-virtualized device, allowing to send IOMMU > requests such as map/unmap over virtio-mmio transport. This driver should > illustrate the initial proposal for virtio-iommu, that you hopefully received > with it. It handle attach, detach, map and unmap requests. > > The bulk of the code is to create requests and send them through virtio. > Implementing the IOMMU API is fairly straightforward since the virtio-iommu > MAP/UNMAP interface is almost identical. I threw in a custom > map_sg() function which takes up some space, but is optional. The core > function would send a sequence of map requests, waiting for a reply > between each mapping. This optimization avoids yielding to the host after > each map, and instead prepares a batch of requests in the virtio ring and > kicks the host once. > > It must be applied on top of the probe deferral work for IOMMU, currently > under discussion. This allows to dissociate early driver detection and device > probing: device-tree or ACPI is parsed early to find which devices are > translated by the IOMMU, but the IOMMU itself cannot be probed until the > core virtio module is loaded. > > Enabling DEBUG makes it extremely verbose at the moment, but it should be > calmer in next versions. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com> > --- > drivers/iommu/Kconfig | 11 + > drivers/iommu/Makefile | 1 + > drivers/iommu/virtio-iommu.c | 980 > ++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/virtio_ids.h | 1 + > include/uapi/linux/virtio_iommu.h | 142 ++++++ > 6 files changed, 1136 insertions(+) > create mode 100644 drivers/iommu/virtio-iommu.c create mode 100644 > include/uapi/linux/virtio_iommu.h > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index > 37e204f3d9be..8cd56ee9a93a 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -359,4 +359,15 @@ config MTK_IOMMU_V1 > > if unsure, say N here. > > +config VIRTIO_IOMMU > + tristate "Virtio IOMMU driver" > + depends on VIRTIO_MMIO > + select IOMMU_API > + select INTERVAL_TREE > + select ARM_DMA_USE_IOMMU if ARM > + help > + Para-virtualised IOMMU driver with virtio. > + > + Say Y here if you intend to run this kernel as a guest. > + > endif # IOMMU_SUPPORT > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index > 195f7b997d8e..1199d8475802 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra- > smmu.o > obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o > obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o > obj-$(CONFIG_S390_IOMMU) += s390-iommu.o > +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c > new file mode 100644 index 000000000000..1cf4f57b7817 > --- /dev/null > +++ b/drivers/iommu/virtio-iommu.c > @@ -0,0 +1,980 @@ > +/* > + * Virtio driver for the paravirtualized IOMMU > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, > USA. > + * > + * Copyright (C) 2017 ARM Limited > + * > + * Author: Jean-Philippe Brucker <jean-philippe.brucker at arm.com> */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include <linux/amba/bus.h> > +#include <linux/delay.h> > +#include <linux/dma-iommu.h> > +#include <linux/freezer.h> > +#include <linux/interval_tree.h> > +#include <linux/iommu.h> > +#include <linux/module.h> > +#include <linux/of_iommu.h> > +#include <linux/of_platform.h> > +#include <linux/platform_device.h> > +#include <linux/virtio.h> > +#include <linux/virtio_config.h> > +#include <linux/virtio_ids.h> > +#include <linux/wait.h> > + > +#include <uapi/linux/virtio_iommu.h> > + > +struct viommu_dev { > + struct iommu_device iommu; > + struct device *dev; > + struct virtio_device *vdev; > + > + struct virtqueue *vq; > + struct list_head pending_requests; > + /* Serialize anything touching the vq and the request list */ > + spinlock_t vq_lock; > + > + struct list_head list; > + > + /* Device configuration */ > + u64 pgsize_bitmap; > + u64 aperture_start; > + u64 aperture_end; > +}; > + > +struct viommu_mapping { > + phys_addr_t paddr; > + struct interval_tree_node iova; > +}; > + > +struct viommu_domain { > + struct iommu_domain domain; > + struct viommu_dev *viommu; > + struct mutex mutex; > + u64 id; > + > + spinlock_t mappings_lock; > + struct rb_root mappings; > + > + /* Number of devices attached to this domain */ > + unsigned long attached; > +}; > + > +struct viommu_endpoint { > + struct viommu_dev *viommu; > + struct viommu_domain *vdomain; > +}; > + > +struct viommu_request { > + struct scatterlist head; > + struct scatterlist tail; > + > + int written; > + struct list_head list; > +}; > + > +/* TODO: use an IDA */ > +static atomic64_t viommu_domain_ids_gen; > + > +#define to_viommu_domain(domain) container_of(domain, struct > +viommu_domain, domain) > + > +/* Virtio transport */ > + > +static int viommu_status_to_errno(u8 status) { > + switch (status) { > + case VIRTIO_IOMMU_S_OK: > + return 0; > + case VIRTIO_IOMMU_S_UNSUPP: > + return -ENOSYS; > + case VIRTIO_IOMMU_S_INVAL: > + return -EINVAL; > + case VIRTIO_IOMMU_S_RANGE: > + return -ERANGE; > + case VIRTIO_IOMMU_S_NOENT: > + return -ENOENT; > + case VIRTIO_IOMMU_S_FAULT: > + return -EFAULT; > + case VIRTIO_IOMMU_S_IOERR: > + case VIRTIO_IOMMU_S_DEVERR: > + default: > + return -EIO; > + } > +} > + > +static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t > *head, > + size_t *tail) > +{ > + size_t size; > + union virtio_iommu_req r; > + > + *tail = sizeof(struct virtio_iommu_req_tail); > + > + switch (req->type) { > + case VIRTIO_IOMMU_T_ATTACH: > + size = sizeof(r.attach); > + break; > + case VIRTIO_IOMMU_T_DETACH: > + size = sizeof(r.detach); > + break; > + case VIRTIO_IOMMU_T_MAP: > + size = sizeof(r.map); > + break; > + case VIRTIO_IOMMU_T_UNMAP: > + size = sizeof(r.unmap); > + break; > + default: > + return -EINVAL; > + } > + > + *head = size - *tail; > + return 0; > +} > + > +static int viommu_receive_resp(struct viommu_dev *viommu, int > +nr_expected) { > + > + unsigned int len; > + int nr_received = 0; > + struct viommu_request *req, *pending, *next; > + > + pending = list_first_entry_or_null(&viommu->pending_requests, > + struct viommu_request, list); > + if (WARN_ON(!pending)) > + return 0; > + > + while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) { > + if (req != pending) { > + dev_warn(viommu->dev, "discarding stale > request\n"); > + continue; > + } > + > + pending->written = len; > + > + if (++nr_received == nr_expected) { > + list_del(&pending->list); > + /* > + * In an ideal world, we'd wake up the waiter for this > + * group of requests here. But everything is painfully > + * synchronous, so waiter is the caller. > + */ > + break; > + } > + > + next = list_next_entry(pending, list); > + list_del(&pending->list); > + > + if (WARN_ON(list_empty(&viommu->pending_requests))) > + return 0; > + > + pending = next; > + } > + > + return nr_received; > +} > + > +/* Must be called with vq_lock held */ > +static int _viommu_send_reqs_sync(struct viommu_dev *viommu, > + struct viommu_request *req, int nr, > + int *nr_sent) > +{ > + int i, ret; > + ktime_t timeout; > + int nr_received = 0; > + struct scatterlist *sg[2]; > + /* > + * FIXME: as it stands, 1s timeout per request. This is a voluntary > + * exaggeration because I have no idea how real our ktime is. Are we > + * using a RTC? Are we aware of steal time? I don't know much about > + * this, need to do some digging. > + */ > + unsigned long timeout_ms = 1000; > + > + *nr_sent = 0; > + > + for (i = 0; i < nr; i++, req++) { > + /* > + * The backend will allocate one indirect descriptor for each > + * request, which allows to double the ring consumption, but > + * might be slower. > + */ > + req->written = 0; > + > + sg[0] = &req->head; > + sg[1] = &req->tail; > + > + ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req, > + GFP_ATOMIC); > + if (ret) > + break; > + > + list_add_tail(&req->list, &viommu->pending_requests); > + } > + > + if (i && !virtqueue_kick(viommu->vq)) > + return -EPIPE; > + > + /* > + * Absolutely no wiggle room here. We're not allowed to sleep as > callers > + * might be holding spinlocks, so we have to poll like savages until > + * something appears. Hopefully the host already handled the > request > + * during the above kick and returned it to us. > + * > + * A nice improvement would be for the caller to tell us if we can > sleep > + * whilst mapping, but this has to go through the IOMMU/DMA API. > + */ > + timeout = ktime_add_ms(ktime_get(), timeout_ms * i); > + while (nr_received < i && ktime_before(ktime_get(), timeout)) { > + nr_received += viommu_receive_resp(viommu, i - > nr_received); > + if (nr_received < i) { > + /* > + * FIXME: what's a good way to yield to host? A > second > + * virtqueue_kick won't have any effect since we > haven't > + * added any descriptor. > + */ > + udelay(10); > + } > + } > + dev_dbg(viommu->dev, "request took %lld us\n", > + ktime_us_delta(ktime_get(), ktime_sub_ms(timeout, > timeout_ms * i))); > + > + if (nr_received != i) > + ret = -ETIMEDOUT; > + > + if (ret == -ENOSPC && nr_received) > + /* > + * We've freed some space since virtio told us that the ring is > + * full, tell the caller to come back later (after releasing the > + * lock first, to be fair to other threads) > + */ > + ret = -EAGAIN; > + > + *nr_sent = nr_received; > + > + return ret; > +} > + > +/** > + * viommu_send_reqs_sync - add a batch of requests, kick the host and > wait for > + * them to return > + * > + * @req: array of requests > + * @nr: size of the array > + * @nr_sent: contains the number of requests actually sent after this > function > + * returns > + * > + * Return 0 on success, or an error if we failed to send some of the > requests. > + */ > +static int viommu_send_reqs_sync(struct viommu_dev *viommu, > + struct viommu_request *req, int nr, > + int *nr_sent) > +{ > + int ret; > + int sent = 0; > + unsigned long flags; > + > + *nr_sent = 0; > + do { > + spin_lock_irqsave(&viommu->vq_lock, flags); > + ret = _viommu_send_reqs_sync(viommu, req, nr, &sent); > + spin_unlock_irqrestore(&viommu->vq_lock, flags); > + > + *nr_sent += sent; > + req += sent; > + nr -= sent; > + } while (ret == -EAGAIN); > + > + return ret; > +} > + > +/** > + * viommu_send_req_sync - send one request and wait for reply > + * > + * @head_ptr: pointer to a virtio_iommu_req_* structure > + * > + * Returns 0 if the request was successful, or an error number > +otherwise. No > + * distinction is done between transport and request errors. > + */ > +static int viommu_send_req_sync(struct viommu_dev *viommu, void > +*head_ptr) { > + int ret; > + int nr_sent; > + struct viommu_request req; > + size_t head_size, tail_size; > + struct virtio_iommu_req_tail *tail; > + struct virtio_iommu_req_head *head = head_ptr; > + > + ret = viommu_get_req_size(head, &head_size, &tail_size); > + if (ret) > + return ret; > + > + dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n", > head->type, > + head_size + tail_size); > + > + tail = head_ptr + head_size; > + > + sg_init_one(&req.head, head, head_size); > + sg_init_one(&req.tail, tail, tail_size); > + > + ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent); > + if (ret || !req.written || nr_sent != 1) { > + dev_err(viommu->dev, "failed to send command\n"); > + return -EIO; > + } > + > + ret = -viommu_status_to_errno(tail->status); > + > + if (ret) > + dev_dbg(viommu->dev, " completed with %d\n", ret); > + > + return ret; > +} > + > +static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned > long iova, > + phys_addr_t paddr, size_t size) > +{ > + unsigned long flags; > + struct viommu_mapping *mapping; > + > + mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC); > + if (!mapping) > + return -ENOMEM; > + > + mapping->paddr = paddr; > + mapping->iova.start = iova; > + mapping->iova.last = iova + size - 1; > + > + spin_lock_irqsave(&vdomain->mappings_lock, flags); > + interval_tree_insert(&mapping->iova, &vdomain->mappings); > + spin_unlock_irqrestore(&vdomain->mappings_lock, flags); > + > + return 0; > +} > + > +static size_t viommu_tlb_unmap(struct viommu_domain *vdomain, > + unsigned long iova, size_t size) { > + size_t unmapped = 0; > + unsigned long flags; > + unsigned long last = iova + size - 1; > + struct viommu_mapping *mapping = NULL; > + struct interval_tree_node *node, *next; > + > + spin_lock_irqsave(&vdomain->mappings_lock, flags); > + next = interval_tree_iter_first(&vdomain->mappings, iova, last); > + while (next) { > + node = next; > + mapping = container_of(node, struct viommu_mapping, > iova); > + > + next = interval_tree_iter_next(node, iova, last); > + > + /* > + * Note that for a partial range, this will return the full > + * mapping so we avoid sending split requests to the device. > + */ > + unmapped += mapping->iova.last - mapping->iova.start + 1; > + > + interval_tree_remove(node, &vdomain->mappings); > + kfree(mapping); > + } > + spin_unlock_irqrestore(&vdomain->mappings_lock, flags); > + > + return unmapped; > +} > + > +/* IOMMU API */ > + > +static bool viommu_capable(enum iommu_cap cap) { > + return false; /* :( */ > +} > + > +static struct iommu_domain *viommu_domain_alloc(unsigned type) { > + struct viommu_domain *vdomain; > + > + if (type != IOMMU_DOMAIN_UNMANAGED && type !> IOMMU_DOMAIN_DMA) > + return NULL; > + > + vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL); > + if (!vdomain) > + return NULL; > + > + vdomain->id > atomic64_inc_return_relaxed(&viommu_domain_ids_gen); > + > + mutex_init(&vdomain->mutex); > + spin_lock_init(&vdomain->mappings_lock); > + vdomain->mappings = RB_ROOT; > + > + pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id); > + > + if (type == IOMMU_DOMAIN_DMA && > + iommu_get_dma_cookie(&vdomain->domain)) { > + kfree(vdomain); > + return NULL; > + } > + > + return &vdomain->domain; > +} > + > +static void viommu_domain_free(struct iommu_domain *domain) { > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + > + pr_debug("free domain %llu\n", vdomain->id); > + > + iommu_put_dma_cookie(domain); > + > + /* Free all remaining mappings (size 2^64) */ > + viommu_tlb_unmap(vdomain, 0, 0); > + > + kfree(vdomain); > +} > + > +static int viommu_attach_dev(struct iommu_domain *domain, struct > device > +*dev) { > + int i; > + int ret = 0; > + struct iommu_fwspec *fwspec = dev->iommu_fwspec; > + struct viommu_endpoint *vdev = fwspec->iommu_priv; > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + struct virtio_iommu_req_attach req = { > + .head.type = VIRTIO_IOMMU_T_ATTACH, > + .address_space = cpu_to_le32(vdomain->id), > + }; > + > + mutex_lock(&vdomain->mutex); > + if (!vdomain->viommu) { > + struct viommu_dev *viommu = vdev->viommu; > + > + vdomain->viommu = viommu; > + > + domain->pgsize_bitmap = viommu- > >pgsize_bitmap; > + domain->geometry.aperture_start = viommu- > >aperture_start; > + domain->geometry.aperture_end = viommu- > >aperture_end; > + domain->geometry.force_aperture = true; > + > + } else if (vdomain->viommu != vdev->viommu) { > + dev_err(dev, "cannot attach to foreign VIOMMU\n"); > + ret = -EXDEV; > + } > + mutex_unlock(&vdomain->mutex); > + > + if (ret) > + return ret; > + > + /* > + * When attaching the device to a new domain, it will be detached > from > + * the old one and, if as as a result the old domain isn't attached to > + * any device, all mappings are removed from the old domain and it is > + * freed. (Note that we can't use get_domain_for_dev here, it > returns > + * the default domain during initial attach.) > + * > + * Take note of the device disappearing, so we can ignore unmap > request > + * on stale domains (that is, between this detach and the upcoming > + * free.) > + * > + * vdev->vdomain is protected by group->mutex > + */ > + if (vdev->vdomain) { > + dev_dbg(dev, "detach from domain %llu\n", vdev- > >vdomain->id); > + vdev->vdomain->attached--; > + } > + > + dev_dbg(dev, "attach to domain %llu\n", vdomain->id); > + > + for (i = 0; i < fwspec->num_ids; i++) { > + req.device = cpu_to_le32(fwspec->ids[i]); > + > + ret = viommu_send_req_sync(vdomain->viommu, &req); > + if (ret) > + break; > + } > + > + vdomain->attached++; > + vdev->vdomain = vdomain; > + > + return ret; > +} > + > +static int viommu_map(struct iommu_domain *domain, unsigned long iova, > + phys_addr_t paddr, size_t size, int prot) { > + int ret; > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + struct virtio_iommu_req_map req = { > + .head.type = VIRTIO_IOMMU_T_MAP, > + .address_space = cpu_to_le32(vdomain->id), > + .virt_addr = cpu_to_le64(iova), > + .phys_addr = cpu_to_le64(paddr), > + .size = cpu_to_le64(size), > + }; > + > + pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova, > + paddr, size);A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this? Thanks -Bharat> + > + if (!vdomain->attached) > + return -ENODEV; > + > + if (prot & IOMMU_READ) > + req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ); > + > + if (prot & IOMMU_WRITE) > + req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE); > + > + ret = viommu_tlb_map(vdomain, iova, paddr, size); > + if (ret) > + return ret; > + > + ret = viommu_send_req_sync(vdomain->viommu, &req); > + if (ret) > + viommu_tlb_unmap(vdomain, iova, size); > + > + return ret; > +} > + > +static size_t viommu_unmap(struct iommu_domain *domain, unsigned > long iova, > + size_t size) > +{ > + int ret; > + size_t unmapped; > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + struct virtio_iommu_req_unmap req = { > + .head.type = VIRTIO_IOMMU_T_UNMAP, > + .address_space = cpu_to_le32(vdomain->id), > + .virt_addr = cpu_to_le64(iova), > + }; > + > + pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size); > + > + /* Callers may unmap after detach, but device already took care of it. > */ > + if (!vdomain->attached) > + return size; > + > + unmapped = viommu_tlb_unmap(vdomain, iova, size); > + if (unmapped < size) > + return 0; > + > + req.size = cpu_to_le64(unmapped); > + > + ret = viommu_send_req_sync(vdomain->viommu, &req); > + if (ret) > + return 0; > + > + return unmapped; > +} > + > +static size_t viommu_map_sg(struct iommu_domain *domain, unsigned > long iova, > + struct scatterlist *sg, unsigned int nents, int prot) { > + int i, ret; > + int nr_sent; > + size_t mapped; > + size_t min_pagesz; > + size_t total_size; > + struct scatterlist *s; > + unsigned int flags = 0; > + unsigned long cur_iova; > + unsigned long mapped_iova; > + size_t head_size, tail_size; > + struct viommu_request reqs[nents]; > + struct virtio_iommu_req_map map_reqs[nents]; > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + > + if (!vdomain->attached) > + return 0; > + > + pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova); > + > + if (prot & IOMMU_READ) > + flags |= VIRTIO_IOMMU_MAP_F_READ; > + > + if (prot & IOMMU_WRITE) > + flags |= VIRTIO_IOMMU_MAP_F_WRITE; > + > + min_pagesz = 1 << __ffs(domain->pgsize_bitmap); > + tail_size = sizeof(struct virtio_iommu_req_tail); > + head_size = sizeof(*map_reqs) - tail_size; > + > + cur_iova = iova; > + > + for_each_sg(sg, s, nents, i) { > + size_t size = s->length; > + phys_addr_t paddr = sg_phys(s); > + void *tail = (void *)&map_reqs[i] + head_size; > + > + if (!IS_ALIGNED(paddr | size, min_pagesz)) { > + ret = -EFAULT; > + break; > + } > + > + /* TODO: merge physically-contiguous mappings if any */ > + map_reqs[i] = (struct virtio_iommu_req_map) { > + .head.type = VIRTIO_IOMMU_T_MAP, > + .address_space = cpu_to_le32(vdomain->id), > + .flags = cpu_to_le32(flags), > + .virt_addr = cpu_to_le64(cur_iova), > + .phys_addr = cpu_to_le64(paddr), > + .size = cpu_to_le64(size), > + }; > + > + ret = viommu_tlb_map(vdomain, cur_iova, paddr, size); > + if (ret) > + break; > + > + sg_init_one(&reqs[i].head, &map_reqs[i], head_size); > + sg_init_one(&reqs[i].tail, tail, tail_size); > + > + cur_iova += size; > + } > + > + total_size = cur_iova - iova; > + > + if (ret) { > + viommu_tlb_unmap(vdomain, iova, total_size); > + return 0; > + } > + > + ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i, > &nr_sent); > + > + if (nr_sent != nents) > + goto err_rollback; > + > + for (i = 0; i < nents; i++) { > + if (!reqs[i].written || map_reqs[i].tail.status) > + goto err_rollback; > + } > + > + return total_size; > + > +err_rollback: > + /* > + * Any request in the range might have failed. Unmap what was > + * successful. > + */ > + cur_iova = iova; > + mapped_iova = iova; > + mapped = 0; > + for_each_sg(sg, s, nents, i) { > + size_t size = s->length; > + > + cur_iova += size; > + > + if (!reqs[i].written || map_reqs[i].tail.status) { > + if (mapped) > + viommu_unmap(domain, mapped_iova, > mapped); > + > + mapped_iova = cur_iova; > + mapped = 0; > + } else { > + mapped += size; > + } > + } > + > + viommu_tlb_unmap(vdomain, iova, total_size); > + > + return 0; > +} > + > +static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain, > + dma_addr_t iova) > +{ > + u64 paddr = 0; > + unsigned long flags; > + struct viommu_mapping *mapping; > + struct interval_tree_node *node; > + struct viommu_domain *vdomain = to_viommu_domain(domain); > + > + spin_lock_irqsave(&vdomain->mappings_lock, flags); > + node = interval_tree_iter_first(&vdomain->mappings, iova, iova); > + if (node) { > + mapping = container_of(node, struct viommu_mapping, > iova); > + paddr = mapping->paddr + (iova - mapping->iova.start); > + } > + spin_unlock_irqrestore(&vdomain->mappings_lock, flags); > + > + pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova, > + paddr); > + > + return paddr; > +} > + > +static struct iommu_ops viommu_ops; > +static struct virtio_driver virtio_iommu_drv; > + > +static int viommu_match_node(struct device *dev, void *data) { > + return dev->parent->fwnode == data; > +} > + > +static struct viommu_dev *viommu_get_by_fwnode(struct > fwnode_handle > +*fwnode) { > + struct device *dev = driver_find_device(&virtio_iommu_drv.driver, > NULL, > + fwnode, > viommu_match_node); > + put_device(dev); > + > + return dev ? dev_to_virtio(dev)->priv : NULL; } > + > +static int viommu_add_device(struct device *dev) { > + struct iommu_group *group; > + struct viommu_endpoint *vdev; > + struct viommu_dev *viommu = NULL; > + struct iommu_fwspec *fwspec = dev->iommu_fwspec; > + > + if (!fwspec || fwspec->ops != &viommu_ops) > + return -ENODEV; > + > + viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode); > + if (!viommu) > + return -ENODEV; > + > + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); > + if (!vdev) > + return -ENOMEM; > + > + vdev->viommu = viommu; > + fwspec->iommu_priv = vdev; > + > + /* > + * Last step creates a default domain and attaches to it. Everything > + * must be ready. > + */ > + group = iommu_group_get_for_dev(dev); > + > + return PTR_ERR_OR_ZERO(group); > +} > + > +static void viommu_remove_device(struct device *dev) { > + kfree(dev->iommu_fwspec->iommu_priv); > +} > + > +static struct iommu_group * > +viommu_device_group(struct device *dev) { > + if (dev_is_pci(dev)) > + return pci_device_group(dev); > + else > + return generic_device_group(dev); > +} > + > +static int viommu_of_xlate(struct device *dev, struct of_phandle_args > +*args) { > + u32 *id = args->args; > + > + dev_dbg(dev, "of_xlate 0x%x\n", *id); > + return iommu_fwspec_add_ids(dev, args->args, 1); } > + > +/* > + * (Maybe) temporary hack for device pass-through into guest userspace. > +On ARM > + * with an ITS, VFIO will look for a region where to map the doorbell, > +even > + * though the virtual doorbell is never written to by the device, and > +instead > + * the host injects interrupts directly. TODO: sort this out in VFIO. > + */ > +#define MSI_IOVA_BASE 0x8000000 > +#define MSI_IOVA_LENGTH 0x100000 > + > +static void viommu_get_resv_regions(struct device *dev, struct > +list_head *head) { > + struct iommu_resv_region *region; > + int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; > + > + region = iommu_alloc_resv_region(MSI_IOVA_BASE, > MSI_IOVA_LENGTH, prot, > + IOMMU_RESV_MSI); > + if (!region) > + return; > + > + list_add_tail(®ion->list, head); > +} > + > +static void viommu_put_resv_regions(struct device *dev, struct > +list_head *head) { > + struct iommu_resv_region *entry, *next; > + > + list_for_each_entry_safe(entry, next, head, list) > + kfree(entry); > +} > + > +static struct iommu_ops viommu_ops = { > + .capable = viommu_capable, > + .domain_alloc = viommu_domain_alloc, > + .domain_free = viommu_domain_free, > + .attach_dev = viommu_attach_dev, > + .map = viommu_map, > + .unmap = viommu_unmap, > + .map_sg = viommu_map_sg, > + .iova_to_phys = viommu_iova_to_phys, > + .add_device = viommu_add_device, > + .remove_device = viommu_remove_device, > + .device_group = viommu_device_group, > + .of_xlate = viommu_of_xlate, > + .get_resv_regions = viommu_get_resv_regions, > + .put_resv_regions = viommu_put_resv_regions, > +}; > + > +static int viommu_init_vq(struct viommu_dev *viommu) { > + struct virtio_device *vdev = dev_to_virtio(viommu->dev); > + vq_callback_t *callback = NULL; > + const char *name = "request"; > + int ret; > + > + ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback, > + &name, NULL); > + if (ret) > + dev_err(viommu->dev, "cannot find VQ\n"); > + > + return ret; > +} > + > +static int viommu_probe(struct virtio_device *vdev) { > + struct device *parent_dev = vdev->dev.parent; > + struct viommu_dev *viommu = NULL; > + struct device *dev = &vdev->dev; > + int ret; > + > + viommu = kzalloc(sizeof(*viommu), GFP_KERNEL); > + if (!viommu) > + return -ENOMEM; > + > + spin_lock_init(&viommu->vq_lock); > + INIT_LIST_HEAD(&viommu->pending_requests); > + viommu->dev = dev; > + viommu->vdev = vdev; > + > + ret = viommu_init_vq(viommu); > + if (ret) > + goto err_free_viommu; > + > + virtio_cread(vdev, struct virtio_iommu_config, page_sizes, > + &viommu->pgsize_bitmap); > + > + viommu->aperture_end = -1UL; > + > + virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE, > + struct virtio_iommu_config, input_range.start, > + &viommu->aperture_start); > + > + virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE, > + struct virtio_iommu_config, input_range.end, > + &viommu->aperture_end); > + > + if (!viommu->pgsize_bitmap) { > + ret = -EINVAL; > + goto err_free_viommu; > + } > + > + viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap; > + > + /* > + * Not strictly necessary, virtio would enable it later. This allows to > + * start using the request queue early. > + */ > + virtio_device_ready(vdev); > + > + ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s", > + virtio_bus_name(vdev)); > + if (ret) > + goto err_free_viommu; > + > + iommu_device_set_ops(&viommu->iommu, &viommu_ops); > + iommu_device_set_fwnode(&viommu->iommu, parent_dev- > >fwnode); > + > + iommu_device_register(&viommu->iommu); > + > +#ifdef CONFIG_PCI > + if (pci_bus_type.iommu_ops != &viommu_ops) { > + pci_request_acs(); > + ret = bus_set_iommu(&pci_bus_type, &viommu_ops); > + if (ret) > + goto err_unregister; > + } > +#endif > +#ifdef CONFIG_ARM_AMBA > + if (amba_bustype.iommu_ops != &viommu_ops) { > + ret = bus_set_iommu(&amba_bustype, &viommu_ops); > + if (ret) > + goto err_unregister; > + } > +#endif > + if (platform_bus_type.iommu_ops != &viommu_ops) { > + ret = bus_set_iommu(&platform_bus_type, &viommu_ops); > + if (ret) > + goto err_unregister; > + } > + > + vdev->priv = viommu; > + > + dev_info(viommu->dev, "probe successful\n"); > + > + return 0; > + > +err_unregister: > + iommu_device_unregister(&viommu->iommu); > + > +err_free_viommu: > + kfree(viommu); > + > + return ret; > +} > + > +static void viommu_remove(struct virtio_device *vdev) { > + struct viommu_dev *viommu = vdev->priv; > + > + iommu_device_unregister(&viommu->iommu); > + kfree(viommu); > + > + dev_info(&vdev->dev, "device removed\n"); } > + > +static void viommu_config_changed(struct virtio_device *vdev) { > + dev_warn(&vdev->dev, "config changed\n"); } > + > +static unsigned int features[] = { > + VIRTIO_IOMMU_F_INPUT_RANGE, > +}; > + > +static struct virtio_device_id id_table[] = { > + { VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID }, > + { 0 }, > +}; > + > +static struct virtio_driver virtio_iommu_drv = { > + .driver.name = KBUILD_MODNAME, > + .driver.owner = THIS_MODULE, > + .id_table = id_table, > + .feature_table = features, > + .feature_table_size = ARRAY_SIZE(features), > + .probe = viommu_probe, > + .remove = viommu_remove, > + .config_changed = viommu_config_changed, > +}; > + > +module_virtio_driver(virtio_iommu_drv); > + > +IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL); > + > +MODULE_DESCRIPTION("virtio-iommu driver"); MODULE_AUTHOR("Jean- > Philippe > +Brucker <jean-philippe.brucker at arm.com>"); > +MODULE_LICENSE("GPL v2"); > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index > 1f25c86374ad..c0cb0f173258 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -467,6 +467,7 @@ header-y += virtio_console.h header-y +> virtio_gpu.h header-y += virtio_ids.h header-y += virtio_input.h > +header-y += virtio_iommu.h > header-y += virtio_mmio.h > header-y += virtio_net.h > header-y += virtio_pci.h > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h > index 6d5c3b2d4f4d..934ed3d3cd3f 100644 > --- a/include/uapi/linux/virtio_ids.h > +++ b/include/uapi/linux/virtio_ids.h > @@ -43,5 +43,6 @@ > #define VIRTIO_ID_INPUT 18 /* virtio input */ > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */ > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */ > +#define VIRTIO_ID_IOMMU 61216 /* virtio IOMMU (temporary) */ > > #endif /* _LINUX_VIRTIO_IDS_H */ > diff --git a/include/uapi/linux/virtio_iommu.h > b/include/uapi/linux/virtio_iommu.h > new file mode 100644 > index 000000000000..ec74c9a727d4 > --- /dev/null > +++ b/include/uapi/linux/virtio_iommu.h > @@ -0,0 +1,142 @@ > +/* > + * Copyright (C) 2017 ARM Ltd. > + * > + * This header is BSD licensed so anyone can use the definitions > + * to implement compatible drivers/servers: > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * 1. Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * 2. Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in the > + * documentation and/or other materials provided with the distribution. > + * 3. Neither the name of ARM Ltd. nor the names of its contributors > + * may be used to endorse or promote products derived from this > software > + * without specific prior written permission. > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND > CONTRIBUTORS > + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT > NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND > FITNESS > + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL IBM > OR > + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT > NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS > OF > + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER > CAUSED AND > + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, > + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY > OUT > + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > + * SUCH DAMAGE. > + */ > +#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H > +#define _UAPI_LINUX_VIRTIO_IOMMU_H > + > +/* Feature bits */ > +#define VIRTIO_IOMMU_F_INPUT_RANGE 0 > +#define VIRTIO_IOMMU_F_IOASID_BITS 1 > +#define VIRTIO_IOMMU_F_MAP_UNMAP 2 > +#define VIRTIO_IOMMU_F_BYPASS 3 > + > +__packed > +struct virtio_iommu_config { > + /* Supported page sizes */ > + __u64 page_sizes; > + struct virtio_iommu_range { > + __u64 start; > + __u64 end; > + } input_range; > + __u8 ioasid_bits; > +}; > + > +/* Request types */ > +#define VIRTIO_IOMMU_T_ATTACH 0x01 > +#define VIRTIO_IOMMU_T_DETACH 0x02 > +#define VIRTIO_IOMMU_T_MAP 0x03 > +#define VIRTIO_IOMMU_T_UNMAP 0x04 > + > +/* Status types */ > +#define VIRTIO_IOMMU_S_OK 0x00 > +#define VIRTIO_IOMMU_S_IOERR 0x01 > +#define VIRTIO_IOMMU_S_UNSUPP 0x02 > +#define VIRTIO_IOMMU_S_DEVERR 0x03 > +#define VIRTIO_IOMMU_S_INVAL 0x04 > +#define VIRTIO_IOMMU_S_RANGE 0x05 > +#define VIRTIO_IOMMU_S_NOENT 0x06 > +#define VIRTIO_IOMMU_S_FAULT 0x07 > + > +__packed > +struct virtio_iommu_req_head { > + __u8 type; > + __u8 reserved[3]; > +}; > + > +__packed > +struct virtio_iommu_req_tail { > + __u8 status; > + __u8 reserved[3]; > +}; > + > +__packed > +struct virtio_iommu_req_attach { > + struct virtio_iommu_req_head head; > + > + __le32 address_space; > + __le32 device; > + __le32 reserved; > + > + struct virtio_iommu_req_tail tail; > +}; > + > +__packed > +struct virtio_iommu_req_detach { > + struct virtio_iommu_req_head head; > + > + __le32 device; > + __le32 reserved; > + > + struct virtio_iommu_req_tail tail; > +}; > + > +#define VIRTIO_IOMMU_MAP_F_READ (1 << 0) > +#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1) > +#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2) > + > +#define VIRTIO_IOMMU_MAP_F_MASK > (VIRTIO_IOMMU_MAP_F_READ | \ > + > VIRTIO_IOMMU_MAP_F_WRITE | \ > + > VIRTIO_IOMMU_MAP_F_EXEC) > + > +__packed > +struct virtio_iommu_req_map { > + struct virtio_iommu_req_head head; > + > + __le32 address_space; > + __le32 flags; > + __le64 virt_addr; > + __le64 phys_addr; > + __le64 size; > + > + struct virtio_iommu_req_tail tail; > +}; > + > +__packed > +struct virtio_iommu_req_unmap { > + struct virtio_iommu_req_head head; > + > + __le32 address_space; > + __le32 flags; > + __le64 virt_addr; > + __le64 size; > + > + struct virtio_iommu_req_tail tail; > +}; > + > +union virtio_iommu_req { > + struct virtio_iommu_req_head head; > + > + struct virtio_iommu_req_attach attach; > + struct virtio_iommu_req_detach detach; > + struct virtio_iommu_req_map map; > + struct virtio_iommu_req_unmap unmap; > +}; > + > +#endif > -- > 2.12.1 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org