This is the continuation of my proposal for virtio-iommu, the para- virtualized IOMMU. Here is a summary of the changes since last time [1]: * The virtio-iommu document now resembles an actual specification. It is split into a formal description of the virtio device, and implementation notes. Please find sources and binaries at [2]. * Added a probe request to describe to the guest different properties that do not fit in firmware or in the virtio config space. This is a necessary stepping stone for extending the virtio-iommu. * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat Bhushan. You can find the Linux driver and kvmtool device at [4] and [5]. I plan to rework driver and kvmtool device slightly before sending the patches. To understand the virtio-iommu, I advise to first read introduction and motivation, then skim through implementation notes and finally look at the device specification. I wasn't sure how to organize the review. For those who prefer to comment inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex to this thread. They are the biggest chunks of the document. But LaTeX isn't very pleasant to read, so you can simply send a list of comments in relation to section numbers and a few words of context, we'll manage. --- Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare more easily differences since the RFC (see [6]), but haven't been made public so far. This is the first public posting since initial proposal [1], and the following describes all changes. ## v0.1 ## Content is the same as the RFC, but formatted to LaTeX. 'make' generates one PDF and one HTML document. ## v0.2 ## Add introductions, improve topology example and firmware description based on feedback and a number of useful discussions. ## v0.3 ## Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten the device and driver behaviour. Unmap semantics are consolidated; they are now closer to VFIO Type1 v2 semantics. ## v0.4 ## Introduce PROBE requests. They provide per-endpoint information to the driver that couldn't be described otherwise. For the moment, they allow to handle MSIs on x86 virtual platforms (see 3.2). To do that we communicate reserved IOVA regions, that will also be useful for describing regions that cannot be mapped for a given endpoint, for instance addresses that correspond to a PCI bridge window. Introducing such a large framework for this tiny feature may seem overkill, but it is needed for future extensions of the virtio-iommu and I believe it really is worth the effort. ## Future ## Other extensions are in preparation. I won't detail them here because v0.4 already is a lot to digest, but in short, building on top of PROBE: * First, since the IOMMU is paravirtualized, the device can expose some properties of the physical topology to the guest, and let it allocate resources more efficiently. For example, when the virtio-iommu manages both physical and emulated endpoints, with different underlying IOMMUs, we now have a way to describe multiple page and block granularities, instead of forcing the guest to use the most restricted one for all endpoints. This will most likely be in v0.5. * Then on top of that, a major improvement will describe hardware acceleration features available to the guest. There is what I call "Page Table Handover" (or simply, from the host POV, "Nested"), the ability for the guest to manipulate its own page tables instead of sending MAP/UNMAP requests to the host. This, along with IO Page Fault reporting, will also permit SVM virtualization on different platforms. Thanks, Jean [1] http://www.spinics.net/lists/kvm/msg147990.html [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf I reiterate the disclaimers: don't use this document as a reference, it's a draft. It's also not an OASIS document yet. It may be riddled with mistakes. As this is a working draft, it is unstable and I do not guarantee backward compatibility of future versions. [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4 Warning: UAPI headers have changed! They didn't follow the spec, please update. (Use branch v0.1, that has the old headers, for the Qemu prototype [3]) [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4 Warning: command-line has changed! Use --viommu vfio[,opts] and --viommu virtio[,opts] to instantiate a device. [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
The following is roughly the content of device-operations.tex --- \section{IOMMU device}\label{sec:Device Types / IOMMU Device} The virtio-iommu device manages Direct Memory Access (DMA) from one or more endpoints. It may act as a proxy for multiple physical IOMMUs managing devices assigned to the guest, and as standalone IOMMU for virtual devices. The driver first discovers endpoints managed by the virtio-iommu device using standard firmware mechanisms. It then sends requests to create virtual address spaces and virtual-to-physical mappings for these endpoints. In its simplest form, the virtio-iommu supports four request types: \begin{enumerate} \item Create an address space and attach an endpoint to it. \\ \texttt{attach(device = 0x104, address space = 1)} \item Create a mapping between a range of guest-virtual and guest-physical address. \\ \texttt{map(address space = 1, virt = 0x1000, phys = 0xa000, size = 0x1000, flags = READ)} Endpoint 0x104, for example a hardware PCI endpoint, can now read at addresses 0x1000-0x1fff. These accesses are translated into system-physical addresses by the IOMMU. \item Remove the mapping.\\ \texttt{unmap(address space = 1, virt = 0x1000, size = 0x1000)} Any access to addresses 0x1000-0x1fff by endpoint 0x104 would now be rejected. \item Detach the device and remove the address space.\\ \texttt{detach(device = 0x104)} \end{enumerate} \subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID} TBD. During development, use 61216. \subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues} \begin{description} \item[0] requestq \end{description} \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \begin{description} \item[VIRTIO_IOMMU_F_INPUT_RANGE (0)] Available range of virtual addresses is described in \field{input_range} \item[VIRTIO_IOMMU_F_IOASID_BITS (1)] The number of address spaces supported is described in \field{ioasid_bits} \item[VIRTIO_IOMMU_F_MAP_UNMAP (2)] Map and unmap requests are available.\footnote{Future extensions may add different modes of operations. At the moment, only VIRTIO_IOMMU_F_MAP_UNMAP is supported.} \item[VIRTIO_IOMMU_F_BYPASS (3)] When not attached to an address space, endpoints downstream of the IOMMU can access the guest-physical address space. \item[VIRTIO_IOMMU_F_PROBE (4)] Probe request is available. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE, VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and VIRTIO_IOMMU_F_PROBE feature bits if offered by the device. % XXX F_MAP_UNMAP will be optional when introducing PTH. But a 0.2 driver % must implement it (otherwise the device is useless) \devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE, VIRTIO_IOMMU_F_IOASID_BITS or VIRTIO_IOMMU_F_PROBE feature bits, and if the driver did not accept this feature bit, then the device MAY signal failure by failing to set FEATURES_OK \field{device status} bit when the driver writes it. If the device offers the VIRTIO_IOMMU_F_MAP_UNMAP feature bit, and if the driver did not accept this feature bit, then the device SHOULD behave as if the feature was negotiated. % This takes into account all the following "If the F_MAP_UNMAP feature % was negotiated..." % If the driver supports F_PTH but not F_MAP_UNMAP, then the driver MUST % give up upon seeing that the 0.2 device doesn't support F_PTH. If it % supports F_MAP_UNMAP, then it SHOULD use F_MAP_UNMAP (this is described % in "Driver Requirements: Feature Bits") % If the driver supports F_PTH but the device doesn't, and the driver % stupidly sets F_PTH but not F_MAP_UNMAP, then the device SHOULD reject % any PTH request and PTH flag in attach. % When not using the legacy interface, if the driver doesn't negotiate % F_MAP_UNMAP, then the device may disable it an reject any request. \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits / Legacy Interface: Feature bits} When using the legacy interface, transitional devices MUST support guests which do not negotiate any of the VIRTIO_IOMMU_F_INPUT_RANGE, VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP or VIRTIO_IOMMU_F_PROBE features, and MUST behave as if the feature was negotiated. %\subsection{Feature Bits Requirements}\label{sec:Device Types / IOMMU Device / Feature bits requirements} \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout} The \field{page_size_mask} field is always present. Availability of the others depend on various feature bits as indicated above. \begin{lstlisting} struct virtio_iommu_config { u64 page_size_mask; struct virtio_iommu_range { u64 start; u64 end; } input_range; u8 ioasid_bits; u8 padding[3]; u32 probe_size; }; \end{lstlisting} \drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout} The driver MUST NOT write to device configuration fields. \devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout} The device SHOULD set \field{padding} to zero. The device MUST set at least one bit in \field{page_size_mask}, describing the page granularity. The device MAY set more than one bit in \field{page_size_mask}. \subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout / Legacy Interface: Device configuration layout} When using the legacy interface, transitional devices and drivers MUST format the fields in struct virtio_iommu_config according to the native endian of the guest rather than (necessarily when not using the legacy interface) little-endian. \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Device initialization} When the device is reset, endpoints are not attached to any address space. If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can access guest-physical addresses ("bypass mode"). If the feature is not negotiated, then any memory access from endpoints will fault. Upon attaching an endpoint in bypass mode to a new address space, any memory access from the endpoint will fault, since the address space does not contain any mapping. The driver chooses operating mode depending on its capabilities. In this revision of the virtio-iommu specification, the only supported mode is VIRTIO_IOMMU_F_MAP_UNMAP. \drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization} The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests. If the VIRTIO_IOMMU_F_PROBE feature is offered, the driver SHOULD send a VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the endpoint to an address space. \devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization} If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the device SHOULD NOT let endpoints access the guest-physical address space. However, the device MAY let endpoints access memory regions negotiated with VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (see \ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}). \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations} Driver send requests on the request virtqueue, notifies the device and waits for the device to return the request with a status in the used ring. All requests are split in two parts: one device-readable, one device- writable. Each request is therefore described with at least two descriptors, as illustrated below. \begin{figure}[htb] \centering \includegraphics[width=0.7\textwidth]{img/request-wrapping.png} \caption{Anatomy of a virtio-iommu request} \end{figure} \begin{lstlisting} struct virtio_iommu_req_head { u8 type; u8 reserved[3]; }; struct virtio_iommu_req_tail { u8 status; u8 reserved[3]; }; \end{lstlisting} \rfc{% TODO There is a problem with framing using multiple chains... If a request isn't recognized by the device and is scattered across multiple descriptor chains, how would the device know where this request ends and where the next one begins? It assumes that a transition from WO descriptors to RO descriptors is a new request, ok. But if this unknown request is from a future extension that uses interleaved RO, WO, RO, WO descriptors for a single request, then the device is doomed.\\ The extension will have to force that third buffer to start with 0 so our device, that thinks it's a new request, doesn't recognize the request type and waits for the next one.\\ Alternatively, we could add a 16-bit 'size' field into virtio_iommu_req_head. We'd loose some space for future flags, and force requests to be at most 64k, or 256k if we drop size bits [1:0]. (head couldn't be extended in the future, with a field to describe a bigger size, because our legacy device here would ignore it.) We'd have to resort to multiple "container" requests transporting a single big one.\\ Personally I think we can let future extensions deal with interleaved RO/WO/RO descriptors, but do you think a 'size' field would be better? } Type may be one of: \begin{lstlisting} #define VIRTIO_IOMMU_T_ATTACH 1 #define VIRTIO_IOMMU_T_DETACH 2 #define VIRTIO_IOMMU_T_MAP 3 #define VIRTIO_IOMMU_T_UNMAP 4 #define VIRTIO_IOMMU_T_PROBE 5 \end{lstlisting} A few general-purpose status codes are defined here. Unless explicitly described in a \textbf{Requirements} section, these values are hints to make troubleshooting easier. When the device fails to parse a request, for instance if a request seems too small for its type and the device cannot find the tail, then it will be unable to set \field{status}. In that case, it should return the buffers without writing in them. \begin{lstlisting} /* All good! Carry on. */ #define VIRTIO_IOMMU_S_OK 0 /* Virtio communication error */ #define VIRTIO_IOMMU_S_IOERR 1 /* Unsupported request */ #define VIRTIO_IOMMU_S_UNSUPP 2 /* Internal device error */ #define VIRTIO_IOMMU_S_DEVERR 3 /* Invalid parameters */ #define VIRTIO_IOMMU_S_INVAL 4 /* Out-of-range parameters */ #define VIRTIO_IOMMU_S_RANGE 5 /* Entry not found */ #define VIRTIO_IOMMU_S_NOENT 6 /* Bad address */ #define VIRTIO_IOMMU_S_FAULT 7 \end{lstlisting} Range limits of some request fields are described in the device configuration: \begin{itemize} \item \field{page_size_mask} contains the bitmask of all page sizes that can be mapped. The least significant bit set defines the page granularity of IOMMU mappings. Other bits in the mask are hints describing page sizes that the IOMMU can merge into a single mapping (page blocks). The smallest page granularity supported by the IOMMU is one byte. It is legal for the driver to map one byte at a time if bit 0 of \field{page_size_mask} is set. \item If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, \field{ioasid_bits} contains the number of bits supported in an I/O Address Space ID, the identifier used in most requests. A value of 0 is valid, and means that a single address space is supported. If the feature is not negotiated, address space identifiers can use up to 32 bits. \item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, \field{input_range} contains the virtual address range that the IOMMU is able to translate. Any mapping request to virtual addresses outside of this range will fail. If the feature is not negotiated, virtual mappings span over the whole 64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff}) \end{itemize} \drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations} The driver SHOULD set reserved fields of the head and the tail of a request to zero. When a device returns a complete request in the used queue without having written to it, the driver SHOULD interpret it as a failure from the device to parse the request. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the driver SHOULD NOT send requests with \field{virt_addr} less than \field{input_range.start} or greater than \field{input_range.end}. If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, the driver SHOULD NOT send requests with \field{address_space} greater than the size described by \field{ioasid_bits}. % We mandate truncation to allow a future extension X.Y that would store % information in addresses and address space IDs. % % If device is 0.2 and driver is X.Y, then device ignores ext. bits. But % if device is X.Y and device is 0.2, then driver *might* set ext. bits to % garbage. But this extension would be negotiated with a feature bit % anyway. If it's not, then device must assume that driver is 0.2 and must % keep truncating the fields. \devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations} The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request didn't succeed. \footnote{% For IMPLEMENTATION DEFINED values of 'succeed'... For example, virtio_iommu_req_detach.reserved is allowed to be non-zero. If it is non-zero, the device may consider it to be a failure and abort the request. Or it may go on with the detach and return OK.} If a request \field{type} is not recognized, the device SHOULD return the buffers on the used ring and set the \field{len} field of the used element to zero. The device MUST ignore reserved fields of the head and the tail of a request. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST truncate the range described by \field{virt_addr} and \field{size} in requests to fit in the range described by \field{input_range}. If the VIRTIO_IOMMU_F_IOASID_BITS is offered, the device MUST ignore bits above \field{ioasid_bits} in field \field{address_space} of requests. \subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request} \begin{lstlisting} struct virtio_iommu_req_attach { le32 address_space; le32 device; le32 reserved; }; \end{lstlisting} Attach an endpoint to an address space. \field{address_space} is an identifier unique to the virtio-iommu device. If the address space doesn't exist in the device, it is created. \field{device} is an endpoint identifier unique to the virtio-iommu device. The host communicates unique device IDs to the guest using methods outside the scope of this specification, but the following rules apply: \begin{itemize} \item The device ID is unique from the virtio-iommu point of view. Multiple endpoints whose DMA transactions are not translated by the same virtio-iommu may have the same device ID. Endpoints whose DMA transactions may be translated by the same virtio-iommu must have different device IDs. \item Sometimes the host cannot completely isolate two endpoints from each others. For example on a legacy PCI bus, endpoints can snoop DMA transactions from their neighbours. In this case, the host must communicate to the guest that it cannot isolate these endpoints from each others, or that the physical IOMMU cannot distinguish transactions coming from these endpoints. The method used to communicate this is outside the scope of this specification. \end{itemize} Multiple endpoints may be added to the same address space. An endpoint cannot be attached to multiple address spaces in VIRTIO_IOMMU_F_MAP_UNMAP mode. \drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request} The driver SHOULD set \field{reserved} to zero. The driver SHOULD ensure that endpoints that cannot be isolated by the host are attached to the same address space. \devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request} If the \field{reserved} field of an ATTACH request is not zero, the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD NOT attach the endpoint to the address space. \footnote{The device should validate input of ATTACH requests in case the driver attempts to attach in a mode that is unimplemented by the device, and would be incompatible with the modes implemented by the device.} If the endpoint identified by \field{device} doesn't exist, then the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT. If another endpoint is already attached to the address space identified by \field{address_space}, then the device MAY attach the endpoint identified by \field{device} to the address space. If it cannot do so, the device MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP. If the endpoint identified by \field{device} is already attached to another address space, then the device SHOULD first detach it from that address space and attach it to the one identified by \field{address_space}. In that case the device behaves as if the driver issued a DETACH request with this \field{device}, followed by the ATTACH request. If the device cannot do so, it MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP. \subsubsection{DETACH request} \begin{lstlisting} struct virtio_iommu_req_detach { le32 device; le32 reserved; }; \end{lstlisting} Detach an endpoint from its address space. When this request completes, the endpoint cannot access any mapping from that address space anymore. After all endpoints have been successfully detached from an address space, it ceases to exist and its ID can be reused by the driver for another address space. \drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request} The driver SHOULD set \field{reserved} to zero. \devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request} If the \field{reserved} field of an DETACH request is not zero, the device MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case the device MAY perform the DETACH operation. % If it returns OK, the device SHOULD go on with the detach, as required % by the VIRTIO_IOMMU_S_OK rule. If the endpoint identified by \field{device} doesn't exist, then the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT. If the endpoint identified by \field{device} wasn't attached to any address space, then the device MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL. The device MUST ensure that after being detached from an address space, the endpoint cannot access any mapping from that address space. \subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device operations / MAP request} \begin{lstlisting} struct virtio_iommu_req_map { le32 address_space; le64 phys_addr; le64 virt_addr; le64 size; le32 flags; }; /* Flags are: */ #define VIRTIO_IOMMU_MAP_F_READ (1 << 0) #define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1) #define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2) \end{lstlisting} Map a range of virtually-contiguous addresses to a range of physically-contiguous addresses of the same size. After the request succeeds, all endpoints attached to this address space can access memory in the range $[phys\_addr; phys\_addr + size[$. For example, if an endpoint accesses address $VA \in [virt\_addr; virt\_addr + size[$, the device (or the physical IOMMU) translates the address: $PA = VA - virt\_addr + phys\_addr$. If the access parameters are compatible with \field{flags} (for instance, the access is write and \field{flags} are VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows the access to reach $PA$. The range defined by (\field{virt_addr}, \field{size}) must be within the limits specified by \field{input_range}. The range defined by (\field{phys_addr}, \field{size}) must be within the guest-physical address space. This includes upper and lower limits, as well as any carving of guest-physical addresses for use by the host (for instance MSI doorbells). Guest physical boundaries are set by the host using a firmware mechanism outside the scope of this specification. \begin{note} This format prevents from creating the identity mapping in a single request \texttt{[0x0; 0xfff....fff] $\rightarrow$ [0x0; 0xfff...fff]}, since it would result in a size of zero. Hopefully allowing VIRTIO_IOMMU_F_BYPASS eliminates the need for issuing such request. It would also be unlikely to conform to the physical range restrictions from the previous paragraph. \end{note} \begin{note} On flags: it is unlikely that all possible combinations of flags will be supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might be invalid. We do not have a way to advertise supported and implicit (for instance $W \rightarrow R$) flags or combination thereof for the moment, you are free to send any suggestions for describing this. Please keep in mind that we might soon want to add more flags, such as privileged, device, transient, shared, etc. (whatever these would mean). \end{note} This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been negotiated. \drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request} The driver SHOULD set undefined \field{flags} bits to zero. \devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request} If \field{virt_addr}, \field{phys_addr} or \field{size} is not aligned on the page granularity, the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping. If the device doesn't recognize a \field{flags} bit, it SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device SHOULD NOT create the mapping. \footnote{Validating the input is important here, because the driver might be attempting to map with special flags that the device doesn't recognize. Creating the mapping with incompatible flags may introduce a security hazard.} If \field{address_space} does not exist, the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT. \subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device operations / UNMAP request} \begin{lstlisting} struct virtio_iommu_req_unmap { le32 address_space; le64 virt_addr; le64 size; le32 reserved; }; \end{lstlisting} Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here a mapping as a virtual region created with a single MAP request. All mappings covered by the range $[virt\_addr; virt\_addr + size [$ are removed. The semantics of unmapping are specified below, and illustrated with the following requests, assuming each example sequence starts with a blank address space. We define two pseudocode functions \texttt{map(virt\_addr, size) -> mapping} and \texttt{unmap(virt\_addr, size)}. \begin{lstlisting} (1) unmap(addr=0, size=5) -> succeeds, doesn't unmap anything (2) a = map(addr=0, size=10); unmap(0, 10) -> succeeds, unmaps a (3) a = map(0, 5); b = map(5, 5); unmap(0, 10) -> succeeds, unmaps a and b (4) a = map(0, 10); unmap(0, 5) -> faults, doesn't unmap anything (5) a = map(0, 5); b = map(5, 5); unmap(0, 5) -> succeeds, unmaps a (6) a = map(0, 5); unmap(0, 10) -> succeeds, unmaps a (7) a = map(0, 5); b = map(10, 5); unmap(0, 15) -> succeeds, unmaps a and b \end{lstlisting} This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been negotiated. \drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request} The driver SHOULD set the \field{reserved} field to zero. The range, defined by \field{virt_addr} and \field{size}, SHOULD cover one or more contiguous mappings created with MAP requests. The range MAY spill over unmapped virtual addresses. The first address of a range SHOULD either be the first address of a mapping or be outside any mapping. The last address of a range SHOULD either be the last address of a mapping or be outside any mapping. \devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device operations / UNMAP request} If the \field{reserved} field of an UNMAP request is not zero, the device MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case the device MAY perform the UNMAP operation. % If it returns OK, the device SHOULD go on with the unmap, as required by % the VIRTIO_IOMMU_S_OK rule. If \field{address_space} does not exist, the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT. If a mapping affected by the range is not covered in its entirety by the range (the UNMAP request would split the mapping), then the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT remove any mapping. If part of the range or the full range is not covered by an existing mapping, then the device SHOULD remove all mappings affected by the range and set the request \field{status} to VIRTIO_IOMMU_S_OK. \subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device operations / PROBE request} If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu device manages. This probe is performed before attaching the endpoint to an address space. \begin{lstlisting} struct virtio_iommu_req_probe { /* Device-readable */ le32 device; le32 flags; u8 reserved[60]; /* Device-writable when not ACK */ u8 properties[]; }; /* Flags are: */ #define VIRTIO_IOMMU_PROBE_F_ACK (1 << 0) \end{lstlisting} \begin{description} \item[\field{device}] has the same meaning as in ATTACH and DETACH requests. \item[\field{flags}] contain additional information about the request. The VIRTIO_IOMMU_PROBE_F_ACK flag changes the descriptor chain layout: when ACK is clear, the \field{properties} field is device-writable; when it is set, the \field{properties} field is device-readable. \item[\field{reserved}] is used as padding, so that future extensions can add fields to the device-readable part. \item[\field{properties}] contains a list of properties of endpoint \field{device}, filled by the device. This field is exactly \field{probe_size} bytes. Each property is described with a type, four flag bits, a length, and a value: \begin{lstlisting} #define VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK 0xfff #define VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK (1 << 12) struct virtio_iommu_probe_property { le16 type; le16 length; u8 value[]; }; \end{lstlisting} \end{description} The driver allocates a buffer of adequate size for the probe request, writes \field{device} and adds it to the request queue. The device fills the \field{properties} field with a list of properties for this endpoint. The driver parses the first property by reading \field{type}, then \field{length}. If the driver recognizes \field{type}, it reads and handles \field{value}. The driver then reads the next property, that is located $(\field{length} + 4)$ bytes after the beginning of the first one, and so on. The driver parses all properties until it reaches a NONE property or the end of \field{properties}. The upper nibble of property \field{type} is reserved for flags. Therefore only 4096 types are available. The actual type of a property is extracted like this: \begin{lstlisting} u16 type = le16_to_cpu(property.type) & VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK; \end{lstlisting} If a property is correctly understood by the driver, then it sets the ACK bit in \field{type}: \begin{lstlisting} property.type |= cpu_to_le16(VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK); \end{lstlisting} Then, to signal to the device which properties are understood, the device sends the probe again with the VIRTIO_IOMMU_PROBE_F_ACK flag. In all properties understood and accepted by the driver, \field{type} has the VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit set. The other properties are left as is. This second phase of the probe request allows the device to ensure that all properties crucial for good operations are recognized and handled by the driver. This is analogous to the initial feature negotiation of virtio devices: an endpoint property is \emph{offered} by the device to the driver during the first PROBE, and it is \emph{negotiated} after the driver acknowledges it during the second PROBE. Available property types are described in section \ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}. When attaching multiple devices to the same address space, their properties are combined. \emph{Combination Rules} are given for each property, and describe the rules to apply when combining properties obtained during probe. \drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request} The size of \field{properties} MUST be \field{probe_size} bytes. The driver SHOULD set undefined \field{flags} to zero. The driver SHOULD set \field{reserved} to zero. If the driver doesn't recognize the \field{type} of a property, it SHOULD ignore the property and continue parsing the list. The driver SHOULD NOT deduce the property length from \field{type}. If the driver recognizes a property \field{type} and is able to handle{\footnotemark} the property, then the driver SHOULD set the VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit of that property. \footnotetext{A driver's ability to handle a property depends on the property type. Without a specific definition of the ACK requirements for a given property type, it simply means that the driver read all fields of that property.} The driver SHOULD resend the PROBE request with the VIRTIO_IOMMU_PROBE_F_ACK bit set after parsing and updating the \field{properties} list. Depending on the properties encountered in the list, the driver MAY modify some of their fields between the first and second probe, but it SHOULD NOT modify the \field{length} field or bits [11:0] of field \field{type}. \devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device operations / PROBE request} If an undefined bit is set in \field{flags}, the device MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL. If the \field{reserved} field of a PROBE request is not zero, the device MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL. If the endpoint identified by \field{device} doesn't exist, then the device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT. If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return the buffers on the used ring and set the \field{len} field of the used element to zero. The device SHOULD set bits [15:13] of property \field{type} to zero. The device MUST write the size of \field{value}, in bytes, into \field{length}. When two properties follow each others, the device MUST put the second property exactly $(\field{length} + 4)$ bytes after the beginning of the first one. If the device doesn't fill all \field{probe_size} bytes with properties, it SHOULD terminate the list with a property of type NONE and size 0. The device MAY fill the remaining bytes of \field{properties}, if any, with zeroes. If there isn't enough space remaining in \field{properties} to terminate the list with a complete NONE property (4 bytes), then the device SHOULD fill the remaining bytes with zeroes. If the PROBE request has VIRTIO_IOMMU_PROBE_F_ACK bit set, the device MAY ignore the request and set the request \field{status} to VIRTIO_IOMMU_S_OK. \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties} \begin{lstlisting} #define VIRTIO_IOMMU_PROBE_T_NONE 0 #define VIRTIO_IOMMU_PROBE_T_RESV_MEM 2 \end{lstlisting} \paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / NONE} Marks the end of the property list. This property doesn't have any value, and should have \field{length} 0. \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM} The RESV_MEM property describes a chunk of reserved virtual memory. It may be used by the device to describe virtual address ranges that shouldn't be allocated by the driver, or that are special. \begin{lstlisting} struct virtio_iommu_probe_resv_mem { u8 subtype; u8 reserved[3]; le64 addr; le64 size; le32 flags; }; \end{lstlisting} Fields \field{addr} and \field{size} describe the range of reserved addresses. \field{subtype} may be one of: \begin{description} \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT (0)] Accesses to this region are aborted. This subtype does not accept any flag. \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (1)] Accesses to this region behave as if the IOMMU was bypassed, and reach the bus upstream of the IOMMU untranslated. The following \field{flags} are defined for BYPASS regions: \begin{description} \item[VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI (1)] Provides a hint to the guest that this is a doorbell for Message Signaled Interrupts. If the device doesn't provide such a region, then MSIs are normal write accesses from the IOMMU point of view, and arbitrary virtual addresses should be allocated by the driver to map MSI doorbells. Otherwise, the guest should use the guest-physical doorbell address when programming MSIs for this endpoint. \end{description} %\item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_IDENTITY (2)] % This region should be identity-mapped by the guest. TODO: is this % useful for anyone? \end{description} \propcombination{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM} Multiple overlapping RESV_MEM properties are merged together. Difference in subtype on the intersecting range doesn't make a difference from the driver point of view. \drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM} The driver SHOULD NOT map any virtual address described by a VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT or VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS property. % An old driver that doesn't find or understand this property will % allocate and map virtual addresses. We really can't do anything about % that. We're not introducing a regression, MSIs never worked for x86 % before we introduced the F_MSI flag. The driver SHOULD ignore \field{reserved}. For a given \field{subtype}, the driver SHOULD ignore undefined \field{flags} bits. The driver SHOULD treat any \field{subtype} it doesn't recognize as if it was VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT. \devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM} The device SHOULD set \field{reserved} to zero. For a given \field{subtype}, the device SHOULD set undefined \field{flags} bits to zero. The device MAY abort any transaction targeting a VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT region. If an endpoint is attached to an address space, the device SHOULD leave any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS regions pass through untranslated. In other words, the device SHOULD handle such a region as if it was identity-mapped (virtual address equal to physical address). If the endpoint is not attached to any address space, then the device MAY abort the transaction. The device MAY abort any transaction that isn't a write access and that targets a VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS region with flag VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI.
Jean-Philippe Brucker
2017-Aug-04 18:19 UTC
[RFC] virtio-iommu v0.4 - Implementation notes
The following is roughly the content of topology.tex and MSI.tex --- \section{Implementation notes}\label{sec:viommu} \subsection{Virtual system topology}\label{sec:viommu / Virtual topology} \subsubsection{Example virtual topology}\label{sec:viommu / Virtual topology / Example} \begin{figure}[htb] \centering \includegraphics[width=\textwidth]{img/virtual-topology.png} \caption{An example IOMMU topology} \label{fig:viommu / Virtual topology / Topology} \end{figure} Diagram~\ref{fig:viommu / Virtual topology / Topology} shows an example system topology centered around an IOMMU. On the left, the IOMMU manages traffic from two PCI root complexes. On the right, the IOMMU manages traffic from three platform devices (or "integrated devices"). Within a PCI domain, devices are identified by a Requester ID. It is a 16-bit identifier also called Bus/Device/Function (BDF). In a BDF, Bus is 8 bits, Device is 5 bits, and Function is 3 bits. The bottom PCI domain has four endpoints connected to the root complex via two bridges. The first endpoint is identified by BDF 01:04.0. On the other bus, the first endpoint is identified by BDF 02:00.0, and the two other endpoints are two functions of the same device, identified by BDFs 02:01.0 and 02:01.1. The bridges and the root complex may also issue transactions with BDFs 00:00.0, 00:01.0 and 00:02.0. In order for the IOMMU to differentiate devices in multiple PCI domains, the root bridge expands the BDF with a domain ID. In example \ref{fig:viommu / Virtual topology / Topology}, the PCI domain on top gets ID 0 and the one on the bottom gets ID 1. Therefore when reaching the IOMMU, a transaction coming from endpoint 01:04.0 (= 0x0120) is identified by Device ID 0x10120. We define here "platform" devices as endpoints that are on the system bus, as opposed to behind a PCI host bridge. Unlike PCI devices, platform devices do not have a standardized identifier scheme to be used with the IOMMU. Their Device IDs are chosen arbitrarily during system integration in such way that they don't overlap PCI domains or each others. \subsubsection{Firmware description}\label{sec:viommu / Virtual topology / Firmware description} The host describes the relation between IOMMU and devices to the guest using either device-tree or ACPI. Topology description is outside the scope of virtio-iommu, because the virtio-iommu does not and should not need to know about vendor-specific buses. The virtual IOMMU identifies each virtual endpoint with an abstract 32-bit ID, that is called "Device ID" in this document\footnote{Other IOMMU architectures use different names, such as "stream ID" on ARM SMMU or "source ID" on Intel VT-d}. Device IDs are not necessarily unique system-wide, but they should not overlap within a single virtio-iommu. Device IDs of physical endpoints do not need to match IDs seen by the physical IOMMU. We strongly advise to implement the virtio-iommu using virtio-mmio transport. Nothing prevents an implementation to use virtio-pci instead, but existing firmware interfaces do not easily allow to describe an IOMMU $\leftrightarrow$ master relations between PCI endpoints. Device models in Operating Systems might not be designed to support such complicated system. Device-tree offers a way to describe the IOMMU topology for PCI and platform devices. Here's an excerpt of the device-tree describing examples \ref{fig:viommu / Virtual topology / Topology}. \begin{lstlisting} /* The virtual IOMMU is described with a virtio-mmio node */ viommu: virtio at 9050000 { compatible = "virtio,mmio"; reg = <0x09050000 0x200>; dma-coherent; interrupts = <0x0 0x5 0x1>; #iommu-cells = <1> }; /* PCI domain 0 */ pcie at 3eff0000 { ... /* Identity map */ iommu-map = <0x0 &viommu 0x0 0x10000>; }; /* PCI domain 1 */ pcie at 3f000000 { ... /* Linear map: deviceID = RID + 0x10000 */ iommu-map = <0x0 &viommu 0x10000 0x10000>; }; someplatformdevice at a000000 { ... iommus = <&viommu 0x20000>; }; \end{lstlisting} For more details, please refer to \hyperref[intro:IOMMU DT Bindings]{[IOMMU DT]}. In ACPI, the plan would be to add a new node type to the IO Remapping Table specification \hyperref[intro:ACPI IORT]{[ACPI IORT]}, that provides a mechanism similar to DT for describing IOMMU topology. The OS would parse the IORT table to build a map of ID relations between IOMMU and devices. ID Array is used to find correspondence between IOMMU IDs and PCI or platform devices. Later on, the virtio-iommu driver finds the associated LNRO0005 descriptor via the "Device object name" field, and probes the virtio device to find out more about its capabilities. Since all properties of the IOMMU will be obtained during virtio probing, the IORT node can stay simple. The following table shows the possible\protect\footnotemark\ format for a paravirtualized IOMMU IORT node. \footnotetext{This table IS NOT authoritative, only a suggestion. Such a node would be described in \hyperref[intro:ACPI IORT]{[ACPI IORT]}}. \begin{center} \begin{tabular}{| l | l | l | p{.4\textwidth} |} \hline \textbf{Field} & \textbf{Length} & \textbf{Offset} & \textbf{Description} \\ \hline Type & 1 & 0 & 5: Paravirtualized IOMMU \\ \hline Length & 2 & 1 & The length of the node. \\ \hline Revision & 1 & 3 & 0 \\ \hline Reserved & 4 & 4 & Must be zero. \\ \hline Number of ID mapping & 4 & 8 & \\ \hline Reference to ID Array & 4 & 12 & Offset from the start of the ID Array IORT node to the start of its Array ID mappings.\\ \hline Model & 4 & 16 & 0: virtio-iommu \\ \hline Device object name & & 20 & ASCII Null terminated string with the full path to the entry in the namespace for this IOMMU. \\ \hline Padding & & & To keep 32-bit alignment and leave space for future models. \\ \hline Array of ID mappings & 20xN & & ID Array. \\ \hline \end{tabular} \end{center} --- \subsection{Message Signaled Interrupts}\label{sec:viommu / MSI} Some buses, such as PCI, implement Message Signaled Interrupts. Instead of requesting an interrupt via a wire that runs from the endpoint to the irqchip, the endpoint can request interrupts by performing a memory write to a specific register (the "doorbell"). By combining the data written to the doorbell, the address itself, and the originator of the write, the IRQ chip deduces the destination interrupt number and destination processing units. Additional devices between the endpoint and the IRQ chip may translate the doorbell address, the IRQ number and verify that the endpoint is allowed to send this interrupt. Different platforms implement IRQ remapping and routing in different ways. This section describes three ways of dealing with Message Signaled Interrupts in virtio-iommu devices and drivers. In simplest systems, the endpoint writes the plain interrupt number to the doorbell, and the IRQ chip signals the interruption to destination CPUs programmed by software. Section \ref{sec:viommu / MSI / Address bypass} describes how to implement a simple system with virtio-iommu. Section \ref{sec:viommu / MSI / Address translation} describes the added complexity (from the host point of view) of translating the IRQ chip doorbell. More complex systems add a level of indirection in the MSI message. The address or data contains an index into a remapping table, that describes interrupt delivery in details and is programmed by software either into the IRQ chip or the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use the remapping feature of virtio-iommu. \subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass} \begin{figure}[htb] \centering \includegraphics{img/MSI-addr-noremap.png} \caption{MSI remapping with address bypass} \end{figure} Bypassing translation for MSIs is the simplest implementation from the host perspective. The virtio-iommu device has a special IOVA window that it does not translate. Any access from devices to that region is forwarded upstream of the IOMMU without being translated or even checked. The IRQ chip may or may not have an IRQ remapping component. It may be as simple as generating the interrupt number described in data, without checking if the device was allowed to send that interrupt. If there is another component performing the isolation, one might consider translating the doorbell address superfluous. With virtio-iommu, the device can advertise the doorbell address as untranslated by using the PROBE request with a reserved region (see \ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESV_MEM}). For example, if the virtual platform has an IRQ remapping module with a doorbell in the physical address range 0xfee00000-0xfeefffff, then the device can present the following property to the driver: \begin{lstlisting} struct __attribute__((packed)) { struct virtio_iommu_probe_property head; struct virtio_iommu_probe_resv_mem mem; } doorbell = { .head = { .type = VIRTIO_IOMMU_PROBE_T_RESV_MEM, .length = sizeof(doorbell.mem), }, .mem = { .subtype = VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS, .flags = VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI, .addr = 0xfee00000, .size = 0x00100000, }, }; \end{lstlisting} \subsubsection{Address translation}\label{sec:viommu / MSI / Address translation} \begin{figure}[htb] \centering \includegraphics{img/MSI-addr-remap.png} \caption{MSI remapping with address translation} \end{figure} On some systems (e.g. ARM-based platforms) the IOMMU does not have a special MSI window, and MSIs are treated like any other memory write. The MSI address therefore has to be translated by the IOMMU before reaching the IRQ chip. Address translation may be used as a rudimentary form of MSI isolation, but multiple endpoints will typically access the same doorbell. Address translation can only forbid an endpoint from sending interrupts. If it is allowed to send MSIs, the endpoint can easily spoof another endpoint by sending interrupts that were not assigned to it.>From the virtio-iommu point of view, this is the simplest to implement, becausethere is no special address range. The whole address space is treated the same by the virtio-iommu device. However, this mode of operations may add significant complexity in the host implementation. \subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping} Some IOMMUs (e.g. Intel and AMD IOMMUs) are able to remap IRQs themselves. \begin{figure}[htb] \centering \includegraphics{img/MSI-irq-remap.png} \caption{MSI remapping with address bypass} \end{figure} This version of virtio-iommu doesn't support IRQ remapping.
Hi jean, On 04/08/2017 20:19, Jean-Philippe Brucker wrote:> This is the continuation of my proposal for virtio-iommu, the para- > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > * The virtio-iommu document now resembles an actual specification. It is > split into a formal description of the virtio device, and implementation > notes. Please find sources and binaries at [2]. > > * Added a probe request to describe to the guest different properties that > do not fit in firmware or in the virtio config space. This is a > necessary stepping stone for extending the virtio-iommu. > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > Bhushan. > > You can find the Linux driver and kvmtool device at [4] and [5]. I > plan to rework driver and kvmtool device slightly before sending the > patches. > > To understand the virtio-iommu, I advise to first read introduction and > motivation, then skim through implementation notes and finally look at the > device specification. > > I wasn't sure how to organize the review. For those who prefer to comment > inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex > to this thread. They are the biggest chunks of the document. But LaTeX > isn't very pleasant to read, so you can simply send a list of comments in > relation to section numbers and a few words of context, we'll manage. > > --- > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare > more easily differences since the RFC (see [6]), but haven't been made > public so far. This is the first public posting since initial proposal > [1], and the following describes all changes. > > ## v0.1 ## > > Content is the same as the RFC, but formatted to LaTeX. 'make' generates > one PDF and one HTML document. > > ## v0.2 ## > > Add introductions, improve topology example and firmware description based > on feedback and a number of useful discussions. > > ## v0.3 ## > > Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten > the device and driver behaviour. Unmap semantics are consolidated; they > are now closer to VFIO Type1 v2 semantics. > > ## v0.4 ## > > Introduce PROBE requests. They provide per-endpoint information to the > driver that couldn't be described otherwise. > > For the moment, they allow to handle MSIs on x86 virtual platforms (see > 3.2). To do that we communicate reserved IOVA regions, that will also be > useful for describing regions that cannot be mapped for a given endpoint, > for instance addresses that correspond to a PCI bridge window. > > Introducing such a large framework for this tiny feature may seem > overkill, but it is needed for future extensions of the virtio-iommu and I > believe it really is worth the effort.2.6.7 - As I am currently integrating v0.4 in QEMU here are some other comments: At the moment struct virtio_iommu_req_probe flags is missing in your header. As such I understood the ACK protocol was not implemented by the driver in your branch. - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your header too. 2.6.8.2: - I am really confused about what the device should report as resv regions depending on the PE nature (VFIO or not VFIO) In other iommu drivers, the resv regions are populated by the iommu driver through its get_resv_regions callback. They are usually composed of an iommu specific MSI region (mapped or bypassed) and non IOMMU specific (device specific) reserved regions: iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those are the guest reserved regions. First in the current virtio-iommu driver I don't see the iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu driver should compute the non IOMMU specific MSI regions. ie. this is not the responsability of the virtio-iommu device. Then why is it more the job of the device to return the guest iommu specific region rather than the driver itself? Then I understand this is the responsability of the virtio-iommu device to gather information about the host resv regions in case of VFIO EP. Typically the host PCIe host bridge windows cannot be used for IOVA. Also the host MSI reserved IOVA window cannot be used. Do you agree. I really think the spec should clarify what exact resv regions the device should return in case of VFIO device and non VFIO device. Thanks Eric> > ## Future ## > > Other extensions are in preparation. I won't detail them here because v0.4 > already is a lot to digest, but in short, building on top of PROBE: > > * First, since the IOMMU is paravirtualized, the device can expose some > properties of the physical topology to the guest, and let it allocate > resources more efficiently. For example, when the virtio-iommu manages > both physical and emulated endpoints, with different underlying IOMMUs, > we now have a way to describe multiple page and block granularities, > instead of forcing the guest to use the most restricted one for all > endpoints. This will most likely be in v0.5. > > * Then on top of that, a major improvement will describe hardware > acceleration features available to the guest. There is what I call "Page > Table Handover" (or simply, from the host POV, "Nested"), the ability > for the guest to manipulate its own page tables instead of sending > MAP/UNMAP requests to the host. This, along with IO Page Fault > reporting, will also permit SVM virtualization on different platforms. > > Thanks, > Jean > > [1] http://www.spinics.net/lists/kvm/msg147990.html > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 > http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf > I reiterate the disclaimers: don't use this document as a reference, > it's a draft. It's also not an OASIS document yet. It may be riddled > with mistakes. As this is a working draft, it is unstable and I do not > guarantee backward compatibility of future versions. > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4 > Warning: UAPI headers have changed! They didn't follow the spec, > please update. (Use branch v0.1, that has the old headers, for the > Qemu prototype [3]) > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4 > Warning: command-line has changed! Use --viommu vfio[,opts] and > --viommu virtio[,opts] to instantiate a device. > [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs >
Hi Eric,> -----Original Message----- > From: Auger Eric [mailto:eric.auger at redhat.com] > Sent: Tuesday, September 12, 2017 10:43 PM > To: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>; > iommu at lists.linux-foundation.org; kvm at vger.kernel.org; > virtualization at lists.linux-foundation.org; virtio-dev at lists.oasis-open.org > Cc: will.deacon at arm.com; robin.murphy at arm.com; > lorenzo.pieralisi at arm.com; mst at redhat.com; jasowang at redhat.com; > marc.zyngier at arm.com; eric.auger.pro at gmail.com; Bharat Bhushan > <bharat.bhushan at nxp.com>; peterx at redhat.com; kevin.tian at intel.com > Subject: Re: [RFC] virtio-iommu version 0.4 > > Hi jean, > > On 04/08/2017 20:19, Jean-Philippe Brucker wrote: > > This is the continuation of my proposal for virtio-iommu, the para- > > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > > > * The virtio-iommu document now resembles an actual specification. It is > > split into a formal description of the virtio device, and implementation > > notes. Please find sources and binaries at [2]. > > > > * Added a probe request to describe to the guest different properties that > > do not fit in firmware or in the virtio config space. This is a > > necessary stepping stone for extending the virtio-iommu. > > > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > > Bhushan. > > > > You can find the Linux driver and kvmtool device at [4] and [5]. I > > plan to rework driver and kvmtool device slightly before sending the > > patches. > > > > To understand the virtio-iommu, I advise to first read introduction > > and motivation, then skim through implementation notes and finally > > look at the device specification. > > > > I wasn't sure how to organize the review. For those who prefer to > > comment inline, I attached v0.4 of device-operations.tex and > > topology.tex+MSI.tex to this thread. They are the biggest chunks of > > the document. But LaTeX isn't very pleasant to read, so you can simply > > send a list of comments in relation to section numbers and a few words of > context, we'll manage. > > > > --- > > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to > > compare more easily differences since the RFC (see [6]), but haven't > > been made public so far. This is the first public posting since > > initial proposal [1], and the following describes all changes. > > > > ## v0.1 ## > > > > Content is the same as the RFC, but formatted to LaTeX. 'make' > > generates one PDF and one HTML document. > > > > ## v0.2 ## > > > > Add introductions, improve topology example and firmware description > > based on feedback and a number of useful discussions. > > > > ## v0.3 ## > > > > Add normative sections (MUST, SHOULD, etc). Clarify some things, > > tighten the device and driver behaviour. Unmap semantics are > > consolidated; they are now closer to VFIO Type1 v2 semantics. > > > > ## v0.4 ## > > > > Introduce PROBE requests. They provide per-endpoint information to the > > driver that couldn't be described otherwise. > > > > For the moment, they allow to handle MSIs on x86 virtual platforms > > (see 3.2). To do that we communicate reserved IOVA regions, that will > > also be useful for describing regions that cannot be mapped for a > > given endpoint, for instance addresses that correspond to a PCI bridge > window. > > > > Introducing such a large framework for this tiny feature may seem > > overkill, but it is needed for future extensions of the virtio-iommu > > and I believe it really is worth the effort. > > 2.6.7 > - As I am currently integrating v0.4 in QEMU here are some other comments: > At the moment struct virtio_iommu_req_probe flags is missing in your > header. As such I understood the ACK protocol was not implemented by the > driver in your branch. > - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is > VIRTIO_IOMMU_T_MASK in your header too. > 2.6.8.2: > - I am really confused about what the device should report as resv regions > depending on the PE nature (VFIO or not VFIO) > > In other iommu drivers, the resv regions are populated by the iommu driver > through its get_resv_regions callback. They are usually composed of an > iommu specific MSI region (mapped or bypassed) and non IOMMU specific > (device specific) reserved regions: > iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those > are the guest reserved regions. > > First in the current virtio-iommu driver I don't see the > iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu > driver should compute the non IOMMU specific MSI regions. ie. this is not > the responsability of the virtio-iommu device. > > Then why is it more the job of the device to return the guest iommu specific > region rather than the driver itself? > > Then I understand this is the responsability of the virtio-iommu device to > gather information about the host resv regions in case of VFIO EP. > Typically the host PCIe host bridge windows cannot be used for IOVA. > Also the host MSI reserved IOVA window cannot be used. Do you agree. > > I really think the spec should clarify what exact resv regions the device should > return in case of VFIO device and non VFIO device.My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well) Does this make sense to have reserved regions per address-space/domain? Thanks -Bharat> > Thanks > > Eric > > > > > ## Future ## > > > > Other extensions are in preparation. I won't detail them here because > > v0.4 already is a lot to digest, but in short, building on top of PROBE: > > > > * First, since the IOMMU is paravirtualized, the device can expose some > > properties of the physical topology to the guest, and let it allocate > > resources more efficiently. For example, when the virtio-iommu manages > > both physical and emulated endpoints, with different underlying > IOMMUs, > > we now have a way to describe multiple page and block granularities, > > instead of forcing the guest to use the most restricted one for all > > endpoints. This will most likely be in v0.5. > > > > * Then on top of that, a major improvement will describe hardware > > acceleration features available to the guest. There is what I call "Page > > Table Handover" (or simply, from the host POV, "Nested"), the ability > > for the guest to manipulate its own page tables instead of sending > > MAP/UNMAP requests to the host. This, along with IO Page Fault > > reporting, will also permit SVM virtualization on different platforms. > > > > Thanks, > > Jean > > > > [1] http://www.spinics.net/lists/kvm/msg147990.html > > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 > > http://www.linux-arm.org/git?p=virtio- > iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf > > I reiterate the disclaimers: don't use this document as a reference, > > it's a draft. It's also not an OASIS document yet. It may be riddled > > with mistakes. As this is a working draft, it is unstable and I do not > > guarantee backward compatibility of future versions. > > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html > > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4 > > Warning: UAPI headers have changed! They didn't follow the spec, > > please update. (Use branch v0.1, that has the old headers, for the > > Qemu prototype [3]) > > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4 > > Warning: command-line has changed! Use --viommu vfio[,opts] and > > --viommu virtio[,opts] to instantiate a device. > > [6] > > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs > >
Hi Eric, On 12/09/17 18:13, Auger Eric wrote:> 2.6.7 > - As I am currently integrating v0.4 in QEMU here are some other comments: > At the moment struct virtio_iommu_req_probe flags is missing in your > header. As such I understood the ACK protocol was not implemented by the > driver in your branch.Uh indeed. And yet I could swear I've written that code... somewhere. I will add it to the batch of v0.5 changes, it shouldn't be too invasive.> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your > header too.Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best (though it is a mouthful).> 2.6.8.2: > - I am really confused about what the device should report as resv > regions depending on the PE nature (VFIO or not VFIO) > > In other iommu drivers, the resv regions are populated by the iommu > driver through its get_resv_regions callback. They are usually composed > of an iommu specific MSI region (mapped or bypassed) and non IOMMU > specific (device specific) reserved regions: > iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those > are the guest reserved regions. > > First in the current virtio-iommu driver I don't see the > iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu > driver should compute the non IOMMU specific MSI regions. ie. this is > not the responsability of the virtio-iommu device.For SW_MSI, certainly. The driver allocates a fixed IOVA region for mapping the MSI doorbell. But the driver has to know whether the doorbell region is translated or bypassed.> Then why is it more the job of the device to return the guest iommu > specific region rather than the driver itself?The MSI region is architectural on x86 IOMMUs, but implementation-defined on virtio-iommu. It depends which platform the host is emulating. In Linux, x86 IOMMU drivers register the bypass region because there always is an IOAPIC on the other end, with a fixed MSI address. But virtio-iommu may be either behind a GIC, an APIC or some other IRQ chip. The driver *could* go over all the irqchips/platforms it knows and try to guess if there is a fixed doorbell or if it needs to reserve an IOVA for them, but it would look horrible. I much prefer having a well-defined way of doing this, so a description from the device.> Then I understand this is the responsability of the virtio-iommu device > to gather information about the host resv regions in case of VFIO EP. > Typically the host PCIe host bridge windows cannot be used for IOVA. > Also the host MSI reserved IOVA window cannot be used. Do you agree.Yes, all regions reported in sysfs reserved_regions in the host would be reported as RESV_T_RESERVED by virtio-iommu.> I really think the spec should clarify what exact resv regions the > device should return in case of VFIO device and non VFIO device.Agreed. I will add something about RESV_T_RESERVED with the PCI bridge example in Implementation Notes. Do you think the MSI examples at the end need improvement as well? I can try to explain that RESV_MSI regions in virtio-iommu are only those of the emulated platform, not the HW or SW MSI regions from the host. Thanks, Jean
Hi Jean, On 04/08/2017 20:19, Jean-Philippe Brucker wrote:> This is the continuation of my proposal for virtio-iommu, the para- > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > * The virtio-iommu document now resembles an actual specification. It is > split into a formal description of the virtio device, and implementation > notes. Please find sources and binaries at [2]. > > * Added a probe request to describe to the guest different properties that > do not fit in firmware or in the virtio config space. This is a > necessary stepping stone for extending the virtio-iommu. > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > Bhushan. > > You can find the Linux driver and kvmtool device at [4] and [5]. I > plan to rework driver and kvmtool device slightly before sending the > patches. > > To understand the virtio-iommu, I advise to first read introduction and > motivation, then skim through implementation notes and finally look at the > device specification. > > I wasn't sure how to organize the review. For those who prefer to comment > inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex > to this thread. They are the biggest chunks of the document. But LaTeX > isn't very pleasant to read, so you can simply send a list of comments in > relation to section numbers and a few words of context, we'll manage. > > --- > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare > more easily differences since the RFC (see [6]), but haven't been made > public so far. This is the first public posting since initial proposal > [1], and the following describes all changes. > > ## v0.1 ## > > Content is the same as the RFC, but formatted to LaTeX. 'make' generates > one PDF and one HTML document. > > ## v0.2 ## > > Add introductions, improve topology example and firmware description based > on feedback and a number of useful discussions. > > ## v0.3 ## > > Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten > the device and driver behaviour. Unmap semantics are consolidated; they > are now closer to VFIO Type1 v2 semantics. > > ## v0.4 ## > > Introduce PROBE requests. They provide per-endpoint information to the > driver that couldn't be described otherwise. > > For the moment, they allow to handle MSIs on x86 virtual platforms (see > 3.2). To do that we communicate reserved IOVA regions, that will also be > useful for describing regions that cannot be mapped for a given endpoint, > for instance addresses that correspond to a PCI bridge window. > > Introducing such a large framework for this tiny feature may seem > overkill, but it is needed for future extensions of the virtio-iommu and I > believe it really is worth the effort. > > ## Future ## > > Other extensions are in preparation. I won't detail them here because v0.4 > already is a lot to digest, but in short, building on top of PROBE: > > * First, since the IOMMU is paravirtualized, the device can expose some > properties of the physical topology to the guest, and let it allocate > resources more efficiently. For example, when the virtio-iommu manages > both physical and emulated endpoints, with different underlying IOMMUs, > we now have a way to describe multiple page and block granularities, > instead of forcing the guest to use the most restricted one for all > endpoints. This will most likely be in v0.5. > > * Then on top of that, a major improvement will describe hardware > acceleration features available to the guest. There is what I call "Page > Table Handover" (or simply, from the host POV, "Nested"), the ability > for the guest to manipulate its own page tables instead of sending > MAP/UNMAP requests to the host. This, along with IO Page Fault > reporting, will also permit SVM virtualization on different platforms. > > Thanks, > Jean > > [1] http://www.spinics.net/lists/kvm/msg147990.html > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 > http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf > I reiterate the disclaimers: don't use this document as a reference, > it's a draft. It's also not an OASIS document yet. It may be riddled > with mistakes. As this is a working draft, it is unstable and I do not > guarantee backward compatibility of future versions. > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4 > Warning: UAPI headers have changed! They didn't follow the spec, > please update. (Use branch v0.1, that has the old headers, for the > Qemu prototype [3])When rebasing the v0.4 driver on master I observe a regression: commands are not received properly by QEMU (typically an attach command is received with a type of 0). After a bisection of the guest kernel the first commit the problem appears is: commit e3067861ba6650a566a6273738c23c956ad55c02 arm64: add basic VMAP_STACK support After reverting this patch, things resume working. I observe the problem with a 4kB page guest kernel. Thanks Eric> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4 > Warning: command-line has changed! Use --viommu vfio[,opts] and > --viommu virtio[,opts] to instantiate a device. > [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org >
Hi Eric, On 03/10/17 14:04, Auger Eric wrote:> When rebasing the v0.4 driver on master I observe a regression: commands > are not received properly by QEMU (typically an attach command is > received with a type of 0). After a bisection of the guest kernel the > first commit the problem appears is: > > commit e3067861ba6650a566a6273738c23c956ad55c02 > arm64: add basic VMAP_STACK supportIndeed, thanks for the report. I can reproduce the problem with 4.14 and kvmtool. We should allocate buffers on the heap for any request (see for example 9472fe704 or e37e2ff35). I rebased onto 4.14 and pushed the following patch at: git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4.1 It's not very nice, but should fix the issue. I didn't manage to measure the difference though, my benchmark isn't precise enough. So I don't know if we should try to use a kmem cache instead of kmalloc. Thanks, Jean --- 8< --->From 3fc957560e1e6f070a0468bf75ebc4862d37ff82 Mon Sep 17 00:00:00 2001From: Jean-Philippe Brucker <jean-philippe.brucker at arm.com> Date: Mon, 9 Oct 2017 20:13:57 +0100 Subject: [PATCH] iommu/virtio-iommu: Allocate all requests on the heap When CONFIG_VMAP_STACK is enabled, DMA on the stack isn't allowed. Instead of allocating virtio-iommu requests on the stack, use kmalloc. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com> --- drivers/iommu/virtio-iommu.c | 86 +++++++++++++++++++++++++++++--------------- 1 file changed, 58 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index ec1cfaba5997..2d8fd5e99fa7 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -473,13 +473,10 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev) { int i; int ret = 0; + struct virtio_iommu_req_attach *req; struct iommu_fwspec *fwspec = dev->iommu_fwspec; struct viommu_endpoint *vdev = fwspec->iommu_priv; struct viommu_domain *vdomain = to_viommu_domain(domain); - struct virtio_iommu_req_attach req = { - .head.type = VIRTIO_IOMMU_T_ATTACH, - .address_space = cpu_to_le32(vdomain->id), - }; mutex_lock(&vdomain->mutex); if (!vdomain->viommu) { @@ -531,14 +528,25 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev) dev_dbg(dev, "attach to domain %llu\n", vdomain->id); + req = kzalloc(sizeof(*req), GFP_KERNEL); + if (!req) + return -ENOMEM; + + *req = (struct virtio_iommu_req_attach) { + .head.type = VIRTIO_IOMMU_T_ATTACH, + .address_space = cpu_to_le32(vdomain->id), + }; + for (i = 0; i < fwspec->num_ids; i++) { - req.device = cpu_to_le32(fwspec->ids[i]); + req->device = cpu_to_le32(fwspec->ids[i]); - ret = viommu_send_req_sync(vdomain->viommu, &req); + ret = viommu_send_req_sync(vdomain->viommu, req); if (ret) break; } + kfree(req); + vdomain->attached++; vdev->vdomain = vdomain; @@ -550,13 +558,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, { int ret; struct viommu_domain *vdomain = to_viommu_domain(domain); - struct virtio_iommu_req_map req = { - .head.type = VIRTIO_IOMMU_T_MAP, - .address_space = cpu_to_le32(vdomain->id), - .virt_addr = cpu_to_le64(iova), - .phys_addr = cpu_to_le64(paddr), - .size = cpu_to_le64(size), - }; + struct virtio_iommu_req_map *req; pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova, paddr, size); @@ -564,17 +566,30 @@ static int viommu_map(struct iommu_domain *domain, unsigned long iova, if (!vdomain->attached) return -ENODEV; - if (prot & IOMMU_READ) - req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ); - - if (prot & IOMMU_WRITE) - req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE); - ret = viommu_tlb_map(vdomain, iova, paddr, size); if (ret) return ret; - ret = viommu_send_req_sync(vdomain->viommu, &req); + req = kzalloc(sizeof(*req), GFP_KERNEL); + if (!req) + return -ENOMEM; + + *req = (struct virtio_iommu_req_map) { + .head.type = VIRTIO_IOMMU_T_MAP, + .address_space = cpu_to_le32(vdomain->id), + .virt_addr = cpu_to_le64(iova), + .phys_addr = cpu_to_le64(paddr), + .size = cpu_to_le64(size), + }; + + if (prot & IOMMU_READ) + req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ); + + if (prot & IOMMU_WRITE) + req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE); + + ret = viommu_send_req_sync(vdomain->viommu, req); + kfree(req); if (ret) viommu_tlb_unmap(vdomain, iova, size); @@ -587,11 +602,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova, int ret; size_t unmapped; struct viommu_domain *vdomain = to_viommu_domain(domain); - struct virtio_iommu_req_unmap req = { - .head.type = VIRTIO_IOMMU_T_UNMAP, - .address_space = cpu_to_le32(vdomain->id), - .virt_addr = cpu_to_le64(iova), - }; + struct virtio_iommu_req_unmap *req; pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size); @@ -603,9 +614,19 @@ static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova, if (unmapped < size) return 0; - req.size = cpu_to_le64(unmapped); + req = kzalloc(sizeof(*req), GFP_KERNEL); + if (!req) + return -ENOMEM; - ret = viommu_send_req_sync(vdomain->viommu, &req); + *req = (struct virtio_iommu_req_unmap) { + .head.type = VIRTIO_IOMMU_T_UNMAP, + .address_space = cpu_to_le32(vdomain->id), + .virt_addr = cpu_to_le64(iova), + .size = cpu_to_le64(unmapped), + }; + + ret = viommu_send_req_sync(vdomain->viommu, req); + kfree(req); if (ret) return 0; @@ -626,12 +647,16 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova, unsigned long mapped_iova; size_t top_size, bottom_size; struct viommu_request reqs[nents]; - struct virtio_iommu_req_map map_reqs[nents]; + struct virtio_iommu_req_map *map_reqs; struct viommu_domain *vdomain = to_viommu_domain(domain); if (!vdomain->attached) return 0; + map_reqs = kcalloc(nents, sizeof(*map_reqs), GFP_KERNEL); + if (!map_reqs) + return -ENOMEM; + pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova); if (prot & IOMMU_READ) @@ -679,6 +704,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova, if (ret) { viommu_tlb_unmap(vdomain, iova, total_size); + kfree(map_reqs); return 0; } @@ -692,6 +718,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova, goto err_rollback; } + kfree(map_reqs); return total_size; err_rollback: @@ -719,6 +746,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova, } viommu_tlb_unmap(vdomain, iova, total_size); + kfree(map_reqs); return 0; } @@ -863,6 +891,8 @@ static int viommu_probe_device(struct viommu_dev *viommu, type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK; } + kfree(probe); + return 0; } -- 2.13.3