thr3ads.net - Linux Virtualization - [RFC] virtio-iommu version 0.4 [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Jean-Philippe Brucker

2017-Aug-04 18:19 UTC

[RFC] virtio-iommu version 0.4

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the
ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
   
http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

Jean-Philippe Brucker

2017-Aug-04 18:19 UTC

head link

[RFC] virtio-iommu v0.4 - IOMMU Device

The following is roughly the content of device-operations.tex

---
\section{IOMMU device}\label{sec:Device Types / IOMMU Device}

The virtio-iommu device manages Direct Memory Access (DMA) from one or
more endpoints. It may act as a proxy for multiple physical IOMMUs
managing devices assigned to the guest, and as standalone IOMMU for
virtual devices.

The driver first discovers endpoints managed by the virtio-iommu device
using standard firmware mechanisms. It then sends requests to create
virtual address spaces and virtual-to-physical mappings for these
endpoints. In its simplest form, the virtio-iommu supports four request
types:

\begin{enumerate}
\item Create an address space and attach an endpoint to it.  \\
  \texttt{attach(device = 0x104, address space = 1)}
\item Create a mapping between a range of guest-virtual and guest-physical
  address. \\
  \texttt{map(address space = 1, virt = 0x1000, phys = 0xa000,
          size = 0x1000, flags = READ)}

  Endpoint 0x104, for example a hardware PCI endpoint, can now read at
  addresses 0x1000-0x1fff. These accesses are translated into
  system-physical addresses by the IOMMU.

\item Remove the mapping.\\
  \texttt{unmap(address space = 1, virt = 0x1000, size = 0x1000)}

  Any access to addresses 0x1000-0x1fff by endpoint 0x104 would now be
  rejected.
\item Detach the device and remove the address space.\\
  \texttt{detach(device = 0x104)}
\end{enumerate}

\subsection{Device ID}\label{sec:Device Types / IOMMU Device / Device ID}

TBD. During development, use 61216.

\subsection{Virtqueues}\label{sec:Device Types / IOMMU Device / Virtqueues}

\begin{description}
\item[0] requestq
\end{description}

\subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits}

\begin{description}
\item[VIRTIO_IOMMU_F_INPUT_RANGE (0)]
  Available range of virtual addresses is described in \field{input_range}

\item[VIRTIO_IOMMU_F_IOASID_BITS (1)]
  The number of address spaces supported is described in \field{ioasid_bits}

\item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
  Map and unmap requests are available.\footnote{Future extensions may add
  different modes of operations. At the moment, only
  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}

\item[VIRTIO_IOMMU_F_BYPASS (3)]
  When not attached to an address space, endpoints downstream of the IOMMU
  can access the guest-physical address space.

\item[VIRTIO_IOMMU_F_PROBE (4)]
  Probe request is available.
\end{description}

\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device /
Feature bits}

The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.

% XXX F_MAP_UNMAP will be optional when introducing PTH. But a 0.2 driver
% must implement it (otherwise the device is useless)

\devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device /
Feature bits}

If the device offers any of VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS or VIRTIO_IOMMU_F_PROBE feature bits, and if
the driver did not accept this feature bit, then the device MAY signal
failure by failing to set FEATURES_OK \field{device status} bit when the
driver writes it.

If the device offers the VIRTIO_IOMMU_F_MAP_UNMAP feature bit, and if the
driver did not accept this feature bit, then the device SHOULD behave as
if the feature was negotiated.
% This takes into account all the following "If the F_MAP_UNMAP feature
% was negotiated..."

% If the driver supports F_PTH but not F_MAP_UNMAP, then the driver MUST
% give up upon seeing that the 0.2 device doesn't support F_PTH. If it
% supports F_MAP_UNMAP, then it SHOULD use F_MAP_UNMAP (this is described
% in "Driver Requirements: Feature Bits")

% If the driver supports F_PTH but the device doesn't, and the driver
% stupidly sets F_PTH but not F_MAP_UNMAP, then the device SHOULD reject
% any PTH request and PTH flag in attach.

% When not using the legacy interface, if the driver doesn't negotiate
% F_MAP_UNMAP, then the device may disable it an reject any request.

\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / IOMMU
Device / Feature bits / Legacy Interface: Feature bits}

When using the legacy interface, transitional devices MUST support guests
which do not negotiate any of the VIRTIO_IOMMU_F_INPUT_RANGE,
VIRTIO_IOMMU_F_IOASID_BITS, VIRTIO_IOMMU_F_MAP_UNMAP or
VIRTIO_IOMMU_F_PROBE features, and MUST behave as if the feature was
negotiated.

%\subsection{Feature Bits Requirements}\label{sec:Device Types / IOMMU Device /
Feature bits requirements}

\subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device /
Device configuration layout}

The \field{page_size_mask} field is always present. Availability of the
others depend on various feature bits as indicated above.

\begin{lstlisting}
struct virtio_iommu_config {
	u64 page_size_mask;
	struct virtio_iommu_range {
		u64 start;
		u64 end;
	} input_range;
	u8 ioasid_bits;
	u8 padding[3];
	u32 probe_size;
};
\end{lstlisting}

\drivernormative{\subsubsection}{Device configuration layout}{Device Types /
IOMMU Device / Device configuration layout}

The driver MUST NOT write to device configuration fields.

\devicenormative{\subsubsection}{Device configuration layout}{Device Types /
IOMMU Device / Device configuration layout}

The device SHOULD set \field{padding} to zero.

The device MUST set at least one bit in \field{page_size_mask}, describing
the page granularity. The device MAY set more than one bit in
\field{page_size_mask}.

\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device
Types / IOMMU Device / Device configuration layout / Legacy Interface: Device
configuration layout}
When using the legacy interface, transitional devices and drivers
MUST format the fields in struct virtio_iommu_config
according to the native endian of the guest rather than
(necessarily when not using the legacy interface) little-endian.


\subsection{Device initialization}\label{sec:Device Types / IOMMU Device /
Device initialization}

When the device is reset, endpoints are not attached to any address space.
If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
access guest-physical addresses ("bypass mode"). If the feature is not
negotiated, then any memory access from endpoints will fault. Upon
attaching an endpoint in bypass mode to a new address space, any memory
access from the endpoint will fault, since the address space does not
contain any mapping.

The driver chooses operating mode depending on its capabilities. In this
revision of the virtio-iommu specification, the only supported mode is
VIRTIO_IOMMU_F_MAP_UNMAP.

\drivernormative{\subsubsection}{Device Initialization}{Device Types / IOMMU
Device / Device Initialization}

The driver MUST NOT negotiate VIRTIO_IOMMU_F_MAP_UNMAP if it is incapable
of sending VIRTIO_IOMMU_T_MAP and VIRTIO_IOMMU_T_UNMAP requests.

If the VIRTIO_IOMMU_F_PROBE feature is offered, the driver SHOULD send a
VIRTIO_IOMMU_T_PROBE request for each endpoint before attaching the
endpoint to an address space.

\devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU
Device / Device Initialization}

If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
device SHOULD NOT let endpoints access the guest-physical address space.
However, the device MAY let endpoints access memory regions negotiated
with VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (see
\ref{devicenormative:Device Types / IOMMU Device / Device operations / PROBE
properties / RESV_MEM}).

\subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device
operations}

Driver send requests on the request virtqueue, notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writable. Each request is therefore described with at least two
descriptors, as illustrated below.

\begin{figure}[htb]
  \centering
  \includegraphics[width=0.7\textwidth]{img/request-wrapping.png}
  \caption{Anatomy of a virtio-iommu request}
\end{figure}

\begin{lstlisting}
struct virtio_iommu_req_head {
	u8	type;
	u8	reserved[3];
};

struct virtio_iommu_req_tail {
	u8	status;
	u8	reserved[3];
};
\end{lstlisting}

\rfc{% TODO
There is a problem with framing using multiple chains... If a request
isn't recognized by the device and is scattered across multiple descriptor
chains, how would the device know where this request ends and where the
next one begins? It assumes that a transition from WO descriptors to RO
descriptors is a new request, ok. But if this unknown request is from a
future extension that uses interleaved RO, WO, RO, WO descriptors for a
single request, then the device is doomed.\\
The extension will have to force that third buffer to start with 0 so our
device, that thinks it's a new request, doesn't recognize the request
type
and waits for the next one.\\
Alternatively, we could add a 16-bit 'size' field into
virtio_iommu_req_head. We'd loose some space for future flags, and force
requests to be at most 64k, or 256k if we drop size bits [1:0]. (head
couldn't be extended in the future, with a field to describe a bigger
size, because our legacy device here would ignore it.) We'd have to resort
to multiple "container" requests transporting a single big one.\\
Personally I think we can let future extensions deal with interleaved
RO/WO/RO descriptors, but do you think a 'size' field would be better?
}

Type may be one of:

\begin{lstlisting}
#define VIRTIO_IOMMU_T_ATTACH			1
#define VIRTIO_IOMMU_T_DETACH			2
#define VIRTIO_IOMMU_T_MAP			3
#define VIRTIO_IOMMU_T_UNMAP			4
#define VIRTIO_IOMMU_T_PROBE			5
\end{lstlisting}

A few general-purpose status codes are defined here. Unless explicitly
described in a \textbf{Requirements} section, these values are hints to
make troubleshooting easier.

When the device fails to parse a request, for instance if a request seems
too small for its type and the device cannot find the tail, then it will
be unable to set \field{status}. In that case, it should return the
buffers without writing in them.

\begin{lstlisting}
/* All good! Carry on. */
#define VIRTIO_IOMMU_S_OK			0
/* Virtio communication error */
#define VIRTIO_IOMMU_S_IOERR			1
/* Unsupported request */
#define VIRTIO_IOMMU_S_UNSUPP			2
/* Internal device error */
#define VIRTIO_IOMMU_S_DEVERR			3
/* Invalid parameters */
#define VIRTIO_IOMMU_S_INVAL			4
/* Out-of-range parameters */
#define VIRTIO_IOMMU_S_RANGE			5
/* Entry not found */
#define VIRTIO_IOMMU_S_NOENT			6
/* Bad address */
#define VIRTIO_IOMMU_S_FAULT			7
\end{lstlisting}

Range limits of some request fields are described in the device
configuration:

\begin{itemize}
\item \field{page_size_mask} contains the bitmask of all page sizes that
  can be mapped. The least significant bit set defines the page
  granularity of IOMMU mappings. Other bits in the mask are hints
  describing page sizes that the IOMMU can merge into a single mapping
  (page blocks).

  The smallest page granularity supported by the IOMMU is one byte. It is
  legal for the driver to map one byte at a time if bit 0 of
  \field{page_size_mask} is set.

\item If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered,
  \field{ioasid_bits} contains the number of bits supported in an I/O
  Address Space ID, the identifier used in most requests. A value of 0 is
  valid, and means that a single address space is supported.

  If the feature is not negotiated, address space identifiers can use up
  to 32 bits.

\item If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
  \field{input_range} contains the virtual address range that the IOMMU is
  able to translate. Any mapping request to virtual addresses outside of
  this range will fail.

  If the feature is not negotiated, virtual mappings span over the whole
  64-bit address space (\texttt{start = 0, end = 0xffffffff ffffffff})
\end{itemize}

\drivernormative{\subsubsection}{Device operations}{Device Types / IOMMU Device
/ Device operations}

The driver SHOULD set reserved fields of the head and the tail of a
request to zero.

When a device returns a complete request in the used queue without having
written to it, the driver SHOULD interpret it as a failure from the device
to parse the request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the driver SHOULD
NOT send requests with \field{virt_addr} less than
\field{input_range.start} or greater than \field{input_range.end}.

If the VIRTIO_IOMMU_F_IOASID_BITS feature is offered, the driver SHOULD
NOT send requests with \field{address_space} greater than the size
described by \field{ioasid_bits}.

% We mandate truncation to allow a future extension X.Y that would store
% information in addresses and address space IDs.
%
% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
% garbage. But this extension would be negotiated with a feature bit
% anyway. If it's not, then device must assume that driver is 0.2 and must
% keep truncating the fields.

\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device
/ Device operations}

The device SHOULD NOT set \field{status} to VIRTIO_IOMMU_S_OK if a request
didn't succeed. \footnote{%
For IMPLEMENTATION DEFINED values of 'succeed'... For example,
virtio_iommu_req_detach.reserved is allowed to be non-zero. If it is
non-zero, the device may consider it to be a failure and abort the
request. Or it may go on with the detach and return OK.}

If a request \field{type} is not recognized, the device SHOULD return the
buffers on the used ring and set the \field{len} field of the used element
to zero.

The device MUST ignore reserved fields of the head and the tail of a
request.

If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
truncate the range described by \field{virt_addr} and \field{size} in
requests to fit in the range described by \field{input_range}.

If the VIRTIO_IOMMU_F_IOASID_BITS is offered, the device MUST ignore bits
above \field{ioasid_bits} in field \field{address_space} of requests.

\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device
operations / ATTACH request}

\begin{lstlisting}
struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Attach an endpoint to an address space. \field{address_space} is an
identifier unique to the virtio-iommu device. If the address space doesn't
exist in the device, it is created. \field{device} is an endpoint
identifier unique to the virtio-iommu device. The host communicates unique
device IDs to the guest using methods outside the scope of this
specification, but the following rules apply:

\begin{itemize}
\item The device ID is unique from the virtio-iommu point of view. Multiple
  endpoints whose DMA transactions are not translated by the same
  virtio-iommu may have the same device ID. Endpoints whose DMA
  transactions may be translated by the same virtio-iommu must have
  different device IDs.

\item Sometimes the host cannot completely isolate two endpoints from each
  others. For example on a legacy PCI bus, endpoints can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these endpoints from
  each others, or that the physical IOMMU cannot distinguish transactions
  coming from these endpoints. The method used to communicate this is
  outside the scope of this specification.
\end{itemize}

Multiple endpoints may be added to the same address space. An endpoint
cannot be attached to multiple address spaces in VIRTIO_IOMMU_F_MAP_UNMAP
mode.

\drivernormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device /
Device operations / ATTACH request}

The driver SHOULD set \field{reserved} to zero.

The driver SHOULD ensure that endpoints that cannot be isolated by the
host are attached to the same address space.

\devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device /
Device operations / ATTACH request}

If the \field{reserved} field of an ATTACH request is not zero, the device
SHOULD set the request \field{status} to VIRTIO_IOMMU_S_INVAL and SHOULD
NOT attach the endpoint to the address space. \footnote{The device should
validate input of ATTACH requests in case the driver attempts to attach in
a mode that is unimplemented by the device, and would be incompatible with
the modes implemented by the device.}

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If another endpoint is already attached to the address space identified by
\field{address_space}, then the device MAY attach the endpoint identified
by \field{device} to the address space. If it cannot do so, the device
MUST set the request \field{status} to VIRTIO_IOMMU_S_UNSUPP.

If the endpoint identified by \field{device} is already attached to
another address space, then the device SHOULD first detach it from that
address space and attach it to the one identified by
\field{address_space}. In that case the device behaves as if the driver
issued a DETACH request with this \field{device}, followed by the ATTACH
request. If the device cannot do so, it MUST set the request
\field{status} to VIRTIO_IOMMU_S_UNSUPP.

\subsubsection{DETACH request}

\begin{lstlisting}
struct virtio_iommu_req_detach {
	le32	device;
	le32	reserved;
};
\end{lstlisting}

Detach an endpoint from its address space. When this request completes,
the endpoint cannot access any mapping from that address space anymore.

After all endpoints have been successfully detached from an address space,
it ceases to exist and its ID can be reused by the driver for another
address space.

\drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device /
Device operations / DETACH request}

The driver SHOULD set \field{reserved} to zero.

\devicenormative{\paragraph}{DETACH request}{Device Types / IOMMU Device /
Device operations / DETACH request}

If the \field{reserved} field of an DETACH request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the DETACH operation.
% If it returns OK, the device SHOULD go on with the detach, as required
% by the VIRTIO_IOMMU_S_OK rule.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the endpoint identified by \field{device} wasn't attached to any
address space, then the device MAY set the request \field{status} to
VIRTIO_IOMMU_S_INVAL.

The device MUST ensure that after being detached from an address space,
the endpoint cannot access any mapping from that address space.

\subsubsection{MAP request}\label{sec:Device Types / IOMMU Device / Device
operations / MAP request}

\begin{lstlisting}
struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

/* Flags are: */
#define VIRTIO_IOMMU_MAP_F_READ		(1 << 0)
#define VIRTIO_IOMMU_MAP_F_WRITE	(1 << 1)
#define VIRTIO_IOMMU_MAP_F_EXEC		(1 << 2)
\end{lstlisting}

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses of the same size. After the request
succeeds, all endpoints attached to this address space can access memory
in the range $[phys\_addr; phys\_addr + size[$. For example, if an endpoint
accesses address $VA \in [virt\_addr; virt\_addr + size[$, the device (or the
physical IOMMU) translates the address: $PA = VA - virt\_addr +
phys\_addr$. If the access parameters are compatible with \field{flags}
(for instance, the access is write and \field{flags} are
VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE) then the IOMMU allows
the access to reach $PA$.

The range defined by (\field{virt_addr}, \field{size}) must be within the
limits specified by \field{input_range}. The range defined by
(\field{phys_addr}, \field{size}) must be within the guest-physical
address space. This includes upper and lower limits, as well as any
carving of guest-physical addresses for use by the host (for instance MSI
doorbells). Guest physical boundaries are set by the host using a firmware
mechanism outside the scope of this specification.

\begin{note}
This format prevents from creating the identity mapping in a single
request \texttt{[0x0; 0xfff....fff] $\rightarrow$ [0x0; 0xfff...fff]},
since it would result in a size of zero. Hopefully allowing
VIRTIO_IOMMU_F_BYPASS eliminates the need for issuing such request. It
would also be unlikely to conform to the physical range restrictions
from the previous paragraph.
\end{note}

\begin{note}
On flags: it is unlikely that all possible combinations of flags will be
supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$
might
be invalid. We do not have a way to advertise supported and implicit (for
instance $W \rightarrow R$) flags or combination thereof for the moment,
you are free to send any suggestions for describing this. Please keep in
mind that we might soon want to add more flags, such as privileged,
device, transient, shared, etc. (whatever these would mean).
\end{note}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device
operations / MAP request}

The driver SHOULD set undefined \field{flags} bits to zero.

\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device
operations / MAP request}

If \field{virt_addr}, \field{phys_addr} or \field{size} is not aligned on
the page granularity, the device SHOULD set the request \field{status} to
VIRTIO_IOMMU_S_RANGE and SHOULD NOT create the mapping.

If the device doesn't recognize a \field{flags} bit, it SHOULD set the
request \field{status} to VIRTIO_IOMMU_S_INVAL. In this case the device
SHOULD NOT create the mapping. \footnote{Validating the input is important
here, because the driver might be attempting to map with special flags
that the device doesn't recognize. Creating the mapping with incompatible
flags may introduce a security hazard.}

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

\subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device
operations / UNMAP request}

\begin{lstlisting}
struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};
\end{lstlisting}

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. We define here
a mapping as a virtual region created with a single MAP request. All
mappings covered by the range $[virt\_addr; virt\_addr + size [$ are
removed.

The semantics of unmapping are specified below, and illustrated with the
following requests, assuming each example sequence starts with a blank
address space. We define two pseudocode functions \texttt{map(virt\_addr,
size) -> mapping} and \texttt{unmap(virt\_addr, size)}.

\begin{lstlisting}
(1) unmap(addr=0, size=5)        -> succeeds, doesn't unmap anything

(2) a = map(addr=0, size=10);
    unmap(0, 10)                 -> succeeds, unmaps a

(3) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 10)                 -> succeeds, unmaps a and b

(4) a = map(0, 10);
    unmap(0, 5)                  -> faults, doesn't unmap anything

(5) a = map(0, 5);
    b = map(5, 5);
    unmap(0, 5)                  -> succeeds, unmaps a

(6) a = map(0, 5);
    unmap(0, 10)                 -> succeeds, unmaps a

(7) a = map(0, 5);
    b = map(10, 5);
    unmap(0, 15)                 -> succeeds, unmaps a and b
\end{lstlisting}

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

\drivernormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device
operations / UNMAP request}

The driver SHOULD set the \field{reserved} field to zero.

The range, defined by \field{virt_addr} and \field{size}, SHOULD cover one
or more contiguous mappings created with MAP requests. The range MAY spill
over unmapped virtual addresses.

The first address of a range SHOULD either be the first address of a
mapping or be outside any mapping. The last address of a range SHOULD
either be the last address of a mapping or be outside any mapping.

\devicenormative{\paragraph}{UNMAP request}{Device Types / IOMMU Device / Device
operations / UNMAP request}

If the \field{reserved} field of an UNMAP request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL, in which case
the device MAY perform the UNMAP operation.
% If it returns OK, the device SHOULD go on with the unmap, as required by
% the VIRTIO_IOMMU_S_OK rule.

If \field{address_space} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.

If a mapping affected by the range is not covered in its entirety by the
range (the UNMAP request would split the mapping), then the device SHOULD
set the request \field{status} to VIRTIO_IOMMU_S_RANGE, and SHOULD NOT
remove any mapping.

If part of the range or the full range is not covered by an existing
mapping, then the device SHOULD remove all mappings affected by the range
and set the request \field{status} to VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device
operations / PROBE request}

If the VIRTIO_IOMMU_F_PROBE feature bit is present, the driver sends a
VIRTIO_IOMMU_T_PROBE request for each endpoint that the virtio-iommu
device manages. This probe is performed before attaching the endpoint to
an address space.

\begin{lstlisting}
struct virtio_iommu_req_probe {
	/* Device-readable */
	le32	device;
	le32	flags;
	u8	reserved[60];

	/* Device-writable when not ACK */
	u8	properties[];
};

/* Flags are: */
#define VIRTIO_IOMMU_PROBE_F_ACK	(1 << 0)
\end{lstlisting}

\begin{description}
\item[\field{device}] has the same meaning as in ATTACH and DETACH
  requests.

\item[\field{flags}] contain additional information about the request.
  The VIRTIO_IOMMU_PROBE_F_ACK flag changes the descriptor chain layout:
  when ACK is clear, the \field{properties} field is device-writable;
  when it is set, the \field{properties} field is device-readable.

\item[\field{reserved}] is used as padding, so that future extensions can
  add fields to the device-readable part.

\item[\field{properties}] contains a list of properties of endpoint
  \field{device}, filled by the device. This field is exactly
  \field{probe_size} bytes. Each property is described with a type, four
    flag bits, a length, and a value:
\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK	0xfff
#define VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK	(1 << 12)

struct virtio_iommu_probe_property {
	le16	type;
	le16	length;
	u8	value[];
};
\end{lstlisting}

\end{description}

The driver allocates a buffer of adequate size for the probe request,
writes \field{device} and adds it to the request queue. The device fills
the \field{properties} field with a list of properties for this endpoint.

The driver parses the first property by reading \field{type}, then
\field{length}. If the driver recognizes \field{type}, it reads and
handles \field{value}. The driver then reads the next property, that is
located $(\field{length} + 4)$ bytes after the beginning of the first one,
and so on. The driver parses all properties until it reaches a NONE
property or the end of \field{properties}.

The upper nibble of property \field{type} is reserved for flags.
Therefore only 4096 types are available. The actual type of a property is
extracted like this:

\begin{lstlisting}
u16 type = le16_to_cpu(property.type) &
VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK;
\end{lstlisting}

If a property is correctly understood by the driver, then it sets the ACK
bit in \field{type}:

\begin{lstlisting}
property.type |= cpu_to_le16(VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK);
\end{lstlisting}

Then, to signal to the device which properties are understood, the device
sends the probe again with the VIRTIO_IOMMU_PROBE_F_ACK flag. In all
properties understood and accepted by the driver, \field{type} has the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit set. The other properties are left
as is.

This second phase of the probe request allows the device to ensure that
all properties crucial for good operations are recognized and handled by
the driver. This is analogous to the initial feature negotiation of virtio
devices: an endpoint property is \emph{offered} by the device to the
driver during the first PROBE, and it is \emph{negotiated} after the
driver acknowledges it during the second PROBE.

Available property types are described in section
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties}.
When attaching multiple devices to the same address space, their
properties are combined. \emph{Combination Rules} are given for each
property, and describe the rules to apply when combining properties
obtained during probe.

\drivernormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device
operations / PROBE request}

The size of \field{properties} MUST be \field{probe_size} bytes.

The driver SHOULD set undefined \field{flags} to zero.

The driver SHOULD set \field{reserved} to zero.

If the driver doesn't recognize the \field{type} of a property, it SHOULD
ignore the property and continue parsing the list.

The driver SHOULD NOT deduce the property length from \field{type}.

If the driver recognizes a property \field{type} and is able to
handle{\footnotemark} the property, then the driver SHOULD set the
VIRTIO_IOMMU_PROBE_PROPERTY_F_ACK bit of that property.

\footnotetext{A driver's ability to handle a property depends on the
property type. Without a specific definition of the ACK requirements for a
given property type, it simply means that the driver read all fields of
that property.}

The driver SHOULD resend the PROBE request with the
VIRTIO_IOMMU_PROBE_F_ACK bit set after parsing and updating the
\field{properties} list. Depending on the properties encountered in the
list, the driver MAY modify some of their fields between the first and
second probe, but it SHOULD NOT modify the \field{length} field or bits
[11:0] of field \field{type}.

\devicenormative{\paragraph}{PROBE request}{Device Types / IOMMU Device / Device
operations / PROBE request}

If an undefined bit is set in \field{flags}, the device MAY set the
request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the \field{reserved} field of a PROBE request is not zero, the device
MAY set the request \field{status} to VIRTIO_IOMMU_S_INVAL.

If the endpoint identified by \field{device} doesn't exist, then the
device SHOULD set the request \field{status} to VIRTIO_IOMMU_S_NOENT.

If the device does not offer the VIRTIO_IOMMU_F_PROBE feature, and if the
driver sends a VIRTIO_IOMMU_T_PROBE request, then the device SHOULD return
the buffers on the used ring and set the \field{len} field of the used
element to zero.

The device SHOULD set bits [15:13] of property \field{type} to zero.

The device MUST write the size of \field{value}, in bytes, into
\field{length}.

When two properties follow each others, the device MUST put the second
property exactly $(\field{length} + 4)$ bytes after the beginning of the
first one.

If the device doesn't fill all \field{probe_size} bytes with properties,
it SHOULD terminate the list with a property of type NONE and size 0. The
device MAY fill the remaining bytes of \field{properties}, if any, with
zeroes. If there isn't enough space remaining in \field{properties} to
terminate the list with a complete NONE property (4 bytes), then the
device SHOULD fill the remaining bytes with zeroes.

If the PROBE request has VIRTIO_IOMMU_PROBE_F_ACK bit set, the device MAY
ignore the request and set the request \field{status} to
VIRTIO_IOMMU_S_OK.

\subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device
operations / PROBE properties}

\begin{lstlisting}
#define VIRTIO_IOMMU_PROBE_T_NONE		0
#define VIRTIO_IOMMU_PROBE_T_RESV_MEM		2
\end{lstlisting}

\paragraph{Property NONE}\label{sec:Device Types / IOMMU Device / Device
operations / PROBE properties / NONE}

Marks the end of the property list. This property doesn't have any value,
and should have \field{length} 0.

\paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device
operations / PROBE properties / RESV_MEM}

The RESV_MEM property describes a chunk of reserved virtual memory. It may
be used by the device to describe virtual address ranges that shouldn't be
allocated by the driver, or that are special.

\begin{lstlisting}
struct virtio_iommu_probe_resv_mem {
	u8	subtype;
	u8	reserved[3];
	le64	addr;
	le64	size;
	le32	flags;
};
\end{lstlisting}

Fields \field{addr} and \field{size} describe the range of reserved
addresses. \field{subtype} may be one of:

\begin{description}
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT (0)]
    Accesses to this region are aborted. This subtype does not accept any
    flag.
  \item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS (1)]
    Accesses to this region behave as if the IOMMU was bypassed, and reach
    the bus upstream of the IOMMU untranslated.

    The following \field{flags} are defined for BYPASS regions:
    \begin{description}
      \item[VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI (1)]
        Provides a hint to the guest that this is a doorbell for Message
        Signaled Interrupts.

        If the device doesn't provide such a region, then MSIs are normal
        write accesses from the IOMMU point of view, and arbitrary virtual
        addresses should be allocated by the driver to map MSI doorbells.
        Otherwise, the guest should use the guest-physical doorbell
        address when programming MSIs for this endpoint.
    \end{description}

  %\item[VIRTIO_IOMMU_PROBE_RESV_MEM_T_IDENTITY (2)]
  %  This region should be identity-mapped by the guest. TODO: is this
  %  useful for anyone?
\end{description}

\propcombination{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device /
Device operations / PROBE properties / RESV_MEM}

Multiple overlapping RESV_MEM properties are merged together. Difference
in subtype on the intersecting range doesn't make a difference from the
driver point of view.

\drivernormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device /
Device operations / PROBE properties / RESV_MEM}

The driver SHOULD NOT map any virtual address described by a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT or
VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS property.

% An old driver that doesn't find or understand this property will
% allocate and map virtual addresses. We really can't do anything about
% that. We're not introducing a regression, MSIs never worked for x86
% before we introduced the F_MSI flag.

The driver SHOULD ignore \field{reserved}.

For a given \field{subtype}, the driver SHOULD ignore undefined
\field{flags} bits.

The driver SHOULD treat any \field{subtype} it doesn't recognize as if it
was VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT.

\devicenormative{\subparagraph}{Property RESV_MEM}{Device Types / IOMMU Device /
Device operations / PROBE properties / RESV_MEM}

The device SHOULD set \field{reserved} to zero.

For a given \field{subtype}, the device SHOULD set undefined \field{flags}
bits to zero.

The device MAY abort any transaction targeting a
VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT region.

If an endpoint is attached to an address space, the device SHOULD leave
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
regions pass through untranslated. In other words, the device SHOULD
handle such a region as if it was identity-mapped (virtual address equal
to physical address). If the endpoint is not attached to any address
space, then the device MAY abort the transaction.

The device MAY abort any transaction that isn't a write access and that
targets a VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS region with flag
VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI.

Jean-Philippe Brucker

2017-Aug-04 18:19 UTC

head link

[RFC] virtio-iommu v0.4 - Implementation notes

The following is roughly the content of topology.tex and MSI.tex

---
\section{Implementation notes}\label{sec:viommu}

\subsection{Virtual system topology}\label{sec:viommu / Virtual topology}

\subsubsection{Example virtual topology}\label{sec:viommu / Virtual topology /
Example}

\begin{figure}[htb]
  \centering
  \includegraphics[width=\textwidth]{img/virtual-topology.png}
  \caption{An example IOMMU topology}
  \label{fig:viommu / Virtual topology / Topology}
\end{figure}

Diagram~\ref{fig:viommu / Virtual topology / Topology} shows an example
system topology centered around an IOMMU. On the left, the IOMMU manages
traffic from two PCI root complexes. On the right, the IOMMU manages
traffic from three platform devices (or "integrated devices").

Within a PCI domain, devices are identified by a Requester ID. It is a
16-bit identifier also called Bus/Device/Function (BDF). In a BDF, Bus is
8 bits, Device is 5 bits, and Function is 3 bits.

The bottom PCI domain has four endpoints connected to the root complex via
two bridges. The first endpoint is identified by BDF 01:04.0. On the other
bus, the first endpoint is identified by BDF 02:00.0, and the two other
endpoints are two functions of the same device, identified by BDFs 02:01.0
and 02:01.1. The bridges and the root complex may also issue transactions
with BDFs 00:00.0, 00:01.0 and 00:02.0.

In order for the IOMMU to differentiate devices in multiple PCI domains,
the root bridge expands the BDF with a domain ID. In example
\ref{fig:viommu / Virtual topology / Topology},
the PCI domain on top gets ID 0 and the one on the bottom gets ID 1.
Therefore when reaching the IOMMU, a transaction coming from endpoint
01:04.0 (= 0x0120) is identified by Device ID 0x10120.

We define here "platform" devices as endpoints that are on the system
bus,
as opposed to behind a PCI host bridge. Unlike PCI devices, platform
devices do not have a standardized identifier scheme to be used with the
IOMMU. Their Device IDs are chosen arbitrarily during system integration
in such way that they don't overlap PCI domains or each others.

\subsubsection{Firmware description}\label{sec:viommu / Virtual topology /
Firmware description}

The host describes the relation between IOMMU and devices to the guest
using either device-tree or ACPI. Topology description is outside the
scope of virtio-iommu, because the virtio-iommu does not and should not
need to know about vendor-specific buses. The virtual IOMMU identifies
each virtual endpoint with an abstract 32-bit ID, that is called "Device
ID" in this document\footnote{Other IOMMU architectures use different
names, such as "stream ID" on ARM SMMU or "source ID" on
Intel VT-d}.
Device IDs are not necessarily unique system-wide, but they should not
overlap within a single virtio-iommu. Device IDs of physical endpoints do
not need to match IDs seen by the physical IOMMU.

We strongly advise to implement the virtio-iommu using virtio-mmio
transport. Nothing prevents an implementation to use virtio-pci instead,
but existing firmware interfaces do not easily allow to describe an IOMMU
$\leftrightarrow$ master relations between PCI endpoints. Device models in
Operating Systems might not be designed to support such complicated
system.

Device-tree offers a way to describe the IOMMU topology for PCI and
platform devices. Here's an excerpt of the device-tree describing examples
\ref{fig:viommu / Virtual topology / Topology}.

\begin{lstlisting}
/* The virtual IOMMU is described with a virtio-mmio node */
viommu: virtio at 9050000 {
	compatible = "virtio,mmio";
	reg = <0x09050000 0x200>;
	dma-coherent;
	interrupts = <0x0 0x5 0x1>;

	#iommu-cells = <1>
};

/* PCI domain 0 */
pcie at 3eff0000 {
	...
	/* Identity map */
	iommu-map = <0x0 &viommu 0x0 0x10000>;
};

/* PCI domain 1 */
pcie at 3f000000 {
	...
	/* Linear map: deviceID = RID + 0x10000 */
	iommu-map = <0x0 &viommu 0x10000 0x10000>;
};

someplatformdevice at a000000 {
	...
	iommus = <&viommu 0x20000>;
};
\end{lstlisting}

For more details, please refer to \hyperref[intro:IOMMU DT Bindings]{[IOMMU
DT]}.

In ACPI, the plan would be to add a new node type to the IO Remapping
Table specification \hyperref[intro:ACPI IORT]{[ACPI IORT]}, that provides
a mechanism similar to DT for describing IOMMU topology.

The OS would parse the IORT table to build a map of ID relations between
IOMMU and devices. ID Array is used to find correspondence between IOMMU
IDs and PCI or platform devices. Later on, the virtio-iommu driver finds
the associated LNRO0005 descriptor via the "Device object name" field,
and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

The following table shows the possible\protect\footnotemark\ format for a
paravirtualized IOMMU IORT node.

\footnotetext{This table IS NOT authoritative, only a suggestion.
  Such a node would be described in \hyperref[intro:ACPI IORT]{[ACPI IORT]}}.

\begin{center}
\begin{tabular}{| l | l | l | p{.4\textwidth} |}
\hline
\textbf{Field} & \textbf{Length} & \textbf{Offset} &
\textbf{Description} \\
\hline
Type                  & 1     & 0     & 5: Paravirtualized IOMMU \\
\hline
Length                & 2     & 1     & The length of the node. \\
\hline
Revision              & 1     & 3     & 0 \\
\hline
Reserved              & 4     & 4     & Must be zero. \\
\hline
Number of ID mapping  & 4     & 8     & \\
\hline
Reference to ID Array & 4     & 12    &
  Offset from the start of the ID Array IORT node to the start of its
  Array ID mappings.\\
\hline
Model                 & 4     & 16    & 0: virtio-iommu \\
\hline
Device object name    &       & 20    &
  ASCII Null terminated string with the full path to the entry in the
  namespace for this IOMMU. \\
\hline
Padding & & & To keep 32-bit alignment and leave space for future
models. \\
\hline
Array of ID mappings  & 20xN  &       & ID Array. \\
\hline
\end{tabular}
\end{center}

---
\subsection{Message Signaled Interrupts}\label{sec:viommu / MSI}

Some buses, such as PCI, implement Message Signaled Interrupts. Instead of
requesting an interrupt via a wire that runs from the endpoint to the irqchip,
the endpoint can request interrupts by performing a memory write to a specific
register (the "doorbell").

By combining the data written to the doorbell, the address itself, and the
originator of the write, the IRQ chip deduces the destination interrupt
number and destination processing units. Additional devices between the
endpoint and the IRQ chip may translate the doorbell address, the IRQ
number and verify that the endpoint is allowed to send this interrupt.

Different platforms implement IRQ remapping and routing in different ways.
This section describes three ways of dealing with Message Signaled
Interrupts in virtio-iommu devices and drivers.

In simplest systems, the endpoint writes the plain interrupt number to the
doorbell, and the IRQ chip signals the interruption to destination CPUs
programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
describes how to implement a simple system with virtio-iommu. Section
\ref{sec:viommu / MSI / Address translation} describes the added complexity
(from the host point of view) of translating the IRQ chip doorbell.

More complex systems add a level of indirection in the MSI message. The address
or data contains an index into a remapping table, that describes interrupt
delivery in details and is programmed by software either into the IRQ chip or
the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
the remapping feature of virtio-iommu.

\subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-noremap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

Bypassing translation for MSIs is the simplest implementation from the host
perspective. The virtio-iommu device has a special IOVA window that it does not
translate. Any access from devices to that region is forwarded upstream of the
IOMMU without being translated or even checked.

The IRQ chip may or may not have an IRQ remapping component. It may be as
simple as generating the interrupt number described in data, without checking
if the device was allowed to send that interrupt. If there is another
component performing the isolation, one might consider translating the
doorbell address superfluous.

With virtio-iommu, the device can advertise the doorbell address as
untranslated by using the PROBE request with a reserved region (see
\ref{sec:Device Types / IOMMU Device / Device operations / PROBE properties /
RESV_MEM}).

For example, if the virtual platform has an IRQ remapping module with a
doorbell in the physical address range 0xfee00000-0xfeefffff, then the
device can present the following property to the driver:

\begin{lstlisting}
struct __attribute__((packed)) {
	struct virtio_iommu_probe_property	head;
	struct virtio_iommu_probe_resv_mem	mem;
} doorbell = {
	.head = {
		.type		= VIRTIO_IOMMU_PROBE_T_RESV_MEM,
		.length		= sizeof(doorbell.mem),
	},
	.mem = {
		.subtype	= VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS,
		.flags		= VIRTIO_IOMMU_PROBE_RESV_MEM_F_MSI,
		.addr		= 0xfee00000,
		.size		= 0x00100000,
	},
};
\end{lstlisting}

\subsubsection{Address translation}\label{sec:viommu / MSI / Address
translation}

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-addr-remap.png}
  \caption{MSI remapping with address translation}
\end{figure}

On some systems (e.g. ARM-based platforms) the IOMMU does not have a special
MSI window, and MSIs are treated like any other memory write. The MSI address
therefore has to be translated by the IOMMU before reaching the IRQ chip.

Address translation may be used as a rudimentary form of MSI isolation,
but multiple endpoints will typically access the same doorbell. Address
translation can only forbid an endpoint from sending interrupts. If it is
allowed to send MSIs, the endpoint can easily spoof another endpoint by
sending interrupts that were not assigned to it.
>From the virtio-iommu point of view, this is the simplest to implement,
becausethere is no special address range. The whole address space is treated the same
by the virtio-iommu device.

However, this mode of operations may add significant complexity in the host
implementation.


\subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}

Some IOMMUs (e.g. Intel and AMD IOMMUs) are able to remap IRQs themselves. 

\begin{figure}[htb]
  \centering
  \includegraphics{img/MSI-irq-remap.png}
  \caption{MSI remapping with address bypass}
\end{figure}

This version of virtio-iommu doesn't support IRQ remapping.

Auger Eric

2017-Sep-12 17:13 UTC

head link

[RFC] virtio-iommu version 0.4

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to
comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments
in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make'
generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because
v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call
"Page
>   Table Handover" (or simply, from the host POV, "Nested"),
the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>    
http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a
reference,
>     it's a draft. It's also not an OASIS document yet. It may be
riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
>

Bharat Bhushan

2017-Sep-13 03:47 UTC

head link

[RFC] virtio-iommu version 0.4

Hi Eric,
> -----Original Message-----
> From: Auger Eric [mailto:eric.auger at redhat.com]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>;
> iommu at lists.linux-foundation.org; kvm at vger.kernel.org;
> virtualization at lists.linux-foundation.org; virtio-dev at
lists.oasis-open.org
> Cc: will.deacon at arm.com; robin.murphy at arm.com;
> lorenzo.pieralisi at arm.com; mst at redhat.com; jasowang at redhat.com;
> marc.zyngier at arm.com; eric.auger.pro at gmail.com; Bharat Bhushan
> <bharat.bhushan at nxp.com>; peterx at redhat.com; kevin.tian at
intel.com
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time
[1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It
is
> >   split into a formal description of the virtio device, and
implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties
that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and
Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can
simply
> > send a list of comments in relation to section numbers and a few words
of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but
haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu
specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device
should
> return in case of VFIO device and non VFIO device.
My understanding is that reserved regions are 1) Host reserved regions ( This is
applicable for VFIO) 2) Guest MSI reserved region (This is applicable for
emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat
> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here
because
> > v0.4 already is a lot to digest, but in short, building on top of
PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose
some
> >   properties of the physical topology to the guest, and let it
allocate
> >   resources more efficiently. For example, when the virtio-iommu
manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call
"Page
> >   Table Handover" (or simply, from the host POV,
"Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different
platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a
reference,
> >     it's a draft. It's also not an OASIS document yet. It may
be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do
not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the
spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

Jean-Philippe Brucker

2017-Sep-19 10:47 UTC

head link

[RFC] virtio-iommu version 0.4

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.
Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).
> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.
For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.
> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?
The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.
> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.
> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.
Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

Auger Eric

2017-Oct-03 13:04 UTC

head link

[virtio-dev] [RFC] virtio-iommu version 0.4

Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to
comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments
in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make'
generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because
v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call
"Page
>   Table Handover" (or simply, from the host POV, "Nested"),
the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>    
http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a
reference,
>     it's a draft. It's also not an OASIS document yet. It may be
riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit e3067861ba6650a566a6273738c23c956ad55c02
arm64: add basic VMAP_STACK support

After reverting this patch, things resume working.

I observe the problem with a 4kB page guest kernel.

Thanks

Eric

> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org
>

Jean-Philippe Brucker

2017-Oct-09 19:46 UTC

head link

[virtio-dev] [RFC] virtio-iommu version 0.4

Hi Eric,

On 03/10/17 14:04, Auger Eric wrote:> When rebasing the v0.4 driver on master I observe a regression: commands
> are not received properly by QEMU (typically an attach command is
> received with a type of 0). After a bisection of the guest kernel the
> first commit the problem appears is:
> 
> commit e3067861ba6650a566a6273738c23c956ad55c02
> arm64: add basic VMAP_STACK support
Indeed, thanks for the report. I can reproduce the problem with 4.14 and
kvmtool. We should allocate buffers on the heap for any request (see for
example 9472fe704 or e37e2ff35). I rebased onto 4.14 and pushed the
following patch at: git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4.1

It's not very nice, but should fix the issue. I didn't manage to measure
the difference though, my benchmark isn't precise enough. So I don't
know if we should try to use a kmem cache instead of kmalloc.

Thanks,
Jean

--- 8< --->From 3fc957560e1e6f070a0468bf75ebc4862d37ff82 Mon Sep 17 00:00:00 2001From: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
Date: Mon, 9 Oct 2017 20:13:57 +0100
Subject: [PATCH] iommu/virtio-iommu: Allocate all requests on the heap

When CONFIG_VMAP_STACK is enabled, DMA on the stack isn't allowed. Instead
of allocating virtio-iommu requests on the stack, use kmalloc.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker at arm.com>
---
 drivers/iommu/virtio-iommu.c | 86 +++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index ec1cfaba5997..2d8fd5e99fa7 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -473,13 +473,10 @@ static int viommu_attach_dev(struct iommu_domain *domain,
struct device *dev)
 {
 	int i;
 	int ret = 0;
+	struct virtio_iommu_req_attach *req;
 	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
 	struct viommu_endpoint *vdev = fwspec->iommu_priv;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_attach req = {
-		.head.type	= VIRTIO_IOMMU_T_ATTACH,
-		.address_space	= cpu_to_le32(vdomain->id),
-	};

 	mutex_lock(&vdomain->mutex);
 	if (!vdomain->viommu) {
@@ -531,14 +528,25 @@ static int viommu_attach_dev(struct iommu_domain *domain,
struct device *dev)

 	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);

+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_attach) {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
 	for (i = 0; i < fwspec->num_ids; i++) {
-		req.device = cpu_to_le32(fwspec->ids[i]);
+		req->device = cpu_to_le32(fwspec->ids[i]);

-		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		ret = viommu_send_req_sync(vdomain->viommu, req);
 		if (ret)
 			break;
 	}

+	kfree(req);
+
 	vdomain->attached++;
 	vdev->vdomain = vdomain;

@@ -550,13 +558,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned
long iova,
 {
 	int ret;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_map req = {
-		.head.type	= VIRTIO_IOMMU_T_MAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-		.phys_addr	= cpu_to_le64(paddr),
-		.size		= cpu_to_le64(size),
-	};
+	struct virtio_iommu_req_map *req;

 	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id,
iova,
 		 paddr, size);
@@ -564,17 +566,30 @@ static int viommu_map(struct iommu_domain *domain,
unsigned long iova,
 	if (!vdomain->attached)
 		return -ENODEV;

-	if (prot & IOMMU_READ)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
-
-	if (prot & IOMMU_WRITE)
-		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
-
 	ret = viommu_tlb_map(vdomain, iova, paddr, size);
 	if (ret)
 		return ret;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	*req = (struct virtio_iommu_req_map) {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	if (prot & IOMMU_READ)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req->flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		viommu_tlb_unmap(vdomain, iova, size);

@@ -587,11 +602,7 @@ static size_t viommu_unmap(struct iommu_domain *domain,
unsigned long iova,
 	int ret;
 	size_t unmapped;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);
-	struct virtio_iommu_req_unmap req = {
-		.head.type	= VIRTIO_IOMMU_T_UNMAP,
-		.address_space	= cpu_to_le32(vdomain->id),
-		.virt_addr	= cpu_to_le64(iova),
-	};
+	struct virtio_iommu_req_unmap *req;

 	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);

@@ -603,9 +614,19 @@ static size_t viommu_unmap(struct iommu_domain *domain,
unsigned long iova,
 	if (unmapped < size)
 		return 0;

-	req.size = cpu_to_le64(unmapped);
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;

-	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	*req = (struct virtio_iommu_req_unmap) {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.size		= cpu_to_le64(unmapped),
+	};
+
+	ret = viommu_send_req_sync(vdomain->viommu, req);
+	kfree(req);
 	if (ret)
 		return 0;

@@ -626,12 +647,16 @@ static size_t viommu_map_sg(struct iommu_domain *domain,
unsigned long iova,
 	unsigned long mapped_iova;
 	size_t top_size, bottom_size;
 	struct viommu_request reqs[nents];
-	struct virtio_iommu_req_map map_reqs[nents];
+	struct virtio_iommu_req_map *map_reqs;
 	struct viommu_domain *vdomain = to_viommu_domain(domain);

 	if (!vdomain->attached)
 		return 0;

+	map_reqs = kcalloc(nents, sizeof(*map_reqs), GFP_KERNEL);
+	if (!map_reqs)
+		return -ENOMEM;
+
 	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);

 	if (prot & IOMMU_READ)
@@ -679,6 +704,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain,
unsigned long iova,

 	if (ret) {
 		viommu_tlb_unmap(vdomain, iova, total_size);
+		kfree(map_reqs);
 		return 0;
 	}

@@ -692,6 +718,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain,
unsigned long iova,
 			goto err_rollback;
 	}

+	kfree(map_reqs);
 	return total_size;

 err_rollback:
@@ -719,6 +746,7 @@ static size_t viommu_map_sg(struct iommu_domain *domain,
unsigned long iova,
 	}

 	viommu_tlb_unmap(vdomain, iova, total_size);
+	kfree(map_reqs);

 	return 0;
 }
@@ -863,6 +891,8 @@ static int viommu_probe_device(struct viommu_dev *viommu,
 		type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
 	}

+	kfree(probe);
+
 	return 0;
 }

--
2.13.3

Seemingly Similar Threads

Search for more possibly parallel threads

Linux Virtualization - Aug 2017 - [RFC] virtio-iommu version 0.4

[RFC] virtio-iommu version 0.4

[RFC] virtio-iommu v0.4 - IOMMU Device

[RFC] virtio-iommu v0.4 - Implementation notes

[RFC] virtio-iommu version 0.4

[RFC] virtio-iommu version 0.4

[RFC] virtio-iommu version 0.4

[virtio-dev] [RFC] virtio-iommu version 0.4

[virtio-dev] [RFC] virtio-iommu version 0.4

Seemingly Similar Threads