thr3ads.net - Xen devel - FIFO-based event channel ABI design (draft B) [Feb 2013]

If this information is useful, please help other people find it:
Share via:

David Vrabel

2013-Feb-15 14:32 UTC

FIFO-based event channel ABI design (draft B)

Hi,

Here is a design for a FIFO-based event channel ABI for Xen.  It has a
number of benefits over the 3-level design currently being proposed by
Wei Liu.

* The use of FIFOs for events ensures fairness.
* 16 event priorities.
* The same ABI regardless of the native word size (a 32-bit or 64-bit
hypervisor provides the same ABI to both 32 and 64-bit guests).
* Reduced memory requirements (only 1 additional page per domU).
* Room for even more event channels with trivial ABI changes.

This is an update from the original draft A and includes clarifications
and corrections for almost all the comments received so far.  The key
changes are summarized in the beginning of the document.

Except for a lack of information of the required memory barriers (to be
included in draft C), this should now be completely enough to start on a
prototype (any volunteers? ;).

The PDF version is easier to read and has diagrams and readable maths
but the original markdown format document is included below (for ease of
commenting).

http://xenbits.xen.org/people/dvrabel/event-channels-B.pdf

Introduction
===========
Revision History
----------------

--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft A  4 Feb 2013   Initial draft.

Draft B  15 Feb 2013  Clarified that the event array is per-domain.

                      Control block is no longer part of the vcpu_info
                      but does reside in the same page.

                      Hypercall changes: structures are now 32/64-bit
                      clean, added notes on handling failures,
                      `expand_array` has its `vcpu` field removed,
                      use `expand_array` to add first page.

                      Added an upcall section.

                      Added a READY field to the control block to make
                      finding the highest priority non-empty event
                      queue more efficient.

                      Note that memory barriers will be required but
                      leave the details to a future draft.
--------------------------------------------------------------------


Purpose
-------

Xen uses event channels to signal events (interrupts) to (fully or
partially) paravirtualized guests.  The current event channel ABI
provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
(for 64-bit guests) event channels.  This is limiting scalability as
support for more VMs, VCPUs and devices is required.

Events also cannot be serviced fairly as information on the ordering
of events is lost.  This can result in events from some VMs
experiencing (potentially significantly) longer than average latency.

The existing ABI does not easily allow events to have different
priorities.  Current Linux kernels prioritize the timer event by
special casing this but this is not generalizable to more events.
Event priorities may be useful for prioritizing MMIO emulation
requests over bulk data traffic (such as network or disk).

This design replaces the existing event channel ABI with one that:

- is scalable to more than 100,000 event channels, with scope for
increasing this further with minimal ABI changes.

- allows events to be serviced fairly.

- allows guests to use up-to 16 different event priorities.

- has an ABI that is the same regardless of the natural word size.

System Overview
---------------

[FIXME: diagram showing Xen and guest and shared memory block for events?]

Design Map
----------

A new event channel ABI requires changes to Xen and the guest kernels.

References
----------

[FIXME: link to alternate proposal?]

Design Considerations
====================
Assumptions
-----------

* Atomic read-modify-write of 32-bit words is possible on all
  supported platforms.  This can be with a linked-load /
  store-conditional (e.g., ARMv8''s ldrx/strx) or a compare-and-swap
  (e.g., x86''s cmpxchg).

Constraints
-----------

* The existing ABI must continue to be useable.  Compatibilty with
  existing guests is mandatory.

Risks and Volatile Areas
------------------------

* Should the 3-level proposal be merged into Xen then this design does
  not offer enough improvements to warrant the cost of maintaining
  three different event channel ABIs in Xen and guest kernels.

* The performance of some operations may be decreased.  Specifically,
  retriggering an event now always requires a hypercall.

Architecture
===========
Overview
--------

The event channel ABI uses a data structure that is shared between Xen
and the guest.  Access to the structure is done with lock-less
operations (except for some less common operations where the guest
must use a hypercall).  The guest is responsible for allocating this
structure and registering it with Xen during VCPU bring-up.

Events are reported to a guest''s VCPU using a FIFO _event queue_.
There is a queue for each priority level and each VCPU.

Each event has a _pending_ and a _masked_ bit.  The pending bit
indicates the event has been raised.  The masked bit is used by the
guest to prevent delivery of that specific event.


High Level Design
================
Shared Event Data Structure
---------------------------

The shared event data structure has a per-domain _event array_, and a
per-VCPU _control block_.

- _event array_: A logical array of _event words_ (one per event
  channel) which contains the pending and mask bits and the link index
  for next event in the queue.  The event array is shared between all
  of the guest''s VCPUs.

- _control block_: This contains the meta data for the event
  queues: the _ready bits_ and the _head index_ and _tail index_ for
  each priority level.  Each VCPU has its own control block and this
  is contained in the same page as the existing `struct vcpu_info`.

### Event Array

![\label{fig_event-word}Event Array Word](event-word.pdf)

The pages within the event array need not be physically nor virtually
contiguous, but the guest or Xen may make the virtually contiguous for
ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
Linux.  Pages are added by the guest as required to accomodate the
event with the highest port number.

Only 17 bits are currently defined for the LINK field, allowing 2^17^
(131,072) events.  This limit can be trivially increased without any
other changes to the ABI.  Bits [28:17] are reserved for future
expansion or for other uses.

Instead of the L bit, a magic value for the LINK field could be used
to indicate whether an event is in a queue.  However, using the L bit
has two advantages: a) the guest may clear it with a single bit clear
operation; and b) it does not require changing a magic value if the
size of the LINK field changes.

### Control Block

![\label{fig_control-block}Control Block](control-block.pdf)

The READY field contains a bit for each priority''s queue.  A set bit
indicates that there are events pending on that queue.  A queue''s
ready bit is set by Xen when an event is placed on an empty queue and
cleared by the guest when it empties the queue.

There are N HEAD and TAIL indexes, one for each priority.

The HEAD index is the first event in the queue or zero if the queue is
empty.  HEAD is set by the guest as it consumes events and only set by
Xen when adding an event to an empty queue.

The TAIL index is the last event in the queue.  It is undefined if the
queue is empty.  TAIL is only set by Xen.

Event State Machine
-------------------

Event channels are bound to a port in the domain using the existing
ABI.

A bound event may be in one of three main states.

State    Abbrev.  PML Bits  Meaning
-----    -------  --------  -------
BOUND    B        000       The event is bound but not pending.
PENDING  P        100       The event has been raised and not yet
acknowledged.
LINKED   L        101       The event is on an event queue.

Additionally, events may be UNMASKED or MASKED (M).

![\label{events_sm}Event State Machine](event-sm.pdf)

The state of an event is tracked using 3 bits within the event word:
the P (pending), M (masked), and L (linked) bits.  Only state
transitions that change a single bit are valid.

Event Queues
------------

The event queues use a singly-linked list of event array words (see
figure \ref{fig_event-word} and \ref{fig_event-queue}).  Each VCPU has
an event queues for each priority.

![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)

Each event queue has a head and tail index stored in the control
block.  The head index is the index of the first element in the queue.
The tail index is the last element in the queue.  Every element within
the queue has the L bit set.

The LINK field in the event word indexes the next event in the queue.
LINK is zero for the last word in the queue.

The queue is empty when the head index is zero (zero is not a valid
event channel).

Hypercalls
----------

Four new EVTCHNOP hypercall sub-operations are added:

* `EVTCHNOP_init_control`

* `EVTCHNOP_expand_array`

* `EVTCHNOP_set_priority`

* `EVTCHNOP_set_limit`

### `EVTCHNOP_init_control`

This call initializes a single VCPU''s control block.

A guest should call this during initial VCPU bring up.  The guest must
have already successfully registered a vcpu_info structure and the
control block must be in the same page.

If this call fails on the boot VCPU, the guest should continue to use
the 2-level event channel ABI for all VCPUs.  If this call fails on
any non-boot VCPU then the VCPU will be unable to receive events and
the guest should offline the VCPU.
> Note: This only initializes the control block. At least one page
> needs to be added to the event arrary with `EVTCHNOP_expand_array`.
    struct evtchnop_init_control {
        uint64_t control_pfn;
        uint32_t offset;
        uint32_t vcpu;
    };

Field          Purpose
-----          -------
`control_pfn`  [in] The PFN or GMFN of the page containing the control
block.
`offset`       [in] Offset in bytes from the start of the page to the
beginning of the control block.
`vcpu`         [in] The VCPU number.

Error code  Reason
----------  ------
EINVAL      `vcpu` is invalid or already initialized.
EINVAL      `control_pfn` is not a valid frame for the domain.
EINVAL      `control_pfn` is not the same frame as the vcpu_info structure.
EINVAL      `offset` is not a multiple of 8 or the control block would
cross a page boundary.
ENOMEM      Insufficient memory to allocate internal structures.

### `EVTCHNOP_expand_array`

This call expands the event array by appending an additional page.

A guest should call this when a new event channel is required and
there is insufficient space in the current event array.

It is not possible to shrink the event array once it has been
expanded.

If this call fails, then subsequent attempts to bind event channels
may fail with -ENOSPC.  If the first page cannot be added then the
guest cannot receive any events and it should panic.

    struct evtchnop_expand_array {
        uint64_t array_pfn;
    };

Field          Purpose
-----          -------
`array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
               of the event array.

Error code  Reason
----------  ------
EINVAL      `array_pfn` is not a valid frame for the domain.
EINVAL      The event array already has the maximum number of pages.
ENOMEM      Insufficient memory to allocate internal structures.

### `EVTCHNOP_set_priority`

This call sets the priority for an event channel.  The event channel
may be bound or unbound.

The meaning and the use of the priority are up to the guest.  Valid
priorities are 0 - 15 and the default is 7.  0 is the highest
priority.

If the priority is changed on a bound event channel then at most one
event may be signalled at the previous priority.

    struct evtchnop_set_priority {
        uint32_t port;
        uint32_t priority;
    };

Field       Purpose
-----       -------
`port`      [in] The event channel.
`priority`  [in] The priority for the event channel.

Error code  Reason
----------  ------
EINVAL      `port` is invalid.
EINVAL      `priority` is outside the range 0 - 15.

### `EVTCHNOP_set_limit`

This privileged call sets the highest port number a domain can bind an
event channel to.  The default for dom0 is the maximum supported
($2^{17}-1$).  Other domains default to 1023 (requiring only a single
page for their event array).

The limit only affects future attempts to bind event channels.  Event
channels that are already bound are not affected.

It is recommended that the toolstack only calls this during domain
creation before the guest is started.

    struct evtchnop_set_limit {
        uint32_t domid;
        uint32_t max_port;
    };

Field       Purpose
-----       -------
`domid`     [in] The domain ID.
`max_port`  [in] The highest port number that the domain may bound an
event channel to.

Error code  Reason
----------  ------
EINVAL      `domid` is invalid.
EPERM       The calling domain has insufficient privileges.

Memory Usage
------------

### Event Arrays

Xen needs to map every domains'' event array into its address space.
The space reserved for these global mappings is limited to 1 GiB on
x86-64 (262144 pages) and is shared with other users.

It is non-trivial to calculate the maximum number of VMs that can be
supported as this depends on the system configuration (how many driver
domains etc.) and VM configuration.  We can make some assuptions and
derive an approximate limit.

Each page of the event array has space for 1024 events ($E_P$) so a
regular domU will only require a single page.  Since event channels
have two ends, the upper bound on the total number of pages is $2
\times\text{number of VMs}$.

If the guests are further restricted in the number of event channels
($E_V$) then this upper bound can be reduced further.  By assuming
that each event event channel has one end in a domU and the other in
dom0 (or a small number of driver domains) then the ends in dom0 will
be packed together within the event array.

The number of VMs ($V$) with a limit of $P$ total event array pages is
approximately:
\begin{equation*}
V = P \div \left(1 + \frac{E_V}{E_P}\right)
\end{equation*}

Using only half the available pages and limiting guests to only 64
events gives:
\begin{eqnarray*}
V & = & (262144 / 2) \div (1 + 64 / 1024) \\
  & = & 123 \times 10^3\text{ VMs}
\end{eqnarray*}

Alternatively, we can consider a system with $D$ driver domains, each
of which requires $E_D$ events, and a dom0 using the maximum number of
pages (128).  The number of pages left over, hence the number of
guests is:
\begin{eqnarray*}
V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
\end{eqnarray*}

With, for example, 16 driver domains each using the maximum number of pages:
\begin{eqnarray*}
V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
   & = & 129 \times 10^3\text{ VMs}
\end{eqnarray*}

In summary, there is space to map the event arrays for over 100,000
VMs.  This is more than the limit imposed by the 16 bit domain ID
(~32,000 VMs).

### Control Block

With $L$ priority levels and two 32-bit words for the head and tail
indexes, the amount of space ($S$) required for the control block is:
\begin{eqnarray*}
S & = & L \times 2 \times 4 + 8 \\
  & = & 16 \times 2 \times 4 + 8\\
  & = & 136\text{ bytes}
\end{eqnarray*}

This allows the `struct vcpu_info` and the control block to
comfortably packed into a single page.


Low Level Design
===============
In the pseudo code in this section, all memory accesses are atomic,
including those to bit-fields within the event word.  All memory
accesses are considered to be strongly ordered.  The required memory
barriers for real processors will be considered in a future draft.

The following variables are used for parts of the shared data
structures. Lowercase variables are local.

Variable  Purpose
--------  -------
E         Event array.
C         Per-VCPU control block.
H         Head index for a specific queue.
T         Tail index for a specific queue.

Raising an Event
----------------

When Xen raises an event it marks it pending and (if it is not masked)
adds it tail of event queue.

    E[p].pending = 1
    if not E[p].linked and not E[p].masked
        E[p].linked = 1
        E[p].link = 0
        if H == 0
            H = p
        else
            E[T].link = p
        T = p

Concurrent access by Xen to the event queue must be protected by a
per-event queue spin lock.

Consuming Events
----------------

The guest consumes events starting at the head until it reaches the
tail.  Events in the queue that are not pending or are masked are
consumed but not handled.

To consume a single event:

    p = H
    H = E[p].link
    if H == 0
        H = E[p].link
    E[p].linked = 0
    if not E[p].masked
        handle(p)

handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> Note: When the event queue contains a single event, removing the
> event may race with Xen appending another event because the load of
> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> the guest must recheck `E[p].link` if the list appeared empty.
Upcall
------

When Xen places an event on an empty queue it sets the queue as ready
in the control block.  If the ready bit transitions from 0 to 1, a new
event is signalled to the guest.

The guest uses the control block''s ready field to find the highest
priority queue with pending events.  The bit in the ready word is
cleared when the queue is emptied.

Higher priority events do not need to preempt lower priority event
handlers so the guest can handle events by taking one event off the
currently ready queue with highest priority.

    r = C.ready
    while r
        q = find_first_set_bit(r)
        consume_one_event_from_queue(q)
        if C.queue[q].H == 0
            C.ready[q] = 0
            r[b] = 0
            if C.queue[q].H != 0
                C.ready[q] = 1
        r |= C.ready
> Note: The clear of the `C.ready[q]` bit may race with Xen placing
> another event on the queue.  After clearing the bit the queue must
> be checked again.
Since the upcall is reentrant the guest should ensure that nested
upcalls return immediately without processing any events.  A per-VCPU
nesting count may be used for this.

Masking Events
--------------

Events are masked by setting the masked bit.  If the event is pending
and linked it does not need to be unlinked.

    E[p].masked = 1

Unmasking Events
----------------

Events are unmasked by the guest by clearing the masked bit.  If the
event is pending the guest must call the event channel unmask
hypercall so Xen can link the event into the correct event queue.

    E[p].masked = 0
    if E[p].pending
        hypercall(EVTCHN_unmask)

The expectation here is that unmasking a pending event will be rare,
so the performance hit of the hypercall is minimal.
> Note: After clearing the mask bit, the event may be raised and thus
> it may already be linked by the time the hypercall is done.  The
> mask must be cleared before testing the pending bit to avoid racing
> with the event becoming pending.

Jan Beulich

2013-Feb-15 15:05 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

>>> On 15.02.13 at 15:32, David Vrabel <david.vrabel@citrix.com>
wrote:
> ### Control Block
> 
> ![\label{fig_control-block}Control Block](control-block.pdf)
> 
> The READY field contains a bit for each priority''s queue.  A set
bit
> indicates that there are events pending on that queue.  A queue''s
> ready bit is set by Xen when an event is placed on an empty queue and
> cleared by the guest when it empties the queue.
> 
> There are N HEAD and TAIL indexes, one for each priority.
> 
> The HEAD index is the first event in the queue or zero if the queue is
> empty.  HEAD is set by the guest as it consumes events and only set by
> Xen when adding an event to an empty queue.
> 
> The TAIL index is the last event in the queue.  It is undefined if the
> queue is empty.  TAIL is only set by Xen.
One thing I continue to wonder is why tail it part of the control
block at all - from all I can tell, it''s not consumed by the guest (end
of list being indicated by a zero LINK field), and hence is an
implementation detail of the hypervisor. And given that the
hypervisor reads that field, not exposing the field also avoids the
need to validate is each time it gets read from the control block.

Additionally, I think the number of significant bits of the LINK field
should actually be obtained from the hypervisor, rather than being
baked at 17 (for the time being). This again is an implementation
detail (and the value could easily be command line controllable).

Jan

David Vrabel

2013-Feb-15 15:17 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On 15/02/13 15:05, Jan Beulich wrote:>>>> On 15.02.13 at 15:32, David Vrabel
<david.vrabel@citrix.com> wrote:
>> ### Control Block
>>
>> ![\label{fig_control-block}Control Block](control-block.pdf)
>>
>> The READY field contains a bit for each priority''s queue.  A
set bit
>> indicates that there are events pending on that queue.  A
queue''s
>> ready bit is set by Xen when an event is placed on an empty queue and
>> cleared by the guest when it empties the queue.
>>
>> There are N HEAD and TAIL indexes, one for each priority.
>>
>> The HEAD index is the first event in the queue or zero if the queue is
>> empty.  HEAD is set by the guest as it consumes events and only set by
>> Xen when adding an event to an empty queue.
>>
>> The TAIL index is the last event in the queue.  It is undefined if the
>> queue is empty.  TAIL is only set by Xen.
> 
> One thing I continue to wonder is why tail it part of the control
> block at all - from all I can tell, it''s not consumed by the guest
(end
> of list being indicated by a zero LINK field), and hence is an
> implementation detail of the hypervisor. And given that the
> hypervisor reads that field, not exposing the field also avoids the
> need to validate is each time it gets read from the control block.
Some of my initial back-of-envelope designs used it and I left it in
"just in case".

But the validation point is good, so I''ll remove the TAIL field.
> Additionally, I think the number of significant bits of the LINK field
> should actually be obtained from the hypervisor, rather than being
> baked at 17 (for the time being). This again is an implementation
> detail (and the value could easily be command line controllable).
I''d considered something like this but decided to defer this to a
future
date when it needed to be extended.

How about adding an EVTCHNOP_get_limit which gets the maximum permitted
port?  The guest can then round up the result of this to get the maximum
number of LINK bits.

David

Jan Beulich

2013-Feb-15 15:36 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

>>> On 15.02.13 at 16:17, David Vrabel <david.vrabel@citrix.com>
wrote:
> On 15/02/13 15:05, Jan Beulich wrote:
>> Additionally, I think the number of significant bits of the LINK field
>> should actually be obtained from the hypervisor, rather than being
>> baked at 17 (for the time being). This again is an implementation
>> detail (and the value could easily be command line controllable).
> 
> I''d considered something like this but decided to defer this to a
future
> date when it needed to be extended.
> 
> How about adding an EVTCHNOP_get_limit which gets the maximum permitted
> port?  The guest can then round up the result of this to get the maximum
> number of LINK bits.
I''d actually want to make this an output of the setup operation.
Whether to return a bit count or the number of ports is secondary
to me.

Jan

Wei Liu

2013-Feb-15 18:19 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

Apart from the TAIL field, I have some comments on the hypercall
interface. Please see inline comments.

On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init_control`
> 
> * `EVTCHNOP_expand_array`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init_control`
> 
> This call initializes a single VCPU''s control block.
> 
> A guest should call this during initial VCPU bring up.  The guest must
> have already successfully registered a vcpu_info structure and the
> control block must be in the same page.
> 
> If this call fails on the boot VCPU, the guest should continue to use
> the 2-level event channel ABI for all VCPUs.  If this call fails on
> any non-boot VCPU then the VCPU will be unable to receive events and
> the guest should offline the VCPU.
> 
Why offline this CPU? This call only registers control block, we can
switch back to use 2-level for all CPUs.

This is not a right or wrong question, but consider, how many vcpus will
a guest have? If the number is small, user would probably choose to have
all their cpus online over the new event interface.
> > Note: This only initializes the control block. At least one page
> > needs to be added to the event arrary with `EVTCHNOP_expand_array`.
> 
>     struct evtchnop_init_control {
>         uint64_t control_pfn;
>         uint32_t offset;
>         uint32_t vcpu;
>     };
> 
> Field          Purpose
> -----          -------
> `control_pfn`  [in] The PFN or GMFN of the page containing the control
> block.
> `offset`       [in] Offset in bytes from the start of the page to the
> beginning of the control block.
> `vcpu`         [in] The VCPU number.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `control_pfn` is not a valid frame for the domain.
> EINVAL      `control_pfn` is not the same frame as the vcpu_info structure.
Hmm... then you can omit control_pfn entirely?
> EINVAL      `offset` is not a multiple of 8 or the control block would
> cross a page boundary.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand_array`
> 
> This call expands the event array by appending an additional page.
> 
A bit implementation detail:

What state should guest / hypervisor be in when issuing this call?
> A guest should call this when a new event channel is required and
> there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
> If this call fails, then subsequent attempts to bind event channels
> may fail with -ENOSPC.  If the first page cannot be added then the
> guest cannot receive any events and it should panic.
> 
>     struct evtchnop_expand_array {
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `array_pfn` is not a valid frame for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
[snip]> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the highest port number a domain can bind an
> event channel to.  The default for dom0 is the maximum supported
> ($2^{17}-1$).  Other domains default to 1023 (requiring only a single
> page for their event array).
> 
> The limit only affects future attempts to bind event channels.  Event
> channels that are already bound are not affected.
> 
> It is recommended that the toolstack only calls this during domain
> creation before the guest is started.
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_port;
>     };
> 
> Field       Purpose
> -----       -------
> `domid`     [in] The domain ID.
> `max_port`  [in] The highest port number that the domain may bound an
> event channel to.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
Is it OK to shrink port limit?
> 
> Upcall
> ------
> 
> When Xen places an event on an empty queue it sets the queue as ready
> in the control block.  If the ready bit transitions from 0 to 1, a new
> event is signalled to the guest.
> 
> The guest uses the control block''s ready field to find the highest
> priority queue with pending events.  The bit in the ready word is
> cleared when the queue is emptied.
> 
> Higher priority events do not need to preempt lower priority event
> handlers so the guest can handle events by taking one event off the
> currently ready queue with highest priority.
> 
>     r = C.ready
>     while r
>         q = find_first_set_bit(r)
>         consume_one_event_from_queue(q)
>         if C.queue[q].H == 0
>             C.ready[q] = 0
>             r[b] = 0
What is b? Could be clear_bit(r,q)?


Wei.

David Vrabel

2013-Feb-15 18:46 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On 15/02/13 18:19, Wei Liu wrote:> Apart from the TAIL field, I have some comments on the hypercall
> interface. Please see inline comments.
> 
> On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:
>>
>> ### `EVTCHNOP_init_control`
>>
>> This call initializes a single VCPU''s control block.
>>
>> A guest should call this during initial VCPU bring up.  The guest must
>> have already successfully registered a vcpu_info structure and the
>> control block must be in the same page.
>>
>> If this call fails on the boot VCPU, the guest should continue to use
>> the 2-level event channel ABI for all VCPUs.  If this call fails on
>> any non-boot VCPU then the VCPU will be unable to receive events and
>> the guest should offline the VCPU.
>>
> 
> Why offline this CPU? This call only registers control block, we can
> switch back to use 2-level for all CPUs.
Because switching back for the other VCPUs is potentially impossible (we
may have hotplugged this new VCPU and more than 4096 event channels may
be currently in use).  It would also require migrating events between
the new and old data structures which is hard.

I would expect this call to never fail in normal operation (except on
the boot VCPU where support may be missing).  i.e., it will only fail if
xenheap or map space is exhausted.  Both the xenheap and the map space
should be large enough so we run out of other resources (e.g,, domheap)
first.
> This is not a right or wrong question, but consider, how many vcpus will
> a guest have? If the number is small, user would probably choose to have
> all their cpus online over the new event interface.
> 
>>> Note: This only initializes the control block. At least one page
>>> needs to be added to the event arrary with `EVTCHNOP_expand_array`.
>>
>>     struct evtchnop_init_control {
>>         uint64_t control_pfn;
>>         uint32_t offset;
>>         uint32_t vcpu;
>>     };
>>
>> Field          Purpose
>> -----          -------
>> `control_pfn`  [in] The PFN or GMFN of the page containing the control
>> block.
>> `offset`       [in] Offset in bytes from the start of the page to the
>> beginning of the control block.
>> `vcpu`         [in] The VCPU number.
>>
>> Error code  Reason
>> ----------  ------
>> EINVAL      `vcpu` is invalid or already initialized.
>> EINVAL      `control_pfn` is not a valid frame for the domain.
>> EINVAL      `control_pfn` is not the same frame as the vcpu_info
structure.
> 
> Hmm... then you can omit control_pfn entirely?
I left it in so the requirement that the two structures share pages can
be relaxed in the future.

And it will catch callers that haven''t laid out their structures
correctly which is useful.
>> EINVAL      `offset` is not a multiple of 8 or the control block would
>> cross a page boundary.
>> ENOMEM      Insufficient memory to allocate internal structures.
>>
>> ### `EVTCHNOP_expand_array`
>>
>> This call expands the event array by appending an additional page.
>>
> 
> A bit implementation detail:
> 
> What state should guest / hypervisor be in when issuing this call?
The boot VCPU shall have successfully initialized its control block.  I
don''t think there are any other restrictions on this call.
>> ### `EVTCHNOP_set_limit`
>>
>> This privileged call sets the highest port number a domain can bind an
>> event channel to.  The default for dom0 is the maximum supported
>> ($2^{17}-1$).  Other domains default to 1023 (requiring only a single
>> page for their event array).
>>
>> The limit only affects future attempts to bind event channels.  Event
>> channels that are already bound are not affected.
>>
>> It is recommended that the toolstack only calls this during domain
>> creation before the guest is started.
>>
> 
> Is it OK to shrink port limit?
Yes, but note the affect this has:

"The limit only affects future attempts to bind event channels.  Event
channels that are already bound are not affected."
>> Upcall
>> ------
>>
>> When Xen places an event on an empty queue it sets the queue as ready
>> in the control block.  If the ready bit transitions from 0 to 1, a new
>> event is signalled to the guest.
>>
>> The guest uses the control block''s ready field to find the
highest
>> priority queue with pending events.  The bit in the ready word is
>> cleared when the queue is emptied.
>>
>> Higher priority events do not need to preempt lower priority event
>> handlers so the guest can handle events by taking one event off the
>> currently ready queue with highest priority.
>>
>>     r = C.ready
>>     while r
>>         q = find_first_set_bit(r)
>>         consume_one_event_from_queue(q)
>>         if C.queue[q].H == 0
>>             C.ready[q] = 0
>>             r[b] = 0
> 
> What is b? Could be clear_bit(r,q)?
Typo, oops.

r[q] = 0

i.e., clear the q''th bit of r, so clear_bit(q, r) yes.

Typos not withstanding, would the pseudocode be clearer if it used Linux
style bit ops?

David

Wei Liu

2013-Feb-15 19:04 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On Fri, 2013-02-15 at 18:46 +0000, David Vrabel wrote:> On 15/02/13 18:19, Wei Liu wrote:
> > Apart from the TAIL field, I have some comments on the hypercall
> > interface. Please see inline comments.
> > 
> > On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:
> >>
> >> ### `EVTCHNOP_init_control`
> >>
> >> This call initializes a single VCPU''s control block.
> >>
> >> A guest should call this during initial VCPU bring up.  The guest
must
> >> have already successfully registered a vcpu_info structure and the
> >> control block must be in the same page.
> >>
> >> If this call fails on the boot VCPU, the guest should continue to
use
> >> the 2-level event channel ABI for all VCPUs.  If this call fails
on
> >> any non-boot VCPU then the VCPU will be unable to receive events
and
> >> the guest should offline the VCPU.
> >>
> > 
> > Why offline this CPU? This call only registers control block, we can
> > switch back to use 2-level for all CPUs.
> 
> Because switching back for the other VCPUs is potentially impossible (we
> may have hotplugged this new VCPU and more than 4096 event channels may
> be currently in use).  It would also require migrating events between
> the new and old data structures which is hard.
> 
If you hot plug a vcpu, can it possibly bypass the cpu_possible_map
limit?

VCPU info placement call is issued for all possible CPU''s, in this
case,
you can have new ABI setup for all possible CPUs at start-of-day, so
there is no migration needed.
> I would expect this call to never fail in normal operation (except on
> the boot VCPU where support may be missing).  i.e., it will only fail if
> xenheap or map space is exhausted.  Both the xenheap and the map space
> should be large enough so we run out of other resources (e.g,, domheap)
> first.
> 
[snip]> 
> The boot VCPU shall have successfully initialized its control block.  I
> don''t think there are any other restrictions on this call.
> 
So do you mean that this is only called when guest starts?


Wei.

Jan Beulich

2013-Feb-18 08:10 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

>>> On 15.02.13 at 19:46, David Vrabel <david.vrabel@citrix.com>
wrote:
> I would expect this call to never fail in normal operation (except on
> the boot VCPU where support may be missing).  i.e., it will only fail if
> xenheap or map space is exhausted.  Both the xenheap and the map space
> should be large enough so we run out of other resources (e.g,, domheap)
> first.
This is a questionable assumption, considering that the Xen heap is
a subset of the domain one (on x86-64 at least), and the map space
size has no correlation to either.

Jan

Ian Campbell

2013-Feb-18 15:19 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:> Except for a lack of information of the required memory barriers (to be
> included in draft C), this should now be completely enough to start on a
> prototype (any volunteers? ;).
I think the real question is whether you as creator of the design are
going to be working on the implementation. That would the usual
progression of things.

As I see it we have on the one hand a substantially complete
implementation of one solution and a design draft of the other with
nobody signed up to even implement a prototype at the moment.

Given that I think we should surely go with the available solution for
4.3, which does meet our current needs AFAICT (most of the stuff which
the other design adds is gravy AFAICT). If something better comes along
(be that the FIFO design or something else) in the future then we can
consider it on its merits then.

Ian.

David Vrabel

2013-Feb-18 20:16 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On 18/02/13 15:19, Ian Campbell wrote:> On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:
>> Except for a lack of information of the required memory barriers (to be
>> included in draft C), this should now be completely enough to start on
a
>> prototype (any volunteers? ;).
> 
> I think the real question is whether you as creator of the design are
> going to be working on the implementation. That would the usual
> progression of things.
Yes, but I can''t work on it full-time.

David

Keir Fraser

2013-Feb-18 20:21 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On 18/02/2013 15:19, "Ian Campbell" <Ian.Campbell@citrix.com>
wrote:
> On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:
>> Except for a lack of information of the required memory barriers (to be
>> included in draft C), this should now be completely enough to start on
a
>> prototype (any volunteers? ;).
> 
> I think the real question is whether you as creator of the design are
> going to be working on the implementation. That would the usual
> progression of things.
> 
> As I see it we have on the one hand a substantially complete
> implementation of one solution and a design draft of the other with
> nobody signed up to even implement a prototype at the moment.
> 
> Given that I think we should surely go with the available solution for
> 4.3, which does meet our current needs AFAICT (most of the stuff which
> the other design adds is gravy AFAICT). If something better comes along
> (be that the FIFO design or something else) in the future then we can
> consider it on its merits then.
Oh this is certainly true. I assumed David was planning to implement his
stuff, but if that''s not the case, it''s nice but not really
going to affect
the forward progress of the existing 3-level patches into xen-unstable.

 -- Keir
> Ian.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Wei Liu

2013-Feb-18 20:50 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

On Mon, Feb 18, 2013 at 8:16 PM, David Vrabel
<david.vrabel@citrix.com>wrote:
> On 18/02/13 15:19, Ian Campbell wrote:
> > On Fri, 2013-02-15 at 14:32 +0000, David Vrabel wrote:
> >> Except for a lack of information of the required memory barriers
(to be
> >> included in draft C), this should now be completely enough to
start on a
> >> prototype (any volunteers? ;).
> >
> > I think the real question is whether you as creator of the design are
> > going to be working on the implementation. That would the usual
> > progression of things.
>
> Yes, but I can''t work on it full-time.
>
>I would really like to help, but I''m mainly occupied by Xen network
improvement
at the moment. I will be happy to join the discussion on the new design and
share my humble two cents.

One thing I do think about is the registration interface. The naming for now
mainly focuses on N-level event channel, but there might be other models in
the
future, so I will rename it to EVTCHNOP_register_extended_event_channel and
make registering each model a subcommand of that interface. Modifications to
toolstack will also need adjustment. AFAICT the FIFO-design will be enabled
for
all guests while the 3-level design only aims for Dom0 and driver domains.
Anyway, let''s discuss this later, after I post my new series.

Wei.

 David>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Feb-19 18:39 UTC

head link

Re: FIFO-based event channel ABI design (draft B)

David Vrabel writes ("[Xen-devel] FIFO-based event channel ABI design
(draft B)"):> Event State Machine
> -------------------
> 
> Event channels are bound to a port in the domain using the existing
> ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  PML Bits  Meaning
> -----    -------  --------  -------
> BOUND    B        000       The event is bound but not pending.
> PENDING  P        100       The event has been raised and not yet
> acknowledged.
> LINKED   L        101       The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
This high-level state machine description doesn''t capture the
transient states which exist when Xen and/or the guest need to update
multiple words at once.

In order to check that this protocol is correct, I think you should
draw a state diagram which includes these transient states, and verify
that the edges all operate as intended.

Ian.

Xen devel - Feb 2013 - FIFO-based event channel ABI design (draft B)

FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)

Re: FIFO-based event channel ABI design (draft B)