Hi, Here is a design for a scalable event channel ABI for Xen. It has a number of benefits over the design currently being proposed by Wei Lui. * More event channels (>100,000). * 16 event priorities. * Reduced memory requirements (only 1 additional page per domU). * The use of FIFOs for events ensures fairness, whereas it is difficult to reason about the fairness with the current bitmap system. The PDF version is easier to read and has diagrams and readable maths but the original markdown format document is included below (for ease of commenting). http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf Introduction =========== Purpose ------- Xen uses event channels to signal events (interrupts) to (fully or partially) paravirtualized guests. The current event channel ABI provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096 (for 64-bit guests) event channels. This is limiting scalability as support for more VMs, VCPUs and devices is required. The existing ABI does not easily allow events to have different priorities. Current Linux kernels prioritize the timer event by special casing this but this is not generalizable to more events. Event priorities may be useful for prioritizing MMIO emulation requests over bulk data traffic (such as network or disk). This design replaces the existing event channel ABI with one that: - is scalable to more than 100,000 event channels, with scope for increasing this further with minimal ABI changes. - allows guests to use up-to 16 different event priorities. System Overview --------------- [FIXME: diagram showing Xen and guest and shared memory block for events?] Design Map ---------- A new event channel ABI requires changes to Xen and the guest kernels. References ---------- [FIXME: link to alternate proposal?] Design Considerations ==================== Assumptions ----------- * Atomic read-modify-write of 32-bit words is possible on all supported platforms. This can be with a linked-load / store-conditional (e.g., ARMv8''s ldrx/strx) or a compare-and-swap (e.g., x86''s cmpxchg). Constraints ----------- * The existing ABI must continue to be useable. Compatibilty with existing guests is mandatory. Risks and Volatile Areas ------------------------ * Should the 3-level proposal be merged into Xen then this design does not offer enough improvements to warrant the cost of maintaining three different event channel ABIs in Xen and guest kernels. * The performance of some operations may be decreased. Specifically, retriggering an event now always requires a hypercall. Architecture =========== Overview -------- The event channel ABI uses a data structure that is shared between Xen and the guest. Access to the structure is done with lock-less operations (except for some less common operations where the guest must use a hypercall). The guest is responsible for allocating this structure and registering it with Xen during VCPU bring-up. Events are reported to a guest''s VCPU using a FIFO queue. There is a queue for each priority level and each VCPU. Each event has a _pending_ and a _masked_ bit. The pending bit indicates the event has been raised. The masked bit is used by the guest to prevent delivery of that specific event. High Level Design ================ Shared Event Data Structure --------------------------- The shared event data structure has a per-domain _event array_, and a per-VCPU _control block_. - _event array_: A logical array of event words (one per event channel) which contains the pending and mask bits and the link index for next event in the queue. ![\label{fig_event-word}Event Array Word](event-word.pdf) - _control block_: This contains the head and tail indexes for each priority level. Each VCPU has its own control block and this is contained in the existing `struct vcpu_info`. The pages within the event array need not be physically nor virtually contiguous, but the guest or Xen may make the virtually contiguous for ease of implementation. e.g., by using vmap() in Xen or vmalloc() in Linux. Pages are added by the guest as required by the number of bound event channels. Only 17 bits are currently defined for the LINK field, allowing 2^17^ (131,072) events. This limit can be trivially increased without any other changes to the ABI. Bits [28:17] are reserved for future expansion or for other uses. Event State Machine ------------------- Event channels are bound to a domain using the existing ABI. A bound event may be in one of three main states. State Abbrev. Meaning ----- ------- ------- BOUND B The event is bound but not pending. PENDING P The event has been raised and not yet acknowledged. LINKED L The event is on an event queue. Additionally, events may be UNMASKED or MASKED (M). ![\label{events_sm}Event State Machine](event-sm.pdf) The state of an event is tracked using 3 bits within the event word. the M (masked), P (pending) and L (linked) bits. Only state transitions that change a single bit are valid. Event Queues ------------ The event queues use a singly-linked list of event array words (see figure \ref{fig_event-word} and \ref{fig_event-queue}). ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf) Each event queue has a head and tail index stored in the control block. The head index is the index of the first element in the queue. The tail index is the last element in the queue. Every element within the queue has the L bit set. The LINK field in the event word indexes the next event in the queue. LINK is zero for the last word in the queue. The queue is empty when the head index is zero (zero is not a valid event channel). Hypercalls ---------- Four new EVTCHNOP hypercall sub-operations are added: * `EVTCHNOP_init` * `EVTCHNOP_expand` * `EVTCHNOP_set_priority` * `EVTCHNOP_set_limit` ### `EVTCHNOP_init` This call initializes a single VCPU''s event channel data structures, adding one page for the event array. A guest should call this during initial VCPU bring up (and on resume?). struct evtchnop_init { uint32_t vcpu; uint64_t array_pfn; }; Field Purpose ----- ------- `vcpu` [in] The VCPU number. `array_pfn` [in] The PFN or GMFN of a page to be used for the first page of the event array. Error code Reason ---------- ------ EINVAL `vcpu` is invalid or already initialized. EINVAL `array_pfn` is not a valid frame for the domain. ENOMEM Insufficient memory to allocate internal structures. ### `EVTCHNOP_expand` This call expands the event array for a VCPU by appended an additional page. A guest should call this for all VCPUs when a new event channel is required and there is insufficient space in the current event array. It is not possible to shrink the event array once it has been expanded. struct evtchnop_expand { uint32_t vcpu; uint64_t array_pfn; }; Field Purpose ----- ------- `vcpu` [in] The VCPU number. `array_pfn` [in] The PFN or GMFN of a page to be used for the next page of the event array. Error code Reason ---------- ------ EINVAL `vcpu` is invalid or already initialized. EINVAL `array_pfn` is not a valid frames for the domain. EINVAL The event array already has the maximum number of pages. ENOMEM Insufficient memory to allocate internal structures. ### `EVTCHNOP_set_priority` This call sets the priority for an event channel. The event must be unbound. A guest may call this prior to binding an event channel. The meaning and the use of the priority are up to the guest. Valid priorities are 0 - 15 and the default is 7. struct evtchnop_set_priority { uint32_t port; uint32_t priority; }; Field Purpose ----- ------- `port` [in] The event channel. `priority` [in] The priority for the event channel. Error code Reason ---------- ------ EINVAL `port` is invalid. EINVAL `port` is currently bound. EINVAL `priority` is outside the range 0 - 15. ### `EVTCHNOP_set_limit` This privileged call sets the maximum number of event channels a domain can bind. The default for dom0 is unlimited. Other domains default to 1024 events (requiring only a single page for their event array). struct evtchnop_set_limit { uint32_t domid; uint32_t max_events; }; Field Purpose ----- ------- `domid` [in] The domain ID. `max_events` [in] The maximum number of event channels that may be bound to the domain. Error code Reason ---------- ------ EINVAL `domid` is invalid. EPERM The calling domain has insufficient privileges. Memory Usage ------------ ### Event Arrays Xen needs to map every domains'' event array into its address space. The space reserved for these global mappings is limited to 1 GiB on x86-64 (262144 pages) and is shared with other users. It is non-trivial to calculate the maximum number of VMs that can be supported as this depends on the system configuration (how many driver domains etc.) and VM configuration. We can make some assuptions and derive an approximate limit. Each page of the event array has space for 1024 events ($E_P$) so a regular domU will only require a single page. Since event channels typically come in pairs, the upper bound on the total number of pages is $2 \times\text{number of VMs}$. If the guests are further restricted in the number of event channels ($E_V$) then this upper bound can be reduced further. The number of VMs ($V$) with a limit of $P$ total event array pages is: \begin{equation*} V = P \div \left(1 + \frac{E_V}{E_P}\right) \end{equation*} Using only half the available pages and limiting guests to only 64 events gives: \begin{eqnarray*} V & = & (262144 / 2) \div (1 + 64 / 1024) \\ & = & 123 \times 10^3\text{ VMs} \end{eqnarray*} Alternatively, we can consider a system with $D$ driver domains, each of which requires $E_D$ events, and a dom0 using the maximum number of pages (128). \begin{eqnarray*} V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) \end{eqnarray*} With, for example, 16 driver domains each using the maximum number of pages: \begin{eqnarray*} V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ & = & 129 \times 10^3\text{ VMs} \end{eqnarray*} In summary, there is space to map the event arrays for over 100,000 VMs. This is more than the limit imposed by the 16 bit domain ID (~32,000 VMs). ### Control Block With $L$ priority levels and two 32-bit words for the head and tail indexes, the amount of space ($S$) required in the `struct vcpu_info` for the control block is: \begin{eqnarray*} S & = & L \times 2 \times 4 \\ & = & 16 \times 2 \times 4 \\ & = & 128\text{ bytes} \end{eqnarray*} There is more than enough space within `struct vcpu_info` for the control block while still keeping plenty of space for future use. Low Level Design =============== In the pseudo code in this section, all memory accesses are atomic, including those to bit-fields within the event word. These variables have a standard meaning: Variable Purpose -------- ------- E Event array. p A specific event. H The head index for a specific priority. T The tail index for a specific priority. These memory barriers are required: Function Purpose -------- ------- mb() Full (read/write) memory barrier Raising an Event ---------------- When Xen raises an event it marks it pending and (if it is not masked) adds it tail of event queue. E[p].pending = 1 if not E[p].linked and not E[n].masked E[p].linked = 1 E[p].link = 0 mb() if H == 0 H = p else E[T].link = p T = p Concurrent access by Xen to the event queue must be protected by a per-event queue spin lock. Consuming Events ---------------- The guests consumes events starting at the head until it reaches the tail. Events in the queue that are not pending or are masked are consumed but not handled. while H != 0 p = H H = E[p].link if H == 0 mb() H = E[p].link E[H].linked = 0 if not E[p].masked handle(p) handle() clears `E[p].pending` and EOIs level-triggered PIRQs.> Note: When the event queue contains a single event, removing the > event may race with Xen appending another event because the load of > `E[p].link` and the store of `H` is not atomic. To avoid this race, > the guest must recheck `E[p].link` if the list appeared empty.Masking Events -------------- Events are masked by setting the masked bit. If the event is pending and linked it does not need to be unlinked. E[p].masked = 1 Unmasking Events ---------------- Events are unmasked by the guest by clearing the masked bit. If the event is pending the guest must call the event channel unmask hypercall so Xen can link the event into the correct event queue. E[p].masked = 0 if E[p].pending hypercall(EVTCHN_unmask) The expectation here is that unmasking a pending event will be rare, so the performance hit of the hypercall is minimal.> Note: that after clearing the mask bit, the event may be raised and > thus it may already be linked by the time the hypercall is done.
On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:> Hi, > > Here is a design for a scalable event channel ABI for Xen. It has a > number of benefits over the design currently being proposed by Wei Lui. > > * More event channels (>100,000). > * 16 event priorities. > * Reduced memory requirements (only 1 additional page per domU). > * The use of FIFOs for events ensures fairness, whereas it is difficult > to reason about the fairness with the current bitmap system.I have some sympathy for this design. It''s primary downside compared with the 3-level alternative is its greater space cost (32*#vcpus). However, as you say the fairness and prioritisation features are rather nice. Also having the data structures be per vcpu may well help avoid cacheline contention on busy multi-vcpu guests. Interested in what others think, but I may actually prefer a ground-up redesign like this. -- Keir> The PDF version is easier to read and has diagrams and readable maths > but the original markdown format document is included below (for ease of > commenting). > > http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf > > Introduction > ===========> > Purpose > ------- > > Xen uses event channels to signal events (interrupts) to (fully or > partially) paravirtualized guests. The current event channel ABI > provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096 > (for 64-bit guests) event channels. This is limiting scalability as > support for more VMs, VCPUs and devices is required. > > The existing ABI does not easily allow events to have different > priorities. Current Linux kernels prioritize the timer event by > special casing this but this is not generalizable to more events. > Event priorities may be useful for prioritizing MMIO emulation > requests over bulk data traffic (such as network or disk). > > This design replaces the existing event channel ABI with one that: > > - is scalable to more than 100,000 event channels, with scope for > increasing this further with minimal ABI changes. > > - allows guests to use up-to 16 different event priorities. > > System Overview > --------------- > > [FIXME: diagram showing Xen and guest and shared memory block for events?] > > Design Map > ---------- > > A new event channel ABI requires changes to Xen and the guest kernels. > > References > ---------- > > [FIXME: link to alternate proposal?] > > Design Considerations > ====================> > Assumptions > ----------- > > * Atomic read-modify-write of 32-bit words is possible on all > supported platforms. This can be with a linked-load / > store-conditional (e.g., ARMv8''s ldrx/strx) or a compare-and-swap > (e.g., x86''s cmpxchg). > > Constraints > ----------- > > * The existing ABI must continue to be useable. Compatibilty with > existing guests is mandatory. > > Risks and Volatile Areas > ------------------------ > > * Should the 3-level proposal be merged into Xen then this design does > not offer enough improvements to warrant the cost of maintaining > three different event channel ABIs in Xen and guest kernels. > > * The performance of some operations may be decreased. Specifically, > retriggering an event now always requires a hypercall. > > Architecture > ===========> > Overview > -------- > > The event channel ABI uses a data structure that is shared between Xen > and the guest. Access to the structure is done with lock-less > operations (except for some less common operations where the guest > must use a hypercall). The guest is responsible for allocating this > structure and registering it with Xen during VCPU bring-up. > > Events are reported to a guest''s VCPU using a FIFO queue. There is a > queue for each priority level and each VCPU. > > Each event has a _pending_ and a _masked_ bit. The pending bit > indicates the event has been raised. The masked bit is used by the > guest to prevent delivery of that specific event. > > > High Level Design > ================> > Shared Event Data Structure > --------------------------- > > The shared event data structure has a per-domain _event array_, and a > per-VCPU _control block_. > > - _event array_: A logical array of event words (one per event > channel) which contains the pending and mask bits and the link index > for next event in the queue. > > ![\label{fig_event-word}Event Array Word](event-word.pdf) > > - _control block_: This contains the head and tail indexes for each > priority level. Each VCPU has its own control block and this is > contained in the existing `struct vcpu_info`. > > The pages within the event array need not be physically nor virtually > contiguous, but the guest or Xen may make the virtually contiguous for > ease of implementation. e.g., by using vmap() in Xen or vmalloc() in > Linux. Pages are added by the guest as required by the number of > bound event channels. > > Only 17 bits are currently defined for the LINK field, allowing 2^17^ > (131,072) events. This limit can be trivially increased without any > other changes to the ABI. Bits [28:17] are reserved for future > expansion or for other uses. > > Event State Machine > ------------------- > > Event channels are bound to a domain using the existing ABI. > > A bound event may be in one of three main states. > > State Abbrev. Meaning > ----- ------- ------- > BOUND B The event is bound but not pending. > PENDING P The event has been raised and not yet acknowledged. > LINKED L The event is on an event queue. > > Additionally, events may be UNMASKED or MASKED (M). > > ![\label{events_sm}Event State Machine](event-sm.pdf) > > The state of an event is tracked using 3 bits within the event word. > the M (masked), P (pending) and L (linked) bits. Only state > transitions that change a single bit are valid. > > Event Queues > ------------ > > The event queues use a singly-linked list of event array words (see > figure \ref{fig_event-word} and \ref{fig_event-queue}). > > ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf) > > Each event queue has a head and tail index stored in the control > block. The head index is the index of the first element in the queue. > The tail index is the last element in the queue. Every element within > the queue has the L bit set. > > The LINK field in the event word indexes the next event in the queue. > LINK is zero for the last word in the queue. > > The queue is empty when the head index is zero (zero is not a valid > event channel). > > Hypercalls > ---------- > > Four new EVTCHNOP hypercall sub-operations are added: > > * `EVTCHNOP_init` > > * `EVTCHNOP_expand` > > * `EVTCHNOP_set_priority` > > * `EVTCHNOP_set_limit` > > ### `EVTCHNOP_init` > > This call initializes a single VCPU''s event channel data structures, > adding one page for the event array. > > A guest should call this during initial VCPU bring up (and on resume?). > > struct evtchnop_init { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the first page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frame for the domain. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_expand` > > This call expands the event array for a VCPU by appended an additional > page. > > A guest should call this for all VCPUs when a new event channel is > required and there is insufficient space in the current event array. > > It is not possible to shrink the event array once it has been > expanded. > > struct evtchnop_expand { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the next page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frames for the domain. > EINVAL The event array already has the maximum number of pages. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_set_priority` > > This call sets the priority for an event channel. The event must be > unbound. > > A guest may call this prior to binding an event channel. The meaning > and the use of the priority are up to the guest. Valid priorities are > 0 - 15 and the default is 7. > > struct evtchnop_set_priority { > uint32_t port; > uint32_t priority; > }; > > Field Purpose > ----- ------- > `port` [in] The event channel. > `priority` [in] The priority for the event channel. > > Error code Reason > ---------- ------ > EINVAL `port` is invalid. > EINVAL `port` is currently bound. > EINVAL `priority` is outside the range 0 - 15. > > ### `EVTCHNOP_set_limit` > > This privileged call sets the maximum number of event channels a > domain can bind. The default for dom0 is unlimited. Other domains > default to 1024 events (requiring only a single page for their event > array). > > struct evtchnop_set_limit { > uint32_t domid; > uint32_t max_events; > }; > > Field Purpose > ----- ------- > `domid` [in] The domain ID. > `max_events` [in] The maximum number of event channels that may be bound > to the domain. > > Error code Reason > ---------- ------ > EINVAL `domid` is invalid. > EPERM The calling domain has insufficient privileges. > > Memory Usage > ------------ > > ### Event Arrays > > Xen needs to map every domains'' event array into its address space. > The space reserved for these global mappings is limited to 1 GiB on > x86-64 (262144 pages) and is shared with other users. > > It is non-trivial to calculate the maximum number of VMs that can be > supported as this depends on the system configuration (how many driver > domains etc.) and VM configuration. We can make some assuptions and > derive an approximate limit. > > Each page of the event array has space for 1024 events ($E_P$) so a > regular domU will only require a single page. Since event channels > typically come in pairs, the upper bound on the total number of pages > is $2 \times\text{number of VMs}$. > > If the guests are further restricted in the number of event channels > ($E_V$) then this upper bound can be reduced further. > > The number of VMs ($V$) with a limit of $P$ total event array pages is: > \begin{equation*} > V = P \div \left(1 + \frac{E_V}{E_P}\right) > \end{equation*} > > Using only half the available pages and limiting guests to only 64 > events gives: > \begin{eqnarray*} > V & = & (262144 / 2) \div (1 + 64 / 1024) \\ > & = & 123 \times 10^3\text{ VMs} > \end{eqnarray*} > > Alternatively, we can consider a system with $D$ driver domains, each > of which requires $E_D$ events, and a dom0 using the maximum number of > pages (128). > > \begin{eqnarray*} > V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) > \end{eqnarray*} > > With, for example, 16 driver domains each using the maximum number of pages: > \begin{eqnarray*} > V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ > & = & 129 \times 10^3\text{ VMs} > \end{eqnarray*} > > In summary, there is space to map the event arrays for over 100,000 > VMs. This is more than the limit imposed by the 16 bit domain ID > (~32,000 VMs). > > ### Control Block > > With $L$ priority levels and two 32-bit words for the head and tail > indexes, the amount of space ($S$) required in the `struct vcpu_info` > for the control block is: > \begin{eqnarray*} > S & = & L \times 2 \times 4 \\ > & = & 16 \times 2 \times 4 \\ > & = & 128\text{ bytes} > \end{eqnarray*} > > There is more than enough space within `struct vcpu_info` for the > control block while still keeping plenty of space for future use. > > Low Level Design > ===============> > In the pseudo code in this section, all memory accesses are atomic, > including those to bit-fields within the event word. > > These variables have a standard meaning: > > Variable Purpose > -------- ------- > E Event array. > p A specific event. > H The head index for a specific priority. > T The tail index for a specific priority. > > These memory barriers are required: > > Function Purpose > -------- ------- > mb() Full (read/write) memory barrier > > Raising an Event > ---------------- > > When Xen raises an event it marks it pending and (if it is not masked) > adds it tail of event queue. > > E[p].pending = 1 > if not E[p].linked and not E[n].masked > E[p].linked = 1 > E[p].link = 0 > mb() > if H == 0 > H = p > else > E[T].link = p > T = p > > Concurrent access by Xen to the event queue must be protected by a > per-event queue spin lock. > > Consuming Events > ---------------- > > The guests consumes events starting at the head until it reaches the > tail. Events in the queue that are not pending or are masked are > consumed but not handled. > > while H != 0 > p = H > H = E[p].link > if H == 0 > mb() > H = E[p].link > E[H].linked = 0 > if not E[p].masked > handle(p) > > handle() clears `E[p].pending` and EOIs level-triggered PIRQs. > >> Note: When the event queue contains a single event, removing the >> event may race with Xen appending another event because the load of >> `E[p].link` and the store of `H` is not atomic. To avoid this race, >> the guest must recheck `E[p].link` if the list appeared empty. > > Masking Events > -------------- > > Events are masked by setting the masked bit. If the event is pending > and linked it does not need to be unlinked. > > E[p].masked = 1 > > Unmasking Events > ---------------- > > Events are unmasked by the guest by clearing the masked bit. If the > event is pending the guest must call the event channel unmask > hypercall so Xen can link the event into the correct event queue. > > E[p].masked = 0 > if E[p].pending > hypercall(EVTCHN_unmask) > > The expectation here is that unmasking a pending event will be rare, > so the performance hit of the hypercall is minimal. > >> Note: that after clearing the mask bit, the event may be raised and >> thus it may already be linked by the time the hypercall is done. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Nice work and interesting reading for the night! please see inline comments. On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:> Hi, > > Here is a design for a scalable event channel ABI for Xen. It has a > number of benefits over the design currently being proposed by Wei Lui. > > * More event channels (>100,000).This is not a benefit over the 3-level design. ;-)> * 16 event priorities. > * Reduced memory requirements (only 1 additional page per domU). > * The use of FIFOs for events ensures fairness, whereas it is difficult > to reason about the fairness with the current bitmap system.These are! ;-)> The PDF version is easier to read and has diagrams and readable maths > but the original markdown format document is included below (for ease of > commenting). > > http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf > > Introduction > ===========> > Purpose > ------- > > Xen uses event channels to signal events (interrupts) to (fully or > partially) paravirtualized guests. The current event channel ABI > provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096 > (for 64-bit guests) event channels. This is limiting scalability as > support for more VMs, VCPUs and devices is required. > > The existing ABI does not easily allow events to have different > priorities. Current Linux kernels prioritize the timer event by > special casing this but this is not generalizable to more events. > Event priorities may be useful for prioritizing MMIO emulation > requests over bulk data traffic (such as network or disk). > > This design replaces the existing event channel ABI with one that: > > - is scalable to more than 100,000 event channels, with scope for > increasing this further with minimal ABI changes. > > - allows guests to use up-to 16 different event priorities. > > System Overview > --------------- > > [FIXME: diagram showing Xen and guest and shared memory block for events?] > > Design Map > ---------- > > A new event channel ABI requires changes to Xen and the guest kernels. > > References > ---------- > > [FIXME: link to alternate proposal?] > > Design Considerations > ====================> > Assumptions > ----------- > > * Atomic read-modify-write of 32-bit words is possible on all > supported platforms. This can be with a linked-load / > store-conditional (e.g., ARMv8''s ldrx/strx) or a compare-and-swap > (e.g., x86''s cmpxchg). > > Constraints > ----------- > > * The existing ABI must continue to be useable. Compatibilty with > existing guests is mandatory. > > Risks and Volatile Areas > ------------------------ > > * Should the 3-level proposal be merged into Xen then this design does > not offer enough improvements to warrant the cost of maintaining > three different event channel ABIs in Xen and guest kernels. > > * The performance of some operations may be decreased. Specifically, > retriggering an event now always requires a hypercall. > > Architecture > ===========> > Overview > -------- > > The event channel ABI uses a data structure that is shared between Xen > and the guest. Access to the structure is done with lock-less > operations (except for some less common operations where the guest > must use a hypercall). The guest is responsible for allocating this > structure and registering it with Xen during VCPU bring-up.Just for completeness, legacy guests might not register vcpu info - they just use those vcpu info inside shared info. And a modern guest might use vcpu info inside shared info along with registered per-vcpu vcpu info if registrations fail half way. But this will not be a big problem for this design.> Events are reported to a guest''s VCPU using a FIFO queue. There is a > queue for each priority level and each VCPU. > > Each event has a _pending_ and a _masked_ bit. The pending bit > indicates the event has been raised. The masked bit is used by the > guest to prevent delivery of that specific event. > > > High Level Design > ================> > Shared Event Data Structure > --------------------------- > > The shared event data structure has a per-domain _event array_, and a > per-VCPU _control block_. > > - _event array_: A logical array of event words (one per event > channel) which contains the pending and mask bits and the link index > for next event in the queue. > > ![\label{fig_event-word}Event Array Word](event-word.pdf) > > - _control block_: This contains the head and tail indexes for each > priority level. Each VCPU has its own control block and this is > contained in the existing `struct vcpu_info`. > > The pages within the event array need not be physically nor virtually > contiguous, but the guest or Xen may make the virtually contiguous for > ease of implementation. e.g., by using vmap() in Xen or vmalloc() in > Linux. Pages are added by the guest as required by the number of > bound event channels. > > Only 17 bits are currently defined for the LINK field, allowing 2^17^ > (131,072) events. This limit can be trivially increased without any > other changes to the ABI. Bits [28:17] are reserved for future > expansion or for other uses. > > Event State Machine > ------------------- > > Event channels are bound to a domain using the existing ABI. > > A bound event may be in one of three main states. > > State Abbrev. Meaning > ----- ------- ------- > BOUND B The event is bound but not pending. > PENDING P The event has been raised and not yet acknowledged. > LINKED L The event is on an event queue. > > Additionally, events may be UNMASKED or MASKED (M). > > ![\label{events_sm}Event State Machine](event-sm.pdf) > > The state of an event is tracked using 3 bits within the event word. > the M (masked), P (pending) and L (linked) bits. Only state > transitions that change a single bit are valid. >Where is the "priority" information stored? I think it should be somewhere inside Xen, right? I presume a field in struct evtchn?> Event Queues > ------------ > > The event queues use a singly-linked list of event array words (see > figure \ref{fig_event-word} and \ref{fig_event-queue}). > > ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf) > > Each event queue has a head and tail index stored in the control > block. The head index is the index of the first element in the queue. > The tail index is the last element in the queue. Every element within > the queue has the L bit set. > > The LINK field in the event word indexes the next event in the queue. > LINK is zero for the last word in the queue. > > The queue is empty when the head index is zero (zero is not a valid > event channel). > > Hypercalls > ---------- > > Four new EVTCHNOP hypercall sub-operations are added: > > * `EVTCHNOP_init` > > * `EVTCHNOP_expand` > > * `EVTCHNOP_set_priority` > > * `EVTCHNOP_set_limit` > > ### `EVTCHNOP_init` > > This call initializes a single VCPU''s event channel data structures, > adding one page for the event array. >I think the registration for new event channels should always be done in a batch. If not then you need to provide another OP to rollback previous registered ones.> A guest should call this during initial VCPU bring up (and on resume?). > > struct evtchnop_init { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the first page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frame for the domain. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_expand` > > This call expands the event array for a VCPU by appended an additional > page. > > A guest should call this for all VCPUs when a new event channel is > required and there is insufficient space in the current event array. > > It is not possible to shrink the event array once it has been > expanded. > > struct evtchnop_expand { > uint32_t vcpu; > uint64_t array_pfn; > }; > > Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the next page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frames for the domain. > EINVAL The event array already has the maximum number of pages. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_set_priority` > > This call sets the priority for an event channel. The event must be > unbound. > > A guest may call this prior to binding an event channel. The meaning > and the use of the priority are up to the guest. Valid priorities are > 0 - 15 and the default is 7. > > struct evtchnop_set_priority { > uint32_t port; > uint32_t priority; > }; > Field Purpose > ----- ------- > `port` [in] The event channel. > `priority` [in] The priority for the event channel. > > Error code Reason > ---------- ------ > EINVAL `port` is invalid. > EINVAL `port` is currently bound. > EINVAL `priority` is outside the range 0 - 15. > > ### `EVTCHNOP_set_limit` > > This privileged call sets the maximum number of event channels a > domain can bind. The default for dom0 is unlimited. Other domains > default to 1024 events (requiring only a single page for their event > array). > > struct evtchnop_set_limit { > uint32_t domid; > uint32_t max_events; > }; > > Field Purpose > ----- ------- > `domid` [in] The domain ID. > `max_events` [in] The maximum number of event channels that may be bound > to the domain. > > Error code Reason > ---------- ------ > EINVAL `domid` is invalid. > EPERM The calling domain has insufficient privileges. >Is shrinking max_events legal? Should also deal with max_events > design_limit case.> Memory Usage > ------------ > > ### Event Arrays > > Xen needs to map every domains'' event array into its address space. > The space reserved for these global mappings is limited to 1 GiB on > x86-64 (262144 pages) and is shared with other users. > > It is non-trivial to calculate the maximum number of VMs that can be > supported as this depends on the system configuration (how many driver > domains etc.) and VM configuration. We can make some assuptions and > derive an approximate limit. > > Each page of the event array has space for 1024 events ($E_P$) so a > regular domU will only require a single page. Since event channels > typically come in pairs, the upper bound on the total number of pages^^^^^ upper or lower?> is $2 \times\text{number of VMs}$. > > If the guests are further restricted in the number of event channels > ($E_V$) then this upper bound can be reduced further. >Can this bound really be reduced? Can you map memory on non-page granularity?> The number of VMs ($V$) with a limit of $P$ total event array pages is: > \begin{equation*} > V = P \div \left(1 + \frac{E_V}{E_P}\right) > \end{equation*} > > Using only half the available pages and limiting guests to only 64 > events gives: > \begin{eqnarray*} > V & = & (262144 / 2) \div (1 + 64 / 1024) \\ > & = & 123 \times 10^3\text{ VMs} > \end{eqnarray*} > > Alternatively, we can consider a system with $D$ driver domains, each > of which requires $E_D$ events, and a dom0 using the maximum number of > pages (128). > > \begin{eqnarray*} > V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) > \end{eqnarray*} > > With, for example, 16 driver domains each using the maximum number of pages: > \begin{eqnarray*} > V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ > & = & 129 \times 10^3\text{ VMs} > \end{eqnarray*} > > In summary, there is space to map the event arrays for over 100,000 > VMs. This is more than the limit imposed by the 16 bit domain ID > (~32,000 VMs). > > ### Control Block > > With $L$ priority levels and two 32-bit words for the head and tail > indexes, the amount of space ($S$) required in the `struct vcpu_info` > for the control block is: > \begin{eqnarray*} > S & = & L \times 2 \times 4 \\ > & = & 16 \times 2 \times 4 \\ > & = & 128\text{ bytes} > \end{eqnarray*} > > There is more than enough space within `struct vcpu_info` for the > control block while still keeping plenty of space for future use. > > Low Level Design > ===============> > In the pseudo code in this section, all memory accesses are atomic, > including those to bit-fields within the event word. > > These variables have a standard meaning: > > Variable Purpose > -------- ------- > E Event array. > p A specific event. > H The head index for a specific priority. > T The tail index for a specific priority. > > These memory barriers are required: > > Function Purpose > -------- ------- > mb() Full (read/write) memory barrier > > Raising an Event > ---------------- > > When Xen raises an event it marks it pending and (if it is not masked) > adds it tail of event queue. > > E[p].pending = 1 > if not E[p].linked and not E[n].masked > E[p].linked = 1 > E[p].link = 0 > mb() > if H == 0 > H = p > else > E[T].link = p > T = p > > Concurrent access by Xen to the event queue must be protected by a > per-event queue spin lock. >I presume "E[n]" in the pseudo code is "E[p]"? Is this spin lock really a good idea? How many threads / cpus will spin on this lock? As [0] shows, contention on spin lock incurs heavy performance penalty. [0] https://lwn.net/Articles/530458/> Consuming Events > ---------------- > > The guests consumes events starting at the head until it reaches the > tail. Events in the queue that are not pending or are masked are > consumed but not handled. > > while H != 0 > p = H > H = E[p].link > if H == 0 > mb() > H = E[p].link > E[H].linked = 0 > if not E[p].masked > handle(p) > > handle() clears `E[p].pending` and EOIs level-triggered PIRQs. >How to synchronize access to the array and control blocks between Xen and guest? I''m afraid I have no knowledge of a xen-guest spin lock... AFAICT at least inter-domain event channel is a corner case in your design - cpu A is handling events in Linux kernel while cpu B tries to raise a inter-domain port to A. If this is not a problem, please also clarify a bit. (sorry I''m a bit fuzzy after a day''s work)> > Note: When the event queue contains a single event, removing the > > event may race with Xen appending another event because the load of > > `E[p].link` and the store of `H` is not atomic. To avoid this race, > > the guest must recheck `E[p].link` if the list appeared empty. > > Masking Events > -------------- > > Events are masked by setting the masked bit. If the event is pending > and linked it does not need to be unlinked. > > E[p].masked = 1 > > Unmasking Events > ---------------- > > Events are unmasked by the guest by clearing the masked bit. If the > event is pending the guest must call the event channel unmask > hypercall so Xen can link the event into the correct event queue. > > E[p].masked = 0 > if E[p].pending > hypercall(EVTCHN_unmask) > > The expectation here is that unmasking a pending event will be rare, > so the performance hit of the hypercall is minimal. >Currently unmask a "remote" port requires issuing hyercall as well, so if unmasking is not very frequent, this is not a big problem. But please take some interrupt-intensive workloads into consideration. For example, 1G nic (e1000) even with NAPI enabled can generate 27k+ intrs per second under high packet load [1], 10G nic can surely generate more. Can you give estimation on the performance hit on the context switch? [1] http://www.linuxfoundation.org/collaborate/workgroups/networking/napi> > Note: that after clearing the mask bit, the event may be raised and > > thus it may already be linked by the time the hypercall is done.Even if the event has already been linked before you finish the hypercall, you would still need to get hold of the a lock to serialize access to event structure for checking, right? Or a test_bit on linked field is sufficient? I think you need to write some pseudo code for this as well. Thanks Wei.
On 04/02/2013 21:07, "Wei Liu" <wei.liu2@citrix.com> wrote:>> Concurrent access by Xen to the event queue must be protected by a >> per-event queue spin lock. >> > > I presume "E[n]" in the pseudo code is "E[p]"? > > Is this spin lock really a good idea? How many threads / cpus will spin > on this lock? As [0] shows, contention on spin lock incurs heavy > performance penalty. > > [0] https://lwn.net/Articles/530458/Given that the critical region is small, the extra cache line contention for the spinlock is probably not a big deal. Even in the current event-channel design, we would get cache ping-pong on the event-channel bitmaps. Consider 10k interrupts to a CPU would be a heavy amount. That''s one every 100us. The event-channel delivery code described probably runs in less than 1us, even if memory accesses are horrible cache misses. The really highly contended case shouldn''t happen. -- Keir
On 04/02/13 19:59, Keir Fraser wrote:> On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote: > >> Hi, >> >> Here is a design for a scalable event channel ABI for Xen. It has a >> number of benefits over the design currently being proposed by Wei Lui. >> >> * More event channels (>100,000). >> * 16 event priorities. >> * Reduced memory requirements (only 1 additional page per domU). >> * The use of FIFOs for events ensures fairness, whereas it is difficult >> to reason about the fairness with the current bitmap system. > > I have some sympathy for this design. It''s primary downside compared with > the 3-level alternative is its greater space cost (32*#vcpus). However, as > you say the fairness and prioritisation features are rather nice. Also > having the data structures be per vcpu may well help avoid cacheline > contention on busy multi-vcpu guests.This design originally (before I posted it) did have per-VCPU event arrays but I changed it to per-domain to reduce the memory footprint. A hybrid approach might be useful. Busy guests like dom0 or driver domains could use per-VCPU event arrays but other guests could be per-domain. This would be controlled by the toolstack.> Interested in what others think, but I may actually prefer a ground-up > redesign like this.Since the ABI needs to be changed to support more event channels anyway, it seems an ideal point to revisit the design. David
On Tue, 2013-02-05 at 14:48 +0000, David Vrabel wrote:> On 04/02/13 19:59, Keir Fraser wrote: > > On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote: > > > >> Hi, > >> > >> Here is a design for a scalable event channel ABI for Xen. It has a > >> number of benefits over the design currently being proposed by Wei Lui. > >> > >> * More event channels (>100,000). > >> * 16 event priorities. > >> * Reduced memory requirements (only 1 additional page per domU). > >> * The use of FIFOs for events ensures fairness, whereas it is difficult > >> to reason about the fairness with the current bitmap system. > > > > I have some sympathy for this design. It''s primary downside compared with > > the 3-level alternative is its greater space cost (32*#vcpus). However, as > > you say the fairness and prioritisation features are rather nice. Also > > having the data structures be per vcpu may well help avoid cacheline > > contention on busy multi-vcpu guests. > > This design originally (before I posted it) did have per-VCPU event > arrays but I changed it to per-domain to reduce the memory footprint. > > A hybrid approach might be useful. Busy guests like dom0 or driver > domains could use per-VCPU event arrays but other guests could be > per-domain. This would be controlled by the toolstack. > > > Interested in what others think, but I may actually prefer a ground-up > > redesign like this. > > Since the ABI needs to be changed to support more event channels anyway, > it seems an ideal point to revisit the design. >Right. I do care about better design and good implementation. Can we build a prototype of this design? We are less than two months away from 4.3 feature freeze, and the event channel scalability is planned for that release, which means we need to be hurried. :-) Wei.> David
On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote:>> I have some sympathy for this design. It''s primary downside compared with >> the 3-level alternative is its greater space cost (32*#vcpus). However, as >> you say the fairness and prioritisation features are rather nice. Also >> having the data structures be per vcpu may well help avoid cacheline >> contention on busy multi-vcpu guests. > > This design originally (before I posted it) did have per-VCPU event > arrays but I changed it to per-domain to reduce the memory footprint.Okay, I wonder how much it actually matters anyhow... Oh by the way you say the control block is 128 bytes and will easily fit in the existing struct vcpu_info. That existing structure is 64 bytes in total. So how does that work then? -- Keir> A hybrid approach might be useful. Busy guests like dom0 or driver > domains could use per-VCPU event arrays but other guests could be > per-domain. This would be controlled by the toolstack. > >> Interested in what others think, but I may actually prefer a ground-up >> redesign like this. > > Since the ABI needs to be changed to support more event channels anyway, > it seems an ideal point to revisit the design.
On 05/02/13 15:49, Keir Fraser wrote:> On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote: > >>> I have some sympathy for this design. It''s primary downside compared with >>> the 3-level alternative is its greater space cost (32*#vcpus). However, as >>> you say the fairness and prioritisation features are rather nice. Also >>> having the data structures be per vcpu may well help avoid cacheline >>> contention on busy multi-vcpu guests. >> >> This design originally (before I posted it) did have per-VCPU event >> arrays but I changed it to per-domain to reduce the memory footprint. > > Okay, I wonder how much it actually matters anyhow... > > Oh by the way you say the control block is 128 bytes and will easily fit in > the existing struct vcpu_info. That existing structure is 64 bytes in total. > So how does that work then?I meant struct vcpu_info can be extended without it growing to more than a page. i.e., it fits into the guest page provided in the VCPUOP_register_vcpu_info call so no additional pages need to be globally mapped for the control block. David
On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:> The pages within the event array need not be physically nor virtually > contiguous, but the guest or Xen may make the virtually contiguous for > ease of implementation. e.g., by using vmap() in Xen or vmalloc() in > Linux. Pages are added by the guest as required by the number of > bound event channels.Strictly speaking the size is related to the maximum bound evtchn, which need not be the same as the number of bound evtchns.> The state of an event is tracked using 3 bits within the event word. > the M (masked), P (pending) and L (linked) bits. Only state > transitions that change a single bit are valid.The diagram shows transitions B<->P and P<->L, which involve two bits in each case. So do BM<->PM and LM<->BM now I think about it. Is the L bit redundant with the LINK field == 0 or != 0?> > Event Queues > ------------ > > The event queues use a singly-linked list of event array words (see > figure \ref{fig_event-word} and \ref{fig_event-queue}).` > > ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf) > > Each event queue has a head and tail index stored in the control > block. The head index is the index of the first element in the queue. > The tail index is the last element in the queue. Every element within > the queue has the L bit set. > > The LINK field in the event word indexes the next event in the queue. > LINK is zero for the last word in the queue. > > The queue is empty when the head index is zero (zero is not a valid > event channel). > > Hypercalls > ---------- > > Four new EVTCHNOP hypercall sub-operations are added: > > * `EVTCHNOP_init` > > * `EVTCHNOP_expand` > > * `EVTCHNOP_set_priority` > > * `EVTCHNOP_set_limit` > > ### `EVTCHNOP_init` > > This call initializes a single VCPU''s event channel data structures, > adding one page for the event array.Isn''t the event array per-domain?> A guest should call this during initial VCPU bring up (and on resume?). > > struct evtchnop_init { > uint32_t vcpu; > uint64_t array_pfn; > };I think this will have a different layout on 32 and 64 bit x86, if you care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).> Field Purpose > ----- ------- > `vcpu` [in] The VCPU number. > `array_pfn` [in] The PFN or GMFN of a page to be used for the first page > of the event array. > > Error code Reason > ---------- ------ > EINVAL `vcpu` is invalid or already initialized. > EINVAL `array_pfn` is not a valid frame for the domain. > ENOMEM Insufficient memory to allocate internal structures. > > ### `EVTCHNOP_expand` > > This call expands the event array for a VCPU by appended an additional > page.This doesn''t seem all that different to _init, except the former handles the transition from 0->1 event pages and this handles N->N+1?> A guest should call this for all VCPUs when a new event channel is > required and there is insufficient space in the current event array.> ### `EVTCHNOP_set_priority` > > This call sets the priority for an event channel. The event must be > unbound. > > A guest may call this prior to binding an event channel. The meaning > and the use of the priority are up to the guest. Valid priorities are > 0 - 15 and the default is 7. > > struct evtchnop_set_priority { > uint32_t port; > uint32_t priority; > }; > > Field Purpose > ----- ------- > `port` [in] The event channel. > `priority` [in] The priority for the event channel. > > Error code Reason > ---------- ------ > EINVAL `port` is invalid. > EINVAL `port` is currently bound.EBUSY?> EINVAL `priority` is outside the range 0 - 15. > > ### `EVTCHNOP_set_limit` > > This privileged call sets the maximum number of event channels a > domain can bind. The default for dom0 is unlimited. Other domains > default to 1024 events (requiring only a single page for their event > array). > > struct evtchnop_set_limit { > uint32_t domid; > uint32_t max_events; > }; > > Field Purpose > ----- ------- > `domid` [in] The domain ID. > `max_events` [in] The maximum number of event channels that may be bound > to the domain. > > Error code Reason > ---------- ------ > EINVAL `domid` is invalid. > EPERM The calling domain has insufficient privileges. > > Memory Usage > ------------ > > ### Event Arrays > > Xen needs to map every domains'' event array into its address space. > The space reserved for these global mappings is limited to 1 GiB on > x86-64 (262144 pages) and is shared with other users. > > It is non-trivial to calculate the maximum number of VMs that can be > supported as this depends on the system configuration (how many driver > domains etc.) and VM configuration. We can make some assuptions and > derive an approximate limit. > > Each page of the event array has space for 1024 events ($E_P$) so a > regular domU will only require a single page. Since event channels > typically come in pairs,I''m not sure this is the case, since evtchns are bidirectional (i.e. either end can use it to signal the other) so one is often shared for both rqst and rsp notifications. the upper bound on the total number of pages> is $2 \times\text{number of VMs}$. > > If the guests are further restricted in the number of event channels > ($E_V$) then this upper bound can be reduced further. > > The number of VMs ($V$) with a limit of $P$ total event array pages is: > \begin{equation*} > V = P \div \left(1 + \frac{E_V}{E_P}\right) > \end{equation*} > > Using only half the available pages and limiting guests to only 64 > events gives: > \begin{eqnarray*} > V & = & (262144 / 2) \div (1 + 64 / 1024) \\ > & = & 123 \times 10^3\text{ VMs} > \end{eqnarray*} > > Alternatively, we can consider a system with $D$ driver domains, each > of which requires $E_D$ events, and a dom0 using the maximum number of > pages (128). > > \begin{eqnarray*} > V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) > \end{eqnarray*} > > With, for example, 16 driver domains each using the maximum number of pages: > \begin{eqnarray*} > V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ > & = & 129 \times 10^3\text{ VMs} > \end{eqnarray*}This accounts for the driver domains and dom0 but not the domains which they are serving, doesn''t it?> In summary, there is space to map the event arrays for over 100,000 > VMs. This is more than the limit imposed by the 16 bit domain ID > (~32,000 VMs).Is there scope to reduce the maximum then?> > ### Control Block > > With $L$ priority levels and two 32-bit words for the head and tail > indexes, the amount of space ($S$) required in the `struct vcpu_info` > for the control block is: > \begin{eqnarray*} > S & = & L \times 2 \times 4 \\ > & = & 16 \times 2 \times 4 \\ > & = & 128\text{ bytes} > \end{eqnarray*} > > There is more than enough space within `struct vcpu_info` for the > control block while still keeping plenty of space for future use.There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel which I suppose becomes redundant now. I don''t think it would be a problem to predicate use of this new interface on the use of the VCPU_placement API and therefore give scope to expand the vcpu_info.> Low Level Design > ===============> > In the pseudo code in this section, all memory accesses are atomic, > including those to bit-fields within the event word.> These variables have a standard meaning: > > Variable Purpose > -------- ------- > E Event array. > p A specific event. > H The head index for a specific priority. > T The tail index for a specific priority. > > These memory barriers are required: > > Function Purpose > -------- ------- > mb() Full (read/write) memory barrier > > Raising an Event > ---------------- > > When Xen raises an event it marks it pending and (if it is not masked) > adds it tail of event queue.What are the conditions for actually performing the upcall when returning from the hypervisor to the guest?> E[p].pending = 1 > if not E[p].linked and not E[n].masked > E[p].linked = 1 > E[p].link = 0 > mb() > if H == 0 > H = p > else > E[T].link = p > T = pDo you not need a barrier towards the end here to ensure that a consumer who is currently processing interrupts sees the updated link when they get there?> Concurrent access by Xen to the event queue must be protected by a > per-event queue spin lock. > > Consuming Events > ---------------- > > The guests consumes events starting at the head until it reaches the > tail. Events in the queue that are not pending or are masked are > consumed but not handled. > > while H != 0 > p = H > H = E[p].link > if H == 0 > mb()What is this mb() needed for?> H = E[p].link > E[H].linked = 0Did you mean E[p].linked here? If at this point the interrupt is reraised then the if in the raising pseudo code becomes true, linked will set again and don''t we also race with the clearing of E[p].pending below? Is a barrier needed around here due to Xen also reading E[p].pending?> if not E[p].masked > handle(p) > > handle() clears `E[p].pending`What barriers are required for this and where? Is it done at the start or the end of the handler? (early of late EOI)> and EOIs level-triggered PIRQs. > > > Note: When the event queue contains a single event, removing the > > event may race with Xen appending another event because the load of > > `E[p].link` and the store of `H` is not atomic. To avoid this race, > > the guest must recheck `E[p].link` if the list appeared empty.It appears that both "Raising an Event" and "Consuming Events" can write H? Is that correct? Likewise for the linked bit. It''s a bit unclear because the pseudo-code doesn''t make it explicitly which variables are par of the shared data structure and which are private to the local routine.> Masking Events > -------------- > > Events are masked by setting the masked bit. If the event is pending > and linked it does not need to be unlinked. > > E[p].masked = 1 > > Unmasking Events > ---------------- > > Events are unmasked by the guest by clearing the masked bit. If the > event is pending the guest must call the event channel unmask > hypercall so Xen can link the event into the correct event queue. > > E[p].masked = 0 > if E[p].pending > hypercall(EVTCHN_unmask)Can the hypercall do the E[p].masked = 0?> The expectation here is that unmasking a pending event will be rare, > so the performance hit of the hypercall is minimal. > > > Note: that after clearing the mask bit, the event may be raised and > > thus it may already be linked by the time the hypercall is done. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Tue, 2013-02-05 at 15:54 +0000, David Vrabel wrote:> On 05/02/13 15:49, Keir Fraser wrote: > > On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote: > > > >>> I have some sympathy for this design. It''s primary downside compared with > >>> the 3-level alternative is its greater space cost (32*#vcpus). However, as > >>> you say the fairness and prioritisation features are rather nice. Also > >>> having the data structures be per vcpu may well help avoid cacheline > >>> contention on busy multi-vcpu guests. > >> > >> This design originally (before I posted it) did have per-VCPU event > >> arrays but I changed it to per-domain to reduce the memory footprint. > > > > Okay, I wonder how much it actually matters anyhow... > > > > Oh by the way you say the control block is 128 bytes and will easily fit in > > the existing struct vcpu_info. That existing structure is 64 bytes in total. > > So how does that work then? > > I meant struct vcpu_info can be extended without it growing to more than > a page. i.e., it fits into the guest page provided in the > VCPUOP_register_vcpu_info call so no additional pages need to be > globally mapped for the control block.VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page by itself, even if that happens to be the Linux implementation today (I haven''t checked that). Ian.
On 05/02/2013 15:54, "David Vrabel" <david.vrabel@citrix.com> wrote:> On 05/02/13 15:49, Keir Fraser wrote: >> On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote: >> >>>> I have some sympathy for this design. It''s primary downside compared with >>>> the 3-level alternative is its greater space cost (32*#vcpus). However, as >>>> you say the fairness and prioritisation features are rather nice. Also >>>> having the data structures be per vcpu may well help avoid cacheline >>>> contention on busy multi-vcpu guests. >>> >>> This design originally (before I posted it) did have per-VCPU event >>> arrays but I changed it to per-domain to reduce the memory footprint. >> >> Okay, I wonder how much it actually matters anyhow... >> >> Oh by the way you say the control block is 128 bytes and will easily fit in >> the existing struct vcpu_info. That existing structure is 64 bytes in total. >> So how does that work then? > > I meant struct vcpu_info can be extended without it growing to more than > a page. i.e., it fits into the guest page provided in the > VCPUOP_register_vcpu_info call so no additional pages need to be > globally mapped for the control block.Oh, I see so any guest that uses the new event-channel interface will understand that vcpu_info is extended to contain it. Well, that makes sense. It''s not what your document says though. -- Keir> David
On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:>>> Okay, I wonder how much it actually matters anyhow... >>> >>> Oh by the way you say the control block is 128 bytes and will easily fit in >>> the existing struct vcpu_info. That existing structure is 64 bytes in total. >>> So how does that work then? >> >> I meant struct vcpu_info can be extended without it growing to more than >> a page. i.e., it fits into the guest page provided in the >> VCPUOP_register_vcpu_info call so no additional pages need to be >> globally mapped for the control block. > > VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page > by itself, even if that happens to be the Linux implementation today (I > haven''t checked that).Having guest agree that vcpu_info grows by size of the per-vcpu control block, if using this new event-channel interface, is reasonable though. -- Keir
On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com> wrote:> > Since the ABI needs to be changed to support more event channels anyway, > > it seems an ideal point to revisit the design. > > > > Right. I do care about better design and good implementation. Can we > build a prototype of this design? We are less than two months away from > 4.3 feature freeze, and the event channel scalability is planned for > that release, which means we need to be hurried. :-) >I think the general consensus is that scaling event channels is pretty important -- probably important enough to slip the release a bit if necessary. (Although it would certainly be better not to.) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 05/02/13 16:10, Ian Campbell wrote:> On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote: >> The pages within the event array need not be physically nor virtually >> contiguous, but the guest or Xen may make the virtually contiguous for >> ease of implementation. e.g., by using vmap() in Xen or vmalloc() in >> Linux. Pages are added by the guest as required by the number of >> bound event channels. > > Strictly speaking the size is related to the maximum bound evtchn, which > need not be the same as the number of bound evtchns.Yes.>> The state of an event is tracked using 3 bits within the event word. >> the M (masked), P (pending) and L (linked) bits. Only state >> transitions that change a single bit are valid. > > The diagram shows transitions B<->P and P<->L, which involve two bits in > each case. So do BM<->PM and LM<->BM now I think about it.B == 000b, P == 100b, L == 101b. Similarly for the masked transitions: BM == 010b, PM == 110b, LM == 111b.> Is the L bit redundant with the LINK field == 0 or != 0?LINK == 0 is the end of queue marker. We could use all ones to mean ''unlinked'' but using the L bit allows the guest to remove the event from the queue by clearing a single bit, instead of writing the LINK field.>> ### `EVTCHNOP_init` >> >> This call initializes a single VCPU''s event channel data structures, >> adding one page for the event array. > > Isn''t the event array per-domain?Er, yes. I changed this from per-VCPU and it looks like I didn''t make all the updates required.>> A guest should call this during initial VCPU bring up (and on resume?). >> >> struct evtchnop_init { >> uint32_t vcpu; >> uint64_t array_pfn; >> }; > > I think this will have a different layout on 32 and 64 bit x86, if you > care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).Ok. I thought uint64_t was always 8 byte aligned on both.>> Field Purpose >> ----- ------- >> `vcpu` [in] The VCPU number. >> `array_pfn` [in] The PFN or GMFN of a page to be used for the first page >> of the event array. >> >> Error code Reason >> ---------- ------ >> EINVAL `vcpu` is invalid or already initialized. >> EINVAL `array_pfn` is not a valid frame for the domain. >> ENOMEM Insufficient memory to allocate internal structures. >> >> ### `EVTCHNOP_expand` >> >> This call expands the event array for a VCPU by appended an additional >> page. > > This doesn''t seem all that different to _init, except the former handles > the transition from 0->1 event pages and this handles N->N+1?Agreed, I''ll fold the two together.>> ### `EVTCHNOP_set_priority` >> >> This call sets the priority for an event channel. The event must be >> unbound. >> >> A guest may call this prior to binding an event channel. The meaning >> and the use of the priority are up to the guest. Valid priorities are >> 0 - 15 and the default is 7. >> >> struct evtchnop_set_priority { >> uint32_t port; >> uint32_t priority; >> }; >> >> Field Purpose >> ----- ------- >> `port` [in] The event channel. >> `priority` [in] The priority for the event channel. >> >> Error code Reason >> ---------- ------ >> EINVAL `port` is invalid. >> EINVAL `port` is currently bound. > > EBUSY?Sure.>> Memory Usage >> ------------ >> >> Alternatively, we can consider a system with $D$ driver domains, each >> of which requires $E_D$ events, and a dom0 using the maximum number of >> pages (128). >> >> \begin{eqnarray*} >> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) >> \end{eqnarray*} >> >> With, for example, 16 driver domains each using the maximum number of pages: >> \begin{eqnarray*} >> V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ >> & = & 129 \times 10^3\text{ VMs} >> \end{eqnarray*} > > This accounts for the driver domains and dom0 but not the domains which > they are serving, doesn''t it?This is calculating the number of pages left once those used for dom0 and the driver domains are used. This is the same as the number of supportable VMs since the other VMs only require a page.>> In summary, there is space to map the event arrays for over 100,000 >> VMs. This is more than the limit imposed by the 16 bit domain ID >> (~32,000 VMs). > > Is there scope to reduce the maximum then?Maximum of what?>> ### Control Block >> >> With $L$ priority levels and two 32-bit words for the head and tail >> indexes, the amount of space ($S$) required in the `struct vcpu_info` >> for the control block is: >> \begin{eqnarray*} >> S & = & L \times 2 \times 4 \\ >> & = & 16 \times 2 \times 4 \\ >> & = & 128\text{ bytes} >> \end{eqnarray*} >> >> There is more than enough space within `struct vcpu_info` for the >> control block while still keeping plenty of space for future use. > > There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel > which I suppose becomes redundant now. > > I don''t think it would be a problem to predicate use of this new > interface on the use of the VCPU_placement API and therefore give scope > to expand the vcpu_info.I should have taken a proper look at where the vcpu_info comes from... Oops. We can add a new VCPUOP_register_vcpu_info_and_control_block which will add a struct vcpu_info and the control block.>> Low Level Design >> ===============>> >> In the pseudo code in this section, all memory accesses are atomic, >> including those to bit-fields within the event word. > >> These variables have a standard meaning: >> >> Variable Purpose >> -------- ------- >> E Event array. >> p A specific event. >> H The head index for a specific priority. >> T The tail index for a specific priority. >> >> These memory barriers are required: >> >> Function Purpose >> -------- ------- >> mb() Full (read/write) memory barrier >> >> Raising an Event >> ---------------- >> >> When Xen raises an event it marks it pending and (if it is not masked) >> adds it tail of event queue. > > What are the conditions for actually performing the upcall when > returning from the hypervisor to the guest?Any H != 0 for that VCPU. This means we should have a per-VCPU bitmap of non empty queues.>> E[p].pending = 1 >> if not E[p].linked and not E[n].masked >> E[p].linked = 1 >> E[p].link = 0 >> mb() >> if H == 0 >> H = p >> else >> E[T].link = p >> T = p > > Do you not need a barrier towards the end here to ensure that a consumer > who is currently processing interrupts sees the updated link when they > get there?I need to take another look at the barriers -- I''ll get back to you on this and the others you highlighted.>> Concurrent access by Xen to the event queue must be protected by a >> per-event queue spin lock. >> >> Consuming Events >> ---------------- >> >> The guests consumes events starting at the head until it reaches the >> tail. Events in the queue that are not pending or are masked are >> consumed but not handled. >> >> while H != 0 >> p = H >> H = E[p].link >> if H == 0 >> mb() >> H = E[p].link >> E[H].linked = 0 > > Did you mean E[p].linked here? > > If at this point the interrupt is reraised then the if in the raising > pseudo code becomes true, linked will set again and don''t we also race > with the clearing of E[p].pending below?The event will be correctly linked and then we clear P, when we consume the linked event it will be ignored as P is already clear. Perhaps we could also avoid linking the event if it is already pending?>>> Note: When the event queue contains a single event, removing the >>> event may race with Xen appending another event because the load of >>> `E[p].link` and the store of `H` is not atomic. To avoid this race, >>> the guest must recheck `E[p].link` if the list appeared empty. > > It appears that both "Raising an Event" and "Consuming Events" can write > H? Is that correct? Likewise for the linked bit.Xen only writes H when the queue is empty, the VM only writes H if it is non-empty. The linked bit is set by Xen and cleared by the guest. Do you see a problem with this?> It''s a bit unclear because the pseudo-code doesn''t make it explicitly > which variables are par of the shared data structure and which are > private to the local routine.Capital variables are in the shared structures. Lower-case are local.>> Masking Events >> -------------- >> >> Events are masked by setting the masked bit. If the event is pending >> and linked it does not need to be unlinked. >> >> E[p].masked = 1 >> >> Unmasking Events >> ---------------- >> >> Events are unmasked by the guest by clearing the masked bit. If the >> event is pending the guest must call the event channel unmask >> hypercall so Xen can link the event into the correct event queue. >> >> E[p].masked = 0 >> if E[p].pending >> hypercall(EVTCHN_unmask) > > Can the hypercall do the E[p].masked = 0?Sure. David
On 04/02/13 21:07, Wei Liu wrote:> > Where is the "priority" information stored? I think it should be > somewhere inside Xen, right? I presume a field in struct evtchn?I think so. I''ve not really thought too much about the internal design.>> ### `EVTCHNOP_init` >> >> This call initializes a single VCPU''s event channel data structures, >> adding one page for the event array. >> > > I think the registration for new event channels should always be done in > a batch. If not then you need to provide another OP to rollback previous > registered ones.Hmmm. That''s an interesting point. I''d be inclined to have the guest take the VCPU offline if it cannot initialize it fully.>> Each page of the event array has space for 1024 events ($E_P$) so a >> regular domU will only require a single page. Since event channels >> typically come in pairs, the upper bound on the total number of pages > ^^^^^ > upper or lower?I meant upper, but I am playing fast-and-loose with the maths here since I was aiming for an estimate rather than a real upper bound.>> is $2 \times\text{number of VMs}$. >> >> If the guests are further restricted in the number of event channels >> ($E_V$) then this upper bound can be reduced further. >> > > Can this bound really be reduced? Can you map memory on non-page > granularity?The reasoning here is that event channels are in pairs (or rather they have two ends). One event is the domU, the other in dom0. The events in dom0 are packed into pages, whereas the domU events use 1 page no matter how few events there are. I wasn''t entirely happy with this way of doing the estimate which is why I did the second method, which gave a similar figure.>> The number of VMs ($V$) with a limit of $P$ total event array pages is: >> \begin{equation*} >> V = P \div \left(1 + \frac{E_V}{E_P}\right)^ ^^^^^^^^^ The page in domU The fraction of a page in dom0.>> Raising an Event >> ---------------- >> >> When Xen raises an event it marks it pending and (if it is not masked) >> adds it tail of event queue. >> >> E[p].pending = 1 >> if not E[p].linked and not E[n].masked >> E[p].linked = 1 >> E[p].link = 0 >> mb() >> if H == 0 >> H = p >> else >> E[T].link = p >> T = p >> >> Concurrent access by Xen to the event queue must be protected by a >> per-event queue spin lock. >> > > I presume "E[n]" in the pseudo code is "E[p]"?Yes.> Is this spin lock really a good idea? How many threads / cpus will spin > on this lock? As [0] shows, contention on spin lock incurs heavy > performance penalty.In addition to Keir''s comment, the spinlock itself won''t reside in the same cache line as the control block or event array so this will reduce cache line bouncing.> [0] https://lwn.net/Articles/530458/ > >> Consuming Events >> ---------------- >> >> The guests consumes events starting at the head until it reaches the >> tail. Events in the queue that are not pending or are masked are >> consumed but not handled. >> >> while H != 0 >> p = H >> H = E[p].link >> if H == 0 >> mb() >> H = E[p].link >> E[H].linked = 0 >> if not E[p].masked >> handle(p) >> >> handle() clears `E[p].pending` and EOIs level-triggered PIRQs. >> > > How to synchronize access to the array and control blocks between Xen > and guest? I''m afraid I have no knowledge of a xen-guest spin lock...It''s a lockless data structure on the consumer side (provided there is only one consumer).>> Unmasking Events >> ---------------- >> >> Events are unmasked by the guest by clearing the masked bit. If the >> event is pending the guest must call the event channel unmask >> hypercall so Xen can link the event into the correct event queue. >> >> E[p].masked = 0 >> if E[p].pending >> hypercall(EVTCHN_unmask) >> >> The expectation here is that unmasking a pending event will be rare, >> so the performance hit of the hypercall is minimal. >> > > Currently unmask a "remote" port requires issuing hyercall as well, so > if unmasking is not very frequent, this is not a big problem. > > But please take some interrupt-intensive workloads into consideration. > For example, 1G nic (e1000) even with NAPI enabled can generate 27k+ > intrs per second under high packet load [1], 10G nic can surely generate > more. Can you give estimation on the performance hit on the context > switch?I''m not sure how I would give an estimate, this is something that would need to be measured I think. Also, whilst the upcall itself might be reentrant, the processing of each queue cannot be so the mask/unmask done by the irq_chip callbacks isn''t needed. mask/unmask is then only needed occasionally (e.g., for irq migration) and thus isn''t so performance critical.>>> Note: that after clearing the mask bit, the event may be raised and >>> thus it may already be linked by the time the hypercall is done. > > Even if the event has already been linked before you finish the > hypercall, you would still need to get hold of the a lock to serialize > access to event structure for checking, right? Or a test_bit on linked > field is sufficient? I think you need to write some pseudo code for this > as well.Xen will need to take the per-queue lock for this, yes. David
On 05/02/13 18:05, George Dunlap wrote:> On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com > <mailto:wei.liu2@citrix.com>> wrote: > > > Since the ABI needs to be changed to support more event channels > anyway, > > it seems an ideal point to revisit the design. > > > > Right. I do care about better design and good implementation. Can we > build a prototype of this design? We are less than two months away from > 4.3 feature freeze, and the event channel scalability is planned for > that release, which means we need to be hurried. :-)Two months doesn''t seem possible even if I could work on this full time.> I think the general consensus is that scaling event channels is pretty > important -- probably important enough to slip the release a bit if > necessary. (Although it would certainly be better not to.)What to do here is a non-trivial decision. Possible options include: 1. Merge the 3-level ABI for 4.3. Work on the FIFO-based ABI in parallel, aiming to get this in for 4.4. This means maintaining 3 event channel ABIs in Xen. 2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in. 3. Status quo. Defer extending the event channel ABI to 4.4. The preferable option may be to: 4. Get the 3-level ABI to a mergable state. In parallel develop a prototype of the FIFO-based ABI. When the prototype is ready or the 4.3 freeze is here, evaluate it and make a decision then. David
On Tue, 2013-02-05 at 18:57 +0000, David Vrabel wrote:> On 05/02/13 18:05, George Dunlap wrote: > > On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com > > <mailto:wei.liu2@citrix.com>> wrote: > > > > > Since the ABI needs to be changed to support more event channels > > anyway, > > > it seems an ideal point to revisit the design. > > > > > > > Right. I do care about better design and good implementation. Can we > > build a prototype of this design? We are less than two months away from > > 4.3 feature freeze, and the event channel scalability is planned for > > that release, which means we need to be hurried. :-) > > Two months doesn''t seem possible even if I could work on this full time. > > > I think the general consensus is that scaling event channels is pretty > > important -- probably important enough to slip the release a bit if > > necessary. (Although it would certainly be better not to.) > > What to do here is a non-trivial decision. Possible options include: > > 1. Merge the 3-level ABI for 4.3. Work on the FIFO-based ABI in > parallel, aiming to get this in for 4.4. This means maintaining 3 event > channel ABIs in Xen. > > 2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in. > 3. Status quo. Defer extending the event channel ABI to 4.4. > > The preferable option may be to: > > 4. Get the 3-level ABI to a mergable state. In parallel develop a > prototype of the FIFO-based ABI. When the prototype is ready or the 4.3 > freeze is here, evaluate it and make a decision then. >Actually this is what I''m thinking now. Wei.> David
On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:> > ### `EVTCHNOP_set_priority` > > This call sets the priority for an event channel. The event must be > unbound.I suppose this restriction is because it is hard to change the priority of a pending interrupt from either Xen or the guest in a lock-free manner? The problem is that while on one end you call EVTCHNOP_alloc_unbound and can then change the priority on the other you call EVTCHNOP_bind_interdomain passing (remote-domid,remote-evtcht) and receive a local-evtchn which is already bound, which means you never get the opportunity to set the priority. Likewise when binding virqs and such you never get to see the unbound evtchn. I don''t think we want to go around adding priority to all of those existing binding hypercalls. Perhaps it would be acceptable to say that after having called this hypercall the domain may still observe at most one upcall of the evtchn with the old priority, which corresponds with the interrupt being raised right before the priority change takes effect, but being delivered after.> A guest may call this prior to binding an event channel. The meaning > and the use of the priority are up to the guest. Valid priorities are > 0 - 15 and the default is 7. > > struct evtchnop_set_priority { > uint32_t port; > uint32_t priority; > }; > > Field Purpose > ----- ------- > `port` [in] The event channel. > `priority` [in] The priority for the event channel. > > Error code Reason > ---------- ------ > EINVAL `port` is invalid. > EINVAL `port` is currently bound. > EINVAL `priority` is outside the range 0 - 15.
On Tue, 2013-02-05 at 18:18 +0000, David Vrabel wrote:> On 05/02/13 16:10, Ian Campbell wrote: > > On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote: > >> The pages within the event array need not be physically nor virtually > >> contiguous, but the guest or Xen may make the virtually contiguous for > >> ease of implementation. e.g., by using vmap() in Xen or vmalloc() in > >> Linux. Pages are added by the guest as required by the number of > >> bound event channels. > > > > Strictly speaking the size is related to the maximum bound evtchn, which > > need not be the same as the number of bound evtchns. > > Yes. > > >> The state of an event is tracked using 3 bits within the event word. > >> the M (masked), P (pending) and L (linked) bits. Only state > >> transitions that change a single bit are valid. > > > > The diagram shows transitions B<->P and P<->L, which involve two bits in > > each case. So do BM<->PM and LM<->BM now I think about it. > > B == 000b, P == 100b, L == 101b. > > Similarly for the masked transitions: > > BM == 010b, PM == 110b, LM == 111b.Ah, it wasn''t clear that the states in the diagram didn''t relate to the bits directly, reusing the letters in that way is a bit confusing.> > > Is the L bit redundant with the LINK field == 0 or != 0? > > LINK == 0 is the end of queue marker. We could use all ones to mean > ''unlinked'' but using the L bit allows the guest to remove the event from > the queue by clearing a single bit, instead of writing the LINK field.So you can use test/clear-bit instead of cmpxchg?> >> A guest should call this during initial VCPU bring up (and on resume?). > >> > >> struct evtchnop_init { > >> uint32_t vcpu; > >> uint64_t array_pfn; > >> }; > > > > I think this will have a different layout on 32 and 64 bit x86, if you > > care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit). > > Ok. I thought uint64_t was always 8 byte aligned on both.Sadly not, it''s the cause of a fair few 32- vs 64-bit ABI differences :-(> >> Memory Usage > >> ------------ > >> > >> Alternatively, we can consider a system with $D$ driver domains, each > >> of which requires $E_D$ events, and a dom0 using the maximum number of > >> pages (128). > >> > >> \begin{eqnarray*} > >> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right) > >> \end{eqnarray*} > >> > >> With, for example, 16 driver domains each using the maximum number of pages: > >> \begin{eqnarray*} > >> V & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\ > >> & = & 129 \times 10^3\text{ VMs} > >> \end{eqnarray*} > > > > This accounts for the driver domains and dom0 but not the domains which > > they are serving, doesn''t it? > > This is calculating the number of pages left once those used for dom0 > and the driver domains are used. This is the same as the number of > supportable VMs since the other VMs only require a page.Ah, makes sense.> >> In summary, there is space to map the event arrays for over 100,000 > >> VMs. This is more than the limit imposed by the 16 bit domain ID > >> (~32,000 VMs). > > > > Is there scope to reduce the maximum then? > > Maximum of what?I suppose the maximum number of pages a domain can use for evtchn queues.> > >> ### Control Block > >> > >> With $L$ priority levels and two 32-bit words for the head and tail > >> indexes, the amount of space ($S$) required in the `struct vcpu_info` > >> for the control block is: > >> \begin{eqnarray*} > >> S & = & L \times 2 \times 4 \\ > >> & = & 16 \times 2 \times 4 \\ > >> & = & 128\text{ bytes} > >> \end{eqnarray*} > >> > >> There is more than enough space within `struct vcpu_info` for the > >> control block while still keeping plenty of space for future use. > > > > There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel > > which I suppose becomes redundant now. > > > > I don''t think it would be a problem to predicate use of this new > > interface on the use of the VCPU_placement API and therefore give scope > > to expand the vcpu_info. > > I should have taken a proper look at where the vcpu_info comes from... > Oops. > > We can add a new VCPUOP_register_vcpu_info_and_control_block which will > add a struct vcpu_info and the control block.Better to either say that using the new ABI requires the guest to have reserved space after vcpu_info or to have a separate call for the control block (which could have the constraint that it is on the same page as vcpu_nfo)> >> Low Level Design > >> ===============> >> > >> In the pseudo code in this section, all memory accesses are atomic, > >> including those to bit-fields within the event word. > > > >> These variables have a standard meaning: > >> > >> Variable Purpose > >> -------- ------- > >> E Event array. > >> p A specific event. > >> H The head index for a specific priority. > >> T The tail index for a specific priority. > >> > >> These memory barriers are required: > >> > >> Function Purpose > >> -------- ------- > >> mb() Full (read/write) memory barrier > >> > >> Raising an Event > >> ---------------- > >> > >> When Xen raises an event it marks it pending and (if it is not masked) > >> adds it tail of event queue. > > > > What are the conditions for actually performing the upcall when > > returning from the hypervisor to the guest? > > Any H != 0 for that VCPU. This means we should have a per-VCPU bitmap > of non empty queues.You need to be careful here since if two evtchn''s are raised near simultaneously then you will return to the guest with H == +2, and do an upcall. If in the course of handling the first interrupt you make a hypercall you will be returning with H == +1, but you won''t want to reinject. Which makes me wonder about handling of IRQ priority, is the intention that if a higher priority interrupt occurs while you are processing an existing upcall then it will be interrupted for the new higher priority one? In which case we need to have an IRQL concept which allows the domain to re-enable interrupts while masking interrupts at the current or lower priority. Or perhaps it is acceptable to say that a domain will always finish handling the current IRQ before it takes the higher priority one. Should we have a bitmask of pending priorities so the guest doesn''t have to check all the queues? It''s only 15/16 queues but since most evtchns probably remain at level 7 that''s still a bit wasteful, especially if you need to restart the scan after every interrupt to implement prioritisation.> >> Concurrent access by Xen to the event queue must be protected by a > >> per-event queue spin lock. > >> > >> Consuming Events > >> ---------------- > >> > >> The guests consumes events starting at the head until it reaches the > >> tail. Events in the queue that are not pending or are masked are > >> consumed but not handled. > >> > >> while H != 0 > >> p = H > >> H = E[p].link > >> if H == 0 > >> mb() > >> H = E[p].link > >> E[H].linked = 0 > > > > Did you mean E[p].linked here? > > > > If at this point the interrupt is reraised then the if in the raising > > pseudo code becomes true, linked will set again and don''t we also race > > with the clearing of E[p].pending below? > > The event will be correctly linked and then we clear P,By P you mean E[p].pending in the pseudo-code terminology?> when we consume > the linked event it will be ignored as P is already clear.So we will miss the second interrupt?> Perhaps we could also avoid linking the event if it is already pending? > > >>> Note: When the event queue contains a single event, removing the > >>> event may race with Xen appending another event because the load of > >>> `E[p].link` and the store of `H` is not atomic. To avoid this race, > >>> the guest must recheck `E[p].link` if the list appeared empty. > > > > It appears that both "Raising an Event" and "Consuming Events" can write > > H? Is that correct? Likewise for the linked bit. > > Xen only writes H when the queue is empty, the VM only writes H if it is > non-empty. > > The linked bit is set by Xen and cleared by the guest. > > Do you see a problem with this?You''d need to be careful about the barriers and think hard about the case where the VM is emptying the queue at the same time as Xen is injecting a new interrupt and therefore unemptying it. And vice-versa.> > It''s a bit unclear because the pseudo-code doesn''t make it explicitly > > which variables are par of the shared data structure and which are > > private to the local routine. > > Capital variables are in the shared structures. Lower-case are local. > > >> Masking Events > >> -------------- > >> > >> Events are masked by setting the masked bit. If the event is pending > >> and linked it does not need to be unlinked. > >> > >> E[p].masked = 1 > >> > >> Unmasking Events > >> ---------------- > >> > >> Events are unmasked by the guest by clearing the masked bit. If the > >> event is pending the guest must call the event channel unmask > >> hypercall so Xen can link the event into the correct event queue. > >> > >> E[p].masked = 0 > >> if E[p].pending > >> hypercall(EVTCHN_unmask) > > > > Can the hypercall do the E[p].masked = 0? > > Sure.Perhaps I should have said "Should..." :-) Ian.
On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote:> On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote: > > >>> Okay, I wonder how much it actually matters anyhow... > >>> > >>> Oh by the way you say the control block is 128 bytes and will easily fit in > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total. > >>> So how does that work then? > >> > >> I meant struct vcpu_info can be extended without it growing to more than > >> a page. i.e., it fits into the guest page provided in the > >> VCPUOP_register_vcpu_info call so no additional pages need to be > >> globally mapped for the control block. > > > > VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page > > by itself, even if that happens to be the Linux implementation today (I > > haven''t checked that). > > Having guest agree that vcpu_info grows by size of the per-vcpu control > block, if using this new event-channel interface, is reasonable though.Can only use this trick once though, so it might be blocking ourselves into a future ABI corner. Is there a downside to registering the control block separately? The guest can always arrange for them to be contiguous if it wants, or if we are worried about the number of global mappings then the hypervisor could require it shares a page with the vcpu_info but allow the offset to be specified separately. Ian.
On 06/02/2013 09:38, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:>>> VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page >>> by itself, even if that happens to be the Linux implementation today (I >>> haven''t checked that). >> >> Having guest agree that vcpu_info grows by size of the per-vcpu control >> block, if using this new event-channel interface, is reasonable though. > > Can only use this trick once though, so it might be blocking ourselves > into a future ABI corner. > > Is there a downside to registering the control block separately? The > guest can always arrange for them to be contiguous if it wants, or if we > are worried about the number of global mappings then the hypervisor > could require it shares a page with the vcpu_info but allow the offset > to be specified separately.I would say we consider vcpu_info to be versioned, and that using the new event-channel interface requires use of at least v2 of the vcpu_info structure. There is a rsvd field in register_vcpu_info hypercall that could be used for specifying such a thing, although sadly it is not currently checked to be zero, so we may not actually be able to make use of those bits. -- Keir
On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote:> On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote: > > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote: > > > > >>> Okay, I wonder how much it actually matters anyhow... > > >>> > > >>> Oh by the way you say the control block is 128 bytes and will easily fit in > > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total. > > >>> So how does that work then? > > >> > > >> I meant struct vcpu_info can be extended without it growing to more than > > >> a page. i.e., it fits into the guest page provided in the > > >> VCPUOP_register_vcpu_info call so no additional pages need to be > > >> globally mapped for the control block. > > > > > > VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page > > > by itself, even if that happens to be the Linux implementation today (I > > > haven''t checked that). > > > > Having guest agree that vcpu_info grows by size of the per-vcpu control > > block, if using this new event-channel interface, is reasonable though. > > Can only use this trick once though, so it might be blocking ourselves > into a future ABI corner. >In practice embedding control block in vcpu_info might not be feasible because there is a legacy array of vcpu_info in shared_info page. It is quite easy to bloat shared_info to exceed size limit.> Is there a downside to registering the control block separately? The > guest can always arrange for them to be contiguous if it wants, or if we > are worried about the number of global mappings then the hypervisor > could require it shares a page with the vcpu_info but allow the offset > to be specified separately. >IMHO the global mapping space is the main concern. Regarding sharing page with vcpu_info, this requires us to control the way kernel handles its per-cpu section. But how? Wei.> Ian. >
On Wed, 2013-02-06 at 10:42 +0000, Wei Liu wrote:> On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote: > > On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote: > > > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote: > > > > > > >>> Okay, I wonder how much it actually matters anyhow... > > > >>> > > > >>> Oh by the way you say the control block is 128 bytes and will easily fit in > > > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total. > > > >>> So how does that work then? > > > >> > > > >> I meant struct vcpu_info can be extended without it growing to more than > > > >> a page. i.e., it fits into the guest page provided in the > > > >> VCPUOP_register_vcpu_info call so no additional pages need to be > > > >> globally mapped for the control block. > > > > > > > > VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page > > > > by itself, even if that happens to be the Linux implementation today (I > > > > haven''t checked that). > > > > > > Having guest agree that vcpu_info grows by size of the per-vcpu control > > > block, if using this new event-channel interface, is reasonable though. > > > > Can only use this trick once though, so it might be blocking ourselves > > into a future ABI corner. > > > > In practice embedding control block in vcpu_info might not be feasible > because there is a legacy array of vcpu_info in shared_info page. It is > quite easy to bloat shared_info to exceed size limit.I don''t think we need to literally add the control block to struct vcpu_info, just require that it immediately follows the vpcu_info. This new interface requires the use of vcpu_info placement so the legacy array is not a concern.> > Is there a downside to registering the control block separately? The > > guest can always arrange for them to be contiguous if it wants, or if we > > are worried about the number of global mappings then the hypervisor > > could require it shares a page with the vcpu_info but allow the offset > > to be specified separately. > > > > IMHO the global mapping space is the main concern. Regarding sharing > page with vcpu_info, this requires us to control the way kernel handles > its per-cpu section. But how?We simply require that the kernel do it right... In this instance you would probably do: struct per_vcpu_info { struct vcpu_info ...; struct evtchn_control ...; } and allocate that in per-cpu space. Or is there concern that this allocation might cross a page boundary? I think it is required to be naturally aligned (i.e. aligned to at least its own size) so that is ok. Ian.
On Wed, 2013-02-06 at 10:52 +0000, Ian Campbell wrote:> On Wed, 2013-02-06 at 10:42 +0000, Wei Liu wrote: > > On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote: > > > On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote: > > > > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote: > > > > > > > > >>> Okay, I wonder how much it actually matters anyhow... > > > > >>> > > > > >>> Oh by the way you say the control block is 128 bytes and will easily fit in > > > > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total. > > > > >>> So how does that work then? > > > > >> > > > > >> I meant struct vcpu_info can be extended without it growing to more than > > > > >> a page. i.e., it fits into the guest page provided in the > > > > >> VCPUOP_register_vcpu_info call so no additional pages need to be > > > > >> globally mapped for the control block. > > > > > > > > > > VCPUOP_register_vcpu_info doesn''t require each vcpu_info to be on a page > > > > > by itself, even if that happens to be the Linux implementation today (I > > > > > haven''t checked that). > > > > > > > > Having guest agree that vcpu_info grows by size of the per-vcpu control > > > > block, if using this new event-channel interface, is reasonable though. > > > > > > Can only use this trick once though, so it might be blocking ourselves > > > into a future ABI corner. > > > > > > > In practice embedding control block in vcpu_info might not be feasible > > because there is a legacy array of vcpu_info in shared_info page. It is > > quite easy to bloat shared_info to exceed size limit. > > I don''t think we need to literally add the control block to struct > vcpu_info, just require that it immediately follows the vpcu_info. This > new interface requires the use of vcpu_info placement so the legacy > array is not a concern. > > > > Is there a downside to registering the control block separately? The > > > guest can always arrange for them to be contiguous if it wants, or if we > > > are worried about the number of global mappings then the hypervisor > > > could require it shares a page with the vcpu_info but allow the offset > > > to be specified separately. > > > > > > > IMHO the global mapping space is the main concern. Regarding sharing > > page with vcpu_info, this requires us to control the way kernel handles > > its per-cpu section. But how? > > We simply require that the kernel do it right... > > In this instance you would probably do: > struct per_vcpu_info { > struct vcpu_info ...; > struct evtchn_control ...; > } > and allocate that in per-cpu space. Or is there concern that this > allocation might cross a page boundary? I think it is required to be > naturally aligned (i.e. aligned to at least its own size) so that is ok. >I get what you mean. ;-) My other concern is, along this path we can save global mapping address space, but now the burden of enabling the new interface somewhat slip to the vcpu placement hypercall other than HYPERVISOR_event_channel_op. If you''re fine with this I will be OK too. Wei.> Ian. >
On 05/02/13 18:57, David Vrabel wrote:> What to do here is a non-trivial decision. Possible options include: > > 1. Merge the 3-level ABI for 4.3. Work on the FIFO-based ABI in > parallel, aiming to get this in for 4.4. This means maintaining 3 event > channel ABIs in Xen. > > 2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in. > > 3. Status quo. Defer extending the event channel ABI to 4.4. > > The preferable option may be to: > > 4. Get the 3-level ABI to a mergable state. In parallel develop a > prototype of the FIFO-based ABI. When the prototype is ready or the 4.3 > freeze is here, evaluate it and make a decision then.Just to clarify, the difference between #1 and #4 is that in #4 we hold off on the merge, to delay committing to a specific course of action until later? That seems at first blush to be a pretty safe option. But I think it''s worth pointing out that in practice the end result is likely to be that we just go with #1 eventually anyway: if the FIFO ABI can''t be finished in 4 months giving it all our effort, can we really expect it to be finished in any less time while polishing up the 3-level ABI? I was going to say, "There''s no particular reason to merge the 3-level ABI sooner rather than later", but of course there is: it allows considerably longer and wider testing. No conclusion here, just adding to the mix of things to consider. :-) -George
>>> On 04.02.13 at 20:59, Keir Fraser <keir.xen@gmail.com> wrote: > On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote: >> Here is a design for a scalable event channel ABI for Xen. It has a >> number of benefits over the design currently being proposed by Wei Lui. >> >> * More event channels (>100,000). >> * 16 event priorities. >> * Reduced memory requirements (only 1 additional page per domU). >> * The use of FIFOs for events ensures fairness, whereas it is difficult >> to reason about the fairness with the current bitmap system. > > I have some sympathy for this design. It''s primary downside compared with > the 3-level alternative is its greater space cost (32*#vcpus). However, as > you say the fairness and prioritisation features are rather nice. Also > having the data structures be per vcpu may well help avoid cacheline > contention on busy multi-vcpu guests. > > Interested in what others think, but I may actually prefer a ground-up > redesign like this.So do I, fwiw. Jan
On 06/02/2013 11:32, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:>> 4. Get the 3-level ABI to a mergable state. In parallel develop a >> prototype of the FIFO-based ABI. When the prototype is ready or the 4.3 >> freeze is here, evaluate it and make a decision then. > > Just to clarify, the difference between #1 and #4 is that in #4 we hold > off on the merge, to delay committing to a specific course of action > until later? > > That seems at first blush to be a pretty safe option. But I think it''s > worth pointing out that in practice the end result is likely to be that > we just go with #1 eventually anyway: if the FIFO ABI can''t be finished > in 4 months giving it all our effort, can we really expect it to be > finished in any less time while polishing up the 3-level ABI? > > I was going to say, "There''s no particular reason to merge the 3-level > ABI sooner rather than later", but of course there is: it allows > considerably longer and wider testing. > > No conclusion here, just adding to the mix of things to consider. :-)How many man-weeks do we think David''s design would take to get to draft implementation? I mean honestly I would have thought that a straight two-week run at it would be a reasonable estimate -- the places it plugs in in hypervisor and guest kernel are pretty clean and well defined. This depends on a man having the weeks to spend on it of course! -- Keir
On 06/02/13 13:53, Keir Fraser wrote:> On 06/02/2013 11:32, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote: > >>> 4. Get the 3-level ABI to a mergable state. In parallel develop a >>> prototype of the FIFO-based ABI. When the prototype is ready or the 4.3 >>> freeze is here, evaluate it and make a decision then. >> >> Just to clarify, the difference between #1 and #4 is that in #4 we hold >> off on the merge, to delay committing to a specific course of action >> until later? >> >> That seems at first blush to be a pretty safe option. But I think it''s >> worth pointing out that in practice the end result is likely to be that >> we just go with #1 eventually anyway: if the FIFO ABI can''t be finished >> in 4 months giving it all our effort, can we really expect it to be >> finished in any less time while polishing up the 3-level ABI? >> >> I was going to say, "There''s no particular reason to merge the 3-level >> ABI sooner rather than later", but of course there is: it allows >> considerably longer and wider testing. >> >> No conclusion here, just adding to the mix of things to consider. :-) > > How many man-weeks do we think David''s design would take to get to draft > implementation? I mean honestly I would have thought that a straight > two-week run at it would be a reasonable estimate -- the places it plugs in > in hypervisor and guest kernel are pretty clean and well defined.Your estimate wasn''t far off. After 6 days I now have my first prototype implementation sending and receiving its first events. There needs to be some more work on the Xen side to add an explicit hook for binding new event channels to VCPU0 (so we can hook them up to the right queue) and the Linux side is limited to 1024 events as Linux isn''t expanding the event array yet. And it needs more testing of course. Should be able to post a patch series of the prototype of both the Xen and Linux sides within a week or so. David