Guys, Attached is the first darft of the design for making sure that B/W control features and fanout across CPUs can work for NICs/VNIC/flows without much overheads and that we can still poll the H/W where possible. Cheers, Sunay -- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow -------------- next part -------------- NIC/VNIC/Flows - Bandwidth Control & Fanout Sunay Tripathi (Sunay at Sun.Com) For Crossbow Team (crossbow-discuss at opensolaris.org) Dated: Feb 24th, 2007 =================================================== Preface ====== This document attempts to define/design one of the most complex parts of project Crossbow. Read it in its entirety or you will get lost. This document is not supposed to be a stand alone document but supplement the Crossbow Architecture document available at: http://opensolaris.org/os/project/crossbow/Docs/crossbow_architecture.txt Scope of the document ==================== This document explores the relationship between the NIC, VNICs, and flows in terms of how they use the hardware resources like Rx rings and software resources like soft rings to get dynamic polling and bandwidth control. It also defines how the traffic for various elements (NICs, VNICs, and flows) can spread out to multiple CPUs specified via cpu list (''-C'' option of dladm and flowadm). The complexity comes because bandwidth control and fanout can be applied on NIC, VNIC, and flows independant of each other while there exists a hierarchical relationship between flows and VNICs/NICs. The goal of this architecture is allow the squeue to still dynamically poll a Rx ring or a soft ring for performance reason or do the bandwidth control where specified. The 2nd goal is that if B/W control is specified (with or without fanout), we should still try to get a Rx ring assigned and dynamically poll the Rx ring so that packets associated with the extra bandwidth stay on the H/W Rx ring and and are brought in the main memory only when they are ready to be processed. Rx Rings, Soft Rings and Set of Soft Rings ========================================= To achieve polling, bandwidth control, and fanout (and any combination), we use the following resources Rx ring: A H/W receive ring which receives incoming packets based on the rules described in H/W classifier. e.g. Packets for incoming MAC address A go to Rx ring 1 or packets for incoming TCP port B go to Rx ring 2. Soft Ring: A data structure for a pseudo H/W layer that mimics a Rx ring when one is not available (either the H/W doesn''t have it or has run out of it). It consists of a queue for packets, state flags, and a worker thread to do the processing. A soft ring can be used on both receive side and transmit side and normally go in pair (one for receive and one for transmit). Soft Ring Set: A data structure that contains a queue, state flag, a polling thread, a worker thread, and a pointer to a set of soft rings. The concept is useful when a NIC or VNIC wants to do polling over a rx ring (for performance or bandwidth control) and yet spread out the incoming processing to either multiple CPUs or flows or a comination of the two. To avoid multiple context switches, the queue and worker thread in soft ring set (called ''SRS'' from now on) is used only if the SRS is doing bandwidth control but is not allowed to do polling (because of unavailability of Rx ring). -------- -------- -------- -------- -------- |Squeue| |Squeue| |Squeue| |Squeue| |Squeue| | X | | A | | Z | | 1 | | n | -------- -------- -------- -------- -------- IP and above __________|_|_______|_|_________|_|________|_|______|_|_______________________ | | | | | | | | | | | | | | | | | | | | Data Link | | | | | | | | | | and VNIC Layer __________|_|_______|_|_________|_|________|_|______|_|_______________________ | | | | | | | | | | | | | | | | | | | | MAC Pseudo | | ----|-V---------|-V--------|-|------|-|---------- H/W Layer | | | -------- -------- | | | | | | | | | | Soft | | Soft | | | | | | | | | | | Ring | | Ring | | | | | | | | | | | A | | Z | | | | | | | | | | -------- -------- | | | | | | | | --------|^ ^ | | | | | | | | \ / | | | | | | | | \ / | V | V | Direct interrupt | -------- -------- -------- | and polling | | Soft | | Soft | | Soft | | path to/from | | Ring | | Ring | | Ring | | IP bypassing | | Set | | 1 | | n | | Data Link Layer | -------- -------- -------- | ^ | | ^ | | ^ | | | | ----------------------- | | | | | | | S/W Classifier | | | | | | | ----------------------- | | | ----|-|---------------------------------- | | | | ^ | | Interrupt| |Polling ^Interrupt | | Path| |Path |Only Path | | | | | ---\ \--------------|-|--------------------|------------- | \ \ + + NIC Driver | | -----\ \-------------\ \-------------------|------------- \ \ \ \ | -------\ \-------------\-\-----------------|------------- | \ V \ V | | | --------- --------- --------- | | | Rx/Tx | | Rx/Tx | |Default| | | | Ring | | Ring | | Rx/Tx | | | | 1 } | n | | Ring | | | --------- --------- --------- | | ----------------------------------------- | | | H/W Classifier | | | ----------------------------------------- | | | | NIC Hardware | --------------------------------------------------------- Figure 1 Fig 1. above shows a complex relationship between the 3 data structures. Squeue X: This can be a data path for a VNIC or a flow based on IP address, transport, or port. There is *no* fanout involved but bandwidth can be specified for flows based on transport or ports (services). The squeue directly controls the Rx rings and does dynamic polling for performance or do B/W control. Its also important to note that in such a case, we can bypass the entire driver and datalink processing for better performance and lower latencies. This is the best case scenario for Crossbow architecture where B/W control etc. can be maintained for NIC/VNIC/flow and yet there is performance improvements. This case doesn''t work when bandwidth is specified for a VNIC (mac address or IP address based) or a fanout is specified for the VNIC (a cpu list of more than 1 CPU via ''-C'' option - 1 CPU is OK). Squeue A&Z: This can be the data path when the bandwidth or fanout is specified for the VNIC or a NIC. The soft ring set does the polling for performance or enforces B/W control for the overall NIC/VNIC and then fans out the incoming traffic to a TCP and UDP based soft ring (in case bandwidth control was specified) or just multiple soft ring per CPU specified in the fanout list (via the ''-C'' option of dladm/flowadm). In case both bandwidth control and fanout is specified, the SRS creates a TCP and UDP based soft ring for each CPU specified in fanout list. There can be flows specified on top of the VNIC/NIC (which has fanout specified) but the flows themselves cannot have any fanout (they are allowed to have B/W control and B/W comes from any B/W specified for the NIC/VNIC). As part of creating the flow, the sysadmin can specify a single CPU via ''-C'' option that will do the processing for that flow (the CPU must be part of the CPU list specified for the NIC/VNIC else flowadm will fail). In case, the NIC is not capable of having Rx rings (or none are available) the SRS has to use the default Rx ring and S/W classification and in this case SRS can not do dynamic polling. If B/W control is specified, the SRS is forced to use its internal queue and worker thread to maintain B/W control. This leads to one extra context switch and is generally not a good scenario for performance sensitive cases. Squeue 1&N: This is the same scenario as the first one (squeue X) except this is used when Rx rings are not available and things are driven via S/W classifier. Polling & Bandwidth Control without Squeue: Xen & Forwarding path ================================================================ This is the scenario used to make sure that dynamic polling (and bandwidth control where necessary) is always enabled even for traffic that is not meant for the host. The two most common cases are when machine is forwarding packets and needs to perform better (per packet interrupt in forwarding path has a very heavy negetive performance impact) or its Xen dom0 where the NIC/VNIC is dealing with packets meant for a a domU guest. In either of these cases, the IP Squeues are not available to provide dynamic polling or bandwidth control. ------------------------- | IP forwarding | | Path | IP ------------------------- Layer ___________________________________________|_|________|_|__________________ | | | | | | | | Data Link | | | | and VNIC Layer ___________________________________________|_|________|_|__________________ | | | | MAC Pseudo | | | | H/W Layer --------------------- | | | | | Path To | | | | | | Xen domU | | | | | --------------------- | | | | | | | | | | | | | | | | | | | | ----|-V---------|-V--------|-V--------|-V-------- | -------- -------- -------- -------- | | | Soft | | Soft | | Soft | | Soft | | | | Ring | | Ring | | Ring | | Ring | | | | Y1 | | Y2 | | Z1 | | Z2 | | | -------- -------- -------- -------- | | ^ ^ ^ ^ | | \ / \ / | | \ / \ / | | -------- -------- | | | Soft | | Soft | | | | Ring | | Ring | | | | SetY | | SetZ | | | -------- -------- | | ^ | | | | | ----------------------- | | | | | S/W Classifier | | | | | ----------------------- | -----------|-|---------------------------------- Path For | | ^ local traffic Interrupt| |Polling ^Interrupt per Fig 1. Path| |Path |Only Path \ \ | | | ---\ \-------------|-|---------------------|------------- | \ \ \ \ NIC Driver | | -----\ \-------------\-\-------------------|------------- \ \ \ \ | -------\ \-------------\-\-----------------|------------- | \ V \ V | | | --------- --------- --------- | | | Rx/Tx | | Rx/Tx | |Default| | | | Ring1 | | Ring2 | | Rx/Tx | | | |IP A-X | | IP Y | | Ring3 | | | --------- --------- --------- | | ----------------------------------------- | | | H/W Classifier | | | ----------------------------------------- | | | | NIC Hardware | --------------------------------------------------------- IP Addresses A-X are local for the machine IP Address Y is for Xen domU instance The remaining traffic (not for local machine) goes default Ring Figure 2. As shown in figure 2, we start by putting specific entry in the H/W or S/W classifier for each local IP address. In figure, they map to Rx Ring 1 and the data path for these is similar to show in Figure 1. The classifier maps the mac address or IP address for Xen domU to Rx Ring 2 which goes to a soft ring set Y which controls the dynamic polling, bandwidth control and fanout for the traffic for Xen domU. Any remaining traffic maps to Rx Ring 3 which goes to S/W classifier. Soft Ring Set Z gets the traffic that is meant for forwarding and it can enforce and fanout and bandwidth control before sending the traffic to IP layer for route table lookup and forwarding. In this example, Soft Ring Set Z does not do any dynamic polling. Its important to note that this is just an example and we could have easily assigned all the mac addresses and IP addresses for Xen domU to go to Rx ring 3 and let the forwarded traffic go to Rx ring 2 where soft ring set Y can now do dynamic polling for better performance. The example above tries to show the possibilities with and without Rx rings. If there are Rx rings available, then we should always use them. Soft Ring Set ============ SRS is represented by the following data structure. typedef void (*s_ring_proc_t)(void *, void *, mblk_t *); typedef mblk *(*srs_poll_func_t)(void *, size_t); typedef void (*srs_blank_t)(void *, time_t, uint_t, int); typedef struct mac_soft_ring_set_s { kmutex_t srs_lock; /* lock before using any member */ uint16_t srs_type; /* processing model of the sq */ uint16_t srs_state; /* state flags and message count */ int srs_count; /* # of mblocks in mac_soft_ring */ mblk_t *srs_first; /* first mblk chain or NULL */ mblk_t *srs_last; /* last mblk chain or NULL */ kthread_t *srs_worker; /* worker thread id */ kcondvar_t srs_async; /* cv for worker thread */ /* Upcall Function for fanout, processing etc */ s_ring_proc_t srs_rx_func; void *s_ring_rx_arg1; void *s_ring_rx_arg2; /* List of soft rings */ mac_soft_ring_t **srs_soft_ring_list; /* Polling the Rx ring related members */ srs_blank_t srs_blank_func; /* Interrupt blank function */ srs_poll_func_t srs_poll_func; /* Function to call for polling */ void *srs_rx_handle; /* Cookie to pass for polling */ kthread_t *srs_poll_thr; /* polling thread */ kcondvar_t srs_cv; /* cond variable poll_thr waits on */ /* Bandwidth control related members */ size_t srs_size; /* Size of packets queued in bytes */ size_t srs_bw_limit; /* Max bytes to process per tick */ size_t srs_bw_used; /* Bytes processed in current tick */ clock_t srs_curr_time; /* Current tick (lbolt) */ processorid_t srs_bind; /* processor to bind to */ }; Note that srs_first, srs_last, and srs_worker (the queue and worker thread is used only when there is no dedicated Rx ring for the SRS to poll on it has been asked to do bandwidth control as well. In all other cases, including fanout, either the interrupt thread from below will process the SRS or the poll thread will process the SRS. As long as there is only one interrupt designated for the Rx ring (pointed by srs_rx_handle) and the interrupt and poll thread runs mutually exclusive to each other, there is no need for any queuing (this part still under prototyping). NICs and VNICs ============= VNICs ar same as regular NICs. They are both built on top of the Rx rings or Soft ring and work independant of each other i.e. bandwidth limit set on either are independant of each other and work as long as the sum total of all bandwidth assigned to one plumbed NIC and all VNICs is less than or equal to the total physical bandwidth configured (link speed). Rx Rings - If enough Rx rings are available, the VNICs use the Rx rings and the H/W classifier while one Rx ring is reserved for the physical NIC plumbed. Soft rings - When enough Rx rings are not available, VNICs use software classifier and soft rings. Fanout for NICs and VNICs ======================== When the CPU list is provided, the NIC or VNIC spreads out traffic to all the CPUs using a a set of soft rings called ''soft ring set'' or ''SRS''. With Rx rings - The SRS uses polling over Rx rings for performance and B/W control if necessary. The upcall function and cookie (the ''pointer to SRS'') gets the fanout done and uses one of soft rings to deliver the packet. Without Rx ring - The SRS doesn''t use polling. The S/W classifier points to the lower proc function and ''SRS'' which does the fanout. B/W control and/or Fanout for NICs and VNICs =========================================== Each SRS will have a soft ring for TCP and UDP each in case bandwidth control is specified. If fanout is specified as well, a TCP/UDP soft ring pair is created for each CPU specified in the cpu list. With Rx ring - Use ''SRS'' over dedicated Rx ring for B/W control and/or fanout. No queuing (or SRS worker thread processing) at SRS is required but only B/W accounting. The SRS poll thread or interrupt does the fanout and queues at one of the soft ring where soft ring worker thread picks up the processing (only one context switch). Without Rx ring - In case no dedicated Rx ring is present, ''SRS'' is not allowed to do polling and packets can get queued at ''SRS'' level to enforce B/W and processed by the SRS worker thread which in turn will select one of the soft rings to queue the packet on (resulting in two context switches). Flows on NICs and VNICs ====================== Flows ar built on top of VNICs or NICs and any B/W dedicated to the flow is taken from the NIC or VNIC its built on top of. In case, the NIC or VNIC doesn''t have any B/W control or fanout on top of it, the flow can use a dedicated Rx ring or soft ring and have B/W control. In case, the flow needs to have fanout and/or B/W control as well, it uses SRS on a Rx ring or via S/W classifier. As above, if no dedicated Rx ring is present, the ''SRS'' doesn''t use polling. NIC/VNIC already has B/W control configured and a flow is configured. The NIC/VNIC switches to using ''SRS'' where one of the soft ring from the soft ring set is dedicated to the flow. The flow in this case can have its own B/W control which is enforced between the soft ring and the squeue. Similarly, the VNIC/NIC can have B/W control + fanout configured and a flow is created on top of it. The flow takes one of the soft ring from the set and can have its own B/W (part of NIC/VNIC bandwidth) enforced by the soft ring managing the flow. If Flows configured on top of VNIC or NIC which already has a bandwidth limit and or fanout specified, then the flow *can not* have a fanout. It can still have a single CPU specified via -C option (which needs to from the set of CPUs specified for the NIC or VNIC). The only way a flow can have its own bandwidth limit specified with or without fanout is by creating the flow on a NIC itself where there is no fanout or bandwidth limit specified for the NIC. Latency: ======= Improve latency by a tunable ''tune_for_latency'' which when set, ignores the normal interrupt blank time and packet count and instead sets these value to per packet interrupt. For bonus points, this needs to be an option of dladm/flowadm so that this can apply to per VNIC or flow. Currently for performance reason (tuned towards overall throughput), we set the interrupt for normal blank time and normal packet count (i.e. whatever is the default interrupt moderation scheme) in case the NIC or Rx ring is in interrupt mode. Based on the tunable above, we should ignore the drivers default interrupt moderation values supplied and force a per packet interrupt when in interrupt mode.
David Edmondson
2007-Mar-21 10:42 UTC
[crossbow-discuss] Design for B/W control and fanout
On Tue, Mar 13, 2007 at 04:29:49PM -0700, Sunay Tripathi wrote:> Squeue X: > > This can be a data path for a VNIC or a flow based on IP address, transport, > or port. There is *no* fanout involved but bandwidth can be specified for > flows based on transport or ports (services). The squeue directly controls > the Rx rings and does dynamic polling for performance or do B/W control. Its > also important to note that in such a case, we can bypass the entire driver > and datalink processing for better performance and lower latencies. This is > the best case scenario for Crossbow architecture where B/W control etc. can > be maintained for NIC/VNIC/flow and yet there is performance > improvements. > > This case doesn''t work when bandwidth is specified for a VNIC (mac address > or IP address based) or a fanout is specified for the VNIC (a cpu list of > more than 1 CPU via ''-C'' option - 1 CPU is OK).Can you explain why this case doesn''t work for a mac address based VNIC with bandwidth limits? dme.
David Edmondson wrote:> On Tue, Mar 13, 2007 at 04:29:49PM -0700, Sunay Tripathi wrote: >> Squeue X: >> >> This can be a data path for a VNIC or a flow based on IP address, transport, >> or port. There is *no* fanout involved but bandwidth can be specified for >> flows based on transport or ports (services). The squeue directly controls >> the Rx rings and does dynamic polling for performance or do B/W control. Its >> also important to note that in such a case, we can bypass the entire driver >> and datalink processing for better performance and lower latencies. This is >> the best case scenario for Crossbow architecture where B/W control etc. can >> be maintained for NIC/VNIC/flow and yet there is performance >> improvements. >> >> This case doesn''t work when bandwidth is specified for a VNIC (mac address >> or IP address based) or a fanout is specified for the VNIC (a cpu list of >> more than 1 CPU via ''-C'' option - 1 CPU is OK). > > Can you explain why this case doesn''t work for a mac address based > VNIC with bandwidth limits? > > dme.Thats a limitation of the architecture. We have squeues meant for TCP and non-TCP (typically UDP which has lesser constraints for mutual exclusion etc than TCP). If B/W control is specified for the entire VNIC, we need to make sure that limit or guarantee gets honoured for entire VNIC before the traffic gets split into TCP and non-TCP. We are kind of replicating the behaviour as if there was a real wire of specified B/W connected to the VNIC. Once we can honour the limit for the entire VNIC (for any traffic flowing), we split it up into TCP and non-TCP soft rings where TCP and non-TCP squeues can poll their respective soft rings (see figure) In terms of data path, we will end up taking an extra context switch compared to the best case of squeue polling the Rx ring directly. The better way would be to express the bandwidth limit for a VNIC in terms of TCP and non TCP. We will end up using two Rx rings but performance will be better than Nevada today. We are planning to document these in the best practices and probably the sysadmin books. Cheers, Sunay -- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow
On Mar 21, 2007, at 4:52 AM, Sunay Tripathi wrote:> > Thats a limitation of the architecture. We have squeues meant for TCP > and non-TCP (typically UDP which has lesser constraints for mutual > exclusion etc than TCP). If B/W control is specified for the entire > VNIC, we need to make sure that limit or guarantee gets honoured for > entire VNIC before the traffic gets split into TCP and non-TCP. We are > kind of replicating the behaviour as if there was a real wire of > specified B/W connected to the VNIC. Once we can honour the limit for > the entire VNIC (for any traffic flowing), we split it up into TCP > and non-TCP soft rings where TCP and non-TCP squeues can poll their > respective soft rings (see figure)This is not specific to VNICs, it''s a general problem for all registered MACs, since we want to be able to specify bandwidth limits for all NICs, not just VNICs. Nicolas. -- Nicolas Droux - Solaris Networking - Sun Microsystems, Inc. droux at sun.com - http://blogs.sun.com/droux