Yunsong (Roamer) Lu
2007-Sep-28 22:59 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi, We''ve posted a new design document, Crossbow Hardware Resources Management and Virtualization, here: http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf you''re welcome to review and comment. In this document, we introduce the new driver interfaces of virtualizing NIC resources, like multiple rings and multiple MAC addresses, and we talk about the ideas about organizing various hardware implementations into the new framework. We''re glad to know any comments from IHVs who are designing new NICs. What''s not included in the document? The new design evolves GLDv3 interfaces, but before finalizing those changed interfaces, like (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we would like to listen to your opinions first. Also, support for PCI-SIG IOV will be added afterwards. Please feel free to comment and share your opinions! Thanks, Kais & Roamer -- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
Yunsong (Roamer) Lu
2007-Sep-28 23:02 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi, We''ve posted a new design document, Crossbow Hardware Resources Management and Virtualization, here: http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf you''re welcome to review and comment. In this document, we introduce the new driver interfaces of virtualizing NIC resources, like multiple rings and multiple MAC addresses, and we talk about the ideas about organizing various hardware implementations into the new framework. We''re glad to know any comments from IHVs who are designing new NICs. What''s not included in the document? The new design evolves GLDv3 interfaces, but before finalizing those changed interfaces, like (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we would like to listen to your opinions first. Also, support for PCI-SIG IOV will be added afterwards. Please feel free to comment and share your opinions! Thanks, Kais & Roamer -- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd .. _______________________________________________ crossbow-discuss mailing list crossbow-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
Kais Belgaied
2007-Sep-29 19:43 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
The doc is also now available from the Crossbow docs page, at http://www.opensolaris.org/os/project/crossbow/Docs/virtual_resources.pdf Kais blogs.sun.com/kais Yunsong (Roamer) Lu wrote:> Hi, > We''ve posted a new design document, Crossbow Hardware Resources > Management and Virtualization, here: > http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf > you''re welcome to review and comment. > > In this document, we introduce the new driver interfaces of virtualizing > NIC resources, like multiple rings and multiple MAC addresses, and we > talk about the ideas about organizing various hardware implementations > into the new framework. We''re glad to know any comments from IHVs who > are designing new NICs. > > What''s not included in the document? The new design evolves GLDv3 > interfaces, but before finalizing those changed interfaces, like > (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we > would like to listen to your opinions first. Also, support for PCI-SIG > IOV will be added afterwards. > > Please feel free to comment and share your opinions! > > Thanks, > > Kais & Roamer > >
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-02 21:53 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi, Kais and Roamer, This is neat doc. I have few questions/comments. 1) This documents describes polling single ring. Can you poll group of rings, if they all share common interrupt number? mrg_intr is described to be "nice to have" per group based common interrupt number, Is it driver dependent? or the mac framework can have virtual interrupt that masks individual interrupts of each individual ring of given group? 2) If any hardware/network driver does not have ring support, can crossbow for such drivers emulate channel/Fifo/ring behavior in software? Does SRS would serve that purpose? 3) I see there is mac_rx_ring_info_t and mac_rx_ring_group_info_t. how about if you have common structures for rx and tx side for info? Instead of having mac_rx_ring_info_t and mac_rx_ring_group_info_t would it make sense to have mac_ring_info_t and mac_ring_group_info_t to be usable for rx and tx side rings or ring-groups.? e.g. To implement above you can have mac_cb function pointer in mac_ring_info_t , and say - 1) for rx side "mac_cb" can be initialized as "mr_poll" and 2) on Tx side "mac_cb" can be initialized as "mr_send" since mr_driver, mr_intr, mr_start, mr_stop are members of mac_rx_ring_info_t as well as mac_tx_ring_info_t and it''s just that mr_poll and mr_send routines are different for rx and tx side ring_info respectively. 4) AFAI understand these hardware resource capabilities can help do load balancing/packet classification , how it can help virtualization? cause, As I see, virtual machines are identified using MAC+IP addresses, Is there any userland utilities that can help steer, classify and administer VM''s traffic and steer across multiple rings by programming policy/rule on ring/s? Is so, what is it and how user can enforce a policy dynamically on given set of rings or ring groups? I know flowadm can program ring but can it program ring-group? 5) Can you group rings of different physical NICs, If yes, what is the interface for the same? -Deepti Yunsong (Roamer) Lu wrote:>Hi, >We''ve posted a new design document, Crossbow Hardware Resources >Management and Virtualization, here: >http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf >you''re welcome to review and comment. > >In this document, we introduce the new driver interfaces of virtualizing >NIC resources, like multiple rings and multiple MAC addresses, and we >talk about the ideas about organizing various hardware implementations >into the new framework. We''re glad to know any comments from IHVs who >are designing new NICs. > >What''s not included in the document? The new design evolves GLDv3 >interfaces, but before finalizing those changed interfaces, like >(*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we >would like to listen to your opinions first. Also, support for PCI-SIG >IOV will be added afterwards. > >Please feel free to comment and share your opinions! > >Thanks, > >Kais & Roamer > > >
Yunsong (Roamer) Lu
2007-Oct-02 22:40 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Thank you for the comments!> 1) > This documents describes polling single ring. > Can you poll group of rings, if they all share common interrupt number?Polling can only happen on single ring, regardless sharing interrupt or not. Practically, if a shared group interrupt is used, to poll a ring group is just to poll all single rings separately.> mrg_intr is described to be "nice to have" per group based common > interrupt number, Is it driver dependent? or the mac framework > can have virtual interrupt that masks individual interrupts of each > individual ring of given group?Driver need to make the decision whether to enable per-group interrupt if there are no enough interrupts for all rings, because this depends on how the hardware and driver organize interrupt sources.> 2) > If any hardware/network driver does not have ring support, > can crossbow for such drivers emulate channel/Fifo/ring behavior in > software?The term "Ring" is used to describe a software data structure that is mapped to hardware traffic channel, receive or transmit, so any NIC has at least 1 receive "ring" and 1 transmit "ring".> Does SRS would serve that purpose?SRS is on top of hardware ring.> 3) > I see there is mac_rx_ring_info_t and mac_rx_ring_group_info_t. > how about if you have common structures for rx and tx side for info? > Instead of having mac_rx_ring_info_t and mac_rx_ring_group_info_t > would it make sense to have mac_ring_info_t and mac_ring_group_info_t > to be usable for rx and tx side rings or ring-groups.? > > e.g. To implement above you can have mac_cb function pointer > in mac_ring_info_t , and say - > 1) for rx side "mac_cb" can be initialized as "mr_poll" and > 2) on Tx side "mac_cb" can be initialized as "mr_send" > since mr_driver, mr_intr, mr_start, mr_stop are members of > mac_rx_ring_info_t as well as mac_tx_ring_info_t and it''s just that > mr_poll and mr_send routines are different for rx and tx side ring_info > respectively.(*mr_poll)() and (*mr_send)() have different prototypes, and they could evolve in the future. More functions are possibly to be added to either data structure. The two different data structures, mac_rx_ring_info_t and mac_tx_ring_info_t, are used to simplify the interface. But if we can make sure the two data structures are just slightly different, we may consider to combine them. You may have seen that the transmit ring group and receive ring group perform very different, so the mac_rx_ring_group_info_t should be very different from mac_tx_ring_group_info_t. So we''d better to keep them separately. In the MAC framework, we''re using common data structures, mac_ring_t/mac_ring_group_t, to simplify the resource management logic.> 4) > AFAI understand these hardware resource capabilities can help do > load balancing/packet classification , how it can help virtualization?The ring grouping isolate VNICs from hardware layer, and the virtualization is done at layer 2 instead of L3/L4. Ring group will be the basic entity of virtualizing hardware resources.> > cause, As I see, virtual machines are identified using MAC+IP addresses, > Is there any userland utilities that can help steer, classify and > administer > VM''s traffic and steer across multiple rings by programming policy/rule on > ring/s? > Is so, what is it and how user can enforce a policy dynamically on given > set of rings or ring groups? I know flowadm can program ring > but can it program ring-group?The updated Flow design will answer this.> 5) > Can you group rings of different physical NICs, If yes, what is the > interface for the same?No. You may check out the ring group definition in the document. :) Thanks, Roamer> > -Deepti > > > Yunsong (Roamer) Lu wrote: > >> Hi, >> We''ve posted a new design document, Crossbow Hardware Resources >> Management and Virtualization, here: >> http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf >> you''re welcome to review and comment. >> >> In this document, we introduce the new driver interfaces of virtualizing >> NIC resources, like multiple rings and multiple MAC addresses, and we >> talk about the ideas about organizing various hardware implementations >> into the new framework. We''re glad to know any comments from IHVs who >> are designing new NICs. >> >> What''s not included in the document? The new design evolves GLDv3 >> interfaces, but before finalizing those changed interfaces, like >> (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we >> would like to listen to your opinions first. Also, support for PCI-SIG >> IOV will be added afterwards. >> >> Please feel free to comment and share your opinions! >> >> Thanks, >> >> Kais & Roamer >> >> >> > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss-- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
Kais Belgaied
2007-Oct-02 23:09 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi Deepti, thanks for the review. answers below deepti dhokte - Sun Microsystems - Menlo Park United States wrote:> Hi, > Kais and Roamer, > > This is neat doc. I have few questions/comments. >Thanks!> 1) > This documents describes polling single ring. > Can you poll group of rings, if they all share common interrupt number? >The issue when polling multiple rings is deciding which one to choose when some rings have received packets and some are empty. It''s not possible to know the order of incoming packets to various rings, therefore we don''t have enough information to honor that order up in the stack. We let the interrupt deposit the packets up to the SRS in this case, then a worker thread polls from the SRS''s queue at a rate prescribed by the bandwidth share.> mrg_intr is described to be "nice to have" per group based common > interrupt number, Is it driver dependent? or the mac framework > can have virtual interrupt that masks individual interrupts of each > individual ring of given group? > >It is both driver and system dependent. The preference is to the finest level of granularity of course, which is an interrupt per ring. If the device doesn''t know how to generate a per-ring interrupt, or if the device driver failed to allocate an MSI-X interrupt number for each ring, then it is expected to fall back to the next best granularity which is per-group interrupt. If that fails, then interrupts can be shared between multiple groups.> 2) > If any hardware/network driver does not have ring support, > can crossbow for such drivers emulate channel/Fifo/ring behavior in > software? > Does SRS would serve that purpose? >yes, SRS and ring members of an SRS will serve that purpose. Note that the driver will expose one singleton group in that case.> 3) > I see there is mac_rx_ring_info_t and mac_rx_ring_group_info_t. > how about if you have common structures for rx and tx side for info? > Instead of having mac_rx_ring_info_t and mac_rx_ring_group_info_t > would it make sense to have mac_ring_info_t and mac_ring_group_info_t > to be usable for rx and tx side rings or ring-groups.? >mac_rx_ring_info_t and mac_tx_ring_info_t are objects of different nature. Different functions act on them. The actions are different, and the arguments are different. Roamer and I discussed this quite a bit during the design, and it didn''t feel natural to force a communality of the types on them just for the sake of having compact code. We do have a common mac_capab_rings_t on the other hand, because that object is used the same way for both rx and tx direction, simply for exchanging the opaque handles for rx and tx rings, and pointers to their more specific info structs. We opted for type communality in that case. The first paragraph of the Provider Interface section was an attempt to capture that rationale.> e.g. To implement above you can have mac_cb function pointer > in mac_ring_info_t , and say - > 1) for rx side "mac_cb" can be initialized as "mr_poll" and > 2) on Tx side "mac_cb" can be initialized as "mr_send" > since mr_driver, mr_intr, mr_start, mr_stop are members of > mac_rx_ring_info_t as well as mac_tx_ring_info_t and it''s just that > mr_poll and mr_send routines are different for rx and tx side ring_info > respectively. > > > 4) > AFAI understand these hardware resource capabilities can help do > load balancing/packet classification , how it can help virtualization? >good question. The ability to split traffic into independent lanes helps sharing access to the hardware resources in an isolated manner. When you have a ring group that has its own MAC address and interrup(s), you get to assign that interrupt to a CPU that was given to a virtual machine. That''s isolation in terms of scheduling resource, because even an avalanche of interrupts targeting that VM''s address will have little effect on CPU resources allocated to others. On the transmit side, the core MAC framework will be submitting packets to the right tx ring associated with a specific MAC client (e.g. a VNIC given to a VM), and not using other clients tx rings. I think some elaboration is needed in the text here.> cause, As I see, virtual machines are identified using MAC+IP addresses, > Is there any userland utilities that can help steer, classify and > administer > VM''s traffic and steer across multiple rings by programming policy/rule on > ring/s? >yes, at the end of the day, flowadm(1m) that may result in programming the hardware classifier for steering based on a rule (e.g. IP addr or port) or policy (hash function).> Is so, what is it and how user can enforce a policy dynamically on given > set of rings or ring groups? I know flowadm can program ring > but can it program ring-group? >the generalized load balancing policy (generalized from the existing aggr policy) is currently the only way to alter the behavior at the level of the ring group, and that''s using dladm(1m).> 5) > Can you group rings of different physical NICs, If yes, what is the > interface > for the same? >I need to think about this one. The scope of the question is actually how to make the aggr driver work efficiently and best utilize the virtualization capabilities of its members. Maybe we can have an open crossbow design meeting about this, if members of this audience wish to participate. Thanks, Kais> -Deepti > >
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-02 23:55 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Yunsong (Roamer) Lu wrote:>Thank you for the comments! > > > >>1) >>This documents describes polling single ring. >>Can you poll group of rings, if they all share common interrupt number? >> >> >Polling can only happen on single ring, regardless sharing interrupt or >not. Practically, if a shared group interrupt is used, to poll a ring >group is just to poll all single rings separately. > >I recall, some conversations whether rings belonging to same group have same interrupt number so that blanking that shared interrupt, can emulate polling of a group. My original question also, said that can you mask it in a framework, such as even if individual rings generate different interrupts, it can be delivered as some common virtual interrupt to upper layer, so polling on ring-group could work. cause, if drive rimplementation doesn''t support it, can we implement it in mac framework?> > >>mrg_intr is described to be "nice to have" per group based common >>interrupt number, Is it driver dependent? or the mac framework >>can have virtual interrupt that masks individual interrupts of each >>individual ring of given group? >> >> >Driver need to make the decision >so you mean it is NIC/hardware limited factor or driver implementation specific?> whether to enable per-group interrupt >if there are no enough interrupts for all rings, >so, the interrupts may be limited in number. In that case sharing same interrupt among multiple rings of the group should help, isn''t it?> because this depends on >how the hardware and driver organize interrupt sources. > > >I guess, driver should have ability to program interrupt vector and assigning interrupts, I may be missing something?>>2) >>If any hardware/network driver does not have ring support, >>can crossbow for such drivers emulate channel/Fifo/ring behavior in >>software? >> >> >The term "Ring" is used to describe a software data structure that is >mapped to hardware traffic channel, receive or transmit, >I understand that.>so any NIC has >at least 1 receive "ring" and 1 transmit "ring". > >yeah.>>Does SRS would serve that purpose? >> >> >SRS is on top of hardware ring. > > >In the absence of multiple RX/TX rings support in hardware, does crossbow emulate RX/TX multiple rings/groups behavior? or are there plans to extend it in future phases? can SRS (softt ring Set) help us to achieve that in future?>>3) >>I see there is mac_rx_ring_info_t and mac_rx_ring_group_info_t. >>how about if you have common structures for rx and tx side for info? >>Instead of having mac_rx_ring_info_t and mac_rx_ring_group_info_t >>would it make sense to have mac_ring_info_t and mac_ring_group_info_t >>to be usable for rx and tx side rings or ring-groups.? >> >>e.g. To implement above you can have mac_cb function pointer >>in mac_ring_info_t , and say - >>1) for rx side "mac_cb" can be initialized as "mr_poll" and >>2) on Tx side "mac_cb" can be initialized as "mr_send" >>since mr_driver, mr_intr, mr_start, mr_stop are members of >>mac_rx_ring_info_t as well as mac_tx_ring_info_t and it''s just that >>mr_poll and mr_send routines are different for rx and tx side ring_info >>respectively. >> >> >(*mr_poll)() and (*mr_send)() have different prototypes, and they could >evolve in the future. More functions are possibly to be added to either >data structure. > >The two different data structures, mac_rx_ring_info_t and >mac_tx_ring_info_t, are used to simplify the interface. >They have 4 out of 5 redundant members.> But if we can >make sure the two data structures are just slightly different, we may >consider to combine them. > > >aren''t they just slightly different in just one structure member i.e. mr_poll and mr_send?>You may have seen that the transmit ring group and receive ring group >perform very different, so the mac_rx_ring_group_info_t should be very >different from mac_tx_ring_group_info_t. So we''d better to keep them >separately. > >ok.>In the MAC framework, we''re using common data structures, >mac_ring_t/mac_ring_group_t, to simplify the resource management logic. > > >yeah, I see that and also mac_capab_ring_t is used on TX as well as RX side, that''s I wondered why there is seaparate mac_rx_ring_info_t and mac_tx_ring_info_t and why not just mac_ring_info_t with a member mac_cb that can hold mac_poll for rx side rings and mac_send for TX side ring.>>4) >>AFAI understand these hardware resource capabilities can help do >>load balancing/packet classification , how it can help virtualization? >> >> >The ring grouping isolate VNICs from hardware layer, and the >virtualization is done at layer 2 instead of L3/L4. >Having user-prgrammed Rx rings(/flows) with L3/L4 classifications on top of VNIC/NIC, would differentiate crossbow features over other vendor''s crossbow-like virtualization capabilities. It would be especially useful, when you want to run multiple applications such as streaming server(RTP), DHCP server (UDP)on same Virtual machine with a VNIC. such as no single app can monopolize given VNIC''s already capped bandwidth. rx-ring grouping gives many benefits and I am not sure if we want to limit it to just VNIC/L2 classification. so I am trying to think-out loud here to wonder it''s further exploitation in line with VNICs.>Ring group will be >the basic entity of virtualizing hardware resources. > >programming ring-group for L3/L4 classifications can help prformance of network virtual machines, I suppose.>>cause, As I see, virtual machines are identified using MAC+IP addresses, >>Is there any userland utilities that can help steer, classify and >>administer >>VM''s traffic and steer across multiple rings by programming policy/rule on >>ring/s? >>Is so, what is it and how user can enforce a policy dynamically on given >>set of rings or ring groups? I know flowadm can program ring >>but can it program ring-group? >> >> >The updated Flow design will answer this. > >ok. :-)> > >>5) >>Can you group rings of different physical NICs, If yes, what is the >>interface for the same? >> >> >No. You may check out the ring group definition in the document. :) > > >so it''s hardware design limitation of driver implementation limitation.? btw, I don''t see the definition, can you help me with a page number? -Deepti>Thanks, > >Roamer > > >>-Deepti >> >> >>Yunsong (Roamer) Lu wrote: >> >> >> >>>Hi, >>>We''ve posted a new design document, Crossbow Hardware Resources >>>Management and Virtualization, here: >>>http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf >>>you''re welcome to review and comment. >>> >>>In this document, we introduce the new driver interfaces of virtualizing >>>NIC resources, like multiple rings and multiple MAC addresses, and we >>>talk about the ideas about organizing various hardware implementations >>>into the new framework. We''re glad to know any comments from IHVs who >>>are designing new NICs. >>> >>>What''s not included in the document? The new design evolves GLDv3 >>>interfaces, but before finalizing those changed interfaces, like >>>(*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we >>>would like to listen to your opinions first. Also, support for PCI-SIG >>>IOV will be added afterwards. >>> >>>Please feel free to comment and share your opinions! >>> >>>Thanks, >>> >>>Kais & Roamer >>> >>> >>> >>> >>> >>_______________________________________________ >>crossbow-discuss mailing list >>crossbow-discuss at opensolaris.org >>http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss >> >> > > >
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-03 00:20 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Kais Belgaied wrote:>Hi Deepti, > >thanks for the review. > >you are welcome!>answers below > > > >>1) >>This documents describes polling single ring. >>Can you poll group of rings, if they all share common interrupt number? >> >> >> > > >The issue when polling multiple rings is deciding which one to choose >when some rings have >received packets and some are empty. > >aha, so which ring should be drained in a group is a issue. But the rate at which these 10GBe nics are loaded, in data-centers, I think the common case would be that we have some packet always on each ring of the group.>It''s not possible to know the order of incoming packets to various >rings, >so out of order packets can be a issue, I agree.>therefore we don''t have >enough information to honor that order up in the stack. > >can we drain the rings of a group in round-robin fashion, such as we remember the last rings drained and while serving the same interrupt number of the group, next time we pick/drain packet off of next ring of the same group 1,2,3,4,1,2, etc.? but, yeah that won''t solve out-of-order packets issue, I would like to brood a bit on this.>We let the interrupt deposit the packets up to the SRS in this case, >then a worker thread polls >from the SRS''s queue at a rate prescribed by the bandwidth share. > >This is neat.> > >>mrg_intr is described to be "nice to have" per group based common >>interrupt number, Is it driver dependent? or the mac framework >>can have virtual interrupt that masks individual interrupts of each >>individual ring of given group? >> >> >> >> > >It is both driver and system dependent. >The preference is to the finest level of granularity of course, which is >an interrupt per ring. > >I think, that would be complex and I don''t see how useful.? cause, purpose of grouping would be to load-balance same traffic. although, I see your point that as we don''t have correct way of knowing which ring of the group got packet to be drained up, individual interrupt per ring is the finer knowledge. But again as per my above comment in earlier paragraph , it''s likely that each ring of the group would always have some packet also, grouping will be used in heavy network situations.>If the device doesn''t know how to generate a per-ring interrupt, or if >the device driver failed to >allocate an MSI-X interrupt number for each ring, then it is expected to >fall back to the next best granularity >which is per-group interrupt. If that fails, then interrupts can be >shared between multiple groups. > > >per-group interrupt is something I get convniced about. sharing same interrupts across multiple groups (each group with multiple rings) seems bit un-useful to me. 2)>>If any hardware/network driver does not have ring support, >>can crossbow for such drivers emulate channel/Fifo/ring behavior in >>software? >>Does SRS would serve that purpose? >> >> >> > >yes, SRS and ring members of an SRS will serve that purpose. >Note that the driver will expose one singleton group in that case. > > >great.>>3) >>I see there is mac_rx_ring_info_t and mac_rx_ring_group_info_t. >>how about if you have common structures for rx and tx side for info? >>Instead of having mac_rx_ring_info_t and mac_rx_ring_group_info_t >>would it make sense to have mac_ring_info_t and mac_ring_group_info_t >>to be usable for rx and tx side rings or ring-groups.? >> >> >> > > >mac_rx_ring_info_t and mac_tx_ring_info_t > >are objects of different nature. Different functions act on them. The >actions are different, >and the arguments are different. >I may be missing something. I think mac_rx_ring_info_t and mac_tx_ring_info_t differ just in last member.> Roamer and I discussed this quite a bit >during the design, >and it didn''t feel natural to force a communality of the types on them >just for the sake of having >compact code. >We do have a common mac_capab_rings_t on the other hand, because that >object is used the same way > >yeah, that''s why I wondered why ring_info_t can not be same for TX and RX side?>for both rx and tx direction, simply for exchanging the opaque handles >for rx and tx rings, and >pointers to their more specific info structs. We opted for type >communality in that case. >The first paragraph of the Provider Interface section was an attempt to >capture that rationale. > > >yes, that''s why I got into thinking why not same ring_info_t.>>e.g. To implement above you can have mac_cb function pointer >>in mac_ring_info_t , and say - >>1) for rx side "mac_cb" can be initialized as "mr_poll" and >>2) on Tx side "mac_cb" can be initialized as "mr_send" >>since mr_driver, mr_intr, mr_start, mr_stop are members of >>mac_rx_ring_info_t as well as mac_tx_ring_info_t and it''s just that >>mr_poll and mr_send routines are different for rx and tx side ring_info >>respectively. >> >> >>4) >>AFAI understand these hardware resource capabilities can help do >>load balancing/packet classification , how it can help virtualization? >> >> >> > >good question. The ability to split traffic into independent lanes helps >sharing access to the >hardware resources in an isolated manner. When you have a ring group >that has its own >MAC address and interrup(s), you get to assign that interrupt to a CPU >that was given >to a virtual machine. That''s isolation in terms of scheduling resource, >because even an avalanche >of interrupts targeting that VM''s address will have little effect on CPU >resources allocated to >others. On the transmit side, the core MAC framework will be submitting >packets to the >right tx ring associated with a specific MAC client (e.g. a VNIC given >to a VM), and not >using other clients tx rings. > >I think some elaboration is needed in the text here. > > >>cause, As I see, virtual machines are identified using MAC+IP addresses, >>Is there any userland utilities that can help steer, classify and >>administer >>VM''s traffic and steer across multiple rings by programming policy/rule on >>ring/s? >> >> >> > >yes, at the end of the day, flowadm(1m) that may result in programming >the hardware classifier >for steering based on a rule (e.g. IP addr or port) or policy (hash >function). > > > >>Is so, what is it and how user can enforce a policy dynamically on given >>set of rings or ring groups? I know flowadm can program ring >>but can it program ring-group? >> >> >> > >the generalized load balancing policy (generalized from the existing >aggr policy) is currently >the only way to alter the behavior at the level of the ring group, and >that''s using dladm(1m). > > > >>5) >>Can you group rings of different physical NICs, If yes, what is the >>interface >>for the same? >> >> >> > >I need to think about this one. The scope of the question is actually >how to make the aggr driver >work efficiently and best utilize the virtualization capabilities of its >members. >Maybe we can have an open crossbow design meeting about this, if members >of this audience wish to >participate. > > >I was thinking about some use-case about why we would be needing such thing. say, a multicast streaming/video application running in different virtual network machines (listening to same multicast address) can have stream of multicast packets classified and deposited on rings of a group that is made of rings of different physical NICS/VNICs. a high streaming quality and high thruput could be achieved and all multicast listener apps in all VMs can be single-programmed by programming that ring-group, if ring-groups are programmable using flowadm. But this is from top of 100 feets and is on my wish-list :-) and have very peculiar requirement and corner usage... but if its easily doable and already availble in hardware+driver and just one API away in mac framework, I wondered can we have that in crossbow. Thanks, Deepti>Thanks, > > Kais > > >>-Deepti >> >> >> >> > >_______________________________________________ >crossbow-discuss mailing list >crossbow-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > >
Kais Belgaied
2007-Oct-03 02:19 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
deepti dhokte - Sun Microsystems - Menlo Park United States wrote:>> >> It is both driver and system dependent. >> The preference is to the finest level of granularity of course, which >> is an interrupt per ring. >> >> > I think, that would be complex and I don''t see how useful.?when rings are assigned to sub-flows, we want to toggle the interrupts on/off rings independently, and poll packets off them individually. Even with interrupt-only mode, the interrupts from different rings can be spread to multiple CPUs.> cause, purpose of grouping would be to load-balance same traffic. > > although, I see your point that as we don''t have correct way of knowing > which ring of the group got packet to be drained up, individual interrupt > per ring is the finer knowledge. > But again as per my above comment in earlier paragraph , it''s likely that > each ring of the group would always have some packet > also, grouping will be used in heavy network situations.Kais
deepti dhokte
2007-Oct-03 04:39 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Kais Belgaied wrote:> deepti dhokte - Sun Microsystems - Menlo Park United States wrote: >>> >>> It is both driver and system dependent. >>> The preference is to the finest level of granularity of course, >>> which is an interrupt per ring. >>> >>> >> I think, that would be complex and I don''t see how useful.? > > when rings are assigned to sub-flows, we want to toggle the interrupts > on/off rings > independently, and poll packets off them individually. > > Even with interrupt-only mode, the interrupts from different rings can > be spread to multiple CPUs.understood. it''s good that many corner cases as well are very well taken care of. -Deepti>> cause, purpose of grouping would be to load-balance same traffic. >> >> although, I see your point that as we don''t have correct way of knowing >> which ring of the group got packet to be drained up, individual >> interrupt >> per ring is the finer knowledge. >> But again as per my above comment in earlier paragraph , it''s likely >> that >> each ring of the group would always have some packet >> also, grouping will be used in heavy network situations. > > Kais
Yunsong (Roamer) Lu
2007-Oct-03 05:13 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
deepti dhokte - Sun Microsystems - Menlo Park United States wrote:> I recall, some conversations whether rings belonging to same > group have same interrupt number so that blanking that shared > interrupt, can emulate polling of a group. > My original question also, said that can you mask it in a framework, > such as even if individual rings generate different interrupts, it can > be delivered as some common virtual interrupt to upper layer, > so polling on ring-group could work. > cause, if drive rimplementation doesn''t support it, can we implement > it in mac framework?I don''t understand why you want to do this. If you look at the document, you may see that the per-group interrupt is nice-to-have feature *ONLY* when there are not enough per-ring interrupts. What''s your case that need to poll a ring group even though there are per-ring interrupts implemented?> In the absence of multiple RX/TX rings support in hardware, > does crossbow emulate RX/TX multiple rings/groups behavior? > or are there plans to extend it in future phases? > can SRS (softt ring Set) help us to achieve that in future?Why the framework need to pretend having multiple hardware ring and/or ring group? What''s the behavior the framework need to emulate?> aren''t they just slightly different in just one structure member > i.e. mr_poll and mr_send?Firstly, to combine these two data structures is unnatural for now. Two function prototypes are different, so I don''t know what''s the necessity to let them share one functional pointer.>> The ring grouping isolate VNICs from hardware layer, and the >> virtualization is done at layer 2 instead of L3/L4. >> > > Having user-prgrammed Rx rings(/flows) with L3/L4 classifications > on top of VNIC/NIC, would differentiate crossbow features over > other vendor''s crossbow-like virtualization capabilities. > > It would be especially useful, when you want to run multiple > applications such > as streaming server(RTP), DHCP server (UDP)on same Virtual machine with > a VNIC. > such as no single app can monopolize given VNIC''s already capped bandwidth. > > rx-ring grouping gives many benefits and I am not sure if we want to limit > it to just VNIC/L2 classification. > > so I am trying to think-out loud here to wonder it''s further exploitation > in line with VNICs. > >> Ring group will be >> the basic entity of virtualizing hardware resources. >> >> > programming ring-group for L3/L4 classifications can help > prformance of network virtual machines, I suppose.Looks the document is not clear enough on explaining this. We''ll update it accordingly. Briefly, the device virtualization is done at layer 2, through MAC addresses and VLAN, and we''ll create VNIC on top of this type of vitualization. L3/L4 hardware classifications are still important, but that will happen only within a receive ring group, and FLOW is created with L3/L4 classification rules. We''re not dropping any helpful feautre, but organizing the sources in a natural way.>>> 5) >>> Can you group rings of different physical NICs, If yes, what is the >>> interface for the same? >> No. You may check out the ring group definition in the document. :) >> > so it''s hardware design limitation of driver implementation limitation.? > btw, I don''t see the definition, can you help me with a page number?Page 3. In this design, the ring group is mapped to hardware resources, so it''s impossible to group rings from different devices. But as mentioned by Kais, when creating VNIC on top of aggr, the aggr driver may re-group those hardware groups for its client, say VNIC. From the VNIC''s point of view, the ring group exposed by aggr driver may include rings from different hardware instances. We need to further discuss this as Kais suggested. Thanks, Roamer -- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-03 18:16 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Yunsong (Roamer) Lu wrote:> deepti dhokte - Sun Microsystems - Menlo Park United States wrote: > >> I recall, some conversations whether rings belonging to same >> group have same interrupt number so that blanking that shared >> interrupt, can emulate polling of a group. >> My original question also, said that can you mask it in a framework, >> such as even if individual rings generate different interrupts, it can >> be delivered as some common virtual interrupt to upper layer, >> so polling on ring-group could work. >> cause, if drive rimplementation doesn''t support it, can we implement >> it in mac framework? > > I don''t understand why you want to do this.explained below.> If you look at the document, you may see that the per-group interrupt > is nice-to-have feature *ONLY* when there are not enough per-ring > interrupts.my point is per-group single interrupt may not be just "nice-to-have" and it has potential and use-cases that should evolve it into "important" feature.> What''s your case that need to poll a ring group even though there are > per-ring interrupts implemented? >The particular case for which I see, per-group interrupt useful is that - ring-groups are used for load-balancing in heavy network conditions. so traffic on all the rings of a group is buffering same packet-flow. so to make interrupt blanking work so that this heavy network traffic don''t keep thrashing interrupt on the bound cpus, it would be good to have single interrupt across all rings of the same group. Otherwise you will have to blank individual interrupts of each ring of the group, which seems to me overhead. if single interrupts across multiple rings of the same group is supported in hardware and abstract from L2 framework then no virtual masking would be needed, and would be cleaner and simpler. but if there is no such support I was wondering if there can be something like virtual masking, i.e. either 1) the way I had earlier pictured and explained in earlier email''s first paragraph. or 2) in which although interrupt 1,2,3,4 belong to rings of same group blanking 4, always would blank 1,2,3 as well. unless you guys have some other ways in mind.>> In the absence of multiple RX/TX rings support in hardware, >> does crossbow emulate RX/TX multiple rings/groups behavior? >> or are there plans to extend it in future phases? >> can SRS (softt ring Set) help us to achieve that in future? > > Why the framework need to pretend having multiple hardware ring and/or > ring group? What''s the behavior the framework need to emulate?why network interface card care to have multiple rings or groups? that''s the same reason why in absence of such a feature in hardware why we would want to emulate it in software. afaik, current crossbow, if there are no multiple RX rings/channels/.fifo to program the rule/policy on it , have soft rings, that''s the same reason you would want to emulate such a behavior (ring-groups) in mac framework. and wondered if SRS does it?> >> aren''t they just slightly different in just one structure member >> i.e. mr_poll and mr_send? > > Firstly, to combine these two data structures is unnatural for now. > Two function prototypes are different, so I don''t know what''s the > necessity to let them share one functional pointer.I leave it upto you. but I still think, you may want to remove redundany in code. structure members of these 2 structures, are same except for the last member. unless I am missing something. If I am, please correct me.> >>> The ring grouping isolate VNICs from hardware layer, and the >>> virtualization is done at layer 2 instead of L3/L4. >> >> >> Having user-prgrammed Rx rings(/flows) with L3/L4 classifications >> on top of VNIC/NIC, would differentiate crossbow features over >> other vendor''s crossbow-like virtualization capabilities. >> >> It would be especially useful, when you want to run multiple >> applications such >> as streaming server(RTP), DHCP server (UDP)on same Virtual machine >> with a VNIC. >> such as no single app can monopolize given VNIC''s already capped >> bandwidth. >> >> rx-ring grouping gives many benefits and I am not sure if we want to >> limit >> it to just VNIC/L2 classification. >> >> so I am trying to think-out loud here to wonder it''s further >> exploitation >> in line with VNICs. >> >>> Ring group will be the basic entity of virtualizing hardware resources. >>> >>> >> programming ring-group for L3/L4 classifications can help >> prformance of network virtual machines, I suppose. > > Looks the document is not clear enough on explaining this. We''ll > update it accordingly.ok.> > Briefly, the device virtualization is done at layer 2, through MAC > addresses and VLAN, and we''ll create VNIC on top of this type of > vitualization. L3/L4 hardware classifications are still important, but > that will happen only within a receive ring group, and FLOW is created > with L3/L4 classification rules. We''re not dropping any helpful > feautre, but organizing the sources in a natural way.ok.> >>>> 5) >>>> Can you group rings of different physical NICs, If yes, what is the >>>> interface for the same? >>> >>> No. You may check out the ring group definition in the document. :) >>> >> so it''s hardware design limitation of driver implementation limitation.? >> btw, I don''t see the definition, can you help me with a page number? > > Page 3.Thanks.> > In this design, the ring group is mapped to hardware resources, so > it''s impossible to group rings from different devices.I don''t know if its'' impossible but as network interfaces/devices and their driver implementations evolve, there might be one in the making. wishful thinking ;-)> > But as mentioned by Kais, when creating VNIC on top of aggr, the aggr > driver may re-group those hardware groups for its client, say VNIC. > From the VNIC''s point of view, the ring group exposed by aggr driver > may include rings from different hardware instances.yeah, he explained that use-case in his email.> > We need to further discuss this as Kais suggested.ok, -Deepti> > Thanks, > > Roamer
Yunsong (Roamer) Lu
2007-Oct-03 18:52 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
deepti dhokte - Sun Microsystems - Menlo Park United States wrote:>> What''s your case that need to poll a ring group even though there are >> per-ring interrupts implemented? >> > The particular case for which I see, per-group interrupt useful is that - > > ring-groups are used for load-balancing in heavy network conditions. > so traffic on all the rings of a group is buffering same packet-flow. > so to make interrupt blanking work so that this heavy network traffic don''t > keep thrashing interrupt on the bound cpus, > it would be good to have single interrupt across all rings of the same > group. > Otherwise you will have to blank individual interrupts of each ring of the > group, which seems to me overhead. > if single interrupts across multiple rings of the same group is supported > in hardware and abstract from L2 framework > then no virtual masking would be needed, and would be cleaner and simpler. > but if there is no such support I was wondering if there can be something > like virtual masking, > i.e. either > 1) the way I had earlier pictured and explained in earlier email''s first > paragraph. > or > 2) in which although interrupt 1,2,3,4 belong to rings of same group > blanking 4, always would blank 1,2,3 as well. > > unless you guys have some other ways in mind.You may check out the polling design published by Sunay in early this year: http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt Your question reminders us to add this reference in the new document. Thanks a lot!>> In this design, the ring group is mapped to hardware resources, so >> it''s impossible to group rings from different devices. > > I don''t know if its'' impossible but as network interfaces/devices and > their driver implementations > evolve, there might be one in the making. wishful thinking ;-)I can not agree with you, but your comments are very helpful! Thanks, Roamer -- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-03 19:17 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Yunsong (Roamer) Lu wrote:> deepti dhokte - Sun Microsystems - Menlo Park United States wrote: > >>> What''s your case that need to poll a ring group even though there >>> are per-ring interrupts implemented? >>> >> The particular case for which I see, per-group interrupt useful is >> that - >> >> ring-groups are used for load-balancing in heavy network conditions. >> so traffic on all the rings of a group is buffering same packet-flow. >> so to make interrupt blanking work so that this heavy network traffic >> don''t >> keep thrashing interrupt on the bound cpus, >> it would be good to have single interrupt across all rings of the >> same group. >> Otherwise you will have to blank individual interrupts of each ring >> of the >> group, which seems to me overhead. >> if single interrupts across multiple rings of the same group is >> supported >> in hardware and abstract from L2 framework >> then no virtual masking would be needed, and would be cleaner and >> simpler. >> but if there is no such support I was wondering if there can be >> something >> like virtual masking, >> i.e. either >> 1) the way I had earlier pictured and explained in earlier email''s >> first paragraph. >> or >> 2) in which although interrupt 1,2,3,4 belong to rings of same group >> blanking 4, always would blank 1,2,3 as well. >> >> unless you guys have some other ways in mind. > > You may check out the polling design published by Sunay in early this > year: > http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt >The pointed link is a very neat,complete crossbow guide. But, we are reviewing your design document of hardware resource management. and 1) our discussion was regarding same interrupt number across ring-group and how it''s useful. 2) if ring-groups aren''t supported in hardware, are you planning to emulate it in software. if we already do it, how? pointing this link/doc. not in relevance with what we were discussing. If 1) and/or 2) above or its any other equivalent if achieving the same goal, if not in current scope of crossbow, I can understand that.> Your question reminders us to add this reference in the new document. > Thanks a lot! > >>> In this design, the ring group is mapped to hardware resources, so >>> it''s impossible to group rings from different devices. >> >> >> I don''t know if its'' impossible but as network interfaces/devices and >> their driver implementations >> evolve, there might be one in the making. wishful thinking ;-) > > I can not agree with you,ok, It doesn''t matter to me.> but your comments are very helpful! >Thanks, Deepti> Thanks, > > Roamer >
Kais Belgaied
2007-Oct-03 20:17 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi Deepti,>> You may check out the polling design published by Sunay in early this >> year: >> http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt >> >> > The pointed link is a very neat,complete crossbow guide. > > But, > we are reviewing your design document of hardware resource management. > and > 1) > our discussion was regarding same interrupt number across ring-group and > how it''s useful. > > 2) > if ring-groups aren''t supported in hardware, are you planning to emulate > it in software. if we already do it, how? >The L2-level steering (classification based on MAC address or VLAN) is what determines the grouping. If the hardware supports steering to N individual rings, then the drivers should register N groups with 1 ring in each. If the hardware cannot offer L2 steering at all, then it is a single group encompassing all the rings on the system. Back to the specific question, there is always 1 group, at least. the software in the framework will no emulate the multiple rings. Kais.> pointing this link/doc. not in relevance with what we were discussing. > If 1) and/or 2) above or its any other equivalent if achieving the same > goal, > if not in current scope of crossbow, I can understand that. > >
deepti dhokte - Sun Microsystems - Menlo Park United States
2007-Oct-03 21:15 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Kais Belgaied wrote:>Hi Deepti, > > > >>>You may check out the polling design published by Sunay in early this >>>year: >>>http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt >>> >>> >>> >>> >>The pointed link is a very neat,complete crossbow guide. >> >>But, >>we are reviewing your design document of hardware resource management. >>and >>1) >>our discussion was regarding same interrupt number across ring-group and >>how it''s useful. >> >>2) >>if ring-groups aren''t supported in hardware, are you planning to emulate >>it in software. if we already do it, how? >> >> >> >The L2-level steering (classification based on MAC address or VLAN) is >what determines >the grouping. >If the hardware supports steering to N individual rings, then the >drivers should register >N groups with 1 ring in each. If the hardware cannot offer L2 steering >at all, then >it is a single group encompassing all the rings on the system. >Back to the specific question, there is always 1 group, at least. the >software in the framework >will no emulate the multiple rings. > >this explains it well to me. thanks, -Deepti> Kais. > > > > >>pointing this link/doc. not in relevance with what we were discussing. >>If 1) and/or 2) above or its any other equivalent if achieving the same >>goal, >>if not in current scope of crossbow, I can understand that. >> >> >> >> > >_______________________________________________ >crossbow-discuss mailing list >crossbow-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > >
Paul Durrant
2007-Oct-04 12:47 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
On 28/09/2007, Yunsong (Roamer) Lu <roamer at sun.com> wrote:> We''ve posted a new design document, Crossbow Hardware Resources > Management and Virtualization, here: > http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf > you''re welcome to review and comment. >WRT to load balancing/traffic steering: Microsoft have very specific requirements for supporting load balancing NICs under Windows (using Toeplitz hash and a re-programmable table) and you can bet that most NICs will be supported under Windows! Hence, perhaps you should look at the way Windows handles load balancing when deciding how you may want to do it for GLDv3. Paul -- Paul Durrant http://www.linkedin.com/in/pdurrant
Yunsong (Roamer) Lu
2007-Oct-05 05:29 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Paul Durrant wrote:> On 28/09/2007, Yunsong (Roamer) Lu <roamer at sun.com> wrote: >> We''ve posted a new design document, Crossbow Hardware Resources >> Management and Virtualization, here: >> http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf >> you''re welcome to review and comment. >> > > WRT to load balancing/traffic steering: Microsoft have very specific > requirements for supporting load balancing NICs under Windows (using > Toeplitz hash and a re-programmable table) and you can bet that most > NICs will be supported under Windows! Hence, perhaps you should look > at the way Windows handles load balancing when deciding how you may > want to do it for GLDv3.Thank you very much for the suggestion! We investigated some hardwares and found that the current receive load balance implementations, also known as traffic steering or receive side scaling, are very different. Not only different hash algorithms have been adopted by IHVs, but also various hardware classification scenarios are implemented. Even though Microsoft suggested Toeplitz hash based RSS. Alternatively, we prefer to leave the flexibility to hardware/drivers by defining a high level load balance interface, even the driver may decide not to expose the interface changing load balance policy. Ideally, the driver and hardware may implement these classification features: * Initially, when a group of rx rings are enabled, a default load balance policy(we prefer 5-tuple) should be enabled. It doesn''t matter which type of hash is used and whether the load is evenly balanced. The framework also doesn''t care the hash output. As soon as the traffics are steered to multiple channels, we assume the performance is acceptable. Then * If capable, the framework may decide to change the load balance policy, like from socket-pair based classification to IP protocol based classification, according to the traffic type. This is an optional nice-to-have interface. Then * A specific classification rule can be added with higher priority, so that a dedicated hardware channel can be used exclusively for this type of flow. For example, you want to steer all UDP/IPv4 packets to a ring for some reason. The load balance should still happen among all rest rings. Any comments? Thanks, Roamer -- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
Paul Durrant
2007-Oct-05 09:59 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
On 05/10/2007, Yunsong (Roamer) Lu <Roamer at sun.com> wrote:> > We investigated some hardwares and found that the current receive load > balance implementations, also known as traffic steering or receive side > scaling, are very different. Not only different hash algorithms have > been adopted by IHVs, but also various hardware classification scenarios > are implemented. Even though Microsoft suggested Toeplitz hash based RSS.Microsoft *require* Toeplitz has based RSS; it''s not just a suggestion!> Alternatively, we prefer to leave the flexibility to hardware/drivers by > defining a high level load balance interface, even the driver may decide > not to expose the interface changing load balance policy. > > Ideally, the driver and hardware may implement these classification > features: > * Initially, when a group of rx rings are enabled, a default load > balance policy(we prefer 5-tuple) should be enabled. It doesn''t matter > which type of hash is used and whether the load is evenly balanced. The > framework also doesn''t care the hash output. As soon as the traffics are > steered to multiple channels, we assume the performance is acceptable. Then > * If capable, the framework may decide to change the load balance > policy, like from socket-pair based classification to IP protocol based > classification, according to the traffic type. This is an optional > nice-to-have interface. Then > * A specific classification rule can be added with higher priority, > so that a dedicated hardware channel can be used exclusively for this > type of flow. For example, you want to steer all UDP/IPv4 packets to a > ring for some reason. The load balance should still happen among all > rest rings. > > Any comments? >Microsoft require that the hardware supports an indirection table based on a Toeplitz hash using a key supplied by Windows. The hash should be used to index into the table and the entry in the table should determine the CPU on which the hardware should generate the interrupt. This table is periodically updated by windows (about every 40 seconds or so) to prevent any particular CPU from becoming overloaded. Given, as I said, that Microsoft *require* hardware to support this kind of RSS you should consider it as an option for Solaris; it''ll be a pretty safe bet that all future hardware will support it. Paul -- Paul Durrant http://www.linkedin.com/in/pdurrant
Yunsong (Roamer) Lu
2007-Oct-05 22:44 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Paul Durrant wrote:>> Ideally, the driver and hardware may implement these classification >> features: >> * Initially, when a group of rx rings are enabled, a default load >> balance policy(we prefer 5-tuple) should be enabled. It doesn''t matter >> which type of hash is used and whether the load is evenly balanced. The >> framework also doesn''t care the hash output. As soon as the traffics are >> steered to multiple channels, we assume the performance is acceptable. Then >> * If capable, the framework may decide to change the load balance >> policy, like from socket-pair based classification to IP protocol based >> classification, according to the traffic type. This is an optional >> nice-to-have interface. Then >> * A specific classification rule can be added with higher priority, >> so that a dedicated hardware channel can be used exclusively for this >> type of flow. For example, you want to steer all UDP/IPv4 packets to a >> ring for some reason. The load balance should still happen among all >> rest rings. >> >> Any comments? >> > > Microsoft require that the hardware supports an indirection table > based on a Toeplitz hash using a key supplied by Windows. The hash > should be used to index into the table and the entry in the table > should determine the CPU on which the hardware should generate the > interrupt. This table is periodically updated by windows (about every > 40 seconds or so) to prevent any particular CPU from becoming > overloaded.Crossbow is implementing polling threads each of that serves a single receive ring. Consequently, we prefer to slightly adjust traffic distribution through move those heavy flows to a dedicate ring than to randomly change the hash seeds to re-distribute traffics. Also, precise flow control done through polling interface will reduce the chance that few heave flows use up all CPU time. Your advice is very helpful! We need to figure out how to leave a extendible interface so that Solaris will support this feature as soon as it become really popular in the future. (if there would not be any legal issue) Does this make sense? Thanks, Roamer> Given, as I said, that Microsoft *require* hardware to support this > kind of RSS you should consider it as an option for Solaris; it''ll be > a pretty safe bet that all future hardware will support it. > > Paul >-- # telnet (650)-786-6759 (x86759) Connected to Solaris.Sun.COM. login: Lu, Yunsong Last login: January 2, 2007 from beyond.sfbay Yunsong.Lu at Sun.COM v1.03 Since Mon Dec. 22, 2003 [Roamer at Solaris Networking]# cd ..
Jason Jiang - Solaris China Team
2007-Oct-12 02:14 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Hi Roamer and all, Here is my comments[resent from our internal aliases to sync with people here :-) ] After a long digging into this document, for me, it is a very decent document. And I have a few questions according to this document( maybe related, maybe not :-) , but I ask help from your expertise on this). 1. Multiple Interrupts Support: When the device driver is capable of MSI-X, *how about MSI?* 2. For DRR and DRRG, is there any parameter to record this in your ring structure? 3. Is it possible that remove/add ring group functions were required? Especially for TX ring group, if some error happenes on one ring (or group), may I remove it dynamically? For group, I think mr_rem_ring is not enough. Is the mrg_stop used for this purpose? *4. *The administrator may choose to maximize the number of groups, in order to increase the number of virtual NICs that can be built over this NIC.** If a NIC only has 1 rx ring and 1 tx ring, does it mean that only one VNIC can be built over this NIC? Or I can simply increase the number of groups to increase the number of VNICs but every VNIC share the only 1 rx/tx ring? 5. Does it need to add status bit into mac_rx_ring_info_t/mac_rx_ring_group_info_t/mac_tx_ring_info_t structures? 6. For this document is about hardware resources, but I can not see any words about hardware checksum. Will you merge the rings capability into m_getcapab or add another item into mac_callbacks_t? 7. For VNIC is introduced by Crossbow. Is it possible that we can use a uniform name for all the NICs drivers when doing the configuration(, such as ifconfig vnic0 plumb up)? Maybe it is related to Brussel. :-) Thanks, Jason Yunsong (Roamer) Lu ??:> Hi, > We''ve posted a new design document, Crossbow Hardware Resources > Management and Virtualization, here: > http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf > you''re welcome to review and comment. > > In this document, we introduce the new driver interfaces of virtualizing > NIC resources, like multiple rings and multiple MAC addresses, and we > talk about the ideas about organizing various hardware implementations > into the new framework. We''re glad to know any comments from IHVs who > are designing new NICs. > > What''s not included in the document? The new design evolves GLDv3 > interfaces, but before finalizing those changed interfaces, like > (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we > would like to listen to your opinions first. Also, support for PCI-SIG > IOV will be added afterwards. > > Please feel free to comment and share your opinions! > > Thanks, > > Kais & Roamer > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20071012/4f582eee/attachment.html>
Yunsong (Roamer) Lu
2007-Oct-12 06:52 UTC
[crossbow-discuss] Crossbow Hardware Resources Management Design
Jason Jiang - Solaris China Team wrote:> Hi Roamer and all, > > Here is my comments[resent from our internal aliases to sync with people > here :-) ] > After a long digging into this document, for me, it is a very decent > document. And I have a few questions according to this document( maybe > related, maybe not :-) , but I ask help from your expertise on this). > 1. Multiple Interrupts Support: When the device driver is capable of > MSI-X, *how about MSI?*The flexibilities of MSI-X, like larger maximum number interrupt vectors and per-vector maksing, are what Crossbow is interested in. Of course, MSI is also a nice feature if MSI-X is not implemented. New 10GbE NICs almost all support MSI-X. The interface of Crossbow doesn''t reject MSI, but prefer MSI-X.> 2. For DRR and DRRG, is there any parameter to record this in your ring > structure?Good question! We think the simplest way is to register the default ring as the first one. For current hardware implementations, this way works. Some hardware allow to set default ring by software, but some others don''t. We''re considering a common interface that may handle both with enough simplicity. :)> 3. Is it possible that remove/add ring group functions were required? > Especially for TX ring group, if some error happenes on one ring (or > group), may I remove it dynamically?Yes. The new notification interface will address this. When a hardware fail happen with a ring, driver may call the new interface to report a status changes like link_up/down.> For group, I think mr_rem_ring is not enough. Is the mrg_stop used for > this purpose?All those optional interfaces are checked and called. Safely, the framework may call mrg_stop()-mr_rem_ring()-mrg_start() to finish the regrouping. But it''s up to driver to decide to only provide mr_rem_ring() and handle all conflicts in the driver itself. Driver may not support regrouping feature if the hardware is not capable.> *4. *The administrator may choose to maximize the number of groups, in > order to increase the number of virtual NICs that can be built over this > NIC.** > If a NIC only has 1 rx ring and 1 tx ring, does it mean that only one > VNIC can be built over this NIC? Or I can simply increase the number of > groups to increase the number of VNICs but every VNIC share the only 1 > rx/tx ring?Yes, it''s possible. But two things need to be balanced: * Throughput of single VNIC * Virtualization and independence Even if the hardware support, to have only one rx ring in a group can not drive to line speed for 10GbE. Also, it''s hard to tell how many VNICs are needed by the system users. I think we''re still looking for the balance point.> > 5. Does it need to add status bit into > mac_rx_ring_info_t/mac_rx_ring_group_info_t/mac_tx_ring_info_t structures?Probably not. In MAC layer, rings'' statuses are recorded and updated accordingly if needed. While, we''re trying to simplify the interface to driver. For example, to add a status to indicate whether it''s default ring was proposed, but we didn''t think it''s necessary.> > 6. For this document is about hardware resources, but I can not see any > words about hardware checksum. Will you merge the rings capability into > m_getcapab or add another item into mac_callbacks_t?It should be part of this document but we didn''t include it to keep the document focusing on resource virtualization. Eventually, hardware checksum could be covered by the ring capability, in fact rx hcksum and tx hcksum could be split to two parts. Anyway, need more investigation and discussions.> 7. For VNIC is introduced by Crossbow. Is it possible that we can use a > uniform name for all the NICs drivers when doing the configuration(, > such as ifconfig vnic0 plumb up)? Maybe it is related to Brussel. :-)There has been addressed by Vanity Naming in ClearView project, and you may find out more information from http://www.opensolaris.org/os/project/clearview/uv/ Thanks, Roamer> > Thanks, > Jason > Yunsong (Roamer) Lu ??: >> Hi, >> We''ve posted a new design document, Crossbow Hardware Resources >> Management and Virtualization, here: >> http://dlc.sun.com/osol/netvirt/downloads/docs/virtual_resources.pdf >> you''re welcome to review and comment. >> >> In this document, we introduce the new driver interfaces of virtualizing >> NIC resources, like multiple rings and multiple MAC addresses, and we >> talk about the ideas about organizing various hardware implementations >> into the new framework. We''re glad to know any comments from IHVs who >> are designing new NICs. >> >> What''s not included in the document? The new design evolves GLDv3 >> interfaces, but before finalizing those changed interfaces, like >> (*mc_tx)(), (*mc_unicast)(), (*mc_resoureces)(), mac_rx(), etc., we >> would like to listen to your opinions first. Also, support for PCI-SIG >> IOV will be added afterwards. >> >> Please feel free to comment and share your opinions! >> >> Thanks, >> >> Kais & Roamer >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss