thr3ads.net - crossbow discuss - [crossbow-discuss] Comments on updated arch document [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Narayan Venkat

2007-Oct-17 23:53 UTC

[crossbow-discuss] Comments on updated arch document

Hi Nicolas

Thanks for the response ..
> In general it seems that the issues being discussed are related to  
> either (1) the port of the existing LDOM functionality to the  
> updated MAC client interfaces introduced by Crossbow, or (2) the  
> addition of new LDOM functionality based on these new APIs.
>
The primary goal for us is to address the first case - the port of the
existing LDom functionality already released to customers. But there
is new LDoms functionality on the horizon that must be addressed as  
well.
The new features is expected show up in the same time frame as Crossbow.
> It''s critical that we resolve the issues falling into the first  
> category as soon as possible. This will allow the port of LDOM to  
> the new APIs to start as soon as possible, maintain LDOM  
> functionality, and allow Crossbow to meet its integration schedule.  
> For the class of issues related to future LDOM functionality, this  
> discussion will have to take place with a clear understanding of  
> the new functional requirements.
>
We should absolutely be able to do the above changes soon.
As I said above, the only thing that slightly falls outside the
above definition is the set of changes we are currently working
on. These are not yet in the OpenSolaris tree. These may rely on
either existing or new Crossbow interfaces we will need to resolve.
> So given my latest response below, what are in your opinion the  
> remaining issues which are directly related to the port of the  
> existing LDOM functionality to the new MAC API?
>
Quick summary of issues that still needs resolving for the port are:

- How is ring to CPU mapping change when DR happens
- How multiple mac address assigned to a single client correspond
     to the rings owned by the client
- Usage of HW mac addr slots in the NIC and automatic switching
     to layer-2 filtering and promisc mode.
- Separation of incoming traffic and association with Rx rings
- Tx ring allocation and relation to Rx ring and level of
     parallelism

See more questions/comments below.

<..snip..>

>>>>
>>>>     Q1.3) Are there any other flags other than the following
ones?
>>>>
>>>> 	 MAC_OPEN_FLAGS_FORCE_MULTI_RINGS
>>>> 	 MAC_OPEN_FLAGS_FORCE_ONE_RING
>>>>
>>>
>>> No.
>>>
>>
>> Is there a reason this is tied to hardware rings. We would like  
>> the mac client
>> open request be extended so that it can get either all software  
>> rings, a mix of
>> hardware and software or HW rings in order to match the number of  
>> cpus specified
>> in the client_open call .. A flag can be specified for this ..
>>
>
> Why do you want to do this? What is the functional requirement here?
>
What I initially did not understand is the association of hardware rings
to the parallelism requested. I think your explanation below on when and
SW rings are created and when HW rings are used clarifies this issue.

>> Also according to the explanation in the doc at page 38, there is
>> also a case where no flags is specified. It seems like, if no flags
>> specified, then it will attempt to reserve one hardware ring.
>> It seems to not to fail even if such reservation fails, but it is not
>> clearly specified.
>>
>
> If you don''t specify the ONE_RING flag, and a hardware ring cannot
> be reserved for the MAC client, then the MAC client will share the  
> default ring with other MAC clients.
>
In the current design, if there will be N rings, and a client open
is done, and it requests a HW ring it will get one assigned from the
N-1 rings. If it does not request one, it will be still mapped to a
HW ring (if available), else be assigned a SW ring fanned out from
the default HW ring - correct ?

>>>>  - Is there a way to force a software ring?
>>>>
>>>
>>> Do you mean not assign a hardware ring? I think this is something  
>>> we could add, yes.
>>>
>>
>> This is related to the above. Can you add a flag that we can use
>> to indicate that client wants to use a single ring or multiple rings
>> but not force hardware rings. That way even when the underlying
>> device does not have enough hardware rings, a client can get a soft
>> ring per CPU.
>>
>
> How is that different than the currently documented behavior?
This is OK. I think the documentation is clear.

>> Above comment applies to this one also. The behavior without any  
>> flags
>> seem to attempt to reserve one h/w ring. What is the failure case ?
>>
>
> Without any flag, we try to allocate N hardware rings. If that  
> fails, then we try to allocate 1 hardware ring and do fanout to N  
> soft rings, if that fails, then we share the default hardware ring,  
> and do fanout to N soft rings.
>
This is the part that I was missing. So when a client requests NCPUS
they do get N rings - either all SW or HW rings. In the case of SW rings
it might be a fanout from a allocated HW ring or the default HW ring.
This is clear from chapter 4.x ..

Couple more questions:

- If a NIC has only 1 HW ring, this will essentially become the
   the default HW ring. All other requests are fanout from this HW
   ring to the SW rings ?
- If a NIC has N HW rings, only (N-1) rings are available for
   allotment, 1 is always reserved as the default HW ring ?
- If a NIC has 2 free HW rings, and a client requests 3 rings,
   the mapping today will be 1 HW to 3 SW rings, correct ? This
   will still leave the other HW ring still free ?

>>>
>>> The first one is correct. If mbc_cpus is non-NULL, the MAC layer  
>>> will assign the CPUs provided by the caller.
>>>
>>
>> When the mbc_cpus is NULL what determines how many CPUs and hence
>> the number of rings available to this client.
>>
>
> mbc_ncpus.
>
Can you document this. It is not clear from the doc that mbc_ncpus
still controls the ring allotment and the degree of parallelism even
when mbc_cpus is NULL .. Can we set the ncpus value to a number
greater than the actual number of cpus in the domain. Will the MAC
layer create the requested number of rings to match ncpus or will it
limit the rings to the number of actual cpus in the domain.

>>>>     Q1.6) What is the relationship between Unicast addresses 
>>>> (multiple
>>>>          unicast set via mac_unicast_add()), Rings and CPUs?
>>>>     	
>>>>     	 - Is there a 1:1 relation between a unicast address and a
>>>> ring?
>>>> 	 - Is there a 1:1 relation between a ring and CPU?
>>>>
>>>
>>> Neither. The MAC addresses will share the same rings and CPUs.
>>>
>>
>> But since you are allowing multiple mac addresses to be associated
>> with a client, can we add support as part of the unicast_add call
>> to indicate that each of these addresses should be associated with
>> ring (either HW or SW).
>>
>
> No, each client is associated with a group of hardware rings or  
> soft rings. Each group or set of rings corresponds to a set of  
> unicast MAC addresses. The bandwidth limits are set on a per MAC  
> client basis. This maps to how hardware NICs do their  
> classification and fanout.
>
> If you want a separate set of rings for different MAC addresses,  
> then you create a new MAC client.
>
If this is the case what is the real value in being able to assign
multiple addresses to a client. Especially when a single client has
multiple MAC addresses, coalescing the pkts into a single stream
has less benefit over separating the traffic for each address to
its own ring. Since a client has many rings and addresses, being
able to treat these not as a group of addresses associated with a
group of rings will be useful.

For instance on N2-NIU, if you assign a RDC group to a mac_client, this
group can still contain one or more rings. The group also can be  
assigned
multiple MAC addresses. Will the number of groups limit the mac clients
that can be created for the specific device ? In that case we will want
the ability to have traffic from separate MAC addresses spread across
the rings in this mac_client.
>>>>         -  The Rings and CPUs are tightly coupled in this  
>>>> interface.
>>>> 	   How can allocate multiple rings even when there is one CPU 
>>>> (or less
>>>> 	   number of CPUs).
>>>>
>>>
>>> You don''t allocate rings explicitly, you express a level
of
>>> parallelism instead, the framework distributes the hardware rings  
>>> transparently.
>>>
>>
>> But the only way we can control this parallelism is by specifying
>> the number of CPUs in the domain. But in a system capable of adding
>> and removing CPUs dynamically, we might want to change the  
>> parallelism
>> level too. The current APIs dont allow changing this. We will need
>> a way to specify this as an extension to the client_open or a via
>> a new API call.
>>
>
> So you want an API which allows you to change the actual  
> mac_bind_cpu_t for a client which has been already opened? I think  
> we can do that.
>
Exactly. As a result it should also allocate more rings to correspond
to the current set of CPUs. The reverse also should be true. We should
similarly be able to reduce the mbc_ncpus when CPUs are removed from
the system.

<..snip..>
>> Also, in terms of parallelism is this specified by the no. of CPUs
>> or by unique CPUIDs in the array. What happens if I specify ncpus
>> where all IDs are the same - do I get ncpus HW rings if they are
>> available. Also can we then change the ring to cpu mapping when
>> more CPUs are added/removed to/from the domain ?
>>
>
> There should be no duplicate CPU ids in the array.
>
Will this be checked and error returned to client_open ? Can you
add that this is an error in the doc also.

<..snip..>


>>>>     Q1.7) How is the binding of CPUs via mac_bind_cpus_t is co-
>>>> ordinated
>>>>          with CPU DR(on the platforms that support them)?
>>>>
>>>
>>> The MAC layer will be notified of the removal of the CPU and will  
>>> stop using it for its worker threads and interrupts.
>>>
>>
>> That is purely error handling. We need the ability to be able to use
>> more CPUs and improve the level of parallelism when CPUs are added.
>> The reverse is true when the CPUs are removed. When the MAC layer is
>> notified about CPUs going away does it remove the rings associated
>> with the CPUs ?
>>
>
> I was not talking specifically about error handling. If the MAC  
> layer bound a ring worker thread or interrupt to a CPU and that CPU  
> is going away, the MAC layer will move that thread or interrupt to  
> a different CPU.
>
So if a client_open was done with only one CPU, in the mbc_cpus array
and this CPU is going away, the alternate CPU will be picked in the
same manner as if the client_open was done with mbc_cpus=NULL.

> The API discussed in Q1.6 above would allow a MAC client to  
> increase the number of CPUs if it detects that CPUs were added to  
> the system.
>
Does this only allow specifying a increased CPU count or even allow
the client to specify the CPUs to use for doing the mapping when
more CPUs are added  ?

>>>>     Q1.10) Can the mac client interface be extended to support
>>>> creating
>>>>            a client based on ether_type? This is required for  
>>>> mac clients
>>>> 	   like fiberchannel over ethernet.
>>>>
>>>
>>> No, each MAC client corresponds to a MAC level entity which is  
>>> defined by its MAC address. Multiple ether types can be supported  
>>> on top of a MAC client.
>>>
>>
>> Devices like the Niagara2 NIU allow classification of packets using
>> parameters like the ether_type. How can a mac_client take advantage
>> of such a functionality.
>>
>
> The fact that a particular hardware implementation can do  
> classification on a specific header field of a packet doesn''t  
> necessarily mean that a MAC client needs to be associated with that  
> field.
>
> Today the SAP demultiplexing is done by DLS on top of MAC clients.  
> At some point in the future we may make use of hardware  
> classification to offload that demultiplexing, but can be done at a  
> level above the MAC layer, and maintain the separation between MAC  
> clients and what defines them (MAC addresses and VLANs), from SAP  
> demultiplexing.
>
Agreed, that makes sense for SAP demultiplexing. In the near future,  
opening clients
based on ether_type will be important.  Particularly for FCoE.  
Interfaces this time
next year will be supporting FCoE and the Leadville stack will need  
to open a client
based on the ether_type for FCoE.

>>>>
>>>>    	Q2.2) Is there an impact to the  
>>>> multiaddress_capab_t.maddr_add()/
>>>>   	     maddr_remove() interfaces? Are these being obsoleted or
>>>> 	     going away?
>>>>
>>>
>>> The capability will stay, and the framework will continue to use  
>>> that capability to query and control the allocation of MAC  
>>> address slots. However that interface is not intended to be used  
>>> by drivers which should use the MAC client interfaces instead.
>>>
>>
>> OK.
>>
>
> Since my last reply Kais and Roamer have been working on the design  
> for the new driver interface. Their proposal removes the multiple  
> MAC address capability as it is known today. You should read their  
> design document, which is available at http://www.opensolaris.org/ 
> os/project/crossbow/Docs/virtual_resources.pdf
>
Thanks. I saw the email too and reviewing the doc now ..


>>>> 	Q2.3) A system with many domains (aka LDoms) with virtual
network
>>>>               devices, it requires the use of a large number  
>>>> layer2 addresses,
>>>> 	      this will exhaust h/w slots available on most standard
NICs.
>>>> 	      How can a client take advantage of layer2 filtering  
>>>> provided by
>>>> 	      NICs like NII-NIU/Neptune. Specifically, this will help
in
>>>>               avoiding the programming of the device into  
>>>> PROMISCous mode
>>>>               etc.  Currently there are no interfaces that seem
>>>> to provide
>>>>               such ability.
>>>>
>>>
>>> Yes, this is a situation we are aware of. We''ve talked on
this
>>> list about having multiple VNICs sharing the same MAC address,  
>>> and identified by their IP address instead. However this needs to  
>>> be scoped and defined further before we can commit on providing  
>>> that functionality.
>>>
>>>
>>
>> The current APIs only allow adding as many addresses as the
>> number of slots available. Following this it will put the adapter
>> in promisc mode. Instead can you add the capability to specify
>> when to use a filter and when to take up a slot in the HW.
>>
>
> Do you mean that you want to be able to specify that a  
> mac_unicast_add() should put the NIC in promiscuous mode even  
> though there are MAC address slots available? What is the use case  
> for this?
>
No, that does not make any sense. I am not asking for that. The number
of mac addresses that can be added across all mac clients is restricted
to the total number of HW slots in the NIC - correct ? If this is not
the case, does the MAC layer put the card in promisc mode to filter
the MAC addresses in SW ?

In the case of HW that allow layer-2 filtering in HW, is there a way
the MAC layer takes advantage of this, instead of putting the NIC in
promisc mode, especially when we run out of HW slots on the NIC.

<..snip..>

>>>> 	Q2.7) How are the multiple addresses per client maintained, is
>>>> it done
>>>> 	     in the MAC layer or does it bybpass the MAC layer and
passed
>>>> 	     to h/w directly.
>>>>
>>>
>>> Since the action of reserving the MAC address is triggered by a  
>>> call to the MAC layer, the MAC layer cannot be bypassed. The MAC  
>>> layer will use the multiple MAC address capability exposed by the  
>>> driver to reserve a new MAC address slot.
>>>
>>
>> What if the driver does not expose that capability. Will the  
>> unicast_add
>> call fail ? Is the MAC layer essentially reflecting the capability of
>> the underlying hardware or does it provide the ability to have  
>> multiple
>> addresses irrespective of whether the HW has multiple slots or not ?
>>
>
> The request will still succeed if the number of MAC address slots  
> is exhausted, or if the underlying NIC doesn''t support the
multiple
> MAC address capability. However in these cases, the MAC layer will  
> transparently put the NIC in promiscuous mode in order to receive  
> traffic for that new MAC unicast address.
>
Can we take advantage of other HW capabilities like address  
filtering. See
comment above wrt Q2.3. Also there are cases we dont want to switch to
promisc mode automatically. Can we add a flag to the unicast_add call
and get an error instead of automatic switching. Since there is an API
for forcing promisc mode, we can explicitly request to switch to promisc
mode using that API when needed.

>>>> 	Q2.8) Can unlimited number of mac addresses be assigned to a
MAC
>>>> 	     client? What are the software/hardware features that
limit
>>>> this?
>>>>
>>>
>>> Memory that can be allocated by the kernel.
>>>
>>
>> So even if the underlying device runs out of slots the MAC layer will
>> maintain all the addresses associated with that client. How does  
>> it then
>> manage and associate these addresses with the rings allocated for
>> this client ? What does it do in both software and hardware to
>> filter the addresses for this client ? Also which addresses get HW  
>> slots
>> and which dont ? Also if you run out of slots does the HW go to  
>> promisc mode ..
>>
>
> Each MAC client is associated with a group of rings. Each group of  
> rings is therefore associated with a set of MAC addresses. If a  
> client needs to be associated with more than one MAC address, then  
> the corresponding group needs to be associated with the same set of  
> addresses. If the hardware runs out of MAC addresses, then the NIC  
> is put in promiscuous mode. The allocation of slots is on a first  
> come first served basis.
>
That HW slots are global across all MAC clients. Since it is FCFS one  
client
can potentially consume all HW slots ? Also since transitioning the  
NIC to
promisc mode has impact on all clients, I think the mac layer should try
to do slightly better than FCFS and do something like fair-share so that
it does not give one client all the HW slots ? Also add a flag to  
prevent
automatic switching to promisc mode.

>>>> 3) Rings related:
>>>> 	(Crossbow-virt.pdf  Section 5.3 Pg 43)
>>>> 	  mac_rint_t *mac_rings_list_get(mac_client_handle_t mch,
>>>> 		  uint_t nrings);
>>>> 	  void mac_rings_list_free(mac_rings_t *rings, uint_t nrings);
>>>> 	  uint16_t mac_ring_get_flags(mac_ring_t ring);
>>>>
>>>>
>>>>     QUESTIONS:
>>>>
>>>>     	Q3.1) All of these interfaces are now categorized as  
>>>> project-private
>>>> 	      API. What motivated this change. These interfaces need
to be
>>>>               more open.
>>>>
>>>
>>> The MAC layer will do the allocation of hardware resources to the  
>>> various MAC clients and their flows. Instead of having each MAC  
>>> client manage its own set of resources, the resources are  
>>> allocated to MAC clients based on their needs, for example the  
>>> degree of parallelism expressed through mac_client_open(). If you  
>>> have specific functional requirements that are not satisfied by  
>>> the current document, please list them.
>>>
>>
>> Currently rings are hidden resources entirely managed by the mac  
>> layer
>> and clients have no visiblity. All the client gets to do is request a
>> degree of parallelism. Providing APIs that allow clients to see how
>> rings were allocated will be useful.
>>
>
> Why? What is the functional requirement?
>
A client otherwise does not know whether its parallelism request is met
using HW rings or SW rings. HW is obviously better than SW. In the case
it gets the later, it might choose to reduce the degree to parallelism
so that it gets all HW rings. Having said that, since the current APIs
allow for requesting only HW rings, we can always try HW first and then
ask for SW rings only if the first one fails and if client is OK with
the SW rings. Some visibility into this in the future will positively
help with optimizations.
>>>>         Q3.2) The mac_rings_list_get() is only  for h/w rings,
>>>> is there
>>>> 	     an equivalent interface to obtain s/w ring information.
>>>> 	     Or this interface can be extended return both h/w ring
>>>> 	     or s/w ring information.
>>>>
>>>
>>> The interface will evolve to provide that information, but it  
>>> will remain project private. It is provided here FYI but will  
>>> change in future revisions of the document.
>>>
>>
>> So the expectation is that ring APIs should not be used by clients  
>> and it
>> is only an internal MAC layer resource managed by it ?
>>
>
> Yes, the MAC layer does the allocation of resources to MAC clients  
> and their flows.
>
Some visibility into this will help in both perf monitoring and
policy correction. Instead of looking at this from a single OS instance
point of view, if we see this from a perspective of diff OSs, having
more info can help better tune for varying traffic loads. Can some
of this be available via some type of stats like interface ?

>>>>    	Q3.3) Are the mac_resource_set() and mac_resources()
interfaces
>>>> 	     going away?	
>>>>
>>>
>>> Yes, they will be replaced by different interfaces. But note that  
>>> they are already project private in Nevada and were not supposed  
>>> to be used by other ON components.
>>>
>>
>> Agreed, but there is no other way in the new Crossbow API a way
>> to take advantage of multiple rings. There is one generic RX  
>> callback,
>> but no other way to associate a callback with a specific ring. This
>> is a limitation of the existing API and support should be added to
>> open API list, so that we can process traffic independently. So  
>> having
>> some additional APIs to expose this will be very useful.
>>
>
> But you can process the traffic independently since packets will be  
> sent up to the MAC client concurrently for the multiple rings  
> associated with the client. What else do you need to do here  
> specifically that is prevented by the API?
>
I think this is sufficient for now. In the future, as flow  
classification
interfaces become mature, the ability to associate a ring with a  
specific
classification might be useful. Subsequently have the ability to specify
an unique handle for each flow will help the rx_callback process pkts
better.

<..snip..>

>>>> 	Q3.5) Are there any interfaces other than the above
mac_rings_xxx
>>>> 	     interfaces that are available to deal with MAC rings?
>>>>
>>>
>>> Not available to MAC clients. The set of project private  
>>> interfaces might evolve as we refine the design.
>>>
>>
>> I would like to see some of the functionality exported via these
>> private mac interfaces promoted to a open API. Even if the API
>> cannot be moved over, can we extend APIs to provide hints ..
>>
>
> Again, what is the functional requirement? What functionality do  
> you need to provide with this information? What hints do you think  
> are missing from the API?
>
See response to Q3.1 ..


>>>> 	Q3.6) Is the mac_rings_list_get() returns the list of mac
rings
>>>> 	     assigned to the client at the time of client open. How
can
>>>> 	     this be changed after the client is open.
>>>>
>>>
>>> The set of assigned rings may change. The details on the APIs  
>>> needed to support this still need to be defined, but they will  
>>> remain project private.
>>>
>>
>> So you are saying there is no way to rely on how many rings are
>> available to a particular client. This will change without the
>> client''s control ? Is CPUs being removed from the system a
case
>> under which this will happen ?
>>
>
> The flags taken by mac_client_open() allows some control by the MAC  
> client, see Q1.3. If the client specified that a given CPU be  
> assigned to the client, we could block the DR''ing out the CPU
until
> the MAC client releases that CPU, what is your requirement here?
>
I dont think you want to block DR. DR of CPUs happen outside the scope
of the kernel. Normally from a external control point like a data mgmt
center. This control point has little visibility into what CPU in a
domain is being used by a MAC client. So instead of preventing DR from
happening, the mac_client should be notified that it might lose some
of its rings. Alternatively you can handle this in the same way as
when the client_open is done with mbc_cpus=NULL and ncpus > 0. The
mac layer can redist the rings across the remaining CPUs in the
system instead of reducing the number of rings the client currently has.

>>>> 	Q3.7) Assigning h/w rings to a specific MAC address limits the
>>>> 	     bandwidth to the number of rings that are assigned to
that
>>>> 	     address. Is there a way to not to bind h/w rings specific
>>>> 	     to MAC address so that the bandwidth could be used by
>>>> 	     any mac client depending on the traffic?
>>>>
>>>
>>> See Q1.3.
>>>
>>
>> Not sure what you mean. Are you suggesting that some mac addresses
>> will have SW rings and others will be associated to HW rings ?
>>
>
> Between different MAC clients, that''s possible. But within the
same
> MAC client, all unicast addresses of that client will share the  
> same group of hardware rings or SRS.
>
So when packets for a specific address arrives, will it be processed
by a different ring each time ? So each ring is synonymous to a CPU
resource and handles whichever packet arrives. It has no affinity to
specific mac addresses ?

>>>> 4) Receive callback related:
>>>> 	(Crossbow-virt.pdf  Section 5.2.5 Pg 40)
>>>>    	int mac_rx_set(mac_client_handle_t mch, mac_rx_fn_t rx_fn,
>>>> 	    void *arg);
>>>> 	int mac_rx_clear(mac_client_handle_t mch);
>>>>
>>>>     QUESTIONS:
>>>>
>>>>     	Q4.1) How can a client get rx callback  per ring that is  
>>>> assigned
>>>> 	     to the mac client? This will allow parallel processing
>>>> 	     and improve the performance. Such a feature is already
>>>> 	     being used in the current implementation of LDoms vSwitch
>>>> 	     driver and the mac_xxx interfaces should support such an
>>>>              ability.
>>>>
>>>
>>> The parallel processing will still happen. I.e. if multiple  
>>> hardware rings or software rings are assigned to a MAC clients,  
>>> multiple connections associated with that MAC client will be  
>>> spread across these rings.
>>>
>>
>> So with multiple rings, there will be concurrent callbacks to the
>> rx_fn, each with packets in the corresponding ring ? Also will each
>> callback be able to determine what ring did the callback ?
>>
>
> Why do you need that information, what is your functional requirement?
>
I think today, the requirement is not relevant, especially because
the API does not allow associating each mac address with separate
ring, unless the ring and address are associated with a unique  
mac_client.

>>>> 	Q4.2) How can a client get a separate callback for a defined  
>>>> type of
>>>> 	     traffic, such as different SAP numbers etc. This will
>>>> 	     be useful to provide out of the band type packet
processing
>>>>              or related services.
>>>>
>>>
>>> This will be supported by a MAC flow API built on top of the MAC  
>>> client API. The flow API will be described by a separate document.
>>>
>>
>> So if a client wants to use the flow API will it need to layer  
>> itself on
>> the flow API and not the mac client API directly. Can you give me  
>> more
>> information on what this layering will look like. Also, when do you
>> expect the flow API doc to be available ?
>>
>
> The flow API will be an addition to the MAC client API. A MAC  
> client will be able to use that flow API. Such a flow operation  
> would be of the form mac_flow_xxx(mac_client_handle_t mch, <flow  
> description>, <bandwidth properties>, etc). Kais is working on  
> defining that API, I''ll let him comment on expected availability.
>
Thanks -- some of the requirements / comments above are tied to the
flow API. So clarification on the flow API will help better define the
requirements.


>>>> 	Q5.2) If NULL specified as a ''hint'', how is
the tx ring
>>>> 	      selected?
>>>>
>>>
>>> In this case mac_tx() will parse the packet headers and hash on  
>>> the header information to select a transmit ring.
>>>
>>>
>>
>> Is the goal here to somehow bifurcate traffic being sent by a client
>> via the interface ?
>>
>
> The goal is to spread the traffic among the transmit rings assigned  
> to the client while maintaining packet ordering for individual  
> connections, without exposing the details of assignment of transmit  
> rings to MAC clients.
>
Another related question ..

Are Tx and Rx rings assigned as a pair to a mac client ? Can a client
have more Tx rings than Rx rings ? What controls this ? Does the ncpus
parameter control how many Tx rings a client is assigned ?


>>>> 	Q7.2) From the explanation of mac_promisc_add(), it seems like
>>>> 	     the mac_promisc_add() could be called without setting
>>>> 	     MAC address via mac_unicast_add(). Is this correct?
>>>> 	     If so, what is the expected behaviour?
>>>>
>>>
>>> Currently we provide the same semantics as a switched  
>>> environment, i.e. a MAC client will see the same traffic that  
>>> would be seen by a NIC connected to a switch.
>>>
>>>
>>
>> Is there a way to see only the multicast traffic associated with  
>> all mac
>> clients - union of all mac_client multicast_add addresses. The MULTI
>> promisc option seems more a way to weed to unicast and broadcast  
>> traffic
>> on the wire and pass all wire multicast traffic up - including  
>> ones the
>> system may not be interested in ? Is this the case ?
>>
>
> These broadcast flags apply not only to the incoming received  
> traffic but also the traffic sent my MAC clients of the same  
> underlying MAC. I.e. a MAC client PROMISC_MULTI callback will also  
> see all multicast traffic sent by the other MAC clients defined on  
> top of the same MAC. In order to preserve the semantics that are  
> implemented by a real physical switch, this applies to *all*  
> multicast traffic, not just the multicast groups that were
"joined"
> by the individual MAC clients.
>
Ok - thanks for the clarification .. Can you add some text to the doc  
also
to this effect ..

>>> Another option would be to generalize this with the shared  
>>> ethernet model, and allow a MAC client to specify that it wants  
>>> to observe all traffic via a separate promiscuous type. I need to  
>>> see how this can be added to the API.
>>>
>>
>> This will be very useful. How about something MAC_PROMISC_CLIENTS ?
>>
>
> The new modes will be:
>
> * ALL: all traffic, including all traffic sent by MAC clients and  
> traffic seen by the hardware
> * FILTERED: all multicast and broadcast traffic, plus the traffic  
> to the unicast MAC addresses associated with the MAC client.
> * MULTI: all multicast and broadcast traffic received on the wire  
> and sent by the MAC clients
>
OK.

Thanks
-Narayan

Nicolas Droux

2007-Oct-19 22:33 UTC

head link

[crossbow-discuss] Comments on updated arch document

Hi Narayan,

On Oct 17, 2007, at 5:53 PM, Narayan Venkat wrote:
> Hi Nicolas
>
> Thanks for the response ..
>
>> In general it seems that the issues being discussed are related to
>> either (1) the port of the existing LDOM functionality to the
>> updated MAC client interfaces introduced by Crossbow, or (2) the
>> addition of new LDOM functionality based on these new APIs.
>>
>
> The primary goal for us is to address the first case - the port of the
> existing LDom functionality already released to customers.
Yes, these are the issues that we need to address for the initial  
putback.
> But there
> is new LDoms functionality on the horizon that must be addressed as
> well.
> The new features is expected show up in the same time frame as  
> Crossbow.
If there''s additional functionality which is needed by LDOM which has  
an impact on Crossbow then the functional and schedule requirements  
for these new features needs to be clearly communicated to us. If  
additional work is needed on Crossbow, then we need to discuss the  
vehicle for these changes, and the potential impact of that work on  
the existing Crossbow schedule, other existing projects, and future  
funded projects needs to evaluated, tracked and the effort staffed.  
Unless you give us that information we can''t plan for that work. And  
we cannot get into detailed design discussions if the functional  
requirements are not clearly communicated first.
>> So given my latest response below, what are in your opinion the
>> remaining issues which are directly related to the port of the
>> existing LDOM functionality to the new MAC API?
>>
>
> Quick summary of issues that still needs resolving for the port are:
>
> - How is ring to CPU mapping change when DR happens
I think this was answered at length below. Do you have any remaining  
issues on this topic?
> - How multiple mac address assigned to a single client correspond
>      to the rings owned by the client
I don''t agree that this is required for the port as part of the  
initial Crossbow putback. From what I could tell LDOMs today doesn''t  
allow multiple MAC addresses to be assigned to vnets in the first place.
> - Usage of HW mac addr slots in the NIC and automatic switching
>      to layer-2 filtering and promisc mode.
I think I answered your questions about this point below. Also today  
there''s no hardware classification done in hardware for LDOMs,  
whether in promiscuous mode or not. So I don''t understand why you  
consider this a requirement for the initial port.
> - Separation of incoming traffic and association with Rx rings
"Association with Rx rings" is a bit vague.

As I described in my previous emails and below, individual  
connections will be fanned out to the rings that are member of a  
group associated with a MAC client, or the fanout will be done in  
software between soft rings.
> - Tx ring allocation and relation to Rx ring and level of
>      parallelism
See Q5.2 below.

<snip>
>>> Also according to the explanation in the doc at page 38, there is
>>> also a case where no flags is specified. It seems like, if no flags
>>> specified, then it will attempt to reserve one hardware ring.
>>> It seems to not to fail even if such reservation fails, but it is  
>>> not
>>> clearly specified.
>>>
>>
>> If you don''t specify the ONE_RING flag, and a hardware ring
cannot
>> be reserved for the MAC client, then the MAC client will share the
>> default ring with other MAC clients.
>>
>
> In the current design, if there will be N rings, and a client open
> is done, and it requests a HW ring it will get one assigned from the
> N-1 rings. If it does not request one, it will be still mapped to a
> HW ring (if available), else be assigned a SW ring fanned out from
> the default HW ring - correct ?
Correct.

<snip>
>>> Above comment applies to this one also. The behavior without any
>>> flags
>>> seem to attempt to reserve one h/w ring. What is the failure case ?
>>>
>>
>> Without any flag, we try to allocate N hardware rings. If that
>> fails, then we try to allocate 1 hardware ring and do fanout to N
>> soft rings, if that fails, then we share the default hardware ring,
>> and do fanout to N soft rings.
>>
>
> This is the part that I was missing. So when a client requests NCPUS
> they do get N rings - either all SW or HW rings. In the case of SW  
> rings
> it might be a fanout from a allocated HW ring or the default HW ring.
> This is clear from chapter 4.x ..
Correct.
>
> Couple more questions:
>
> - If a NIC has only 1 HW ring, this will essentially become the
>    the default HW ring. All other requests are fanout from this HW
>    ring to the SW rings ?
Yes
> - If a NIC has N HW rings, only (N-1) rings are available for
>    allotment, 1 is always reserved as the default HW ring ?
Yes.
> - If a NIC has 2 free HW rings, and a client requests 3 rings,
>    the mapping today will be 1 HW to 3 SW rings, correct ? This
>    will still leave the other HW ring still free ?
Yes.
>>>> The first one is correct. If mbc_cpus is non-NULL, the MAC
layer
>>>> will assign the CPUs provided by the caller.
>>>>
>>>
>>> When the mbc_cpus is NULL what determines how many CPUs and hence
>>> the number of rings available to this client.
>>>
>>
>> mbc_ncpus.
>>
>
> Can you document this. It is not clear from the doc that mbc_ncpus
> still controls the ring allotment and the degree of parallelism even
> when mbc_cpus is NULL ..
I will update the document to make it clearer.
> Can we set the ncpus value to a number
> greater than the actual number of cpus in the domain. Will the MAC
> layer create the requested number of rings to match ncpus or will it
> limit the rings to the number of actual cpus in the domain.
We use the specified ncpus.
>>>>>     Q1.6) What is the relationship between Unicast
addresses
>>>>> (multiple
>>>>>          unicast set via mac_unicast_add()), Rings and
CPUs?
>>>>>     	
>>>>>     	 - Is there a 1:1 relation between a unicast address
and a
>>>>> ring?
>>>>> 	 - Is there a 1:1 relation between a ring and CPU?
>>>>>
>>>>
>>>> Neither. The MAC addresses will share the same rings and CPUs.
>>>>
>>>
>>> But since you are allowing multiple mac addresses to be associated
>>> with a client, can we add support as part of the unicast_add call
>>> to indicate that each of these addresses should be associated with
>>> ring (either HW or SW).
>>>
>>
>> No, each client is associated with a group of hardware rings or
>> soft rings. Each group or set of rings corresponds to a set of
>> unicast MAC addresses. The bandwidth limits are set on a per MAC
>> client basis. This maps to how hardware NICs do their
>> classification and fanout.
>>
>> If you want a separate set of rings for different MAC addresses,
>> then you create a new MAC client.
>>
>
> If this is the case what is the real value in being able to assign
> multiple addresses to a client. Especially when a single client has
> multiple MAC addresses, coalescing the pkts into a single stream
> has less benefit over separating the traffic for each address to
> its own ring. Since a client has many rings and addresses, being
> able to treat these not as a group of addresses associated with a
> group of rings will be useful.
Of course there''s value of assigning the multiple MAC addresses to a  
single client even if you share one or more rings within that client.  
If such a MAC client maps to a front-end driver instance in another  
domain (vnet in your case, xnf for Xen), that domain can then create  
multiple VNICs on top of that front-end driver, assign a MAC address  
to these VNICs without having to turn the underlying hardware in  
promiscuous mode, and establish a path between the VNICs on the  
domain itself without having to cross hypervisor boundary.

That said, I don''t think anything forces you to have a 1:1 mapping  
between a vnet instance and a MAC client in the service domain. If  
what you are trying to do is have separate sets of rings for the  
clients of a vnet based on their MAC addresses, you could have a vnet  
map to multiple MAC clients in the service domain, with their own MAC  
addresses and separate groups of rings. That vnet instance would then  
register groups of rings corresponding to the MAC clients in the  
service domain. These groups would have their own set of rings, and  
the groups would be assigned by the MAC layer to the MAC clients of  
vnet.
> For instance on N2-NIU, if you assign a RDC group to a mac_client,  
> this
> group can still contain one or more rings. The group also can be
> assigned
> multiple MAC addresses. Will the number of groups limit the mac  
> clients
> that can be created for the specific device ? In that case we will  
> want
> the ability to have traffic from separate MAC addresses spread across
> the rings in this mac_client.
We will create one group per MAC client, as long as hardware  
resources are available. We reserve one group as the default group,  
and software classification will be used on top of the default group  
to spread the traffic to multiple software rings assigned to the MAC  
clients sharing the same group.

The hardware associates multiple MAC addresses per group, and each  
group maps to a MAC client. Once the hardware finds the group  
associated with a MAC address, it spreads traffic across the rings  
assigned to that group according to a computed hash on the inbound  
packet headers.
>>>>>         -  The Rings and CPUs are tightly coupled in this
>>>>> interface.
>>>>> 	   How can allocate multiple rings even when there is one
CPU
>>>>> (or less
>>>>> 	   number of CPUs).
>>>>>
>>>>
>>>> You don''t allocate rings explicitly, you express a
level of
>>>> parallelism instead, the framework distributes the hardware
rings
>>>> transparently.
>>>>
>>>
>>> But the only way we can control this parallelism is by specifying
>>> the number of CPUs in the domain. But in a system capable of adding
>>> and removing CPUs dynamically, we might want to change the
>>> parallelism
>>> level too. The current APIs dont allow changing this. We will need
>>> a way to specify this as an extension to the client_open or a via
>>> a new API call.
>>>
>>
>> So you want an API which allows you to change the actual
>> mac_bind_cpu_t for a client which has been already opened? I think
>> we can do that.
>>
>
> Exactly. As a result it should also allocate more rings to correspond
> to the current set of CPUs. The reverse also should be true. We should
> similarly be able to reduce the mbc_ncpus when CPUs are removed from
> the system.
OK.

>
> <..snip..>
>
>>> Also, in terms of parallelism is this specified by the no. of CPUs
>>> or by unique CPUIDs in the array. What happens if I specify ncpus
>>> where all IDs are the same - do I get ncpus HW rings if they are
>>> available. Also can we then change the ring to cpu mapping when
>>> more CPUs are added/removed to/from the domain ?
>>>
>>
>> There should be no duplicate CPU ids in the array.
>>
>
> Will this be checked and error returned to client_open ? Can you
> add that this is an error in the doc also.
Yes. I will update the doc.
>>>>>     Q1.7) How is the binding of CPUs via mac_bind_cpus_t is
co-
>>>>> ordinated
>>>>>          with CPU DR(on the platforms that support them)?
>>>>>
>>>>
>>>> The MAC layer will be notified of the removal of the CPU and
will
>>>> stop using it for its worker threads and interrupts.
>>>>
>>>
>>> That is purely error handling. We need the ability to be able to
use
>>> more CPUs and improve the level of parallelism when CPUs are added.
>>> The reverse is true when the CPUs are removed. When the MAC layer
is
>>> notified about CPUs going away does it remove the rings associated
>>> with the CPUs ?
>>>
>>
>> I was not talking specifically about error handling. If the MAC
>> layer bound a ring worker thread or interrupt to a CPU and that CPU
>> is going away, the MAC layer will move that thread or interrupt to
>> a different CPU.
>>
>
> So if a client_open was done with only one CPU, in the mbc_cpus array
> and this CPU is going away, the alternate CPU will be picked in the
> same manner as if the client_open was done with mbc_cpus=NULL.
Yes.
>> The API discussed in Q1.6 above would allow a MAC client to
>> increase the number of CPUs if it detects that CPUs were added to
>> the system.
>>
>
> Does this only allow specifying a increased CPU count or even allow
> the client to specify the CPUs to use for doing the mapping when
> more CPUs are added  ?
It would allow the new CPU set to be specified, but we need to figure  
out the details of that API.
>>>>>     Q1.10) Can the mac client interface be extended to
support
>>>>> creating
>>>>>            a client based on ether_type? This is required
for
>>>>> mac clients
>>>>> 	   like fiberchannel over ethernet.
>>>>>
>>>>
>>>> No, each MAC client corresponds to a MAC level entity which is
>>>> defined by its MAC address. Multiple ether types can be
supported
>>>> on top of a MAC client.
>>>>
>>>
>>> Devices like the Niagara2 NIU allow classification of packets using
>>> parameters like the ether_type. How can a mac_client take advantage
>>> of such a functionality.
>>>
>>
>> The fact that a particular hardware implementation can do
>> classification on a specific header field of a packet doesn''t
>> necessarily mean that a MAC client needs to be associated with that
>> field.
>>
>> Today the SAP demultiplexing is done by DLS on top of MAC clients.
>> At some point in the future we may make use of hardware
>> classification to offload that demultiplexing, but can be done at a
>> level above the MAC layer, and maintain the separation between MAC
>> clients and what defines them (MAC addresses and VLANs), from SAP
>> demultiplexing.
>>
>
> Agreed, that makes sense for SAP demultiplexing. In the near future,
> opening clients
> based on ether_type will be important.  Particularly for FCoE.
> Interfaces this time
> next year will be supporting FCoE and the Leadville stack will need
> to open a client
> based on the ether_type for FCoE.
The fact that there''s a need to do SAP demultiplexing based on SAPs  
doesn''t necessarily mean that the SAP needs to be associated with the  
MAC client directly. I think this falls into the "future projects"  
category where the functional requirements need to be clarified.  
We''ll need to work with the FCoE folks on this. At least our current  
design doesn''t prevent that demultiplexing from being added to the  
MAC layer in the future.
>>>>>    	Q2.2) Is there an impact to the
>>>>> multiaddress_capab_t.maddr_add()/
>>>>>   	     maddr_remove() interfaces? Are these being
obsoleted or
>>>>> 	     going away?
>>>>>
>>>>
>>>> The capability will stay, and the framework will continue to
use
>>>> that capability to query and control the allocation of MAC
>>>> address slots. However that interface is not intended to be
used
>>>> by drivers which should use the MAC client interfaces instead.
>>>>
>>>
>>> OK.
>>>
>>
>> Since my last reply Kais and Roamer have been working on the design
>> for the new driver interface. Their proposal removes the multiple
>> MAC address capability as it is known today. You should read their
>> design document, which is available at http://www.opensolaris.org/
>> os/project/crossbow/Docs/virtual_resources.pdf
>>
>
> Thanks. I saw the email too and reviewing the doc now ..
Actually since we''re covering this topic. Can you clarify for us how  
LDOMs obtains factory MAC addresses from the interfaces? It doesn''t  
seem that you go through the multiple MAC address capability to do  
this. We will change that part of the multiple MAC address capability  
as well, so we need to know if you depend on that interface to obtain  
the factory MAC addresses.
>>>>> 	Q2.3) A system with many domains (aka LDoms) with virtual
network
>>>>>               devices, it requires the use of a large
number
>>>>> layer2 addresses,
>>>>> 	      this will exhaust h/w slots available on most
standard
>>>>> NICs.
>>>>> 	      How can a client take advantage of layer2 filtering
>>>>> provided by
>>>>> 	      NICs like NII-NIU/Neptune. Specifically, this will
help in
>>>>>               avoiding the programming of the device into
>>>>> PROMISCous mode
>>>>>               etc.  Currently there are no interfaces that
seem
>>>>> to provide
>>>>>               such ability.
>>>>>
>>>>
>>>> Yes, this is a situation we are aware of. We''ve talked
on this
>>>> list about having multiple VNICs sharing the same MAC address,
>>>> and identified by their IP address instead. However this needs
to
>>>> be scoped and defined further before we can commit on providing
>>>> that functionality.
>>>>
>>>>
>>>
>>> The current APIs only allow adding as many addresses as the
>>> number of slots available. Following this it will put the adapter
>>> in promisc mode. Instead can you add the capability to specify
>>> when to use a filter and when to take up a slot in the HW.
>>>
>>
>> Do you mean that you want to be able to specify that a
>> mac_unicast_add() should put the NIC in promiscuous mode even
>> though there are MAC address slots available? What is the use case
>> for this?
>>
>
> No, that does not make any sense. I am not asking for that. The number
> of mac addresses that can be added across all mac clients is  
> restricted
> to the total number of HW slots in the NIC - correct ? If this is not
> the case, does the MAC layer put the card in promisc mode to filter
> the MAC addresses in SW ?
Yes, we put the card in promiscuous mode if we run out of MAC address  
slots.
> In the case of HW that allow layer-2 filtering in HW, is there a way
> the MAC layer takes advantage of this, instead of putting the NIC in
> promisc mode, especially when we run out of HW slots on the NIC.
By layer-2 filtering I guess you mean hardware classification. You  
still need to put the NIC in promiscuous mode so that it starts  
receiving traffic for the MAC address which do not fit in the  
hardware MAC address slots. In the case of the NIU I believe that the  
packets will still be classified to the right groups even if the card  
is in promiscuous mode.
>>>>> 	Q2.7) How are the multiple addresses per client
maintained, is
>>>>> it done
>>>>> 	     in the MAC layer or does it bybpass the MAC layer and
passed
>>>>> 	     to h/w directly.
>>>>>
>>>>
>>>> Since the action of reserving the MAC address is triggered by a
>>>> call to the MAC layer, the MAC layer cannot be bypassed. The
MAC
>>>> layer will use the multiple MAC address capability exposed by
the
>>>> driver to reserve a new MAC address slot.
>>>>
>>>
>>> What if the driver does not expose that capability. Will the
>>> unicast_add
>>> call fail ? Is the MAC layer essentially reflecting the  
>>> capability of
>>> the underlying hardware or does it provide the ability to have
>>> multiple
>>> addresses irrespective of whether the HW has multiple slots or not
?
>>>
>>
>> The request will still succeed if the number of MAC address slots
>> is exhausted, or if the underlying NIC doesn''t support the
multiple
>> MAC address capability. However in these cases, the MAC layer will
>> transparently put the NIC in promiscuous mode in order to receive
>> traffic for that new MAC unicast address.
>>
>
> Can we take advantage of other HW capabilities like address
> filtering. See
> comment above wrt Q2.3. Also there are cases we dont want to switch to
> promisc mode automatically. Can we add a flag to the unicast_add call
> and get an error instead of automatic switching.
Yes, I thought I already agreed to add it but forgot to document the  
flag.
> Since there is an API
> for forcing promisc mode, we can explicitly request to switch to  
> promisc
> mode using that API when needed.
You lost me here. There''s no separate API to force promisc mode.  
There''s an API to add promiscuous callbacks, it''s different
than the
older API.
>>>>> 	Q2.8) Can unlimited number of mac addresses be assigned to
a MAC
>>>>> 	     client? What are the software/hardware features that
limit
>>>>> this?
>>>>>
>>>>
>>>> Memory that can be allocated by the kernel.
>>>>
>>>
>>> So even if the underlying device runs out of slots the MAC layer  
>>> will
>>> maintain all the addresses associated with that client. How does
>>> it then
>>> manage and associate these addresses with the rings allocated for
>>> this client ? What does it do in both software and hardware to
>>> filter the addresses for this client ? Also which addresses get HW
>>> slots
>>> and which dont ? Also if you run out of slots does the HW go to
>>> promisc mode ..
>>>
>>
>> Each MAC client is associated with a group of rings. Each group of
>> rings is therefore associated with a set of MAC addresses. If a
>> client needs to be associated with more than one MAC address, then
>> the corresponding group needs to be associated with the same set of
>> addresses. If the hardware runs out of MAC addresses, then the NIC
>> is put in promiscuous mode. The allocation of slots is on a first
>> come first served basis.
>>
>
> That HW slots are global across all MAC clients. Since it is FCFS one
> client
> can potentially consume all HW slots ? Also since transitioning the
> NIC to
> promisc mode has impact on all clients, I think the mac layer  
> should try
> to do slightly better than FCFS and do something like fair-share so  
> that
> it does not give one client all the HW slots ? Also add a flag to
> prevent
> automatic switching to promisc mode.
There are cases where internally we might come up with some  
algorithms to distribute the slots fairly across the clients, however  
we want to be able to tune these algorithms as we gain experience  
with the framework, and avoid pushing the complexity of managing  
these shared resources to the clients.
>>>>> 3) Rings related:
>>>>> 	(Crossbow-virt.pdf  Section 5.3 Pg 43)
>>>>> 	  mac_rint_t *mac_rings_list_get(mac_client_handle_t mch,
>>>>> 		  uint_t nrings);
>>>>> 	  void mac_rings_list_free(mac_rings_t *rings, uint_t
nrings);
>>>>> 	  uint16_t mac_ring_get_flags(mac_ring_t ring);
>>>>>
>>>>>
>>>>>     QUESTIONS:
>>>>>
>>>>>     	Q3.1) All of these interfaces are now categorized as
>>>>> project-private
>>>>> 	      API. What motivated this change. These interfaces
need
>>>>> to be
>>>>>               more open.
>>>>>
>>>>
>>>> The MAC layer will do the allocation of hardware resources to
the
>>>> various MAC clients and their flows. Instead of having each MAC
>>>> client manage its own set of resources, the resources are
>>>> allocated to MAC clients based on their needs, for example the
>>>> degree of parallelism expressed through mac_client_open(). If
you
>>>> have specific functional requirements that are not satisfied by
>>>> the current document, please list them.
>>>>
>>>
>>> Currently rings are hidden resources entirely managed by the mac
>>> layer
>>> and clients have no visiblity. All the client gets to do is  
>>> request a
>>> degree of parallelism. Providing APIs that allow clients to see how
>>> rings were allocated will be useful.
>>>
>>
>> Why? What is the functional requirement?
>>
>
> A client otherwise does not know whether its parallelism request is  
> met
> using HW rings or SW rings. HW is obviously better than SW. In the  
> case
> it gets the later, it might choose to reduce the degree to parallelism
> so that it gets all HW rings. Having said that, since the current APIs
> allow for requesting only HW rings, we can always try HW first and  
> then
> ask for SW rings only if the first one fails and if client is OK with
> the SW rings. Some visibility into this in the future will positively
> help with optimizations.
OK, if you can go with the algorithm that you just described for now  
that would be great. We''ll look into how we can improve the  
visibility of resource availability in the future, however we won''t  
be able to add this feature for our initial putback.
>>>>>         Q3.2) The mac_rings_list_get() is only  for h/w
rings,
>>>>> is there
>>>>> 	     an equivalent interface to obtain s/w ring
information.
>>>>> 	     Or this interface can be extended return both h/w
ring
>>>>> 	     or s/w ring information.
>>>>>
>>>>
>>>> The interface will evolve to provide that information, but it
>>>> will remain project private. It is provided here FYI but will
>>>> change in future revisions of the document.
>>>>
>>>
>>> So the expectation is that ring APIs should not be used by clients
>>> and it
>>> is only an internal MAC layer resource managed by it ?
>>>
>>
>> Yes, the MAC layer does the allocation of resources to MAC clients
>> and their flows.
>>
>
> Some visibility into this will help in both perf monitoring and
> policy correction. Instead of looking at this from a single OS  
> instance
> point of view, if we see this from a perspective of diff OSs, having
> more info can help better tune for varying traffic loads. Can some
> of this be available via some type of stats like interface ?
That''s a very complex problem. Even today in the single OS instance  
case we don''t fully self-tune according to the workload. Providing  
this type of capability falls outside the scope of our initial  
putback. We need an architecture first before starting to export kstats.

<snip>
>>>>> 	Q3.6) Is the mac_rings_list_get() returns the list of mac
rings
>>>>> 	     assigned to the client at the time of client open.
How can
>>>>> 	     this be changed after the client is open.
>>>>>
>>>>
>>>> The set of assigned rings may change. The details on the APIs
>>>> needed to support this still need to be defined, but they will
>>>> remain project private.
>>>>
>>>
>>> So you are saying there is no way to rely on how many rings are
>>> available to a particular client. This will change without the
>>> client''s control ? Is CPUs being removed from the system a
case
>>> under which this will happen ?
>>>
>>
>> The flags taken by mac_client_open() allows some control by the MAC
>> client, see Q1.3. If the client specified that a given CPU be
>> assigned to the client, we could block the DR''ing out the CPU
until
>> the MAC client releases that CPU, what is your requirement here?
>>
>
> I dont think you want to block DR. DR of CPUs happen outside the scope
> of the kernel. Normally from a external control point like a data mgmt
> center. This control point has little visibility into what CPU in a
> domain is being used by a MAC client. So instead of preventing DR from
> happening, the mac_client should be notified that it might lose some
> of its rings. Alternatively you can handle this in the same way as
> when the client_open is done with mbc_cpus=NULL and ncpus > 0. The
> mac layer can redist the rings across the remaining CPUs in the
> system instead of reducing the number of rings the client currently  
> has.
Thanks for your input on this, we will rebind the thread to one of  
the remaining CPUs.
>>>>> 	Q3.7) Assigning h/w rings to a specific MAC address limits
the
>>>>> 	     bandwidth to the number of rings that are assigned to
that
>>>>> 	     address. Is there a way to not to bind h/w rings
specific
>>>>> 	     to MAC address so that the bandwidth could be used by
>>>>> 	     any mac client depending on the traffic?
>>>>>
>>>>
>>>> See Q1.3.
>>>>
>>>
>>> Not sure what you mean. Are you suggesting that some mac addresses
>>> will have SW rings and others will be associated to HW rings ?
>>>
>>
>> Between different MAC clients, that''s possible. But within the
same
>> MAC client, all unicast addresses of that client will share the
>> same group of hardware rings or SRS.
>>
>
> So when packets for a specific address arrives, will it be processed
> by a different ring each time ? So each ring is synonymous to a CPU
> resource and handles whichever packet arrives. It has no affinity to
> specific mac addresses ?
For the rings assigned to a MAC client, yes. Note that the hardware  
is required to use the same RX ring of a group for a connection, in  
order to maintain locality and prevent reordering.

However each MAC client will get its own group of rings, and traffic  
for the two clients will not spill over to the set of rings of the  
other clients.

See also Q1.6 above.

<snip>
>>>>> 	Q4.2) How can a client get a separate callback for a
defined
>>>>> type of
>>>>> 	     traffic, such as different SAP numbers etc. This will
>>>>> 	     be useful to provide out of the band type packet
processing
>>>>>              or related services.
>>>>>
>>>>
>>>> This will be supported by a MAC flow API built on top of the
MAC
>>>> client API. The flow API will be described by a separate
document.
>>>>
>>>
>>> So if a client wants to use the flow API will it need to layer
>>> itself on
>>> the flow API and not the mac client API directly. Can you give me
>>> more
>>> information on what this layering will look like. Also, when do you
>>> expect the flow API doc to be available ?
>>>
>>
>> The flow API will be an addition to the MAC client API. A MAC
>> client will be able to use that flow API. Such a flow operation
>> would be of the form mac_flow_xxx(mac_client_handle_t mch, <flow
>> description>, <bandwidth properties>, etc). Kais is working on
>> defining that API, I''ll let him comment on expected
availability.
>>
>
> Thanks -- some of the requirements / comments above are tied to the
> flow API. So clarification on the flow API will help better define the
> requirements.
I would think that this should be the other way around :-) you  
provide the functional requirements, then can discuss whether the  
APIs satisfy these requirements.
>>>>> 	Q5.2) If NULL specified as a ''hint'', how
is the tx ring
>>>>> 	      selected?
>>>>>
>>>>
>>>> In this case mac_tx() will parse the packet headers and hash on
>>>> the header information to select a transmit ring.
>>>>
>>>>
>>>
>>> Is the goal here to somehow bifurcate traffic being sent by a
client
>>> via the interface ?
>>>
>>
>> The goal is to spread the traffic among the transmit rings assigned
>> to the client while maintaining packet ordering for individual
>> connections, without exposing the details of assignment of transmit
>> rings to MAC clients.
>>
>
> Another related question ..
>
> Are Tx and Rx rings assigned as a pair to a mac client ? Can a client
> have more Tx rings than Rx rings ? What controls this ? Does the ncpus
> parameter control how many Tx rings a client is assigned ?
First we try to allocate ncpus number of hardware TX rings. If there  
is less than ncpus, it''s the same algorithm as for receive rings,  
i.e. we assign one TX ring to the client. If we run out of TX rings,  
then we fallback to a default TX ring. We''ll add flags to  
mac_client_open() similar to the ones we already have for receive  
ring allocation. Note that since the hardware cannot guarantee that  
the number of TX rings is always the same as the number of RX rings,  
it is not possible to always guarantee that each RX ring will map to  
a TX ring.
>>>>> 	Q7.2) From the explanation of mac_promisc_add(), it seems
like
>>>>> 	     the mac_promisc_add() could be called without setting
>>>>> 	     MAC address via mac_unicast_add(). Is this correct?
>>>>> 	     If so, what is the expected behaviour?
>>>>>
>>>>
>>>> Currently we provide the same semantics as a switched
>>>> environment, i.e. a MAC client will see the same traffic that
>>>> would be seen by a NIC connected to a switch.
>>>>
>>>>
>>>
>>> Is there a way to see only the multicast traffic associated with
>>> all mac
>>> clients - union of all mac_client multicast_add addresses. The
MULTI
>>> promisc option seems more a way to weed to unicast and broadcast
>>> traffic
>>> on the wire and pass all wire multicast traffic up - including
>>> ones the
>>> system may not be interested in ? Is this the case ?
>>>
>>
>> These broadcast flags apply not only to the incoming received
>> traffic but also the traffic sent my MAC clients of the same
>> underlying MAC. I.e. a MAC client PROMISC_MULTI callback will also
>> see all multicast traffic sent by the other MAC clients defined on
>> top of the same MAC. In order to preserve the semantics that are
>> implemented by a real physical switch, this applies to *all*
>> multicast traffic, not just the multicast groups that were
"joined"
>> by the individual MAC clients.
>>
>
> Ok - thanks for the clarification .. Can you add some text to the doc
> also
> to this effect ..
Will do.

<snip>

Thanks,
Nicolas.

-- 
Nicolas Droux - Solaris Core OS - Sun Microsystems, Inc.
droux at sun.com - http://blogs.sun.com/droux

Michael Speer

2007-Oct-19 22:50 UTC

head link

[crossbow-discuss] Comments on updated arch document

Nicolas,

On Oct 19, 2007, at 3:33 PM, Nicolas Droux wrote:
> <snip>
>
>> - How multiple mac address assigned to a single client correspond
>>      to the rings owned by the client
>
> I don''t agree that this is required for the port as part of the
> initial Crossbow putback. From what I could tell LDOMs today
doesn''t
> allow multiple MAC addresses to be assigned to vnets in the first  
> place.
The issue is not multiple MAC addresses for a vnet.  It''s that the  
vSwitch
fronts mutliple MAC addresses for many vnet clients.  Hence, if a  
vswitch
is a single MAC client, then it needs to be assign multiple MAC  
addresses
to distribute packets for multiple L2 destinations.  This is the  
existing
functionality in vSwitch.
>
>> - Usage of HW mac addr slots in the NIC and automatic switching
>>      to layer-2 filtering and promisc mode.
>
> I think I answered your questions about this point below. Also today
> there''s no hardware classification done in hardware for LDOMs,
> whether in promiscuous mode or not. So I don''t understand why you
> consider this a requirement for the initial port.
The may be a terminology issue here. So,the existing vSwitch uses L2  
Hardware
classification in nxge, bge, and e1000g for example to filter  
incoming packets.
Each domain''s L2 address is placed in hardware address slot.  If the  
slots are
exhausted, the switch will then decide whether to put the interface  
into promiscuous
mode or not.
>
Regards,
Michael

Nicolas Droux

2007-Oct-19 23:34 UTC

head link

[crossbow-discuss] Comments on updated arch document

Michael,

On Oct 19, 2007, at 4:50 PM, Michael Speer wrote:
> Nicolas,
>
> On Oct 19, 2007, at 3:33 PM, Nicolas Droux wrote:
>
>> <snip>
>>
>>> - How multiple mac address assigned to a single client correspond
>>>      to the rings owned by the client
>>
>> I don''t agree that this is required for the port as part of
the
>> initial Crossbow putback. From what I could tell LDOMs today
doesn''t
>> allow multiple MAC addresses to be assigned to vnets in the first  
>> place.
>
> The issue is not multiple MAC addresses for a vnet.  It''s that the
> vSwitch
> fronts mutliple MAC addresses for many vnet clients.  Hence, if a  
> vswitch
> is a single MAC client, then it needs to be assign multiple MAC  
> addresses
> to distribute packets for multiple L2 destinations.  This is the  
> existing
> functionality in vSwitch.
There are two things being discussed here. One is associating more  
than one MAC unicast addresses with a MAC client. The interface being  
reviewed allows this.

The other is within a MAC client to assign individual hardware rings  
to the unicast addresses of the same MAC client. That we cannot  
support. In order to do this you need to create multiple MAC clients  
with their own set of rings.

The MAC layer is designed to do the multiplexing and virtualization  
among multiple MAC clients, not within a MAC client. Having the  
vswitch as one MAC client and implement some of the same  
functionality as the MAC layer would be inefficient, and wouldn''t  
allow it to take advantage of the features now provided by the MAC  
layer.

In order to take advantage of Crossbow, the vswitch should associate  
one or more MAC clients in the service domain with each vnet.
>>> - Usage of HW mac addr slots in the NIC and automatic switching
>>>      to layer-2 filtering and promisc mode.
>>
>> I think I answered your questions about this point below. Also today
>> there''s no hardware classification done in hardware for LDOMs,
>> whether in promiscuous mode or not. So I don''t understand why
you
>> consider this a requirement for the initial port.
>
> The may be a terminology issue here. So,the existing vSwitch uses  
> L2 Hardware
> classification in nxge, bge, and e1000g for example to filter  
> incoming packets.
That''s not what we call hardware classification in Crossbow  
terminology. What we refer to as hardware classification is directing  
traffic to one or more rings based on the content of the packet headers.
> Each domain''s L2 address is placed in hardware address slot.  If  
> the slots are
> exhausted, the switch will then decide whether to put the interface  
> into promiscuous
> mode or not.
This is what the new MAC layer does as well. Did Narayan mean  
"switching *between* L2 filtering and promisc mode" instead of  
"switching *to* L2 filtering and promisc mode" then?

Nicolas.

-- 
Nicolas Droux - Solaris Core OS - Sun Microsystems, Inc.
droux at sun.com - http://blogs.sun.com/droux

crossbow discuss - Oct 2007 - Comments on updated arch document

[crossbow-discuss] Comments on updated arch document

[crossbow-discuss] Comments on updated arch document

[crossbow-discuss] Comments on updated arch document

[crossbow-discuss] Comments on updated arch document