Hi Nicolas Thanks for the response ..> In general it seems that the issues being discussed are related to > either (1) the port of the existing LDOM functionality to the > updated MAC client interfaces introduced by Crossbow, or (2) the > addition of new LDOM functionality based on these new APIs. >The primary goal for us is to address the first case - the port of the existing LDom functionality already released to customers. But there is new LDoms functionality on the horizon that must be addressed as well. The new features is expected show up in the same time frame as Crossbow.> It''s critical that we resolve the issues falling into the first > category as soon as possible. This will allow the port of LDOM to > the new APIs to start as soon as possible, maintain LDOM > functionality, and allow Crossbow to meet its integration schedule. > For the class of issues related to future LDOM functionality, this > discussion will have to take place with a clear understanding of > the new functional requirements. >We should absolutely be able to do the above changes soon. As I said above, the only thing that slightly falls outside the above definition is the set of changes we are currently working on. These are not yet in the OpenSolaris tree. These may rely on either existing or new Crossbow interfaces we will need to resolve.> So given my latest response below, what are in your opinion the > remaining issues which are directly related to the port of the > existing LDOM functionality to the new MAC API? >Quick summary of issues that still needs resolving for the port are: - How is ring to CPU mapping change when DR happens - How multiple mac address assigned to a single client correspond to the rings owned by the client - Usage of HW mac addr slots in the NIC and automatic switching to layer-2 filtering and promisc mode. - Separation of incoming traffic and association with Rx rings - Tx ring allocation and relation to Rx ring and level of parallelism See more questions/comments below. <..snip..>>>>> >>>> Q1.3) Are there any other flags other than the following ones? >>>> >>>> MAC_OPEN_FLAGS_FORCE_MULTI_RINGS >>>> MAC_OPEN_FLAGS_FORCE_ONE_RING >>>> >>> >>> No. >>> >> >> Is there a reason this is tied to hardware rings. We would like >> the mac client >> open request be extended so that it can get either all software >> rings, a mix of >> hardware and software or HW rings in order to match the number of >> cpus specified >> in the client_open call .. A flag can be specified for this .. >> > > Why do you want to do this? What is the functional requirement here? >What I initially did not understand is the association of hardware rings to the parallelism requested. I think your explanation below on when and SW rings are created and when HW rings are used clarifies this issue.>> Also according to the explanation in the doc at page 38, there is >> also a case where no flags is specified. It seems like, if no flags >> specified, then it will attempt to reserve one hardware ring. >> It seems to not to fail even if such reservation fails, but it is not >> clearly specified. >> > > If you don''t specify the ONE_RING flag, and a hardware ring cannot > be reserved for the MAC client, then the MAC client will share the > default ring with other MAC clients. >In the current design, if there will be N rings, and a client open is done, and it requests a HW ring it will get one assigned from the N-1 rings. If it does not request one, it will be still mapped to a HW ring (if available), else be assigned a SW ring fanned out from the default HW ring - correct ?>>>> - Is there a way to force a software ring? >>>> >>> >>> Do you mean not assign a hardware ring? I think this is something >>> we could add, yes. >>> >> >> This is related to the above. Can you add a flag that we can use >> to indicate that client wants to use a single ring or multiple rings >> but not force hardware rings. That way even when the underlying >> device does not have enough hardware rings, a client can get a soft >> ring per CPU. >> > > How is that different than the currently documented behavior?This is OK. I think the documentation is clear.>> Above comment applies to this one also. The behavior without any >> flags >> seem to attempt to reserve one h/w ring. What is the failure case ? >> > > Without any flag, we try to allocate N hardware rings. If that > fails, then we try to allocate 1 hardware ring and do fanout to N > soft rings, if that fails, then we share the default hardware ring, > and do fanout to N soft rings. >This is the part that I was missing. So when a client requests NCPUS they do get N rings - either all SW or HW rings. In the case of SW rings it might be a fanout from a allocated HW ring or the default HW ring. This is clear from chapter 4.x .. Couple more questions: - If a NIC has only 1 HW ring, this will essentially become the the default HW ring. All other requests are fanout from this HW ring to the SW rings ? - If a NIC has N HW rings, only (N-1) rings are available for allotment, 1 is always reserved as the default HW ring ? - If a NIC has 2 free HW rings, and a client requests 3 rings, the mapping today will be 1 HW to 3 SW rings, correct ? This will still leave the other HW ring still free ?>>> >>> The first one is correct. If mbc_cpus is non-NULL, the MAC layer >>> will assign the CPUs provided by the caller. >>> >> >> When the mbc_cpus is NULL what determines how many CPUs and hence >> the number of rings available to this client. >> > > mbc_ncpus. >Can you document this. It is not clear from the doc that mbc_ncpus still controls the ring allotment and the degree of parallelism even when mbc_cpus is NULL .. Can we set the ncpus value to a number greater than the actual number of cpus in the domain. Will the MAC layer create the requested number of rings to match ncpus or will it limit the rings to the number of actual cpus in the domain.>>>> Q1.6) What is the relationship between Unicast addresses >>>> (multiple >>>> unicast set via mac_unicast_add()), Rings and CPUs? >>>> >>>> - Is there a 1:1 relation between a unicast address and a >>>> ring? >>>> - Is there a 1:1 relation between a ring and CPU? >>>> >>> >>> Neither. The MAC addresses will share the same rings and CPUs. >>> >> >> But since you are allowing multiple mac addresses to be associated >> with a client, can we add support as part of the unicast_add call >> to indicate that each of these addresses should be associated with >> ring (either HW or SW). >> > > No, each client is associated with a group of hardware rings or > soft rings. Each group or set of rings corresponds to a set of > unicast MAC addresses. The bandwidth limits are set on a per MAC > client basis. This maps to how hardware NICs do their > classification and fanout. > > If you want a separate set of rings for different MAC addresses, > then you create a new MAC client. >If this is the case what is the real value in being able to assign multiple addresses to a client. Especially when a single client has multiple MAC addresses, coalescing the pkts into a single stream has less benefit over separating the traffic for each address to its own ring. Since a client has many rings and addresses, being able to treat these not as a group of addresses associated with a group of rings will be useful. For instance on N2-NIU, if you assign a RDC group to a mac_client, this group can still contain one or more rings. The group also can be assigned multiple MAC addresses. Will the number of groups limit the mac clients that can be created for the specific device ? In that case we will want the ability to have traffic from separate MAC addresses spread across the rings in this mac_client.>>>> - The Rings and CPUs are tightly coupled in this >>>> interface. >>>> How can allocate multiple rings even when there is one CPU >>>> (or less >>>> number of CPUs). >>>> >>> >>> You don''t allocate rings explicitly, you express a level of >>> parallelism instead, the framework distributes the hardware rings >>> transparently. >>> >> >> But the only way we can control this parallelism is by specifying >> the number of CPUs in the domain. But in a system capable of adding >> and removing CPUs dynamically, we might want to change the >> parallelism >> level too. The current APIs dont allow changing this. We will need >> a way to specify this as an extension to the client_open or a via >> a new API call. >> > > So you want an API which allows you to change the actual > mac_bind_cpu_t for a client which has been already opened? I think > we can do that. >Exactly. As a result it should also allocate more rings to correspond to the current set of CPUs. The reverse also should be true. We should similarly be able to reduce the mbc_ncpus when CPUs are removed from the system. <..snip..>>> Also, in terms of parallelism is this specified by the no. of CPUs >> or by unique CPUIDs in the array. What happens if I specify ncpus >> where all IDs are the same - do I get ncpus HW rings if they are >> available. Also can we then change the ring to cpu mapping when >> more CPUs are added/removed to/from the domain ? >> > > There should be no duplicate CPU ids in the array. >Will this be checked and error returned to client_open ? Can you add that this is an error in the doc also. <..snip..>>>>> Q1.7) How is the binding of CPUs via mac_bind_cpus_t is co- >>>> ordinated >>>> with CPU DR(on the platforms that support them)? >>>> >>> >>> The MAC layer will be notified of the removal of the CPU and will >>> stop using it for its worker threads and interrupts. >>> >> >> That is purely error handling. We need the ability to be able to use >> more CPUs and improve the level of parallelism when CPUs are added. >> The reverse is true when the CPUs are removed. When the MAC layer is >> notified about CPUs going away does it remove the rings associated >> with the CPUs ? >> > > I was not talking specifically about error handling. If the MAC > layer bound a ring worker thread or interrupt to a CPU and that CPU > is going away, the MAC layer will move that thread or interrupt to > a different CPU. >So if a client_open was done with only one CPU, in the mbc_cpus array and this CPU is going away, the alternate CPU will be picked in the same manner as if the client_open was done with mbc_cpus=NULL.> The API discussed in Q1.6 above would allow a MAC client to > increase the number of CPUs if it detects that CPUs were added to > the system. >Does this only allow specifying a increased CPU count or even allow the client to specify the CPUs to use for doing the mapping when more CPUs are added ?>>>> Q1.10) Can the mac client interface be extended to support >>>> creating >>>> a client based on ether_type? This is required for >>>> mac clients >>>> like fiberchannel over ethernet. >>>> >>> >>> No, each MAC client corresponds to a MAC level entity which is >>> defined by its MAC address. Multiple ether types can be supported >>> on top of a MAC client. >>> >> >> Devices like the Niagara2 NIU allow classification of packets using >> parameters like the ether_type. How can a mac_client take advantage >> of such a functionality. >> > > The fact that a particular hardware implementation can do > classification on a specific header field of a packet doesn''t > necessarily mean that a MAC client needs to be associated with that > field. > > Today the SAP demultiplexing is done by DLS on top of MAC clients. > At some point in the future we may make use of hardware > classification to offload that demultiplexing, but can be done at a > level above the MAC layer, and maintain the separation between MAC > clients and what defines them (MAC addresses and VLANs), from SAP > demultiplexing. >Agreed, that makes sense for SAP demultiplexing. In the near future, opening clients based on ether_type will be important. Particularly for FCoE. Interfaces this time next year will be supporting FCoE and the Leadville stack will need to open a client based on the ether_type for FCoE.>>>> >>>> Q2.2) Is there an impact to the >>>> multiaddress_capab_t.maddr_add()/ >>>> maddr_remove() interfaces? Are these being obsoleted or >>>> going away? >>>> >>> >>> The capability will stay, and the framework will continue to use >>> that capability to query and control the allocation of MAC >>> address slots. However that interface is not intended to be used >>> by drivers which should use the MAC client interfaces instead. >>> >> >> OK. >> > > Since my last reply Kais and Roamer have been working on the design > for the new driver interface. Their proposal removes the multiple > MAC address capability as it is known today. You should read their > design document, which is available at http://www.opensolaris.org/ > os/project/crossbow/Docs/virtual_resources.pdf >Thanks. I saw the email too and reviewing the doc now ..>>>> Q2.3) A system with many domains (aka LDoms) with virtual network >>>> devices, it requires the use of a large number >>>> layer2 addresses, >>>> this will exhaust h/w slots available on most standard NICs. >>>> How can a client take advantage of layer2 filtering >>>> provided by >>>> NICs like NII-NIU/Neptune. Specifically, this will help in >>>> avoiding the programming of the device into >>>> PROMISCous mode >>>> etc. Currently there are no interfaces that seem >>>> to provide >>>> such ability. >>>> >>> >>> Yes, this is a situation we are aware of. We''ve talked on this >>> list about having multiple VNICs sharing the same MAC address, >>> and identified by their IP address instead. However this needs to >>> be scoped and defined further before we can commit on providing >>> that functionality. >>> >>> >> >> The current APIs only allow adding as many addresses as the >> number of slots available. Following this it will put the adapter >> in promisc mode. Instead can you add the capability to specify >> when to use a filter and when to take up a slot in the HW. >> > > Do you mean that you want to be able to specify that a > mac_unicast_add() should put the NIC in promiscuous mode even > though there are MAC address slots available? What is the use case > for this? >No, that does not make any sense. I am not asking for that. The number of mac addresses that can be added across all mac clients is restricted to the total number of HW slots in the NIC - correct ? If this is not the case, does the MAC layer put the card in promisc mode to filter the MAC addresses in SW ? In the case of HW that allow layer-2 filtering in HW, is there a way the MAC layer takes advantage of this, instead of putting the NIC in promisc mode, especially when we run out of HW slots on the NIC. <..snip..>>>>> Q2.7) How are the multiple addresses per client maintained, is >>>> it done >>>> in the MAC layer or does it bybpass the MAC layer and passed >>>> to h/w directly. >>>> >>> >>> Since the action of reserving the MAC address is triggered by a >>> call to the MAC layer, the MAC layer cannot be bypassed. The MAC >>> layer will use the multiple MAC address capability exposed by the >>> driver to reserve a new MAC address slot. >>> >> >> What if the driver does not expose that capability. Will the >> unicast_add >> call fail ? Is the MAC layer essentially reflecting the capability of >> the underlying hardware or does it provide the ability to have >> multiple >> addresses irrespective of whether the HW has multiple slots or not ? >> > > The request will still succeed if the number of MAC address slots > is exhausted, or if the underlying NIC doesn''t support the multiple > MAC address capability. However in these cases, the MAC layer will > transparently put the NIC in promiscuous mode in order to receive > traffic for that new MAC unicast address. >Can we take advantage of other HW capabilities like address filtering. See comment above wrt Q2.3. Also there are cases we dont want to switch to promisc mode automatically. Can we add a flag to the unicast_add call and get an error instead of automatic switching. Since there is an API for forcing promisc mode, we can explicitly request to switch to promisc mode using that API when needed.>>>> Q2.8) Can unlimited number of mac addresses be assigned to a MAC >>>> client? What are the software/hardware features that limit >>>> this? >>>> >>> >>> Memory that can be allocated by the kernel. >>> >> >> So even if the underlying device runs out of slots the MAC layer will >> maintain all the addresses associated with that client. How does >> it then >> manage and associate these addresses with the rings allocated for >> this client ? What does it do in both software and hardware to >> filter the addresses for this client ? Also which addresses get HW >> slots >> and which dont ? Also if you run out of slots does the HW go to >> promisc mode .. >> > > Each MAC client is associated with a group of rings. Each group of > rings is therefore associated with a set of MAC addresses. If a > client needs to be associated with more than one MAC address, then > the corresponding group needs to be associated with the same set of > addresses. If the hardware runs out of MAC addresses, then the NIC > is put in promiscuous mode. The allocation of slots is on a first > come first served basis. >That HW slots are global across all MAC clients. Since it is FCFS one client can potentially consume all HW slots ? Also since transitioning the NIC to promisc mode has impact on all clients, I think the mac layer should try to do slightly better than FCFS and do something like fair-share so that it does not give one client all the HW slots ? Also add a flag to prevent automatic switching to promisc mode.>>>> 3) Rings related: >>>> (Crossbow-virt.pdf Section 5.3 Pg 43) >>>> mac_rint_t *mac_rings_list_get(mac_client_handle_t mch, >>>> uint_t nrings); >>>> void mac_rings_list_free(mac_rings_t *rings, uint_t nrings); >>>> uint16_t mac_ring_get_flags(mac_ring_t ring); >>>> >>>> >>>> QUESTIONS: >>>> >>>> Q3.1) All of these interfaces are now categorized as >>>> project-private >>>> API. What motivated this change. These interfaces need to be >>>> more open. >>>> >>> >>> The MAC layer will do the allocation of hardware resources to the >>> various MAC clients and their flows. Instead of having each MAC >>> client manage its own set of resources, the resources are >>> allocated to MAC clients based on their needs, for example the >>> degree of parallelism expressed through mac_client_open(). If you >>> have specific functional requirements that are not satisfied by >>> the current document, please list them. >>> >> >> Currently rings are hidden resources entirely managed by the mac >> layer >> and clients have no visiblity. All the client gets to do is request a >> degree of parallelism. Providing APIs that allow clients to see how >> rings were allocated will be useful. >> > > Why? What is the functional requirement? >A client otherwise does not know whether its parallelism request is met using HW rings or SW rings. HW is obviously better than SW. In the case it gets the later, it might choose to reduce the degree to parallelism so that it gets all HW rings. Having said that, since the current APIs allow for requesting only HW rings, we can always try HW first and then ask for SW rings only if the first one fails and if client is OK with the SW rings. Some visibility into this in the future will positively help with optimizations.>>>> Q3.2) The mac_rings_list_get() is only for h/w rings, >>>> is there >>>> an equivalent interface to obtain s/w ring information. >>>> Or this interface can be extended return both h/w ring >>>> or s/w ring information. >>>> >>> >>> The interface will evolve to provide that information, but it >>> will remain project private. It is provided here FYI but will >>> change in future revisions of the document. >>> >> >> So the expectation is that ring APIs should not be used by clients >> and it >> is only an internal MAC layer resource managed by it ? >> > > Yes, the MAC layer does the allocation of resources to MAC clients > and their flows. >Some visibility into this will help in both perf monitoring and policy correction. Instead of looking at this from a single OS instance point of view, if we see this from a perspective of diff OSs, having more info can help better tune for varying traffic loads. Can some of this be available via some type of stats like interface ?>>>> Q3.3) Are the mac_resource_set() and mac_resources() interfaces >>>> going away? >>>> >>> >>> Yes, they will be replaced by different interfaces. But note that >>> they are already project private in Nevada and were not supposed >>> to be used by other ON components. >>> >> >> Agreed, but there is no other way in the new Crossbow API a way >> to take advantage of multiple rings. There is one generic RX >> callback, >> but no other way to associate a callback with a specific ring. This >> is a limitation of the existing API and support should be added to >> open API list, so that we can process traffic independently. So >> having >> some additional APIs to expose this will be very useful. >> > > But you can process the traffic independently since packets will be > sent up to the MAC client concurrently for the multiple rings > associated with the client. What else do you need to do here > specifically that is prevented by the API? >I think this is sufficient for now. In the future, as flow classification interfaces become mature, the ability to associate a ring with a specific classification might be useful. Subsequently have the ability to specify an unique handle for each flow will help the rx_callback process pkts better. <..snip..>>>>> Q3.5) Are there any interfaces other than the above mac_rings_xxx >>>> interfaces that are available to deal with MAC rings? >>>> >>> >>> Not available to MAC clients. The set of project private >>> interfaces might evolve as we refine the design. >>> >> >> I would like to see some of the functionality exported via these >> private mac interfaces promoted to a open API. Even if the API >> cannot be moved over, can we extend APIs to provide hints .. >> > > Again, what is the functional requirement? What functionality do > you need to provide with this information? What hints do you think > are missing from the API? >See response to Q3.1 ..>>>> Q3.6) Is the mac_rings_list_get() returns the list of mac rings >>>> assigned to the client at the time of client open. How can >>>> this be changed after the client is open. >>>> >>> >>> The set of assigned rings may change. The details on the APIs >>> needed to support this still need to be defined, but they will >>> remain project private. >>> >> >> So you are saying there is no way to rely on how many rings are >> available to a particular client. This will change without the >> client''s control ? Is CPUs being removed from the system a case >> under which this will happen ? >> > > The flags taken by mac_client_open() allows some control by the MAC > client, see Q1.3. If the client specified that a given CPU be > assigned to the client, we could block the DR''ing out the CPU until > the MAC client releases that CPU, what is your requirement here? >I dont think you want to block DR. DR of CPUs happen outside the scope of the kernel. Normally from a external control point like a data mgmt center. This control point has little visibility into what CPU in a domain is being used by a MAC client. So instead of preventing DR from happening, the mac_client should be notified that it might lose some of its rings. Alternatively you can handle this in the same way as when the client_open is done with mbc_cpus=NULL and ncpus > 0. The mac layer can redist the rings across the remaining CPUs in the system instead of reducing the number of rings the client currently has.>>>> Q3.7) Assigning h/w rings to a specific MAC address limits the >>>> bandwidth to the number of rings that are assigned to that >>>> address. Is there a way to not to bind h/w rings specific >>>> to MAC address so that the bandwidth could be used by >>>> any mac client depending on the traffic? >>>> >>> >>> See Q1.3. >>> >> >> Not sure what you mean. Are you suggesting that some mac addresses >> will have SW rings and others will be associated to HW rings ? >> > > Between different MAC clients, that''s possible. But within the same > MAC client, all unicast addresses of that client will share the > same group of hardware rings or SRS. >So when packets for a specific address arrives, will it be processed by a different ring each time ? So each ring is synonymous to a CPU resource and handles whichever packet arrives. It has no affinity to specific mac addresses ?>>>> 4) Receive callback related: >>>> (Crossbow-virt.pdf Section 5.2.5 Pg 40) >>>> int mac_rx_set(mac_client_handle_t mch, mac_rx_fn_t rx_fn, >>>> void *arg); >>>> int mac_rx_clear(mac_client_handle_t mch); >>>> >>>> QUESTIONS: >>>> >>>> Q4.1) How can a client get rx callback per ring that is >>>> assigned >>>> to the mac client? This will allow parallel processing >>>> and improve the performance. Such a feature is already >>>> being used in the current implementation of LDoms vSwitch >>>> driver and the mac_xxx interfaces should support such an >>>> ability. >>>> >>> >>> The parallel processing will still happen. I.e. if multiple >>> hardware rings or software rings are assigned to a MAC clients, >>> multiple connections associated with that MAC client will be >>> spread across these rings. >>> >> >> So with multiple rings, there will be concurrent callbacks to the >> rx_fn, each with packets in the corresponding ring ? Also will each >> callback be able to determine what ring did the callback ? >> > > Why do you need that information, what is your functional requirement? >I think today, the requirement is not relevant, especially because the API does not allow associating each mac address with separate ring, unless the ring and address are associated with a unique mac_client.>>>> Q4.2) How can a client get a separate callback for a defined >>>> type of >>>> traffic, such as different SAP numbers etc. This will >>>> be useful to provide out of the band type packet processing >>>> or related services. >>>> >>> >>> This will be supported by a MAC flow API built on top of the MAC >>> client API. The flow API will be described by a separate document. >>> >> >> So if a client wants to use the flow API will it need to layer >> itself on >> the flow API and not the mac client API directly. Can you give me >> more >> information on what this layering will look like. Also, when do you >> expect the flow API doc to be available ? >> > > The flow API will be an addition to the MAC client API. A MAC > client will be able to use that flow API. Such a flow operation > would be of the form mac_flow_xxx(mac_client_handle_t mch, <flow > description>, <bandwidth properties>, etc). Kais is working on > defining that API, I''ll let him comment on expected availability. >Thanks -- some of the requirements / comments above are tied to the flow API. So clarification on the flow API will help better define the requirements.>>>> Q5.2) If NULL specified as a ''hint'', how is the tx ring >>>> selected? >>>> >>> >>> In this case mac_tx() will parse the packet headers and hash on >>> the header information to select a transmit ring. >>> >>> >> >> Is the goal here to somehow bifurcate traffic being sent by a client >> via the interface ? >> > > The goal is to spread the traffic among the transmit rings assigned > to the client while maintaining packet ordering for individual > connections, without exposing the details of assignment of transmit > rings to MAC clients. >Another related question .. Are Tx and Rx rings assigned as a pair to a mac client ? Can a client have more Tx rings than Rx rings ? What controls this ? Does the ncpus parameter control how many Tx rings a client is assigned ?>>>> Q7.2) From the explanation of mac_promisc_add(), it seems like >>>> the mac_promisc_add() could be called without setting >>>> MAC address via mac_unicast_add(). Is this correct? >>>> If so, what is the expected behaviour? >>>> >>> >>> Currently we provide the same semantics as a switched >>> environment, i.e. a MAC client will see the same traffic that >>> would be seen by a NIC connected to a switch. >>> >>> >> >> Is there a way to see only the multicast traffic associated with >> all mac >> clients - union of all mac_client multicast_add addresses. The MULTI >> promisc option seems more a way to weed to unicast and broadcast >> traffic >> on the wire and pass all wire multicast traffic up - including >> ones the >> system may not be interested in ? Is this the case ? >> > > These broadcast flags apply not only to the incoming received > traffic but also the traffic sent my MAC clients of the same > underlying MAC. I.e. a MAC client PROMISC_MULTI callback will also > see all multicast traffic sent by the other MAC clients defined on > top of the same MAC. In order to preserve the semantics that are > implemented by a real physical switch, this applies to *all* > multicast traffic, not just the multicast groups that were "joined" > by the individual MAC clients. >Ok - thanks for the clarification .. Can you add some text to the doc also to this effect ..>>> Another option would be to generalize this with the shared >>> ethernet model, and allow a MAC client to specify that it wants >>> to observe all traffic via a separate promiscuous type. I need to >>> see how this can be added to the API. >>> >> >> This will be very useful. How about something MAC_PROMISC_CLIENTS ? >> > > The new modes will be: > > * ALL: all traffic, including all traffic sent by MAC clients and > traffic seen by the hardware > * FILTERED: all multicast and broadcast traffic, plus the traffic > to the unicast MAC addresses associated with the MAC client. > * MULTI: all multicast and broadcast traffic received on the wire > and sent by the MAC clients >OK. Thanks -Narayan
Hi Narayan, On Oct 17, 2007, at 5:53 PM, Narayan Venkat wrote:> Hi Nicolas > > Thanks for the response .. > >> In general it seems that the issues being discussed are related to >> either (1) the port of the existing LDOM functionality to the >> updated MAC client interfaces introduced by Crossbow, or (2) the >> addition of new LDOM functionality based on these new APIs. >> > > The primary goal for us is to address the first case - the port of the > existing LDom functionality already released to customers.Yes, these are the issues that we need to address for the initial putback.> But there > is new LDoms functionality on the horizon that must be addressed as > well. > The new features is expected show up in the same time frame as > Crossbow.If there''s additional functionality which is needed by LDOM which has an impact on Crossbow then the functional and schedule requirements for these new features needs to be clearly communicated to us. If additional work is needed on Crossbow, then we need to discuss the vehicle for these changes, and the potential impact of that work on the existing Crossbow schedule, other existing projects, and future funded projects needs to evaluated, tracked and the effort staffed. Unless you give us that information we can''t plan for that work. And we cannot get into detailed design discussions if the functional requirements are not clearly communicated first.>> So given my latest response below, what are in your opinion the >> remaining issues which are directly related to the port of the >> existing LDOM functionality to the new MAC API? >> > > Quick summary of issues that still needs resolving for the port are: > > - How is ring to CPU mapping change when DR happensI think this was answered at length below. Do you have any remaining issues on this topic?> - How multiple mac address assigned to a single client correspond > to the rings owned by the clientI don''t agree that this is required for the port as part of the initial Crossbow putback. From what I could tell LDOMs today doesn''t allow multiple MAC addresses to be assigned to vnets in the first place.> - Usage of HW mac addr slots in the NIC and automatic switching > to layer-2 filtering and promisc mode.I think I answered your questions about this point below. Also today there''s no hardware classification done in hardware for LDOMs, whether in promiscuous mode or not. So I don''t understand why you consider this a requirement for the initial port.> - Separation of incoming traffic and association with Rx rings"Association with Rx rings" is a bit vague. As I described in my previous emails and below, individual connections will be fanned out to the rings that are member of a group associated with a MAC client, or the fanout will be done in software between soft rings.> - Tx ring allocation and relation to Rx ring and level of > parallelismSee Q5.2 below. <snip>>>> Also according to the explanation in the doc at page 38, there is >>> also a case where no flags is specified. It seems like, if no flags >>> specified, then it will attempt to reserve one hardware ring. >>> It seems to not to fail even if such reservation fails, but it is >>> not >>> clearly specified. >>> >> >> If you don''t specify the ONE_RING flag, and a hardware ring cannot >> be reserved for the MAC client, then the MAC client will share the >> default ring with other MAC clients. >> > > In the current design, if there will be N rings, and a client open > is done, and it requests a HW ring it will get one assigned from the > N-1 rings. If it does not request one, it will be still mapped to a > HW ring (if available), else be assigned a SW ring fanned out from > the default HW ring - correct ?Correct. <snip>>>> Above comment applies to this one also. The behavior without any >>> flags >>> seem to attempt to reserve one h/w ring. What is the failure case ? >>> >> >> Without any flag, we try to allocate N hardware rings. If that >> fails, then we try to allocate 1 hardware ring and do fanout to N >> soft rings, if that fails, then we share the default hardware ring, >> and do fanout to N soft rings. >> > > This is the part that I was missing. So when a client requests NCPUS > they do get N rings - either all SW or HW rings. In the case of SW > rings > it might be a fanout from a allocated HW ring or the default HW ring. > This is clear from chapter 4.x ..Correct.> > Couple more questions: > > - If a NIC has only 1 HW ring, this will essentially become the > the default HW ring. All other requests are fanout from this HW > ring to the SW rings ?Yes> - If a NIC has N HW rings, only (N-1) rings are available for > allotment, 1 is always reserved as the default HW ring ?Yes.> - If a NIC has 2 free HW rings, and a client requests 3 rings, > the mapping today will be 1 HW to 3 SW rings, correct ? This > will still leave the other HW ring still free ?Yes.>>>> The first one is correct. If mbc_cpus is non-NULL, the MAC layer >>>> will assign the CPUs provided by the caller. >>>> >>> >>> When the mbc_cpus is NULL what determines how many CPUs and hence >>> the number of rings available to this client. >>> >> >> mbc_ncpus. >> > > Can you document this. It is not clear from the doc that mbc_ncpus > still controls the ring allotment and the degree of parallelism even > when mbc_cpus is NULL ..I will update the document to make it clearer.> Can we set the ncpus value to a number > greater than the actual number of cpus in the domain. Will the MAC > layer create the requested number of rings to match ncpus or will it > limit the rings to the number of actual cpus in the domain.We use the specified ncpus.>>>>> Q1.6) What is the relationship between Unicast addresses >>>>> (multiple >>>>> unicast set via mac_unicast_add()), Rings and CPUs? >>>>> >>>>> - Is there a 1:1 relation between a unicast address and a >>>>> ring? >>>>> - Is there a 1:1 relation between a ring and CPU? >>>>> >>>> >>>> Neither. The MAC addresses will share the same rings and CPUs. >>>> >>> >>> But since you are allowing multiple mac addresses to be associated >>> with a client, can we add support as part of the unicast_add call >>> to indicate that each of these addresses should be associated with >>> ring (either HW or SW). >>> >> >> No, each client is associated with a group of hardware rings or >> soft rings. Each group or set of rings corresponds to a set of >> unicast MAC addresses. The bandwidth limits are set on a per MAC >> client basis. This maps to how hardware NICs do their >> classification and fanout. >> >> If you want a separate set of rings for different MAC addresses, >> then you create a new MAC client. >> > > If this is the case what is the real value in being able to assign > multiple addresses to a client. Especially when a single client has > multiple MAC addresses, coalescing the pkts into a single stream > has less benefit over separating the traffic for each address to > its own ring. Since a client has many rings and addresses, being > able to treat these not as a group of addresses associated with a > group of rings will be useful.Of course there''s value of assigning the multiple MAC addresses to a single client even if you share one or more rings within that client. If such a MAC client maps to a front-end driver instance in another domain (vnet in your case, xnf for Xen), that domain can then create multiple VNICs on top of that front-end driver, assign a MAC address to these VNICs without having to turn the underlying hardware in promiscuous mode, and establish a path between the VNICs on the domain itself without having to cross hypervisor boundary. That said, I don''t think anything forces you to have a 1:1 mapping between a vnet instance and a MAC client in the service domain. If what you are trying to do is have separate sets of rings for the clients of a vnet based on their MAC addresses, you could have a vnet map to multiple MAC clients in the service domain, with their own MAC addresses and separate groups of rings. That vnet instance would then register groups of rings corresponding to the MAC clients in the service domain. These groups would have their own set of rings, and the groups would be assigned by the MAC layer to the MAC clients of vnet.> For instance on N2-NIU, if you assign a RDC group to a mac_client, > this > group can still contain one or more rings. The group also can be > assigned > multiple MAC addresses. Will the number of groups limit the mac > clients > that can be created for the specific device ? In that case we will > want > the ability to have traffic from separate MAC addresses spread across > the rings in this mac_client.We will create one group per MAC client, as long as hardware resources are available. We reserve one group as the default group, and software classification will be used on top of the default group to spread the traffic to multiple software rings assigned to the MAC clients sharing the same group. The hardware associates multiple MAC addresses per group, and each group maps to a MAC client. Once the hardware finds the group associated with a MAC address, it spreads traffic across the rings assigned to that group according to a computed hash on the inbound packet headers.>>>>> - The Rings and CPUs are tightly coupled in this >>>>> interface. >>>>> How can allocate multiple rings even when there is one CPU >>>>> (or less >>>>> number of CPUs). >>>>> >>>> >>>> You don''t allocate rings explicitly, you express a level of >>>> parallelism instead, the framework distributes the hardware rings >>>> transparently. >>>> >>> >>> But the only way we can control this parallelism is by specifying >>> the number of CPUs in the domain. But in a system capable of adding >>> and removing CPUs dynamically, we might want to change the >>> parallelism >>> level too. The current APIs dont allow changing this. We will need >>> a way to specify this as an extension to the client_open or a via >>> a new API call. >>> >> >> So you want an API which allows you to change the actual >> mac_bind_cpu_t for a client which has been already opened? I think >> we can do that. >> > > Exactly. As a result it should also allocate more rings to correspond > to the current set of CPUs. The reverse also should be true. We should > similarly be able to reduce the mbc_ncpus when CPUs are removed from > the system.OK.> > <..snip..> > >>> Also, in terms of parallelism is this specified by the no. of CPUs >>> or by unique CPUIDs in the array. What happens if I specify ncpus >>> where all IDs are the same - do I get ncpus HW rings if they are >>> available. Also can we then change the ring to cpu mapping when >>> more CPUs are added/removed to/from the domain ? >>> >> >> There should be no duplicate CPU ids in the array. >> > > Will this be checked and error returned to client_open ? Can you > add that this is an error in the doc also.Yes. I will update the doc.>>>>> Q1.7) How is the binding of CPUs via mac_bind_cpus_t is co- >>>>> ordinated >>>>> with CPU DR(on the platforms that support them)? >>>>> >>>> >>>> The MAC layer will be notified of the removal of the CPU and will >>>> stop using it for its worker threads and interrupts. >>>> >>> >>> That is purely error handling. We need the ability to be able to use >>> more CPUs and improve the level of parallelism when CPUs are added. >>> The reverse is true when the CPUs are removed. When the MAC layer is >>> notified about CPUs going away does it remove the rings associated >>> with the CPUs ? >>> >> >> I was not talking specifically about error handling. If the MAC >> layer bound a ring worker thread or interrupt to a CPU and that CPU >> is going away, the MAC layer will move that thread or interrupt to >> a different CPU. >> > > So if a client_open was done with only one CPU, in the mbc_cpus array > and this CPU is going away, the alternate CPU will be picked in the > same manner as if the client_open was done with mbc_cpus=NULL.Yes.>> The API discussed in Q1.6 above would allow a MAC client to >> increase the number of CPUs if it detects that CPUs were added to >> the system. >> > > Does this only allow specifying a increased CPU count or even allow > the client to specify the CPUs to use for doing the mapping when > more CPUs are added ?It would allow the new CPU set to be specified, but we need to figure out the details of that API.>>>>> Q1.10) Can the mac client interface be extended to support >>>>> creating >>>>> a client based on ether_type? This is required for >>>>> mac clients >>>>> like fiberchannel over ethernet. >>>>> >>>> >>>> No, each MAC client corresponds to a MAC level entity which is >>>> defined by its MAC address. Multiple ether types can be supported >>>> on top of a MAC client. >>>> >>> >>> Devices like the Niagara2 NIU allow classification of packets using >>> parameters like the ether_type. How can a mac_client take advantage >>> of such a functionality. >>> >> >> The fact that a particular hardware implementation can do >> classification on a specific header field of a packet doesn''t >> necessarily mean that a MAC client needs to be associated with that >> field. >> >> Today the SAP demultiplexing is done by DLS on top of MAC clients. >> At some point in the future we may make use of hardware >> classification to offload that demultiplexing, but can be done at a >> level above the MAC layer, and maintain the separation between MAC >> clients and what defines them (MAC addresses and VLANs), from SAP >> demultiplexing. >> > > Agreed, that makes sense for SAP demultiplexing. In the near future, > opening clients > based on ether_type will be important. Particularly for FCoE. > Interfaces this time > next year will be supporting FCoE and the Leadville stack will need > to open a client > based on the ether_type for FCoE.The fact that there''s a need to do SAP demultiplexing based on SAPs doesn''t necessarily mean that the SAP needs to be associated with the MAC client directly. I think this falls into the "future projects" category where the functional requirements need to be clarified. We''ll need to work with the FCoE folks on this. At least our current design doesn''t prevent that demultiplexing from being added to the MAC layer in the future.>>>>> Q2.2) Is there an impact to the >>>>> multiaddress_capab_t.maddr_add()/ >>>>> maddr_remove() interfaces? Are these being obsoleted or >>>>> going away? >>>>> >>>> >>>> The capability will stay, and the framework will continue to use >>>> that capability to query and control the allocation of MAC >>>> address slots. However that interface is not intended to be used >>>> by drivers which should use the MAC client interfaces instead. >>>> >>> >>> OK. >>> >> >> Since my last reply Kais and Roamer have been working on the design >> for the new driver interface. Their proposal removes the multiple >> MAC address capability as it is known today. You should read their >> design document, which is available at http://www.opensolaris.org/ >> os/project/crossbow/Docs/virtual_resources.pdf >> > > Thanks. I saw the email too and reviewing the doc now ..Actually since we''re covering this topic. Can you clarify for us how LDOMs obtains factory MAC addresses from the interfaces? It doesn''t seem that you go through the multiple MAC address capability to do this. We will change that part of the multiple MAC address capability as well, so we need to know if you depend on that interface to obtain the factory MAC addresses.>>>>> Q2.3) A system with many domains (aka LDoms) with virtual network >>>>> devices, it requires the use of a large number >>>>> layer2 addresses, >>>>> this will exhaust h/w slots available on most standard >>>>> NICs. >>>>> How can a client take advantage of layer2 filtering >>>>> provided by >>>>> NICs like NII-NIU/Neptune. Specifically, this will help in >>>>> avoiding the programming of the device into >>>>> PROMISCous mode >>>>> etc. Currently there are no interfaces that seem >>>>> to provide >>>>> such ability. >>>>> >>>> >>>> Yes, this is a situation we are aware of. We''ve talked on this >>>> list about having multiple VNICs sharing the same MAC address, >>>> and identified by their IP address instead. However this needs to >>>> be scoped and defined further before we can commit on providing >>>> that functionality. >>>> >>>> >>> >>> The current APIs only allow adding as many addresses as the >>> number of slots available. Following this it will put the adapter >>> in promisc mode. Instead can you add the capability to specify >>> when to use a filter and when to take up a slot in the HW. >>> >> >> Do you mean that you want to be able to specify that a >> mac_unicast_add() should put the NIC in promiscuous mode even >> though there are MAC address slots available? What is the use case >> for this? >> > > No, that does not make any sense. I am not asking for that. The number > of mac addresses that can be added across all mac clients is > restricted > to the total number of HW slots in the NIC - correct ? If this is not > the case, does the MAC layer put the card in promisc mode to filter > the MAC addresses in SW ?Yes, we put the card in promiscuous mode if we run out of MAC address slots.> In the case of HW that allow layer-2 filtering in HW, is there a way > the MAC layer takes advantage of this, instead of putting the NIC in > promisc mode, especially when we run out of HW slots on the NIC.By layer-2 filtering I guess you mean hardware classification. You still need to put the NIC in promiscuous mode so that it starts receiving traffic for the MAC address which do not fit in the hardware MAC address slots. In the case of the NIU I believe that the packets will still be classified to the right groups even if the card is in promiscuous mode.>>>>> Q2.7) How are the multiple addresses per client maintained, is >>>>> it done >>>>> in the MAC layer or does it bybpass the MAC layer and passed >>>>> to h/w directly. >>>>> >>>> >>>> Since the action of reserving the MAC address is triggered by a >>>> call to the MAC layer, the MAC layer cannot be bypassed. The MAC >>>> layer will use the multiple MAC address capability exposed by the >>>> driver to reserve a new MAC address slot. >>>> >>> >>> What if the driver does not expose that capability. Will the >>> unicast_add >>> call fail ? Is the MAC layer essentially reflecting the >>> capability of >>> the underlying hardware or does it provide the ability to have >>> multiple >>> addresses irrespective of whether the HW has multiple slots or not ? >>> >> >> The request will still succeed if the number of MAC address slots >> is exhausted, or if the underlying NIC doesn''t support the multiple >> MAC address capability. However in these cases, the MAC layer will >> transparently put the NIC in promiscuous mode in order to receive >> traffic for that new MAC unicast address. >> > > Can we take advantage of other HW capabilities like address > filtering. See > comment above wrt Q2.3. Also there are cases we dont want to switch to > promisc mode automatically. Can we add a flag to the unicast_add call > and get an error instead of automatic switching.Yes, I thought I already agreed to add it but forgot to document the flag.> Since there is an API > for forcing promisc mode, we can explicitly request to switch to > promisc > mode using that API when needed.You lost me here. There''s no separate API to force promisc mode. There''s an API to add promiscuous callbacks, it''s different than the older API.>>>>> Q2.8) Can unlimited number of mac addresses be assigned to a MAC >>>>> client? What are the software/hardware features that limit >>>>> this? >>>>> >>>> >>>> Memory that can be allocated by the kernel. >>>> >>> >>> So even if the underlying device runs out of slots the MAC layer >>> will >>> maintain all the addresses associated with that client. How does >>> it then >>> manage and associate these addresses with the rings allocated for >>> this client ? What does it do in both software and hardware to >>> filter the addresses for this client ? Also which addresses get HW >>> slots >>> and which dont ? Also if you run out of slots does the HW go to >>> promisc mode .. >>> >> >> Each MAC client is associated with a group of rings. Each group of >> rings is therefore associated with a set of MAC addresses. If a >> client needs to be associated with more than one MAC address, then >> the corresponding group needs to be associated with the same set of >> addresses. If the hardware runs out of MAC addresses, then the NIC >> is put in promiscuous mode. The allocation of slots is on a first >> come first served basis. >> > > That HW slots are global across all MAC clients. Since it is FCFS one > client > can potentially consume all HW slots ? Also since transitioning the > NIC to > promisc mode has impact on all clients, I think the mac layer > should try > to do slightly better than FCFS and do something like fair-share so > that > it does not give one client all the HW slots ? Also add a flag to > prevent > automatic switching to promisc mode.There are cases where internally we might come up with some algorithms to distribute the slots fairly across the clients, however we want to be able to tune these algorithms as we gain experience with the framework, and avoid pushing the complexity of managing these shared resources to the clients.>>>>> 3) Rings related: >>>>> (Crossbow-virt.pdf Section 5.3 Pg 43) >>>>> mac_rint_t *mac_rings_list_get(mac_client_handle_t mch, >>>>> uint_t nrings); >>>>> void mac_rings_list_free(mac_rings_t *rings, uint_t nrings); >>>>> uint16_t mac_ring_get_flags(mac_ring_t ring); >>>>> >>>>> >>>>> QUESTIONS: >>>>> >>>>> Q3.1) All of these interfaces are now categorized as >>>>> project-private >>>>> API. What motivated this change. These interfaces need >>>>> to be >>>>> more open. >>>>> >>>> >>>> The MAC layer will do the allocation of hardware resources to the >>>> various MAC clients and their flows. Instead of having each MAC >>>> client manage its own set of resources, the resources are >>>> allocated to MAC clients based on their needs, for example the >>>> degree of parallelism expressed through mac_client_open(). If you >>>> have specific functional requirements that are not satisfied by >>>> the current document, please list them. >>>> >>> >>> Currently rings are hidden resources entirely managed by the mac >>> layer >>> and clients have no visiblity. All the client gets to do is >>> request a >>> degree of parallelism. Providing APIs that allow clients to see how >>> rings were allocated will be useful. >>> >> >> Why? What is the functional requirement? >> > > A client otherwise does not know whether its parallelism request is > met > using HW rings or SW rings. HW is obviously better than SW. In the > case > it gets the later, it might choose to reduce the degree to parallelism > so that it gets all HW rings. Having said that, since the current APIs > allow for requesting only HW rings, we can always try HW first and > then > ask for SW rings only if the first one fails and if client is OK with > the SW rings. Some visibility into this in the future will positively > help with optimizations.OK, if you can go with the algorithm that you just described for now that would be great. We''ll look into how we can improve the visibility of resource availability in the future, however we won''t be able to add this feature for our initial putback.>>>>> Q3.2) The mac_rings_list_get() is only for h/w rings, >>>>> is there >>>>> an equivalent interface to obtain s/w ring information. >>>>> Or this interface can be extended return both h/w ring >>>>> or s/w ring information. >>>>> >>>> >>>> The interface will evolve to provide that information, but it >>>> will remain project private. It is provided here FYI but will >>>> change in future revisions of the document. >>>> >>> >>> So the expectation is that ring APIs should not be used by clients >>> and it >>> is only an internal MAC layer resource managed by it ? >>> >> >> Yes, the MAC layer does the allocation of resources to MAC clients >> and their flows. >> > > Some visibility into this will help in both perf monitoring and > policy correction. Instead of looking at this from a single OS > instance > point of view, if we see this from a perspective of diff OSs, having > more info can help better tune for varying traffic loads. Can some > of this be available via some type of stats like interface ?That''s a very complex problem. Even today in the single OS instance case we don''t fully self-tune according to the workload. Providing this type of capability falls outside the scope of our initial putback. We need an architecture first before starting to export kstats. <snip>>>>>> Q3.6) Is the mac_rings_list_get() returns the list of mac rings >>>>> assigned to the client at the time of client open. How can >>>>> this be changed after the client is open. >>>>> >>>> >>>> The set of assigned rings may change. The details on the APIs >>>> needed to support this still need to be defined, but they will >>>> remain project private. >>>> >>> >>> So you are saying there is no way to rely on how many rings are >>> available to a particular client. This will change without the >>> client''s control ? Is CPUs being removed from the system a case >>> under which this will happen ? >>> >> >> The flags taken by mac_client_open() allows some control by the MAC >> client, see Q1.3. If the client specified that a given CPU be >> assigned to the client, we could block the DR''ing out the CPU until >> the MAC client releases that CPU, what is your requirement here? >> > > I dont think you want to block DR. DR of CPUs happen outside the scope > of the kernel. Normally from a external control point like a data mgmt > center. This control point has little visibility into what CPU in a > domain is being used by a MAC client. So instead of preventing DR from > happening, the mac_client should be notified that it might lose some > of its rings. Alternatively you can handle this in the same way as > when the client_open is done with mbc_cpus=NULL and ncpus > 0. The > mac layer can redist the rings across the remaining CPUs in the > system instead of reducing the number of rings the client currently > has.Thanks for your input on this, we will rebind the thread to one of the remaining CPUs.>>>>> Q3.7) Assigning h/w rings to a specific MAC address limits the >>>>> bandwidth to the number of rings that are assigned to that >>>>> address. Is there a way to not to bind h/w rings specific >>>>> to MAC address so that the bandwidth could be used by >>>>> any mac client depending on the traffic? >>>>> >>>> >>>> See Q1.3. >>>> >>> >>> Not sure what you mean. Are you suggesting that some mac addresses >>> will have SW rings and others will be associated to HW rings ? >>> >> >> Between different MAC clients, that''s possible. But within the same >> MAC client, all unicast addresses of that client will share the >> same group of hardware rings or SRS. >> > > So when packets for a specific address arrives, will it be processed > by a different ring each time ? So each ring is synonymous to a CPU > resource and handles whichever packet arrives. It has no affinity to > specific mac addresses ?For the rings assigned to a MAC client, yes. Note that the hardware is required to use the same RX ring of a group for a connection, in order to maintain locality and prevent reordering. However each MAC client will get its own group of rings, and traffic for the two clients will not spill over to the set of rings of the other clients. See also Q1.6 above. <snip>>>>>> Q4.2) How can a client get a separate callback for a defined >>>>> type of >>>>> traffic, such as different SAP numbers etc. This will >>>>> be useful to provide out of the band type packet processing >>>>> or related services. >>>>> >>>> >>>> This will be supported by a MAC flow API built on top of the MAC >>>> client API. The flow API will be described by a separate document. >>>> >>> >>> So if a client wants to use the flow API will it need to layer >>> itself on >>> the flow API and not the mac client API directly. Can you give me >>> more >>> information on what this layering will look like. Also, when do you >>> expect the flow API doc to be available ? >>> >> >> The flow API will be an addition to the MAC client API. A MAC >> client will be able to use that flow API. Such a flow operation >> would be of the form mac_flow_xxx(mac_client_handle_t mch, <flow >> description>, <bandwidth properties>, etc). Kais is working on >> defining that API, I''ll let him comment on expected availability. >> > > Thanks -- some of the requirements / comments above are tied to the > flow API. So clarification on the flow API will help better define the > requirements.I would think that this should be the other way around :-) you provide the functional requirements, then can discuss whether the APIs satisfy these requirements.>>>>> Q5.2) If NULL specified as a ''hint'', how is the tx ring >>>>> selected? >>>>> >>>> >>>> In this case mac_tx() will parse the packet headers and hash on >>>> the header information to select a transmit ring. >>>> >>>> >>> >>> Is the goal here to somehow bifurcate traffic being sent by a client >>> via the interface ? >>> >> >> The goal is to spread the traffic among the transmit rings assigned >> to the client while maintaining packet ordering for individual >> connections, without exposing the details of assignment of transmit >> rings to MAC clients. >> > > Another related question .. > > Are Tx and Rx rings assigned as a pair to a mac client ? Can a client > have more Tx rings than Rx rings ? What controls this ? Does the ncpus > parameter control how many Tx rings a client is assigned ?First we try to allocate ncpus number of hardware TX rings. If there is less than ncpus, it''s the same algorithm as for receive rings, i.e. we assign one TX ring to the client. If we run out of TX rings, then we fallback to a default TX ring. We''ll add flags to mac_client_open() similar to the ones we already have for receive ring allocation. Note that since the hardware cannot guarantee that the number of TX rings is always the same as the number of RX rings, it is not possible to always guarantee that each RX ring will map to a TX ring.>>>>> Q7.2) From the explanation of mac_promisc_add(), it seems like >>>>> the mac_promisc_add() could be called without setting >>>>> MAC address via mac_unicast_add(). Is this correct? >>>>> If so, what is the expected behaviour? >>>>> >>>> >>>> Currently we provide the same semantics as a switched >>>> environment, i.e. a MAC client will see the same traffic that >>>> would be seen by a NIC connected to a switch. >>>> >>>> >>> >>> Is there a way to see only the multicast traffic associated with >>> all mac >>> clients - union of all mac_client multicast_add addresses. The MULTI >>> promisc option seems more a way to weed to unicast and broadcast >>> traffic >>> on the wire and pass all wire multicast traffic up - including >>> ones the >>> system may not be interested in ? Is this the case ? >>> >> >> These broadcast flags apply not only to the incoming received >> traffic but also the traffic sent my MAC clients of the same >> underlying MAC. I.e. a MAC client PROMISC_MULTI callback will also >> see all multicast traffic sent by the other MAC clients defined on >> top of the same MAC. In order to preserve the semantics that are >> implemented by a real physical switch, this applies to *all* >> multicast traffic, not just the multicast groups that were "joined" >> by the individual MAC clients. >> > > Ok - thanks for the clarification .. Can you add some text to the doc > also > to this effect ..Will do. <snip> Thanks, Nicolas. -- Nicolas Droux - Solaris Core OS - Sun Microsystems, Inc. droux at sun.com - http://blogs.sun.com/droux
Nicolas, On Oct 19, 2007, at 3:33 PM, Nicolas Droux wrote:> <snip> > >> - How multiple mac address assigned to a single client correspond >> to the rings owned by the client > > I don''t agree that this is required for the port as part of the > initial Crossbow putback. From what I could tell LDOMs today doesn''t > allow multiple MAC addresses to be assigned to vnets in the first > place.The issue is not multiple MAC addresses for a vnet. It''s that the vSwitch fronts mutliple MAC addresses for many vnet clients. Hence, if a vswitch is a single MAC client, then it needs to be assign multiple MAC addresses to distribute packets for multiple L2 destinations. This is the existing functionality in vSwitch.> >> - Usage of HW mac addr slots in the NIC and automatic switching >> to layer-2 filtering and promisc mode. > > I think I answered your questions about this point below. Also today > there''s no hardware classification done in hardware for LDOMs, > whether in promiscuous mode or not. So I don''t understand why you > consider this a requirement for the initial port.The may be a terminology issue here. So,the existing vSwitch uses L2 Hardware classification in nxge, bge, and e1000g for example to filter incoming packets. Each domain''s L2 address is placed in hardware address slot. If the slots are exhausted, the switch will then decide whether to put the interface into promiscuous mode or not.>Regards, Michael
Michael, On Oct 19, 2007, at 4:50 PM, Michael Speer wrote:> Nicolas, > > On Oct 19, 2007, at 3:33 PM, Nicolas Droux wrote: > >> <snip> >> >>> - How multiple mac address assigned to a single client correspond >>> to the rings owned by the client >> >> I don''t agree that this is required for the port as part of the >> initial Crossbow putback. From what I could tell LDOMs today doesn''t >> allow multiple MAC addresses to be assigned to vnets in the first >> place. > > The issue is not multiple MAC addresses for a vnet. It''s that the > vSwitch > fronts mutliple MAC addresses for many vnet clients. Hence, if a > vswitch > is a single MAC client, then it needs to be assign multiple MAC > addresses > to distribute packets for multiple L2 destinations. This is the > existing > functionality in vSwitch.There are two things being discussed here. One is associating more than one MAC unicast addresses with a MAC client. The interface being reviewed allows this. The other is within a MAC client to assign individual hardware rings to the unicast addresses of the same MAC client. That we cannot support. In order to do this you need to create multiple MAC clients with their own set of rings. The MAC layer is designed to do the multiplexing and virtualization among multiple MAC clients, not within a MAC client. Having the vswitch as one MAC client and implement some of the same functionality as the MAC layer would be inefficient, and wouldn''t allow it to take advantage of the features now provided by the MAC layer. In order to take advantage of Crossbow, the vswitch should associate one or more MAC clients in the service domain with each vnet.>>> - Usage of HW mac addr slots in the NIC and automatic switching >>> to layer-2 filtering and promisc mode. >> >> I think I answered your questions about this point below. Also today >> there''s no hardware classification done in hardware for LDOMs, >> whether in promiscuous mode or not. So I don''t understand why you >> consider this a requirement for the initial port. > > The may be a terminology issue here. So,the existing vSwitch uses > L2 Hardware > classification in nxge, bge, and e1000g for example to filter > incoming packets.That''s not what we call hardware classification in Crossbow terminology. What we refer to as hardware classification is directing traffic to one or more rings based on the content of the packet headers.> Each domain''s L2 address is placed in hardware address slot. If > the slots are > exhausted, the switch will then decide whether to put the interface > into promiscuous > mode or not.This is what the new MAC layer does as well. Did Narayan mean "switching *between* L2 filtering and promisc mode" instead of "switching *to* L2 filtering and promisc mode" then? Nicolas. -- Nicolas Droux - Solaris Core OS - Sun Microsystems, Inc. droux at sun.com - http://blogs.sun.com/droux