I''ve been thinking about hardware that has multiple transmit rings ("tx resources"). We really should have a way to expose this up to the stack. And ideally, the stack should guarantee that a given flow will always be sent down using the same hardware tx resource. I''ve heard that crossbow will deliver this, but I can''t find evidence of it in the crossbow gate. Am I missing something? Is it functionality yet to be added, or is it not planned? The other problem I''ve heard from PAE, which is that one potential approach drivers could use today, which is to map the flow by hashing the sending CPU (which one would expect not to change for a given flow) is doomed to suffer packet reordering. Apparently the problem is that application threads can get get bounced around between CPUs by the scheduler pretty freely (more so than one would thing), and the result is that you can''t assume that the sending CPU will be reasonably static for a given flow. (I gotta think this wreaks havoc on the caches involved... but that''s a different problem.) _If_ transmitted packets are sent to the stack and always land in a delivery queue, then perhaps the outbound queue (squeue?) can have a worker thread that doesn''t migrate around. But in order for that to happen, we have to stop having sending threads deliver right to the driver driver when intervening queues are empty. I _think_ this will work better for throughput. It may hurt latency slightly though. I haven''t measured the latencies involved with queuing as opposed to direct delivery through the driver''s xxx_send/xxx_start routine, but I''d be curious to know if others here have. Anyway, let me know your thoughts. -- Garrett
Garrett, Garrett D''Amore wrote:> I''ve been thinking about hardware that has multiple transmit rings ("tx > resources"). > > We really should have a way to expose this up to the stack. And > ideally, the stack should guarantee that a given flow will always be > sent down using the same hardware tx resource. > > I''ve heard that crossbow will deliver this, but I can''t find evidence of > it in the crossbow gate. Am I missing something? Is it functionality > yet to be added, or is it not planned?Its designed in but code is yet to make it in Crossbow gate. I think parts of it are sitting in Roamer and Gopi''s workspaces.> The other problem I''ve heard from PAE, which is that one potential > approach drivers could use today, which is to map the flow by hashing > the sending CPU (which one would expect not to change for a given flow) > is doomed to suffer packet reordering. Apparently the problem is that > application threads can get get bounced around between CPUs by the > scheduler pretty freely (more so than one would thing), and the result > is that you can''t assume that the sending CPU will be reasonably static > for a given flow. (I gotta think this wreaks havoc on the caches > involved... but that''s a different problem.) > > _If_ transmitted packets are sent to the stack and always land in a > delivery queue, then perhaps the outbound queue (squeue?) can have a > worker thread that doesn''t migrate around. But in order for that to > happen, we have to stop having sending threads deliver right to the > driver driver when intervening queues are empty.This doesn''t really apply to forwarding traffic and in case of traffic terminating on the host, the application thread very rarely is able to reach the driver directly (its about 17-18% of the time on web workloads). The times it does means that there was nothing else to do anyway and its better to let the thread go through instead of doing a context switch.> I _think_ this will work better for throughput. It may hurt latency > slightly though. I haven''t measured the latencies involved with queuing > as opposed to direct delivery through the driver''s xxx_send/xxx_start > routine, but I''d be curious to know if others here have.Yes, you are discussing FireEngine design here. The ARC case has a detailed document which discusses all these things. Can''t remember the case number but search for FireEngine. Cheers, Sunay> > Anyway, let me know your thoughts. > > -- Garrett > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://opensolaris.org/mailman/listinfo/crossbow-discuss-- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow
Sunay Tripathi wrote:> Garrett, > > Garrett D''Amore wrote: >> I''ve been thinking about hardware that has multiple transmit rings >> ("tx resources"). >> >> We really should have a way to expose this up to the stack. And >> ideally, the stack should guarantee that a given flow will always be >> sent down using the same hardware tx resource. >> >> I''ve heard that crossbow will deliver this, but I can''t find evidence >> of it in the crossbow gate. Am I missing something? Is it >> functionality yet to be added, or is it not planned? > > Its designed in but code is yet to make it in Crossbow gate. I think > parts of it are sitting in Roamer and Gopi''s workspaces.Okay. Are there any design documents which provide the overall view of this? I''ve read bits and pieces of crossbow, and the marketing literature, but I''d really like to have details all the way down the driver API level.> >> The other problem I''ve heard from PAE, which is that one potential >> approach drivers could use today, which is to map the flow by hashing >> the sending CPU (which one would expect not to change for a given >> flow) is doomed to suffer packet reordering. Apparently the problem >> is that application threads can get get bounced around between CPUs >> by the scheduler pretty freely (more so than one would thing), and >> the result is that you can''t assume that the sending CPU will be >> reasonably static for a given flow. (I gotta think this wreaks havoc >> on the caches involved... but that''s a different problem.) >> >> _If_ transmitted packets are sent to the stack and always land in a >> delivery queue, then perhaps the outbound queue (squeue?) can have a >> worker thread that doesn''t migrate around. But in order for that to >> happen, we have to stop having sending threads deliver right to the >> driver driver when intervening queues are empty. > > This doesn''t really apply to forwarding trafficAgreed. Although if we use multiple rings for forwarding, we still have to be careful to minimize reordering of the forwarded streams.> and in case of traffic > terminating on the host, the application thread very rarely is able to > reach the driver directly (its about 17-18% of the time on web > workloads). The times it does means that there was nothing else to do > anyway and its better to let the thread go through instead of doing > a context switch.I think this is a fallacy, even if you have observed it. Because it ignores another potential location of queuing, which is the device driver (and the hardware) itself. For example, some of the hardware rings have fairly deep TX queues -- up to 1,000 packets or more in some cases, which can lead to incorrect assumptions about just how busy the link really is. And if you have multiple such rings, its really, really important to get the ordering right. I also fear that the attempt to "let the packet pass thru" is an optimization for the case of a lightly loaded environment, without regard to the impact it places upon the driver. Essentially, what I''m saying is, I am concerned that the design that requires the NIC driver to consider load balancing and flow management is inherently busted. Its much, much better, I think, if the ordering and ring scheduling considerations be handled by the stack, without any brains whatsoever on the part of the driver. Anything else leads to either a lot of wasted driver cycles, or drivers that make poor decisions because they don''t have sufficient information. I think we can see a bit of both in at least two of the drivers that support multiple tx rings: nxge and ce. This also leads, I think, to some of the craziness that PAE has to do to manually tune the device drivers. We really, I think, should be looking at ways to remove driver tuning from the steps that customers have to use to get good performance.> >> I _think_ this will work better for throughput. It may hurt latency >> slightly though. I haven''t measured the latencies involved with >> queuing as opposed to direct delivery through the driver''s >> xxx_send/xxx_start routine, but I''d be curious to know if others here >> have. > > Yes, you are discussing FireEngine design here. The ARC case has a > detailed document which discusses all these things. Can''t remember the > case number but search for FireEngine.Thanks, I''ll investigate further. -- Garrett> > Cheers, > Sunay > >> >> Anyway, let me know your thoughts. >> >> -- Garrett >> >> _______________________________________________ >> crossbow-discuss mailing list >> crossbow-discuss at opensolaris.org >> http://opensolaris.org/mailman/listinfo/crossbow-discuss > >
Sunay Tripathi wrote:>> I _think_ this will work better for throughput. It may hurt latency >> slightly though. I haven''t measured the latencies involved with >> queuing as opposed to direct delivery through the driver''s >> xxx_send/xxx_start routine, but I''d be curious to know if others here >> have. > > > Yes, you are discussing FireEngine design here. The ARC case has a > detailed document which discusses all these things. Can''t remember the > case number but search for FireEngine.PSARC/2002/433 FireEngine: A new architecture in Networking Kais> > Cheers, > Sunay