Santos, Jose Renato G
2009-Feb-17  03:23 UTC
[Xen-devel] [PATCH] Netchannel2 optimizations [2/4]
This applies to the latest netchannel2 tree. This patch uses the new packet message flag created in the previous patch to request an event only every N fragments. N needs to be less than the maximum number of fragments that we can send or we may get stuck. The default number of fragments in this patch is 192 while the maximum number of fragments that we can send is 256. There is a small issue with this code. If we have a single UDP stream and the maximum TX socket buffer size limited by the kernel in the sender guest is not sufficient to consume N fragments (192 for now) the communication may stop until some other stream sends packets in either the TX or RX direction. This should not be an issue with TCP since we willalway have ACKs being erceived what will cause events to be generated. We will need to fix this sometime soon, but it is an unlikely scenario in practice that we may let the code get into the netchannel2 tree for now, especially because the code is still experimental. But Steven has the final word on that. A possible fix to this issue is to set the event request flag when we send a packet and the sender socket buffer is full. I just did not have the time to look into the linux socket buffer code to figure out how to do that, but it should not be difficult once we understand the code. Renato _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Smith
2009-Feb-17  17:56 UTC
[Xen-devel] Re: [PATCH] Netchannel2 optimizations [2/4]
> This patch uses the new packet message flag created in the previous > patch to request an event only every N fragments. N needs to be less > than the maximum number of fragments that we can send or we may get > stuck. The default number of fragments in this patch is 192 while > the maximum number of fragments that we can send is 256. > > There is a small issue with this code. If we have a single UDP > stream and the maximum TX socket buffer size limited by the kernel > in the sender guest is not sufficient to consume N fragments (192 > for now) the communication may stop until some other stream sends > packets in either the TX or RX direction. This should not be an > issue with TCP since we willalway have ACKs being erceived what will > cause events to be generated. We will need to fix this sometime > soon, but it is an unlikely scenario in practice that we may let the > code get into the netchannel2 tree for now, especially because the > code is still experimental. But Steven has the final word on that.I''ve applied the patch, along with the others in the series. As you say, this isn''t really good enough for a final solution, as it stands, but it''ll do for now.> A possible fix to this issue is to set the event request flag when > we send a packet and the sender socket buffer is full. I just did > not have the time to look into the linux socket buffer code to > figure out how to do that, but it should not be difficult once we > understand the code.I''m not convinced by this fix. It''ll certainly solve the particular case of a UDP blast, but I''d be worried that there might be some other buffering somewhere, in e.g. the queueing discipline or somewhere in iptables. Fixing any particular instance probably wouldn''t be very tricky, but it''d be hard to be confident you''d got all of them, and it just sounds like a bit of a rat hole of complicated and hard-to-reproduce bugs. Since this is likely to be a rare case, I''d almost be happy just using e.g. a 1Hz ticker to catch things when they look like they''ve gone south. Performance will suck, but this should be a very rare workload, so that''s not too much of a problem. Does that sound plausible? Steven. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2009-Feb-17  18:16 UTC
[Xen-devel] RE: [PATCH] Netchannel2 optimizations [2/4]
> -----Original Message----- > From: Steven Smith [mailto:steven.smith@citrix.com] > Sent: Tuesday, February 17, 2009 9:56 AM > To: Santos, Jose Renato G > Cc: xen-devel@lists.xensource.com; Steven Smith > Subject: Re: [PATCH] Netchannel2 optimizations [2/4] > > > This patch uses the new packet message flag created in the previous > > patch to request an event only every N fragments. N needs > to be less > > than the maximum number of fragments that we can send or we may get > > stuck. The default number of fragments in this patch is > 192 while the > > maximum number of fragments that we can send is 256. > > > > There is a small issue with this code. If we have a single > UDP stream > > and the maximum TX socket buffer size limited by the kernel in the > > sender guest is not sufficient to consume N fragments (192 for now) > > the communication may stop until some other stream sends packets in > > either the TX or RX direction. This should not be an issue with TCP > > since we willalway have ACKs being erceived what will cause > events to > > be generated. We will need to fix this sometime soon, but it is an > > unlikely scenario in practice that we may let the code get into the > > netchannel2 tree for now, especially because the code is still > > experimental. But Steven has the final word on that. > I''ve applied the patch, along with the others in the series. > As you say, this isn''t really good enough for a final > solution, as it stands, but it''ll do for now. > > > A possible fix to this issue is to set the event request > flag when we > > send a packet and the sender socket buffer is full. I just did not > > have the time to look into the linux socket buffer code to > figure out > > how to do that, but it should not be difficult once we > understand the > > code. > I''m not convinced by this fix. It''ll certainly solve the > particular case of a UDP blast, but I''d be worried that there > might be some other buffering somewhere, in e.g. the queueing > discipline or somewhere in iptables. Fixing any particular > instance probably wouldn''t be very tricky, but it''d be hard > to be confident you''d got all of them, and it just sounds > like a bit of a rat hole of complicated and hard-to-reproduce bugs. > > Since this is likely to be a rare case, I''d almost be happy > just using e.g. a 1Hz ticker to catch things when they look > like they''ve gone south. Performance will suck, but this > should be a very rare workload, so that''s not too much of a problem. > > Does that sound plausible? >Yes, a low frequency periodic timer is a good idea. We could also make the number of fragments that generate an event a configurable paramenter that it could be adjusted (right now it is an constant). So a sys admin would have an option to configure it with a value compatible with the default socket buffer. What about combining the timer with a configurable parameter? Renato> Steven. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Smith
2009-Feb-18  12:35 UTC
[Xen-devel] Re: [PATCH] Netchannel2 optimizations [2/4]
> > > A possible fix to this issue is to set the event request > > flag when we > > > send a packet and the sender socket buffer is full. I just did not > > > have the time to look into the linux socket buffer code to > > figure out > > > how to do that, but it should not be difficult once we > > understand the > > > code. > > I''m not convinced by this fix. It''ll certainly solve the > > particular case of a UDP blast, but I''d be worried that there > > might be some other buffering somewhere, in e.g. the queueing > > discipline or somewhere in iptables. Fixing any particular > > instance probably wouldn''t be very tricky, but it''d be hard > > to be confident you''d got all of them, and it just sounds > > like a bit of a rat hole of complicated and hard-to-reproduce bugs. > > > > Since this is likely to be a rare case, I''d almost be happy > > just using e.g. a 1Hz ticker to catch things when they look > > like they''ve gone south. Performance will suck, but this > > should be a very rare workload, so that''s not too much of a problem. > > > > Does that sound plausible? > Yes, a low frequency periodic timer is a good idea.Okay, there''s now a 1Hz ticker which just goes and prods the ring if there are any messages outstanding. As expected, performance is dire if you''re relying on it to actually force packets out (~180 packets a second), but it does avoid the deadlock. I''ve also added a (very stupid) adaptation scheme which tries to adjust the max_count_frags_no_event parameter to avoid hitting the deadlock too often in the first place. It seems to do broadly the right thing for both UDP floods and TCP stream tests, but it probably wouldn''t be very hard to come up with some workload for which it falls over.> We could also make the number of fragments that generate an event > a configurable paramenter that it could be adjusted (right now it is > an constant). So a sys admin would have an option to configure it > with a value compatible with the default socket buffer. What about > combining the timer with a configurable parameter?I guess it wouldn''t hurt to make this stuff configurable, although I think you may be overestimating the average sysadmin if you think they''re going to know the default socket buffer size (hell, *I* don''t know the default socket buffer size). Steven. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2009-Feb-19  07:32 UTC
[Xen-devel] RE: [PATCH] Netchannel2 optimizations [2/4]
> > I''ve also added a (very stupid) adaptation scheme which tries > to adjust the max_count_frags_no_event parameter to avoid > hitting the deadlock too often in the first place. It seems > to do broadly the right thing for both UDP floods and TCP > stream tests, but it probably wouldn''t be very hard to come > up with some workload for which it falls over. >OK, I will test how this work on 10 gig NICs when I have some time. I am currently doing some tests on Intel 10gig ixgbe NICs and I am seeing some behaviour that I cannot explain (without this adaptation patch). Netperf is not able to saturate the link and at the same time both the guest and dom0 cannot not saturate the CPU either ( I made sure the client is not the bottleneck either). So some other factor is limiting throughput. (I disabled the netchannel2 rate limiter but this did not seem to have any effect either). I will spend some time looking into that. Regards Renato> > We could also make the number of fragments that generate > an event a > > configurable paramenter that it could be adjusted (right > now it is an > > constant). So a sys admin would have an option to configure > it with a > > value compatible with the default socket buffer. What about > combining > > the timer with a configurable parameter? > I guess it wouldn''t hurt to make this stuff configurable, > although I think you may be overestimating the average > sysadmin if you think they''re going to know the default > socket buffer size (hell, *I* don''t know the default socket > buffer size). > > Steven. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Smith
2009-Feb-20  09:58 UTC
[Xen-devel] Re: [PATCH] Netchannel2 optimizations [2/4]
> > I''ve also added a (very stupid) adaptation scheme which tries > > to adjust the max_count_frags_no_event parameter to avoid > > hitting the deadlock too often in the first place. It seems > > to do broadly the right thing for both UDP floods and TCP > > stream tests, but it probably wouldn''t be very hard to come > > up with some workload for which it falls over. > OK, I will test how this work on 10 gig NICs when I have some > time. I am currently doing some tests on Intel 10gig ixgbe NICs > and I am seeing some behaviour that I cannot explain (without this > adaptation patch). Netperf is not able to saturate the link and > at the same time both the guest and dom0 cannot not saturate the > CPU either ( I made sure the client is not the bottleneck > either). So some other factor is limiting throughput. (I disabled > the netchannel2 rate limiter but this did not seem to have any > effect either). I will spend some time looking into that.Is it possible that we''re seeing some kind of semi-synchronous bouncing between the domU and dom0? Something like this: -- DomU sends some messages to dom0, wakes it up, and then goes to sleep. -- Dom0 wakes up, processes the messages, sends the responses, wakes the domU, and then goes to sleep. -- Repeat. So that both domains are spending significant time just waiting for the other one to do something, and neither can saturate their CPU. That should be fairly obvious in a xentrace trace if you run it while you''re observing the bad behaviour. If that is the problem, there are a couple of easy-ish things we could do which might help a bit: -- Re-arrange the tasklet a bit so that it sends outgoing messages before checking for incoming ones. The risk is that processing an incoming message is likely to generate further outgoing ones, so we risk splitting the messages into two flights. -- Arrange to kick after N messages, even if we still have more messages to send, so that the domain which is receiving the messages runs in parallel with the sending one. Both approaches would risk sending more batches of messages, and hence more event channel notifications, trips through the Xen scheduler, etc., and hence would only ever increase the number of cycles per packet, but if they stop CPUs going idle then they might increase the actual throughput. Ideally, we''d only do this kind of thing if the receiving domain is idle, but figuring that out from the transmitting domain in an efficient way sounds tricky. You could imagine some kind of scoreboard showing which domains are running, maintained by Xen and readable by all domains, but I really don''t think we want to go down that route. Steven. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2009-Feb-20  15:39 UTC
[Xen-devel] RE: [PATCH] Netchannel2 optimizations [2/4]
> -----Original Message----- > From: Steven Smith [mailto:steven.smith@citrix.com] > Sent: Friday, February 20, 2009 1:58 AM > To: Santos, Jose Renato G > Cc: Steven Smith; xen-devel@lists.xensource.com > Subject: Re: [PATCH] Netchannel2 optimizations [2/4] > > > > I''ve also added a (very stupid) adaptation scheme which tries to > > > adjust the max_count_frags_no_event parameter to avoid > hitting the > > > deadlock too often in the first place. It seems to do > broadly the > > > right thing for both UDP floods and TCP stream tests, but it > > > probably wouldn''t be very hard to come up with some workload for > > > which it falls over. > > OK, I will test how this work on 10 gig NICs when I have some > > time. I am currently doing some tests on Intel 10gig ixgbe NICs > > and I am seeing some behaviour that I cannot explain (without this > > adaptation patch). Netperf is not able to saturate the link and > > at the same time both the guest and dom0 cannot not saturate the > > CPU either ( I made sure the client is not the bottleneck > > either). So some other factor is limiting throughput. (I disabled > > the netchannel2 rate limiter but this did not seem to have any > > effect either). I will spend some time looking into that. > Is it possible that we''re seeing some kind of > semi-synchronous bouncing between the domU and dom0? > Something like this: > > -- DomU sends some messages to dom0, wakes it up, and then goes to > sleep. > -- Dom0 wakes up, processes the messages, sends the responses, wakes > the domU, and then goes to sleep. > -- Repeat. > > So that both domains are spending significant time just > waiting for the other one to do something, and neither can > saturate their CPU.Yes, that is what I thought as well. I still need to do a careful xentrace analysis though.> That should be fairly obvious in a xentrace trace if you run > it while you''re observing the bad behaviour. > > If that is the problem, there are a couple of easy-ish things > we could do which might help a bit: > > -- Re-arrange the tasklet a bit so that it sends outgoing messages > before checking for incoming ones. The risk is that processing an > incoming message is likely to generate further outgoing ones, so we > risk splitting the messages into two flights. > > -- Arrange to kick after N messages, even if we still have more > messages to send, so that the domain which is receiving the > messages runs in parallel with the sending one. >I tried limiting the messages in each run and some other things but it did not help. I can try doing TX before RX but I think there is something else going on and I will need to spend more time analysing. I will postpone this until after the Xen summit as I will be busy in the next days writing my slides. Thanks for the suggestions Renato> Both approaches would risk sending more batches of messages, > and hence more event channel notifications, trips through the > Xen scheduler, etc., and hence would only ever increase the > number of cycles per packet, but if they stop CPUs going idle > then they might increase the actual throughput. > > Ideally, we''d only do this kind of thing if the receiving > domain is idle, but figuring that out from the transmitting > domain in an efficient way sounds tricky. You could imagine > some kind of scoreboard showing which domains are running, > maintained by Xen and readable by all domains, but I really > don''t think we want to go down that route. > > Steven. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel