Xu, Dongxiao
2009-Sep-10 07:42 UTC
[Xen-devel][PATCH][RFC] Using data polling mechanism in netfront to replace event notification between netback and netfront
Hi, This is a VNIF optimization patch, need for your comments. Thanks! [Background]: One of the VNIF driver''s scalability issues is the high event channel frequency. It''s highly related to physical NIC''s interrupt frequency in dom0, which could be 20K HZ in some situation. The high frequency event channel notification makes the guest and dom0 CPU utilization at a high value. Especially for HVM PV driver, it brings high rate of interrupts, which could cost a lot of CPU cycle. The attached patches have two parts: one part is for netback, and the other is for netfront. The netback part is based on the latest PV-Ops Dom0, and the netfront part is based on the 2.6.18 HVM unmodified driver. This patch uses a timer in netfront to poll the ring instead of event channel notification. If guest is transferring data, the timer will start working and periodicaly send/receive data from ring. If guest is idle and no data is transferring, the timer will stop working automatically. It will restart again once there is new data transferring. We set a feature flag in xenstore to indicate whether the netfront/netback support this feature. If there is only one side supporting it, the communication mechanism will fall back to default, and the new feature will not be used. The feature is enabled only when both sides have the flag set in xenstore. One problem is the timer polling frequency. This netfront part patch is based on 2.6.18 HVM unmodified driver, and in that kernel version, guest hrtimer is not accuracy, so I use the classical timer. The polling frequency is 1KHz. If rebase the netfront part patch to latest pv-ops, we could use hrtimer instead. [Testing Result]: We used a 4-core Intel Q9550 to do the test. From below we can see that, the test cases include the combination of 1/3/6/9 VMs (all the VMs are UP guest), 50/1472/1500 packets size, we mesured the throughput, Dom0 CPU utilization, and guest total CPU utiliztion. We pinned each guest vcpu with one pcpu, and pinned dom0''s vcpu with one pcpu. Take 9 guest VMs case as an example, we pinned Dom0''s vcpu with pcpu0, and pinned guest 1~3 vcpu with pcpu1, guest 4~6 vcpu with pcpu2, and guest 7~9 vcpu with pcpu3. We use netperf to do the test, and these results are based on HVM VNIF driver. The packets size means: 50 (small packets size), 1472 (big packets size, near MTU), 1500 (packets mixed with big size and small size). From below chart we could see that, host/guest CPU utilization decreases after applying the patch, especially when multiple VM is launched, while the performance is not impacted. VM RECEIVE CASES: Guest UDP Receive (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 50 w/o patch 83.25 100.00% 26.10% 50 w/ patch 79.56 100.00% 23.80% 1472 w/o patch 950.30 44.80% 22.40% 1472 w/ patch 949.32 46.00% 17.90% 1500 w/o patch 915.84 84.70% 42.40% 1500 w/ patch 908.94 88.30% 28.70% Guest TCP Receive (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 50 w/o patch 506.57 43.30% 70.30% 50 w/ patch 521.52 34.50% 57.70% 1472 w/o patch 926.19 69.00% 32.90% 1472 w/ patch 928.23 63.00% 24.40% 1500 w/o patch 935.12 68.60% 33.70% 1500 w/ patch 926.11 63.80% 24.80% Guest UDP Receive (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 963.43 50.70% 41.10% 1472 w/ patch 964.47 51.00% 25.00% 1500 w/o patch 859.96 99.50% 73.40% 1500 w/ patch 861.19 97.40% 39.90% Guest TCP Receive (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 939.68 78.40% 64.00% 1472 w/ patch 926.04 65.90% 31.80% 1500 w/o patch 933.00 78.10% 63.30% 1500 w/ patch 927.14 66.90% 31.90% Guest UDP Receive (Six Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 978.85 56.90% 59.20% 1472 w/ patch 975.05 53.80% 33.50% 1500 w/o patch 886.92 100.00% 87.20% 1500 w/ patch 902.02 96.90% 46.00% Guest TCP Receive (Six Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 962.04 90.30% 104.00% 1472 w/ patch 958.94 69.40% 43.70% 1500 w/o patch 960.35 90.10% 103.70% 1500 w/ patch 957.75 68.70% 42.80% Guest UDP Receive (Nine Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 987.91 60.50% 70.00% 1472 w/ FE patch 988.30 56.60% 42.70% 1500 w/o patch 953.48 100.00% 93.80% 1500 w/ FE patch 904.17 98.60% 53.50% Guest TCP Receive (Nine Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 974.89 90.00% 110.60% 1472 w/ patch 980.03 73.70% 55.40% 1500 w/o patch 971.34 89.80% 109.60% 1500 w/ patch 973.63 73.90% 54.70% VM SEND CASES: Guest UDP Send (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 949.84 56.50% 21.70% 1472 w/ patch 946.25 51.20% 20.10% 1500 w/o patch 912.46 87.00% 26.60% 1500 w/ patch 899.29 86.70% 26.20% Guest TCP Send (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 932.16 71.50% 35.60% 1472 w/ patch 932.09 66.90% 29.50% 1500 w/o patch 929.91 72.60% 35.90% 1500 w/ patch 931.63 66.70% 29.50% Guest UDP Send (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 972.66 57.60% 24.00% 1472 w/ patch 970.07 56.30% 23.30% 1500 w/o patch 943.87 93.50% 32.50% 1500 w/ patch 933.61 93.90% 30.00% Guest TCP Send (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 955.92 70.40% 36.10% 1472 w/ patch 946.39 72.90% 32.90% 1500 w/o patch 966.06 73.00% 38.00% 1500 w/ patch 947.23 72.50% 33.60% Best Regards, -- Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Sep-10 08:03 UTC
RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront
> Hi, > This is a VNIF optimization patch, need for your comments.Thanks!> > [Background]: > One of the VNIF driver''s scalability issues is the high eventchannel> frequency. It''s highly related to physical NIC''s interrupt frequencyin dom0,> which could be 20K HZ in some situation. The high frequency eventchannel> notification makes the guest and dom0 CPU utilization at a high value. > Especially for HVM PV driver, it brings high rate of interrupts, whichcould> cost a lot of CPU cycle. > The attached patches have two parts: one part is for netback,and the> other is for netfront. The netback part is based on the latest PV-OpsDom0,> and the netfront part is based on the 2.6.18 HVM unmodified driver. > This patch uses a timer in netfront to poll the ring instead ofevent> channel notification. If guest is transferring data, the timer willstart> working and periodicaly send/receive data from ring. If guest is idleand no> data is transferring, the timer will stop working automatically. Itwill> restart again once there is new data transferring. > We set a feature flag in xenstore to indicate whether the > netfront/netback support this feature. If there is only one sidesupporting> it, the communication mechanism will fall back to default, and the newfeature> will not be used. The feature is enabled only when both sides have theflag> set in xenstore. > One problem is the timer polling frequency. This netfront partpatch is> based on 2.6.18 HVM unmodified driver, and in that kernel version,guest> hrtimer is not accuracy, so I use the classical timer. The pollingfrequency> is 1KHz. If rebase the netfront part patch to latest pv-ops, we coulduse> hrtimer instead. >I experimented with this in Windows too, but the timer resolution is too poor. I think you should also look at setting the ''event'' parameter too. The current driver tells the backend to tell it as soon as there is a single packet ready to be notified (np->rx.sring->rsp_event np->rx.rsp_cons + 1), but you could set it to a higher number and also use the timer, eg "tell me when there are 32 ring slots filled, or when the timer elapses". That way you should have less problems with overflows. Also, I don''t think you need to tell the backend to stop notifying you, just don''t set the ''event'' field in the frontend and then RING_PUSH_RESPONSES_AND_CHECK_NOTIFY in the backend will not return that a notification is required. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-10 08:14 UTC
Re: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront
On 10/09/2009 09:03, "James Harper" <james.harper@bendigoit.com.au> wrote:> I experimented with this in Windows too, but the timer resolution is too > poor. I think you should also look at setting the ''event'' parameter too. > The current driver tells the backend to tell it as soon as there is a > single packet ready to be notified (np->rx.sring->rsp_event > np->rx.rsp_cons + 1), but you could set it to a higher number and also > use the timer, eg "tell me when there are 32 ring slots filled, or when > the timer elapses". That way you should have less problems with > overflows. > > Also, I don''t think you need to tell the backend to stop notifying you, > just don''t set the ''event'' field in the frontend and then > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY in the backend will not return that > a notification is required.Yes, the ring event mechanism is already set up for supporting event holdoff/mitigation. It''s just not used smartly by netfront right now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2009-Sep-10 09:09 UTC
RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront
Thanks for comments! The solution I present in last mail has an advantage that, it could almost decrease the event channel notification frequency to zero, which will save a lot of CPU cycle especially for HVM PV driver. For James''s suggestion, actually we have another solution which works in that style, see the attachment. We only modifies the netback, and keeps netfront unchanged. The patch is based on PV-ops Dom0, so the hrtimer is accurate. We set a timer in netback. If timer elapses or there are RING_SIZE/2 data slots in ring, netback will notify netfront (Of course we could modify the ''event'' parameter to replace the check of data number in ring). The patch contains auto adjustment logic for each netfront''s event channel frequency according to packet rate and size in a timer period. Also user could assign specific timer frequency for a certain netfront by using standard coalesce interface. If set the event notification frequency to 1000HZ, it also brings a lot of CPU utilization decrease like the previous test result. Here are the detail result for the two solutions. I think the two solutions could coexist, and we can set a MACRO to indicate which solution is used as default. Here the w/ FE patch means that applying the first solution patch attached in my last mail. w/ BE patch means applying the second solution patch attached in this mail. VM receive results: UDP Receive (Single Guest VM) TCP Receive (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 50 w/o patch 83.25 100.00% 26.10% 50 w/o patch 506.57 43.30% 70.30% w/ FE patch 79.56 100.00% 23.80% w/ FE patch 521.52 34.50% 57.70% w/ BE patch 72.43 100.00% 21.90% w/ BE patch 512.78 38.50% 54.40% 1472 w/o patch 950.30 44.80% 22.40% 1472 w/o patch 926.19 69.00% 32.90% w/ FE patch 949.32 46.00% 17.90% w/ FE patch 928.23 63.00% 24.40% w/ BE patch 951.57 51.10% 18.50% w/ BE patch 928.59 67.50% 24.80% 1500 w/o patch 915.84 84.70% 42.40% 1500 w/o patch 935.12 68.60% 33.70% w/ FE patch 908.94 88.30% 28.70% w/ FE patch 926.11 63.80% 24.80% w/ BE patch 904.00 88.90% 27.30% w/ BE patch 927.00 68.80% 24.60% UDP Receive (Three Guest VMs) TCP Receive (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 963.43 50.70% 41.10% 1472 w/o patch 939.68 78.40% 64.00% w/ FE patch 964.47 51.00% 25.00% w/ FE patch 926.04 65.90% 31.80% w/ BE patch 963.07 55.60% 27.80% w/ BE patch 930.61 71.60% 34.80% 1500 w/o patch 859.96 99.50% 73.40% 1500 w/o patch 933.00 78.10% 63.30% w/ FE patch 861.19 97.40% 39.90% w/ FE patch 927.14 66.90% 31.90% w/ BE patch 860.92 98.90% 40.00% w/ BE patch 930.76 71.10% 34.80% UDP Receive (Six Guest VMs) TCP Receive (Six Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 978.85 56.90% 59.20% 1472 w/o patch 962.04 90.30% 104.00% w/ FE patch 975.05 53.80% 33.50% w/ FE patch 958.94 69.40% 43.70% w/ BE patch 974.71 59.50% 40.00% w/ BE patch 958.08 68.30% 48.00% 1500 w/o patch 886.92 100.00% 87.20% 1500 w/o patch 960.35 90.10% 103.70% w/ FE patch 902.02 96.90% 46.00% w/ FE patch 957.75 68.70% 42.80% w/ BE patch 894.57 98.90% 49.60% w/ BE patch 956.42 68.20% 48.50% UDP Receive (Nine Guest VMs) TCP Receive (Nine Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 987.91 60.50% 70.00% 1472 w/o patch 974.89 90.00% 110.60% w/ FE patch 988.30 56.60% 42.70% w/ FE patch 980.03 73.70% 55.40% w/ BE patch 986.58 61.80% 50.00% w/ BE patch 968.29 72.30% 60.20% 1500 w/o patch 953.48 100.00% 93.80% 1500 w/o patch 971.34 89.80% 109.60% w/ FE patch 904.17 98.60% 53.50% w/ FE patch 973.63 73.90% 54.70% w/ BE patch 905.52 100.00% 56.80% w/ BE patch 971.08 72.30% 61.00% VM send results: UDP Send (Single Guest VM) TCP Send (Single Guest VM) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 949.84 56.50% 21.70% 1472 w/o patch 932.16 71.50% 35.60% w/ FE patch 946.25 51.20% 20.10% w/ FE patch 932.09 66.90% 29.50% w/ BE patch 948.73 51.60% 19.70% w/ BE patch 932.54 66.20% 25.30% 1500 w/o patch 912.46 87.00% 26.60% 1500 w/o patch 929.91 72.60% 35.90% w/ FE patch 899.29 86.70% 26.20% w/ FE patch 931.63 66.70% 29.50% w/ BE patch 909.31 86.90% 25.90% w/ BE patch 932.83 66.20% 26.20% UDP Send (Three Guest VMs) TCP Send (Three Guest VMs) Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util Packet Size (bytes) Test Case Throughput (Mbps) Dom0 CPU Util Guest CPU Total Util 1472 w/o patch 972.66 57.60% 24.00% 1472 w/o patch 955.92 70.40% 36.10% w/ FE patch 970.07 56.30% 23.30% w/ FE patch 946.39 72.90% 32.90% w/ BE patch 971.05 59.10% 23.10% w/ BE patch 949.80 70.30% 33.20% 1500 w/o patch 943.87 93.50% 32.50% 1500 w/o patch 966.06 73.00% 38.00% w/ FE patch 933.61 93.90% 30.00% w/ FE patch 947.23 72.50% 33.60% w/ BE patch 937.08 95.10% 31.00% w/ BE patch 948.74 72.20% 34.50% Best Regards, -- Dongxiao -----Original Message----- From: James Harper [mailto:james.harper@bendigoit.com.au] Sent: Thursday, September 10, 2009 4:03 PM To: Xu, Dongxiao; xen-devel@lists.xensource.com Subject: RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront> Hi, > This is a VNIF optimization patch, need for your comments.Thanks!> > [Background]: > One of the VNIF driver''s scalability issues is the high eventchannel> frequency. It''s highly related to physical NIC''s interrupt frequencyin dom0,> which could be 20K HZ in some situation. The high frequency eventchannel> notification makes the guest and dom0 CPU utilization at a high value. > Especially for HVM PV driver, it brings high rate of interrupts, whichcould> cost a lot of CPU cycle. > The attached patches have two parts: one part is for netback,and the> other is for netfront. The netback part is based on the latest PV-OpsDom0,> and the netfront part is based on the 2.6.18 HVM unmodified driver. > This patch uses a timer in netfront to poll the ring instead ofevent> channel notification. If guest is transferring data, the timer willstart> working and periodicaly send/receive data from ring. If guest is idleand no> data is transferring, the timer will stop working automatically. Itwill> restart again once there is new data transferring. > We set a feature flag in xenstore to indicate whether the > netfront/netback support this feature. If there is only one sidesupporting> it, the communication mechanism will fall back to default, and the newfeature> will not be used. The feature is enabled only when both sides have theflag> set in xenstore. > One problem is the timer polling frequency. This netfront partpatch is> based on 2.6.18 HVM unmodified driver, and in that kernel version,guest> hrtimer is not accuracy, so I use the classical timer. The pollingfrequency> is 1KHz. If rebase the netfront part patch to latest pv-ops, we coulduse> hrtimer instead. >I experimented with this in Windows too, but the timer resolution is too poor. I think you should also look at setting the ''event'' parameter too. The current driver tells the backend to tell it as soon as there is a single packet ready to be notified (np->rx.sring->rsp_event np->rx.rsp_cons + 1), but you could set it to a higher number and also use the timer, eg "tell me when there are 32 ring slots filled, or when the timer elapses". That way you should have less problems with overflows. Also, I don''t think you need to tell the backend to stop notifying you, just don''t set the ''event'' field in the frontend and then RING_PUSH_RESPONSES_AND_CHECK_NOTIFY in the backend will not return that a notification is required. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Sep-10 09:47 UTC
RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront
> > Here the w/ FE patch means that applying the first solution patchattached in> my last mail. w/ BE patch means applying the second solution patchattached in> this mail. >I think you also need to measure dropped packets (which presumably happened due to buffer overflow), latency, and maybe jitter. The latter two might be hard to measure with enough resolution to be significant though. What I saw when I was testing this sort of thing under Windows was that instead of receiving a constant stream of a few packets at a time, I received less frequent but much larger chunks of packets, which caused more work to be done per DPC. Does Xen give Windows any high resolution timers to play with? The best I could find (I didn''t look that hard) was the standard windows timer which has >10ms resolution. I think it might be good, as you have suggested, to push all the smarts back to the back end. Have some tunable figures (either via xenbus or via ring comms) to give the parameters of when a notify is needed eg: . maximum time since last notification . maximum time since last packet . number of packets to notify at, regardless of timeout (this is the event setting in the ring, although maybe not using that and having a separate backend driven auto-scaling algorithm might be worthwhile) Sounds like a bit of work though... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2009-Sep-10 10:19 UTC
RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront
Yes, the drop packets should be related with the OS socket buffer size, but user could enlarge it via proc interface. Latency and event channel frequency is a pair of tradeoff, this is why we provide the coalesce interface in the second solution for user to balance the tradeoff. I think it is related with the low resolution timer in Windows that caused the chunk of packets received. 10ms is really too long for timer slot. The tunable figures you mentioned, for example, the maxium time, is set by the standard coalesce interface in netback. For the max packets to notify, we calculates the number in ring and if it exceeds half ring size, netback will do notification. For the maxium time since last packet, I think it is a bit hard to measure and maybe unnecessary because we already have a time parameter (timer frequency), and it could adjust the tradeoff between latency and notification frequency. Thanks very much for the comments! Best Regards, -- Dongxiao -----Original Message----- From: James Harper [mailto:james.harper@bendigoit.com.au] Sent: Thursday, September 10, 2009 5:48 PM To: Xu, Dongxiao; Keir Fraser; xen-devel@lists.xensource.com Cc: Dong, Eddie; Yang, Xiaowei Subject: RE: [Xen-devel][PATCH][RFC] Using data polling mechanism in netfront toreplace event notification between netback and netfront> > Here the w/ FE patch means that applying the first solution patchattached in> my last mail. w/ BE patch means applying the second solution patchattached in> this mail. >I think you also need to measure dropped packets (which presumably happened due to buffer overflow), latency, and maybe jitter. The latter two might be hard to measure with enough resolution to be significant though. What I saw when I was testing this sort of thing under Windows was that instead of receiving a constant stream of a few packets at a time, I received less frequent but much larger chunks of packets, which caused more work to be done per DPC. Does Xen give Windows any high resolution timers to play with? The best I could find (I didn''t look that hard) was the standard windows timer which has >10ms resolution. I think it might be good, as you have suggested, to push all the smarts back to the back end. Have some tunable figures (either via xenbus or via ring comms) to give the parameters of when a notify is needed eg: . maximum time since last notification . maximum time since last packet . number of packets to notify at, regardless of timeout (this is the event setting in the ring, although maybe not using that and having a separate backend driven auto-scaling algorithm might be worthwhile) Sounds like a bit of work though... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel