>>> Maybe I was wrong, but according to Michael's comment it looks like he >>> want >>> check affinity_hint_set just for speculative tx polling on rx napi >>> instead >>> of disabling it at all. >>> >>> And I'm not convinced this is really needed, driver only provide affinity >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt >>> are in the same vcpus. >> >> You're right. I made the restriction broader than the request, to really >> err >> on the side of caution for the initial merge of napi tx. And enabling >> the optimization is always a win over keeping it off, even without irq >> affinity. >> >> The cycle cost is significant without affinity regardless of whether the >> optimization is used. > > > Yes, I noticed this in the past too. > >> Though this is not limited to napi-tx, it is more >> pronounced in that mode than without napi. >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: >> >> upstream: >> >> 1,1,1: 28985 Mbps, 278 Gcyc >> 1,0,2: 30067 Mbps, 402 Gcyc >> >> napi tx: >> >> 1,1,1: 34492 Mbps, 269 Gcyc >> 1,0,2: 36527 Mbps, 537 Gcyc (!) >> 1,0,1: 36269 Mbps, 394 Gcyc >> 1,0,0: 34674 Mbps, 402 Gcyc >> >> This is a particularly strong example. It is also representative >> of most RR tests. It is less pronounced in other streaming tests. >> 10x TCP_RR, for instance: >> >> upstream: >> >> 1,1,1: 42267 Mbps, 301 Gcyc >> 1,0,2: 40663 Mbps, 445 Gcyc >> >> napi tx: >> >> 1,1,1: 42420 Mbps, 303 Gcyc >> 1,0,2: 42267 Mbps, 431 Gcyc >> >> These numbers were obtained with the virtqueue_enable_cb_delayed >> optimization after xmit_skb, btw. It turns out that moving that before >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing >> 100x TCP_RR a bit. > > > I see, so I think we can leave the affinity hint optimization/check for > future investigation: > > - to avoid endless optimization (e.g we may want to share a single > vector/napi for tx/rx queue pairs in the future) for this series. > - tx napi is disabled by default which means we can do optimization on top.Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
Michael S. Tsirkin
2017-Apr-24 16:40 UTC
[PATCH net-next v2 2/5] virtio-net: transmit napi
On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:> >>> Maybe I was wrong, but according to Michael's comment it looks like he > >>> want > >>> check affinity_hint_set just for speculative tx polling on rx napi > >>> instead > >>> of disabling it at all. > >>> > >>> And I'm not convinced this is really needed, driver only provide affinity > >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt > >>> are in the same vcpus. > >> > >> You're right. I made the restriction broader than the request, to really > >> err > >> on the side of caution for the initial merge of napi tx. And enabling > >> the optimization is always a win over keeping it off, even without irq > >> affinity. > >> > >> The cycle cost is significant without affinity regardless of whether the > >> optimization is used. > > > > > > Yes, I noticed this in the past too. > > > >> Though this is not limited to napi-tx, it is more > >> pronounced in that mode than without napi. > >> > >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: > >> > >> upstream: > >> > >> 1,1,1: 28985 Mbps, 278 Gcyc > >> 1,0,2: 30067 Mbps, 402 Gcyc > >> > >> napi tx: > >> > >> 1,1,1: 34492 Mbps, 269 Gcyc > >> 1,0,2: 36527 Mbps, 537 Gcyc (!) > >> 1,0,1: 36269 Mbps, 394 Gcyc > >> 1,0,0: 34674 Mbps, 402 Gcyc > >> > >> This is a particularly strong example. It is also representative > >> of most RR tests. It is less pronounced in other streaming tests. > >> 10x TCP_RR, for instance: > >> > >> upstream: > >> > >> 1,1,1: 42267 Mbps, 301 Gcyc > >> 1,0,2: 40663 Mbps, 445 Gcyc > >> > >> napi tx: > >> > >> 1,1,1: 42420 Mbps, 303 Gcyc > >> 1,0,2: 42267 Mbps, 431 Gcyc > >> > >> These numbers were obtained with the virtqueue_enable_cb_delayed > >> optimization after xmit_skb, btw. It turns out that moving that before > >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing > >> 100x TCP_RR a bit. > > > > > > I see, so I think we can leave the affinity hint optimization/check for > > future investigation: > > > > - to avoid endless optimization (e.g we may want to share a single > > vector/napi for tx/rx queue pairs in the future) for this series. > > - tx napi is disabled by default which means we can do optimization on top. > > Okay. I'll drop the vi->affinity_hint_set from the patch set for now.I kind of like it, let's be conservative. But I'd prefer a comment near it explaining why it's there. -- MST
On Mon, Apr 24, 2017 at 12:40 PM, Michael S. Tsirkin <mst at redhat.com> wrote:> On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote: >> >>> Maybe I was wrong, but according to Michael's comment it looks like he >> >>> want >> >>> check affinity_hint_set just for speculative tx polling on rx napi >> >>> instead >> >>> of disabling it at all. >> >>> >> >>> And I'm not convinced this is really needed, driver only provide affinity >> >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt >> >>> are in the same vcpus. >> >> >> >> You're right. I made the restriction broader than the request, to really >> >> err >> >> on the side of caution for the initial merge of napi tx. And enabling >> >> the optimization is always a win over keeping it off, even without irq >> >> affinity. >> >> >> >> The cycle cost is significant without affinity regardless of whether the >> >> optimization is used. >> > >> > >> > Yes, I noticed this in the past too. >> > >> >> Though this is not limited to napi-tx, it is more >> >> pronounced in that mode than without napi. >> >> >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: >> >> >> >> upstream: >> >> >> >> 1,1,1: 28985 Mbps, 278 Gcyc >> >> 1,0,2: 30067 Mbps, 402 Gcyc >> >> >> >> napi tx: >> >> >> >> 1,1,1: 34492 Mbps, 269 Gcyc >> >> 1,0,2: 36527 Mbps, 537 Gcyc (!) >> >> 1,0,1: 36269 Mbps, 394 Gcyc >> >> 1,0,0: 34674 Mbps, 402 Gcyc >> >> >> >> This is a particularly strong example. It is also representative >> >> of most RR tests. It is less pronounced in other streaming tests. >> >> 10x TCP_RR, for instance: >> >> >> >> upstream: >> >> >> >> 1,1,1: 42267 Mbps, 301 Gcyc >> >> 1,0,2: 40663 Mbps, 445 Gcyc >> >> >> >> napi tx: >> >> >> >> 1,1,1: 42420 Mbps, 303 Gcyc >> >> 1,0,2: 42267 Mbps, 431 Gcyc >> >> >> >> These numbers were obtained with the virtqueue_enable_cb_delayed >> >> optimization after xmit_skb, btw. It turns out that moving that before >> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing >> >> 100x TCP_RR a bit. >> > >> > >> > I see, so I think we can leave the affinity hint optimization/check for >> > future investigation: >> > >> > - to avoid endless optimization (e.g we may want to share a single >> > vector/napi for tx/rx queue pairs in the future) for this series. >> > - tx napi is disabled by default which means we can do optimization on top. >> >> Okay. I'll drop the vi->affinity_hint_set from the patch set for now. > > I kind of like it, let's be conservative. But I'd prefer a comment > near it explaining why it's there.I don't feel strongly. Was minutes away from sending a v3 with this code reverted, but I'll reinstate it and add a comment. Other planned changes based on Jason's feedback to v2: v2 -> v3: - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll ensure that the handler always cleans, to avoid deadlock - unconditionally clean in start_xmit avoid adding an unnecessary "if (use_napi)" branch - remove virtqueue_disable_cb in patch 5/5 a noop in the common event_idx based loop
On 2017?04?25? 00:40, Michael S. Tsirkin wrote:> On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote: >>>>> Maybe I was wrong, but according to Michael's comment it looks like he >>>>> want >>>>> check affinity_hint_set just for speculative tx polling on rx napi >>>>> instead >>>>> of disabling it at all. >>>>> >>>>> And I'm not convinced this is really needed, driver only provide affinity >>>>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt >>>>> are in the same vcpus. >>>> You're right. I made the restriction broader than the request, to really >>>> err >>>> on the side of caution for the initial merge of napi tx. And enabling >>>> the optimization is always a win over keeping it off, even without irq >>>> affinity. >>>> >>>> The cycle cost is significant without affinity regardless of whether the >>>> optimization is used. >>> >>> Yes, I noticed this in the past too. >>> >>>> Though this is not limited to napi-tx, it is more >>>> pronounced in that mode than without napi. >>>> >>>> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: >>>> >>>> upstream: >>>> >>>> 1,1,1: 28985 Mbps, 278 Gcyc >>>> 1,0,2: 30067 Mbps, 402 Gcyc >>>> >>>> napi tx: >>>> >>>> 1,1,1: 34492 Mbps, 269 Gcyc >>>> 1,0,2: 36527 Mbps, 537 Gcyc (!) >>>> 1,0,1: 36269 Mbps, 394 Gcyc >>>> 1,0,0: 34674 Mbps, 402 Gcyc >>>> >>>> This is a particularly strong example. It is also representative >>>> of most RR tests. It is less pronounced in other streaming tests. >>>> 10x TCP_RR, for instance: >>>> >>>> upstream: >>>> >>>> 1,1,1: 42267 Mbps, 301 Gcyc >>>> 1,0,2: 40663 Mbps, 445 Gcyc >>>> >>>> napi tx: >>>> >>>> 1,1,1: 42420 Mbps, 303 Gcyc >>>> 1,0,2: 42267 Mbps, 431 Gcyc >>>> >>>> These numbers were obtained with the virtqueue_enable_cb_delayed >>>> optimization after xmit_skb, btw. It turns out that moving that before >>>> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing >>>> 100x TCP_RR a bit. >>> >>> I see, so I think we can leave the affinity hint optimization/check for >>> future investigation: >>> >>> - to avoid endless optimization (e.g we may want to share a single >>> vector/napi for tx/rx queue pairs in the future) for this series. >>> - tx napi is disabled by default which means we can do optimization on top. >> Okay. I'll drop the vi->affinity_hint_set from the patch set for now. > I kind of like it, let's be conservative. But I'd prefer a comment > near it explaining why it's there. >Another issue for affinity_hint_set is that it could be changed when setting channels. I think we've already conservative enough (e.g it was disabled by default). Thanks