On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang <jasowang at redhat.com> wrote:> > > On 2017?04?19? 04:21, Willem de Bruijn wrote: >> >> +static void virtnet_napi_tx_enable(struct virtnet_info *vi, >> + struct virtqueue *vq, >> + struct napi_struct *napi) >> +{ >> + if (!napi->weight) >> + return; >> + >> + if (!vi->affinity_hint_set) { >> + napi->weight = 0; >> + return; >> + } >> + >> + return virtnet_napi_enable(vq, napi); >> +} >> + >> static void refill_work(struct work_struct *work) > > > Maybe I was wrong, but according to Michael's comment it looks like he want > check affinity_hint_set just for speculative tx polling on rx napi instead > of disabling it at all. > > And I'm not convinced this is really needed, driver only provide affinity > hint instead of affinity, so it's not guaranteed that tx and rx interrupt > are in the same vcpus.You're right. I made the restriction broader than the request, to really err on the side of caution for the initial merge of napi tx. And enabling the optimization is always a win over keeping it off, even without irq affinity. The cycle cost is significant without affinity regardless of whether the optimization is used. Though this is not limited to napi-tx, it is more pronounced in that mode than without napi. 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: upstream: 1,1,1: 28985 Mbps, 278 Gcyc 1,0,2: 30067 Mbps, 402 Gcyc napi tx: 1,1,1: 34492 Mbps, 269 Gcyc 1,0,2: 36527 Mbps, 537 Gcyc (!) 1,0,1: 36269 Mbps, 394 Gcyc 1,0,0: 34674 Mbps, 402 Gcyc This is a particularly strong example. It is also representative of most RR tests. It is less pronounced in other streaming tests. 10x TCP_RR, for instance: upstream: 1,1,1: 42267 Mbps, 301 Gcyc 1,0,2: 40663 Mbps, 445 Gcyc napi tx: 1,1,1: 42420 Mbps, 303 Gcyc 1,0,2: 42267 Mbps, 431 Gcyc These numbers were obtained with the virtqueue_enable_cb_delayed optimization after xmit_skb, btw. It turns out that moving that before increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing 100x TCP_RR a bit.
On 2017?04?20? 21:58, Willem de Bruijn wrote:> On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang <jasowang at redhat.com> wrote: >> >> On 2017?04?19? 04:21, Willem de Bruijn wrote: >>> +static void virtnet_napi_tx_enable(struct virtnet_info *vi, >>> + struct virtqueue *vq, >>> + struct napi_struct *napi) >>> +{ >>> + if (!napi->weight) >>> + return; >>> + >>> + if (!vi->affinity_hint_set) { >>> + napi->weight = 0; >>> + return; >>> + } >>> + >>> + return virtnet_napi_enable(vq, napi); >>> +} >>> + >>> static void refill_work(struct work_struct *work) >> >> Maybe I was wrong, but according to Michael's comment it looks like he want >> check affinity_hint_set just for speculative tx polling on rx napi instead >> of disabling it at all. >> >> And I'm not convinced this is really needed, driver only provide affinity >> hint instead of affinity, so it's not guaranteed that tx and rx interrupt >> are in the same vcpus. > You're right. I made the restriction broader than the request, to really err > on the side of caution for the initial merge of napi tx. And enabling > the optimization is always a win over keeping it off, even without irq > affinity. > > The cycle cost is significant without affinity regardless of whether the > optimization is used.Yes, I noticed this in the past too.> Though this is not limited to napi-tx, it is more > pronounced in that mode than without napi. > > 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: > > upstream: > > 1,1,1: 28985 Mbps, 278 Gcyc > 1,0,2: 30067 Mbps, 402 Gcyc > > napi tx: > > 1,1,1: 34492 Mbps, 269 Gcyc > 1,0,2: 36527 Mbps, 537 Gcyc (!) > 1,0,1: 36269 Mbps, 394 Gcyc > 1,0,0: 34674 Mbps, 402 Gcyc > > This is a particularly strong example. It is also representative > of most RR tests. It is less pronounced in other streaming tests. > 10x TCP_RR, for instance: > > upstream: > > 1,1,1: 42267 Mbps, 301 Gcyc > 1,0,2: 40663 Mbps, 445 Gcyc > > napi tx: > > 1,1,1: 42420 Mbps, 303 Gcyc > 1,0,2: 42267 Mbps, 431 Gcyc > > These numbers were obtained with the virtqueue_enable_cb_delayed > optimization after xmit_skb, btw. It turns out that moving that before > increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing > 100x TCP_RR a bit.I see, so I think we can leave the affinity hint optimization/check for future investigation: - to avoid endless optimization (e.g we may want to share a single vector/napi for tx/rx queue pairs in the future) for this series. - tx napi is disabled by default which means we can do optimization on top. Thanks
>>> Maybe I was wrong, but according to Michael's comment it looks like he >>> want >>> check affinity_hint_set just for speculative tx polling on rx napi >>> instead >>> of disabling it at all. >>> >>> And I'm not convinced this is really needed, driver only provide affinity >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt >>> are in the same vcpus. >> >> You're right. I made the restriction broader than the request, to really >> err >> on the side of caution for the initial merge of napi tx. And enabling >> the optimization is always a win over keeping it off, even without irq >> affinity. >> >> The cycle cost is significant without affinity regardless of whether the >> optimization is used. > > > Yes, I noticed this in the past too. > >> Though this is not limited to napi-tx, it is more >> pronounced in that mode than without napi. >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}: >> >> upstream: >> >> 1,1,1: 28985 Mbps, 278 Gcyc >> 1,0,2: 30067 Mbps, 402 Gcyc >> >> napi tx: >> >> 1,1,1: 34492 Mbps, 269 Gcyc >> 1,0,2: 36527 Mbps, 537 Gcyc (!) >> 1,0,1: 36269 Mbps, 394 Gcyc >> 1,0,0: 34674 Mbps, 402 Gcyc >> >> This is a particularly strong example. It is also representative >> of most RR tests. It is less pronounced in other streaming tests. >> 10x TCP_RR, for instance: >> >> upstream: >> >> 1,1,1: 42267 Mbps, 301 Gcyc >> 1,0,2: 40663 Mbps, 445 Gcyc >> >> napi tx: >> >> 1,1,1: 42420 Mbps, 303 Gcyc >> 1,0,2: 42267 Mbps, 431 Gcyc >> >> These numbers were obtained with the virtqueue_enable_cb_delayed >> optimization after xmit_skb, btw. It turns out that moving that before >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing >> 100x TCP_RR a bit. > > > I see, so I think we can leave the affinity hint optimization/check for > future investigation: > > - to avoid endless optimization (e.g we may want to share a single > vector/napi for tx/rx queue pairs in the future) for this series. > - tx napi is disabled by default which means we can do optimization on top.Okay. I'll drop the vi->affinity_hint_set from the patch set for now.