Michael S. Tsirkin
2014-Dec-02  09:55 UTC
[PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote:> > > On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <mst at redhat.com> wrote: > >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote: > >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <jasowang at redhat.com> > >>wrote: > >> > > >> > > >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <mst at redhat.com> > >>wrote: > >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote: > >> >>> Hello: > >> >>> We used to orphan packets before transmission for virtio-net. This > >> >>>breaks > >> >>> socket accounting and can lead serveral functions won't work, e.g: > >> >>> - Byte Queue Limit depends on tx completion nofication to work. > >> >>> - Packet Generator depends on tx completion nofication for the last > >> >>> transmitted packet to complete. > >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to > >> >>>work. > >> >>> This series tries to solve the issue by enabling tx interrupts. To > >> >>>minize > >> >>> the performance impacts of this, several optimizations were used: > >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay > >>the > >> >>>tx > >> >>> interrupt untile 3/4 pending packets were sent. > >> >>> - In host side, interrupt coalescing were used to reduce tx > >> >>>interrupts. > >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows: > >> >>> - For guest receiving. No obvious regression on throughput were > >> >>> noticed. More cpu utilization were noticed in few cases. > >> >>> - For guest transmission. Very huge improvement on througput for > >> >>>small > >> >>> packet transmission were noticed. This is expected since TSQ and > >> >>>other > >> >>> optimization for small packet transmission work after tx > >>interrupt. > >> >>>But > >> >>> will use more cpu for large packets. > >> >>> - For TCP_RR, regression (10% on transaction rate and cpu > >> >>>utilization) were > >> >>> found. Tx interrupt won't help but cause overhead in this case. > >> >>>Using > >> >>> more aggressive coalescing parameters may help to reduce the > >> >>>regression. > >> >> > >> >>OK, you do have posted coalescing patches - does it help any? > >> > > >> >Helps a lot. > >> > > >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs) > >> >For small packet TX, it increases 33% - 245% throughput. (reduce about > >>60% > >> >inters) > >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx > >>intrs) > >> > > >> >> > >> >>I'm not sure the regression is due to interrupts. > >> >>It would make sense for CPU but why would it > >> >>hurt transaction rate? > >> > > >> >Anyway guest need to take some cycles to handle tx interrupts. > >> >And transaction rate does increase if we coalesces more tx interurpts. > >> >> > >> >> > >> >>It's possible that we are deferring kicks too much due to BQL. > >> >> > >> >>As an experiment: do we get any of it back if we do > >> >>- if (kick || netif_xmit_stopped(txq)) > >> >>- virtqueue_kick(sq->vq); > >> >>+ virtqueue_kick(sq->vq); > >> >>? > >> > > >> > > >> >I will try, but during TCP_RR, at most 1 packets were pending, > >> >I suspect if BQL can help in this case. > >> Looks like this helps a lot in multiple sessions of TCP_RR. > > > >so what's faster > > BQL + kick each packet > > no BQL > >? > > Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious > differences. > > May need a complete benchmark to see.Okay so going forward something like BQL + kick each packet might be a good solution. The advantage of BQL is that it works without GSO. For example, now that we don't do UFO, you might see significant gains with UDP.> > > > > >> How about move the BQL patch out of this series? > >> Let's first converge tx interrupt and then introduce it? > >> (e.g with kicking after queuing X bytes?) > > > >Sounds good.
Pankaj Gupta
2014-Dec-02  10:08 UTC
[PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
> > On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote: > > > > > > On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <mst at redhat.com> wrote: > > >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote: > > >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <jasowang at redhat.com> > > >>wrote: > > >> > > > >> > > > >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <mst at redhat.com> > > >>wrote: > > >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote: > > >> >>> Hello: > > >> >>> We used to orphan packets before transmission for virtio-net. This > > >> >>>breaks > > >> >>> socket accounting and can lead serveral functions won't work, e.g: > > >> >>> - Byte Queue Limit depends on tx completion nofication to work. > > >> >>> - Packet Generator depends on tx completion nofication for the last > > >> >>> transmitted packet to complete. > > >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to > > >> >>>work. > > >> >>> This series tries to solve the issue by enabling tx interrupts. To > > >> >>>minize > > >> >>> the performance impacts of this, several optimizations were used: > > >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay > > >>the > > >> >>>tx > > >> >>> interrupt untile 3/4 pending packets were sent. > > >> >>> - In host side, interrupt coalescing were used to reduce tx > > >> >>>interrupts. > > >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows: > > >> >>> - For guest receiving. No obvious regression on throughput were > > >> >>> noticed. More cpu utilization were noticed in few cases. > > >> >>> - For guest transmission. Very huge improvement on througput for > > >> >>>small > > >> >>> packet transmission were noticed. This is expected since TSQ and > > >> >>>other > > >> >>> optimization for small packet transmission work after tx > > >>interrupt. > > >> >>>But > > >> >>> will use more cpu for large packets. > > >> >>> - For TCP_RR, regression (10% on transaction rate and cpu > > >> >>>utilization) were > > >> >>> found. Tx interrupt won't help but cause overhead in this case. > > >> >>>Using > > >> >>> more aggressive coalescing parameters may help to reduce the > > >> >>>regression. > > >> >> > > >> >>OK, you do have posted coalescing patches - does it help any? > > >> > > > >> >Helps a lot. > > >> > > > >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs) > > >> >For small packet TX, it increases 33% - 245% throughput. (reduce about > > >>60% > > >> >inters) > > >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx > > >>intrs) > > >> > > > >> >> > > >> >>I'm not sure the regression is due to interrupts. > > >> >>It would make sense for CPU but why would it > > >> >>hurt transaction rate? > > >> > > > >> >Anyway guest need to take some cycles to handle tx interrupts. > > >> >And transaction rate does increase if we coalesces more tx interurpts. > > >> >> > > >> >> > > >> >>It's possible that we are deferring kicks too much due to BQL. > > >> >> > > >> >>As an experiment: do we get any of it back if we do > > >> >>- if (kick || netif_xmit_stopped(txq)) > > >> >>- virtqueue_kick(sq->vq); > > >> >>+ virtqueue_kick(sq->vq); > > >> >>? > > >> > > > >> > > > >> >I will try, but during TCP_RR, at most 1 packets were pending, > > >> >I suspect if BQL can help in this case. > > >> Looks like this helps a lot in multiple sessions of TCP_RR. > > > > > >so what's faster > > > BQL + kick each packet > > > no BQL > > >? > > > > Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious > > differences. > > > > May need a complete benchmark to see. > > Okay so going forward something like BQL + kick each packet > might be a good solution. > The advantage of BQL is that it works without GSO. > For example, now that we don't do UFO, you might > see significant gains with UDP.If I understand correctly, it can also help for small packet regr. in multiqueue scenario? Would be nice to see the perf. numbers with multi-queue for small packets streams.> > > > > > > > > > >> How about move the BQL patch out of this series? > > >> Let's first converge tx interrupt and then introduce it? > > >> (e.g with kicking after queuing X bytes?) > > > > > >Sounds good. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Michael S. Tsirkin
2014-Dec-02  10:11 UTC
[PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
On Tue, Dec 02, 2014 at 05:08:35AM -0500, Pankaj Gupta wrote:> > > > > On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote: > > > > > > > > > On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <mst at redhat.com> wrote: > > > >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote: > > > >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <jasowang at redhat.com> > > > >>wrote: > > > >> > > > > >> > > > > >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <mst at redhat.com> > > > >>wrote: > > > >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote: > > > >> >>> Hello: > > > >> >>> We used to orphan packets before transmission for virtio-net. This > > > >> >>>breaks > > > >> >>> socket accounting and can lead serveral functions won't work, e.g: > > > >> >>> - Byte Queue Limit depends on tx completion nofication to work. > > > >> >>> - Packet Generator depends on tx completion nofication for the last > > > >> >>> transmitted packet to complete. > > > >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to > > > >> >>>work. > > > >> >>> This series tries to solve the issue by enabling tx interrupts. To > > > >> >>>minize > > > >> >>> the performance impacts of this, several optimizations were used: > > > >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay > > > >>the > > > >> >>>tx > > > >> >>> interrupt untile 3/4 pending packets were sent. > > > >> >>> - In host side, interrupt coalescing were used to reduce tx > > > >> >>>interrupts. > > > >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows: > > > >> >>> - For guest receiving. No obvious regression on throughput were > > > >> >>> noticed. More cpu utilization were noticed in few cases. > > > >> >>> - For guest transmission. Very huge improvement on througput for > > > >> >>>small > > > >> >>> packet transmission were noticed. This is expected since TSQ and > > > >> >>>other > > > >> >>> optimization for small packet transmission work after tx > > > >>interrupt. > > > >> >>>But > > > >> >>> will use more cpu for large packets. > > > >> >>> - For TCP_RR, regression (10% on transaction rate and cpu > > > >> >>>utilization) were > > > >> >>> found. Tx interrupt won't help but cause overhead in this case. > > > >> >>>Using > > > >> >>> more aggressive coalescing parameters may help to reduce the > > > >> >>>regression. > > > >> >> > > > >> >>OK, you do have posted coalescing patches - does it help any? > > > >> > > > > >> >Helps a lot. > > > >> > > > > >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs) > > > >> >For small packet TX, it increases 33% - 245% throughput. (reduce about > > > >>60% > > > >> >inters) > > > >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx > > > >>intrs) > > > >> > > > > >> >> > > > >> >>I'm not sure the regression is due to interrupts. > > > >> >>It would make sense for CPU but why would it > > > >> >>hurt transaction rate? > > > >> > > > > >> >Anyway guest need to take some cycles to handle tx interrupts. > > > >> >And transaction rate does increase if we coalesces more tx interurpts. > > > >> >> > > > >> >> > > > >> >>It's possible that we are deferring kicks too much due to BQL. > > > >> >> > > > >> >>As an experiment: do we get any of it back if we do > > > >> >>- if (kick || netif_xmit_stopped(txq)) > > > >> >>- virtqueue_kick(sq->vq); > > > >> >>+ virtqueue_kick(sq->vq); > > > >> >>? > > > >> > > > > >> > > > > >> >I will try, but during TCP_RR, at most 1 packets were pending, > > > >> >I suspect if BQL can help in this case. > > > >> Looks like this helps a lot in multiple sessions of TCP_RR. > > > > > > > >so what's faster > > > > BQL + kick each packet > > > > no BQL > > > >? > > > > > > Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious > > > differences. > > > > > > May need a complete benchmark to see. > > > > Okay so going forward something like BQL + kick each packet > > might be a good solution. > > The advantage of BQL is that it works without GSO. > > For example, now that we don't do UFO, you might > > see significant gains with UDP. > > If I understand correctly, it can also help for small packet > regr. in multiqueue scenario?Well BQL generally should only be active for 1:1 mappings.> Would be nice to see the perf. numbers > with multi-queue for small packets streams. > > > > > > > > > > > > > > > >> How about move the BQL patch out of this series? > > > >> Let's first converge tx interrupt and then introduce it? > > > >> (e.g with kicking after queuing X bytes?) > > > > > > > >Sounds good. > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo at vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
Possibly Parallel Threads
- [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
- [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
- [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
- [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
- [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts