Jason Wang
2018-Jul-16 09:46 UTC
[PATCH net-next V2 0/8] Packed virtqueue support for vhost
On 2018?07?16? 16:39, Michael S. Tsirkin wrote:> On Mon, Jul 16, 2018 at 11:28:03AM +0800, Jason Wang wrote: >> Hi all: >> >> This series implements packed virtqueues. The code were tested with >> Tiwei's guest driver series at https://patchwork.ozlabs.org/cover/942297/ >> >> >> Pktgen test for both RX and TX does not show obvious difference with >> split virtqueues. The main bottleneck is the guest Linux driver, since >> it can not stress vhost for a 100% CPU utilization. A full TCP >> benchmark is ongoing. Will test virtio-net pmd as well when it was >> ready. > Well the question then is why we should bother merging this > if this doesn't give a performance gain.We meet bottlenecks at other places. I can only test Linux driver which has lots of overheads e.g interrupts. And perf show only a small fraction of time were spent on e.g virtqueue manipulation. I hope virtio-net pmd can give us different result but we don't have one ready for testing now. (Jen's V4 have bugs thus can not work with this series).> Do you see > a gain in CPU utilization maybe?Unfortunately not.> > If not - let's wait for that TCP benchmark result?We can, but you know TCP_STREAM result is sometime misleading. A brunch of other patches of mine were rebased on this and then blocked on this series. Consider we don't meet regression, maybe we can merge this first and try optimizations or fixups on top? Thanks> >> Notes: >> - This version depends on Tiwei's series at https://patchwork.ozlabs.org/cover/942297/ >> >> This version were tested with: >> >> - Zerocopy (Out of Order) support >> - vIOMMU support >> - mergeable buffer on/off >> - busy polling on/off >> - vsock (nc-vsock) >> >> Changes from V1: >> - drop uapi patch and use Tiwei's >> - split the enablement of packed virtqueue into a separate patch >> >> Changes from RFC V5: >> >> - save unnecessary barriers during vhost_add_used_packed_n() >> - more compact math for event idx >> - fix failure of SET_VRING_BASE when avail_wrap_counter is true >> - fix not copy avail_wrap_counter during GET_VRING_BASE >> - introduce SET_VRING_USED_BASE/GET_VRING_USED_BASE for syncing last_used_idx >> - rename used_wrap_counter to last_used_wrap_counter >> - rebase to net-next >> >> Changes from RFC V4: >> >> - fix signalled_used index recording >> - track avail index correctly >> - various minor fixes >> >> Changes from RFC V3: >> >> - Fix math on event idx checking >> - Sync last avail wrap counter through GET/SET_VRING_BASE >> - remove desc_event prefix in the driver/device structure >> >> Changes from RFC V2: >> >> - do not use & in checking desc_event_flags >> - off should be most significant bit >> - remove the workaround of mergeable buffer for dpdk prototype >> - id should be in the last descriptor in the chain >> - keep _F_WRITE for write descriptor when adding used >> - device flags updating should use ADDR_USED type >> - return error on unexpected unavail descriptor in a chain >> - return false in vhost_ve_avail_empty is descriptor is available >> - track last seen avail_wrap_counter >> - correctly examine available descriptor in get_indirect_packed() >> - vhost_idx_diff should return u16 instead of bool >> >> Changes from RFC V1: >> >> - Refactor vhost used elem code to avoid open coding on used elem >> - Event suppression support (compile test only). >> - Indirect descriptor support (compile test only). >> - Zerocopy support. >> - vIOMMU support. >> - SCSI/VSOCK support (compile test only). >> - Fix several bugs >> >> Jason Wang (8): >> vhost: move get_rx_bufs to vhost.c >> vhost: hide used ring layout from device >> vhost: do not use vring_used_elem >> vhost_net: do not explicitly manipulate vhost_used_elem >> vhost: vhost_put_user() can accept metadata type >> vhost: packed ring support >> vhost: event suppression for packed ring >> vhost: enable packed virtqueues >> >> drivers/vhost/net.c | 143 ++----- >> drivers/vhost/scsi.c | 62 +-- >> drivers/vhost/vhost.c | 994 ++++++++++++++++++++++++++++++++++++++++----- >> drivers/vhost/vhost.h | 55 ++- >> drivers/vhost/vsock.c | 42 +- >> include/uapi/linux/vhost.h | 7 + >> 6 files changed, 1035 insertions(+), 268 deletions(-) >> >> -- >> 2.7.4
Michael S. Tsirkin
2018-Jul-16 12:49 UTC
[PATCH net-next V2 0/8] Packed virtqueue support for vhost
On Mon, Jul 16, 2018 at 05:46:33PM +0800, Jason Wang wrote:> > > On 2018?07?16? 16:39, Michael S. Tsirkin wrote: > > On Mon, Jul 16, 2018 at 11:28:03AM +0800, Jason Wang wrote: > > > Hi all: > > > > > > This series implements packed virtqueues. The code were tested with > > > Tiwei's guest driver series at https://patchwork.ozlabs.org/cover/942297/ > > > > > > > > > Pktgen test for both RX and TX does not show obvious difference with > > > split virtqueues. The main bottleneck is the guest Linux driver, since > > > it can not stress vhost for a 100% CPU utilization. A full TCP > > > benchmark is ongoing. Will test virtio-net pmd as well when it was > > > ready. > > Well the question then is why we should bother merging this > > if this doesn't give a performance gain. > > We meet bottlenecks at other places. I can only test Linux driver which has > lots of overheads e.g interrupts. And perf show only a small fraction of > time were spent on e.g virtqueue manipulation. I hope virtio-net pmd can > give us different result but we don't have one ready for testing now. (Jen's > V4 have bugs thus can not work with this series).Can't linux busy poll? And how about testing loopback with XDP?> > Do you see > > a gain in CPU utilization maybe? > > Unfortunately not. > > > > > If not - let's wait for that TCP benchmark result? > > We can, but you know TCP_STREAM result is sometime misleading. > > A brunch of other patches of mine were rebased on this and then blocked on > this series. Consider we don't meet regression, maybe we can merge this > first and try optimizations or fixups on top? > > ThanksI'm not sure I understand this approach. Packed ring is just an optimization. What value is there in merging it if it does not help speed?> > > > > Notes: > > > - This version depends on Tiwei's series at https://patchwork.ozlabs.org/cover/942297/ > > > > > > This version were tested with: > > > > > > - Zerocopy (Out of Order) support > > > - vIOMMU support > > > - mergeable buffer on/off > > > - busy polling on/off > > > - vsock (nc-vsock) > > > > > > Changes from V1: > > > - drop uapi patch and use Tiwei's > > > - split the enablement of packed virtqueue into a separate patch > > > > > > Changes from RFC V5: > > > > > > - save unnecessary barriers during vhost_add_used_packed_n() > > > - more compact math for event idx > > > - fix failure of SET_VRING_BASE when avail_wrap_counter is true > > > - fix not copy avail_wrap_counter during GET_VRING_BASE > > > - introduce SET_VRING_USED_BASE/GET_VRING_USED_BASE for syncing last_used_idx > > > - rename used_wrap_counter to last_used_wrap_counter > > > - rebase to net-next > > > > > > Changes from RFC V4: > > > > > > - fix signalled_used index recording > > > - track avail index correctly > > > - various minor fixes > > > > > > Changes from RFC V3: > > > > > > - Fix math on event idx checking > > > - Sync last avail wrap counter through GET/SET_VRING_BASE > > > - remove desc_event prefix in the driver/device structure > > > > > > Changes from RFC V2: > > > > > > - do not use & in checking desc_event_flags > > > - off should be most significant bit > > > - remove the workaround of mergeable buffer for dpdk prototype > > > - id should be in the last descriptor in the chain > > > - keep _F_WRITE for write descriptor when adding used > > > - device flags updating should use ADDR_USED type > > > - return error on unexpected unavail descriptor in a chain > > > - return false in vhost_ve_avail_empty is descriptor is available > > > - track last seen avail_wrap_counter > > > - correctly examine available descriptor in get_indirect_packed() > > > - vhost_idx_diff should return u16 instead of bool > > > > > > Changes from RFC V1: > > > > > > - Refactor vhost used elem code to avoid open coding on used elem > > > - Event suppression support (compile test only). > > > - Indirect descriptor support (compile test only). > > > - Zerocopy support. > > > - vIOMMU support. > > > - SCSI/VSOCK support (compile test only). > > > - Fix several bugs > > > > > > Jason Wang (8): > > > vhost: move get_rx_bufs to vhost.c > > > vhost: hide used ring layout from device > > > vhost: do not use vring_used_elem > > > vhost_net: do not explicitly manipulate vhost_used_elem > > > vhost: vhost_put_user() can accept metadata type > > > vhost: packed ring support > > > vhost: event suppression for packed ring > > > vhost: enable packed virtqueues > > > > > > drivers/vhost/net.c | 143 ++----- > > > drivers/vhost/scsi.c | 62 +-- > > > drivers/vhost/vhost.c | 994 ++++++++++++++++++++++++++++++++++++++++----- > > > drivers/vhost/vhost.h | 55 ++- > > > drivers/vhost/vsock.c | 42 +- > > > include/uapi/linux/vhost.h | 7 + > > > 6 files changed, 1035 insertions(+), 268 deletions(-) > > > > > > -- > > > 2.7.4
Jason Wang
2018-Jul-17 00:45 UTC
[PATCH net-next V2 0/8] Packed virtqueue support for vhost
On 2018?07?16? 20:49, Michael S. Tsirkin wrote:> On Mon, Jul 16, 2018 at 05:46:33PM +0800, Jason Wang wrote: >> >> On 2018?07?16? 16:39, Michael S. Tsirkin wrote: >>> On Mon, Jul 16, 2018 at 11:28:03AM +0800, Jason Wang wrote: >>>> Hi all: >>>> >>>> This series implements packed virtqueues. The code were tested with >>>> Tiwei's guest driver series at https://patchwork.ozlabs.org/cover/942297/ >>>> >>>> >>>> Pktgen test for both RX and TX does not show obvious difference with >>>> split virtqueues. The main bottleneck is the guest Linux driver, since >>>> it can not stress vhost for a 100% CPU utilization. A full TCP >>>> benchmark is ongoing. Will test virtio-net pmd as well when it was >>>> ready. >>> Well the question then is why we should bother merging this >>> if this doesn't give a performance gain. >> We meet bottlenecks at other places. I can only test Linux driver which has >> lots of overheads e.g interrupts. And perf show only a small fraction of >> time were spent on e.g virtqueue manipulation. I hope virtio-net pmd can >> give us different result but we don't have one ready for testing now. (Jen's >> V4 have bugs thus can not work with this series). > Can't linux busy poll?For vhost busy polling, there's no difference since guest can not give vhost enough stress. For guest busy polling, it does not work for the packets generated by pktgen.> And how about testing loopback with XDP?No difference, I even shortcut both the tun_get_user() on host and netif_receive_skb() in guest.>>> Do you see >>> a gain in CPU utilization maybe? >> Unfortunately not. >> >>> If not - let's wait for that TCP benchmark result? >> We can, but you know TCP_STREAM result is sometime misleading. >> >> A brunch of other patches of mine were rebased on this and then blocked on >> this series. Consider we don't meet regression, maybe we can merge this >> first and try optimizations or fixups on top? >> >> Thanks > I'm not sure I understand this approach. Packed ring is just an optimization. > What value is there in merging it if it does not help speed?If you want to support migration from dpdk or vDPA backend. And we still have the chance to see the performance with virito-net pmd in the future. If this does not make sense for you, I will leave this series until we can get results from virtio-net pmd (or find a way that packed virtqueue outperform). And I will start to post other optimizations on vhost. Thanks> >>>> Notes: >>>> - This version depends on Tiwei's series at https://patchwork.ozlabs.org/cover/942297/ >>>> >>>> This version were tested with: >>>> >>>> - Zerocopy (Out of Order) support >>>> - vIOMMU support >>>> - mergeable buffer on/off >>>> - busy polling on/off >>>> - vsock (nc-vsock) >>>> >>>> Changes from V1: >>>> - drop uapi patch and use Tiwei's >>>> - split the enablement of packed virtqueue into a separate patch >>>> >>>> Changes from RFC V5: >>>> >>>> - save unnecessary barriers during vhost_add_used_packed_n() >>>> - more compact math for event idx >>>> - fix failure of SET_VRING_BASE when avail_wrap_counter is true >>>> - fix not copy avail_wrap_counter during GET_VRING_BASE >>>> - introduce SET_VRING_USED_BASE/GET_VRING_USED_BASE for syncing last_used_idx >>>> - rename used_wrap_counter to last_used_wrap_counter >>>> - rebase to net-next >>>> >>>> Changes from RFC V4: >>>> >>>> - fix signalled_used index recording >>>> - track avail index correctly >>>> - various minor fixes >>>> >>>> Changes from RFC V3: >>>> >>>> - Fix math on event idx checking >>>> - Sync last avail wrap counter through GET/SET_VRING_BASE >>>> - remove desc_event prefix in the driver/device structure >>>> >>>> Changes from RFC V2: >>>> >>>> - do not use & in checking desc_event_flags >>>> - off should be most significant bit >>>> - remove the workaround of mergeable buffer for dpdk prototype >>>> - id should be in the last descriptor in the chain >>>> - keep _F_WRITE for write descriptor when adding used >>>> - device flags updating should use ADDR_USED type >>>> - return error on unexpected unavail descriptor in a chain >>>> - return false in vhost_ve_avail_empty is descriptor is available >>>> - track last seen avail_wrap_counter >>>> - correctly examine available descriptor in get_indirect_packed() >>>> - vhost_idx_diff should return u16 instead of bool >>>> >>>> Changes from RFC V1: >>>> >>>> - Refactor vhost used elem code to avoid open coding on used elem >>>> - Event suppression support (compile test only). >>>> - Indirect descriptor support (compile test only). >>>> - Zerocopy support. >>>> - vIOMMU support. >>>> - SCSI/VSOCK support (compile test only). >>>> - Fix several bugs >>>> >>>> Jason Wang (8): >>>> vhost: move get_rx_bufs to vhost.c >>>> vhost: hide used ring layout from device >>>> vhost: do not use vring_used_elem >>>> vhost_net: do not explicitly manipulate vhost_used_elem >>>> vhost: vhost_put_user() can accept metadata type >>>> vhost: packed ring support >>>> vhost: event suppression for packed ring >>>> vhost: enable packed virtqueues >>>> >>>> drivers/vhost/net.c | 143 ++----- >>>> drivers/vhost/scsi.c | 62 +-- >>>> drivers/vhost/vhost.c | 994 ++++++++++++++++++++++++++++++++++++++++----- >>>> drivers/vhost/vhost.h | 55 ++- >>>> drivers/vhost/vsock.c | 42 +- >>>> include/uapi/linux/vhost.h | 7 + >>>> 6 files changed, 1035 insertions(+), 268 deletions(-) >>>> >>>> -- >>>> 2.7.4
David Miller
2018-Jul-18 04:09 UTC
[PATCH net-next V2 0/8] Packed virtqueue support for vhost
From: "Michael S. Tsirkin" <mst at redhat.com> Date: Mon, 16 Jul 2018 15:49:04 +0300> I'm not sure I understand this approach. Packed ring is just an > optimization. What value is there in merging it if it does not help > speed?So it seems that both Tiwei's and Jason's packed patch sets are kind of in limbo due to this discussion. If I understand Jason correctly, he's trying to say that although this work doesn't show improvements by itself, however it paves the way such that optimizaations done in the future will be more visible. I kind of can see Michael's viewpoint too, in that we should put this stuff in later when it does actually show some difference. Therefore, I'll mark both patch sets as "deferred" for now. Let me know if I should do something else. Thanks!