thr3ads.net - Virtualization - [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Stefano Garzarella

2019-Apr-04 16:47 UTC

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin
wrote:> I simply love it that you have analysed the individual impact of
> each patch! Great job!
Thanks! I followed Stefan's suggestions!
> 
> For comparison's sake, it could be IMHO benefitial to add a column
> with virtio-net+vhost-net performance.
> 
> This will both give us an idea about whether the vsock layer introduces
> inefficiencies, and whether the virtio-net idea has merit.
> 
Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
this way:
  $ qemu-system-x86_64 ... \
      -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
      -device virtio-net-pci,netdev=net0

I did also a test using TCP_NODELAY, just to be fair, because VSOCK
doesn't implement something like this.
In both cases I set the MTU to the maximum allowed (65520).

                        VSOCK                        TCP + virtio-net + vhost
                  host -> guest [Gbps]                 host -> guest
[Gbps]
pkt_size  before opt. patch 1 patches 2+3 patch 4     TCP_NODELAY
  64          0.060     0.102     0.102     0.096         0.16        0.15
  256         0.22      0.40      0.40      0.36          0.32        0.57
  512         0.42      0.82      0.85      0.74          1.2         1.2
  1K          0.7       1.6       1.6       1.5           2.1         2.1
  2K          1.5       3.0       3.1       2.9           3.5         3.4
  4K          2.5       5.2       5.3       5.3           5.5         5.3
  8K          3.9       8.4       8.6       8.8           8.0         7.9
  16K         6.6      11.1      11.3      12.8           9.8        10.2
  32K         9.9      15.8      15.8      18.1          11.8        10.7
  64K        13.5      17.4      17.7      21.4          11.4        11.3
  128K       17.9      19.0      19.0      23.6          11.2        11.0
  256K       18.0      19.4      19.8      24.4          11.1        11.0
  512K       18.4      19.6      20.1      25.3          10.1        10.7

For small packet size (< 4K) I think we should implement some kind of
batching/merging, that could be for free if we use virtio-net as a transport.

Note: Maybe I have something miss configured because TCP on virtio-net
for host -> guest case doesn't exceed 11 Gbps.

                        VSOCK                        TCP + virtio-net + vhost
                  guest -> host [Gbps]                 guest -> host
[Gbps]
pkt_size  before opt. patch 1 patches 2+3             TCP_NODELAY
  64          0.088     0.100     0.101                   0.24        0.24
  256         0.35      0.36      0.41                    0.36        1.03
  512         0.70      0.74      0.73                    0.69        1.6
  1K          1.1       1.3       1.3                     1.1         3.0
  2K          2.4       2.4       2.6                     2.1         5.5
  4K          4.3       4.3       4.5                     3.8         8.8
  8K          7.3       7.4       7.6                     6.6        20.0
  16K         9.2       9.6      11.1                    12.3        29.4
  32K         8.3       8.9      18.1                    19.3        28.2
  64K         8.3       8.9      25.4                    20.6        28.7
  128K        7.2       8.7      26.7                    23.1        27.9
  256K        7.7       8.4      24.9                    28.5        29.4
  512K        7.7       8.5      25.0                    28.3        29.3

For guest -> host I think is important the TCP_NODELAY test, because TCP
buffering increases a lot the throughput.
> One other comment: it makes sense to test with disabling smap
> mitigations (boot host and guest with nosmap).  No problem with also
> testing the default smap path, but I think you will discover that the
> performance impact of smap hardening being enabled is often severe for
> such benchmarks.
Thanks for this valuable suggestion, I'll redo all the tests with nosmap!

Cheers,
Stefano

Michael S. Tsirkin

2019-Apr-04 18:04 UTC

head link

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella
wrote:> On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > I simply love it that you have analysed the individual impact of
> > each patch! Great job!
> 
> Thanks! I followed Stefan's suggestions!
> 
> > 
> > For comparison's sake, it could be IMHO benefitial to add a column
> > with virtio-net+vhost-net performance.
> > 
> > This will both give us an idea about whether the vsock layer
introduces
> > inefficiencies, and whether the virtio-net idea has merit.
> > 
> 
> Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> this way:
>   $ qemu-system-x86_64 ... \
>       -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
>       -device virtio-net-pci,netdev=net0
> 
> I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> doesn't implement something like this.
Why not?
> In both cases I set the MTU to the maximum allowed (65520).
> 
>                         VSOCK                        TCP + virtio-net +
vhost
>                   host -> guest [Gbps]                 host -> guest
[Gbps]
> pkt_size  before opt. patch 1 patches 2+3 patch 4     TCP_NODELAY
>   64          0.060     0.102     0.102     0.096         0.16        0.15
>   256         0.22      0.40      0.40      0.36          0.32        0.57
>   512         0.42      0.82      0.85      0.74          1.2         1.2
>   1K          0.7       1.6       1.6       1.5           2.1         2.1
>   2K          1.5       3.0       3.1       2.9           3.5         3.4
>   4K          2.5       5.2       5.3       5.3           5.5         5.3
>   8K          3.9       8.4       8.6       8.8           8.0         7.9
>   16K         6.6      11.1      11.3      12.8           9.8        10.2
>   32K         9.9      15.8      15.8      18.1          11.8        10.7
>   64K        13.5      17.4      17.7      21.4          11.4        11.3
>   128K       17.9      19.0      19.0      23.6          11.2        11.0
>   256K       18.0      19.4      19.8      24.4          11.1        11.0
>   512K       18.4      19.6      20.1      25.3          10.1        10.7
> 
> For small packet size (< 4K) I think we should implement some kind of
> batching/merging, that could be for free if we use virtio-net as a
transport.
> 
> Note: Maybe I have something miss configured because TCP on virtio-net
> for host -> guest case doesn't exceed 11 Gbps.
> 
>                         VSOCK                        TCP + virtio-net +
vhost
>                   guest -> host [Gbps]                 guest -> host
[Gbps]
> pkt_size  before opt. patch 1 patches 2+3             TCP_NODELAY
>   64          0.088     0.100     0.101                   0.24        0.24
>   256         0.35      0.36      0.41                    0.36        1.03
>   512         0.70      0.74      0.73                    0.69        1.6
>   1K          1.1       1.3       1.3                     1.1         3.0
>   2K          2.4       2.4       2.6                     2.1         5.5
>   4K          4.3       4.3       4.5                     3.8         8.8
>   8K          7.3       7.4       7.6                     6.6        20.0
>   16K         9.2       9.6      11.1                    12.3        29.4
>   32K         8.3       8.9      18.1                    19.3        28.2
>   64K         8.3       8.9      25.4                    20.6        28.7
>   128K        7.2       8.7      26.7                    23.1        27.9
>   256K        7.7       8.4      24.9                    28.5        29.4
>   512K        7.7       8.5      25.0                    28.3        29.3
> 
> For guest -> host I think is important the TCP_NODELAY test, because TCP
> buffering increases a lot the throughput.
> 
> > One other comment: it makes sense to test with disabling smap
> > mitigations (boot host and guest with nosmap).  No problem with also
> > testing the default smap path, but I think you will discover that the
> > performance impact of smap hardening being enabled is often severe for
> > such benchmarks.
> 
> Thanks for this valuable suggestion, I'll redo all the tests with
nosmap!
> 
> Cheers,
> Stefano

Stefano Garzarella

2019-Apr-05 07:49 UTC

head link

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin
wrote:> On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote:
> > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > > I simply love it that you have analysed the individual impact of
> > > each patch! Great job!
> > 
> > Thanks! I followed Stefan's suggestions!
> > 
> > > 
> > > For comparison's sake, it could be IMHO benefitial to add a
column
> > > with virtio-net+vhost-net performance.
> > > 
> > > This will both give us an idea about whether the vsock layer
introduces
> > > inefficiencies, and whether the virtio-net idea has merit.
> > > 
> > 
> > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> > this way:
> >   $ qemu-system-x86_64 ... \
> >       -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no
\
> >       -device virtio-net-pci,netdev=net0
> > 
> > I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> > doesn't implement something like this.
> 
> Why not?
> 
I think because originally VSOCK was designed to be simple and
low-latency, but of course we can introduce something like that.

Current implementation directly copy the buffer from the user-space in a
virtio_vsock_pkt and enqueue it to be transmitted.

Maybe we can introduce a buffer per socket where accumulate bytes and
send it when it is full or when a timer is fired . We can also introduce
a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for
compatibility) to send the buffer immediately for low-latency use cases.

What do you think?

Thanks,
Stefano

Possibly Parallel Threads

Search for more reasonably related threads

Virtualization - Apr 2019 - [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

[PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

Possibly Parallel Threads