thr3ads.net - Virtualization - [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive [May 2022]

If this information is useful, please help other people find it:
Share via:
Stefano Garzarella
2022-May-17 15:14 UTC
[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

Hi Arseniy,

On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov
wrote:>                              INTRODUCTION
>
>	Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API
uses
>same idea: call 'mmap()' on socket's descriptor, then every
'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data
was
>processed by user, pages must be freed by 'madvise()'  call with
MADV_DONTNEED
>flag set(if user won't call 'madvise()', next
'getsockopt()' will fail).
Sounds cool, but maybe we would need some socket/net experts here for 
review.

Could we do something similar for the sending path as well?
>
>                                 DETAILS
>
>	Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only
length
>of data on the corresponding page and 'flags' field):
>
>	struct virtio_vsock_usr_hdr {
>		uint32_t length;
>		uint32_t flags;
>	};
>
>Field  'length' allows user to know exact size of payload within
each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such
as message
>bounds or record bounds). All other pages are data pages from RX queue.
>
>             Page 0      Page 1      Page N
>
>	[ hdr1 .. hdrN ][ data ] .. [ data ]
>           |        |       ^           ^
>           |        |       |           |
>           |        *-------------------*
>           |                |
>           |                |
>           *----------------*
>
>	Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping
layout
>for some set of packages. Lets consider that we have the following sequence 
of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5
will
>be inserted to user's vma(vma is large enough).
>
>	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>	Page 1: [ 56 ]
>	Page 2: [ 4096 ]
>	Page 3: [ 4096 ]
>	Page 4: [ 4096 ]
>	Page 5: [ 8 ]
>
>	Page 0 contains only array of headers:
>	'hdr0' has 56 in length field.
>	'hdr1' has 4096 in length field.
>	'hdr2' has 8200 in length field.
>	'hdr3' has 0 in length field(this is end of data marker).
>
>	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>	Page 2 corresponds to 'hdr1' and filled with data.
>	Page 3 corresponds to 'hdr2' and filled with data.
>	Page 4 corresponds to 'hdr2' and filled with data.
>	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
>	This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we
try to map
>such buffers to user's vma - kernel forbids to map slab pages to
user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked
with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data
buffers now
>allocated using 'alloc_pages()' call.
>
>                                   TESTS
>
>	This patchset updates 'vsock_test' utility: two tests for new
feature
>were added. First test covers invalid cases. Second checks valid
transmission
>case.
Thanks for adding the test!
>
>                                BENCHMARKING
>
>	For benchmakring I've added small utility 'rx_zerocopy'. It
works in
>client/server mode. When client connects to server, server starts sending
exact
>amount of data to client(amount is set as input argument).Client reads data
and
>waits for next portion of it. Client works in two modes: copy and zero-copy.
In
>copy mode client uses 'read()' call while in zerocopy mode sequence
of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time
for transmission
>is better. For server, we can set size of tx buffer and for client we can
set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this
utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
>                           size of rx/tx buffers in pages
>               *---------------------------------------------------*
>               |    8   |    32    |    64   |   256    |   512    |
>*--------------*--------*----------*---------*----------*----------*
>|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>*--------------*----------------------------------------------------
>
>I think, that results are not so impressive, but at least it is not worse
than
>copy mode and there is no need to allocate memory for processing date.
Why is it twice as slow in the first column?
>
>                                 PROBLEMS
>
>	Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each
packet
>(even if packet has 1 byte payload). I think this could be resolved in
several
>ways:
Can we somehow copy the incoming packets into the payload of the already 
queued packet?

This reminds me that we have yet to fix a similar problem with kmalloc() 
as well...

https://bugzilla.kernel.org/show_bug.cgi?id=215329
>	1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used.
That sounds reasonable to me, I guess also TCP needs a setsockopt() call 
to enable the feature, right?
>	2) Use 'kmalloc()' for "small" packets, else call page
allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets
will
>be handled in some stupid way: we need to allocate one page for user, copy
data
>to it and then insert page to user's vma.
It seems more difficult to me, but at the same time doable. I would go 
more on option 1, though.
>
>P.S: of course this is experimental RFC, so what do You think guys?
It seems cool :-)

But I would like some feedback from the net guys to have some TCP-like 
things.

Thanks,
Stefano
Virtualization - May 2022 - [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive