thr3ads.net - Virtualization - [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive [May 2022]

If this information is useful, please help other people find it:
Share via:
Stefano Garzarella
2022-May-24 07:32 UTC
[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov
wrote:>Hello Stefano,
>
>On 19.05.2022 10:42, Stefano Garzarella wrote:
>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>> Hello Stefano,
>>>
>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>> Hi Arseniy,
>>>>
>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov
wrote:
>>>>> ???????????????????????????? INTRODUCTION
>>>>>
>>>>> ????Hello, this is experimental implementation of virtio
vsock zerocopy
>>>>> receive. It was inspired by TCP zerocopy receive by Eric
Dumazet. This API uses
>>>>> same idea: call 'mmap()' on socket's
descriptor, then every 'getsockopt()' will
>>>>> fill provided vma area with pages of virtio RX buffers.
After received data was
>>>>> processed by user, pages must be freed by
'madvise()'? call with MADV_DONTNEED
>>>>> flag set(if user won't call 'madvise()', next
'getsockopt()' will fail).
>>>>
>>>> Sounds cool, but maybe we would need some socket/net experts
here for review.
>>>
>>> Yes, that would be great
>>>
>>>>
>>>> Could we do something similar for the sending path as well?
>>>
>>> Here are thoughts about zerocopy transmission:
>>>
>>> I tried to implement this feature in the following way: user
creates
>>> some page aligned buffer, then during tx packet allocation instead
of
>>> creating data buffer with 'kmalloc()', i tried to add
user's buffer
>>> to virtio queue. But found problem: as kernel virtio API uses
virtual
>>> addresses to add new buffers, in the deep of virtio subsystem
>>> 'virt_to_phys()' is called to get physical address of
buffer, so user's
>>> virtual address won't be translated correctly to physical
address(in
>>> theory, i can perform page walk for such user's va, get
physical address
>>> and pass some "fake" virtual address to virtio API in
order to make
>>> 'virt_to_phys()' return valid physical address(but i think
this is ugly).
>>
>> And maybe we should also pin the pages to prevent them from being
replaced.
>>
>> I think we should do something similar to what we do in vhost-vdpa.
>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>
>Hm, ok. I'll read about vdpa...
>
>>
>>>
>>>
>>> If we are talking about 'mmap()' way, i think we can do the
following:
>>> user calls 'mmap()' on socket, kernel fills newly created
mapping with
>>> allocated pages(all pages have rw permissions). Now user can use
pages
>>> of this mapping(e.g. fill it with data). Finally, to start
transmission,
>>> user calls 'getsockopt()' or some 'ioctl()' and
kernel processes data of
>>> this mapping. Also as this call will return immediately(e.g. it is
>>> asynchronous), some completion logic must be implemented. For
example
>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to
get
>>> message that pages could be reused, or don't allow user to work
with
>>> these pages: unmap it, perform transmission and finally free pages.
>>> To start new transmission user need to call 'mmap()' again.
>>>
>>> ?????????????????????????? OR
>>>
>>> I think there is another unusual way for zerocopy tx: let's use
'vmsplice()'
>>> /'splice()'. In this approach to transmit something, user
does the following
>>> steps:
>>> 1) Creates pipe.
>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert
data pages to it.
>>> ? SPLICE_F_GIFT allows user to forget about allocated pages -
kernel will
>>> ? free it.
>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket.
SPLICE_F_MOVE will
>>> ? move pages from pipe to socket(e.g. in special socket callback we
got
>>> ? set of pipe's pages as input argument and all pages will be
inserted
>>> ? to virtio queue).
>>>
>>> But as SPLICE_F_MOVE support is disabled, it must be repaired
first.
>>
>> Splice seems interesting, but it would be nice If we do something
similar to TCP. IIUC they use a flag for send(2):
>>
>> ??? send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>
>
>Yes, but in this way i think:
>1) What is 'buf'? It can't be user's address, since this
buffer must be inserted to tx queue.
>   E.g. it must be allocated by kernel and then returned to user for tx
purposes. In TCP
>   case, 'buf' is user's address(of course page aligned) because
TCP logic uses sk_buff which
>   allows to use such memory as data buffer.
IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the 
VQ.
>2) To wait tx process is done(e.g. pages can be used again), such 
>API(send + MSG_ZEROCOPY),
>   uses socket's error queue - poll events that tx is finished. So same 
>   way must be
>   implemented for virtio vsock.
Yeah, I think so.
>
>> ?
>>>
>>>>
>>>>>
>>>>> ??????????????????????????????? DETAILS
>>>>>
>>>>> ????Here is how mapping with mapped pages looks exactly:
first page mapping
>>>>> contains array of trimmed virtio vsock packet headers (in
contains only length
>>>>> of data on the corresponding page and 'flags'
field):
>>>>>
>>>>> ????struct virtio_vsock_usr_hdr {
>>>>> ??????? uint32_t length;
>>>>> ??????? uint32_t flags;
>>>>> ????};
>>>>>
>>>>> Field? 'length' allows user to know exact size of
payload within each sequence
>>>>> of pages and 'flags' allows user to handle
SOCK_SEQPACKET flags(such as message
>>>>> bounds or record bounds). All other pages are data pages
from RX queue.
>>>>>
>>>>> ??????????? Page 0????? Page 1????? Page N
>>>>>
>>>>> ????[ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>> ????????? |??????? |?????? ^?????????? ^
>>>>> ????????? |??????? |?????? |?????????? |
>>>>> ????????? |??????? *-------------------*
>>>>> ????????? |??????????????? |
>>>>> ????????? |??????????????? |
>>>>> ????????? *----------------*
>>>>>
>>>>> ????Of course, single header could represent array of pages
(when packet's
>>>>> buffer is bigger than one page).So here is example of
detailed mapping layout
>>>>> for some set of packages. Lets consider that we have the
following sequence? of
>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages:
0,1,2,3,4 and 5 will
>>>>> be inserted to user's vma(vma is large enough).
>>>>>
>>>>> ????Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>> ????Page 1: [ 56 ]
>>>>> ????Page 2: [ 4096 ]
>>>>> ????Page 3: [ 4096 ]
>>>>> ????Page 4: [ 4096 ]
>>>>> ????Page 5: [ 8 ]
>>>>>
>>>>> ????Page 0 contains only array of headers:
>>>>> ????'hdr0' has 56 in length field.
>>>>> ????'hdr1' has 4096 in length field.
>>>>> ????'hdr2' has 8200 in length field.
>>>>> ????'hdr3' has 0 in length field(this is end of
data marker).
>>>>>
>>>>> ????Page 1 corresponds to 'hdr0' and has only 56
bytes of data.
>>>>> ????Page 2 corresponds to 'hdr1' and filled with
data.
>>>>> ????Page 3 corresponds to 'hdr2' and filled with
data.
>>>>> ????Page 4 corresponds to 'hdr2' and filled with
data.
>>>>> ????Page 5 corresponds to 'hdr2' and has only 8
bytes of data.
>>>>>
>>>>> ????This patchset also changes packets allocation way:
today implementation
>>>>> uses only 'kmalloc()' to create data buffer.
Problem happens when we try to map
>>>>> such buffers to user's vma - kernel forbids to map slab
pages to user's vma(as
>>>>> pages of "not large" 'kmalloc()'
allocations are marked with PageSlab flag and
>>>>> "not large" could be bigger than one page). So to
avoid this, data buffers now
>>>>> allocated using 'alloc_pages()' call.
>>>>>
>>>>> ????????????????????????????????? TESTS
>>>>>
>>>>> ????This patchset updates 'vsock_test' utility: two
tests for new feature
>>>>> were added. First test covers invalid cases. Second checks
valid transmission
>>>>> case.
>>>>
>>>> Thanks for adding the test!
>>>>
>>>>>
>>>>> ?????????????????????????????? BENCHMARKING
>>>>>
>>>>> ????For benchmakring I've added small utility
'rx_zerocopy'. It works in
>>>>> client/server mode. When client connects to server, server
starts sending exact
>>>>> amount of data to client(amount is set as input
argument).Client reads data and
>>>>> waits for next portion of it. Client works in two modes:
copy and zero-copy. In
>>>>> copy mode client uses 'read()' call while in
zerocopy mode sequence of 'mmap()'
>>>>> /'getsockopt()'/'madvise()' are used.
Smaller amount of time for transmission
>>>>> is better. For server, we can set size of tx buffer and for
client we can set
>>>>> size of rx buffer or rx mapping size(in zerocopy mode).
Usage of this utility
>>>>> is quiet simple:
>>>>>
>>>>> For client mode:
>>>>>
>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>
>>>>> For server mode:
>>>>>
>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>
>>>>> [--mb] sets number of megabytes to transfer.
>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>
>>>>> I checked for transmission of 4000mb of data. Here are some
results:
>>>>>
>>>>> ????????????????????????? size of rx/tx buffers in pages
>>>>> ?????????????
*---------------------------------------------------*
>>>>> ????????????? |??? 8?? |??? 32??? |??? 64?? |?? 256??? |??
512??? |
>>>>>
*--------------*--------*----------*---------*----------*----------*
>>>>> |?? zerocopy?? |?? 24?? |?? 10.6?? |? 12.2?? |?? 23.6??
|??? 21??? | secs to
>>>>>
*--------------*---------------------------------------------------- process
>>>>> | non-zerocopy |?? 13?? |?? 16.4?? |? 24.7?? |?? 27.2?? |??
23.9?? | 4000 mb
>>>>>
*--------------*----------------------------------------------------
>>>>>
>>>>> I think, that results are not so impressive, but at least
it is not worse than
>>>>> copy mode and there is no need to allocate memory for
processing date.
>>>>
>>>> Why is it twice as slow in the first column?
>>>
>>> May be this is because memory copying for small buffers is very
fast... i'll
>>> analyze it deeply.
>>
>> Maybe I misunderstood, by small buffers here what do you mean?
>>
>> I thought 8 was the number of pages, so 32KB buffers.
>
>Yes, 8 is size in pages. Anyway, i need to check it more deeply.
Okay, thanks!

Stefano
Virtualization - May 2022 - [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive