thr3ads.net - Virtualization - [RFC PATCH v2 00/15] vsock: MSG

If this information is useful, please help other people find it:
Share via:
Stefano Garzarella
2023-May-03 13:47 UTC
[RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support

On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov
wrote:>
>
>On 03.05.2023 15:52, Stefano Garzarella wrote:
>> Hi Arseniy,
>> Sorry for the delay, but I have been very busy.
>
>Hello, no problem!
>
>>
>> I can't apply this series on master or net-next, can you share with
me
>> the base commit?
>
>Here is my base:
>https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
>
Thanks, it worked!
>>
>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> ????????????????????????? DESCRIPTION
>>>
>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to
follow
>>> current implementation for TCP as much as possible:
>>>
>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without
this
>>> ? flag, data will be sent in "classic" copy manner and
MSG_ZEROCOPY
>>> ? flag will be ignored (e.g. without completion).
>>>
>>> 2) Kernel uses completions from socket's error queue. Single
completion
>>> ? for single tx syscall (or it can merge several completions to
single
>>> ? one). I used already implemented logic for MSG_ZEROCOPY support:
>>> ? 'msg_zerocopy_realloc()' etc.
>>>
>>> Difference with copy way is not significant. During packet
allocation,
>>> non-linear skb is created, then I call 'pin_user_pages()'
for each page
>>> from user's iov iterator and add each returned page to the skb
as fragment.
>>> There are also some updates for vhost and guest parts of transport
- in
>>> both cases i've added handling of non-linear skb for virtio
part. vhost
>>> copies data from such skb to the guest's rx virtio buffers. In
the guest,
>>> virtio transport fills tx virtio queue with pages from skb.
>>>
>>> This version has several limits/problems:
>>>
>>> 1) As this feature totally depends on transport, there is no way
(or it
>>> ? is difficult) to check whether transport is able to handle it or
not
>>> ? during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK
specific
>>> ? setsockopt callback from setsockopt callback for SOL_SOCKET, but
this
>>> ? leads to lock problem, because both AF_VSOCK and SOL_SOCKET
callback
>>> ? are not considered to be called from each other. So in current
version
>>> ? SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>> ? AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>> ? tx routine will fail with EOPNOTSUPP.
>>
>> Do you plan to fix this in the next versions?
>>
>> If it is too complicated, I think we can have this limitation until we
>> find a good solution.
>>
>
>I'll try to fix it again, but just didn't pay attention on it in v2.
>
>>>
>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to
enqueue
>>> ? one completion. In each completion there is flag which shows how
tx
>>> ? was performed: zerocopy or copy. This leads that whole message
must
>>> ? be send in zerocopy or copy way - we can't send part of
message with
>>> ? copying and rest of message with zerocopy mode (or vice versa).
Now,
>>> ? we need to account vsock credit logic, e.g. we can't send
whole data
>>> ? once - only allowed number of bytes could sent at any moment. In
case
>>> ? of copying way there is no problem as in worst case we can send
single
>>> ? bytes, but zerocopy is more complex because smallest transmission
>>> ? unit is single page. So if there is not enough space at
peer's side
>>> ? to send integer number of pages (at least one) - we will wait,
thus
>>> ? stalling tx side. To overcome this problem i've added simple
rule -
>>> ? zerocopy is possible only when there is enough space at another
side
>>> ? for whole message (to check, that current 'msghdr' was
already used
>>> ? in previous tx iterations i use 'iov_offset' field of
it's iov iter).
>>
>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in
the
>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is
set.
>> Right?
>
>Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled
and
>MSG_ZEROCOPY flag as used). But completion structure contains information
that
>there was copying instead of zerocopying.
Got it.
>
>>
>> If it is the case it seems reasonable to me.
>>
>>>
>>> 3) loopback transport is not supported, because it requires to
implement
>>> ? non-linear skb handling in dequeue logic (as we "send"
fragged skb
>>> ? and "receive" it from the same queue). I'm going to
implement it in
>>> ? next versions.
>>>
>>> ? ^^^ fixed in v2
>>>
>>> 4) Current implementation sets max length of packet to 64KB. IIUC
this
>>> ? is due to 'kmalloc()' allocated data buffers. I think, in
case of
>>> ? MSG_ZEROCOPY this value could be increased, because
'kmalloc()' is
>>> ? not touched for data - user space pages are used as buffers. Also
>>> ? this limit trims every message which is > 64KB, thus such
messages
>>> ? will be send in copy mode due to 'iov_offset' check in
2).
>>>
>>> ? ^^^ fixed in v2
>>>
>>> ??????????????????????? PATCHSET STRUCTURE
>>>
>>> Patchset has the following structure:
>>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>>> 2) Handle non-linear skbuff on send in virtio/vhost.
>>> 3) Updates for AF_VSOCK.
>>> 4) Enable MSG_ZEROCOPY support on transports.
>>> 5) Tests/tools/docs updates.
>>>
>>> ?????????????????????????? PERFORMANCE
>>>
>>> Performance: it is a little bit tricky to compare performance
between
>>> copy and zerocopy transmissions. In zerocopy way we need to wait
when
>>> user buffers will be released by kernel, so it something like
synchronous
>>> path (wait until device driver will process it), while in copy way
we
>>> can feed data to kernel as many as we want, don't care about
device
>>> driver. So I compared only time which we spend in the
'send()' syscall.
>>> Then if this value will be combined with total number of
transmitted
>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to
not
>>> enough credit, receiver allocates same amount of space as sender
needs.
>>>
>>> Sender:
>>> ./vsock_perf --sender <CID> --buf-size <buf size>
--bytes 256M [--zc]
>>>
>>> Receiver:
>>> ./vsock_perf --vsk-size 256M
>>>
>>> G2H transmission (values are Gbit/s):
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 3??? |??? 10??? |
>>> *-------------------------------*
>>> |?? 32KB?? |??? 9??? |??? 45??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 24?? |??? 195?? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 27?? |??? 270?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 22?? |??? 277?? |
>>> *-------------------------------*
>>>
>>> H2G:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 17?? |??? 11??? |
>>
>> Do you know why in this case zerocopy is slower in this case?
>> Could be the cost of pin/unpin pages?
>May be, i think i need to analyze such enormous difference more. Also about
>pin/unpin: i found that there is already implemented function to fill
non-linear
>skb with pages from user's iov: __zerocopy_sg_from_iter() in
net/core/datagram.c.
>It uses 'get_user_pages()' instead of 'pin_user_pages()'.
May be in my case it
>is also valid to user 'get_XXX()' instead of 'pin_XXX()',
because it is used by
>TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
If we can reuse them, it will be great!
>
>>
>>> *-------------------------------*
>>> |?? 32KB?? |??? 30?? |??? 66??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 38?? |??? 179?? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 38?? |??? 234?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 28?? |??? 279?? |
>>> *-------------------------------*
>>>
>>> Loopback:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 8??? |??? 7???? |
>>> *-------------------------------*
>>> |?? 32KB?? |??? 34?? |??? 42??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 43?? |??? 83??? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 40?? |??? 109?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 40?? |??? 171?? |
>>> *-------------------------------*
>>>
>>> I suppose that huge difference above between both modes has two
reasons:
>>> 1) We don't need to copy data.
>>> 2) We don't need to allocate buffer for data, only for header.
>>>
>>> Zerocopy is faster than classic copy mode, but of course it
requires
>>> specific architecture of application due to user pages pinning,
buffer
>>> size and alignment.
>>>
>>> If host fails to send data with "Cannot allocate memory",
check value
>>> /proc/sys/net/core/optmem_max - it is accounted during completion
skb
>>> allocation.
>>
>> What the user needs to do? Increase it?
>>
>Yes, i'll update it.
>>>
>>> ?????????????????????????? TESTING
>>>
>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I
tried to
>>> cover new code as much as possible so there are different cases for
>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several
io
>>> vector types (different sizes, alignments, with unmapped pages). I
also
>>> run tests with loopback transport and running vsockmon.
>>
>> Thanks for the test again :-)
>>
>> This cover letter is very good, with a lot of details, but please add
>> more details in each single patch, explaining the reason of the
changes,
>> otherwise it is very difficult to review, because it is a very big
>> change.
>>
>> I'll do a per-patch review in the next days.
>
>Sure, thanks! In v3 i'm also working on io_uring test, because this
thing also
>supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
That would be cool!

Do you want to me to review these patches or it is better to wait for 
v3?

Thanks,
Stefano
Virtualization - May 2023 - [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support

[RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support