thr3ads.net - Virtualization - [RFC PATCH v1 00/12] vsock: MSG

If this information is useful, please help other people find it:
Share via:
Stefano Garzarella
2023-Feb-28 10:23 UTC
[RFC PATCH v1 00/12] vsock: MSG_ZEROCOPY flag support

On Mon, Feb 20, 2023 at 08:59:32AM +0000, Krasnov Arseniy
wrote:>On 16.02.2023 16:33, Stefano Garzarella wrote:
>> Hi Arseniy,
>> sorry for the delay, but I was offline.
>
>Hello! Sure no problem!, i was also offline a little bit so i'm replying
>just now
>
>>
>> On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> ????????????????????????? DESCRIPTION
>>>
>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to
follow
>>> current implementation for TCP as much as possible:
>>>
>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without
this
>>> ? flag, data will be sent in "classic" copy manner and
MSG_ZEROCOPY
>>> ? flag will be ignored (e.g. without completion).
>>>
>>> 2) Kernel uses completions from socket's error queue. Single
completion
>>> ? for single tx syscall (or it can merge several completions to
single
>>> ? one). I used already implemented logic for MSG_ZEROCOPY support:
>>> ? 'msg_zerocopy_realloc()' etc.
>>
>> I will review for the vsock point of view. Hope some net maintainers
can
>> comment about SO_ZEROCOPY.
>>
>> Anyway I think is a good idea to keep it as close as possible to the
TCP
>> implementation.
>>
>>>
>>> Difference with copy way is not significant. During packet
allocation,
>>> non-linear skb is created, then I call 'get_user_pages()'
for each page
>>> from user's iov iterator (I think i don't need
'pin_user_pages()' as
>>
>> Are these pages exposed to the host via virtqueues? If so, I think we
>> should pin them. What happens if the host accesses them but these pages
>> have been unmapped?
>
>Yes, user pages with data will be used by the virtio device.
>'pin' - You mean use 'pin_user_pages()'? Unmapped - You mean
guest
> will unmap it, while host must access it to copy packet? Such pages
> have incremented refcount by 'get_user_pages()', so page is locked
> and memory and will be locked until host finishes copying data.
Yep, I got it while reviewing patch 7 ;-)
> I think it is better to discuss things related to
'get/pin_user_pages()'
> in one place, for example in 07/12 where this function is called.
Agree!
>
>>
>>> there is no backing storage for these pages) and add each returned
page
>>> to the skb as fragment. There are also some updates for vhost and
guest
>>> parts of transport - in both cases i've added handling of
non-linear skb
>>> for virtio part. vhost copies data from such skb to the guest's
rx virtio
>>> buffers. In the guest, virtio transport fills virtio queue with
pages
>>> from skb.
>>>
>>> I think doc in Documentation/networking/msg_zerocopy.rst could be
also
>>> updated in next versions.
>>
>> Yep, good idea.
>
>Ack, i'll do it in v2.
>
>>
>>>
>>> This version has several limits/problems:
>>>
>>> 1) As this feature totally depends on transport, there is no way
(or it
>>> ? is difficult) to check whether transport is able to handle it or
not
>>> ? during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK
specific
>>> ? setsockopt callback from setsockopt callback for SOL_SOCKET, but
this
>>> ? leads to lock problem, because both AF_VSOCK and SOL_SOCKET
callback
>>> ? are not considered to be called from each other. So in current
version
>>> ? SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>> ? AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>> ? tx routine will fail with EOPNOTSUPP.
>>
>> I'll take a look, but if we have no alternative, I think it's
okay to
>> make tx fail.>
>
>Thanks
>
>>>
>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to
enqueue
>>> ? one completion. In each completion there is flag which shows how
tx
>>> ? was performed: zerocopy or copy. This leads that whole message
must
>>> ? be send in zerocopy or copy way - we can't send part of
message with
>>> ? copying and rest of message with zerocopy mode (or vice versa).
Now,
>>> ? we need to account vsock credit logic, e.g. we can't send
whole data
>>> ? once - only allowed number of bytes could sent at any moment. In
case
>>> ? of copying way there is no problem as in worst case we can send
single
>>> ? bytes, but zerocopy is more complex because smallest transmission
>>> ? unit is single page. So if there is not enough space at
peer's side
>>> ? to send integer number of pages (at least one) - we will wait,
thus
>>> ? stalling tx side. To overcome this problem i've added simple
rule -
>>> ? zerocopy is possible only when there is enough space at another
side
>>> ? for whole message (to check, that current 'msghdr' was
already used
>>> ? in previous tx iterations i use 'iov_offset' field of
it's iov iter).
>>
>> I see the problem and I think your approach is the right one.
>>
>>>
>>> 3) loopback transport is not supported, because it requires to
implement
>>> ? non-linear skb handling in dequeue logic (as we "send"
fragged skb
>>> ? and "receive" it from the same queue). I'm going to
implement it in
>>> ? next versions.
>>
>> loopback is useful for testing and debugging, so it would be great to
>> have the support, but if it's too complicated, we can do it later.
>>
>
>Ok, i'll implement it in v2.
>
>>>
>>> 4) Current implementation sets max length of packet to 64KB. IIUC
this
>>> ? is due to 'kmalloc()' allocated data buffers. I think, in
case of
>>
>> Yep, I think so.
>> When I started touching this code, the limit was already there.
>> As you said, I think it was introduced to have a limit on (host/device
>> side?) allocation, but buf_alloc might be enough, so maybe we could
>> also remove it for copy mode.
>> The only problem I see is compatibility with old devices/drivers, so
>> maybe we need a feature in the spec to say the limit is gone, or have a
>> field in the virtio config space where the device specifies its limit
>> (for the driver, the limit is implicitly that of the buffer allocated
>> and put in the virtqueue).
>>
>> This reminded me that Laura had proposed something similar before,
>> maybe we should take it up again:
>> https://markmail.org/message/3el4ckeakfilg5wo
>>
>>> ? MSG_ZEROCOPY this value could be increased, because
'kmalloc()' is
>>> ? not touched for data - user space pages are used as buffers. Also
>>> ? this limit trims every message which is > 64KB, thus such
messages
>>> ? will be send in copy mode due to 'iov_offset' check in
2).
>>
>> The host still needs to allocate and copy, so maybe the limitation
>> could be to avoid large allocations in the host, but actually the host
>> can use vmalloc because it doesn't need them to be contiguous.
>>
>
>Hmmm, I think it is possible to solve this situation in the following
>way - i can keep limitation for 64KB for copy mode, and remove it for
>zero copy, but I'll limit each packet size to 64KB(of course technically
>it is not exactly 64KB, it is min(max packet size, MAX_SKB_FRAGS *
PAGE_SIZE)
>where max packet size is 64Kb, but for simplicity  let's call this limit
64Kb:) ).
>E.g. when zerocopy transmission needs to send for example 129Kb(of course
>peer's free space is big enough and this check is passed), I'll
won't trim
>129Kb to 64Kb + 64Kb + 1Kb in the current manner - by returning to
af_vsock.c
>after sending every skb. I'll alloc several skbs, (3 in this case - 64Kb
+ 64Kb + 1Kb)
>in single call to the transport. Completion arg will be attached
>only to the last one skb, and send these 3 skbs. Host still processes
>64Kb(let it be 64Kb for simplicity again :) ) packets - no big allocations.
Make sense to me!
>Moreover, i think that this logic could be a little optimization for
>copy mode - why we allocate single skb and always return to af_vsock.c?
@Bobby, can you help us here?
>May be we can iterate needed number of skbs in the loop and send them.
Yep, I think is doable.
>
>Also about vmalloc(), IIUC there is already this idea which replaces
'kmalloc()'
>to 'kvmalloc()'.
Yep, I think it is already merged:
0e3f72931fc4 ("vhost/vsock: Use kvmalloc/kvfree for larger packets.")

But this is in the vhost transport (device emulation running in the 
host), where we don't need that the pages are pinned.
>
>>>
>>> ?????????????????????????? PERFORMANCE
>>>
>>> Performance: it is a little bit tricky to compare performance
between
>>> copy and zerocopy transmissions. In zerocopy way we need to wait
when
>>> user buffers will be released by kernel, so it something like
synchronous
>>> path (wait until device driver will process it), while in copy way
we
>>> can feed data to kernel as many as we want, don't care about
device
>>> driver. So I compared only time which we spend in
'sendmsg()' syscall.
>>> Also there is limit from 4) above so max buffer size is 64KB.
I've
>>> tested this patchset in the nested VM, but i think for V1 it is not
a
>>> big deal.
>>>
>>> Sender:
>>> ./vsock_perf --sender <CID> --buf-size <buf size>
--bytes 60M [--zc]
>>>
>>> Receiver:
>>> ./vsock_perf --vsk-size 256M
>>>
>>> Number in cell is seconds which senders spends inside tx syscall.
>>>
>>> Guest to host transmission:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |? 0.26?? |?? 0.042? |
>>> *-------------------------------*
>>> |?? 16KB?? |? 0.11?? |?? 0.014? |
>>> *-------------------------------*
>>> |?? 32KB?? |? 0.05?? |?? 0.009? |
>>> *-------------------------------*
>>> |?? 64KB?? |? 0.04?? |?? 0.005? |
>>> *-------------------------------*
>>>
>>> Host to guest transmission:
>>>
>>> *--------------------------------*
>>> |????????? |????????? |????????? |
>>> | buf size |?? copy?? | zerocopy |
>>> |????????? |????????? |????????? |
>>> *--------------------------------*
>>> |?? 4KB??? |?? 0.049? |?? 0.034? |
>>> *--------------------------------*
>>> |?? 16KB?? |?? 0.03?? |?? 0.024? |
>>> *--------------------------------*
>>> |?? 32KB?? |?? 0.025? |?? 0.01?? |
>>> *--------------------------------*
>>> |?? 64KB?? |?? 0.028? |?? 0.01?? |
>>> *--------------------------------*
>>>
>>> If host fails to send data with "Cannot allocate memory",
check value
>>> /proc/sys/net/core/optmem_max - it is accounted during completion
skb
>>> allocation.
>>>
>>> Zerocopy is faster than classic copy mode, but of course it
requires
>>> specific architecture of application due to user pages pinning,
buffer
>>> size and alignment. In next versions i'm going to fix 64KB
barrier to
>>> perform tests with bigger buffer sizes.
>>
>> Yep, I see.
>> Adjusting vsock_perf to compare also Gbps (can io_uring helps in this
>> case to have more requests in-flight?) would be great.
>>
>
>Yes, i'll add Gbps. Also I thought about adding io_uring test to
>the current test suite because io_uring also have MSG_ZEROCOPY
>type of request, and interesting thing is that io_uring uses it's
>own already allocated uarg.
Cool!
>
>>>
>>> ?????????????????????????? TESTING
>>>
>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I
tried to
>>> cover new code as much as possible so there are different cases for
>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several
io
>>> vector types (different sizes, alignments, with unmapped pages).
>>
>> Great! Thanks for adding the tests!
>>
>> I'll go through the patches between today and Monday.
>> Sorry again for taking so long!
>
>Sure, Thanks for review! I think we are not hurry :)
Yep, thank you for this work!

Stefano
Virtualization - Feb 2023 - [RFC PATCH v1 00/12] vsock: MSG_ZEROCOPY flag support

[RFC PATCH v1 00/12] vsock: MSG_ZEROCOPY flag support