Stefano Garzarella
2023-May-03 13:54 UTC
[RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support
On Wed, May 3, 2023 at 3:50?PM Arseniy Krasnov <avkrasnov at sberdevices.ru> wrote:> > > > On 03.05.2023 16:47, Stefano Garzarella wrote: > > On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote: > >> > >> > >> On 03.05.2023 15:52, Stefano Garzarella wrote: > >>> Hi Arseniy, > >>> Sorry for the delay, but I have been very busy. > >> > >> Hello, no problem! > >> > >>> > >>> I can't apply this series on master or net-next, can you share with me > >>> the base commit? > >> > >> Here is my base: > >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54 > >> > > > > Thanks, it worked! > > > >>> > >>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote: > >>>> Hello, > >>>> > >>>> DESCRIPTION > >>>> > >>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow > >>>> current implementation for TCP as much as possible: > >>>> > >>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > >>>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > >>>> flag will be ignored (e.g. without completion). > >>>> > >>>> 2) Kernel uses completions from socket's error queue. Single completion > >>>> for single tx syscall (or it can merge several completions to single > >>>> one). I used already implemented logic for MSG_ZEROCOPY support: > >>>> 'msg_zerocopy_realloc()' etc. > >>>> > >>>> Difference with copy way is not significant. During packet allocation, > >>>> non-linear skb is created, then I call 'pin_user_pages()' for each page > >>>> from user's iov iterator and add each returned page to the skb as fragment. > >>>> There are also some updates for vhost and guest parts of transport - in > >>>> both cases i've added handling of non-linear skb for virtio part. vhost > >>>> copies data from such skb to the guest's rx virtio buffers. In the guest, > >>>> virtio transport fills tx virtio queue with pages from skb. > >>>> > >>>> This version has several limits/problems: > >>>> > >>>> 1) As this feature totally depends on transport, there is no way (or it > >>>> is difficult) to check whether transport is able to handle it or not > >>>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > >>>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this > >>>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > >>>> are not considered to be called from each other. So in current version > >>>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of > >>>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > >>>> tx routine will fail with EOPNOTSUPP. > >>> > >>> Do you plan to fix this in the next versions? > >>> > >>> If it is too complicated, I think we can have this limitation until we > >>> find a good solution. > >>> > >> > >> I'll try to fix it again, but just didn't pay attention on it in v2. > >> > >>>> > >>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > >>>> one completion. In each completion there is flag which shows how tx > >>>> was performed: zerocopy or copy. This leads that whole message must > >>>> be send in zerocopy or copy way - we can't send part of message with > >>>> copying and rest of message with zerocopy mode (or vice versa). Now, > >>>> we need to account vsock credit logic, e.g. we can't send whole data > >>>> once - only allowed number of bytes could sent at any moment. In case > >>>> of copying way there is no problem as in worst case we can send single > >>>> bytes, but zerocopy is more complex because smallest transmission > >>>> unit is single page. So if there is not enough space at peer's side > >>>> to send integer number of pages (at least one) - we will wait, thus > >>>> stalling tx side. To overcome this problem i've added simple rule - > >>>> zerocopy is possible only when there is enough space at another side > >>>> for whole message (to check, that current 'msghdr' was already used > >>>> in previous tx iterations i use 'iov_offset' field of it's iov iter). > >>> > >>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the > >>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set. > >>> Right? > >> > >> Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and > >> MSG_ZEROCOPY flag as used). But completion structure contains information that > >> there was copying instead of zerocopying. > > > > Got it. > > > >> > >>> > >>> If it is the case it seems reasonable to me. > >>> > >>>> > >>>> 3) loopback transport is not supported, because it requires to implement > >>>> non-linear skb handling in dequeue logic (as we "send" fragged skb > >>>> and "receive" it from the same queue). I'm going to implement it in > >>>> next versions. > >>>> > >>>> ^^^ fixed in v2 > >>>> > >>>> 4) Current implementation sets max length of packet to 64KB. IIUC this > >>>> is due to 'kmalloc()' allocated data buffers. I think, in case of > >>>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > >>>> not touched for data - user space pages are used as buffers. Also > >>>> this limit trims every message which is > 64KB, thus such messages > >>>> will be send in copy mode due to 'iov_offset' check in 2). > >>>> > >>>> ^^^ fixed in v2 > >>>> > >>>> PATCHSET STRUCTURE > >>>> > >>>> Patchset has the following structure: > >>>> 1) Handle non-linear skbuff on receive in virtio/vhost. > >>>> 2) Handle non-linear skbuff on send in virtio/vhost. > >>>> 3) Updates for AF_VSOCK. > >>>> 4) Enable MSG_ZEROCOPY support on transports. > >>>> 5) Tests/tools/docs updates. > >>>> > >>>> PERFORMANCE > >>>> > >>>> Performance: it is a little bit tricky to compare performance between > >>>> copy and zerocopy transmissions. In zerocopy way we need to wait when > >>>> user buffers will be released by kernel, so it something like synchronous > >>>> path (wait until device driver will process it), while in copy way we > >>>> can feed data to kernel as many as we want, don't care about device > >>>> driver. So I compared only time which we spend in the 'send()' syscall. > >>>> Then if this value will be combined with total number of transmitted > >>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not > >>>> enough credit, receiver allocates same amount of space as sender needs. > >>>> > >>>> Sender: > >>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] > >>>> > >>>> Receiver: > >>>> ./vsock_perf --vsk-size 256M > >>>> > >>>> G2H transmission (values are Gbit/s): > >>>> > >>>> *-------------------------------* > >>>> | | | | > >>>> | buf size | copy | zerocopy | > >>>> | | | | > >>>> *-------------------------------* > >>>> | 4KB | 3 | 10 | > >>>> *-------------------------------* > >>>> | 32KB | 9 | 45 | > >>>> *-------------------------------* > >>>> | 256KB | 24 | 195 | > >>>> *-------------------------------* > >>>> | 1M | 27 | 270 | > >>>> *-------------------------------* > >>>> | 8M | 22 | 277 | > >>>> *-------------------------------* > >>>> > >>>> H2G: > >>>> > >>>> *-------------------------------* > >>>> | | | | > >>>> | buf size | copy | zerocopy | > >>>> | | | | > >>>> *-------------------------------* > >>>> | 4KB | 17 | 11 | > >>> > >>> Do you know why in this case zerocopy is slower in this case? > >>> Could be the cost of pin/unpin pages? > >> May be, i think i need to analyze such enormous difference more. Also about > >> pin/unpin: i found that there is already implemented function to fill non-linear > >> skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c. > >> It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it > >> is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by > >> TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY. > > > > If we can reuse them, it will be great! > > > >> > >>> > >>>> *-------------------------------* > >>>> | 32KB | 30 | 66 | > >>>> *-------------------------------* > >>>> | 256KB | 38 | 179 | > >>>> *-------------------------------* > >>>> | 1M | 38 | 234 | > >>>> *-------------------------------* > >>>> | 8M | 28 | 279 | > >>>> *-------------------------------* > >>>> > >>>> Loopback: > >>>> > >>>> *-------------------------------* > >>>> | | | | > >>>> | buf size | copy | zerocopy | > >>>> | | | | > >>>> *-------------------------------* > >>>> | 4KB | 8 | 7 | > >>>> *-------------------------------* > >>>> | 32KB | 34 | 42 | > >>>> *-------------------------------* > >>>> | 256KB | 43 | 83 | > >>>> *-------------------------------* > >>>> | 1M | 40 | 109 | > >>>> *-------------------------------* > >>>> | 8M | 40 | 171 | > >>>> *-------------------------------* > >>>> > >>>> I suppose that huge difference above between both modes has two reasons: > >>>> 1) We don't need to copy data. > >>>> 2) We don't need to allocate buffer for data, only for header. > >>>> > >>>> Zerocopy is faster than classic copy mode, but of course it requires > >>>> specific architecture of application due to user pages pinning, buffer > >>>> size and alignment. > >>>> > >>>> If host fails to send data with "Cannot allocate memory", check value > >>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb > >>>> allocation. > >>> > >>> What the user needs to do? Increase it? > >>> > >> Yes, i'll update it. > >>>> > >>>> TESTING > >>>> > >>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to > >>>> cover new code as much as possible so there are different cases for > >>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io > >>>> vector types (different sizes, alignments, with unmapped pages). I also > >>>> run tests with loopback transport and running vsockmon. > >>> > >>> Thanks for the test again :-) > >>> > >>> This cover letter is very good, with a lot of details, but please add > >>> more details in each single patch, explaining the reason of the changes, > >>> otherwise it is very difficult to review, because it is a very big > >>> change. > >>> > >>> I'll do a per-patch review in the next days. > >> > >> Sure, thanks! In v3 i'm also working on io_uring test, because this thing also > >> supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring. > > > > That would be cool! > > > > Do you want to me to review these patches or it is better to wait for v3? > > I think it is ok to wait for v3, as i'm going to reduce size of new kernel source code, > especially by reusing already implemented functions instead of my own.Okay, great! I'll wait for it ;-) Thanks, Stefano