Stefano Garzarella
2022-May-24 07:32 UTC
[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:>Hello Stefano, > >On 19.05.2022 10:42, Stefano Garzarella wrote: >> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote: >>> Hello Stefano, >>> >>> On 17.05.2022 18:14, Stefano Garzarella wrote: >>>> Hi Arseniy, >>>> >>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote: >>>>> ???????????????????????????? INTRODUCTION >>>>> >>>>> ????Hello, this is experimental implementation of virtio vsock zerocopy >>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will >>>>> fill provided vma area with pages of virtio RX buffers. After received data was >>>>> processed by user, pages must be freed by 'madvise()'? call with MADV_DONTNEED >>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail). >>>> >>>> Sounds cool, but maybe we would need some socket/net experts here for review. >>> >>> Yes, that would be great >>> >>>> >>>> Could we do something similar for the sending path as well? >>> >>> Here are thoughts about zerocopy transmission: >>> >>> I tried to implement this feature in the following way: user creates >>> some page aligned buffer, then during tx packet allocation instead of >>> creating data buffer with 'kmalloc()', i tried to add user's buffer >>> to virtio queue. But found problem: as kernel virtio API uses virtual >>> addresses to add new buffers, in the deep of virtio subsystem >>> 'virt_to_phys()' is called to get physical address of buffer, so user's >>> virtual address won't be translated correctly to physical address(in >>> theory, i can perform page walk for such user's va, get physical address >>> and pass some "fake" virtual address to virtio API in order to make >>> 'virt_to_phys()' return valid physical address(but i think this is ugly). >> >> And maybe we should also pin the pages to prevent them from being replaced. >> >> I think we should do something similar to what we do in vhost-vdpa. >> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c > >Hm, ok. I'll read about vdpa... > >> >>> >>> >>> If we are talking about 'mmap()' way, i think we can do the following: >>> user calls 'mmap()' on socket, kernel fills newly created mapping with >>> allocated pages(all pages have rw permissions). Now user can use pages >>> of this mapping(e.g. fill it with data). Finally, to start transmission, >>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of >>> this mapping. Also as this call will return immediately(e.g. it is >>> asynchronous), some completion logic must be implemented. For example >>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get >>> message that pages could be reused, or don't allow user to work with >>> these pages: unmap it, perform transmission and finally free pages. >>> To start new transmission user need to call 'mmap()' again. >>> >>> ?????????????????????????? OR >>> >>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()' >>> /'splice()'. In this approach to transmit something, user does the following >>> steps: >>> 1) Creates pipe. >>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it. >>> ? SPLICE_F_GIFT allows user to forget about allocated pages - kernel will >>> ? free it. >>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will >>> ? move pages from pipe to socket(e.g. in special socket callback we got >>> ? set of pipe's pages as input argument and all pages will be inserted >>> ? to virtio queue). >>> >>> But as SPLICE_F_MOVE support is disabled, it must be repaired first. >> >> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2): >> >> ??? send(fd, buf, sizeof(buf), MSG_ZEROCOPY); >> > >Yes, but in this way i think: >1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue. > E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP > case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which > allows to use such memory as data buffer.IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the VQ.>2) To wait tx process is done(e.g. pages can be used again), such >API(send + MSG_ZEROCOPY), > uses socket's error queue - poll events that tx is finished. So same > way must be > implemented for virtio vsock.Yeah, I think so.> >> ? >>> >>>> >>>>> >>>>> ??????????????????????????????? DETAILS >>>>> >>>>> ????Here is how mapping with mapped pages looks exactly: first page mapping >>>>> contains array of trimmed virtio vsock packet headers (in contains only length >>>>> of data on the corresponding page and 'flags' field): >>>>> >>>>> ????struct virtio_vsock_usr_hdr { >>>>> ??????? uint32_t length; >>>>> ??????? uint32_t flags; >>>>> ????}; >>>>> >>>>> Field? 'length' allows user to know exact size of payload within each sequence >>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message >>>>> bounds or record bounds). All other pages are data pages from RX queue. >>>>> >>>>> ??????????? Page 0????? Page 1????? Page N >>>>> >>>>> ????[ hdr1 .. hdrN ][ data ] .. [ data ] >>>>> ????????? |??????? |?????? ^?????????? ^ >>>>> ????????? |??????? |?????? |?????????? | >>>>> ????????? |??????? *-------------------* >>>>> ????????? |??????????????? | >>>>> ????????? |??????????????? | >>>>> ????????? *----------------* >>>>> >>>>> ????Of course, single header could represent array of pages (when packet's >>>>> buffer is bigger than one page).So here is example of detailed mapping layout >>>>> for some set of packages. Lets consider that we have the following sequence? of >>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will >>>>> be inserted to user's vma(vma is large enough). >>>>> >>>>> ????Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] >>>>> ????Page 1: [ 56 ] >>>>> ????Page 2: [ 4096 ] >>>>> ????Page 3: [ 4096 ] >>>>> ????Page 4: [ 4096 ] >>>>> ????Page 5: [ 8 ] >>>>> >>>>> ????Page 0 contains only array of headers: >>>>> ????'hdr0' has 56 in length field. >>>>> ????'hdr1' has 4096 in length field. >>>>> ????'hdr2' has 8200 in length field. >>>>> ????'hdr3' has 0 in length field(this is end of data marker). >>>>> >>>>> ????Page 1 corresponds to 'hdr0' and has only 56 bytes of data. >>>>> ????Page 2 corresponds to 'hdr1' and filled with data. >>>>> ????Page 3 corresponds to 'hdr2' and filled with data. >>>>> ????Page 4 corresponds to 'hdr2' and filled with data. >>>>> ????Page 5 corresponds to 'hdr2' and has only 8 bytes of data. >>>>> >>>>> ????This patchset also changes packets allocation way: today implementation >>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map >>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as >>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and >>>>> "not large" could be bigger than one page). So to avoid this, data buffers now >>>>> allocated using 'alloc_pages()' call. >>>>> >>>>> ????????????????????????????????? TESTS >>>>> >>>>> ????This patchset updates 'vsock_test' utility: two tests for new feature >>>>> were added. First test covers invalid cases. Second checks valid transmission >>>>> case. >>>> >>>> Thanks for adding the test! >>>> >>>>> >>>>> ?????????????????????????????? BENCHMARKING >>>>> >>>>> ????For benchmakring I've added small utility 'rx_zerocopy'. It works in >>>>> client/server mode. When client connects to server, server starts sending exact >>>>> amount of data to client(amount is set as input argument).Client reads data and >>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In >>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()' >>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission >>>>> is better. For server, we can set size of tx buffer and for client we can set >>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility >>>>> is quiet simple: >>>>> >>>>> For client mode: >>>>> >>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx] >>>>> >>>>> For server mode: >>>>> >>>>> ./rx_zerocopy --mode server [--mb] [--tx] >>>>> >>>>> [--mb] sets number of megabytes to transfer. >>>>> [--rx] sets size of receive buffer/mapping in pages. >>>>> [--tx] sets size of transmit buffer in pages. >>>>> >>>>> I checked for transmission of 4000mb of data. Here are some results: >>>>> >>>>> ????????????????????????? size of rx/tx buffers in pages >>>>> ????????????? *---------------------------------------------------* >>>>> ????????????? |??? 8?? |??? 32??? |??? 64?? |?? 256??? |?? 512??? | >>>>> *--------------*--------*----------*---------*----------*----------* >>>>> |?? zerocopy?? |?? 24?? |?? 10.6?? |? 12.2?? |?? 23.6?? |??? 21??? | secs to >>>>> *--------------*---------------------------------------------------- process >>>>> | non-zerocopy |?? 13?? |?? 16.4?? |? 24.7?? |?? 27.2?? |?? 23.9?? | 4000 mb >>>>> *--------------*---------------------------------------------------- >>>>> >>>>> I think, that results are not so impressive, but at least it is not worse than >>>>> copy mode and there is no need to allocate memory for processing date. >>>> >>>> Why is it twice as slow in the first column? >>> >>> May be this is because memory copying for small buffers is very fast... i'll >>> analyze it deeply. >> >> Maybe I misunderstood, by small buffers here what do you mean? >> >> I thought 8 was the number of pages, so 32KB buffers. > >Yes, 8 is size in pages. Anyway, i need to check it more deeply.Okay, thanks! Stefano