Stefano Garzarella
2022-May-17 15:14 UTC
[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
Hi Arseniy, On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:> INTRODUCTION > > Hello, this is experimental implementation of virtio vsock zerocopy >receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will >fill provided vma area with pages of virtio RX buffers. After received data was >processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED >flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).Sounds cool, but maybe we would need some socket/net experts here for review. Could we do something similar for the sending path as well?> > DETAILS > > Here is how mapping with mapped pages looks exactly: first page mapping >contains array of trimmed virtio vsock packet headers (in contains only length >of data on the corresponding page and 'flags' field): > > struct virtio_vsock_usr_hdr { > uint32_t length; > uint32_t flags; > }; > >Field 'length' allows user to know exact size of payload within each sequence >of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message >bounds or record bounds). All other pages are data pages from RX queue. > > Page 0 Page 1 Page N > > [ hdr1 .. hdrN ][ data ] .. [ data ] > | | ^ ^ > | | | | > | *-------------------* > | | > | | > *----------------* > > Of course, single header could represent array of pages (when packet's >buffer is bigger than one page).So here is example of detailed mapping layout >for some set of packages. Lets consider that we have the following sequence of >packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will >be inserted to user's vma(vma is large enough). > > Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] > Page 1: [ 56 ] > Page 2: [ 4096 ] > Page 3: [ 4096 ] > Page 4: [ 4096 ] > Page 5: [ 8 ] > > Page 0 contains only array of headers: > 'hdr0' has 56 in length field. > 'hdr1' has 4096 in length field. > 'hdr2' has 8200 in length field. > 'hdr3' has 0 in length field(this is end of data marker). > > Page 1 corresponds to 'hdr0' and has only 56 bytes of data. > Page 2 corresponds to 'hdr1' and filled with data. > Page 3 corresponds to 'hdr2' and filled with data. > Page 4 corresponds to 'hdr2' and filled with data. > Page 5 corresponds to 'hdr2' and has only 8 bytes of data. > > This patchset also changes packets allocation way: today implementation >uses only 'kmalloc()' to create data buffer. Problem happens when we try to map >such buffers to user's vma - kernel forbids to map slab pages to user's vma(as >pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and >"not large" could be bigger than one page). So to avoid this, data buffers now >allocated using 'alloc_pages()' call. > > TESTS > > This patchset updates 'vsock_test' utility: two tests for new feature >were added. First test covers invalid cases. Second checks valid transmission >case.Thanks for adding the test!> > BENCHMARKING > > For benchmakring I've added small utility 'rx_zerocopy'. It works in >client/server mode. When client connects to server, server starts sending exact >amount of data to client(amount is set as input argument).Client reads data and >waits for next portion of it. Client works in two modes: copy and zero-copy. In >copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()' >/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission >is better. For server, we can set size of tx buffer and for client we can set >size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility >is quiet simple: > >For client mode: > >./rx_zerocopy --mode client [--zerocopy] [--rx] > >For server mode: > >./rx_zerocopy --mode server [--mb] [--tx] > >[--mb] sets number of megabytes to transfer. >[--rx] sets size of receive buffer/mapping in pages. >[--tx] sets size of transmit buffer in pages. > >I checked for transmission of 4000mb of data. Here are some results: > > size of rx/tx buffers in pages > *---------------------------------------------------* > | 8 | 32 | 64 | 256 | 512 | >*--------------*--------*----------*---------*----------*----------* >| zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to >*--------------*---------------------------------------------------- process >| non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb >*--------------*---------------------------------------------------- > >I think, that results are not so impressive, but at least it is not worse than >copy mode and there is no need to allocate memory for processing date.Why is it twice as slow in the first column?> > PROBLEMS > > Updated packet's allocation logic creates some problem: when host gets >data from guest(in vhost-vsock), it allocates at least one page for each packet >(even if packet has 1 byte payload). I think this could be resolved in several >ways:Can we somehow copy the incoming packets into the payload of the already queued packet? This reminds me that we have yet to fix a similar problem with kmalloc() as well... https://bugzilla.kernel.org/show_bug.cgi?id=215329> 1) Make zerocopy rx mode disabled by default, so if user didn't enable >it, current 'kmalloc()' way will be used.That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?> 2) Use 'kmalloc()' for "small" packets, else call page allocator. But >in this case, we have mix of packets, allocated in two different ways thus >during zerocopying to user(e.g. mapping pages to vma), such small packets will >be handled in some stupid way: we need to allocate one page for user, copy data >to it and then insert page to user's vma.It seems more difficult to me, but at the same time doable. I would go more on option 1, though.> >P.S: of course this is experimental RFC, so what do You think guys?It seems cool :-) But I would like some feedback from the net guys to have some TCP-like things. Thanks, Stefano