Stefano Garzarella
2022-Jun-13 08:54 UTC
[RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:>On 09.06.2022 11:54, Stefano Garzarella wrote: >> Hi Arseniy, >> I left some comments in the patches, and I'm adding something also here: >Thanks for comments >> >> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote: >>> ???????????????????????????? INTRODUCTION >>> >>> ????Hello, this is experimental implementation of virtio vsock zerocopy >>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will >>> fill provided vma area with pages of virtio RX buffers. After received data was >>> processed by user, pages must be freed by 'madvise()'? call with MADV_DONTNEED >>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail). >> >> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy? >You mean compare API with more details?Yes, maybe a comparison from the user's point of view to do zero-copy with TCP and VSOCK.>> >>> >>> ??????????????????????????????? DETAILS >>> >>> ????Here is how mapping with mapped pages looks exactly: first page mapping >>> contains array of trimmed virtio vsock packet headers (in contains only length >>> of data on the corresponding page and 'flags' field): >>> >>> ????struct virtio_vsock_usr_hdr { >>> ??????? uint32_t length; >>> ??????? uint32_t flags; >>> ??????? uint32_t copy_len; >>> ????}; >>> >>> Field? 'length' allows user to know exact size of payload within each sequence >>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message >>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part. >>> All other pages are data pages from RX queue. >>> >>> ??????????? Page 0????? Page 1????? Page N >>> >>> ????[ hdr1 .. hdrN ][ data ] .. [ data ] >>> ????????? |??????? |?????? ^?????????? ^ >>> ????????? |??????? |?????? |?????????? | >>> ????????? |??????? *-------------------* >>> ????????? |??????????????? | >>> ????????? |??????????????? | >>> ????????? *----------------* >>> >>> ????Of course, single header could represent array of pages (when packet's >>> buffer is bigger than one page).So here is example of detailed mapping layout >>> for some set of packages. Lets consider that we have the following sequence? of >>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will >>> be inserted to user's vma(vma is large enough). >> >> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive. >> Then the userspace can poll a flag or other fields in the header to understand that there are new packets. >You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call >to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.Yes, that's right. I mean to support both, poll() for interrupt-based applications and the ability to actively poll a variable in the shared memory for applications that want to minimize latency. Thanks, Stefano