Stefano Garzarella
2023-Feb-28 10:26 UTC
[RFC PATCH v1 07/12] vsock/virtio: MGS_ZEROCOPY flag support
On Mon, Feb 20, 2023 at 09:04:04AM +0000, Krasnov Arseniy wrote:>On 16.02.2023 18:16, Stefano Garzarella wrote: >> On Mon, Feb 06, 2023 at 07:00:35AM +0000, Arseniy Krasnov wrote: >>> This adds main logic of MSG_ZEROCOPY flag processing for packet >>> creation. When this flag is set and user's iov iterator fits for >>> zerocopy transmission, call 'get_user_pages()' and add returned >>> pages to the newly created skb. >>> >>> Signed-off-by: Arseniy Krasnov <AVKrasnov at sberdevices.ru> >>> --- >>> net/vmw_vsock/virtio_transport_common.c | 212 ++++++++++++++++++++++-- >>> 1 file changed, 195 insertions(+), 17 deletions(-) >>> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >>> index 05ce97b967ad..69e37f8a68a6 100644 >>> --- a/net/vmw_vsock/virtio_transport_common.c >>> +++ b/net/vmw_vsock/virtio_transport_common.c >>> @@ -37,6 +37,169 @@ virtio_transport_get_ops(struct vsock_sock *vsk) >>> ????return container_of(t, struct virtio_transport, transport); >>> } >>> >> >> I'd use bool if we don't need to return an error value in the following >> new functions. >> >>> +static int virtio_transport_can_zcopy(struct iov_iter *iov_iter, >>> +????????????????????? size_t free_space) >>> +{ >>> +??? size_t pages; >>> +??? int i; >>> + >>> +??? if (!iter_is_iovec(iov_iter)) >>> +??????? return -1; >>> + >>> +??? if (iov_iter->iov_offset) >>> +??????? return -1; >>> + >>> +??? /* We can't send whole iov. */ >>> +??? if (free_space < iov_iter->count) >>> +??????? return -1; >>> + >>> +??? for (pages = 0, i = 0; i < iov_iter->nr_segs; i++) { >>> +??????? const struct iovec *iovec; >>> +??????? int pages_in_elem; >>> + >>> +??????? iovec = &iov_iter->iov[i]; >>> + >>> +??????? /* Base must be page aligned. */ >>> +??????? if (offset_in_page(iovec->iov_base)) >>> +??????????? return -1; >>> + >>> +??????? /* Only last element could have not page aligned size.? */ >>> +??????? if (i != (iov_iter->nr_segs - 1)) { >>> +??????????? if (offset_in_page(iovec->iov_len)) >>> +??????????????? return -1; >>> + >>> +??????????? pages_in_elem = iovec->iov_len >> PAGE_SHIFT; >>> +??????? } else { >>> +??????????? pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE); >>> +??????????? pages_in_elem >>= PAGE_SHIFT; >>> +??????? } >>> + >>> +??????? /* In case of user's pages - one page is one frag. */ >>> +??????? if (pages + pages_in_elem > MAX_SKB_FRAGS) >>> +??????????? return -1; >>> + >>> +??????? pages += pages_in_elem; >>> +??? } >>> + >>> +??? return 0; >>> +} >>> + >>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk, >>> +?????????????????????? struct sk_buff *skb, >>> +?????????????????????? struct iov_iter *iter, >>> +?????????????????????? bool zerocopy) >>> +{ >>> +??? struct ubuf_info_msgzc *uarg_zc; >>> +??? struct ubuf_info *uarg; >>> + >>> +??? uarg = msg_zerocopy_realloc(sk_vsock(vsk), >>> +??????????????????? iov_length(iter->iov, iter->nr_segs), >>> +??????????????????? NULL); >>> + >>> +??? if (!uarg) >>> +??????? return -1; >>> + >>> +??? uarg_zc = uarg_to_msgzc(uarg); >>> +??? uarg_zc->zerocopy = zerocopy ? 1 : 0; >>> + >>> +??? skb_zcopy_init(skb, uarg); >>> + >>> +??? return 0; >>> +} >>> + >>> +static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb, >>> +?????????????????????????? struct vsock_sock *vsk, >>> +?????????????????????????? struct virtio_vsock_pkt_info *info) >>> +{ >>> +??? struct iov_iter *iter; >>> +??? int frag_idx; >>> +??? int seg_idx; >>> + >>> +??? iter = &info->msg->msg_iter; >>> +??? frag_idx = 0; >>> +??? VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0; >>> +??? VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0; >>> + >>> +??? /* At this moment: >>> +???? * 1) 'iov_offset' is zero. >>> +???? * 2) Every 'iov_base' and 'iov_len' are also page aligned >>> +???? *??? (except length of the last element). >>> +???? * 3) Number of pages in this iov <= MAX_SKB_FRAGS. >>> +???? * 4) Length of the data fits in current credit space. >>> +???? */ >>> +??? for (seg_idx = 0; seg_idx < iter->nr_segs; seg_idx++) { >>> +??????? struct page *user_pages[MAX_SKB_FRAGS]; >>> +??????? const struct iovec *iovec; >>> +??????? size_t last_frag_len; >>> +??????? size_t pages_in_seg; >>> +??????? int page_idx; >>> + >>> +??????? iovec = &iter->iov[seg_idx]; >>> +??????? pages_in_seg = iovec->iov_len >> PAGE_SHIFT; >>> + >>> +??????? if (iovec->iov_len % PAGE_SIZE) { >>> +??????????? last_frag_len = iovec->iov_len % PAGE_SIZE; >>> +??????????? pages_in_seg++; >>> +??????? } else { >>> +??????????? last_frag_len = PAGE_SIZE; >>> +??????? } >>> + >>> +??????? if (get_user_pages((unsigned long)iovec->iov_base, >>> +?????????????????? pages_in_seg, FOLL_GET, user_pages, >>> +?????????????????? NULL) != pages_in_seg) >>> +??????????? return -1; >> >> Reading the get_user_pages() documentation, this should pin the user >> pages, so we should be fine if we then expose them in the virtqueue. >> >> But reading Documentation/core-api/pin_user_pages.rst it seems that >> drivers should use "pin_user_pages*() for DMA-pinned pages", so I'm not >> sure what we should do. >> >That is really interesting question for me too. IIUC 'pin_user_pages()' >sets special value to ref counter of page, so we can distinguish such >pages from the others. I've grepped for pinned pages check and found, >the it is used in mm/vmscan.c by calling 'folio_maybe_dma_pinned()' during >page lists processing. Seems 'pin_user_pages()' is more strict version of >'get_user_pages()' and it is recommended to use 'pin_' when data on these >pages will be accessed. >I think, i'll check which API is used in the TCP implementation for zerocopy >transmission. > >> Additional advice would be great! >> >> Anyway, when we are done using the pages, we should call put_page() or >> unpin_user_page() depending on how we pin them. >> >In case of 'get_user_pages()' everything is ok here: when such skb >will be released, 'put_page()' will be called for every frag page >of it, so there is no page leak.Got it!>But in case of 'pin_user_pages()', >i will need to unpin in manually before calling 'consume_skb()' >after it is processed by virtio device. But anyway - it is not a >problem.Yep. Thanks, Stefano