Michael S. Tsirkin
2018-Dec-26 15:02 UTC
[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:> > On 2018/12/25 ??8:50, Michael S. Tsirkin wrote: > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote: > > > On 2018/12/25 ??2:10, Michael S. Tsirkin wrote: > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote: > > > > > On 2018/12/14 ??8:36, Michael S. Tsirkin wrote: > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote: > > > > > > > On 2018/12/13 ??11:44, Michael S. Tsirkin wrote: > > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote: > > > > > > > > > It was noticed that the copy_user() friends that was used to access > > > > > > > > > virtqueue metdata tends to be very expensive for dataplane > > > > > > > > > implementation like vhost since it involves lots of software check, > > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The > > > > > > > > > extra cost will be more obvious when transferring small packets. > > > > > > > > > > > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages > > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup > > > > > > > > > those mappings and memory accessors are modified to use pointers to > > > > > > > > > access the metadata directly. > > > > > > > > > > > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could > > > > > > > > > use similar method to optimize it in the future. > > > > > > > > > > > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user + > > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled): > > > > > > > > > > > > > > > > > > Before: ~5.0Mpps > > > > > > > > > After: ~6.1Mpps > > > > > > > > > > > > > > > > > > Signed-off-by: Jason Wang<jasowang at redhat.com> > > > > > > > > > --- > > > > > > > > > drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > drivers/vhost/vhost.h | 11 +++ > > > > > > > > > 2 files changed, 189 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > > > index bafe39d2e637..1bd24203afb6 100644 > > > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev, > > > > > > > > > vq->indirect = NULL; > > > > > > > > > vq->heads = NULL; > > > > > > > > > vq->dev = dev; > > > > > > > > > + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring)); > > > > > > > > > + memset(&vq->used_ring, 0, sizeof(vq->used_ring)); > > > > > > > > > + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring)); > > > > > > > > > mutex_init(&vq->mutex); > > > > > > > > > vhost_vq_reset(dev, vq); > > > > > > > > > if (vq->handle_kick) > > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev) > > > > > > > > > spin_unlock(&dev->iotlb_lock); > > > > > > > > > } > > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr, > > > > > > > > > + size_t size, int write) > > > > > > > > > +{ > > > > > > > > > + struct page **pages; > > > > > > > > > + int npages = DIV_ROUND_UP(size, PAGE_SIZE); > > > > > > > > > + int npinned; > > > > > > > > > + void *vaddr; > > > > > > > > > + > > > > > > > > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); > > > > > > > > > + if (!pages) > > > > > > > > > + return -ENOMEM; > > > > > > > > > + > > > > > > > > > + npinned = get_user_pages_fast(uaddr, npages, write, pages); > > > > > > > > > + if (npinned != npages) > > > > > > > > > + goto err; > > > > > > > > > + > > > > > > > > As I said I have doubts about the whole approach, but this > > > > > > > > implementation in particular isn't a good idea > > > > > > > > as it keeps the page around forever. > > > > > The pages wil be released during set features. > > > > > > > > > > > > > > > > > > So no THP, no NUMA rebalancing, > > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter > > > > > consider the gain we have? > > > > We as in vhost? networking isn't the only thing guest does. > > > > We don't even know if this guest does a lot of networking. > > > > You don't > > > > know what else is in this huge page. Can be something very important > > > > that guest touches all the time. > > > > > > Well, the probability should be very small consider we usually give several > > > gigabytes to guest. The rest of the pages that doesn't sit in the same > > > hugepage with metadata can still be merged by THP.? Anyway, I can test the > > > differences. > > Thanks! > > > > > > > For NUMA rebalancing, I'm even not quite sure if > > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it > > > > > may cause page to be thrash between nodes if vhost and userspace are running > > > > > in two nodes. > > > > So again it's a gain for vhost but has a completely unpredictable effect on > > > > other functionality of the guest. > > > > > > > > That's what bothers me with this approach. > > > > > > So: > > > > > > - The rest of the pages could still be balanced to other nodes, no? > > > > > > - try to balance metadata pages (belongs to co-operate processes) itself is > > > still questionable > > I am not sure why. It should be easy enough to force the VCPU and vhost > > to move (e.g. start them pinned to 1 cpu, then pin them to another one). > > Clearly sometimes this would be necessary for load balancing reasons. > > > Yes, but it looks to me the part of motivation of auto NUMA is to avoid > manual pinning.... of memory. Yes.> > > With autonuma after a while (could take seconds but it will happen) the > > memory will migrate. > > > > Yes. As you mentioned during the discuss, I wonder we could do it similarly > through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86: > Unpin and remove kvm_arch->apic_access_page")That would be a possible approach.> > > > > > > > > > > > > > > > > > > > > > > > This is the price of all GUP users not only vhost itself. > > > > > > Yes. GUP is just not a great interface for vhost to use. > > > > > Zerocopy codes (enabled by defualt) use them for years. > > > > But only for TX and temporarily. We pin, read, unpin. > > > > > > Probably not. For several reasons that the page will be not be released soon > > > or held for a very long period of time or even forever. > > > > With zero copy? Well it's pinned until transmit. Takes a while > > but could be enough for autocopy to work esp since > > its the packet memory so not reused immediately. > > > > > > Your patch is different > > > > > > > > - it writes into memory and GUP has known issues with file > > > > backed memory > > > > > > The ordinary user for vhost is anonymous pages I think? > > > > It's not the most common scenario and not the fastest one > > (e.g. THP does not work) but file backed is useful sometimes. > > It would not be nice at all to corrupt guest memory in that case. > > > Ok. > > > > > > > > - it keeps pages pinned forever > > > > > > > > > > > > > > > > > > > What's more > > > > > > > important, the goal is not to be left too much behind for other backends > > > > > > > like DPDK or AF_XDP (all of which are using GUP). > > > > > > So these guys assume userspace knows what it's doing. > > > > > > We can't assume that. > > > > > What kind of assumption do you they have? > > > > > > > > > > > > > > > > > > userspace-controlled > > > > > > > > amount of memory locked up and not accounted for. > > > > > > > It's pretty easy to add this since the slow path was still kept. If we > > > > > > > exceeds the limitation, we can switch back to slow path. > > > > > > > > > > > > > > > Don't get me wrong it's a great patch in an ideal world. > > > > > > > > But then in an ideal world no barriers smap etc are necessary at all. > > > > > > > Again, this is only for metadata accessing not the data which has been used > > > > > > > for years for real use cases. > > > > > > > > > > > > > > For SMAP, it makes senses for the address that kernel can not forcast. But > > > > > > > it's not the case for the vhost metadata since we know the address will be > > > > > > > accessed very frequently. For speculation barrier, it helps nothing for the > > > > > > > data path of vhost which is a kthread. > > > > > > I don't see how a kthread makes any difference. We do have a validation > > > > > > step which makes some difference. > > > > > The problem is not kthread but the address of userspace address. The > > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows > > > > > they will be frequently. SMAP doesn't help too much in this case. > > > > > > > > > > Thanks. > > > > It's true for a real life applications but a malicious one > > > > can call the setup ioctls any number of times. And SMAP is > > > > all about malcious applications. > > > > > > We don't do this in the path of ioctl, there's no context switch between > > > userspace and kernel in the worker thread. SMAP is used to prevent kernel > > > from accessing userspace pages unexpectedly which is not the case for > > > metadata access. > > > > > > Thanks > > OK let's forget smap for now. > > > Some numbers I measured: > > On an old Sandy bridge machine without SMAP support. Remove speculation > barrier boost the performance from 4.6Mpps to 5.1Mpps > > On a newer Broadwell machine with SMAP support. Remove speculation barrier > only gives 2%-5% improvement, disable SMAP completely through Kconfig boost > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it > only bypass SMAP for metadata). > > So it looks like for recent machine, SMAP becomes pain point when the copy > is short (e.g 64B) for high PPS. > > ThanksThanks a lot for looking into this! So first of all users can just boot with nosmap, right? What's wrong with that? Yes it's not fine-grained but OTOH it's easy to understand. And I guess this confirms that if we are going to worry about smap enabled, we need to look into packet copies too, not just meta-data. Vaguely could see a module option (off by default) where vhost basically does user_access_begin when it starts running, then uses unsafe accesses in vhost and tun and then user_access_end.> > > > > > > > > > Packet or AF_XDP benefit from > > > > > > > accessing metadata directly, we should do it as well. > > > > > > > > > > > > > > Thanks
Jason Wang
2018-Dec-27 09:39 UTC
[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
On 2018/12/26 ??11:02, Michael S. Tsirkin wrote:> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote: >> On 2018/12/25 ??8:50, Michael S. Tsirkin wrote: >>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote: >>>> On 2018/12/25 ??2:10, Michael S. Tsirkin wrote: >>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote: >>>>>> On 2018/12/14 ??8:36, Michael S. Tsirkin wrote: >>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote: >>>>>>>> On 2018/12/13 ??11:44, Michael S. Tsirkin wrote: >>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote: >>>>>>>>>> It was noticed that the copy_user() friends that was used to access >>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane >>>>>>>>>> implementation like vhost since it involves lots of software check, >>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The >>>>>>>>>> extra cost will be more obvious when transferring small packets. >>>>>>>>>> >>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages >>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup >>>>>>>>>> those mappings and memory accessors are modified to use pointers to >>>>>>>>>> access the metadata directly. >>>>>>>>>> >>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could >>>>>>>>>> use similar method to optimize it in the future. >>>>>>>>>> >>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user + >>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled): >>>>>>>>>> >>>>>>>>>> Before: ~5.0Mpps >>>>>>>>>> After: ~6.1Mpps >>>>>>>>>> >>>>>>>>>> Signed-off-by: Jason Wang<jasowang at redhat.com> >>>>>>>>>> --- >>>>>>>>>> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> drivers/vhost/vhost.h | 11 +++ >>>>>>>>>> 2 files changed, 189 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644 >>>>>>>>>> --- a/drivers/vhost/vhost.c >>>>>>>>>> +++ b/drivers/vhost/vhost.c >>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev, >>>>>>>>>> vq->indirect = NULL; >>>>>>>>>> vq->heads = NULL; >>>>>>>>>> vq->dev = dev; >>>>>>>>>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring)); >>>>>>>>>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring)); >>>>>>>>>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring)); >>>>>>>>>> mutex_init(&vq->mutex); >>>>>>>>>> vhost_vq_reset(dev, vq); >>>>>>>>>> if (vq->handle_kick) >>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev) >>>>>>>>>> spin_unlock(&dev->iotlb_lock); >>>>>>>>>> } >>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr, >>>>>>>>>> + size_t size, int write) >>>>>>>>>> +{ >>>>>>>>>> + struct page **pages; >>>>>>>>>> + int npages = DIV_ROUND_UP(size, PAGE_SIZE); >>>>>>>>>> + int npinned; >>>>>>>>>> + void *vaddr; >>>>>>>>>> + >>>>>>>>>> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); >>>>>>>>>> + if (!pages) >>>>>>>>>> + return -ENOMEM; >>>>>>>>>> + >>>>>>>>>> + npinned = get_user_pages_fast(uaddr, npages, write, pages); >>>>>>>>>> + if (npinned != npages) >>>>>>>>>> + goto err; >>>>>>>>>> + >>>>>>>>> As I said I have doubts about the whole approach, but this >>>>>>>>> implementation in particular isn't a good idea >>>>>>>>> as it keeps the page around forever. >>>>>> The pages wil be released during set features. >>>>>> >>>>>> >>>>>>>>> So no THP, no NUMA rebalancing, >>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter >>>>>> consider the gain we have? >>>>> We as in vhost? networking isn't the only thing guest does. >>>>> We don't even know if this guest does a lot of networking. >>>>> You don't >>>>> know what else is in this huge page. Can be something very important >>>>> that guest touches all the time. >>>> Well, the probability should be very small consider we usually give several >>>> gigabytes to guest. The rest of the pages that doesn't sit in the same >>>> hugepage with metadata can still be merged by THP.? Anyway, I can test the >>>> differences. >>> Thanks! >>> >>>>>> For NUMA rebalancing, I'm even not quite sure if >>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it >>>>>> may cause page to be thrash between nodes if vhost and userspace are running >>>>>> in two nodes. >>>>> So again it's a gain for vhost but has a completely unpredictable effect on >>>>> other functionality of the guest. >>>>> >>>>> That's what bothers me with this approach. >>>> So: >>>> >>>> - The rest of the pages could still be balanced to other nodes, no? >>>> >>>> - try to balance metadata pages (belongs to co-operate processes) itself is >>>> still questionable >>> I am not sure why. It should be easy enough to force the VCPU and vhost >>> to move (e.g. start them pinned to 1 cpu, then pin them to another one). >>> Clearly sometimes this would be necessary for load balancing reasons. >> >> Yes, but it looks to me the part of motivation of auto NUMA is to avoid >> manual pinning. > ... of memory. Yes. > > >>> With autonuma after a while (could take seconds but it will happen) the >>> memory will migrate. >>> >> Yes. As you mentioned during the discuss, I wonder we could do it similarly >> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86: >> Unpin and remove kvm_arch->apic_access_page") > That would be a possible approach.Yes, this looks possible, and the conversion seems not hard. Let me have a try with this. [...]>>>>>>> I don't see how a kthread makes any difference. We do have a validation >>>>>>> step which makes some difference. >>>>>> The problem is not kthread but the address of userspace address. The >>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows >>>>>> they will be frequently. SMAP doesn't help too much in this case. >>>>>> >>>>>> Thanks. >>>>> It's true for a real life applications but a malicious one >>>>> can call the setup ioctls any number of times. And SMAP is >>>>> all about malcious applications. >>>> We don't do this in the path of ioctl, there's no context switch between >>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel >>>> from accessing userspace pages unexpectedly which is not the case for >>>> metadata access. >>>> >>>> Thanks >>> OK let's forget smap for now. >> >> Some numbers I measured: >> >> On an old Sandy bridge machine without SMAP support. Remove speculation >> barrier boost the performance from 4.6Mpps to 5.1Mpps >> >> On a newer Broadwell machine with SMAP support. Remove speculation barrier >> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost >> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it >> only bypass SMAP for metadata). >> >> So it looks like for recent machine, SMAP becomes pain point when the copy >> is short (e.g 64B) for high PPS. >> >> Thanks > Thanks a lot for looking into this! > > So first of all users can just boot with nosmap, right? > What's wrong with that?Nothing wrong, just realize we had this kernel parameter.> Yes it's not fine-grained but OTOH > it's easy to understand. > > And I guess this confirms that if we are going to worry > about smap enabled, we need to look into packet copies > too, not just meta-data.For packet copies, we can do batch copy which is pretty simple for the case of XDP. I've already had patches for this.> > Vaguely could see a module option (off by default) > where vhost basically does user_access_begin > when it starts running, then uses unsafe accesses > in vhost and tun and then user_access_end.Using user_access_begin() is more tricky than imaged. E.g it requires: - userspace address to be validated before through access_ok() [1] - It doesn't support calling a function that does explicit schedule since SMAP/PAN state is not maintained through schedule() [2] [1] https://lwn.net/Articles/736348/ [2] https://lkml.org/lkml/2018/11/23/430 So calling user_access_begin() all the time when vhost is running seems pretty dangerous. For a better batched datacopy, I tend to build not only XDP but also skb in vhost in the future. Thanks> > >>>>>>>> Packet or AF_XDP benefit from >>>>>>>> accessing metadata directly, we should do it as well. >>>>>>>> >>>>>>>> Thanks
Michael S. Tsirkin
2018-Dec-30 18:30 UTC
[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:> > On 2018/12/26 ??11:02, Michael S. Tsirkin wrote: > > On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote: > > > On 2018/12/25 ??8:50, Michael S. Tsirkin wrote: > > > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote: > > > > > On 2018/12/25 ??2:10, Michael S. Tsirkin wrote: > > > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote: > > > > > > > On 2018/12/14 ??8:36, Michael S. Tsirkin wrote: > > > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote: > > > > > > > > > On 2018/12/13 ??11:44, Michael S. Tsirkin wrote: > > > > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote: > > > > > > > > > > > It was noticed that the copy_user() friends that was used to access > > > > > > > > > > > virtqueue metdata tends to be very expensive for dataplane > > > > > > > > > > > implementation like vhost since it involves lots of software check, > > > > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The > > > > > > > > > > > extra cost will be more obvious when transferring small packets. > > > > > > > > > > > > > > > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages > > > > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup > > > > > > > > > > > those mappings and memory accessors are modified to use pointers to > > > > > > > > > > > access the metadata directly. > > > > > > > > > > > > > > > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could > > > > > > > > > > > use similar method to optimize it in the future. > > > > > > > > > > > > > > > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user + > > > > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled): > > > > > > > > > > > > > > > > > > > > > > Before: ~5.0Mpps > > > > > > > > > > > After: ~6.1Mpps > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jason Wang<jasowang at redhat.com> > > > > > > > > > > > --- > > > > > > > > > > > drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > drivers/vhost/vhost.h | 11 +++ > > > > > > > > > > > 2 files changed, 189 insertions(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > > > > > index bafe39d2e637..1bd24203afb6 100644 > > > > > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev, > > > > > > > > > > > vq->indirect = NULL; > > > > > > > > > > > vq->heads = NULL; > > > > > > > > > > > vq->dev = dev; > > > > > > > > > > > + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring)); > > > > > > > > > > > + memset(&vq->used_ring, 0, sizeof(vq->used_ring)); > > > > > > > > > > > + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring)); > > > > > > > > > > > mutex_init(&vq->mutex); > > > > > > > > > > > vhost_vq_reset(dev, vq); > > > > > > > > > > > if (vq->handle_kick) > > > > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev) > > > > > > > > > > > spin_unlock(&dev->iotlb_lock); > > > > > > > > > > > } > > > > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr, > > > > > > > > > > > + size_t size, int write) > > > > > > > > > > > +{ > > > > > > > > > > > + struct page **pages; > > > > > > > > > > > + int npages = DIV_ROUND_UP(size, PAGE_SIZE); > > > > > > > > > > > + int npinned; > > > > > > > > > > > + void *vaddr; > > > > > > > > > > > + > > > > > > > > > > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); > > > > > > > > > > > + if (!pages) > > > > > > > > > > > + return -ENOMEM; > > > > > > > > > > > + > > > > > > > > > > > + npinned = get_user_pages_fast(uaddr, npages, write, pages); > > > > > > > > > > > + if (npinned != npages) > > > > > > > > > > > + goto err; > > > > > > > > > > > + > > > > > > > > > > As I said I have doubts about the whole approach, but this > > > > > > > > > > implementation in particular isn't a good idea > > > > > > > > > > as it keeps the page around forever. > > > > > > > The pages wil be released during set features. > > > > > > > > > > > > > > > > > > > > > > > > So no THP, no NUMA rebalancing, > > > > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter > > > > > > > consider the gain we have? > > > > > > We as in vhost? networking isn't the only thing guest does. > > > > > > We don't even know if this guest does a lot of networking. > > > > > > You don't > > > > > > know what else is in this huge page. Can be something very important > > > > > > that guest touches all the time. > > > > > Well, the probability should be very small consider we usually give several > > > > > gigabytes to guest. The rest of the pages that doesn't sit in the same > > > > > hugepage with metadata can still be merged by THP.? Anyway, I can test the > > > > > differences. > > > > Thanks! > > > > > > > > > > > For NUMA rebalancing, I'm even not quite sure if > > > > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it > > > > > > > may cause page to be thrash between nodes if vhost and userspace are running > > > > > > > in two nodes. > > > > > > So again it's a gain for vhost but has a completely unpredictable effect on > > > > > > other functionality of the guest. > > > > > > > > > > > > That's what bothers me with this approach. > > > > > So: > > > > > > > > > > - The rest of the pages could still be balanced to other nodes, no? > > > > > > > > > > - try to balance metadata pages (belongs to co-operate processes) itself is > > > > > still questionable > > > > I am not sure why. It should be easy enough to force the VCPU and vhost > > > > to move (e.g. start them pinned to 1 cpu, then pin them to another one). > > > > Clearly sometimes this would be necessary for load balancing reasons. > > > > > > Yes, but it looks to me the part of motivation of auto NUMA is to avoid > > > manual pinning. > > ... of memory. Yes. > > > > > > > > With autonuma after a while (could take seconds but it will happen) the > > > > memory will migrate. > > > > > > > Yes. As you mentioned during the discuss, I wonder we could do it similarly > > > through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86: > > > Unpin and remove kvm_arch->apic_access_page") > > That would be a possible approach. > > > Yes, this looks possible, and the conversion seems not hard. Let me have a > try with this. > > > [...] > > > > > > > > > > I don't see how a kthread makes any difference. We do have a validation > > > > > > > > step which makes some difference. > > > > > > > The problem is not kthread but the address of userspace address. The > > > > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows > > > > > > > they will be frequently. SMAP doesn't help too much in this case. > > > > > > > > > > > > > > Thanks. > > > > > > It's true for a real life applications but a malicious one > > > > > > can call the setup ioctls any number of times. And SMAP is > > > > > > all about malcious applications. > > > > > We don't do this in the path of ioctl, there's no context switch between > > > > > userspace and kernel in the worker thread. SMAP is used to prevent kernel > > > > > from accessing userspace pages unexpectedly which is not the case for > > > > > metadata access. > > > > > > > > > > Thanks > > > > OK let's forget smap for now. > > > > > > Some numbers I measured: > > > > > > On an old Sandy bridge machine without SMAP support. Remove speculation > > > barrier boost the performance from 4.6Mpps to 5.1Mpps > > > > > > On a newer Broadwell machine with SMAP support. Remove speculation barrier > > > only gives 2%-5% improvement, disable SMAP completely through Kconfig boost > > > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it > > > only bypass SMAP for metadata). > > > > > > So it looks like for recent machine, SMAP becomes pain point when the copy > > > is short (e.g 64B) for high PPS. > > > > > > Thanks > > Thanks a lot for looking into this! > > > > So first of all users can just boot with nosmap, right? > > What's wrong with that? > > > Nothing wrong, just realize we had this kernel parameter. > > > > Yes it's not fine-grained but OTOH > > it's easy to understand. > > > > And I guess this confirms that if we are going to worry > > about smap enabled, we need to look into packet copies > > too, not just meta-data. > > > For packet copies, we can do batch copy which is pretty simple for the case > of XDP. I've already had patches for this. > > > > > > Vaguely could see a module option (off by default) > > where vhost basically does user_access_begin > > when it starts running, then uses unsafe accesses > > in vhost and tun and then user_access_end. > > > Using user_access_begin() is more tricky than imaged. E.g it requires: > > - userspace address to be validated before through access_ok() [1]This part is fine I think - addresses come from the memory map and when userspace supplies the memory map we validate everything with access_ok. Well do we validate with the iotlb too? Don't see it right now so maybe not but it's easy to add.> - It doesn't support calling a function that does explicit schedule since > SMAP/PAN state is not maintained through schedule() [2] > > [1] https://lwn.net/Articles/736348/ > > [2] https://lkml.org/lkml/2018/11/23/430 > > So calling user_access_begin() all the time when vhost is running seems > pretty dangerous.Yes it requires some rework e.g. to try getting memory with GFP_ATOMIC. We could then do a slow path with GFP_KERNEL if that fails.> For a better batched datacopy, I tend to build not only XDP but also skb in > vhost in the future. > > ThanksSure, why not.> > > > > > > > > > > > > > Packet or AF_XDP benefit from > > > > > > > > > accessing metadata directly, we should do it as well. > > > > > > > > > > > > > > > > > > Thanks