thr3ads.net - Virtualization - [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2018-Dec-26 15:02 UTC

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang
wrote:> 
> On 2018/12/25 ??8:50, Michael S. Tsirkin wrote:
> > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > On 2018/12/25 ??2:10, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > On 2018/12/14 ??8:36, Michael S. Tsirkin wrote:
> > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason
Wang wrote:
> > > > > > > On 2018/12/13 ??11:44, Michael S. Tsirkin
wrote:
> > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM
+0800, Jason Wang wrote:
> > > > > > > > > It was noticed that the copy_user()
friends that was used to access
> > > > > > > > > virtqueue metdata tends to be very
expensive for dataplane
> > > > > > > > > implementation like vhost since it
involves lots of software check,
> > > > > > > > > speculation barrier, hardware
feature toggling (e.g SMAP). The
> > > > > > > > > extra cost will be more obvious
when transferring small packets.
> > > > > > > > > 
> > > > > > > > > This patch tries to eliminate those
overhead by pin vq metadata pages
> > > > > > > > > and access them through vmap().
During SET_VRING_ADDR, we will setup
> > > > > > > > > those mappings and memory accessors
are modified to use pointers to
> > > > > > > > > access the metadata directly.
> > > > > > > > > 
> > > > > > > > > Note, this was only done when
device IOTLB is not enabled. We could
> > > > > > > > > use similar method to optimize it
in the future.
> > > > > > > > > 
> > > > > > > > > Tests shows about ~24% improvement
on TX PPS when using virtio-user +
> > > > > > > > > vhost_net + xdp1 on TAP
(CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > 
> > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jason
Wang<jasowang at redhat.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/vhost/vhost.c | 178
++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >      drivers/vhost/vhost.h |  11
+++
> > > > > > > > >      2 files changed, 189
insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/vhost/vhost.c
b/drivers/vhost/vhost.c
> > > > > > > > > index bafe39d2e637..1bd24203afb6
100644
> > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > @@ -443,6 +443,9 @@ void
vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > >      		vq->indirect = NULL;
> > > > > > > > >      		vq->heads = NULL;
> > > > > > > > >      		vq->dev = dev;
> > > > > > > > > +		memset(&vq->avail_ring,
0, sizeof(vq->avail_ring));
> > > > > > > > > +		memset(&vq->used_ring, 0,
sizeof(vq->used_ring));
> > > > > > > > > +		memset(&vq->desc_ring, 0,
sizeof(vq->desc_ring));
> > > > > > > > >      	
mutex_init(&vq->mutex);
> > > > > > > > >      		vhost_vq_reset(dev, vq);
> > > > > > > > >      		if (vq->handle_kick)
> > > > > > > > > @@ -614,6 +617,102 @@ static void
vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > >      
spin_unlock(&dev->iotlb_lock);
> > > > > > > > >      }
> > > > > > > > > +static int vhost_init_vmap(struct
vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > +			   size_t size, int write)
> > > > > > > > > +{
> > > > > > > > > +	struct page **pages;
> > > > > > > > > +	int npages = DIV_ROUND_UP(size,
PAGE_SIZE);
> > > > > > > > > +	int npinned;
> > > > > > > > > +	void *vaddr;
> > > > > > > > > +
> > > > > > > > > +	pages = kmalloc_array(npages,
sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > +	if (!pages)
> > > > > > > > > +		return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +	npinned =
get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > +	if (npinned != npages)
> > > > > > > > > +		goto err;
> > > > > > > > > +
> > > > > > > > As I said I have doubts about the whole
approach, but this
> > > > > > > > implementation in particular isn't a
good idea
> > > > > > > > as it keeps the page around forever.
> > > > > The pages wil be released during set features.
> > > > > 
> > > > > 
> > > > > > > > So no THP, no NUMA rebalancing,
> > > > > For THP, we will probably miss 2 or 4 pages, but does
this really matter
> > > > > consider the gain we have?
> > > > We as in vhost? networking isn't the only thing guest
does.
> > > > We don't even know if this guest does a lot of
networking.
> > > > You don't
> > > > know what else is in this huge page. Can be something very
important
> > > > that guest touches all the time.
> > > 
> > > Well, the probability should be very small consider we usually
give several
> > > gigabytes to guest. The rest of the pages that doesn't sit in
the same
> > > hugepage with metadata can still be merged by THP.? Anyway, I can
test the
> > > differences.
> > Thanks!
> > 
> > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > it can helps for the case of IPC (vhost). It looks to
me the worst case it
> > > > > may cause page to be thrash between nodes if vhost and
userspace are running
> > > > > in two nodes.
> > > > So again it's a gain for vhost but has a completely
unpredictable effect on
> > > > other functionality of the guest.
> > > > 
> > > > That's what bothers me with this approach.
> > > 
> > > So:
> > > 
> > > - The rest of the pages could still be balanced to other nodes,
no?
> > > 
> > > - try to balance metadata pages (belongs to co-operate processes)
itself is
> > > still questionable
> > I am not sure why. It should be easy enough to force the VCPU and
vhost
> > to move (e.g. start them pinned to 1 cpu, then pin them to another
one).
> > Clearly sometimes this would be necessary for load balancing reasons.
> 
> 
> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> manual pinning.
... of memory. Yes.

> 
> > With autonuma after a while (could take seconds but it will happen)
the
> > memory will migrate.
> > 
> 
> Yes. As you mentioned during the discuss, I wonder we could do it similarly
> through mmu notifier like APIC access page in commit c24ae0dcd3e
("kvm: x86:
> Unpin and remove kvm_arch->apic_access_page")
That would be a possible approach.
> 
> > 
> > 
> > > > 
> > > > 
> > > > 
> > > > > > > This is the price of all GUP users not only
vhost itself.
> > > > > > Yes. GUP is just not a great interface for vhost
to use.
> > > > > Zerocopy codes (enabled by defualt) use them for years.
> > > > But only for TX and temporarily. We pin, read, unpin.
> > > 
> > > Probably not. For several reasons that the page will be not be
released soon
> > > or held for a very long period of time or even forever.
> > 
> > With zero copy? Well it's pinned until transmit. Takes a while
> > but could be enough for autocopy to work esp since
> > its the packet memory so not reused immediately.
> > 
> > > > Your patch is different
> > > > 
> > > > - it writes into memory and GUP has known issues with file
> > > >     backed memory
> > > 
> > > The ordinary user for vhost is anonymous pages I think?
> > 
> > It's not the most common scenario and not the fastest one
> > (e.g. THP does not work) but file backed is useful sometimes.
> > It would not be nice at all to corrupt guest memory in that case.
> 
> 
> Ok.
> 
> 
> > 
> > > > - it keeps pages pinned forever
> > > > 
> > > > 
> > > > 
> > > > > > > What's more
> > > > > > > important, the goal is not to be left too
much behind for other backends
> > > > > > > like DPDK or AF_XDP (all of which are using
GUP).
> > > > > > So these guys assume userspace knows what it's
doing.
> > > > > > We can't assume that.
> > > > > What kind of assumption do you they have?
> > > > > 
> > > > > 
> > > > > > > > userspace-controlled
> > > > > > > > amount of memory locked up and not
accounted for.
> > > > > > > It's pretty easy to add this since the
slow path was still kept. If we
> > > > > > > exceeds the limitation, we can switch back to
slow path.
> > > > > > > 
> > > > > > > > Don't get me wrong it's a great
patch in an ideal world.
> > > > > > > > But then in an ideal world no barriers
smap etc are necessary at all.
> > > > > > > Again, this is only for metadata accessing
not the data which has been used
> > > > > > > for years for real use cases.
> > > > > > > 
> > > > > > > For SMAP, it makes senses for the address
that kernel can not forcast. But
> > > > > > > it's not the case for the vhost metadata
since we know the address will be
> > > > > > > accessed very frequently. For speculation
barrier, it helps nothing for the
> > > > > > > data path of vhost which is a kthread.
> > > > > > I don't see how a kthread makes any
difference. We do have a validation
> > > > > > step which makes some difference.
> > > > > The problem is not kthread but the address of userspace
address. The
> > > > > addresses of vq metadata tends to be consistent for a
while, and vhost knows
> > > > > they will be frequently. SMAP doesn't help too much
in this case.
> > > > > 
> > > > > Thanks.
> > > > It's true for a real life applications but a malicious
one
> > > > can call the setup ioctls any number of times. And SMAP is
> > > > all about malcious applications.
> > > 
> > > We don't do this in the path of ioctl, there's no context
switch between
> > > userspace and kernel in the worker thread. SMAP is used to
prevent kernel
> > > from accessing userspace pages unexpectedly which is not the case
for
> > > metadata access.
> > > 
> > > Thanks
> > OK let's forget smap for now.
> 
> 
> Some numbers I measured:
> 
> On an old Sandy bridge machine without SMAP support. Remove speculation
> barrier boost the performance from 4.6Mpps to 5.1Mpps
> 
> On a newer Broadwell machine with SMAP support. Remove speculation barrier
> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> only bypass SMAP for metadata).
> 
> So it looks like for recent machine, SMAP becomes pain point when the copy
> is short (e.g 64B) for high PPS.
> 
> Thanks
Thanks a lot for looking into this!

So first of all users can just boot with nosmap, right?
What's wrong with that? Yes it's not fine-grained but OTOH
it's easy to understand.

And I guess this confirms that if we are going to worry
about smap enabled, we need to look into packet copies
too, not just meta-data.

Vaguely could see a module option (off by default)
where vhost basically does user_access_begin
when it starts running, then uses unsafe accesses
in vhost and tun and then user_access_end.

> 
> > 
> > > > > > > Packet or AF_XDP benefit from
> > > > > > > accessing metadata directly, we should do it
as well.
> > > > > > > 
> > > > > > > Thanks

Jason Wang

2018-Dec-27 09:39 UTC

head link

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

On 2018/12/26 ??11:02, Michael S. Tsirkin wrote:> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>> On 2018/12/25 ??8:50, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 ??2:10, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 ??8:36, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason
Wang wrote:
>>>>>>>> On 2018/12/13 ??11:44, Michael S. Tsirkin
wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800,
Jason Wang wrote:
>>>>>>>>>> It was noticed that the copy_user()
friends that was used to access
>>>>>>>>>> virtqueue metdata tends to be very
expensive for dataplane
>>>>>>>>>> implementation like vhost since it
involves lots of software check,
>>>>>>>>>> speculation barrier, hardware feature
toggling (e.g SMAP). The
>>>>>>>>>> extra cost will be more obvious when
transferring small packets.
>>>>>>>>>>
>>>>>>>>>> This patch tries to eliminate those
overhead by pin vq metadata pages
>>>>>>>>>> and access them through vmap(). During
SET_VRING_ADDR, we will setup
>>>>>>>>>> those mappings and memory accessors are
modified to use pointers to
>>>>>>>>>> access the metadata directly.
>>>>>>>>>>
>>>>>>>>>> Note, this was only done when device
IOTLB is not enabled. We could
>>>>>>>>>> use similar method to optimize it in
the future.
>>>>>>>>>>
>>>>>>>>>> Tests shows about ~24% improvement on
TX PPS when using virtio-user +
>>>>>>>>>> vhost_net + xdp1 on TAP
(CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>
>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jason Wang<jasowang
at redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/vhost/vhost.c | 178
++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>       2 files changed, 189
insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vhost/vhost.c
b/drivers/vhost/vhost.c
>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>> @@ -443,6 +443,9 @@ void
vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>       		vq->indirect = NULL;
>>>>>>>>>>       		vq->heads = NULL;
>>>>>>>>>>       		vq->dev = dev;
>>>>>>>>>> +		memset(&vq->avail_ring, 0,
sizeof(vq->avail_ring));
>>>>>>>>>> +		memset(&vq->used_ring, 0,
sizeof(vq->used_ring));
>>>>>>>>>> +		memset(&vq->desc_ring, 0,
sizeof(vq->desc_ring));
>>>>>>>>>>       		mutex_init(&vq->mutex);
>>>>>>>>>>       		vhost_vq_reset(dev, vq);
>>>>>>>>>>       		if (vq->handle_kick)
>>>>>>>>>> @@ -614,6 +617,102 @@ static void
vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>       
spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>       }
>>>>>>>>>> +static int vhost_init_vmap(struct
vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>> +{
>>>>>>>>>> +	struct page **pages;
>>>>>>>>>> +	int npages = DIV_ROUND_UP(size,
PAGE_SIZE);
>>>>>>>>>> +	int npinned;
>>>>>>>>>> +	void *vaddr;
>>>>>>>>>> +
>>>>>>>>>> +	pages = kmalloc_array(npages,
sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> +	if (!pages)
>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr,
npages, write, pages);
>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>> +		goto err;
>>>>>>>>>> +
>>>>>>>>> As I said I have doubts about the whole
approach, but this
>>>>>>>>> implementation in particular isn't a
good idea
>>>>>>>>> as it keeps the page around forever.
>>>>>> The pages wil be released during set features.
>>>>>>
>>>>>>
>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>> For THP, we will probably miss 2 or 4 pages, but does
this really matter
>>>>>> consider the gain we have?
>>>>> We as in vhost? networking isn't the only thing guest
does.
>>>>> We don't even know if this guest does a lot of
networking.
>>>>> You don't
>>>>> know what else is in this huge page. Can be something very
important
>>>>> that guest touches all the time.
>>>> Well, the probability should be very small consider we usually
give several
>>>> gigabytes to guest. The rest of the pages that doesn't sit
in the same
>>>> hugepage with metadata can still be merged by THP.? Anyway, I
can test the
>>>> differences.
>>> Thanks!
>>>
>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>> it can helps for the case of IPC (vhost). It looks to
me the worst case it
>>>>>> may cause page to be thrash between nodes if vhost and
userspace are running
>>>>>> in two nodes.
>>>>> So again it's a gain for vhost but has a completely
unpredictable effect on
>>>>> other functionality of the guest.
>>>>>
>>>>> That's what bothers me with this approach.
>>>> So:
>>>>
>>>> - The rest of the pages could still be balanced to other nodes,
no?
>>>>
>>>> - try to balance metadata pages (belongs to co-operate
processes) itself is
>>>> still questionable
>>> I am not sure why. It should be easy enough to force the VCPU and
vhost
>>> to move (e.g. start them pinned to 1 cpu, then pin them to another
one).
>>> Clearly sometimes this would be necessary for load balancing
reasons.
>>
>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>> manual pinning.
> ... of memory. Yes.
>
>
>>> With autonuma after a while (could take seconds but it will happen)
the
>>> memory will migrate.
>>>
>> Yes. As you mentioned during the discuss, I wonder we could do it
similarly
>> through mmu notifier like APIC access page in commit c24ae0dcd3e
("kvm: x86:
>> Unpin and remove kvm_arch->apic_access_page")
> That would be a possible approach.

Yes, this looks possible, and the conversion seems not hard. Let me have 
a try with this.


[...]

>>>>>>> I don't see how a kthread makes any difference.
We do have a validation
>>>>>>> step which makes some difference.
>>>>>> The problem is not kthread but the address of userspace
address. The
>>>>>> addresses of vq metadata tends to be consistent for a
while, and vhost knows
>>>>>> they will be frequently. SMAP doesn't help too much
in this case.
>>>>>>
>>>>>> Thanks.
>>>>> It's true for a real life applications but a malicious
one
>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>> all about malcious applications.
>>>> We don't do this in the path of ioctl, there's no
context switch between
>>>> userspace and kernel in the worker thread. SMAP is used to
prevent kernel
>>>> from accessing userspace pages unexpectedly which is not the
case for
>>>> metadata access.
>>>>
>>>> Thanks
>>> OK let's forget smap for now.
>>
>> Some numbers I measured:
>>
>> On an old Sandy bridge machine without SMAP support. Remove speculation
>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>
>> On a newer Broadwell machine with SMAP support. Remove speculation
barrier
>> only gives 2%-5% improvement, disable SMAP completely through Kconfig
boost
>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps,
it
>> only bypass SMAP for metadata).
>>
>> So it looks like for recent machine, SMAP becomes pain point when the
copy
>> is short (e.g 64B) for high PPS.
>>
>> Thanks
> Thanks a lot for looking into this!
>
> So first of all users can just boot with nosmap, right?
> What's wrong with that?

Nothing wrong, just realize we had this kernel parameter.

>   Yes it's not fine-grained but OTOH
> it's easy to understand.
>
> And I guess this confirms that if we are going to worry
> about smap enabled, we need to look into packet copies
> too, not just meta-data.

For packet copies, we can do batch copy which is pretty simple for the 
case of XDP. I've already had patches for this.

>
> Vaguely could see a module option (off by default)
> where vhost basically does user_access_begin
> when it starts running, then uses unsafe accesses
> in vhost and tun and then user_access_end.

Using user_access_begin() is more tricky than imaged. E.g it requires:

- userspace address to be validated before through access_ok() [1]

- It doesn't support calling a function that does explicit schedule 
since SMAP/PAN state is not maintained through schedule() [2]

[1] https://lwn.net/Articles/736348/

[2] https://lkml.org/lkml/2018/11/23/430

So calling user_access_begin() all the time when vhost is running seems 
pretty dangerous.

For a better batched datacopy, I tend to build not only XDP but also skb 
in vhost in the future.

Thanks

>
>
>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>> accessing metadata directly, we should do it as
well.
>>>>>>>>
>>>>>>>> Thanks

Michael S. Tsirkin

2018-Dec-30 18:30 UTC

head link

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang
wrote:> 
> On 2018/12/26 ??11:02, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> > > On 2018/12/25 ??8:50, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > > > On 2018/12/25 ??2:10, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason
Wang wrote:
> > > > > > > On 2018/12/14 ??8:36, Michael S. Tsirkin
wrote:
> > > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM
+0800, Jason Wang wrote:
> > > > > > > > > On 2018/12/13 ??11:44, Michael S.
Tsirkin wrote:
> > > > > > > > > > On Thu, Dec 13, 2018 at
06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > > > It was noticed that the
copy_user() friends that was used to access
> > > > > > > > > > > virtqueue metdata tends
to be very expensive for dataplane
> > > > > > > > > > > implementation like vhost
since it involves lots of software check,
> > > > > > > > > > > speculation barrier,
hardware feature toggling (e.g SMAP). The
> > > > > > > > > > > extra cost will be more
obvious when transferring small packets.
> > > > > > > > > > > 
> > > > > > > > > > > This patch tries to
eliminate those overhead by pin vq metadata pages
> > > > > > > > > > > and access them through
vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > > > those mappings and memory
accessors are modified to use pointers to
> > > > > > > > > > > access the metadata
directly.
> > > > > > > > > > > 
> > > > > > > > > > > Note, this was only done
when device IOTLB is not enabled. We could
> > > > > > > > > > > use similar method to
optimize it in the future.
> > > > > > > > > > > 
> > > > > > > > > > > Tests shows about ~24%
improvement on TX PPS when using virtio-user +
> > > > > > > > > > > vhost_net + xdp1 on TAP
(CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > > > 
> > > > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > > > After:  ~6.1Mpps
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Jason
Wang<jasowang at redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >      
drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >      
drivers/vhost/vhost.h |  11 +++
> > > > > > > > > > >       2 files changed,
189 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git
a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > > index
bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > > > ---
a/drivers/vhost/vhost.c
> > > > > > > > > > > +++
b/drivers/vhost/vhost.c
> > > > > > > > > > > @@ -443,6 +443,9 @@ void
vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > > > >       		vq->indirect =
NULL;
> > > > > > > > > > >       		vq->heads =
NULL;
> > > > > > > > > > >       		vq->dev = dev;
> > > > > > > > > > > +	
memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > > > +	
memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > > > +	
memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > > > >       	
mutex_init(&vq->mutex);
> > > > > > > > > > >       	
vhost_vq_reset(dev, vq);
> > > > > > > > > > >       		if
(vq->handle_kick)
> > > > > > > > > > > @@ -614,6 +617,102 @@
static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > > > >       
spin_unlock(&dev->iotlb_lock);
> > > > > > > > > > >       }
> > > > > > > > > > > +static int
vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > > > +			   size_t size, int
write)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct page **pages;
> > > > > > > > > > > +	int npages =
DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > > > +	int npinned;
> > > > > > > > > > > +	void *vaddr;
> > > > > > > > > > > +
> > > > > > > > > > > +	pages =
kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > > > +	if (!pages)
> > > > > > > > > > > +		return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +	npinned =
get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > > > +	if (npinned != npages)
> > > > > > > > > > > +		goto err;
> > > > > > > > > > > +
> > > > > > > > > > As I said I have doubts about
the whole approach, but this
> > > > > > > > > > implementation in particular
isn't a good idea
> > > > > > > > > > as it keeps the page around
forever.
> > > > > > > The pages wil be released during set
features.
> > > > > > > 
> > > > > > > 
> > > > > > > > > > So no THP, no NUMA
rebalancing,
> > > > > > > For THP, we will probably miss 2 or 4 pages,
but does this really matter
> > > > > > > consider the gain we have?
> > > > > > We as in vhost? networking isn't the only
thing guest does.
> > > > > > We don't even know if this guest does a lot of
networking.
> > > > > > You don't
> > > > > > know what else is in this huge page. Can be
something very important
> > > > > > that guest touches all the time.
> > > > > Well, the probability should be very small consider we
usually give several
> > > > > gigabytes to guest. The rest of the pages that
doesn't sit in the same
> > > > > hugepage with metadata can still be merged by THP.?
Anyway, I can test the
> > > > > differences.
> > > > Thanks!
> > > > 
> > > > > > > For NUMA rebalancing, I'm even not quite
sure if
> > > > > > > it can helps for the case of IPC (vhost). It
looks to me the worst case it
> > > > > > > may cause page to be thrash between nodes if
vhost and userspace are running
> > > > > > > in two nodes.
> > > > > > So again it's a gain for vhost but has a
completely unpredictable effect on
> > > > > > other functionality of the guest.
> > > > > > 
> > > > > > That's what bothers me with this approach.
> > > > > So:
> > > > > 
> > > > > - The rest of the pages could still be balanced to
other nodes, no?
> > > > > 
> > > > > - try to balance metadata pages (belongs to co-operate
processes) itself is
> > > > > still questionable
> > > > I am not sure why. It should be easy enough to force the
VCPU and vhost
> > > > to move (e.g. start them pinned to 1 cpu, then pin them to
another one).
> > > > Clearly sometimes this would be necessary for load balancing
reasons.
> > > 
> > > Yes, but it looks to me the part of motivation of auto NUMA is to
avoid
> > > manual pinning.
> > ... of memory. Yes.
> > 
> > 
> > > > With autonuma after a while (could take seconds but it will
happen) the
> > > > memory will migrate.
> > > > 
> > > Yes. As you mentioned during the discuss, I wonder we could do it
similarly
> > > through mmu notifier like APIC access page in commit c24ae0dcd3e
("kvm: x86:
> > > Unpin and remove kvm_arch->apic_access_page")
> > That would be a possible approach.
> 
> 
> Yes, this looks possible, and the conversion seems not hard. Let me have a
> try with this.
> 
> 
> [...]
> 
> 
> > > > > > > > I don't see how a kthread makes any
difference. We do have a validation
> > > > > > > > step which makes some difference.
> > > > > > > The problem is not kthread but the address of
userspace address. The
> > > > > > > addresses of vq metadata tends to be
consistent for a while, and vhost knows
> > > > > > > they will be frequently. SMAP doesn't
help too much in this case.
> > > > > > > 
> > > > > > > Thanks.
> > > > > > It's true for a real life applications but a
malicious one
> > > > > > can call the setup ioctls any number of times. And
SMAP is
> > > > > > all about malcious applications.
> > > > > We don't do this in the path of ioctl, there's
no context switch between
> > > > > userspace and kernel in the worker thread. SMAP is used
to prevent kernel
> > > > > from accessing userspace pages unexpectedly which is
not the case for
> > > > > metadata access.
> > > > > 
> > > > > Thanks
> > > > OK let's forget smap for now.
> > > 
> > > Some numbers I measured:
> > > 
> > > On an old Sandy bridge machine without SMAP support. Remove
speculation
> > > barrier boost the performance from 4.6Mpps to 5.1Mpps
> > > 
> > > On a newer Broadwell machine with SMAP support. Remove
speculation barrier
> > > only gives 2%-5% improvement, disable SMAP completely through
Kconfig boost
> > > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps -
6.1Mpps, it
> > > only bypass SMAP for metadata).
> > > 
> > > So it looks like for recent machine, SMAP becomes pain point when
the copy
> > > is short (e.g 64B) for high PPS.
> > > 
> > > Thanks
> > Thanks a lot for looking into this!
> > 
> > So first of all users can just boot with nosmap, right?
> > What's wrong with that?
> 
> 
> Nothing wrong, just realize we had this kernel parameter.
> 
> 
> >   Yes it's not fine-grained but OTOH
> > it's easy to understand.
> > 
> > And I guess this confirms that if we are going to worry
> > about smap enabled, we need to look into packet copies
> > too, not just meta-data.
> 
> 
> For packet copies, we can do batch copy which is pretty simple for the case
> of XDP. I've already had patches for this.
> 
> 
> > 
> > Vaguely could see a module option (off by default)
> > where vhost basically does user_access_begin
> > when it starts running, then uses unsafe accesses
> > in vhost and tun and then user_access_end.
> 
> 
> Using user_access_begin() is more tricky than imaged. E.g it requires:
> 
> - userspace address to be validated before through access_ok() [1]
This part is fine I think - addresses come from the memory
map and when userspace supplies the memory map
we validate everything with access_ok.
Well do we validate with the iotlb too? Don't see it right now
so maybe not but it's easy to add.
> - It doesn't support calling a function that does explicit schedule
since
> SMAP/PAN state is not maintained through schedule() [2]
> 
> [1] https://lwn.net/Articles/736348/
> 
> [2] https://lkml.org/lkml/2018/11/23/430
> 
> So calling user_access_begin() all the time when vhost is running seems
> pretty dangerous.
Yes it requires some rework e.g. to try getting memory with
GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
if that fails.
> For a better batched datacopy, I tend to build not only XDP but also skb in
> vhost in the future.
> 
> Thanks
Sure, why not.
> 
> > 
> > 
> > > > > > > > > Packet or AF_XDP benefit from
> > > > > > > > > accessing metadata directly, we
should do it as well.
> > > > > > > > > 
> > > > > > > > > Thanks

Virtualization - Dec 2018 - [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

[PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address