thr3ads.net - Linux Virtualization - [PATCH net-next 0/3] vhost: accelerate metadata access through vmap() [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2018-Dec-14 03:42 UTC

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

On 2018/12/13 ??11:27, Michael S. Tsirkin wrote:> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Userspace accesses through remapping tricks and next time there's a
need
> for a new barrier we are left to figure it out by ourselves.

I don't get here, do you mean spec barriers? It's completely unnecessary
for vhost which is kernel thread. And even if you're right, vhost is not 
the only place, there's lots of vmap() based accessing in kernel. Think 
in another direction, this means we won't suffer form unnecessary 
barriers for kthread like vhost in the future, we will manually pick the 
one we really need (but it should have little possibility).

Please notice we only access metdata through remapping not the data 
itself. This idea has been used for high speed userspace backend for 
years, e.g packet socket or recent AF_XDP. The only difference is the 
page was remap to from kernel to userspace.

>    I don't
> like the idea I have to say.  As a first step, why don't we switch to
> unsafe_put_user/unsafe_get_user etc?

Several reasons:

- They only have x86 variant, it won't have any difference for the rest 
of architecture.

- unsafe_put_user/unsafe_get_user is not sufficient for accessing 
structures (e.g accessing descriptor) or arrays (batching).

- Unless we can batch at least the accessing of two places in three of 
avail, used and descriptor in one run. There will be no difference. E.g 
we can batch updating used ring, but it won't make any difference in 
this case.

> That would be more of an apples to apples comparison, would it not?

Apples to apples comparison only help if we are the No.1. But the fact 
is we are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is 
the fastest method AFAIK.

Thanks

>
>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
>>
>> Jason Wang (3):
>>    vhost: generalize adding used elem
>>    vhost: fine grain userspace memory accessors
>>    vhost: access vq metadata through kernel virtual address
>>
>>   drivers/vhost/vhost.c | 281
++++++++++++++++++++++++++++++++++++++----
>>   drivers/vhost/vhost.h |  11 ++
>>   2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> -- 
>> 2.17.1

Michael S. Tsirkin

2018-Dec-14 12:33 UTC

head link

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang
wrote:> 
> On 2018/12/13 ??11:27, Michael S. Tsirkin wrote:
> > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to access virtqueue metadata through kernel
virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling.
> > Userspace accesses through remapping tricks and next time there's
a need
> > for a new barrier we are left to figure it out by ourselves.
> 
> 
> I don't get here, do you mean spec barriers?
I mean the next barrier people decide to put into userspace
memory accesses.
> It's completely unnecessary for
> vhost which is kernel thread.
It's defence in depth. Take a look at the commit that added them.
And yes quite possibly in most cases we actually have a spec
barrier in the validation phase. If we do let's use the
unsafe variants so they can be found.
> And even if you're right, vhost is not the
> only place, there's lots of vmap() based accessing in kernel.
For sure. But if one can get by without get user pages, one
really should. Witness recently uncovered mess with file
backed storage.
> Think in
> another direction, this means we won't suffer form unnecessary barriers
for
> kthread like vhost in the future, we will manually pick the one we really
> need
I personally think we should err on the side of caution not on the side of
performance.
> (but it should have little possibility).
History seems to teach otherwise.
> Please notice we only access metdata through remapping not the data itself.
> This idea has been used for high speed userspace backend for years, e.g
> packet socket or recent AF_XDP.
I think their justification for the higher risk is that they are mostly
designed for priveledged userspace.
> The only difference is the page was remap to
> from kernel to userspace.
At least that avoids the g.u.p mess.
> 
> >    I don't
> > like the idea I have to say.  As a first step, why don't we switch
to
> > unsafe_put_user/unsafe_get_user etc?
> 
> 
> Several reasons:
> 
> - They only have x86 variant, it won't have any difference for the rest
of
> architecture.
Is there an issue on other architectures? If yes they can be extended
there.
> - unsafe_put_user/unsafe_get_user is not sufficient for accessing
structures
> (e.g accessing descriptor) or arrays (batching).
So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
> - Unless we can batch at least the accessing of two places in three of
> avail, used and descriptor in one run. There will be no difference. E.g we
> can batch updating used ring, but it won't make any difference in this
case.
> 
So let's batch them all?

> > That would be more of an apples to apples comparison, would it not?
> 
> 
> Apples to apples comparison only help if we are the No.1. But the fact is
we
> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
> fastest method AFAIK.
> 
> 
> Thanks
We need to speed up the packet access itself too though.
You can't vmap all of guest memory.

> 
> > 
> > 
> > > Test shows about 24% improvement on TX PPS. It should benefit
other
> > > cases as well.
> > > 
> > > Please review
> > > 
> > > Jason Wang (3):
> > >    vhost: generalize adding used elem
> > >    vhost: fine grain userspace memory accessors
> > >    vhost: access vq metadata through kernel virtual address
> > > 
> > >   drivers/vhost/vhost.c | 281
++++++++++++++++++++++++++++++++++++++----
> > >   drivers/vhost/vhost.h |  11 ++
> > >   2 files changed, 266 insertions(+), 26 deletions(-)
> > > 
> > > -- 
> > > 2.17.1

Michael S. Tsirkin

2018-Dec-14 15:31 UTC

head link

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

On Fri, Dec 14, 2018 at 07:33:20AM -0500, Michael S. Tsirkin
wrote:> > - unsafe_put_user/unsafe_get_user is not sufficient for accessing
structures
> > (e.g accessing descriptor) or arrays (batching).
> 
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
Like this basically? Warning: completely untested.

Signed-off-by: Michael S. Tsirkin <mst at redhat.com>


diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index b5e58cc0c5e7..d2afd70793ca 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -728,5 +728,10 @@ do {										\
 	if (unlikely(__gu_err)) goto err_label;					\
 } while (0)
 
+#define unsafe_copy_from_user(to, from, n)					\
+	 __copy_user_ll(to, (__force const void *)from, n)
+#define unsafe_copy_to_user(to, from, n)					\
+	 __copy_user_ll((__force void *)to, from, n)
+
 #endif /* _ASM_X86_UACCESS_H */
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index efe79c1cdd47..b9d515ba2920 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -271,6 +271,8 @@ extern long strncpy_from_unsafe(char *dst, const void
*unsafe_addr, long count);
 #define user_access_end() do { } while (0)
 #define unsafe_get_user(x, ptr, err) do { if (unlikely(__get_user(x, ptr)))
goto err; } while (0)
 #define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr)))
goto err; } while (0)
+#define unsafe_copy_to_user(from, to, n) copy_to_user(from, to, n)
+#define unsafe_copy_from_user(from, to, n) copy_from_user(from, to, n)
 #endif
 
 #ifdef CONFIG_HARDENED_USERCOPY

Jason Wang

2018-Dec-24 08:32 UTC

head link

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

On 2018/12/14 ??8:33, Michael S. Tsirkin wrote:> On Fri, Dec 14, 2018 at 11:42:18AM +0800, Jason Wang wrote:
>> On 2018/12/13 ??11:27, Michael S. Tsirkin wrote:
>>> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to access virtqueue metadata through kernel
virtual
>>>> address instead of copy_user() friends since they had too much
>>>> overheads like checks, spec barriers or even hardware feature
>>>> toggling.
>>> Userspace accesses through remapping tricks and next time
there's a need
>>> for a new barrier we are left to figure it out by ourselves.
>>
>> I don't get here, do you mean spec barriers?
> I mean the next barrier people decide to put into userspace
> memory accesses.
>
>> It's completely unnecessary for
>> vhost which is kernel thread.
> It's defence in depth. Take a look at the commit that added them.
> And yes quite possibly in most cases we actually have a spec
> barrier in the validation phase. If we do let's use the
> unsafe variants so they can be found.

unsafe variants can only work if you can batch userspace access. This is 
not necessarily the case for light load.

>
>> And even if you're right, vhost is not the
>> only place, there's lots of vmap() based accessing in kernel.
> For sure. But if one can get by without get user pages, one
> really should. Witness recently uncovered mess with file
> backed storage.

We only pin metadata pages, I don't believe they will be used by any DMA.

>
>> Think in
>> another direction, this means we won't suffer form unnecessary
barriers for
>> kthread like vhost in the future, we will manually pick the one we
really
>> need
> I personally think we should err on the side of caution not on the side of
> performance.

So what you suggest may lead unnecessary performance regression 
(10%-20%) which is part of the goal of this series. We should audit and 
only use the one we really need instead of depending on copy_user() 
friends().

If we do it our own, it could be slow for for security fix but it's no 
less safe than before with performance kept.

>
>> (but it should have little possibility).
> History seems to teach otherwise.

What case did you mean here?

>
>> Please notice we only access metdata through remapping not the data
itself.
>> This idea has been used for high speed userspace backend for years, e.g
>> packet socket or recent AF_XDP.
> I think their justification for the higher risk is that they are mostly
> designed for priveledged userspace.

I think it's the same with TUN/TAP, privileged process can pass them to 
unprivileged ones.

>
>> The only difference is the page was remap to
>> from kernel to userspace.
> At least that avoids the g.u.p mess.

I'm still not very clear at the point. We only pin 2 or 4 pages, they're
several other cases that will pin much more.

>
>>>     I don't
>>> like the idea I have to say.  As a first step, why don't we
switch to
>>> unsafe_put_user/unsafe_get_user etc?
>>
>> Several reasons:
>>
>> - They only have x86 variant, it won't have any difference for the
rest of
>> architecture.
> Is there an issue on other architectures? If yes they can be extended
> there.

Consider the unexpected amount of work and in the best case it can give 
the same performance to vmap(). I'm not sure it's worth.

>
>> - unsafe_put_user/unsafe_get_user is not sufficient for accessing
structures
>> (e.g accessing descriptor) or arrays (batching).
> So you want unsafe_copy_xxx_user? I can do this. Hang on will post.
>
>> - Unless we can batch at least the accessing of two places in three of
>> avail, used and descriptor in one run. There will be no difference. E.g
we
>> can batch updating used ring, but it won't make any difference in
this case.
>>
> So let's batch them all?

Batching might not help for the case of light load. And we need to 
measure the gain/cost of batching itself.

>
>
>>> That would be more of an apples to apples comparison, would it not?
>>
>> Apples to apples comparison only help if we are the No.1. But the fact
is we
>> are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is the
>> fastest method AFAIK.
>>
>>
>> Thanks
> We need to speed up the packet access itself too though.
> You can't vmap all of guest memory.

This series only pin and vmap very few pages (metadata).

Thanks

>
>
>>>
>>>> Test shows about 24% improvement on TX PPS. It should benefit
other
>>>> cases as well.
>>>>
>>>> Please review
>>>>
>>>> Jason Wang (3):
>>>>     vhost: generalize adding used elem
>>>>     vhost: fine grain userspace memory accessors
>>>>     vhost: access vq metadata through kernel virtual address
>>>>
>>>>    drivers/vhost/vhost.c | 281
++++++++++++++++++++++++++++++++++++++----
>>>>    drivers/vhost/vhost.h |  11 ++
>>>>    2 files changed, 266 insertions(+), 26 deletions(-)
>>>>
>>>> -- 
>>>> 2.17.1

Possibly Parallel Threads

Search for more reasonably related threads

Linux Virtualization - Dec 2018 - [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

[PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

Possibly Parallel Threads