Michael S. Tsirkin
2022-Dec-27 14:37 UTC
[RFC PATCH 4/9] vringh: unify the APIs for all accessors
On Tue, Dec 27, 2022 at 07:22:36PM +0900, Shunsuke Mie wrote:> 2022?12?27?(?) 16:49 Shunsuke Mie <mie at igel.co.jp>: > > > > 2022?12?27?(?) 16:04 Michael S. Tsirkin <mst at redhat.com>: > > > > > > On Tue, Dec 27, 2022 at 11:25:26AM +0900, Shunsuke Mie wrote: > > > > Each vringh memory accessors that are for user, kern and iotlb has own > > > > interfaces that calls common code. But some codes are duplicated and that > > > > becomes loss extendability. > > > > > > > > Introduce a struct vringh_ops and provide a common APIs for all accessors. > > > > It can bee easily extended vringh code for new memory accessor and > > > > simplified a caller code. > > > > > > > > Signed-off-by: Shunsuke Mie <mie at igel.co.jp> > > > > --- > > > > drivers/vhost/vringh.c | 667 +++++++++++------------------------------ > > > > include/linux/vringh.h | 100 +++--- > > > > 2 files changed, 225 insertions(+), 542 deletions(-) > > > > > > > > diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c > > > > index aa3cd27d2384..ebfd3644a1a3 100644 > > > > --- a/drivers/vhost/vringh.c > > > > +++ b/drivers/vhost/vringh.c > > > > @@ -35,15 +35,12 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...) > > > > } > > > > > > > > /* Returns vring->num if empty, -ve on error. */ > > > > -static inline int __vringh_get_head(const struct vringh *vrh, > > > > - int (*getu16)(const struct vringh *vrh, > > > > - u16 *val, const __virtio16 *p), > > > > - u16 *last_avail_idx) > > > > +static inline int __vringh_get_head(const struct vringh *vrh, u16 *last_avail_idx) > > > > { > > > > u16 avail_idx, i, head; > > > > int err; > > > > > > > > - err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx); > > > > + err = vrh->ops.getu16(vrh, &avail_idx, &vrh->vring.avail->idx); > > > > if (err) { > > > > vringh_bad("Failed to access avail idx at %p", > > > > &vrh->vring.avail->idx); > > > > > > I like that this patch removes more lines of code than it adds. > > > > > > However one of the design points of vringh abstractions is that they were > > > carefully written to be very low overhead. > > > This is why we are passing function pointers to inline functions - > > > compiler can optimize that out. > > > > > > I think that introducing ops indirect functions calls here is going to break > > > these assumptions and hurt performance. > > > Unless compiler can somehow figure it out and optimize? > > > I don't see how it's possible with ops pointer in memory > > > but maybe I'm wrong. > > I think your concern is correct. I have to understand the compiler > > optimization and redesign this approach If it is needed. > > > Was any effort taken to test effect of these patches on performance? > > I just tested vringh_test and already faced little performance reduction. > > I have to investigate that, as you said. > I attempted to test with perf. I found that the performance of patched code > is almost the same as the upstream one. However, I have to investigate way > this patch leads to this result, also the profiling should be run on > more powerful > machines too. > > environment: > $ grep 'model name' /proc/cpuinfo > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > > results: > * for patched code > Performance counter stats for 'nice -n -20 ./vringh_test_patched > --parallel --eventidx --fast-vringh --indirect --virtio-1' (20 runs): > > 3,028.05 msec task-clock # 0.995 CPUs > utilized ( +- 0.12% ) > 78,150 context-switches # 25.691 K/sec > ( +- 0.00% ) > 5 cpu-migrations # 1.644 /sec > ( +- 3.33% ) > 190 page-faults # 62.461 /sec > ( +- 0.41% ) > 6,919,025,222 cycles # 2.275 GHz > ( +- 0.13% ) > 8,990,220,160 instructions # 1.29 insn per > cycle ( +- 0.04% ) > 1,788,326,786 branches # 587.899 M/sec > ( +- 0.05% ) > 4,557,398 branch-misses # 0.25% of all > branches ( +- 0.43% ) > > 3.04359 +- 0.00378 seconds time elapsed ( +- 0.12% ) > > * for upstream code > Performance counter stats for 'nice -n -20 ./vringh_test_base > --parallel --eventidx --fast-vringh --indirect --virtio-1' (10 runs): > > 3,058.41 msec task-clock # 0.999 CPUs > utilized ( +- 0.14% ) > 78,149 context-switches # 25.545 K/sec > ( +- 0.00% ) > 5 cpu-migrations # 1.634 /sec > ( +- 2.67% ) > 194 page-faults # 63.414 /sec > ( +- 0.43% ) > 6,988,713,963 cycles # 2.284 GHz > ( +- 0.14% ) > 8,512,533,269 instructions # 1.22 insn per > cycle ( +- 0.04% ) > 1,638,375,371 branches # 535.549 M/sec > ( +- 0.05% ) > 4,428,866 branch-misses # 0.27% of all > branches ( +- 22.57% ) > > 3.06085 +- 0.00420 seconds time elapsed ( +- 0.14% )How you compiled it also matters. ATM we don't enable retpolines and it did not matter since we didn't have indirect calls, but we should. Didn't yet investigate how to do that for virtio tools.> > Thank you for your comments. > > > Thanks! > > > > > > > > Best, > > Shunsuke.
Shunsuke Mie
2022-Dec-28 02:24 UTC
[RFC PATCH 4/9] vringh: unify the APIs for all accessors
2022?12?27?(?) 23:37 Michael S. Tsirkin <mst at redhat.com>:> > On Tue, Dec 27, 2022 at 07:22:36PM +0900, Shunsuke Mie wrote: > > 2022?12?27?(?) 16:49 Shunsuke Mie <mie at igel.co.jp>: > > > > > > 2022?12?27?(?) 16:04 Michael S. Tsirkin <mst at redhat.com>: > > > > > > > > On Tue, Dec 27, 2022 at 11:25:26AM +0900, Shunsuke Mie wrote: > > > > > Each vringh memory accessors that are for user, kern and iotlb has own > > > > > interfaces that calls common code. But some codes are duplicated and that > > > > > becomes loss extendability. > > > > > > > > > > Introduce a struct vringh_ops and provide a common APIs for all accessors. > > > > > It can bee easily extended vringh code for new memory accessor and > > > > > simplified a caller code. > > > > > > > > > > Signed-off-by: Shunsuke Mie <mie at igel.co.jp> > > > > > --- > > > > > drivers/vhost/vringh.c | 667 +++++++++++------------------------------ > > > > > include/linux/vringh.h | 100 +++--- > > > > > 2 files changed, 225 insertions(+), 542 deletions(-) > > > > > > > > > > diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c > > > > > index aa3cd27d2384..ebfd3644a1a3 100644 > > > > > --- a/drivers/vhost/vringh.c > > > > > +++ b/drivers/vhost/vringh.c > > > > > @@ -35,15 +35,12 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...) > > > > > } > > > > > > > > > > /* Returns vring->num if empty, -ve on error. */ > > > > > -static inline int __vringh_get_head(const struct vringh *vrh, > > > > > - int (*getu16)(const struct vringh *vrh, > > > > > - u16 *val, const __virtio16 *p), > > > > > - u16 *last_avail_idx) > > > > > +static inline int __vringh_get_head(const struct vringh *vrh, u16 *last_avail_idx) > > > > > { > > > > > u16 avail_idx, i, head; > > > > > int err; > > > > > > > > > > - err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx); > > > > > + err = vrh->ops.getu16(vrh, &avail_idx, &vrh->vring.avail->idx); > > > > > if (err) { > > > > > vringh_bad("Failed to access avail idx at %p", > > > > > &vrh->vring.avail->idx); > > > > > > > > I like that this patch removes more lines of code than it adds. > > > > > > > > However one of the design points of vringh abstractions is that they were > > > > carefully written to be very low overhead. > > > > This is why we are passing function pointers to inline functions - > > > > compiler can optimize that out. > > > > > > > > I think that introducing ops indirect functions calls here is going to break > > > > these assumptions and hurt performance. > > > > Unless compiler can somehow figure it out and optimize? > > > > I don't see how it's possible with ops pointer in memory > > > > but maybe I'm wrong. > > > I think your concern is correct. I have to understand the compiler > > > optimization and redesign this approach If it is needed. > > > > Was any effort taken to test effect of these patches on performance? > > > I just tested vringh_test and already faced little performance reduction. > > > I have to investigate that, as you said. > > I attempted to test with perf. I found that the performance of patched code > > is almost the same as the upstream one. However, I have to investigate way > > this patch leads to this result, also the profiling should be run on > > more powerful > > machines too. > > > > environment: > > $ grep 'model name' /proc/cpuinfo > > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > > model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz > > > > results: > > * for patched code > > Performance counter stats for 'nice -n -20 ./vringh_test_patched > > --parallel --eventidx --fast-vringh --indirect --virtio-1' (20 runs): > > > > 3,028.05 msec task-clock # 0.995 CPUs > > utilized ( +- 0.12% ) > > 78,150 context-switches # 25.691 K/sec > > ( +- 0.00% ) > > 5 cpu-migrations # 1.644 /sec > > ( +- 3.33% ) > > 190 page-faults # 62.461 /sec > > ( +- 0.41% ) > > 6,919,025,222 cycles # 2.275 GHz > > ( +- 0.13% ) > > 8,990,220,160 instructions # 1.29 insn per > > cycle ( +- 0.04% ) > > 1,788,326,786 branches # 587.899 M/sec > > ( +- 0.05% ) > > 4,557,398 branch-misses # 0.25% of all > > branches ( +- 0.43% ) > > > > 3.04359 +- 0.00378 seconds time elapsed ( +- 0.12% ) > > > > * for upstream code > > Performance counter stats for 'nice -n -20 ./vringh_test_base > > --parallel --eventidx --fast-vringh --indirect --virtio-1' (10 runs): > > > > 3,058.41 msec task-clock # 0.999 CPUs > > utilized ( +- 0.14% ) > > 78,149 context-switches # 25.545 K/sec > > ( +- 0.00% ) > > 5 cpu-migrations # 1.634 /sec > > ( +- 2.67% ) > > 194 page-faults # 63.414 /sec > > ( +- 0.43% ) > > 6,988,713,963 cycles # 2.284 GHz > > ( +- 0.14% ) > > 8,512,533,269 instructions # 1.22 insn per > > cycle ( +- 0.04% ) > > 1,638,375,371 branches # 535.549 M/sec > > ( +- 0.05% ) > > 4,428,866 branch-misses # 0.27% of all > > branches ( +- 22.57% ) > > > > 3.06085 +- 0.00420 seconds time elapsed ( +- 0.14% ) > > > How you compiled it also matters. ATM we don't enable retpolines > and it did not matter since we didn't have indirect calls, > but we should. Didn't yet investigate how to do that for virtio tools.I think the retpolines certainly affect performance. Thank you for pointing it out. I'd like to start the investigation that how to apply the retpolines to the virtio tools.> > > Thank you for your comments. > > > > Thanks! > > > > > > > > > > > Best, > > > Shunsuke. >