thr3ads.net - Linux Virtualization - [PATCH] virtio_ring: Shadow available ring flags & index [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Venkatesh Srinivas

2015-Nov-18 04:08 UTC

[PATCH] virtio_ring: Shadow available ring flags & index

On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei <huawei.xie at intel.com>
wrote:
> On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
> > On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas
wrote:
> >>> Improves cacheline transfer flow of available ring header.
> >>>
> >>> Virtqueues are implemented as a pair of rings, one
producer->consumer
> >>> avail ring and one consumer->producer used ring; preceding
the
> >>> avail ring in memory are two contiguous u16 fields --
avail->flags
> >>> and avail->idx. A producer posts work by writing to
avail->idx and
> >>> a consumer reads avail->idx.
> >>>
> >>> The flags and idx fields only need to be written by a producer
CPU
> >>> and only read by a consumer CPU; when the producer and
consumer are
> >>> running on different CPUs and the virtio_ring code is
structured to
> >>> only have source writes/sink reads, we can continuously
transfer the
> >>> avail header cacheline between 'M' states between
cores. This flow
> >>> optimizes core -> core bandwidth on certain CPUs.
> >>>
> >>> (see: "Software Optimization Guide for AMD Family 15h
Processors",
> >>> Section 11.6; similar language appears in the 10h guide and
should
> >>> apply to CPUs w/ exclusive caches, using LLC as a transfer
cache)
> >>>
> >>> Unfortunately the existing virtio_ring code issued reads to
the
> >>> avail->idx and read-modify-writes to avail->flags on the
producer.
> >>>
> >>> This change shadows the flags and index fields in producer
memory;
> >>> the vring code now reads from the shadows and only ever writes
to
> >>> avail->flags and avail->idx, allowing the cacheline to
transfer
> >>> core -> core optimally.
> >> Sounds logical, I'll apply this after a  bit of testing
> >> of my own, thanks!
> > Thanks!
>
> Venkatesh:
> Is it that your patch only applies to CPUs w/ exclusive caches?
No --- it applies when the inter-cache coherence flow is optimized by
'M' -> 'M' transfers and when producer reads might interfere
w/
consumer prefetchw/reads. The AMD Optimization guides have specific
language on this subject, but other platforms may benefit.
(see Intel #'s below)
> Do you have perf data on Intel CPUs?
Good idea -- I ran some tests on a couple of Intel platforms:

(these are perf data from sample runs; for each I ran many runs, the
 numbers were pretty stable except for Haswell-EP cross-socket)

One-socket Intel Xeon W3690 ("Westmere"), 3.46 GHz; core turbo
disabled
======================================================================(note --
w/ core turbo disabled, performance is _very_ stable; variance of
 < 0.5% run-to-run; figure of merit is "seconds elapsed" here)

* Producer / consumer bound to Hyperthread pairs:

 Performance counter stats for './vring_bench_noshadow 1000000000':

 343,425,166,916 L1-dcache-loads
      21,393,148 L1-dcache-load-misses     #    0.01% of all L1-dcache hits
  61,709,640,363 L1-dcache-stores
       5,745,690 L1-dcache-store-misses
  10,186,932,553 L1-dcache-prefetches
           1,491 L1-dcache-prefetch-misses
   121.335699344 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 1000000000':

 334,766,413,861 L1-dcache-loads
      15,787,778 L1-dcache-load-misses     #    0.00% of all L1-dcache hits
  62,735,792,799 L1-dcache-stores
       3,252,113 L1-dcache-store-misses
   9,018,273,596 L1-dcache-prefetches
             819 L1-dcache-prefetch-misses
   121.206339656 seconds time elapsed

Effectively Performance-neutral.

* Producer / consumer bound to separate cores, same socket:

 Performance counter stats for './vring_bench_noshadow 1000000000':

   399,943,384,509 L1-dcache-loads
     8,868,334,693 L1-dcache-load-misses     #    2.22% of all L1-dcache hits
    62,721,376,685 L1-dcache-stores
     2,786,806,982 L1-dcache-store-misses
    10,915,046,967 L1-dcache-prefetches
           328,508 L1-dcache-prefetch-misses
     146.585969976 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 1000000000':

   425,123,067,750 L1-dcache-loads 
     6,689,318,709 L1-dcache-load-misses     #    1.57% of all L1-dcache hits
    62,747,525,005 L1-dcache-stores 
     2,496,274,505 L1-dcache-store-misses
     8,627,873,397 L1-dcache-prefetches
           146,729 L1-dcache-prefetch-misses
     142.657327765 seconds time elapsed

2.6% reduction in runtime; note that L1-dcache-load-misses reduced dramatically,
2 Billion(!) L1d misses saved.

Two-socket Intel Sandy Bridge(-EP) Xeon, 2.6 GHz; core turbo disabled
====================================================================
* Producer / consumer bound to Hyperthread pairs:

 Performance counter stats for './vring_bench_noshadow 100000000':

    37,129,070,402 L1-dcache-loads
         6,416,246 L1-dcache-load-misses     #    0.02% of all L1-dcache hits
     6,207,794,675 L1-dcache-stores
         2,800,094 L1-dcache-store-misses
      17.029790809 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 100000000':

    36,799,559,391 L1-dcache-loads
        10,241,080 L1-dcache-load-misses     #    0.03% of all L1-dcache hits
     6,312,252,458 L1-dcache-stores
         2,742,239 L1-dcache-store-misses
      16.941001709 seconds time elapsed

Effectively Performance-neutral.

* Producer / consumer bound to separate cores, same socket:

 Performance counter stats for './vring_bench_noshadow 100000000':

    27,684,883,046 L1-dcache-loads
       809,933,091 L1-dcache-load-misses     #    2.93% of all L1-dcache hits
     6,219,598,352 L1-dcache-stores
         1,758,503 L1-dcache-store-misses
      15.020511218 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 100000000':

    28,092,111,012 L1-dcache-loads                                             
       716,687,011 L1-dcache-load-misses     #    2.55% of all L1-dcache hits  
     6,290,821,211 L1-dcache-stores                                            
         1,565,583 L1-dcache-store-misses                                      
      15.208420297 seconds time elapsed

Effectively Performance-neutral.

* Producer / consumer bound to separate cores, cross socket:
(Sandy Bridge-EP appears to have less cross-socket variance than Haswell-EP)

 Performance counter stats for './vring_bench_noshadow 100000000':

    35,857,245,449 L1-dcache-loads
       821,746,755 L1-dcache-load-misses     #    2.29% of all L1-dcache hits
     6,252,551,550 L1-dcache-stores
         4,665,405 L1-dcache-store-misses
      46.340035651 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 100000000':

    39,044,022,857 L1-dcache-loads
       711,731,527 L1-dcache-load-misses     #    1.82% of all L1-dcache hits
     6,349,051,557 L1-dcache-stores
         4,292,362 L1-dcache-store-misses
      42.593259436 seconds time elapsed

Runtimes for the cross-socket test have somewhat higher variance, but the
pattern in counts of L1-dcache-loads and L1-dcache-load-misses for nonshadow
vs. shadow code is very stable.

noshadow (w/o this patch) reliably clocks in at ~46 seconds, shadow ranges
from ~48 to ~42 (-2.8% to +8.0%).

Two-socket Intel Haswell(-EP) Xeon, 2.3 GHz; core turbo disabled
===============================================================
* Producer / consumer bound to Hyperthread pairs:

 Performance counter stats for './vring_bench_noshadow 10000000000':

   474,856,463,271 L1-dcache-loads
        74,223,784 L1-dcache-load-misses     #    0.02% of all L1-dcache hits
    87,274,898,671 L1-dcache-stores
        31,869,448 L1-dcache-store-misses
     243.290969318 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 10000000000':

   466,891,993,302 L1-dcache-loads
        80,859,208 L1-dcache-load-misses     #    0.02% of all L1-dcache hits
    88,760,627,355 L1-dcache-stores
        35,727,720 L1-dcache-store-misses
     242.146970822 seconds time elapsed

Effectively Performance-neutral.

* Producer / consumer bound to separate cores, same socket:

 Performance counter stats for './vring_bench_noshadow 10000000000':

   357,657,891,797 L1-dcache-loads
     8,760,549,978 L1-dcache-load-misses     #    2.45% of all L1-dcache hits
    87,357,651,103 L1-dcache-stores
        10,166,431 L1-dcache-store-misses
     229.733047436 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 10000000000':

   382,508,881,516 L1-dcache-loads
     8,348,013,630 L1-dcache-load-misses     #    2.18% of all L1-dcache hits
    88,756,639,931 L1-dcache-stores
         9,842,999 L1-dcache-store-misses
     230.850697668 seconds time elapsed

Effectively Performance-neutral.

* Producer / consumer bound to separate cores, different sockets:

Unfortunately I don't have useful numbers for this case -- even with
core turbo disabled, runtime variance is very high (10 - 30% run-to-run).
> For the perf metric you provide, why not L1-dcache-load-misses which is
> more meaning full?
L1-dcache-load-misses is a better metric, you're right; for the original
AMD Piledriver run I posted:

 Performance counter stats for './vring_bench_noshadow':
     5,451,082,016      L1-dcache-loads
        31,690,398      L1-dcache-load-misses
        60,288,052      L1-dcache-stores
        60,517,840      LLC-loads
             9,726      LLC-load-misses
       2.221477739      seconds time elapsed
 
 Performance counter stats for './vring_bench_shadow':
     5,405,701,361      L1-dcache-loads
        31,157,235      L1-dcache-load-misses
        59,172,380      L1-dcache-stores
        59,398,269      LLC-loads
            10,944      LLC-load-misses
       2.168405376      seconds time elapsed

There is a 1.6% reduction in L1-dcache-load-misses, which lines up with
about a 2% reduction in runtime.

Summary:
* No workload on Westmere 1S, Sandy Bridge 2S, and Haswell 2S got worse;
* Westmere 1S cross-core improved by ~2.5% reliably;
* Sandy Bridge 2S cross-core cross-socket may have improved. (cross-socket
  run variance makes it hard to tell)
* AMD Piledriver tests improved by ~2%;
* Other virtio implementations (over PCIe for example) should benefit;

HTH,
-- vs;
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20151117/cef81470/attachment-0001.html>

Venkatesh Srinivas

2015-Nov-18 04:28 UTC

head link

[PATCH] virtio_ring: Shadow available ring flags & index

On Tue, Nov 17, 2015 at 08:08:18PM -0800, Venkatesh Srinivas
wrote:> On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei <huawei.xie at
intel.com> wrote:
> 
> > On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
> > > On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin
wrote:
> > >> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas
wrote:
> > >>> Improves cacheline transfer flow of available ring
header.
> > >>>
> > >>> Virtqueues are implemented as a pair of rings, one
producer->consumer
> > >>> avail ring and one consumer->producer used ring;
preceding the
> > >>> avail ring in memory are two contiguous u16 fields --
avail->flags
> > >>> and avail->idx. A producer posts work by writing to
avail->idx and
> > >>> a consumer reads avail->idx.
> > >>>
> > >>> The flags and idx fields only need to be written by a
producer CPU
> > >>> and only read by a consumer CPU; when the producer and
consumer are
> > >>> running on different CPUs and the virtio_ring code is
structured to
> > >>> only have source writes/sink reads, we can continuously
transfer the
> > >>> avail header cacheline between 'M' states between
cores. This flow
> > >>> optimizes core -> core bandwidth on certain CPUs.
> > >>>
> > >>> (see: "Software Optimization Guide for AMD Family
15h Processors",
> > >>> Section 11.6; similar language appears in the 10h guide
and should
> > >>> apply to CPUs w/ exclusive caches, using LLC as a
transfer cache)
> > >>>
> > >>> Unfortunately the existing virtio_ring code issued reads
to the
> > >>> avail->idx and read-modify-writes to avail->flags
on the producer.
> > >>>
> > >>> This change shadows the flags and index fields in
producer memory;
> > >>> the vring code now reads from the shadows and only ever
writes to
> > >>> avail->flags and avail->idx, allowing the cacheline
to transfer
> > >>> core -> core optimally.
> > >> Sounds logical, I'll apply this after a  bit of testing
> > >> of my own, thanks!
> > > Thanks!
> >
> 
> > Venkatesh:
> > Is it that your patch only applies to CPUs w/ exclusive caches?
> 
> No --- it applies when the inter-cache coherence flow is optimized by
> 'M' -> 'M' transfers and when producer reads might
interfere w/
> consumer prefetchw/reads. The AMD Optimization guides have specific
> language on this subject, but other platforms may benefit.
> (see Intel #'s below)
> 
> > Do you have perf data on Intel CPUs?
> 
> Good idea -- I ran some tests on a couple of Intel platforms:
> 
> (these are perf data from sample runs; for each I ran many runs, the
>  numbers were pretty stable except for Haswell-EP cross-socket)
> 
> One-socket Intel Xeon W3690 ("Westmere"), 3.46 GHz; core turbo
disabled
> ======================================================================>
(note -- w/ core turbo disabled, performance is _very_ stable; variance of
>  < 0.5% run-to-run; figure of merit is "seconds elapsed" here)
> 
> * Producer / consumer bound to Hyperthread pairs:
> 
>  Performance counter stats for './vring_bench_noshadow 1000000000':
> 
>  343,425,166,916 L1-dcache-loads
>       21,393,148 L1-dcache-load-misses     #    0.01% of all L1-dcache hits
>   61,709,640,363 L1-dcache-stores
>        5,745,690 L1-dcache-store-misses
>   10,186,932,553 L1-dcache-prefetches
>            1,491 L1-dcache-prefetch-misses
>    121.335699344 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 1000000000':
> 
>  334,766,413,861 L1-dcache-loads
>       15,787,778 L1-dcache-load-misses     #    0.00% of all L1-dcache hits
>   62,735,792,799 L1-dcache-stores
>        3,252,113 L1-dcache-store-misses
>    9,018,273,596 L1-dcache-prefetches
>              819 L1-dcache-prefetch-misses
>    121.206339656 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, same socket:
> 
>  Performance counter stats for './vring_bench_noshadow 1000000000':
> 
>    399,943,384,509 L1-dcache-loads
>      8,868,334,693 L1-dcache-load-misses     #    2.22% of all L1-dcache
hits
>     62,721,376,685 L1-dcache-stores
>      2,786,806,982 L1-dcache-store-misses
>     10,915,046,967 L1-dcache-prefetches
>            328,508 L1-dcache-prefetch-misses
>      146.585969976 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 1000000000':
> 
>    425,123,067,750 L1-dcache-loads 
>      6,689,318,709 L1-dcache-load-misses     #    1.57% of all L1-dcache
hits
>     62,747,525,005 L1-dcache-stores 
>      2,496,274,505 L1-dcache-store-misses
>      8,627,873,397 L1-dcache-prefetches
>            146,729 L1-dcache-prefetch-misses
>      142.657327765 seconds time elapsed
> 
> 2.6% reduction in runtime; note that L1-dcache-load-misses reduced
> dramatically, 2 Billion(!) L1d misses saved.
> 
> Two-socket Intel Sandy Bridge(-EP) Xeon, 2.6 GHz; core turbo disabled
> ====================================================================> 
> * Producer / consumer bound to Hyperthread pairs:
> 
>  Performance counter stats for './vring_bench_noshadow 100000000':
> 
>     37,129,070,402 L1-dcache-loads
>          6,416,246 L1-dcache-load-misses     #    0.02% of all L1-dcache
hits
>      6,207,794,675 L1-dcache-stores
>          2,800,094 L1-dcache-store-misses
>       17.029790809 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 100000000':
> 
>     36,799,559,391 L1-dcache-loads
>         10,241,080 L1-dcache-load-misses     #    0.03% of all L1-dcache
hits
>      6,312,252,458 L1-dcache-stores
>          2,742,239 L1-dcache-store-misses
>       16.941001709 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, same socket:
> 
>  Performance counter stats for './vring_bench_noshadow 100000000':
> 
>     27,684,883,046 L1-dcache-loads
>        809,933,091 L1-dcache-load-misses     #    2.93% of all L1-dcache
hits
>      6,219,598,352 L1-dcache-stores
>          1,758,503 L1-dcache-store-misses
>       15.020511218 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 100000000':
> 
>     28,092,111,012 L1-dcache-loads                     
>        716,687,011 L1-dcache-load-misses     #    2.55% of all L1-dcache
hits
>      6,290,821,211 L1-dcache-stores 
>          1,565,583 L1-dcache-store-misses
>       15.208420297 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, cross socket:
> (Sandy Bridge-EP appears to have less cross-socket variance than
Haswell-EP)
> 
>  Performance counter stats for './vring_bench_noshadow 100000000':
> 
>     35,857,245,449 L1-dcache-loads
>        821,746,755 L1-dcache-load-misses     #    2.29% of all L1-dcache
hits
>      6,252,551,550 L1-dcache-stores
>          4,665,405 L1-dcache-store-misses
>       46.340035651 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 100000000':
> 
>     39,044,022,857 L1-dcache-loads
>        711,731,527 L1-dcache-load-misses     #    1.82% of all L1-dcache
hits
>      6,349,051,557 L1-dcache-stores
>          4,292,362 L1-dcache-store-misses
>       42.593259436 seconds time elapsed
> 
> Runtimes for the cross-socket test have somewhat higher variance, but the
> pattern in counts of L1-dcache-loads and L1-dcache-load-misses for
nonshadow
> vs. shadow code is very stable.
> 
> noshadow (w/o this patch) reliably clocks in at ~46 seconds, shadow ranges
> from ~48 to ~42 (-2.8% to +8.0%).
> 
> Two-socket Intel Haswell(-EP) Xeon, 2.3 GHz; core turbo disabled
> ===============================================================> 
> * Producer / consumer bound to Hyperthread pairs:
> 
>  Performance counter stats for './vring_bench_noshadow
10000000000':
> 
>    474,856,463,271 L1-dcache-loads
>         74,223,784 L1-dcache-load-misses     #    0.02% of all L1-dcache
hits
>     87,274,898,671 L1-dcache-stores
>         31,869,448 L1-dcache-store-misses
>      243.290969318 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 10000000000':
> 
>    466,891,993,302 L1-dcache-loads
>         80,859,208 L1-dcache-load-misses     #    0.02% of all L1-dcache
hits
>     88,760,627,355 L1-dcache-stores
>         35,727,720 L1-dcache-store-misses
>      242.146970822 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, same socket:
> 
>  Performance counter stats for './vring_bench_noshadow
10000000000':
> 
>    357,657,891,797 L1-dcache-loads
>      8,760,549,978 L1-dcache-load-misses     #    2.45% of all L1-dcache
hits
>     87,357,651,103 L1-dcache-stores
>         10,166,431 L1-dcache-store-misses
>      229.733047436 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 10000000000':
> 
>    382,508,881,516 L1-dcache-loads
>      8,348,013,630 L1-dcache-load-misses     #    2.18% of all L1-dcache
hits
>     88,756,639,931 L1-dcache-stores
>          9,842,999 L1-dcache-store-misses
>      230.850697668 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, different sockets:
> 
> Unfortunately I don't have useful numbers for this case -- even with
> core turbo disabled, runtime variance is very high (10 - 30% run-to-run).
> 
> > For the perf metric you provide, why not L1-dcache-load-misses which
is
> > more meaning full?
> 
> L1-dcache-load-misses is a better metric, you're right; for the
original
> AMD Piledriver run I posted:
> 
>  Performance counter stats for './vring_bench_noshadow':
>      5,451,082,016      L1-dcache-loads
>         31,690,398      L1-dcache-load-misses
>         60,288,052      L1-dcache-stores
>         60,517,840      LLC-loads
>              9,726      LLC-load-misses
>        2.221477739      seconds time elapsed
>  
>  Performance counter stats for './vring_bench_shadow':
>      5,405,701,361      L1-dcache-loads
>         31,157,235      L1-dcache-load-misses
>         59,172,380      L1-dcache-stores
>         59,398,269      LLC-loads
>             10,944      LLC-load-misses
>        2.168405376      seconds time elapsed
> 
> There is a 1.6% reduction in L1-dcache-load-misses, which lines up with
> about a 2% reduction in runtime.
> 
> Summary:
> * No workload on Westmere 1S, Sandy Bridge 2S, and Haswell 2S got worse;
> * Westmere 1S cross-core improved by ~2.5% reliably;
> * Sandy Bridge 2S cross-core cross-socket may have improved. (cross-socket
>   run variance makes it hard to tell)
> * AMD Piledriver tests improved by ~2%;
> * Other virtio implementations (over PCIe for example) should benefit;
> 
> HTH,
> -- vs;
I'm sorry -- I appear to have added an unintentional HTML draft part to my
reply. This would prevent the message from appearing on the kvm@ mailing list
at the minimum.

Re-posting with the HTML part scrubbed.

Sorry,
-- vs;

Xie, Huawei

2015-Nov-19 16:15 UTC

head link

[PATCH] virtio_ring: Shadow available ring flags & index

On 11/18/2015 12:28 PM, Venkatesh Srinivas wrote:> On Tue, Nov 17, 2015 at 08:08:18PM -0800, Venkatesh Srinivas wrote:
>> On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei <huawei.xie at
intel.com> wrote:
>>
>>> On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
>>>> On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin
wrote:
>>>>> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh
Srinivas wrote:
>>>>>> Improves cacheline transfer flow of available ring
header.
>>>>>>
>>>>>> Virtqueues are implemented as a pair of rings, one
producer->consumer
>>>>>> avail ring and one consumer->producer used ring;
preceding the
>>>>>> avail ring in memory are two contiguous u16 fields --
avail->flags
>>>>>> and avail->idx. A producer posts work by writing to
avail->idx and
>>>>>> a consumer reads avail->idx.
>>>>>>
>>>>>> The flags and idx fields only need to be written by a
producer CPU
>>>>>> and only read by a consumer CPU; when the producer and
consumer are
>>>>>> running on different CPUs and the virtio_ring code is
structured to
>>>>>> only have source writes/sink reads, we can continuously
transfer the
>>>>>> avail header cacheline between 'M' states
between cores. This flow
>>>>>> optimizes core -> core bandwidth on certain CPUs.
>>>>>>
>>>>>> (see: "Software Optimization Guide for AMD Family
15h Processors",
>>>>>> Section 11.6; similar language appears in the 10h guide
and should
>>>>>> apply to CPUs w/ exclusive caches, using LLC as a
transfer cache)
>>>>>>
>>>>>> Unfortunately the existing virtio_ring code issued
reads to the
>>>>>> avail->idx and read-modify-writes to avail->flags
on the producer.
>>>>>>
>>>>>> This change shadows the flags and index fields in
producer memory;
>>>>>> the vring code now reads from the shadows and only ever
writes to
>>>>>> avail->flags and avail->idx, allowing the
cacheline to transfer
>>>>>> core -> core optimally.
>>>>> Sounds logical, I'll apply this after a  bit of testing
>>>>> of my own, thanks!
>>>> Thanks!
>>> Venkatesh:
>>> Is it that your patch only applies to CPUs w/ exclusive caches?
>> No --- it applies when the inter-cache coherence flow is optimized by
>> 'M' -> 'M' transfers and when producer reads might
interfere w/
>> consumer prefetchw/reads. The AMD Optimization guides have specific
>> language on this subject, but other platforms may benefit.
>> (see Intel #'s below)For core2core case(not HT paire), after consumer reads that M cache line
for avail_idx, is that line still in the producer core's L1 data cache
with state changing from M->O state?>>
>>> Do you have perf data on Intel CPUs?
>> Good idea -- I ran some tests on a couple of Intel platforms:
>>
>> (these are perf data from sample runs; for each I ran many runs, the
>>  numbers were pretty stable except for Haswell-EP cross-socket)
>>
>> One-socket Intel Xeon W3690 ("Westmere"), 3.46 GHz; core
turbo disabled
>>
======================================================================>>
(note -- w/ core turbo disabled, performance is _very_ stable; variance of
>>  < 0.5% run-to-run; figure of merit is "seconds elapsed"
here)
>>
>> * Producer / consumer bound to Hyperthread pairs:
>>
>>  Performance counter stats for './vring_bench_noshadow
1000000000':
>>
>>  343,425,166,916 L1-dcache-loads
>>       21,393,148 L1-dcache-load-misses     #    0.01% of all L1-dcache
hits
>>   61,709,640,363 L1-dcache-stores
>>        5,745,690 L1-dcache-store-misses
>>   10,186,932,553 L1-dcache-prefetches
>>            1,491 L1-dcache-prefetch-misses
>>    121.335699344 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
1000000000':
>>
>>  334,766,413,861 L1-dcache-loads
>>       15,787,778 L1-dcache-load-misses     #    0.00% of all L1-dcache
hits
>>   62,735,792,799 L1-dcache-stores
>>        3,252,113 L1-dcache-store-misses
>>    9,018,273,596 L1-dcache-prefetches
>>              819 L1-dcache-prefetch-misses
>>    121.206339656 seconds time elapsed
>>
>> Effectively Performance-neutral.
>>
>> * Producer / consumer bound to separate cores, same socket:
>>
>>  Performance counter stats for './vring_bench_noshadow
1000000000':
>>
>>    399,943,384,509 L1-dcache-loads
>>      8,868,334,693 L1-dcache-load-misses     #    2.22% of all
L1-dcache hits
>>     62,721,376,685 L1-dcache-stores
>>      2,786,806,982 L1-dcache-store-misses
>>     10,915,046,967 L1-dcache-prefetches
>>            328,508 L1-dcache-prefetch-misses
>>      146.585969976 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
1000000000':
>>
>>    425,123,067,750 L1-dcache-loads 
>>      6,689,318,709 L1-dcache-load-misses     #    1.57% of all
L1-dcache hits
>>     62,747,525,005 L1-dcache-stores 
>>      2,496,274,505 L1-dcache-store-misses
>>      8,627,873,397 L1-dcache-prefetches
>>            146,729 L1-dcache-prefetch-misses
>>      142.657327765 seconds time elapsed
>>
>> 2.6% reduction in runtime; note that L1-dcache-load-misses reduced
>> dramatically, 2 Billion(!) L1d misses saved.
>>
>> Two-socket Intel Sandy Bridge(-EP) Xeon, 2.6 GHz; core turbo disabled
>>
====================================================================>>
>> * Producer / consumer bound to Hyperthread pairs:
>>
>>  Performance counter stats for './vring_bench_noshadow
100000000':
>>
>>     37,129,070,402 L1-dcache-loads
>>          6,416,246 L1-dcache-load-misses     #    0.02% of all
L1-dcache hits
>>      6,207,794,675 L1-dcache-stores
>>          2,800,094 L1-dcache-store-misses
>>       17.029790809 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
100000000':
>>
>>     36,799,559,391 L1-dcache-loads
>>         10,241,080 L1-dcache-load-misses     #    0.03% of all
L1-dcache hits
>>      6,312,252,458 L1-dcache-stores
>>          2,742,239 L1-dcache-store-misses
>>       16.941001709 seconds time elapsed
>>
>> Effectively Performance-neutral.
>>
>> * Producer / consumer bound to separate cores, same socket:
>>
>>  Performance counter stats for './vring_bench_noshadow
100000000':
>>
>>     27,684,883,046 L1-dcache-loads
>>        809,933,091 L1-dcache-load-misses     #    2.93% of all
L1-dcache hits
>>      6,219,598,352 L1-dcache-stores
>>          1,758,503 L1-dcache-store-misses
>>       15.020511218 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
100000000':
>>
>>     28,092,111,012 L1-dcache-loads                     
>>        716,687,011 L1-dcache-load-misses     #    2.55% of all
L1-dcache hits
>>      6,290,821,211 L1-dcache-stores 
>>          1,565,583 L1-dcache-store-misses
>>       15.208420297 seconds time elapsed
>>
>> Effectively Performance-neutral.
>>
>> * Producer / consumer bound to separate cores, cross socket:
>> (Sandy Bridge-EP appears to have less cross-socket variance than
Haswell-EP)
>>
>>  Performance counter stats for './vring_bench_noshadow
100000000':
>>
>>     35,857,245,449 L1-dcache-loads
>>        821,746,755 L1-dcache-load-misses     #    2.29% of all
L1-dcache hits
>>      6,252,551,550 L1-dcache-stores
>>          4,665,405 L1-dcache-store-misses
>>       46.340035651 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
100000000':
>>
>>     39,044,022,857 L1-dcache-loads
>>        711,731,527 L1-dcache-load-misses     #    1.82% of all
L1-dcache hits
>>      6,349,051,557 L1-dcache-stores
>>          4,292,362 L1-dcache-store-misses
>>       42.593259436 seconds time elapsed
>>
>> Runtimes for the cross-socket test have somewhat higher variance, but
the
>> pattern in counts of L1-dcache-loads and L1-dcache-load-misses for
nonshadow
>> vs. shadow code is very stable.
>>
>> noshadow (w/o this patch) reliably clocks in at ~46 seconds, shadow
ranges
>> from ~48 to ~42 (-2.8% to +8.0%).
>>
>> Two-socket Intel Haswell(-EP) Xeon, 2.3 GHz; core turbo disabled
>> ===============================================================>>
>> * Producer / consumer bound to Hyperthread pairs:
>>
>>  Performance counter stats for './vring_bench_noshadow
10000000000':
>>
>>    474,856,463,271 L1-dcache-loads
>>         74,223,784 L1-dcache-load-misses     #    0.02% of all
L1-dcache hits
>>     87,274,898,671 L1-dcache-stores
>>         31,869,448 L1-dcache-store-misses
>>      243.290969318 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
10000000000':
>>
>>    466,891,993,302 L1-dcache-loads
>>         80,859,208 L1-dcache-load-misses     #    0.02% of all
L1-dcache hits
>>     88,760,627,355 L1-dcache-stores
>>         35,727,720 L1-dcache-store-misses
>>      242.146970822 seconds time elapsed
>>
>> Effectively Performance-neutral.
>>
>> * Producer / consumer bound to separate cores, same socket:
>>
>>  Performance counter stats for './vring_bench_noshadow
10000000000':
>>
>>    357,657,891,797 L1-dcache-loads
>>      8,760,549,978 L1-dcache-load-misses     #    2.45% of all
L1-dcache hits
>>     87,357,651,103 L1-dcache-stores
>>         10,166,431 L1-dcache-store-misses
>>      229.733047436 seconds time elapsed
>>
>>  Performance counter stats for './vring_bench_shadow
10000000000':
>>
>>    382,508,881,516 L1-dcache-loads
>>      8,348,013,630 L1-dcache-load-misses     #    2.18% of all
L1-dcache hits
>>     88,756,639,931 L1-dcache-stores
>>          9,842,999 L1-dcache-store-misses
>>      230.850697668 seconds time elapsed
>>
>> Effectively Performance-neutral.
>>
>> * Producer / consumer bound to separate cores, different sockets:
>>
>> Unfortunately I don't have useful numbers for this case -- even
with
>> core turbo disabled, runtime variance is very high (10 - 30%
run-to-run).
>>
>>> For the perf metric you provide, why not L1-dcache-load-misses
which is
>>> more meaning full?
>> L1-dcache-load-misses is a better metric, you're right; for the
original
>> AMD Piledriver run I posted:
>>
>>  Performance counter stats for './vring_bench_noshadow':
>>      5,451,082,016      L1-dcache-loads
>>         31,690,398      L1-dcache-load-misses
>>         60,288,052      L1-dcache-stores
>>         60,517,840      LLC-loads
>>              9,726      LLC-load-misses
>>        2.221477739      seconds time elapsed
>>  
>>  Performance counter stats for './vring_bench_shadow':
>>      5,405,701,361      L1-dcache-loads
>>         31,157,235      L1-dcache-load-misses
>>         59,172,380      L1-dcache-stores
>>         59,398,269      LLC-loads
>>             10,944      LLC-load-misses
>>        2.168405376      seconds time elapsed
>>
>> There is a 1.6% reduction in L1-dcache-load-misses, which lines up with
>> about a 2% reduction in runtime.
>>
>> Summary:
>> * No workload on Westmere 1S, Sandy Bridge 2S, and Haswell 2S got
worse;
>> * Westmere 1S cross-core improved by ~2.5% reliably;
>> * Sandy Bridge 2S cross-core cross-socket may have improved.
(cross-socket
>>   run variance makes it hard to tell)
>> * AMD Piledriver tests improved by ~2%;
>> * Other virtio implementations (over PCIe for example) should benefit;
>>
>> HTH,
>> -- vs;
> I'm sorry -- I appear to have added an unintentional HTML draft part to
my
> reply. This would prevent the message from appearing on the kvm@ mailing
list
> at the minimum.
>
> Re-posting with the HTML part scrubbed.
>
> Sorry,
> -- vs;
>

Possibly Parallel Threads

Search for more maybe matching threads

Linux Virtualization - Nov 2015 - [PATCH] virtio_ring: Shadow available ring flags & index

[PATCH] virtio_ring: Shadow available ring flags & index

[PATCH] virtio_ring: Shadow available ring flags & index

[PATCH] virtio_ring: Shadow available ring flags & index

Possibly Parallel Threads