We need a full barrier after writing out event index, using smp_store_mb there seems better than open-coding. As usual, we need a wrapper to account for strong barriers/non smp. It's tempting to use this in vhost as well, for that, we'll need a variant of smp_store_mb that works on __user pointers. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- Seems to give a speedup on my box but I'm less sure about this one. E.g. as xchng faster than mfence on all/most intel CPUs? Anyone has an opinion? include/linux/virtio_ring.h | 14 ++++++++++++++ drivers/virtio/virtio_ring.c | 15 +++++++++------ 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index 0135c16..8912189 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -47,6 +47,20 @@ static inline void virtio_wmb(bool weak_barriers) wmb(); } +static inline void virtio_store_mb(bool weak_barriers, + __virtio16 *p, __virtio16 v) +{ +#ifdef CONFIG_SMP + if (weak_barriers) + smp_store_mb(*p, v); + else +#endif + { + WRITE_ONCE(*p, v); + mb(); + } +} + static inline __virtio16 virtio_load_acquire(bool weak_barriers, __virtio16 *p) { if (!weak_barriers) { diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index f822cab..b0aea67 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -517,10 +517,10 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before * the read in the next get_buf call. */ - if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) { - vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, vq->last_used_idx); - virtio_mb(vq->weak_barriers); - } + if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) + virtio_store_mb(vq->weak_barriers, + &vring_used_event(&vq->vring), + cpu_to_virtio16(_vq->vdev, vq->last_used_idx)); #ifdef DEBUG vq->last_add_time_valid = false; @@ -653,8 +653,11 @@ bool virtqueue_enable_cb_delayed(struct virtqueue *_vq) } /* TODO: tune this threshold */ bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4; - vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs); - virtio_mb(vq->weak_barriers); + + virtio_store_mb(vq->weak_barriers, + &vring_used_event(&vq->vring), + cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs)); + if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) { END_USE(vq); return false; -- MST
On Thu, Dec 17, 2015 at 12:32:53PM +0200, Michael S. Tsirkin wrote:> +static inline void virtio_store_mb(bool weak_barriers, > + __virtio16 *p, __virtio16 v) > +{ > +#ifdef CONFIG_SMP > + if (weak_barriers) > + smp_store_mb(*p, v); > + else > +#endif > + { > + WRITE_ONCE(*p, v); > + mb(); > + } > +}This is a different barrier depending on SMP, that seems wrong. smp_mb(), as (should be) used by smp_store_mb() does not provide a barrier against IO. mb() otoh does. Since this is virtIO I would expect you always want mb().
On Thu, Dec 17, 2015 at 12:32:53PM +0200, Michael S. Tsirkin wrote:> Seems to give a speedup on my box but I'm less sure about this one. E.g. as > xchng faster than mfence on all/most intel CPUs? Anyone has an opinion?Would help if you Cc people who would actually know this :-) Yes, we've recently established that xchg is indeed faster than mfence on at least recent machines, see: lkml.kernel.org/r/CA+55aFynbkeuUGs9s-q+fLY6MeRBA6MjEyWWbbe7A5AaqsAknw at mail.gmail.com> +static inline void virtio_store_mb(bool weak_barriers, > + __virtio16 *p, __virtio16 v) > +{ > +#ifdef CONFIG_SMP > + if (weak_barriers) > + smp_store_mb(*p, v); > + else > +#endif > + { > + WRITE_ONCE(*p, v); > + mb(); > + } > +}Note that virtio_mb() is weirdly inconsistent with virtio_[rw]mb() in that they use dma_* ops for weak_barriers, while virtio_mb() uses smp_mb(). As previously stated, smp_mb() does not cover the same memory domains as dma_mb() would.
On Thu, Dec 17, 2015 at 11:52:38AM +0100, Peter Zijlstra wrote:> On Thu, Dec 17, 2015 at 12:32:53PM +0200, Michael S. Tsirkin wrote: > > +static inline void virtio_store_mb(bool weak_barriers, > > + __virtio16 *p, __virtio16 v) > > +{ > > +#ifdef CONFIG_SMP > > + if (weak_barriers) > > + smp_store_mb(*p, v); > > + else > > +#endif > > + { > > + WRITE_ONCE(*p, v); > > + mb(); > > + } > > +} > > This is a different barrier depending on SMP, that seems wrong.Of course it's wrong in the sense that it's suboptimal on UP. What we would really like is to have, on UP, exactly the same barrier as on SMP. This is because a UP guest can run on an SMP host. But Linux doesn't provide this ability: if CONFIG_SMP is not defined is optimizes most barriers out to a compiler barrier. Consider for example x86: what we want is xchg (NOT mfence - see below for why) but if built without CONFIG_SMP smp_store_mb does not include this.> smp_mb(), as (should be) used by smp_store_mb() does not provide a > barrier against IO. mb() otoh does. > > Since this is virtIO I would expect you always want mb().No because it's VIRTio not real io :) It just switches to the hyprevisor mode - kind of like a function call really. The weak_barriers flag is cleared for when it's used with real devices with real IO. All this is explained in some detail at the top of include/linux/virtio.h -- MST
On Thu, Dec 17, 2015 at 12:22:22PM +0100, Peter Zijlstra wrote:> On Thu, Dec 17, 2015 at 12:32:53PM +0200, Michael S. Tsirkin wrote: > > Seems to give a speedup on my box but I'm less sure about this one. E.g. as > > xchng faster than mfence on all/most intel CPUs? Anyone has an opinion? > > Would help if you Cc people who would actually know this :-)Good point. Glad you still saw this. Thanks!> Yes, we've recently established that xchg is indeed faster than mfence > on at least recent machines, see: > > lkml.kernel.org/r/CA+55aFynbkeuUGs9s-q+fLY6MeRBA6MjEyWWbbe7A5AaqsAknw at mail.gmail.com > > > +static inline void virtio_store_mb(bool weak_barriers, > > + __virtio16 *p, __virtio16 v) > > +{ > > +#ifdef CONFIG_SMP > > + if (weak_barriers) > > + smp_store_mb(*p, v); > > + else > > +#endif > > + { > > + WRITE_ONCE(*p, v); > > + mb(); > > + } > > +} > > Note that virtio_mb() is weirdly inconsistent with virtio_[rw]mb() in > that they use dma_* ops for weak_barriers, while virtio_mb() uses > smp_mb().It's a hack really. I think I'll clean it up a bit to make it more consistent. To simplify things, you may consider things before the optimization brought in by commit 9e1a27ea42691429e31f158cce6fc61bc79bb2e9 Author: Alexander Duyck <alexander.h.duyck at redhat.com> Date: Mon Apr 13 21:03:49 2015 +0930 virtio_ring: Update weak barriers to use dma_wmb/rmb> As previously stated, smp_mb() does not cover the same memory domains as > dma_mb() would.I know. We used to consistently do the right thing on SMP, but on UP Linux does not have good portable APIs for us to use. So we hack around with what's available which is typically stronger than what's really needed. I guess no one cares about UP that much. The Alexander came and tried to optimize UP using dma_wmb/dma_rmb. I guess he did not find dma_mb so left it as is. Maybe we should make virtio depend on SMP, and be done with it, but the amount of code to maintain !SMP is small enough to not be worth the potential pain to users (if any). -- MST