Michael S. Tsirkin
2019-Jul-31 18:29 UTC
[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:> We used to use RCU to synchronize MMU notifier with worker. This leads > calling synchronize_rcu() in invalidate_range_start(). But on a busy > system, there would be many factors that may slow down the > synchronize_rcu() which makes it unsuitable to be called in MMU > notifier. > > A solution is SRCU but its overhead is obvious with the expensive full > memory barrier. Another choice is to use seqlock, but it doesn't > provide a synchronization method between readers and writers. The last > choice is to use vq mutex, but it need to deal with the worst case > that MMU notifier must be blocked and wait for the finish of swap in. > > So this patch switches use a counter to track whether or not the map > was used. The counter was increased when vq try to start or finish > uses the map. This means, when it was even, we're sure there's no > readers and MMU notifier is synchronized. When it was odd, it means > there's a reader we need to wait it to be even again then we are > synchronized. To avoid full memory barrier, store_release + > load_acquire on the counter is used.Unfortunately this needs a lot of review and testing, so this can't make rc2, and I don't think this is the kind of patch I can merge after rc3. Subtle memory barrier tricks like this can introduce new bugs while they are fixing old ones.> > Consider the read critical section is pretty small the synchronization > should be done very fast. > > Note the patch lead about 3% PPS dropping.Sorry what do you mean by this last sentence? This degrades performance compared to what?> > Reported-by: Michael S. Tsirkin <mst at redhat.com> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang at redhat.com> > --- > drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > drivers/vhost/vhost.h | 7 +- > 2 files changed, 94 insertions(+), 58 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index cfc11f9ed9c9..db2c81cb1e90 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > > spin_lock(&vq->mmu_lock); > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - map[i] = rcu_dereference_protected(vq->maps[i], > - lockdep_is_held(&vq->mmu_lock)); > + map[i] = vq->maps[i]; > if (map[i]) { > vhost_set_map_dirty(vq, map[i], i); > - rcu_assign_pointer(vq->maps[i], NULL); > + vq->maps[i] = NULL; > } > } > spin_unlock(&vq->mmu_lock); > > - /* No need for synchronize_rcu() or kfree_rcu() since we are > - * serialized with memory accessors (e.g vq mutex held). > + /* No need for synchronization since we are serialized with > + * memory accessors (e.g vq mutex held). > */ > > for (i = 0; i < VHOST_NUM_ADDRS; i++) > @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > } > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + smp_store_release(&vq->ref, ref + 1); > + /* Make sure ref counter is visible before accessing the map */ > + smp_load_acquire(&vq->ref);The map access is after this sequence, correct? Just going by the rules in Documentation/memory-barriers.txt, I think that this pair will not order following accesses with ref store. Documentation/memory-barriers.txt says: + In addition, a RELEASE+ACQUIRE + pair is -not- guaranteed to act as a full memory barrier. The guarantee that is made is this: after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. And if we also had the reverse rule we'd end up with a full barrier, won't we? Cc Paul in case I missed something here. And if I'm right, maybe we should call this out, adding "The opposite is not true: a prior RELEASE is not guaranteed to be visible before memory accesses following the subsequent ACQUIRE".> +} > + > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + /* Make sure vq access is done before increasing ref counter */ > + smp_store_release(&vq->ref, ref + 1); > +} > + > +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > +{ > + int ref; > + > + /* Make sure map change was done before checking ref counter */ > + smp_mb(); > + > + ref = READ_ONCE(vq->ref); > + if (ref & 0x1) {Please document the even/odd trick here too, not just in the commit log.> + /* When ref change,changes> we are sure no reader can see > + * previous map */ > + while (READ_ONCE(vq->ref) == ref) {what is the below line in aid of?> + set_current_state(TASK_RUNNING); > + schedule();if (need_resched()) schedule(); ?> + }On an interruptible kernel, there's a risk here is that a task got preempted with an odd ref. So I suspect we'll have to disable preemption when we make ref odd.> + } > + /* Make sure ref counter was checked before any other > + * operations that was dene on map. */was dene -> were done?> + smp_mb(); > +} > + > static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > int index, > unsigned long start, > @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > spin_lock(&vq->mmu_lock); > ++vq->invalidate_count; > > - map = rcu_dereference_protected(vq->maps[index], > - lockdep_is_held(&vq->mmu_lock)); > + map = vq->maps[index]; > if (map) { > vhost_set_map_dirty(vq, map, index); > - rcu_assign_pointer(vq->maps[index], NULL); > + vq->maps[index] = NULL; > } > spin_unlock(&vq->mmu_lock); > > if (map) { > - synchronize_rcu(); > + vhost_vq_sync_access(vq); > vhost_map_unprefetch(map); > } > } > @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) > for (i = 0; i < dev->nvqs; ++i) { > vq = dev->vqs[i]; > for (j = 0; j < VHOST_NUM_ADDRS; j++) > - RCU_INIT_POINTER(vq->maps[j], NULL); > + vq->maps[j] = NULL; > } > } > #endif > @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, > vq->indirect = NULL; > vq->heads = NULL; > vq->dev = dev; > + vq->ref = 0; > mutex_init(&vq->mutex); > spin_lock_init(&vq->mmu_lock); > vhost_vq_reset(dev, vq); > @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, > map->npages = npages; > map->pages = pages; > > - rcu_assign_pointer(vq->maps[index], map); > + vq->maps[index] = map; > /* No need for a synchronize_rcu(). This function should be > * called by dev->worker so we are serialized with all > * readers. > @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *((__virtio16 *)&used->ring[vq->num]) > cpu_to_vhost16(vq, vq->avail_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > size_t size; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > size = count * sizeof(*head); > memcpy(used->ring + idx, head, size); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->flags = cpu_to_vhost16(vq, vq->used_flags); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *idx = avail->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *head = avail->ring[idx & (vq->num - 1)]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *flags = avail->flags; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + vhost_vq_access_map_begin(vq); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *event = (__virtio16)avail->ring[vq->num]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *idx = used->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, > struct vring_desc *d; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); > + map = vq->maps[VHOST_ADDR_DESC]; > if (likely(map)) { > d = map->addr; > *desc = *(d + idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, > #if VHOST_ARCH_CAN_ACCEL_UACCESS > static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) > { > - struct vhost_map __rcu *map; > + struct vhost_map *map; > int i; > > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[i]); > - rcu_read_unlock(); > + map = vq->maps[i]; > if (unlikely(!map)) > vhost_map_prefetch(vq, i); > } > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index a9a2a93857d2..f9e9558a529d 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -115,16 +115,17 @@ struct vhost_virtqueue { > #if VHOST_ARCH_CAN_ACCEL_UACCESS > /* Read by memory accessors, modified by meta data > * prefetching, MMU notifier and vring ioctl(). > - * Synchonrized through mmu_lock (writers) and RCU (writers > - * and readers). > + * Synchonrized through mmu_lock (writers) and ref counters, > + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). > */ > - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; > + struct vhost_map *maps[VHOST_NUM_ADDRS]; > /* Read by MMU notifier, modified by vring ioctl(), > * synchronized through MMU notifier > * registering/unregistering. > */ > struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > #endif > + int ref;Is it important that this is signed? If not I'd do unsigned here: even though kernel does compile with 2s complement sign overflow, it seems cleaner not to depend on that.> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > > struct file *kick; > -- > 2.18.1
Jason Wang
2019-Aug-01 08:06 UTC
[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
On 2019/8/1 ??2:29, Michael S. Tsirkin wrote:> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >> We used to use RCU to synchronize MMU notifier with worker. This leads >> calling synchronize_rcu() in invalidate_range_start(). But on a busy >> system, there would be many factors that may slow down the >> synchronize_rcu() which makes it unsuitable to be called in MMU >> notifier. >> >> A solution is SRCU but its overhead is obvious with the expensive full >> memory barrier. Another choice is to use seqlock, but it doesn't >> provide a synchronization method between readers and writers. The last >> choice is to use vq mutex, but it need to deal with the worst case >> that MMU notifier must be blocked and wait for the finish of swap in. >> >> So this patch switches use a counter to track whether or not the map >> was used. The counter was increased when vq try to start or finish >> uses the map. This means, when it was even, we're sure there's no >> readers and MMU notifier is synchronized. When it was odd, it means >> there's a reader we need to wait it to be even again then we are >> synchronized. To avoid full memory barrier, store_release + >> load_acquire on the counter is used. > > Unfortunately this needs a lot of review and testing, so this can't make > rc2, and I don't think this is the kind of patch I can merge after rc3. > Subtle memory barrier tricks like this can introduce new bugs while they > are fixing old ones.I admit the patch is tricky. Some questions: - Do we must address the case of e.g swap in? If not, a simple vhost_work_flush() instead of synchronize_rcu() may work. - Having some hard thought, I think we can use seqlock, it looks to me smp_wmb() is in write_segcount_begin() is sufficient, we don't care vq->map read before smp_wmb(), and for the other we all have good data devendency so smp_wmb() in the write_seqbegin_end() is sufficient. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index db2c81cb1e90..6d9501303258 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) { - int ref = READ_ONCE(vq->ref); - - smp_store_release(&vq->ref, ref + 1); - /* Make sure ref counter is visible before accessing the map */ - smp_load_acquire(&vq->ref); + write_seqcount_begin(&vq->seq); } static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) { - int ref = READ_ONCE(vq->ref); - - /* Make sure vq access is done before increasing ref counter */ - smp_store_release(&vq->ref, ref + 1); + write_seqcount_end(&vq->seq); } static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) { - int ref; + unsigned int ret; /* Make sure map change was done before checking ref counter */ smp_mb(); - - ref = READ_ONCE(vq->ref); - if (ref & 0x1) { - /* When ref change, we are sure no reader can see + ret = raw_read_seqcount(&vq->seq); + if (ret & 0x1) { + /* When seq changes, we are sure no reader can see * previous map */ - while (READ_ONCE(vq->ref) == ref) { - set_current_state(TASK_RUNNING); + while (raw_read_seqcount(&vq->seq) == ret) schedule(); - } } - /* Make sure ref counter was checked before any other - * operations that was dene on map. */ + /* Make sure seq was checked before any other operations that + * was dene on map. */ smp_mb(); } @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->indirect = NULL; vq->heads = NULL; vq->dev = dev; - vq->ref = 0; + seqcount_init(&vq->seq); mutex_init(&vq->mutex); spin_lock_init(&vq->mmu_lock); vhost_vq_reset(dev, vq); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 3d10da0ae511..1a705e181a84 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -125,7 +125,7 @@ struct vhost_virtqueue { */ struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; #endif - int ref; + seqcount_t seq; const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; struct file *kick; -- 2.18.1> > > > > >> >> Consider the read critical section is pretty small the synchronization >> should be done very fast. >> >> Note the patch lead about 3% PPS dropping. > > Sorry what do you mean by this last sentence? This degrades performance > compared to what?Compare to without this patch.> >> >> Reported-by: Michael S. Tsirkin <mst at redhat.com> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >> Signed-off-by: Jason Wang <jasowang at redhat.com> >> --- >> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- >> drivers/vhost/vhost.h | 7 +- >> 2 files changed, 94 insertions(+), 58 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index cfc11f9ed9c9..db2c81cb1e90 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >> >> spin_lock(&vq->mmu_lock); >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> - map[i] = rcu_dereference_protected(vq->maps[i], >> - lockdep_is_held(&vq->mmu_lock)); >> + map[i] = vq->maps[i]; >> if (map[i]) { >> vhost_set_map_dirty(vq, map[i], i); >> - rcu_assign_pointer(vq->maps[i], NULL); >> + vq->maps[i] = NULL; >> } >> } >> spin_unlock(&vq->mmu_lock); >> >> - /* No need for synchronize_rcu() or kfree_rcu() since we are >> - * serialized with memory accessors (e.g vq mutex held). >> + /* No need for synchronization since we are serialized with >> + * memory accessors (e.g vq mutex held). >> */ >> >> for (i = 0; i < VHOST_NUM_ADDRS; i++) >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); >> } >> >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); >> + >> + smp_store_release(&vq->ref, ref + 1); >> + /* Make sure ref counter is visible before accessing the map */ >> + smp_load_acquire(&vq->ref); > > The map access is after this sequence, correct?Yes.> > Just going by the rules in Documentation/memory-barriers.txt, > I think that this pair will not order following accesses with ref store. > > Documentation/memory-barriers.txt says: > > > + In addition, a RELEASE+ACQUIRE > + pair is -not- guaranteed to act as a full memory barrier. > > > > The guarantee that is made is this: > after > an ACQUIRE on a given variable, all memory accesses preceding any prior > RELEASE on that same variable are guaranteed to be visible.Yes, but it's not clear about the order of ACQUIRE the same location of previous RELEASE. And it only has a example like: " *A = a; RELEASE M ACQUIRE N *B = b; could occur as: ACQUIRE N, STORE *B, STORE *A, RELEASE M " But it doesn't explain what happen when *A = a RELEASE M ACQUIRE M *B = b; And tools/memory-model/Documentation said " First, when a lock-acquire reads from a lock-release, the LKMM requires that every instruction po-before the lock-release must execute before any instruction po-after the lock-acquire. " Is this a hint that I was correct?> > > And if we also had the reverse rule we'd end up with a full barrier, > won't we? > > Cc Paul in case I missed something here. And if I'm right, > maybe we should call this out, adding > > "The opposite is not true: a prior RELEASE is not > guaranteed to be visible before memory accesses following > the subsequent ACQUIRE".That kinds of violates the RELEASE? " This also acts as a one-way permeable barrier. It guarantees that all memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the "> > > >> +} >> + >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); >> + >> + /* Make sure vq access is done before increasing ref counter */ >> + smp_store_release(&vq->ref, ref + 1); >> +} >> + >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >> +{ >> + int ref; >> + >> + /* Make sure map change was done before checking ref counter */ >> + smp_mb(); >> + >> + ref = READ_ONCE(vq->ref); >> + if (ref & 0x1) { > > Please document the even/odd trick here too, not just in the commit log. >Ok.>> + /* When ref change, > > changes > >> we are sure no reader can see >> + * previous map */ >> + while (READ_ONCE(vq->ref) == ref) { > > > what is the below line in aid of? > >> + set_current_state(TASK_RUNNING); >> + schedule(); > > if (need_resched()) > schedule(); > > ?Yes, better.> >> + } > > On an interruptible kernel, there's a risk here is that > a task got preempted with an odd ref. > So I suspect we'll have to disable preemption when we > make ref odd.I'm not sure I get, if the odd is not the original value we read, we're sure it won't read the new map here I believe.> > >> + } >> + /* Make sure ref counter was checked before any other >> + * operations that was dene on map. */ > > was dene -> were done? >Yes.>> + smp_mb(); >> +} >> + >> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> int index, >> unsigned long start, >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> spin_lock(&vq->mmu_lock); >> ++vq->invalidate_count; >> >> - map = rcu_dereference_protected(vq->maps[index], >> - lockdep_is_held(&vq->mmu_lock)); >> + map = vq->maps[index]; >> if (map) { >> vhost_set_map_dirty(vq, map, index); >> - rcu_assign_pointer(vq->maps[index], NULL); >> + vq->maps[index] = NULL; >> } >> spin_unlock(&vq->mmu_lock); >> >> if (map) { >> - synchronize_rcu(); >> + vhost_vq_sync_access(vq); >> vhost_map_unprefetch(map); >> } >> } >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) >> for (i = 0; i < dev->nvqs; ++i) { >> vq = dev->vqs[i]; >> for (j = 0; j < VHOST_NUM_ADDRS; j++) >> - RCU_INIT_POINTER(vq->maps[j], NULL); >> + vq->maps[j] = NULL; >> } >> } >> #endif >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->indirect = NULL; >> vq->heads = NULL; >> vq->dev = dev; >> + vq->ref = 0; >> mutex_init(&vq->mutex); >> spin_lock_init(&vq->mmu_lock); >> vhost_vq_reset(dev, vq); >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, >> map->npages = npages; >> map->pages = pages; >> >> - rcu_assign_pointer(vq->maps[index], map); >> + vq->maps[index] = map; >> /* No need for a synchronize_rcu(). This function should be >> * called by dev->worker so we are serialized with all >> * readers. >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> *((__virtio16 *)&used->ring[vq->num]) >> cpu_to_vhost16(vq, vq->avail_idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> size_t size; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> size = count * sizeof(*head); >> memcpy(used->ring + idx, head, size); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> used->flags = cpu_to_vhost16(vq, vq->used_flags); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> used->idx = cpu_to_vhost16(vq, vq->last_used_idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *idx = avail->idx; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *head = avail->ring[idx & (vq->num - 1)]; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *flags = avail->flags; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + vhost_vq_access_map_begin(vq); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *event = (__virtio16)avail->ring[vq->num]; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> *idx = used->idx; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, >> struct vring_desc *d; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); >> + map = vq->maps[VHOST_ADDR_DESC]; >> if (likely(map)) { >> d = map->addr; >> *desc = *(d + idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, >> #if VHOST_ARCH_CAN_ACCEL_UACCESS >> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) >> { >> - struct vhost_map __rcu *map; >> + struct vhost_map *map; >> int i; >> >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> - rcu_read_lock(); >> - map = rcu_dereference(vq->maps[i]); >> - rcu_read_unlock(); >> + map = vq->maps[i]; >> if (unlikely(!map)) >> vhost_map_prefetch(vq, i); >> } >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >> index a9a2a93857d2..f9e9558a529d 100644 >> --- a/drivers/vhost/vhost.h >> +++ b/drivers/vhost/vhost.h >> @@ -115,16 +115,17 @@ struct vhost_virtqueue { >> #if VHOST_ARCH_CAN_ACCEL_UACCESS >> /* Read by memory accessors, modified by meta data >> * prefetching, MMU notifier and vring ioctl(). >> - * Synchonrized through mmu_lock (writers) and RCU (writers >> - * and readers). >> + * Synchonrized through mmu_lock (writers) and ref counters, >> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). >> */ >> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; >> + struct vhost_map *maps[VHOST_NUM_ADDRS]; >> /* Read by MMU notifier, modified by vring ioctl(), >> * synchronized through MMU notifier >> * registering/unregistering. >> */ >> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; >> #endif >> + int ref; > > Is it important that this is signed? If not I'd do unsigned here: > even though kernel does compile with 2s complement sign overflow, > it seems cleaner not to depend on that.Not a must, let me fix. Thanks> >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; >> >> struct file *kick; >> -- >> 2.18.1
Michael S. Tsirkin
2019-Aug-03 21:54 UTC
[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote:> On 2019/8/1 ??2:29, Michael S. Tsirkin wrote: > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > >> We used to use RCU to synchronize MMU notifier with worker. This leads > >> calling synchronize_rcu() in invalidate_range_start(). But on a busy > >> system, there would be many factors that may slow down the > >> synchronize_rcu() which makes it unsuitable to be called in MMU > >> notifier. > >> > >> A solution is SRCU but its overhead is obvious with the expensive full > >> memory barrier. Another choice is to use seqlock, but it doesn't > >> provide a synchronization method between readers and writers. The last > >> choice is to use vq mutex, but it need to deal with the worst case > >> that MMU notifier must be blocked and wait for the finish of swap in. > >> > >> So this patch switches use a counter to track whether or not the map > >> was used. The counter was increased when vq try to start or finish > >> uses the map. This means, when it was even, we're sure there's no > >> readers and MMU notifier is synchronized. When it was odd, it means > >> there's a reader we need to wait it to be even again then we are > >> synchronized. To avoid full memory barrier, store_release + > >> load_acquire on the counter is used. > > > > Unfortunately this needs a lot of review and testing, so this can't make > > rc2, and I don't think this is the kind of patch I can merge after rc3. > > Subtle memory barrier tricks like this can introduce new bugs while they > > are fixing old ones. > > I admit the patch is tricky. Some questions: > > - Do we must address the case of e.g swap in? If not, a simple > vhost_work_flush() instead of synchronize_rcu() may work. > - Having some hard thought, I think we can use seqlock, it looks > to me smp_wmb() is in write_segcount_begin() is sufficient, we don't > care vq->map read before smp_wmb(), and for the other we all have > good data devendency so smp_wmb() in the write_seqbegin_end() is > sufficient.If we need an mb in the begin() we can switch to dependent_ptr_mb. if you need me to fix it up and repost, let me know. Why isn't it a problem if the map is accessed outside the lock?> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index db2c81cb1e90..6d9501303258 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > > static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > { > - int ref = READ_ONCE(vq->ref); > - > - smp_store_release(&vq->ref, ref + 1); > - /* Make sure ref counter is visible before accessing the map */ > - smp_load_acquire(&vq->ref); > + write_seqcount_begin(&vq->seq); > } > > static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > { > - int ref = READ_ONCE(vq->ref); > - > - /* Make sure vq access is done before increasing ref counter */ > - smp_store_release(&vq->ref, ref + 1); > + write_seqcount_end(&vq->seq); > } > > static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > { > - int ref; > + unsigned int ret; > > /* Make sure map change was done before checking ref counter */ > smp_mb(); > - > - ref = READ_ONCE(vq->ref); > - if (ref & 0x1) { > - /* When ref change, we are sure no reader can see > + ret = raw_read_seqcount(&vq->seq); > + if (ret & 0x1) { > + /* When seq changes, we are sure no reader can see > * previous map */ > - while (READ_ONCE(vq->ref) == ref) { > - set_current_state(TASK_RUNNING); > + while (raw_read_seqcount(&vq->seq) == ret) > schedule();So why do we set state here? And should not we check need_sched?> - } > } > - /* Make sure ref counter was checked before any other > - * operations that was dene on map. */ > + /* Make sure seq was checked before any other operations that > + * was dene on map. */ > smp_mb(); > } > > @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev, > vq->indirect = NULL; > vq->heads = NULL; > vq->dev = dev; > - vq->ref = 0; > + seqcount_init(&vq->seq); > mutex_init(&vq->mutex); > spin_lock_init(&vq->mmu_lock); > vhost_vq_reset(dev, vq); > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index 3d10da0ae511..1a705e181a84 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -125,7 +125,7 @@ struct vhost_virtqueue { > */ > struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > #endif > - int ref; > + seqcount_t seq; > const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > > struct file *kick; > -- > 2.18.1 > > > > > > > > > > > > >> > >> Consider the read critical section is pretty small the synchronization > >> should be done very fast. > >> > >> Note the patch lead about 3% PPS dropping. > > > > Sorry what do you mean by this last sentence? This degrades performance > > compared to what? > > Compare to without this patch.OK is the feature still a performance win? or should we drop it for now?> > > >> > >> Reported-by: Michael S. Tsirkin <mst at redhat.com> > >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > >> Signed-off-by: Jason Wang <jasowang at redhat.com> > >> --- > >> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > >> drivers/vhost/vhost.h | 7 +- > >> 2 files changed, 94 insertions(+), 58 deletions(-) > >> > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > >> index cfc11f9ed9c9..db2c81cb1e90 100644 > >> --- a/drivers/vhost/vhost.c > >> +++ b/drivers/vhost/vhost.c > >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > >> > >> spin_lock(&vq->mmu_lock); > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { > >> - map[i] = rcu_dereference_protected(vq->maps[i], > >> - lockdep_is_held(&vq->mmu_lock)); > >> + map[i] = vq->maps[i]; > >> if (map[i]) { > >> vhost_set_map_dirty(vq, map[i], i); > >> - rcu_assign_pointer(vq->maps[i], NULL); > >> + vq->maps[i] = NULL; > >> } > >> } > >> spin_unlock(&vq->mmu_lock); > >> > >> - /* No need for synchronize_rcu() or kfree_rcu() since we are > >> - * serialized with memory accessors (e.g vq mutex held). > >> + /* No need for synchronization since we are serialized with > >> + * memory accessors (e.g vq mutex held). > >> */ > >> > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) > >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > >> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > >> } > >> > >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > >> +{ > >> + int ref = READ_ONCE(vq->ref); > >> + > >> + smp_store_release(&vq->ref, ref + 1); > >> + /* Make sure ref counter is visible before accessing the map */ > >> + smp_load_acquire(&vq->ref); > > > > The map access is after this sequence, correct? > > Yes. > > > > > Just going by the rules in Documentation/memory-barriers.txt, > > I think that this pair will not order following accesses with ref store. > > > > Documentation/memory-barriers.txt says: > > > > > > + In addition, a RELEASE+ACQUIRE > > + pair is -not- guaranteed to act as a full memory barrier. > > > > > > > > The guarantee that is made is this: > > after > > an ACQUIRE on a given variable, all memory accesses preceding any prior > > RELEASE on that same variable are guaranteed to be visible. > > Yes, but it's not clear about the order of ACQUIRE the same location > of previous RELEASE. And it only has a example like: > > " > *A = a; > RELEASE M > ACQUIRE N > *B = b; > > could occur as: > > ACQUIRE N, STORE *B, STORE *A, RELEASE M > " > > But it doesn't explain what happen when > > *A = a > RELEASE M > ACQUIRE M > *B = b; > > And tools/memory-model/Documentation said > > " > First, when a lock-acquire reads from a lock-release, the LKMM > requires that every instruction po-before the lock-release must > execute before any instruction po-after the lock-acquire. > " > > Is this a hint that I was correct?I don't think it's correct since by this logic memory barriers can be nops on x86.> > > > > > And if we also had the reverse rule we'd end up with a full barrier, > > won't we? > > > > Cc Paul in case I missed something here. And if I'm right, > > maybe we should call this out, adding > > > > "The opposite is not true: a prior RELEASE is not > > guaranteed to be visible before memory accesses following > > the subsequent ACQUIRE". > > That kinds of violates the RELEASE? > > " > This also acts as a one-way permeable barrier. It guarantees that all > memory operations before the RELEASE operation will appear to happen > before the RELEASE operation with respect to the other components of the > "yes but we are talking about RELEASE itself versus stuff that comes after it.> > > > > > > >> +} > >> + > >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > >> +{ > >> + int ref = READ_ONCE(vq->ref); > >> + > >> + /* Make sure vq access is done before increasing ref counter */ > >> + smp_store_release(&vq->ref, ref + 1); > >> +} > >> + > >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > >> +{ > >> + int ref; > >> + > >> + /* Make sure map change was done before checking ref counter */ > >> + smp_mb(); > >> + > >> + ref = READ_ONCE(vq->ref); > >> + if (ref & 0x1) { > > > > Please document the even/odd trick here too, not just in the commit log. > > > > Ok. > > >> + /* When ref change, > > > > changes > > > >> we are sure no reader can see > >> + * previous map */ > >> + while (READ_ONCE(vq->ref) == ref) { > > > > > > what is the below line in aid of? > > > >> + set_current_state(TASK_RUNNING);any answers here?> >> + schedule(); > > > > if (need_resched()) > > schedule(); > > > > ? > > Yes, better. > > > > >> + } > > > > On an interruptible kernel, there's a risk here is that > > a task got preempted with an odd ref. > > So I suspect we'll have to disable preemption when we > > make ref odd. > > I'm not sure I get, if the odd is not the original value we read, > we're sure it won't read the new map here I believe.But we will spin for a very long time in this case.> > > > > >> + } > >> + /* Make sure ref counter was checked before any other > >> + * operations that was dene on map. */ > > > > was dene -> were done? > > > > Yes. > > >> + smp_mb(); > >> +} > >> + > >> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > >> int index, > >> unsigned long start, > >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > >> spin_lock(&vq->mmu_lock); > >> ++vq->invalidate_count; > >> > >> - map = rcu_dereference_protected(vq->maps[index], > >> - lockdep_is_held(&vq->mmu_lock)); > >> + map = vq->maps[index]; > >> if (map) { > >> vhost_set_map_dirty(vq, map, index); > >> - rcu_assign_pointer(vq->maps[index], NULL); > >> + vq->maps[index] = NULL; > >> } > >> spin_unlock(&vq->mmu_lock); > >> > >> if (map) { > >> - synchronize_rcu(); > >> + vhost_vq_sync_access(vq); > >> vhost_map_unprefetch(map); > >> } > >> } > >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) > >> for (i = 0; i < dev->nvqs; ++i) { > >> vq = dev->vqs[i]; > >> for (j = 0; j < VHOST_NUM_ADDRS; j++) > >> - RCU_INIT_POINTER(vq->maps[j], NULL); > >> + vq->maps[j] = NULL; > >> } > >> } > >> #endif > >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, > >> vq->indirect = NULL; > >> vq->heads = NULL; > >> vq->dev = dev; > >> + vq->ref = 0; > >> mutex_init(&vq->mutex); > >> spin_lock_init(&vq->mmu_lock); > >> vhost_vq_reset(dev, vq); > >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, > >> map->npages = npages; > >> map->pages = pages; > >> > >> - rcu_assign_pointer(vq->maps[index], map); > >> + vq->maps[index] = map; > >> /* No need for a synchronize_rcu(). This function should be > >> * called by dev->worker so we are serialized with all > >> * readers. > >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> *((__virtio16 *)&used->ring[vq->num]) > >> cpu_to_vhost16(vq, vq->avail_idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > >> size_t size; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> size = count * sizeof(*head); > >> memcpy(used->ring + idx, head, size); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> used->flags = cpu_to_vhost16(vq, vq->used_flags); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *idx = avail->idx; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *head = avail->ring[idx & (vq->num - 1)]; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *flags = avail->flags; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + vhost_vq_access_map_begin(vq); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *event = (__virtio16)avail->ring[vq->num]; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> *idx = used->idx; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, > >> struct vring_desc *d; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); > >> + map = vq->maps[VHOST_ADDR_DESC]; > >> if (likely(map)) { > >> d = map->addr; > >> *desc = *(d + idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, > >> #if VHOST_ARCH_CAN_ACCEL_UACCESS > >> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) > >> { > >> - struct vhost_map __rcu *map; > >> + struct vhost_map *map; > >> int i; > >> > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { > >> - rcu_read_lock(); > >> - map = rcu_dereference(vq->maps[i]); > >> - rcu_read_unlock(); > >> + map = vq->maps[i]; > >> if (unlikely(!map)) > >> vhost_map_prefetch(vq, i); > >> } > >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > >> index a9a2a93857d2..f9e9558a529d 100644 > >> --- a/drivers/vhost/vhost.h > >> +++ b/drivers/vhost/vhost.h > >> @@ -115,16 +115,17 @@ struct vhost_virtqueue { > >> #if VHOST_ARCH_CAN_ACCEL_UACCESS > >> /* Read by memory accessors, modified by meta data > >> * prefetching, MMU notifier and vring ioctl(). > >> - * Synchonrized through mmu_lock (writers) and RCU (writers > >> - * and readers). > >> + * Synchonrized through mmu_lock (writers) and ref counters, > >> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). > >> */ > >> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; > >> + struct vhost_map *maps[VHOST_NUM_ADDRS]; > >> /* Read by MMU notifier, modified by vring ioctl(), > >> * synchronized through MMU notifier > >> * registering/unregistering. > >> */ > >> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > >> #endif > >> + int ref; > > > > Is it important that this is signed? If not I'd do unsigned here: > > even though kernel does compile with 2s complement sign overflow, > > it seems cleaner not to depend on that. > > Not a must, let me fix. > > Thanks > > > > >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > >> > >> struct file *kick; > >> -- > >> 2.18.1