Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration ----------------- [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) ------------------------ # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 1376 1500 8 1136 2130 2060 16 1440 2269 2474 24 1408 2179 2436 32 1515 1978 2319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) ---------------------------- # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 58 90 101 32 47 61 72 Patch 1 adds a new API to add functions for piecewise addition for buffers, which enables various simplifications in virtio-scsi (patches 2-3) and a small performance improvement of 2-6%. Patches 4 and 5 add multiqueuing. I'm mostly looking for comments on the new API of patch 1 for inclusion into the 3.9 kernel. Thanks to Wao Ganlong for help rebasing and benchmarking these patches. Paolo Bonzini (5): virtio: add functions for piecewise addition of buffers virtio-scsi: use functions for piecewise composition of buffers virtio-scsi: redo allocation of target data virtio-scsi: pass struct virtio_scsi to virtqueue completion function virtio-scsi: introduce multiqueue support drivers/scsi/virtio_scsi.c | 374 +++++++++++++++++++++++++++++------------- drivers/virtio/virtio_ring.c | 205 ++++++++++++++++++++++++ include/linux/virtio.h | 21 +++ 3 files changed, 485 insertions(+), 115 deletions(-)
Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
The virtqueue_add_buf function has two limitations: 1) it requires the caller to provide all the buffers in a single call; 2) it does not support chained scatterlists: the buffers must be provided as an array of struct scatterlist; Because of these limitations, virtio-scsi has to copy each request into a scatterlist internal to the driver. It cannot just use the one that was prepared by the upper SCSI layers. This patch adds a different set of APIs for adding a buffer to a virtqueue. The new API lets you pass the buffers piecewise, wrapping multiple calls to virtqueue_add_sg between virtqueue_start_buf and virtqueue_end_buf. virtio-scsi can then call virtqueue_add_sg 3/4 times: for the request header, for the write buffer (if present), for the response header, and finally for the read buffer (again if present). It saves the copying and the related locking. Note that this API is not needed in virtio-blk, because it does all the work of the upper SCSI layers itself in the blk_map_rq_sg call. Then it simply hands the resulting scatterlist to virtqueue_add_buf. Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> --- v1->v2: new drivers/virtio/virtio_ring.c | 205 ++++++++++++++++++++++++++++++++++++++++++ include/linux/virtio.h | 21 ++++ 2 files changed, 226 insertions(+), 0 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index ffd7e7d..ccfa97c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -394,6 +394,211 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head) vq->vq.num_free++; } +/** + * virtqueue_start_buf - start building buffer for the other end + * @vq: the struct virtqueue we're talking about. + * @buf: a struct keeping the state of the buffer + * @data: the token identifying the buffer. + * @count: the number of buffers that will be added + * @count_sg: the number of sg lists that will be added + * @gfp: how to do memory allocations (if necessary). + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted), and that a successful call is + * followed by one or more calls to virtqueue_add_sg, and finally a call + * to virtqueue_end_buf. + * + * Returns zero or a negative error (ie. ENOSPC). + */ +int virtqueue_start_buf(struct virtqueue *_vq, + struct virtqueue_buf *buf, + void *data, + unsigned int count, + unsigned int count_sg, + gfp_t gfp) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_desc *desc = NULL; + int head; + int ret = -ENOMEM; + + START_USE(vq); + + BUG_ON(data == NULL); + +#ifdef DEBUG + { + ktime_t now = ktime_get(); + + /* No kick or get, with .1 second between? Warn. */ + if (vq->last_add_time_valid) + WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time)) + > 100); + vq->last_add_time = now; + vq->last_add_time_valid = true; + } +#endif + + BUG_ON(count < count_sg); + BUG_ON(count_sg == 0); + + /* If the host supports indirect descriptor tables, and there is + * no space for direct buffers or there are multi-item scatterlists, + * go indirect. + */ + head = vq->free_head; + if (vq->indirect && (count > count_sg || vq->vq.num_free < count)) { + if (vq->vq.num_free == 0) + goto no_space; + + desc = kmalloc(count * sizeof(struct vring_desc), gfp); + if (!desc) + goto error; + + /* We're about to use a buffer */ + vq->vq.num_free--; + + /* Use a single buffer which doesn't continue */ + vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT; + vq->vring.desc[head].addr = virt_to_phys(desc); + vq->vring.desc[head].len = count * sizeof(struct vring_desc); + + /* Update free pointer */ + vq->free_head = vq->vring.desc[head].next; + } + + /* Set token. */ + vq->data[head] = data; + + pr_debug("Started buffer head %i for %p\n", head, vq); + + buf->vq = _vq; + buf->indirect = desc; + buf->tail = NULL; + buf->head = head; + return 0; + +no_space: + ret = -ENOSPC; +error: + pr_debug("Can't add buf (%d) - count = %i, avail = %i\n", + ret, count, vq->vq.num_free); + END_USE(vq); + return ret; +} +EXPORT_SYMBOL_GPL(virtqueue_start_buf); + +/** + * virtqueue_add_sg - add sglist to buffer + * @buf: the struct that was passed to virtqueue_start_buf + * @sgl: the description of the buffer(s). + * @count: the number of items to process in sgl + * @dir: whether the sgl is read or written (DMA_TO_DEVICE/DMA_FROM_DEVICE only) + * + * Note that, unlike virtqueue_add_buf, this function follows chained + * scatterlists, and stops before the @count-th item if a scatterlist item + * has a marker. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + */ +void virtqueue_add_sg(struct virtqueue_buf *buf, + struct scatterlist sgl[], + unsigned int count, + enum dma_data_direction dir) +{ + struct vring_virtqueue *vq = to_vvq(buf->vq); + unsigned int i, uninitialized_var(prev), n; + struct scatterlist *sg; + struct vring_desc *tail; + u32 flags; + +#ifdef DEBUG + BUG_ON(!vq->in_use); +#endif + + BUG_ON(dir != DMA_FROM_DEVICE && dir != DMA_TO_DEVICE); + BUG_ON(count == 0); + + flags = (dir == DMA_FROM_DEVICE ? VRING_DESC_F_WRITE : 0); + flags |= VRING_DESC_F_NEXT; + + /* If using indirect descriptor tables, fill in the buffers + * at buf->indirect. */ + if (buf->indirect != NULL) { + i = 0; + if (likely(buf->tail != NULL)) + i = buf->tail - buf->indirect + 1; + + for_each_sg(sgl, sg, count, n) { + tail = &buf->indirect[i]; + tail->flags = flags; + tail->addr = sg_phys(sg); + tail->len = sg->length; + tail->next = ++i; + } + } else { + BUG_ON(vq->vq.num_free < count); + + i = vq->free_head; + for_each_sg(sgl, sg, count, n) { + tail = &vq->vring.desc[i]; + tail->flags = flags; + tail->addr = sg_phys(sg); + tail->len = sg->length; + i = vq->vring.desc[i].next; + vq->vq.num_free--; + } + + vq->free_head = i; + } + buf->tail = tail; +} +EXPORT_SYMBOL_GPL(virtqueue_add_sg); + +/** + * virtqueue_end_buf - expose buffer to other end + * @buf: the struct that was passed to virtqueue_start_buf + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + */ +void virtqueue_end_buf(struct virtqueue_buf *buf) +{ + struct vring_virtqueue *vq = to_vvq(buf->vq); + unsigned int avail; + int head = buf->head; + struct vring_desc *tail = buf->tail; + +#ifdef DEBUG + BUG_ON(!vq->in_use); +#endif + BUG_ON(tail == NULL); + + /* The last one does not have the next flag set. */ + tail->flags &= ~VRING_DESC_F_NEXT; + + /* Put entry in available array (but don't update avail->idx until + * virtqueue_end_buf). */ + avail = (vq->vring.avail->idx & (vq->vring.num-1)); + vq->vring.avail->ring[avail] = head; + + /* Descriptors and available array need to be set before we expose the + * new available array entries. */ + virtio_wmb(vq); + vq->vring.avail->idx++; + vq->num_added++; + + /* This is very unlikely, but theoretically possible. Kick + * just in case. */ + if (unlikely(vq->num_added == (1 << 16) - 1)) + virtqueue_kick(buf->vq); + + pr_debug("Added buffer head %i to %p\n", head, vq); + END_USE(vq); +} +EXPORT_SYMBOL_GPL(virtqueue_end_buf); + static inline bool more_used(const struct vring_virtqueue *vq) { return vq->last_used_idx != vq->vring.used->idx; diff --git a/include/linux/virtio.h b/include/linux/virtio.h index cf8adb1..39d56c4 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -7,6 +7,7 @@ #include <linux/spinlock.h> #include <linux/device.h> #include <linux/mod_devicetable.h> +#include <linux/dma-direction.h> #include <linux/gfp.h> /** @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq, void *data, gfp_t gfp); +struct virtqueue_buf { + struct virtqueue *vq; + struct vring_desc *indirect, *tail; + int head; +}; + +int virtqueue_start_buf(struct virtqueue *_vq, + struct virtqueue_buf *buf, + void *data, + unsigned int count, + unsigned int count_sg, + gfp_t gfp); + +void virtqueue_add_sg(struct virtqueue_buf *buf, + struct scatterlist sgl[], + unsigned int count, + enum dma_data_direction dir); + +void virtqueue_end_buf(struct virtqueue_buf *buf); + void virtqueue_kick(struct virtqueue *vq); bool virtqueue_kick_prepare(struct virtqueue *vq); -- 1.7.1
Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
Using the new virtio_scsi_add_sg function lets us simplify the queueing path. In particular, all data protected by the tgt_lock is just gone (multiqueue will find a new use for the lock). The speedup is relatively small (2-4%) but it is worthwhile because of the code simplification---both in this patches and in the next ones. Signed-off-by: Wanlong Gao <gaowanlong at cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> --- v1->v2: new drivers/scsi/virtio_scsi.c | 94 +++++++++++++++++++------------------------ 1 files changed, 42 insertions(+), 52 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 74ab67a..2b93b6e 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -59,11 +59,8 @@ struct virtio_scsi_vq { /* Per-target queue state */ struct virtio_scsi_target_state { - /* Protects sg. Lock hierarchy is tgt_lock -> vq_lock. */ + /* Never held at the same time as vq_lock. */ spinlock_t tgt_lock; - - /* For sglist construction when adding commands to the virtqueue. */ - struct scatterlist sg[]; }; /* Driver instance state */ @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq) spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); }; -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, - struct scsi_data_buffer *sdb) -{ - struct sg_table *table = &sdb->table; - struct scatterlist *sg_elem; - unsigned int idx = *p_idx; - int i; - - for_each_sg(table->sgl, sg_elem, table->nents, i) - sg[idx++] = *sg_elem; - - *p_idx = idx; -} - /** - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue * @vscsi : virtio_scsi state * @cmd : command structure - * @out_num : number of read-only elements - * @in_num : number of write-only elements * @req_size : size of the request buffer * @resp_size : size of the response buffer - * - * Called with tgt_lock held. + * @gfp : flags to use for memory allocations */ -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, - struct virtio_scsi_cmd *cmd, - unsigned *out_num, unsigned *in_num, - size_t req_size, size_t resp_size) +static int virtscsi_add_cmd(struct virtqueue *vq, + struct virtio_scsi_cmd *cmd, + size_t req_size, size_t resp_size, gfp_t gfp) { struct scsi_cmnd *sc = cmd->sc; - struct scatterlist *sg = tgt->sg; - unsigned int idx = 0; + struct scatterlist sg; + unsigned int count, count_sg; + struct sg_table *out, *in; + struct virtqueue_buf buf; + int ret; + + out = in = NULL; + + if (sc && sc->sc_data_direction != DMA_NONE) { + if (sc->sc_data_direction != DMA_FROM_DEVICE) + out = &scsi_out(sc)->table; + if (sc->sc_data_direction != DMA_TO_DEVICE) + in = &scsi_in(sc)->table; + } + + count_sg = 2 + (out ? 1 : 0) + (in ? 1 : 0); + count = 2 + (out ? out->nents : 0) + (in ? in->nents : 0); + ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp); + if (ret < 0) + return ret; /* Request header. */ - sg_set_buf(&sg[idx++], &cmd->req, req_size); + sg_init_one(&sg, &cmd->req, req_size); + virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE); /* Data-out buffer. */ - if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_out(sc)); - - *out_num = idx; + if (out) + virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE); /* Response header. */ - sg_set_buf(&sg[idx++], &cmd->resp, resp_size); + sg_init_one(&sg, &cmd->resp, resp_size); + virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE); /* Data-in buffer */ - if (sc && sc->sc_data_direction != DMA_TO_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_in(sc)); + if (in) + virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE); - *in_num = idx - *out_num; + virtqueue_end_buf(&buf); + return 0; } static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, struct virtio_scsi_cmd *cmd, size_t req_size, size_t resp_size, gfp_t gfp) { - unsigned int out_num, in_num; unsigned long flags; - int err; + int ret; bool needs_kick = false; - spin_lock_irqsave(&tgt->tgt_lock, flags); - virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size); - - spin_lock(&vq->vq_lock); - err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp); - spin_unlock(&tgt->tgt_lock); - if (!err) + spin_lock_irqsave(&vq->vq_lock, flags); + ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp); + if (!ret) needs_kick = virtqueue_kick_prepare(vq->vq); spin_unlock_irqrestore(&vq->vq_lock, flags); if (needs_kick) virtqueue_notify(vq->vq); - return err; + return ret; } static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt( gfp_t gfp_mask = GFP_KERNEL; /* We need extra sg elements at head and tail. */ - tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2), - gfp_mask); - + tgt = kmalloc(sizeof(*tgt), gfp_mask); if (!tgt) return NULL; spin_lock_init(&tgt->tgt_lock); - sg_init_table(tgt->sg, sg_elems + 2); return tgt; } -- 1.7.1
Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 3/5] virtio-scsi: redo allocation of target data
virtio_scsi_target_state is now empty, but we will find new uses for it in the next few patches. However, dropping the sglist lets us turn the array-of-pointers into a simple array, which simplifies the allocation. However, we do not leave the virtio_scsi_target_state structs in the flexible array member at the end of struct virtio_scsi, because we will place the virtqueues there in the next patches. Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> --- v1->v2: new drivers/scsi/virtio_scsi.c | 43 ++++++++++++++----------------------------- 1 files changed, 14 insertions(+), 29 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 2b93b6e..4a3abaf 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -74,7 +74,7 @@ struct virtio_scsi { /* Get some buffers ready for event vq */ struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN]; - struct virtio_scsi_target_state *tgt[]; + struct virtio_scsi_target_state *tgt; }; static struct kmem_cache *virtscsi_cmd_cache; @@ -426,7 +426,7 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) { struct virtio_scsi *vscsi = shost_priv(sh); - struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id]; + struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; struct virtio_scsi_cmd *cmd; int ret; @@ -474,7 +474,7 @@ out: static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd) { DECLARE_COMPLETION_ONSTACK(comp); - struct virtio_scsi_target_state *tgt = vscsi->tgt[cmd->sc->device->id]; + struct virtio_scsi_target_state *tgt = &vscsi->tgt[cmd->sc->device->id]; int ret = FAILED; cmd->comp = ∁ @@ -578,19 +578,9 @@ static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, virtscsi_vq->vq = vq; } -static struct virtio_scsi_target_state *virtscsi_alloc_tgt( - struct virtio_device *vdev, int sg_elems) +static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt) { - struct virtio_scsi_target_state *tgt; - gfp_t gfp_mask = GFP_KERNEL; - - /* We need extra sg elements at head and tail. */ - tgt = kmalloc(sizeof(*tgt), gfp_mask); - if (!tgt) - return NULL; - spin_lock_init(&tgt->tgt_lock); - return tgt; } static void virtscsi_scan(struct virtio_device *vdev) @@ -604,16 +594,12 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev) { struct Scsi_Host *sh = virtio_scsi_host(vdev); struct virtio_scsi *vscsi = shost_priv(sh); - u32 i, num_targets; /* Stop all the virtqueues. */ vdev->config->reset(vdev); - num_targets = sh->max_id; - for (i = 0; i < num_targets; i++) { - kfree(vscsi->tgt[i]); - vscsi->tgt[i] = NULL; - } + kfree(vscsi->tgt); + vscsi->tgt = NULL; vdev->config->del_vqs(vdev); } @@ -654,13 +640,14 @@ static int virtscsi_init(struct virtio_device *vdev, /* We need to know how many segments before we allocate. */ sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1; - for (i = 0; i < num_targets; i++) { - vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems); - if (!vscsi->tgt[i]) { - err = -ENOMEM; - goto out; - } + vscsi->tgt = kmalloc(num_targets * sizeof(vscsi->tgt[0]), GFP_KERNEL); + if (!vscsi->tgt) { + err = -ENOMEM; + goto out; } + for (i = 0; i < num_targets; i++) + virtscsi_init_tgt(&vscsi->tgt[i]); + err = 0; out: @@ -679,9 +666,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) /* Allocate memory and link the structs together. */ num_targets = virtscsi_config_get(vdev, max_target) + 1; - shost = scsi_host_alloc(&virtscsi_host_template, - sizeof(*vscsi) - + num_targets * sizeof(struct virtio_scsi_target_state)); + shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi)); if (!shost) return -ENOMEM; -- 1.7.1
Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function
This will be needed soon in order to retrieve the per-target struct. Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> --- drivers/scsi/virtio_scsi.c | 17 +++++++++-------- 1 files changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 4a3abaf..4f6c6a3 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -104,7 +104,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, u32 resid) * * Called with vq_lock held. */ -static void virtscsi_complete_cmd(void *buf) +static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_cmd *cmd = buf; struct scsi_cmnd *sc = cmd->sc; @@ -165,7 +165,8 @@ static void virtscsi_complete_cmd(void *buf) sc->scsi_done(sc); } -static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf)) +static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq, + void (*fn)(struct virtio_scsi *vscsi, void *buf)) { void *buf; unsigned int len; @@ -173,7 +174,7 @@ static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf)) do { virtqueue_disable_cb(vq); while ((buf = virtqueue_get_buf(vq, &len)) != NULL) - fn(buf); + fn(vscsi, buf); } while (!virtqueue_enable_cb(vq)); } @@ -184,11 +185,11 @@ static void virtscsi_req_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_cmd); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd); spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags); }; -static void virtscsi_complete_free(void *buf) +static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_cmd *cmd = buf; @@ -205,7 +206,7 @@ static void virtscsi_ctrl_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->ctrl_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_free); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_free); spin_unlock_irqrestore(&vscsi->ctrl_vq.vq_lock, flags); }; @@ -329,7 +330,7 @@ static void virtscsi_handle_event(struct work_struct *work) virtscsi_kick_event(vscsi, event_node); } -static void virtscsi_complete_event(void *buf) +static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_event_node *event_node = buf; @@ -344,7 +345,7 @@ static void virtscsi_event_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_event); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_event); spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); }; -- 1.7.1
Paolo Bonzini
2012-Dec-18 12:32 UTC
[PATCH v2 5/5] virtio-scsi: introduce multiqueue support
This patch adds queue steering to virtio-scsi. When a target is sent multiple requests, we always drive them to the same queue so that FIFO processing order is kept. However, if a target was idle, we can choose a queue arbitrarily. In this case the queue is chosen according to the current VCPU, so the driver expects the number of request queues to be equal to the number of VCPUs. This makes it easy and fast to select the queue, and also lets the driver optimize the IRQ affinity for the virtqueues (each virtqueue's affinity is set to the CPU that "owns" the queue). The speedup comes from improving cache locality and giving CPU affinity to the virtqueues, which is why this scheme was selected. Assuming that the thread that is sending requests to the device is I/O-bound, it is likely to be sleeping at the time the ISR is executed, and thus executing the ISR on the same processor that sent the requests is cheap. However, the kernel will not execute the ISR on the "best" processor unless you explicitly set the affinity. This is because in practice you will have many such I/O-bound processes and thus many otherwise idle processors. Then the kernel will execute the ISR on a random processor, rather than the one that is sending requests to the device. The alternative to per-CPU virtqueues is per-target virtqueues. To achieve the same locality, we could dynamically choose the virtqueue's affinity based on the CPU of the last task that sent a request. This is less appealing because we do not set the affinity directly---we only provide a hint to the irqbalanced running in userspace. Dynamically changing the affinity only works if the userspace applies the hint fast enough. Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> --- v1->v2: improved comments and commit messages, added memory barriers drivers/scsi/virtio_scsi.c | 234 +++++++++++++++++++++++++++++++++++++------ 1 files changed, 201 insertions(+), 33 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 4f6c6a3..ca9d29d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -26,6 +26,7 @@ #define VIRTIO_SCSI_MEMPOOL_SZ 64 #define VIRTIO_SCSI_EVENT_LEN 8 +#define VIRTIO_SCSI_VQ_BASE 2 /* Command queue element */ struct virtio_scsi_cmd { @@ -57,24 +58,57 @@ struct virtio_scsi_vq { struct virtqueue *vq; }; -/* Per-target queue state */ +/* + * Per-target queue state. + * + * This struct holds the data needed by the queue steering policy. When a + * target is sent multiple requests, we need to drive them to the same queue so + * that FIFO processing order is kept. However, if a target was idle, we can + * choose a queue arbitrarily. In this case the queue is chosen according to + * the current VCPU, so the driver expects the number of request queues to be + * equal to the number of VCPUs. This makes it easy and fast to select the + * queue, and also lets the driver optimize the IRQ affinity for the virtqueues + * (each virtqueue's affinity is set to the CPU that "owns" the queue). + * + * An interesting effect of this policy is that only writes to req_vq need to + * take the tgt_lock. Read can be done outside the lock because: + * + * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1. + * In that case, no other CPU is reading req_vq: even if they were in + * virtscsi_queuecommand_multi, they would be spinning on tgt_lock. + * + * - reads of req_vq only occur when the target is not idle (reqs != 0). + * A CPU that enters virtscsi_queuecommand_multi will not modify req_vq. + * + * Similarly, decrements of reqs are never concurrent with writes of req_vq. + * Thus they can happen outside the tgt_lock, provided of course we make reqs + * an atomic_t. + */ struct virtio_scsi_target_state { - /* Never held at the same time as vq_lock. */ + /* This spinlock ever held at the same time as vq_lock. */ spinlock_t tgt_lock; + + /* Count of outstanding requests. */ + atomic_t reqs; + + /* Currently active virtqueue for requests sent to this target. */ + struct virtio_scsi_vq *req_vq; }; /* Driver instance state */ struct virtio_scsi { struct virtio_device *vdev; - struct virtio_scsi_vq ctrl_vq; - struct virtio_scsi_vq event_vq; - struct virtio_scsi_vq req_vq; - /* Get some buffers ready for event vq */ struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN]; struct virtio_scsi_target_state *tgt; + + u32 num_queues; + + struct virtio_scsi_vq ctrl_vq; + struct virtio_scsi_vq event_vq; + struct virtio_scsi_vq req_vqs[]; }; static struct kmem_cache *virtscsi_cmd_cache; @@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf) struct virtio_scsi_cmd *cmd = buf; struct scsi_cmnd *sc = cmd->sc; struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd; + struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; dev_dbg(&sc->device->sdev_gendev, "cmd %p response %u status %#02x sense_len %u\n", @@ -163,6 +198,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf) mempool_free(cmd, virtscsi_cmd_pool); sc->scsi_done(sc); + + atomic_dec(&tgt->reqs); } static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq, @@ -182,11 +219,45 @@ static void virtscsi_req_done(struct virtqueue *vq) { struct Scsi_Host *sh = virtio_scsi_host(vq->vdev); struct virtio_scsi *vscsi = shost_priv(sh); + int index = vq->index - VIRTIO_SCSI_VQ_BASE; + struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index]; unsigned long flags; - spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags); + /* + * Read req_vq before decrementing the reqs field in + * virtscsi_complete_cmd. + * + * With barriers: + * + * CPU #0 virtscsi_queuecommand_multi (CPU #1) + * ------------------------------------------------------------ + * lock vq_lock + * read req_vq + * read reqs (reqs = 1) + * write reqs (reqs = 0) + * increment reqs (reqs = 1) + * write req_vq + * + * Possible reordering without barriers: + * + * CPU #0 virtscsi_queuecommand_multi (CPU #1) + * ------------------------------------------------------------ + * lock vq_lock + * read reqs (reqs = 1) + * write reqs (reqs = 0) + * increment reqs (reqs = 1) + * write req_vq + * read (wrong) req_vq + * + * We do not need a full smp_rmb, because req_vq is required to get + * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored + * in the virtqueue as the user token. + */ + smp_read_barrier_depends(); + + spin_lock_irqsave(&req_vq->vq_lock, flags); virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd); - spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags); + spin_unlock_irqrestore(&req_vq->vq_lock, flags); }; static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf) @@ -424,11 +495,12 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, return ret; } -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) +static int virtscsi_queuecommand(struct virtio_scsi *vscsi, + struct virtio_scsi_target_state *tgt, + struct scsi_cmnd *sc) { - struct virtio_scsi *vscsi = shost_priv(sh); - struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; struct virtio_scsi_cmd *cmd; + struct virtio_scsi_vq *req_vq; int ret; struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev); @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE); memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len); - if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd, + req_vq = ACCESS_ONCE(tgt->req_vq); + if (virtscsi_kick_cmd(tgt, req_vq, cmd, sizeof cmd->req.cmd, sizeof cmd->resp.cmd, GFP_ATOMIC) == 0) ret = 0; @@ -472,6 +545,48 @@ out: return ret; } +static int virtscsi_queuecommand_single(struct Scsi_Host *sh, + struct scsi_cmnd *sc) +{ + struct virtio_scsi *vscsi = shost_priv(sh); + struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; + + atomic_inc(&tgt->reqs); + return virtscsi_queuecommand(vscsi, tgt, sc); +} + +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh, + struct scsi_cmnd *sc) +{ + struct virtio_scsi *vscsi = shost_priv(sh); + struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; + unsigned long flags; + u32 queue_num; + + /* + * Using an atomic_t for tgt->reqs lets the virtqueue handler + * decrement it without taking the spinlock. + * + * We still need a critical section to prevent concurrent submissions + * from picking two different req_vqs. + */ + spin_lock_irqsave(&tgt->tgt_lock, flags); + if (atomic_inc_return(&tgt->reqs) == 1) { + queue_num = smp_processor_id(); + while (unlikely(queue_num >= vscsi->num_queues)) + queue_num -= vscsi->num_queues; + + /* + * Write reqs before writing req_vq, matching the + * smp_read_barrier_depends() in virtscsi_req_done. + */ + smp_wmb(); + tgt->req_vq = &vscsi->req_vqs[queue_num]; + } + spin_unlock_irqrestore(&tgt->tgt_lock, flags); + return virtscsi_queuecommand(vscsi, tgt, sc); +} + static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd) { DECLARE_COMPLETION_ONSTACK(comp); @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc) return virtscsi_tmf(vscsi, cmd); } -static struct scsi_host_template virtscsi_host_template = { +static struct scsi_host_template virtscsi_host_template_single = { .module = THIS_MODULE, .name = "Virtio SCSI HBA", .proc_name = "virtio_scsi", - .queuecommand = virtscsi_queuecommand, .this_id = -1, + .queuecommand = virtscsi_queuecommand_single, + .eh_abort_handler = virtscsi_abort, + .eh_device_reset_handler = virtscsi_device_reset, + + .can_queue = 1024, + .dma_boundary = UINT_MAX, + .use_clustering = ENABLE_CLUSTERING, +}; + +static struct scsi_host_template virtscsi_host_template_multi = { + .module = THIS_MODULE, + .name = "Virtio SCSI HBA", + .proc_name = "virtio_scsi", + .this_id = -1, + .queuecommand = virtscsi_queuecommand_multi, .eh_abort_handler = virtscsi_abort, .eh_device_reset_handler = virtscsi_device_reset, @@ -572,16 +701,27 @@ static struct scsi_host_template virtscsi_host_template = { &__val, sizeof(__val)); \ }) + static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, - struct virtqueue *vq) + struct virtqueue *vq, bool affinity) { spin_lock_init(&virtscsi_vq->vq_lock); virtscsi_vq->vq = vq; + if (affinity) + virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE); } -static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt) +static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i) { + struct virtio_scsi_target_state *tgt = &vscsi->tgt[i]; spin_lock_init(&tgt->tgt_lock); + atomic_set(&tgt->reqs, 0); + + /* + * The default is unused for multiqueue, but with a single queue + * or target we use it in virtscsi_queuecommand. + */ + tgt->req_vq = &vscsi->req_vqs[0]; } static void virtscsi_scan(struct virtio_device *vdev) @@ -609,28 +749,41 @@ static int virtscsi_init(struct virtio_device *vdev, struct virtio_scsi *vscsi, int num_targets) { int err; - struct virtqueue *vqs[3]; u32 i, sg_elems; + u32 num_vqs; + vq_callback_t **callbacks; + const char **names; + struct virtqueue **vqs; - vq_callback_t *callbacks[] = { - virtscsi_ctrl_done, - virtscsi_event_done, - virtscsi_req_done - }; - const char *names[] = { - "control", - "event", - "request" - }; + num_vqs = vscsi->num_queues + VIRTIO_SCSI_VQ_BASE; + vqs = kmalloc(num_vqs * sizeof(struct virtqueue *), GFP_KERNEL); + callbacks = kmalloc(num_vqs * sizeof(vq_callback_t *), GFP_KERNEL); + names = kmalloc(num_vqs * sizeof(char *), GFP_KERNEL); + + if (!callbacks || !vqs || !names) { + err = -ENOMEM; + goto out; + } + + callbacks[0] = virtscsi_ctrl_done; + callbacks[1] = virtscsi_event_done; + names[0] = "control"; + names[1] = "event"; + for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) { + callbacks[i] = virtscsi_req_done; + names[i] = "request"; + } /* Discover virtqueues and write information to configuration. */ - err = vdev->config->find_vqs(vdev, 3, vqs, callbacks, names); + err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names); if (err) return err; - virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]); - virtscsi_init_vq(&vscsi->event_vq, vqs[1]); - virtscsi_init_vq(&vscsi->req_vq, vqs[2]); + virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false); + virtscsi_init_vq(&vscsi->event_vq, vqs[1], false); + for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) + virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE], + vqs[i], vscsi->num_queues > 1); virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE); virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE); @@ -647,11 +800,14 @@ static int virtscsi_init(struct virtio_device *vdev, goto out; } for (i = 0; i < num_targets; i++) - virtscsi_init_tgt(&vscsi->tgt[i]); + virtscsi_init_tgt(vscsi, i); err = 0; out: + kfree(names); + kfree(callbacks); + kfree(vqs); if (err) virtscsi_remove_vqs(vdev); return err; @@ -664,11 +820,22 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) int err; u32 sg_elems, num_targets; u32 cmd_per_lun; + u32 num_queues; + struct scsi_host_template *hostt; + + /* We need to know how many queues before we allocate. */ + num_queues = virtscsi_config_get(vdev, num_queues) ?: 1; /* Allocate memory and link the structs together. */ num_targets = virtscsi_config_get(vdev, max_target) + 1; - shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi)); + if (num_queues == 1) + hostt = &virtscsi_host_template_single; + else + hostt = &virtscsi_host_template_multi; + + shost = scsi_host_alloc(hostt, + sizeof(*vscsi) + sizeof(vscsi->req_vqs[0]) * num_queues); if (!shost) return -ENOMEM; @@ -676,6 +843,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) shost->sg_tablesize = sg_elems; vscsi = shost_priv(shost); vscsi->vdev = vdev; + vscsi->num_queues = num_queues; vdev->priv = shost; err = virtscsi_init(vdev, vscsi, num_targets); -- 1.7.1
Paolo Bonzini
2012-Dec-18 13:35 UTC
[PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
Il 18/12/2012 14:37, Michael S. Tsirkin ha scritto:> On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote: >> Using the new virtio_scsi_add_sg function lets us simplify the queueing >> path. In particular, all data protected by the tgt_lock is just gone >> (multiqueue will find a new use for the lock). > > vq access still needs some protection: virtio is not reentrant > by itself. with tgt_lock gone what protects vq against > concurrent add_buf calls?vq_lock. Paolo>> The speedup is relatively small (2-4%) but it is worthwhile because of >> the code simplification---both in this patches and in the next ones. >> >> Signed-off-by: Wanlong Gao <gaowanlong at cn.fujitsu.com> >> Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> >> --- >> v1->v2: new >> >> drivers/scsi/virtio_scsi.c | 94 +++++++++++++++++++------------------------ >> 1 files changed, 42 insertions(+), 52 deletions(-) >> >> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c >> index 74ab67a..2b93b6e 100644 >> --- a/drivers/scsi/virtio_scsi.c >> +++ b/drivers/scsi/virtio_scsi.c >> @@ -59,11 +59,8 @@ struct virtio_scsi_vq { >> >> /* Per-target queue state */ >> struct virtio_scsi_target_state { >> - /* Protects sg. Lock hierarchy is tgt_lock -> vq_lock. */ >> + /* Never held at the same time as vq_lock. */ >> spinlock_t tgt_lock; >> - >> - /* For sglist construction when adding commands to the virtqueue. */ >> - struct scatterlist sg[]; >> }; >> >> /* Driver instance state */ >> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq) >> spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); >> }; >> >> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, >> - struct scsi_data_buffer *sdb) >> -{ >> - struct sg_table *table = &sdb->table; >> - struct scatterlist *sg_elem; >> - unsigned int idx = *p_idx; >> - int i; >> - >> - for_each_sg(table->sgl, sg_elem, table->nents, i) >> - sg[idx++] = *sg_elem; >> - >> - *p_idx = idx; >> -} >> - >> /** >> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist >> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue >> * @vscsi : virtio_scsi state >> * @cmd : command structure >> - * @out_num : number of read-only elements >> - * @in_num : number of write-only elements >> * @req_size : size of the request buffer >> * @resp_size : size of the response buffer >> - * >> - * Called with tgt_lock held. >> + * @gfp : flags to use for memory allocations >> */ >> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, >> - struct virtio_scsi_cmd *cmd, >> - unsigned *out_num, unsigned *in_num, >> - size_t req_size, size_t resp_size) >> +static int virtscsi_add_cmd(struct virtqueue *vq, >> + struct virtio_scsi_cmd *cmd, >> + size_t req_size, size_t resp_size, gfp_t gfp) >> { >> struct scsi_cmnd *sc = cmd->sc; >> - struct scatterlist *sg = tgt->sg; >> - unsigned int idx = 0; >> + struct scatterlist sg; >> + unsigned int count, count_sg; >> + struct sg_table *out, *in; >> + struct virtqueue_buf buf; >> + int ret; >> + >> + out = in = NULL; >> + >> + if (sc && sc->sc_data_direction != DMA_NONE) { >> + if (sc->sc_data_direction != DMA_FROM_DEVICE) >> + out = &scsi_out(sc)->table; >> + if (sc->sc_data_direction != DMA_TO_DEVICE) >> + in = &scsi_in(sc)->table; >> + } >> + >> + count_sg = 2 + (out ? 1 : 0) + (in ? 1 : 0); >> + count = 2 + (out ? out->nents : 0) + (in ? in->nents : 0); >> + ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp); >> + if (ret < 0) >> + return ret; >> >> /* Request header. */ >> - sg_set_buf(&sg[idx++], &cmd->req, req_size); >> + sg_init_one(&sg, &cmd->req, req_size); >> + virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE); >> >> /* Data-out buffer. */ >> - if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) >> - virtscsi_map_sgl(sg, &idx, scsi_out(sc)); >> - >> - *out_num = idx; >> + if (out) >> + virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE); >> >> /* Response header. */ >> - sg_set_buf(&sg[idx++], &cmd->resp, resp_size); >> + sg_init_one(&sg, &cmd->resp, resp_size); >> + virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE); >> >> /* Data-in buffer */ >> - if (sc && sc->sc_data_direction != DMA_TO_DEVICE) >> - virtscsi_map_sgl(sg, &idx, scsi_in(sc)); >> + if (in) >> + virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE); >> >> - *in_num = idx - *out_num; >> + virtqueue_end_buf(&buf); >> + return 0; >> } >> >> static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, >> @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, >> struct virtio_scsi_cmd *cmd, >> size_t req_size, size_t resp_size, gfp_t gfp) >> { >> - unsigned int out_num, in_num; >> unsigned long flags; >> - int err; >> + int ret; >> bool needs_kick = false; >> >> - spin_lock_irqsave(&tgt->tgt_lock, flags); >> - virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size); >> - >> - spin_lock(&vq->vq_lock); >> - err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp); >> - spin_unlock(&tgt->tgt_lock); >> - if (!err) >> + spin_lock_irqsave(&vq->vq_lock, flags); >> + ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp); >> + if (!ret) >> needs_kick = virtqueue_kick_prepare(vq->vq); >> >> spin_unlock_irqrestore(&vq->vq_lock, flags); >> >> if (needs_kick) >> virtqueue_notify(vq->vq); >> - return err; >> + return ret; >> } >> >> static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) >> @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt( >> gfp_t gfp_mask = GFP_KERNEL; >> >> /* We need extra sg elements at head and tail. */ >> - tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2), >> - gfp_mask); >> - >> + tgt = kmalloc(sizeof(*tgt), gfp_mask); >> if (!tgt) >> return NULL; >> >> spin_lock_init(&tgt->tgt_lock); >> - sg_init_table(tgt->sg, sg_elems + 2); >> return tgt; >> } >> >> -- >> 1.7.1 >>
Michael S. Tsirkin
2012-Dec-18 13:37 UTC
[PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote:> Using the new virtio_scsi_add_sg function lets us simplify the queueing > path. In particular, all data protected by the tgt_lock is just gone > (multiqueue will find a new use for the lock).vq access still needs some protection: virtio is not reentrant by itself. with tgt_lock gone what protects vq against concurrent add_buf calls?> The speedup is relatively small (2-4%) but it is worthwhile because of > the code simplification---both in this patches and in the next ones. > > Signed-off-by: Wanlong Gao <gaowanlong at cn.fujitsu.com> > Signed-off-by: Paolo Bonzini <pbonzini at redhat.com> > --- > v1->v2: new > > drivers/scsi/virtio_scsi.c | 94 +++++++++++++++++++------------------------ > 1 files changed, 42 insertions(+), 52 deletions(-) > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 74ab67a..2b93b6e 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -59,11 +59,8 @@ struct virtio_scsi_vq { > > /* Per-target queue state */ > struct virtio_scsi_target_state { > - /* Protects sg. Lock hierarchy is tgt_lock -> vq_lock. */ > + /* Never held at the same time as vq_lock. */ > spinlock_t tgt_lock; > - > - /* For sglist construction when adding commands to the virtqueue. */ > - struct scatterlist sg[]; > }; > > /* Driver instance state */ > @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq) > spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); > }; > > -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, > - struct scsi_data_buffer *sdb) > -{ > - struct sg_table *table = &sdb->table; > - struct scatterlist *sg_elem; > - unsigned int idx = *p_idx; > - int i; > - > - for_each_sg(table->sgl, sg_elem, table->nents, i) > - sg[idx++] = *sg_elem; > - > - *p_idx = idx; > -} > - > /** > - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist > + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue > * @vscsi : virtio_scsi state > * @cmd : command structure > - * @out_num : number of read-only elements > - * @in_num : number of write-only elements > * @req_size : size of the request buffer > * @resp_size : size of the response buffer > - * > - * Called with tgt_lock held. > + * @gfp : flags to use for memory allocations > */ > -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, > - struct virtio_scsi_cmd *cmd, > - unsigned *out_num, unsigned *in_num, > - size_t req_size, size_t resp_size) > +static int virtscsi_add_cmd(struct virtqueue *vq, > + struct virtio_scsi_cmd *cmd, > + size_t req_size, size_t resp_size, gfp_t gfp) > { > struct scsi_cmnd *sc = cmd->sc; > - struct scatterlist *sg = tgt->sg; > - unsigned int idx = 0; > + struct scatterlist sg; > + unsigned int count, count_sg; > + struct sg_table *out, *in; > + struct virtqueue_buf buf; > + int ret; > + > + out = in = NULL; > + > + if (sc && sc->sc_data_direction != DMA_NONE) { > + if (sc->sc_data_direction != DMA_FROM_DEVICE) > + out = &scsi_out(sc)->table; > + if (sc->sc_data_direction != DMA_TO_DEVICE) > + in = &scsi_in(sc)->table; > + } > + > + count_sg = 2 + (out ? 1 : 0) + (in ? 1 : 0); > + count = 2 + (out ? out->nents : 0) + (in ? in->nents : 0); > + ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp); > + if (ret < 0) > + return ret; > > /* Request header. */ > - sg_set_buf(&sg[idx++], &cmd->req, req_size); > + sg_init_one(&sg, &cmd->req, req_size); > + virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE); > > /* Data-out buffer. */ > - if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) > - virtscsi_map_sgl(sg, &idx, scsi_out(sc)); > - > - *out_num = idx; > + if (out) > + virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE); > > /* Response header. */ > - sg_set_buf(&sg[idx++], &cmd->resp, resp_size); > + sg_init_one(&sg, &cmd->resp, resp_size); > + virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE); > > /* Data-in buffer */ > - if (sc && sc->sc_data_direction != DMA_TO_DEVICE) > - virtscsi_map_sgl(sg, &idx, scsi_in(sc)); > + if (in) > + virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE); > > - *in_num = idx - *out_num; > + virtqueue_end_buf(&buf); > + return 0; > } > > static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, > @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, > struct virtio_scsi_cmd *cmd, > size_t req_size, size_t resp_size, gfp_t gfp) > { > - unsigned int out_num, in_num; > unsigned long flags; > - int err; > + int ret; > bool needs_kick = false; > > - spin_lock_irqsave(&tgt->tgt_lock, flags); > - virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size); > - > - spin_lock(&vq->vq_lock); > - err = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp); > - spin_unlock(&tgt->tgt_lock); > - if (!err) > + spin_lock_irqsave(&vq->vq_lock, flags); > + ret = virtscsi_add_cmd(vq->vq, cmd, req_size, resp_size, gfp); > + if (!ret) > needs_kick = virtqueue_kick_prepare(vq->vq); > > spin_unlock_irqrestore(&vq->vq_lock, flags); > > if (needs_kick) > virtqueue_notify(vq->vq); > - return err; > + return ret; > } > > static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) > @@ -592,14 +585,11 @@ static struct virtio_scsi_target_state *virtscsi_alloc_tgt( > gfp_t gfp_mask = GFP_KERNEL; > > /* We need extra sg elements at head and tail. */ > - tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2), > - gfp_mask); > - > + tgt = kmalloc(sizeof(*tgt), gfp_mask); > if (!tgt) > return NULL; > > spin_lock_init(&tgt->tgt_lock); > - sg_init_table(tgt->sg, sg_elems + 2); > return tgt; > } > > -- > 1.7.1 >
Michael S. Tsirkin
2012-Dec-18 13:42 UTC
[PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:> Hi all, > > this series adds multiqueue support to the virtio-scsi driver, based > on Jason Wang's work on virtio-net. It uses a simple queue steering > algorithm that expects one queue per CPU. LUNs in the same target always > use the same queue (so that commands are not reordered); queue switching > occurs when the request being queued is the only one for the target. > Also based on Jason's patches, the virtqueue affinity is set so that > each CPU is associated to one virtqueue. > > I tested the patches with fio, using up to 32 virtio-scsi disks backed > by tmpfs on the host. These numbers are with 1 LUN per target. > > FIO configuration > ----------------- > [global] > rw=read > bsrange=4k-64k > ioengine=libaio > direct=1 > iodepth=4 > loops=20 > > overall bandwidth (MB/s) > ------------------------ > > # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs > 1 540 626 599 > 2 795 965 925 > 4 997 1376 1500 > 8 1136 2130 2060 > 16 1440 2269 2474 > 24 1408 2179 2436 > 32 1515 1978 2319 > > (These numbers for single-queue are with 4 VCPUs, but the impact of adding > more VCPUs is very limited). > > avg bandwidth per LUN (MB/s) > ---------------------------- > > # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs > 1 540 626 599 > 2 397 482 462 > 4 249 344 375 > 8 142 266 257 > 16 90 141 154 > 24 58 90 101 > 32 47 61 72Could you please try and measure host CPU utilization? Without this data it is possible that your host is undersubscribed and you are drinking up more host CPU. Another thing to note is that ATM you might need to test with idle=poll on host otherwise we have strange interaction with power management where reducing the overhead switches to lower power so gives you a worse IOPS.> Patch 1 adds a new API to add functions for piecewise addition for buffers, > which enables various simplifications in virtio-scsi (patches 2-3) and a > small performance improvement of 2-6%. Patches 4 and 5 add multiqueuing. > > I'm mostly looking for comments on the new API of patch 1 for inclusion > into the 3.9 kernel. > > Thanks to Wao Ganlong for help rebasing and benchmarking these patches. > > Paolo Bonzini (5): > virtio: add functions for piecewise addition of buffers > virtio-scsi: use functions for piecewise composition of buffers > virtio-scsi: redo allocation of target data > virtio-scsi: pass struct virtio_scsi to virtqueue completion function > virtio-scsi: introduce multiqueue support > > drivers/scsi/virtio_scsi.c | 374 +++++++++++++++++++++++++++++------------- > drivers/virtio/virtio_ring.c | 205 ++++++++++++++++++++++++ > include/linux/virtio.h | 21 +++ > 3 files changed, 485 insertions(+), 115 deletions(-)
Rolf Eike Beer
2012-Dec-18 22:18 UTC
[PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
Paolo Bonzini wrote:> Hi all, > > this series adds multiqueue support to the virtio-scsi driver, based > on Jason Wang's work on virtio-net. It uses a simple queue steering > algorithm that expects one queue per CPU. LUNs in the same target always > use the same queue (so that commands are not reordered); queue switching > occurs when the request being queued is the only one for the target. > Also based on Jason's patches, the virtqueue affinity is set so that > each CPU is associated to one virtqueue. > > I tested the patches with fio, using up to 32 virtio-scsi disks backed > by tmpfs on the host. These numbers are with 1 LUN per target. > > FIO configuration > ----------------- > [global] > rw=read > bsrange=4k-64k > ioengine=libaio > direct=1 > iodepth=4 > loops=20 > > overall bandwidth (MB/s) > ------------------------ > > # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs > 1 540 626 599 > 2 795 965 925 > 4 997 1376 1500 > 8 1136 2130 2060 > 16 1440 2269 2474 > 24 1408 2179 2436 > 32 1515 1978 2319 > > (These numbers for single-queue are with 4 VCPUs, but the impact of adding > more VCPUs is very limited). > > avg bandwidth per LUN (MB/s) > ---------------------------- > > # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs > 1 540 626 599 > 2 397 482 462 > 4 249 344 375 > 8 142 266 257 > 16 90 141 154 > 24 58 90 101 > 32 47 61 72Is there an explanation why 8x8 is slower then 4x8 in both cases? 8x1 and 8x2 being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16. Eike -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <lists.linuxfoundation.org/pipermail/virtualization/attachments/20121218/e506d974/attachment-0001.sig>
Paolo Bonzini
2012-Dec-19 08:52 UTC
[PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:> Paolo Bonzini wrote: >> Hi all, >> >> this series adds multiqueue support to the virtio-scsi driver, based >> on Jason Wang's work on virtio-net. It uses a simple queue steering >> algorithm that expects one queue per CPU. LUNs in the same target always >> use the same queue (so that commands are not reordered); queue switching >> occurs when the request being queued is the only one for the target. >> Also based on Jason's patches, the virtqueue affinity is set so that >> each CPU is associated to one virtqueue. >> >> I tested the patches with fio, using up to 32 virtio-scsi disks backed >> by tmpfs on the host. These numbers are with 1 LUN per target. >> >> FIO configuration >> ----------------- >> [global] >> rw=read >> bsrange=4k-64k >> ioengine=libaio >> direct=1 >> iodepth=4 >> loops=20 >> >> overall bandwidth (MB/s) >> ------------------------ >> >> # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs >> 1 540 626 599 >> 2 795 965 925 >> 4 997 1376 1500 >> 8 1136 2130 2060 >> 16 1440 2269 2474 >> 24 1408 2179 2436 >> 32 1515 1978 2319 >> >> (These numbers for single-queue are with 4 VCPUs, but the impact of adding >> more VCPUs is very limited). >> >> avg bandwidth per LUN (MB/s) >> ---------------------------- >> >> # of targets single-queue multi-queue, 4 VCPUs multi-queue, 8 VCPUs >> 1 540 626 599 >> 2 397 482 462 >> 4 249 344 375 >> 8 142 266 257 >> 16 90 141 154 >> 24 58 90 101 >> 32 47 61 72 > > Is there an explanation why 8x8 is slower then 4x8 in both cases?Regarding the "in both cases" part, it's because the second table has the same data as the first, but divided by the first column. In general, the "strangenesses" you find are probably within statistical noise or due to other effects such as host CPU utilization or contention on the big QEMU lock. Paolo 8x1 and 8x2> being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against > 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16. > > Eike >
Stefan Hajnoczi
2012-Dec-19 11:27 UTC
[PATCH v2 5/5] virtio-scsi: introduce multiqueue support
On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:> struct virtio_scsi_target_state { > - /* Never held at the same time as vq_lock. */ > + /* This spinlock ever held at the same time as vq_lock. */s/ever/is never/
Wanlong Gao
2013-Jan-15 09:48 UTC
[PATCH 1/2] virtio-scsi: split out request queue set affinity function
These two patches are based on the multi-queue virtio-scsi patch set. We set cpu affinity when the num_queues equals to the number of VCPUs. Split out the set affinity function, this also fix the bug when CPU IDs are not consecutive. Signed-off-by: Wanlong Gao <gaowanlong at cn.fujitsu.com> --- drivers/scsi/virtio_scsi.c | 50 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 3641d5f..16b0ef2 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -106,6 +106,9 @@ struct virtio_scsi { u32 num_queues; + /* Does the affinity hint is set for virtqueues? */ + bool affinity_hint_set; + struct virtio_scsi_vq ctrl_vq; struct virtio_scsi_vq event_vq; struct virtio_scsi_vq req_vqs[]; @@ -701,14 +704,45 @@ static struct scsi_host_template virtscsi_host_template_multi = { &__val, sizeof(__val)); \ }) +static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity) +{ + int i; + int cpu; + + /* In multiqueue mode, when the number of cpu is equal + * to the number of request queues, we let the qeueues + * to be private to one cpu by setting the affinity hint + * to eliminate the contention. + */ + if ((vscsi->num_queues == 1 || + vscsi->num_queues != num_online_cpus()) && affinity) { + if (vscsi->affinity_hint_set) + affinity = false; + else + return; + } + + if (affinity) { + i = 0; + for_each_online_cpu(cpu) { + virtqueue_set_affinity(vscsi->req_vqs[i].vq, cpu); + i++; + } + + vscsi->affinity_hint_set = true; + } else { + for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++) + virtqueue_set_affinity(vscsi->req_vqs[i].vq, -1); + + vscsi->affinity_hint_set = false; + } +} static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, - struct virtqueue *vq, bool affinity) + struct virtqueue *vq) { spin_lock_init(&virtscsi_vq->vq_lock); virtscsi_vq->vq = vq; - if (affinity) - virtqueue_set_affinity(vq, vq->index - VIRTIO_SCSI_VQ_BASE); } static void virtscsi_init_tgt(struct virtio_scsi *vscsi, int i) @@ -736,6 +770,8 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev) struct Scsi_Host *sh = virtio_scsi_host(vdev); struct virtio_scsi *vscsi = shost_priv(sh); + virtscsi_set_affinity(vscsi, false); + /* Stop all the virtqueues. */ vdev->config->reset(vdev); @@ -779,11 +815,13 @@ static int virtscsi_init(struct virtio_device *vdev, if (err) return err; - virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0], false); - virtscsi_init_vq(&vscsi->event_vq, vqs[1], false); + virtscsi_init_vq(&vscsi->ctrl_vq, vqs[0]); + virtscsi_init_vq(&vscsi->event_vq, vqs[1]); for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++) virtscsi_init_vq(&vscsi->req_vqs[i - VIRTIO_SCSI_VQ_BASE], - vqs[i], vscsi->num_queues > 1); + vqs[i]); + + virtscsi_set_affinity(vscsi, true); virtscsi_config_set(vdev, cdb_size, VIRTIO_SCSI_CDB_SIZE); virtscsi_config_set(vdev, sense_size, VIRTIO_SCSI_SENSE_SIZE); -- 1.8.1
Wanlong Gao
2013-Jan-15 09:50 UTC
[PATCH 2/2] virtio-scsi: reset virtqueue affinity when doing cpu hotplug
Add hot cpu notifier to reset the request virtqueue affinity when doing cpu hotplug. Signed-off-by: Wanlong Gao <gaowanlong at cn.fujitsu.com> --- drivers/scsi/virtio_scsi.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 16b0ef2..a3b5f12 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -20,6 +20,7 @@ #include <linux/virtio_ids.h> #include <linux/virtio_config.h> #include <linux/virtio_scsi.h> +#include <linux/cpu.h> #include <scsi/scsi_host.h> #include <scsi/scsi_device.h> #include <scsi/scsi_cmnd.h> @@ -109,6 +110,9 @@ struct virtio_scsi { /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; + /* CPU hotplug notifier */ + struct notifier_block nb; + struct virtio_scsi_vq ctrl_vq; struct virtio_scsi_vq event_vq; struct virtio_scsi_vq req_vqs[]; @@ -738,6 +742,23 @@ static void virtscsi_set_affinity(struct virtio_scsi *vscsi, bool affinity) } } +static int virtscsi_cpu_callback(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + struct virtio_scsi *vscsi = container_of(nfb, struct virtio_scsi, nb); + switch(action) { + case CPU_ONLINE: + case CPU_ONLINE_FROZEN: + case CPU_DEAD: + case CPU_DEAD_FROZEN: + virtscsi_set_affinity(vscsi, true); + break; + default: + break; + } + return NOTIFY_OK; +} + static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, struct virtqueue *vq) { @@ -888,6 +909,13 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) if (err) goto virtscsi_init_failed; + vscsi->nb.notifier_call = &virtscsi_cpu_callback; + err = register_hotcpu_notifier(&vscsi->nb); + if (err) { + pr_err("virtio_scsi: registering cpu notifier failed\n"); + goto scsi_add_host_failed; + } + cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0xFFFF; @@ -925,6 +953,8 @@ static void __devexit virtscsi_remove(struct virtio_device *vdev) scsi_remove_host(shost); + unregister_hotcpu_notifier(&vscsi->nb); + virtscsi_remove_vqs(vdev); scsi_host_put(shost); } -- 1.8.1