Stefan Hajnoczi
2012-Jun-01 09:13 UTC
[PATCH v3] virtio_blk: unlock vblk->lock during kick
Holding the vblk->lock across kick causes poor scalability in SMP guests. If one CPU is doing virtqueue kick and another CPU touches the vblk->lock it will have to spin until virtqueue kick completes. This patch reduces system% CPU utilization in SMP guests that are running multithreaded I/O-bound workloads. The improvements are small but show as iops and SMP are increased. Khoa Huynh <khoa at us.ibm.com> provided initial performance data that indicates this optimization is worthwhile at high iops. Asias He <asias at redhat.com> reports the following fio results: Host: Linux 3.4.0+ #302 SMP x86_64 GNU/Linux Guest: same as host kernel Average 3 runs: with locked kick read iops=119907.50 bw=59954.00 runt=35018.50 io=2048.00 write iops=217187.00 bw=108594.00 runt=19312.00 io=2048.00 read iops=33948.00 bw=16974.50 runt=186820.50 io=3095.70 write iops=35014.00 bw=17507.50 runt=181151.00 io=3095.70 clat (usec) max=3484.10 avg=121085.38 stdev=174416.11 min=0.00 clat (usec) max=3438.30 avg=59863.35 stdev=116607.69 min=0.00 clat (usec) max=3745.65 avg=454501.30 stdev=332699.00 min=0.00 clat (usec) max=4089.75 avg=442374.99 stdev=304874.62 min=0.00 cpu sys=615.12 majf=24080.50 ctx=64253616.50 usr=68.08 minf=17907363.00 cpu sys=1235.95 majf=23389.00 ctx=59788148.00 usr=98.34 minf=20020008.50 cpu sys=764.96 majf=28414.00 ctx=848279274.00 usr=36.39 minf=19737254.00 cpu sys=714.13 majf=21853.50 ctx=854608972.00 usr=33.56 minf=18256760.50 with unlocked kick read iops=118559.00 bw=59279.66 runt=35400.66 io=2048.00 write iops=227560.00 bw=113780.33 runt=18440.00 io=2048.00 read iops=34567.66 bw=17284.00 runt=183497.33 io=3095.70 write iops=34589.33 bw=17295.00 runt=183355.00 io=3095.70 clat (usec) max=3485.56 avg=121989.58 stdev=197355.15 min=0.00 clat (usec) max=3222.33 avg=57784.11 stdev=141002.89 min=0.00 clat (usec) max=4060.93 avg=447098.65 stdev=315734.33 min=0.00 clat (usec) max=3656.30 avg=447281.70 stdev=314051.33 min=0.00 cpu sys=683.78 majf=24501.33 ctx=64435364.66 usr=68.91 minf=17907893.33 cpu sys=1218.24 majf=25000.33 ctx=60451475.00 usr=101.04 minf=19757720.00 cpu sys=740.39 majf=24809.00 ctx=845290443.66 usr=37.25 minf=19349958.33 cpu sys=723.63 majf=27597.33 ctx=850199927.33 usr=35.35 minf=19092343.00 FIO config file: [global] exec_prerun="echo 3 > /proc/sys/vm/drop_caches" group_reporting norandommap ioscheduler=noop thread bs=512 size=4MB direct=1 filename=/dev/vdb numjobs=256 ioengine=aio iodepth=64 loops=3 Signed-off-by: Stefan Hajnoczi <stefanha at linux.vnet.ibm.com> --- Other block drivers (cciss, rbd, nbd) use spin_unlock_irq() so I followed that. To me this seems wrong: blk_run_queue() uses spin_lock_irqsave() but we enable irqs with spin_unlock_irq(). If the caller of blk_run_queue() had irqs disabled and we enable them again this could be a problem, right? Can someone more familiar with kernel locking comment? drivers/block/virtio_blk.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 774c31d..d674977 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -199,8 +199,14 @@ static void do_virtblk_request(struct request_queue *q) issued++; } - if (issued) - virtqueue_kick(vblk->vq); + if (!issued) + return; + + if (virtqueue_kick_prepare(vblk->vq)) { + spin_unlock_irq(vblk->disk->queue->queue_lock); + virtqueue_notify(vblk->vq); + spin_lock_irq(vblk->disk->queue->queue_lock); + } } /* return id (s/n) string for *disk to *id_str -- 1.7.10
On 06/01/2012 05:13 PM, Stefan Hajnoczi wrote:> Holding the vblk->lock across kick causes poor scalability in SMP > guests. If one CPU is doing virtqueue kick and another CPU touches the > vblk->lock it will have to spin until virtqueue kick completes. > > This patch reduces system% CPU utilization in SMP guests that are > running multithreaded I/O-bound workloads. The improvements are small > but show as iops and SMP are increased. > > Khoa Huynh<khoa at us.ibm.com> provided initial performance data that > indicates this optimization is worthwhile at high iops. > > Asias He<asias at redhat.com> reports the following fio results: > > Host: Linux 3.4.0+ #302 SMP x86_64 GNU/Linux > Guest: same as host kernel > > Average 3 runs: > with locked kick > read iops=119907.50 bw=59954.00 runt=35018.50 io=2048.00 > write iops=217187.00 bw=108594.00 runt=19312.00 io=2048.00 > read iops=33948.00 bw=16974.50 runt=186820.50 io=3095.70 > write iops=35014.00 bw=17507.50 runt=181151.00 io=3095.70 > clat (usec) max=3484.10 avg=121085.38 stdev=174416.11 min=0.00 > clat (usec) max=3438.30 avg=59863.35 stdev=116607.69 min=0.00 > clat (usec) max=3745.65 avg=454501.30 stdev=332699.00 min=0.00 > clat (usec) max=4089.75 avg=442374.99 stdev=304874.62 min=0.00 > cpu sys=615.12 majf=24080.50 ctx=64253616.50 usr=68.08 minf=17907363.00 > cpu sys=1235.95 majf=23389.00 ctx=59788148.00 usr=98.34 minf=20020008.50 > cpu sys=764.96 majf=28414.00 ctx=848279274.00 usr=36.39 minf=19737254.00 > cpu sys=714.13 majf=21853.50 ctx=854608972.00 usr=33.56 minf=18256760.50 > > with unlocked kick > read iops=118559.00 bw=59279.66 runt=35400.66 io=2048.00 > write iops=227560.00 bw=113780.33 runt=18440.00 io=2048.00 > read iops=34567.66 bw=17284.00 runt=183497.33 io=3095.70 > write iops=34589.33 bw=17295.00 runt=183355.00 io=3095.70 > clat (usec) max=3485.56 avg=121989.58 stdev=197355.15 min=0.00 > clat (usec) max=3222.33 avg=57784.11 stdev=141002.89 min=0.00 > clat (usec) max=4060.93 avg=447098.65 stdev=315734.33 min=0.00 > clat (usec) max=3656.30 avg=447281.70 stdev=314051.33 min=0.00 > cpu sys=683.78 majf=24501.33 ctx=64435364.66 usr=68.91 minf=17907893.33 > cpu sys=1218.24 majf=25000.33 ctx=60451475.00 usr=101.04 minf=19757720.00 > cpu sys=740.39 majf=24809.00 ctx=845290443.66 usr=37.25 minf=19349958.33 > cpu sys=723.63 majf=27597.33 ctx=850199927.33 usr=35.35 minf=19092343.00 > > FIO config file: > > [global] > exec_prerun="echo 3> /proc/sys/vm/drop_caches" > group_reporting > norandommap > ioscheduler=noop > thread > bs=512 > size=4MB > direct=1 > filename=/dev/vdb > numjobs=256 > ioengine=aio > iodepth=64 > loops=3 > > Signed-off-by: Stefan Hajnoczi<stefanha at linux.vnet.ibm.com> > --- > Other block drivers (cciss, rbd, nbd) use spin_unlock_irq() so I followed that. > To me this seems wrong: blk_run_queue() uses spin_lock_irqsave() but we enable > irqs with spin_unlock_irq(). If the caller of blk_run_queue() had irqs > disabled and we enable them again this could be a problem, right? Can someone > more familiar with kernel locking comment?blk_run_queue() is not used in our code path. We use __blk_run_queue(). The code path is: generic_make_request() -> q->make_request_fn() -> blk_queue_bio() ->__blk_run_queue() -> q->request_fn() -> do_virtblk_request(). __blk_run_queue() is called with interrupts disabled and queue lock locked. In blk_queue_bio, __blk_run_queue() is protected by spin_lock_irq(q->queue_lock). The lock in block layer seems a bit confusing, e.g.block/blk-core.c. There are fixed use of spin_lock_irq() and spin_lock_irqsave() pair.> > drivers/block/virtio_blk.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 774c31d..d674977 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -199,8 +199,14 @@ static void do_virtblk_request(struct request_queue *q) > issued++; > } > > - if (issued) > - virtqueue_kick(vblk->vq); > + if (!issued) > + return; > + > + if (virtqueue_kick_prepare(vblk->vq)) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + virtqueue_notify(vblk->vq); > + spin_lock_irq(vblk->disk->queue->queue_lock); > + } > } > > /* return id (s/n) string for *disk to *id_str-- Asias
Michael S. Tsirkin
2012-Jun-04 11:11 UTC
[PATCH v3] virtio_blk: unlock vblk->lock during kick
On Fri, Jun 01, 2012 at 10:13:06AM +0100, Stefan Hajnoczi wrote:> Holding the vblk->lock across kick causes poor scalability in SMP > guests. If one CPU is doing virtqueue kick and another CPU touches the > vblk->lock it will have to spin until virtqueue kick completes. > > This patch reduces system% CPU utilization in SMP guests that are > running multithreaded I/O-bound workloads. The improvements are small > but show as iops and SMP are increased. > > Khoa Huynh <khoa at us.ibm.com> provided initial performance data that > indicates this optimization is worthwhile at high iops. > > Asias He <asias at redhat.com> reports the following fio results: > > Host: Linux 3.4.0+ #302 SMP x86_64 GNU/Linux > Guest: same as host kernel > > Average 3 runs: > with locked kick > read iops=119907.50 bw=59954.00 runt=35018.50 io=2048.00 > write iops=217187.00 bw=108594.00 runt=19312.00 io=2048.00 > read iops=33948.00 bw=16974.50 runt=186820.50 io=3095.70 > write iops=35014.00 bw=17507.50 runt=181151.00 io=3095.70 > clat (usec) max=3484.10 avg=121085.38 stdev=174416.11 min=0.00 > clat (usec) max=3438.30 avg=59863.35 stdev=116607.69 min=0.00 > clat (usec) max=3745.65 avg=454501.30 stdev=332699.00 min=0.00 > clat (usec) max=4089.75 avg=442374.99 stdev=304874.62 min=0.00 > cpu sys=615.12 majf=24080.50 ctx=64253616.50 usr=68.08 minf=17907363.00 > cpu sys=1235.95 majf=23389.00 ctx=59788148.00 usr=98.34 minf=20020008.50 > cpu sys=764.96 majf=28414.00 ctx=848279274.00 usr=36.39 minf=19737254.00 > cpu sys=714.13 majf=21853.50 ctx=854608972.00 usr=33.56 minf=18256760.50 > > with unlocked kick > read iops=118559.00 bw=59279.66 runt=35400.66 io=2048.00 > write iops=227560.00 bw=113780.33 runt=18440.00 io=2048.00 > read iops=34567.66 bw=17284.00 runt=183497.33 io=3095.70 > write iops=34589.33 bw=17295.00 runt=183355.00 io=3095.70 > clat (usec) max=3485.56 avg=121989.58 stdev=197355.15 min=0.00 > clat (usec) max=3222.33 avg=57784.11 stdev=141002.89 min=0.00 > clat (usec) max=4060.93 avg=447098.65 stdev=315734.33 min=0.00 > clat (usec) max=3656.30 avg=447281.70 stdev=314051.33 min=0.00 > cpu sys=683.78 majf=24501.33 ctx=64435364.66 usr=68.91 minf=17907893.33 > cpu sys=1218.24 majf=25000.33 ctx=60451475.00 usr=101.04 minf=19757720.00 > cpu sys=740.39 majf=24809.00 ctx=845290443.66 usr=37.25 minf=19349958.33 > cpu sys=723.63 majf=27597.33 ctx=850199927.33 usr=35.35 minf=19092343.00 > > FIO config file: > > [global] > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" > group_reporting > norandommap > ioscheduler=noop > thread > bs=512 > size=4MB > direct=1 > filename=/dev/vdb > numjobs=256 > ioengine=aio > iodepth=64 > loops=3 > > Signed-off-by: Stefan Hajnoczi <stefanha at linux.vnet.ibm.com> > --- > Other block drivers (cciss, rbd, nbd) use spin_unlock_irq() so I followed that. > To me this seems wrong: blk_run_queue() uses spin_lock_irqsave() but we enable > irqs with spin_unlock_irq(). If the caller of blk_run_queue() had irqs > disabled and we enable them again this could be a problem, right? Can someone > more familiar with kernel locking comment? > > drivers/block/virtio_blk.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 774c31d..d674977 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -199,8 +199,14 @@ static void do_virtblk_request(struct request_queue *q) > issued++; > } > > - if (issued) > - virtqueue_kick(vblk->vq); > + if (!issued) > + return; > + > + if (virtqueue_kick_prepare(vblk->vq)) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + virtqueue_notify(vblk->vq);If blk_done runs and completes the request at this point, can hot unplug then remove the queue? If yes will we get a use after free?> + spin_lock_irq(vblk->disk->queue->queue_lock); > + } > } > > /* return id (s/n) string for *disk to *id_str > -- > 1.7.10 > > _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Michael S. Tsirkin
2012-Jun-04 11:15 UTC
[PATCH v3] virtio_blk: unlock vblk->lock during kick
On Fri, Jun 01, 2012 at 10:13:06AM +0100, Stefan Hajnoczi wrote:> Other block drivers (cciss, rbd, nbd) use spin_unlock_irq() so I followed that. > To me this seems wrong: blk_run_queue() uses spin_lock_irqsave() but we enable > irqs with spin_unlock_irq(). If the caller of blk_run_queue() had irqs > disabled and we enable them again this could be a problem, right? Can someone > more familiar with kernel locking comment?Why take the risk? What's the advantage of enabling them here? VCPU is not running while the hypervisor is processing the notification anyway. And the next line returns from the function so the interrupts will get enabled.> drivers/block/virtio_blk.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 774c31d..d674977 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -199,8 +199,14 @@ static void do_virtblk_request(struct request_queue *q) > issued++; > } > > - if (issued) > - virtqueue_kick(vblk->vq); > + if (!issued) > + return; > + > + if (virtqueue_kick_prepare(vblk->vq)) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + virtqueue_notify(vblk->vq); > + spin_lock_irq(vblk->disk->queue->queue_lock); > + } > } > > /* return id (s/n) string for *disk to *id_str > -- > 1.7.10 > > _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Stefan Hajnoczi <stefanha at linux.vnet.ibm.com> wrote on 06/01/2012 04:13:06 AM:> > Holding the vblk->lock across kick causes poor scalability in SMP > guests. If one CPU is doing virtqueue kick and another CPU touches the > vblk->lock it will have to spin until virtqueue kick completes. > > This patch reduces system% CPU utilization in SMP guests that are > running multithreaded I/O-bound workloads. The improvements are small > but show as iops and SMP are increased. > > Khoa Huynh <khoa at us.ibm.com> provided initial performance data that > indicates this optimization is worthwhile at high iops.Hi Stefan, I was out on vacation last week, so I could not post the details of my performance testing with this unlocked-kick patch. Anyway, here are more details: Test environment: - Host: IBM x3850 X5 server (40 cores, 256GB) running RHEL6.2 - Guest: 20 vcpus, 16GB, running Linux 3.4 kernel (w/ and w/o this patch) - Storage: 12 block devices (from 3 SCSI target servers) - Workload: FIO benchmark with 8KB random reads and writes (50% split) With RHEL6.2 qemu-kvm: - Locked kick = 44,529 IOPS - Unlocked kick = 44,869 IOPS ==> 0.7% gain With Stefan's "data-plane" qemu-kvm (experimental qemu to get around qemu mutex): - Locked kick = 497,943 IOPS - Unlocked kick = 532,446 IOPS ==> 6.9% gain I also tested on a smaller host and smaller guest (4 vcpus, 4GB, 4 virtual block devices) and applied some performance tuning (e.g. smaller data sets so most data fit into host's cache, 1KB reads) to deliver the highest I/O rates possible to the guest with the existing RHEL6.2 qemu-kvm: - Locked kick = 140,533 IOPS - Unlocked kick = 143,415 IOPS ==> 2.1% gain As you can see, the higher the I/O rate, the more performance benefit we would get from Stefan's unlocked-kick patch. In my performance testing, the highest I/O rate that I could achieve with the existing KVM/QEMU was ~140,000+ IOPS, so the most performance gain that we could get from this unlocked-kick patch in this case was about 2-3%. However, going forward, as we relieve the qemu mutex more and more, we should be able to get much higher I/O rates (500,000 IOPS or higher), and so I believe that the performance benefit of this unlocked-kick patch would be more apparent. If you need further information, please let me know. Thanks, -Khoa> > Asias He <asias at redhat.com> reports the following fio results: > > Host: Linux 3.4.0+ #302 SMP x86_64 GNU/Linux > Guest: same as host kernel > > Average 3 runs: > with locked kick > read iops=119907.50 bw=59954.00 runt=35018.50 io=2048.00 > write iops=217187.00 bw=108594.00 runt=19312.00 io=2048.00 > read iops=33948.00 bw=16974.50 runt=186820.50 io=3095.70 > write iops=35014.00 bw=17507.50 runt=181151.00 io=3095.70 > clat (usec) max=3484.10 avg=121085.38 stdev=174416.11 min=0.00 > clat (usec) max=3438.30 avg=59863.35 stdev=116607.69 min=0.00 > clat (usec) max=3745.65 avg=454501.30 stdev=332699.00 min=0.00 > clat (usec) max=4089.75 avg=442374.99 stdev=304874.62 min=0.00 > cpu sys=615.12 majf=24080.50 ctx=64253616.50 usr=68.08minf=17907363.00> cpu sys=1235.95 majf=23389.00 ctx=59788148.00 usr=98.34minf=20020008.50> cpu sys=764.96 majf=28414.00 ctx=848279274.00 usr=36.39minf=19737254.00> cpu sys=714.13 majf=21853.50 ctx=854608972.00 usr=33.56minf=18256760.50> > with unlocked kick > read iops=118559.00 bw=59279.66 runt=35400.66 io=2048.00 > write iops=227560.00 bw=113780.33 runt=18440.00 io=2048.00 > read iops=34567.66 bw=17284.00 runt=183497.33 io=3095.70 > write iops=34589.33 bw=17295.00 runt=183355.00 io=3095.70 > clat (usec) max=3485.56 avg=121989.58 stdev=197355.15 min=0.00 > clat (usec) max=3222.33 avg=57784.11 stdev=141002.89 min=0.00 > clat (usec) max=4060.93 avg=447098.65 stdev=315734.33 min=0.00 > clat (usec) max=3656.30 avg=447281.70 stdev=314051.33 min=0.00 > cpu sys=683.78 majf=24501.33 ctx=64435364.66 usr=68.91minf=17907893.33> cpu sys=1218.24 majf=25000.33 ctx=60451475.00 usr=101.04minf=19757720.00> cpu sys=740.39 majf=24809.00 ctx=845290443.66 usr=37.25minf=19349958.33> cpu sys=723.63 majf=27597.33 ctx=850199927.33 usr=35.35minf=19092343.00> > FIO config file: > > [global] > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" > group_reporting > norandommap > ioscheduler=noop > thread > bs=512 > size=4MB > direct=1 > filename=/dev/vdb > numjobs=256 > ioengine=aio > iodepth=64 > loops=3 > > Signed-off-by: Stefan Hajnoczi <stefanha at linux.vnet.ibm.com> > --- > Other block drivers (cciss, rbd, nbd) use spin_unlock_irq() so I > followed that. > To me this seems wrong: blk_run_queue() uses spin_lock_irqsave() butweenable> irqs with spin_unlock_irq(). If the caller of blk_run_queue() had irqs > disabled and we enable them again this could be a problem, right? Cansomeone> more familiar with kernel locking comment? > > drivers/block/virtio_blk.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 774c31d..d674977 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -199,8 +199,14 @@ static void do_virtblk_request(struct request_queue*q)> issued++; > } > > - if (issued) > - virtqueue_kick(vblk->vq); > + if (!issued) > + return; > + > + if (virtqueue_kick_prepare(vblk->vq)) { > + spin_unlock_irq(vblk->disk->queue->queue_lock); > + virtqueue_notify(vblk->vq); > + spin_lock_irq(vblk->disk->queue->queue_lock); > + } > } > > /* return id (s/n) string for *disk to *id_str > -- > 1.7.10 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20120604/9ef033a0/attachment.html>