Christian Borntraeger
2017-Nov-21 10:14 UTC
4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
On 11/21/2017 10:50 AM, Christian Borntraeger wrote:> > > On 11/21/2017 09:35 AM, Christian Borntraeger wrote: >> >> >> On 11/20/2017 09:52 PM, Jens Axboe wrote: >>> On 11/20/2017 01:49 PM, Christian Borntraeger wrote: >>>> >>>> >>>> On 11/20/2017 08:42 PM, Jens Axboe wrote: >>>>> On 11/20/2017 12:29 PM, Christian Borntraeger wrote: >>>>>> >>>>>> >>>>>> On 11/20/2017 08:20 PM, Bart Van Assche wrote: >>>>>>> On Fri, 2017-11-17 at 15:42 +0100, Christian Borntraeger wrote: >>>>>>>> This is >>>>>>>> >>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1141) * are mapped to it. >>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1142) */ >>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1143) WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && >>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1144) cpu_online(hctx->next_cpu)); >>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1145) >>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1146) /* >>>>>>> >>>>>>> Did you really try to figure out when the code that reported the warning >>>>>>> was introduced? I think that warning was introduced through the following >>>>>>> commit: >>>>>> >>>>>> This was more a cut'n'paste to show which warning triggered since line numbers are somewhat volatile. >>>>>> >>>>>>> >>>>>>> commit fd1270d5df6a005e1248e87042159a799cc4b2c9 >>>>>>> Date: Wed Apr 16 09:23:48 2014 -0600 >>>>>>> >>>>>>> blk-mq: don't use preempt_count() to check for right CPU >>>>>>> >>>>>>> UP or CONFIG_PREEMPT_NONE will return 0, and what we really >>>>>>> want to check is whether or not we are on the right CPU. >>>>>>> So don't make PREEMPT part of this, just test the CPU in >>>>>>> the mask directly. >>>>>>> >>>>>>> Anyway, I think that warning is appropriate and useful. So the next step >>>>>>> is to figure out what work item was involved and why that work item got >>>>>>> executed on the wrong CPU. >>>>>> >>>>>> It seems to be related to virtio-blk (is triggered by fio on such disks). Your comment basically >>>>>> says: "no this is not a known issue" then :-) >>>>>> I will try to take a dump to find out the work item >>>>> >>>>> blk-mq does not attempt to freeze/sync existing work if a CPU goes away, >>>>> and we reconfigure the mappings. So I don't think the above is unexpected, >>>>> if you are doing CPU hot unplug while running a fio job. >>>> >>>> I did a cpu hot plug (adding a CPU) and I started fio AFTER that. >>> >>> OK, that's different, we should not be triggering a warning for that. >>> What does your machine/virtblk topology look like in terms of CPUS, >>> nr of queues for virtblk, etc? >> >> FWIW, 4.11 does work, 4.12 and later is broken. > > In fact: 4.12 is fine, 4.12.14 is broken.Bisect points to 1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1 Author: Christoph Hellwig <hch at lst.de> Date: Mon Jun 26 12:20:57 2017 +0200 blk-mq: Create hctx for each present CPU commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. Currently we only create hctx for online CPUs, which can lead to a lot of churn due to frequent soft offline / online operations. Instead allocate one for each present CPU to avoid this and dramatically simplify the code. Signed-off-by: Christoph Hellwig <hch at lst.de> Reviewed-by: Jens Axboe <axboe at kernel.dk> Cc: Keith Busch <keith.busch at intel.com> Cc: linux-block at vger.kernel.org Cc: linux-nvme at lists.infradead.org Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch at lst.de Signed-off-by: Thomas Gleixner <tglx at linutronix.de> Cc: Oleksandr Natalenko <oleksandr at natalenko.name> Cc: Mike Galbraith <efault at gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org> :040000 040000 a61cb023014a7b7a6b9f24ea04fe8ab22299e706 059ba6dc3290c74e0468937348e580cd53f963e7 M block :040000 040000 432e719d7e738ffcddfb8fc964544d3b3e0a68f7 f4572aa21b249a851a1b604c148eea109e93b30d M include adding Christoph FWIW, your patch triggers the following on 4.14 when doing a cpu hotplug (adding a CPU) and then accessing a virtio-blk device. 747.652408] ------------[ cut here ]------------ [ 747.652410] WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 __blk_mq_run_hw_queue+0xd4/0x100 [ 747.652410] Modules linked in: dm_multipath [ 747.652412] CPU: 4 PID: 2895 Comm: kworker/4:1H Tainted: G W 4.14.0+ #191 [ 747.652412] Hardware name: IBM 2964 NC9 704 (KVM/Linux) [ 747.652414] Workqueue: kblockd blk_mq_run_work_fn [ 747.652414] task: 0000000060680000 task.stack: 000000005ea30000 [ 747.652415] Krnl PSW : 0704f00180000000 0000000000505864 (__blk_mq_run_hw_queue+0xd4/0x100) [ 747.652417] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 747.652417] Krnl GPRS: 0000000000000010 00000000000000ff 000000005cbec400 0000000000000000 [ 747.652418] 0000000063709120 0000000000000000 0000000063709500 0000000059fa44b0 [ 747.652418] 0000000059fa4480 0000000000000000 000000006370f700 0000000063709100 [ 747.652419] 000000005cbec500 0000000000970948 000000005ea33d80 000000005ea33d48 [ 747.652423] Krnl Code: 0000000000505854: ebaff0a00004 lmg %r10,%r15,160(%r15) 000000000050585a: c0f4ffe690d3 brcl 15,1d7a00 #0000000000505860: a7f40001 brc 15,505862 >0000000000505864: 581003b0 l %r1,944 0000000000505868: c01b001fff00 nilf %r1,2096896 000000000050586e: a784ffdb brc 8,505824 0000000000505872: a7f40001 brc 15,505874 0000000000505876: 9120218f tm 399(%r2),32 [ 747.652435] Call Trace: [ 747.652435] ([<0000000063709600>] 0x63709600) [ 747.652436] [<0000000000187bcc>] process_one_work+0x264/0x4b8 [ 747.652438] [<0000000000187e78>] worker_thread+0x58/0x4f8 [ 747.652439] [<000000000018ee94>] kthread+0x144/0x168 [ 747.652439] [<00000000008f8a62>] kernel_thread_starter+0x6/0xc [ 747.652440] [<00000000008f8a5c>] kernel_thread_starter+0x0/0xc [ 747.652440] Last Breaking-Event-Address: [ 747.652441] [<0000000000505860>] __blk_mq_run_hw_queue+0xd0/0x100 [ 747.652442] ---[ end trace 4a001a80379b18ba ]--- [ 747.652450] ------------[ cut here ]------------
Jens Axboe
2017-Nov-21 17:27 UTC
4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
On 11/21/2017 03:14 AM, Christian Borntraeger wrote:> Bisect points to > > 1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit > commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1 > Author: Christoph Hellwig <hch at lst.de> > Date: Mon Jun 26 12:20:57 2017 +0200 > > blk-mq: Create hctx for each present CPU > > commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. > > Currently we only create hctx for online CPUs, which can lead to a lot > of churn due to frequent soft offline / online operations. Instead > allocate one for each present CPU to avoid this and dramatically simplify > the code. > > Signed-off-by: Christoph Hellwig <hch at lst.de> > Reviewed-by: Jens Axboe <axboe at kernel.dk> > Cc: Keith Busch <keith.busch at intel.com> > Cc: linux-block at vger.kernel.org > Cc: linux-nvme at lists.infradead.org > Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch at lst.de > Signed-off-by: Thomas Gleixner <tglx at linutronix.de> > Cc: Oleksandr Natalenko <oleksandr at natalenko.name> > Cc: Mike Galbraith <efault at gmx.de> > Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>I wonder if we're simply not getting the masks updated correctly. I'll take a look. -- Jens Axboe
Jens Axboe
2017-Nov-21 18:09 UTC
4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
On 11/21/2017 10:27 AM, Jens Axboe wrote:> On 11/21/2017 03:14 AM, Christian Borntraeger wrote: >> Bisect points to >> >> 1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit >> commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1 >> Author: Christoph Hellwig <hch at lst.de> >> Date: Mon Jun 26 12:20:57 2017 +0200 >> >> blk-mq: Create hctx for each present CPU >> >> commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream. >> >> Currently we only create hctx for online CPUs, which can lead to a lot >> of churn due to frequent soft offline / online operations. Instead >> allocate one for each present CPU to avoid this and dramatically simplify >> the code. >> >> Signed-off-by: Christoph Hellwig <hch at lst.de> >> Reviewed-by: Jens Axboe <axboe at kernel.dk> >> Cc: Keith Busch <keith.busch at intel.com> >> Cc: linux-block at vger.kernel.org >> Cc: linux-nvme at lists.infradead.org >> Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch at lst.de >> Signed-off-by: Thomas Gleixner <tglx at linutronix.de> >> Cc: Oleksandr Natalenko <oleksandr at natalenko.name> >> Cc: Mike Galbraith <efault at gmx.de> >> Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org> > > I wonder if we're simply not getting the masks updated correctly. I'll > take a look.Can't make it trigger here. We do init for each present CPU, which means that if I offline a few CPUs here and register a queue, those still show up as present (just offline) and get mapped accordingly.>From the looks of it, your setup is different. If the CPU doesn't showup as present and it gets hotplugged, then I can see how this condition would trigger. What environment are you running this in? We might have to re-introduce the cpu hotplug notifier, right now we just monitor for a dead cpu and handle that. -- Jens Axboe
Possibly Parallel Threads
- 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
- 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
- 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
- 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
- 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)