thr3ads.net - Virtualization - [PATCH net] virtio-net: suppress bad irq warning for tx napi [Feb 2021]

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2021-Feb-03 10:38 UTC

[PATCH net] virtio-net: suppress bad irq warning for tx napi

On Tue, Feb 02, 2021 at 07:06:53PM -0500, Willem de Bruijn
wrote:> On Tue, Feb 2, 2021 at 6:53 PM Willem de Bruijn <willemb at
google.com> wrote:
> >
> > On Tue, Feb 2, 2021 at 6:47 PM Wei Wang <weiwan at google.com>
wrote:
> > >
> > > On Tue, Feb 2, 2021 at 3:12 PM Michael S. Tsirkin <mst at
redhat.com> wrote:
> > > >
> > > > On Thu, Jan 28, 2021 at 04:21:36PM -0800, Wei Wang wrote:
> > > > > With the implementation of napi-tx in virtio driver, we
clean tx
> > > > > descriptors from rx napi handler, for the purpose of
reducing tx
> > > > > complete interrupts. But this could introduce a race
where tx complete
> > > > > interrupt has been raised, but the handler found there
is no work to do
> > > > > because we have done the work in the previous rx
interrupt handler.
> > > > > This could lead to the following warning msg:
> > > > > [ 3588.010778] irq 38: nobody cared (try booting with
the
> > > > > "irqpoll" option)
> > > > > [ 3588.017938] CPU: 4 PID: 0 Comm: swapper/4 Not
tainted
> > > > > 5.3.0-19-generic #20~18.04.2-Ubuntu
> > > > > [ 3588.017940] Call Trace:
> > > > > [ 3588.017942]  <IRQ>
> > > > > [ 3588.017951]  dump_stack+0x63/0x85
> > > > > [ 3588.017953]  __report_bad_irq+0x35/0xc0
> > > > > [ 3588.017955]  note_interrupt+0x24b/0x2a0
> > > > > [ 3588.017956]  handle_irq_event_percpu+0x54/0x80
> > > > > [ 3588.017957]  handle_irq_event+0x3b/0x60
> > > > > [ 3588.017958]  handle_edge_irq+0x83/0x1a0
> > > > > [ 3588.017961]  handle_irq+0x20/0x30
> > > > > [ 3588.017964]  do_IRQ+0x50/0xe0
> > > > > [ 3588.017966]  common_interrupt+0xf/0xf
> > > > > [ 3588.017966]  </IRQ>
> > > > > [ 3588.017989] handlers:
> > > > > [ 3588.020374] [<000000001b9f1da8>]
vring_interrupt
> > > > > [ 3588.025099] Disabling IRQ #38
> > > > >
> > > > > This patch adds a new param to struct vring_virtqueue,
and we set it for
> > > > > tx virtqueues if napi-tx is enabled, to suppress the
warning in such
> > > > > case.
> > > > >
> > > > > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx
descriptors from rx napi")
> > > > > Reported-by: Rick Jones <jonesrick at google.com>
> > > > > Signed-off-by: Wei Wang <weiwan at google.com>
> > > > > Signed-off-by: Willem de Bruijn <willemb at
google.com>
> > > >
> > > >
> > > > This description does not make sense to me.
> > > >
> > > > irq X: nobody cared
> > > > only triggers after an interrupt is unhandled repeatedly.
> > > >
> > > > So something causes a storm of useless tx interrupts here.
> > > >
> > > > Let's find out what it was please. What you are doing is
> > > > just preventing linux from complaining.
> > >
> > > The traffic that causes this warning is a netperf tcp_stream with
at
> > > least 128 flows between 2 hosts. And the warning gets triggered
on the
> > > receiving host, which has a lot of rx interrupts firing on all
queues,
> > > and a few tx interrupts.
> > > And I think the scenario is: when the tx interrupt gets fired, it
gets
> > > coalesced with the rx interrupt. Basically, the rx and tx
interrupts
> > > get triggered very close to each other, and gets handled in one
round
> > > of do_IRQ(). And the rx irq handler gets called first, which
calls
> > > virtnet_poll(). However, virtnet_poll() calls
virtnet_poll_cleantx()
> > > to try to do the work on the corresponding tx queue as well.
That's
> > > why when tx interrupt handler gets called, it sees no work to do.
> > > And the reason for the rx handler to handle the tx work is here:
> > >
https://lists.linuxfoundation.org/pipermail/virtualization/2017-April/034740.html
> >
> > Indeed. It's not a storm necessarily. The warning occurs after one
> > hundred such events, since boot, which is a small number compared real
> > interrupt load.
> 
> Sorry, this is wrong. It is the other call to __report_bad_irq from
> note_interrupt that applies here.
> 
> > Occasionally seeing an interrupt with no work is expected after
> > 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx
napi"). As
> > long as this rate of events is very low compared to useful interrupts,
> > and total interrupt count is greatly reduced vs not having work
> > stealing, it is a net win.
Right, but if 99900 out of 100000 interrupts were wasted, then it is
surely an even greater win to disable interrupts while polling like
this.  Might be tricky to detect, disabling/enabling aggressively every
time even if there's nothing in the queue is sure to cause lots of cache
line bounces, and we don't want to enable callbacks if they were not
enabled e.g. by start_xmit ...  Some kind of counter?


-- 
MST

Willem de Bruijn

2021-Feb-03 18:24 UTC

head link

[PATCH net] virtio-net: suppress bad irq warning for tx napi

On Wed, Feb 3, 2021 at 5:42 AM Michael S. Tsirkin <mst at redhat.com>
wrote:>
> On Tue, Feb 02, 2021 at 07:06:53PM -0500, Willem de Bruijn wrote:
> > On Tue, Feb 2, 2021 at 6:53 PM Willem de Bruijn <willemb at
google.com> wrote:
> > >
> > > On Tue, Feb 2, 2021 at 6:47 PM Wei Wang <weiwan at
google.com> wrote:
> > > >
> > > > On Tue, Feb 2, 2021 at 3:12 PM Michael S. Tsirkin <mst at
redhat.com> wrote:
> > > > >
> > > > > On Thu, Jan 28, 2021 at 04:21:36PM -0800, Wei Wang
wrote:
> > > > > > With the implementation of napi-tx in virtio
driver, we clean tx
> > > > > > descriptors from rx napi handler, for the purpose
of reducing tx
> > > > > > complete interrupts. But this could introduce a
race where tx complete
> > > > > > interrupt has been raised, but the handler found
there is no work to do
> > > > > > because we have done the work in the previous rx
interrupt handler.
> > > > > > This could lead to the following warning msg:
> > > > > > [ 3588.010778] irq 38: nobody cared (try booting
with the
> > > > > > "irqpoll" option)
> > > > > > [ 3588.017938] CPU: 4 PID: 0 Comm: swapper/4 Not
tainted
> > > > > > 5.3.0-19-generic #20~18.04.2-Ubuntu
> > > > > > [ 3588.017940] Call Trace:
> > > > > > [ 3588.017942]  <IRQ>
> > > > > > [ 3588.017951]  dump_stack+0x63/0x85
> > > > > > [ 3588.017953]  __report_bad_irq+0x35/0xc0
> > > > > > [ 3588.017955]  note_interrupt+0x24b/0x2a0
> > > > > > [ 3588.017956]  handle_irq_event_percpu+0x54/0x80
> > > > > > [ 3588.017957]  handle_irq_event+0x3b/0x60
> > > > > > [ 3588.017958]  handle_edge_irq+0x83/0x1a0
> > > > > > [ 3588.017961]  handle_irq+0x20/0x30
> > > > > > [ 3588.017964]  do_IRQ+0x50/0xe0
> > > > > > [ 3588.017966]  common_interrupt+0xf/0xf
> > > > > > [ 3588.017966]  </IRQ>
> > > > > > [ 3588.017989] handlers:
> > > > > > [ 3588.020374] [<000000001b9f1da8>]
vring_interrupt
> > > > > > [ 3588.025099] Disabling IRQ #38
> > > > > >
> > > > > > This patch adds a new param to struct
vring_virtqueue, and we set it for
> > > > > > tx virtqueues if napi-tx is enabled, to suppress
the warning in such
> > > > > > case.
> > > > > >
> > > > > > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx
descriptors from rx napi")
> > > > > > Reported-by: Rick Jones <jonesrick at
google.com>
> > > > > > Signed-off-by: Wei Wang <weiwan at
google.com>
> > > > > > Signed-off-by: Willem de Bruijn <willemb at
google.com>
> > > > >
> > > > >
> > > > > This description does not make sense to me.
> > > > >
> > > > > irq X: nobody cared
> > > > > only triggers after an interrupt is unhandled
repeatedly.
> > > > >
> > > > > So something causes a storm of useless tx interrupts
here.
> > > > >
> > > > > Let's find out what it was please. What you are
doing is
> > > > > just preventing linux from complaining.
> > > >
> > > > The traffic that causes this warning is a netperf tcp_stream
with at
> > > > least 128 flows between 2 hosts. And the warning gets
triggered on the
> > > > receiving host, which has a lot of rx interrupts firing on
all queues,
> > > > and a few tx interrupts.
> > > > And I think the scenario is: when the tx interrupt gets
fired, it gets
> > > > coalesced with the rx interrupt. Basically, the rx and tx
interrupts
> > > > get triggered very close to each other, and gets handled in
one round
> > > > of do_IRQ(). And the rx irq handler gets called first, which
calls
> > > > virtnet_poll(). However, virtnet_poll() calls
virtnet_poll_cleantx()
> > > > to try to do the work on the corresponding tx queue as well.
That's
> > > > why when tx interrupt handler gets called, it sees no work
to do.
> > > > And the reason for the rx handler to handle the tx work is
here:
> > > >
https://lists.linuxfoundation.org/pipermail/virtualization/2017-April/034740.html
> > >
> > > Indeed. It's not a storm necessarily. The warning occurs
after one
> > > hundred such events, since boot, which is a small number compared
real
> > > interrupt load.
> >
> > Sorry, this is wrong. It is the other call to __report_bad_irq from
> > note_interrupt that applies here.
> >
> > > Occasionally seeing an interrupt with no work is expected after
> > > 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx
napi"). As
> > > long as this rate of events is very low compared to useful
interrupts,
> > > and total interrupt count is greatly reduced vs not having work
> > > stealing, it is a net win.
>
> Right, but if 99900 out of 100000 interrupts were wasted, then it is
> surely an even greater win to disable interrupts while polling like
> this.  Might be tricky to detect, disabling/enabling aggressively every
> time even if there's nothing in the queue is sure to cause lots of
cache
> line bounces, and we don't want to enable callbacks if they were not
> enabled e.g. by start_xmit ...  Some kind of counter?
Yes. It was known that the work stealing is more effective in some
workloads than others. But a 99% spurious rate I had not anticipated.

Most interesting is the number of interrupts suppressed as a result of
the feature. That is not captured by this statistic.

In any case, we'll take a step back to better understand behavior. And
especially why this high spurious rate exhibits in this workload with
many concurrent flows.

Virtualization - Feb 2021 - [PATCH net] virtio-net: suppress bad irq warning for tx napi

[PATCH net] virtio-net: suppress bad irq warning for tx napi

[PATCH net] virtio-net: suppress bad irq warning for tx napi