Christian Borntraeger
2020-Jan-07 11:34 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On 07.01.20 10:39, Michael S. Tsirkin wrote:> On Tue, Jan 07, 2020 at 09:59:16AM +0100, Christian Borntraeger wrote: >> >> >> On 06.01.20 11:50, Michael S. Tsirkin wrote: >>> On Wed, Dec 18, 2019 at 04:59:02PM +0100, Christian Borntraeger wrote: >>>> On 18.12.19 16:10, Michael S. Tsirkin wrote: >>>>> On Wed, Dec 18, 2019 at 03:43:43PM +0100, Christian Borntraeger wrote: >>>>>> Michael, >>>>>> >>>>>> with >>>>>> commit db7286b100b503ef80612884453bed53d74c9a16 (refs/bisect/skip-db7286b100b503ef80612884453bed53d74c9a16) >>>>>> vhost: use batched version by default >>>>>> plus >>>>>> commit 6bd262d5eafcdf8cdfae491e2e748e4e434dcda6 (HEAD, refs/bisect/bad) >>>>>> Revert "vhost/net: add an option to test new code" >>>>>> to make things compile (your next tree is not easily bisectable, can you fix that as well?). >>>>> >>>>> I'll try. >>>>> >>>>>> >>>>>> I get random crashes in my s390 KVM guests after reboot. >>>>>> Reverting both patches together with commit decd9b8 "vhost: use vhost_desc instead of vhost_log" to >>>>>> make it compile again) on top of linux-next-1218 makes the problem go away. >>>>>> >>>>>> Looks like the batched version is not yet ready for prime time. Can you drop these patches until >>>>>> we have fixed the issues? >>>>>> >>>>>> Christian >>>>>> >>>>> >>>>> Will do, thanks for letting me know. >>>> >>>> I have confirmed with the initial reporter (internal test team) that <driver name='qemu'/> >>>> with a known to be broken linux next kernel also fixes the problem, so it is really the >>>> vhost changes. >>> >>> OK I'm back and trying to make it more bisectable. >>> >>> I pushed a new tag "batch-v2". >>> It's same code but with this bisect should get more information. >> >> I get the following with this tag >> >> drivers/vhost/net.c: In function ?vhost_net_tx_get_vq_desc?: >> drivers/vhost/net.c:574:7: error: implicit declaration of function ?vhost_get_vq_desc_batch?; did you mean ?vhost_get_vq_desc?? [-Werror=implicit-function-declaration] >> 574 | r = vhost_get_vq_desc_batch(tvq, tvq->iov, ARRAY_SIZE(tvq->iov), >> | ^~~~~~~~~~~~~~~~~~~~~~~ >> | vhost_get_vq_desc >> > > Not sure why but I pushed a wrong commit. Sorry. Should be good now. >during bisect: drivers/vhost/vhost.c: In function ?vhost_get_vq_desc_batch?: drivers/vhost/vhost.c:2634:8: error: ?id? undeclared (first use in this function); did you mean ?i?? 2634 | ret = id; | ^~ | i I changed that to i The last step then gave me (on commit 50297a8480b439efc5f3f23088cb2d90b799acef vhost: use batched version by default) net enc1: Unexpected TXQ (0) queue failure: -5 in the guest. bisect log so far: [cborntra at m83lp52 linux]$ git bisect log git bisect start # bad: [3131e79bb9e9892a5a6bd33513de9bc90b20e867] vhost: use vhost_desc instead of vhost_log git bisect bad 3131e79bb9e9892a5a6bd33513de9bc90b20e867 # good: [d1281e3a562ec6a08f944a876481dd043ba739b9] virtio-blk: remove VIRTIO_BLK_F_SCSI support git bisect good d1281e3a562ec6a08f944a876481dd043ba739b9 # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 # bad: [1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79] vhost: batching fetches git bisect bad 1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79
Michael S. Tsirkin
2020-Jan-07 11:47 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On Tue, Jan 07, 2020 at 12:34:50PM +0100, Christian Borntraeger wrote:> > > On 07.01.20 10:39, Michael S. Tsirkin wrote: > > On Tue, Jan 07, 2020 at 09:59:16AM +0100, Christian Borntraeger wrote: > >> > >> > >> On 06.01.20 11:50, Michael S. Tsirkin wrote: > >>> On Wed, Dec 18, 2019 at 04:59:02PM +0100, Christian Borntraeger wrote: > >>>> On 18.12.19 16:10, Michael S. Tsirkin wrote: > >>>>> On Wed, Dec 18, 2019 at 03:43:43PM +0100, Christian Borntraeger wrote: > >>>>>> Michael, > >>>>>> > >>>>>> with > >>>>>> commit db7286b100b503ef80612884453bed53d74c9a16 (refs/bisect/skip-db7286b100b503ef80612884453bed53d74c9a16) > >>>>>> vhost: use batched version by default > >>>>>> plus > >>>>>> commit 6bd262d5eafcdf8cdfae491e2e748e4e434dcda6 (HEAD, refs/bisect/bad) > >>>>>> Revert "vhost/net: add an option to test new code" > >>>>>> to make things compile (your next tree is not easily bisectable, can you fix that as well?). > >>>>> > >>>>> I'll try. > >>>>> > >>>>>> > >>>>>> I get random crashes in my s390 KVM guests after reboot. > >>>>>> Reverting both patches together with commit decd9b8 "vhost: use vhost_desc instead of vhost_log" to > >>>>>> make it compile again) on top of linux-next-1218 makes the problem go away. > >>>>>> > >>>>>> Looks like the batched version is not yet ready for prime time. Can you drop these patches until > >>>>>> we have fixed the issues? > >>>>>> > >>>>>> Christian > >>>>>> > >>>>> > >>>>> Will do, thanks for letting me know. > >>>> > >>>> I have confirmed with the initial reporter (internal test team) that <driver name='qemu'/> > >>>> with a known to be broken linux next kernel also fixes the problem, so it is really the > >>>> vhost changes. > >>> > >>> OK I'm back and trying to make it more bisectable. > >>> > >>> I pushed a new tag "batch-v2". > >>> It's same code but with this bisect should get more information. > >> > >> I get the following with this tag > >> > >> drivers/vhost/net.c: In function ?vhost_net_tx_get_vq_desc?: > >> drivers/vhost/net.c:574:7: error: implicit declaration of function ?vhost_get_vq_desc_batch?; did you mean ?vhost_get_vq_desc?? [-Werror=implicit-function-declaration] > >> 574 | r = vhost_get_vq_desc_batch(tvq, tvq->iov, ARRAY_SIZE(tvq->iov), > >> | ^~~~~~~~~~~~~~~~~~~~~~~ > >> | vhost_get_vq_desc > >> > > > > Not sure why but I pushed a wrong commit. Sorry. Should be good now. > > > > during bisect: > > drivers/vhost/vhost.c: In function ?vhost_get_vq_desc_batch?: > drivers/vhost/vhost.c:2634:8: error: ?id? undeclared (first use in this function); did you mean ?i?? > 2634 | ret = id; > | ^~ > | i > > I changed that to iHmm no that's wrong I think. Sorry about all the errors. Let me push a fixed v3.> > The last step then gave me (on commit 50297a8480b439efc5f3f23088cb2d90b799acef vhost: use batched version by default) > net enc1: Unexpected TXQ (0) queue failure: -5 > in the guest. > > bisect log so far: > [cborntra at m83lp52 linux]$ git bisect log > git bisect start > # bad: [3131e79bb9e9892a5a6bd33513de9bc90b20e867] vhost: use vhost_desc instead of vhost_log > git bisect bad 3131e79bb9e9892a5a6bd33513de9bc90b20e867 > # good: [d1281e3a562ec6a08f944a876481dd043ba739b9] virtio-blk: remove VIRTIO_BLK_F_SCSI support > git bisect good d1281e3a562ec6a08f944a876481dd043ba739b9 > # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct > git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 > # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct > git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 > # bad: [1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79] vhost: batching fetches > git bisect bad 1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79 > > >
Michael S. Tsirkin
2020-Jan-07 11:55 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On Tue, Jan 07, 2020 at 12:34:50PM +0100, Christian Borntraeger wrote:> > > On 07.01.20 10:39, Michael S. Tsirkin wrote: > > On Tue, Jan 07, 2020 at 09:59:16AM +0100, Christian Borntraeger wrote: > >> > >> > >> On 06.01.20 11:50, Michael S. Tsirkin wrote: > >>> On Wed, Dec 18, 2019 at 04:59:02PM +0100, Christian Borntraeger wrote: > >>>> On 18.12.19 16:10, Michael S. Tsirkin wrote: > >>>>> On Wed, Dec 18, 2019 at 03:43:43PM +0100, Christian Borntraeger wrote: > >>>>>> Michael, > >>>>>> > >>>>>> with > >>>>>> commit db7286b100b503ef80612884453bed53d74c9a16 (refs/bisect/skip-db7286b100b503ef80612884453bed53d74c9a16) > >>>>>> vhost: use batched version by default > >>>>>> plus > >>>>>> commit 6bd262d5eafcdf8cdfae491e2e748e4e434dcda6 (HEAD, refs/bisect/bad) > >>>>>> Revert "vhost/net: add an option to test new code" > >>>>>> to make things compile (your next tree is not easily bisectable, can you fix that as well?). > >>>>> > >>>>> I'll try. > >>>>> > >>>>>> > >>>>>> I get random crashes in my s390 KVM guests after reboot. > >>>>>> Reverting both patches together with commit decd9b8 "vhost: use vhost_desc instead of vhost_log" to > >>>>>> make it compile again) on top of linux-next-1218 makes the problem go away. > >>>>>> > >>>>>> Looks like the batched version is not yet ready for prime time. Can you drop these patches until > >>>>>> we have fixed the issues? > >>>>>> > >>>>>> Christian > >>>>>> > >>>>> > >>>>> Will do, thanks for letting me know. > >>>> > >>>> I have confirmed with the initial reporter (internal test team) that <driver name='qemu'/> > >>>> with a known to be broken linux next kernel also fixes the problem, so it is really the > >>>> vhost changes. > >>> > >>> OK I'm back and trying to make it more bisectable. > >>> > >>> I pushed a new tag "batch-v2". > >>> It's same code but with this bisect should get more information. > >> > >> I get the following with this tag > >> > >> drivers/vhost/net.c: In function ?vhost_net_tx_get_vq_desc?: > >> drivers/vhost/net.c:574:7: error: implicit declaration of function ?vhost_get_vq_desc_batch?; did you mean ?vhost_get_vq_desc?? [-Werror=implicit-function-declaration] > >> 574 | r = vhost_get_vq_desc_batch(tvq, tvq->iov, ARRAY_SIZE(tvq->iov), > >> | ^~~~~~~~~~~~~~~~~~~~~~~ > >> | vhost_get_vq_desc > >> > > > > Not sure why but I pushed a wrong commit. Sorry. Should be good now. > > > > during bisect: > > drivers/vhost/vhost.c: In function ?vhost_get_vq_desc_batch?: > drivers/vhost/vhost.c:2634:8: error: ?id? undeclared (first use in this function); did you mean ?i?? > 2634 | ret = id; > | ^~ > | i > > I changed that to i > > > The last step then gave me (on commit 50297a8480b439efc5f3f23088cb2d90b799acef vhost: use batched version by default) > net enc1: Unexpected TXQ (0) queue failure: -5 > in the guest. > > bisect log so far: > [cborntra at m83lp52 linux]$ git bisect log > git bisect start > # bad: [3131e79bb9e9892a5a6bd33513de9bc90b20e867] vhost: use vhost_desc instead of vhost_log > git bisect bad 3131e79bb9e9892a5a6bd33513de9bc90b20e867 > # good: [d1281e3a562ec6a08f944a876481dd043ba739b9] virtio-blk: remove VIRTIO_BLK_F_SCSI support > git bisect good d1281e3a562ec6a08f944a876481dd043ba739b9 > # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct > git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 > # good: [5b00aab5b6332a67e32dace1dcd3a198ab94ed56] vhost: option to fetch descriptors through an independent struct > git bisect good 5b00aab5b6332a67e32dace1dcd3a198ab94ed56 > # bad: [1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79] vhost: batching fetches > git bisect bad 1414d7ee3d10d2ec2bc4ee652d1d90ec91da1c79 > >I pushed batched-v3 - same head but bisect should work now. -- MST
Christian Borntraeger
2020-Jan-07 12:16 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On 07.01.20 12:55, Michael S. Tsirkin wrote:> > I pushed batched-v3 - same head but bisect should work now. >With commit 38ced0208491103b50f1056f0d1c8f28e2e13d08 (HEAD) Author: Michael S. Tsirkin <mst at redhat.com> AuthorDate: Wed Dec 11 12:19:26 2019 -0500 Commit: Michael S. Tsirkin <mst at redhat.com> CommitDate: Tue Jan 7 06:52:42 2020 -0500 vhost: use batched version by default I have exactly one successful ping and then the network inside the guest is broken (no packet anymore). So you could consider this commit broken (but in a different way and also without any guest reboot necessary). bisect log: git bisect start # bad: [d2f6175f52062ee51ee69754a6925608213475d2] vhost: use vhost_desc instead of vhost_log git bisect bad d2f6175f52062ee51ee69754a6925608213475d2 # good: [d1281e3a562ec6a08f944a876481dd043ba739b9] virtio-blk: remove VIRTIO_BLK_F_SCSI support git bisect good d1281e3a562ec6a08f944a876481dd043ba739b9 # good: [fac7c0f46996e32d996f5c46121df24a6b95ec3b] vhost: option to fetch descriptors through an independent struct git bisect good fac7c0f46996e32d996f5c46121df24a6b95ec3b # bad: [539eb9d738f048cd7be61f404e8f9c7d9d2ff3cc] vhost: batching fetches git bisect bad 539eb9d738f048cd7be61f404e8f9c7d9d2ff3cc
Christian Borntraeger
2020-Feb-06 15:12 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On 06.02.20 15:22, eperezma at redhat.com wrote:> Hi Christian. > > Could you try this patch on top of ("38ced0208491 vhost: use batched version by default")? > > It will not solve your first random crash but it should help with the lost of network connectivity. > > Please let me know how does it goes.38ced0208491 + this seem to be ok. Not sure if you can make out anything of this (and the previous git bisect log)
Michael S. Tsirkin
2020-Feb-06 22:07 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On Thu, Feb 06, 2020 at 03:22:39PM +0100, eperezma at redhat.com wrote:> Hi Christian. > > Could you try this patch on top of ("38ced0208491 vhost: use batched version by default")? > > It will not solve your first random crash but it should help with the lost of network connectivity. > > Please let me know how does it goes. > > Thanks! > > >From 99f0f543f3939dbe803988c9153a95616ccccacd Mon Sep 17 00:00:00 2001 > From: =?UTF-8?q?Eugenio=20P=C3=A9rez?= <eperezma at redhat.com> > Date: Thu, 6 Feb 2020 15:13:42 +0100 > Subject: [PATCH] vhost: filter valid vhost descriptors flags > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > Previous commit copy _NEXT flag, and it complains if a copied descriptor > contains it. > > Signed-off-by: Eugenio P??rez <eperezma at redhat.com> > --- > drivers/vhost/vhost.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 27ae5b4872a0..56c5253056ee 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2125,6 +2125,8 @@ static void pop_split_desc(struct vhost_virtqueue *vq) > --vq->ndescs; > } > > +#define VHOST_DESC_FLAGS (VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE | \ > + VRING_DESC_F_NEXT) > static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, u16 id) > { > struct vhost_desc *h; > @@ -2134,7 +2136,7 @@ static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, > h = &vq->descs[vq->ndescs++]; > h->addr = vhost64_to_cpu(vq, desc->addr); > h->len = vhost32_to_cpu(vq, desc->len); > - h->flags = vhost16_to_cpu(vq, desc->flags); > + h->flags = vhost16_to_cpu(vq, desc->flags) & VHOST_DESC_FLAGS; > h->id = id; > > return 0;> @@ -2343,7 +2345,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > struct vhost_desc *desc = &vq->descs[i]; > int access; > > - if (desc->flags & ~(VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE)) { > + if (desc->flags & ~VHOST_DESC_FLAGS) { > vq_err(vq, "Unexpected flags: 0x%x at descriptor id 0x%x\n", > desc->flags, desc->id); > ret = -EINVAL; > -- > 2.18.1Thanks for catching this! Do we need the 1st chunk though? It seems preferable to just muck with flags in 1 place, when we validate them ...> > On Wed, 2020-01-22 at 20:32 +0100, Christian Borntraeger wrote: > > > > On 20.01.20 07:27, Michael S. Tsirkin wrote: > > > On Tue, Jan 07, 2020 at 01:16:50PM +0100, Christian Borntraeger wrote: > > > > On 07.01.20 12:55, Michael S. Tsirkin wrote: > > > > > > > > > I pushed batched-v3 - same head but bisect should work now. > > > > > > > > > > > > > With > > > > commit 38ced0208491103b50f1056f0d1c8f28e2e13d08 (HEAD) > > > > Author: Michael S. Tsirkin <mst at redhat.com> > > > > AuthorDate: Wed Dec 11 12:19:26 2019 -0500 > > > > Commit: Michael S. Tsirkin <mst at redhat.com> > > > > CommitDate: Tue Jan 7 06:52:42 2020 -0500 > > > > > > > > vhost: use batched version by default > > > > > > > > > > > > I have exactly one successful ping and then the network inside the guest is broken (no packet > > > > anymore). > > > > > > Does anything appear in host's dmesg when this happens? > > > > I think there was nothing, but I am not sure. I would need to redo the test if this is important to know. > > > > > > > > > So you could consider this commit broken (but in a different way and also without any > > > > guest reboot necessary). > > > > > > > > > > > > bisect log: > > > > git bisect start > > > > # bad: [d2f6175f52062ee51ee69754a6925608213475d2] vhost: use vhost_desc instead of vhost_log > > > > git bisect bad d2f6175f52062ee51ee69754a6925608213475d2 > > > > # good: [d1281e3a562ec6a08f944a876481dd043ba739b9] virtio-blk: remove VIRTIO_BLK_F_SCSI support > > > > git bisect good d1281e3a562ec6a08f944a876481dd043ba739b9 > > > > # good: [fac7c0f46996e32d996f5c46121df24a6b95ec3b] vhost: option to fetch descriptors through an independent > > > > struct > > > > git bisect good fac7c0f46996e32d996f5c46121df24a6b95ec3b > > > > # bad: [539eb9d738f048cd7be61f404e8f9c7d9d2ff3cc] vhost: batching fetches > > > > git bisect bad 539eb9d738f048cd7be61f404e8f9c7d9d2ff3cc
Michael S. Tsirkin
2020-Feb-06 22:17 UTC
vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
On Thu, Feb 06, 2020 at 04:12:21PM +0100, Christian Borntraeger wrote:> > > On 06.02.20 15:22, eperezma at redhat.com wrote: > > Hi Christian. > > > > Could you try this patch on top of ("38ced0208491 vhost: use batched version by default")? > > > > It will not solve your first random crash but it should help with the lost of network connectivity. > > > > Please let me know how does it goes. > > > 38ced0208491 + this seem to be ok. > > Not sure if you can make out anything of this (and the previous git bisect log)Yes it does - that this is just bad split-up of patches, and there's still a real bug that caused worse crashes :) So I just pushed batch-v4. I expect that will fail, and bisect to give us vhost: batching fetches Can you try that please?
Apparently Analagous Threads
- vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
- vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
- vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
- vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot
- vhost changes (batched) in linux-next after 12/13 trigger random crashes in KVM guests after reboot