Hi Linus, On Thu, Nov 11, 2021 at 2:03 PM Sudip Mukherjee <sudipm.mukherjee at gmail.com> wrote:> > Hi Linus, > > My testing has been failing for the last few days. Last good test was > with 6f2b76a4a384 and I started seeing the failure with ce840177930f5 > where boot timeout. > > Last good test - https://openqa.qa.codethink.co.uk/tests/323 > Failing test - https://openqa.qa.codethink.co.uk/tests/335 > > Saw a similar issue with 5.10.79-rc1 today and bisect showed the > problem with 8615ff6dd1ac but that was already in the last good test I > had.Did a bisect and this is a separate issue than the one we saw in 5.10.79-rc1. The bisect log: # bad: [ce840177930f591a181f55515fc6ac9e1f56b84a] Merge tag 'defconfig-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc # good: [6f2b76a4a384e05ac8d3349831f29dff5de1e1e2] Merge tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next git bisect start 'ce840177930f5' '6f2b76a4a384e05ac8d3349831f29dff5de1e1e2' # good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f # bad: [dcd68326d29b62f3039e4f4d23d3e38f24d37360] Merge tag 'devicetree-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect bad dcd68326d29b62f3039e4f4d23d3e38f24d37360 # bad: [c7c774fe09389fc806bbe4b487c18e45f576c1ae] Merge tag 'drm-intel-next-2021-10-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next git bisect bad c7c774fe09389fc806bbe4b487c18e45f576c1ae # good: [8017ecb11ebbcdfcbdff14c5edbdf1efc14991f4] drm/amd/display: Added root clock optimization flags git bisect good 8017ecb11ebbcdfcbdff14c5edbdf1efc14991f4 # good: [8a1ec3f3275479292613273a7be2ac87f2a7f6e6] drm/i915: Remove DP_PORT_EN stuff from link training code git bisect good 8a1ec3f3275479292613273a7be2ac87f2a7f6e6 # bad: [9962601ca5719050906915c3c33a63744ac7b15c] drm/bridge: dw-hdmi-cec: Make use of the helper function devm_add_action_or_reset() git bisect bad 9962601ca5719050906915c3c33a63744ac7b15c # bad: [606b102876e3741851dfb09d53f3ee57f650a52c] drm: fb_helper: fix CONFIG_FB dependency git bisect bad 606b102876e3741851dfb09d53f3ee57f650a52c # good: [c43da06c24a485308e80d709737b446e8cad175d] dt-bindings: drm/panel: boe-tv101wum-nl6: Support enabling a 3.3V rail git bisect good c43da06c24a485308e80d709737b446e8cad175d # good: [8d6b006e1f51c99016aa39ca9e03947cbdd024e3] drm/virtio: implement context init: handle VIRTGPU_CONTEXT_PARAM_POLL_RINGS_MASK git bisect good 8d6b006e1f51c99016aa39ca9e03947cbdd024e3 # bad: [d0f5d790ae863079025398015eb59347b01db455] drm/ttm: remove TTM_PAGE_FLAG_NO_RETRY git bisect bad d0f5d790ae863079025398015eb59347b01db455 # bad: [f5d28856b89baab4232a9f841e565763fcebcdf9] drm/ttm: stop calling tt_swapin in vm_access git bisect bad f5d28856b89baab4232a9f841e565763fcebcdf9 # bad: [78aa20fa4381623cf59a85d053486f98784ca3a0] drm/virtio: implement context init: advertise feature to userspace git bisect bad 78aa20fa4381623cf59a85d053486f98784ca3a0 # bad: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0] drm/virtio: implement context init: add virtio_gpu_fence_event git bisect bad cd7f5ca33585918febe5e2f6dc090a21cfa775b0 # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0] drm/virtio: implement context init: add virtio_gpu_fence_event And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed the problem I was seeing on my qemu test of x86_64. The qemu image is based on Ubuntu. Will be happy to test any fix or more debugging if needed. -- Regards Sudip
On Thu, Nov 11, 2021 at 8:51 PM Sudip Mukherjee <sudipm.mukherjee at gmail.com> wrote:> > Hi Linus, > > On Thu, Nov 11, 2021 at 2:03 PM Sudip Mukherjee > <sudipm.mukherjee at gmail.com> wrote: > > > > Hi Linus, > > > > My testing has been failing for the last few days. Last good test was > > with 6f2b76a4a384 and I started seeing the failure with ce840177930f5 > > where boot timeout.Last night's test on 66f4beaa6c1d worked fine. So I guess this has now been fixed. Thanks. -- Regards Sudip
[ Hmm. This email was marked as spam for me. I see no obvious reason for it being marked as spam, but it happens.. ] On Thu, Nov 11, 2021 at 12:52 PM Sudip Mukherjee <sudipm.mukherjee at gmail.com> wrote:> > # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0] > drm/virtio: implement context init: add virtio_gpu_fence_eventHmm. Judging from your automated screenshots, the login never appears.> And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed > the problem I was seeing on my qemu test of x86_64. The qemu image is > based on Ubuntu.Presumably either that commit is somehow buggy in itself - or it does exactly what it means to do, and the new poll() semantics just confuses the heck out of the X server (or wayland or whatever). And honestly, if I read that thing correctly, the patch is entirely broken. The new poll function (virtio_gpu_poll()) will unconditionally remove the first event from the event list, and then report "Yeah, I had events". This is completely bogus for a few reasons: - poll() really should be idempotent, because the poll function gets called multiple times - it doesn't even seem to check that the event that it removes is the new VIRTGPU_EVENT_FENCE_SIGNALED_INTERNAL kind of event, so it will unconditionally just remove random events. - it does seem to check the "vfpriv->ring_idx_mask" and do the old thing if that is zero, but I see absolutely no reason for that (and that check itself has caused problems, see below) Honestly, my reaction to this all is that that commit is fundamentally broken and probably should be reverted regardless as "this commit does bad things". HOWEVER - it has had a fix for a NULL pointer dereference in the meantime - can you check whether the current top of tree happens to work for you? Maybe your problem isn't due to "that commit does unnatural things", but simply due to the bug fixed in d89c0c8322ec ("drm/virtio: Fix NULL dereference error in virtio_gpu_poll"). And if it's still broken with that commit, I'll happily revert it and people need to go back to the drawing board. In fact, I would really suggest that people look at that virtio_gpu_poll() function regardless. That odd "let's unconditionally just drop events in the poll function is really REALLY broken behavior. Linus