Noralf Trønnes
2022-Sep-07 16:44 UTC
[Nouveau] [PATCH v2 00/41] drm: Analog TV Improvements
Den 07.09.2022 12.36, skrev Stefan Wahren:> Hi Maxime, > > Am 05.09.22 um 16:57 schrieb Maxime Ripard: >> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Tr?nnes wrote: >>> >>> Den 01.09.2022 21.35, skrev Noralf Tr?nnes: >>>> >>>> I have finally found a workaround for my kernel hangs. >>>> >>>> Dom had a look at my kernel and found that the VideoCore was fine, and >>>> he said this: >>>> >>>>> That suggests cause of lockup was on arm side rather than VC side. >>>>> >>>>> But it's hard to diagnose further. Once you've had a peripheral not >>>>> respond, the AXI bus locks up and no further operations are possible. >>>>> Usual causes of this are required clocks being stopped or domains >>>>> disabled and then trying to access the hardware. >>>>> >>>> So when I got this on my 64-bit build: >>>> >>>> [? 166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- >>>> SError >>>> [? 166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G??????? W >>>> ???? 5.19.0-rc6-00096-gba7973977976-dirty #1 >>>> [? 166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT) >>>> [? 166.702206] Workqueue: events_freezable_power_ >>>> thermal_zone_device_check >>>> [? 166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS >>>> BTYPE=--) >>>> [? 166.702242] pc : regmap_mmio_read32le+0x10/0x28 >>>> [? 166.702261] lr : regmap_mmio_read+0x44/0x70 >>>> ... >>>> [? 166.702606]? bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal] >>>> >>>> I wondered if that reg read was stalled due to a clock being stopped. >>>> >>>> Lo and behold, disabling runtime pm and keeping the vec clock running >>>> all the time fixed it[1]. >>>> >>>> I don't know what the problem is, but at least I can now test this >>>> patchset. >>>> >>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065 >>>> >>> It turns out I didn't have to disable runtime pm: >>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2 >> If the bcm2711_thermal IP needs that clock to be enabled, it should grab >> a reference itself, but it looks like even the device tree binding >> doesn't ask for one. > The missing clock in the device tree binding is expected, because > despite of the code there is not much information about the BCM2711 > clock tree. But i'm skeptical that the AVS IP actually needs the VEC > clock, i think it's more likely that the VEC clock parent is changed and > that cause this issue. I could take care of the bcm2711 binding & driver > if i know which clock is really necessary.Seems you're right, keeping the parent always enabled is enough: clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per I tried enabling just the grandparent clock as well, but that didn't help. Without the clock hack it seems the hang occurs when switching between NTSC and PAL, at most I've been able to do that 4-5 times before it hangs. For a while it looked like fbdev/fbcon had a play in this, but then I realised that it just gave me a NTSC mode to start from and to go back to when qutting modetest. Noralf.
Noralf Trønnes
2022-Sep-10 15:34 UTC
[Nouveau] [PATCH v2 00/41] drm: Analog TV Improvements
Den 07.09.2022 18.44, skrev Noralf Tr?nnes:> > > Den 07.09.2022 12.36, skrev Stefan Wahren: >> Hi Maxime, >> >> Am 05.09.22 um 16:57 schrieb Maxime Ripard: >>> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Tr?nnes wrote: >>>> >>>> Den 01.09.2022 21.35, skrev Noralf Tr?nnes: >>>>> >>>>> I have finally found a workaround for my kernel hangs. >>>>> >>>>> Dom had a look at my kernel and found that the VideoCore was fine, and >>>>> he said this: >>>>> >>>>>> That suggests cause of lockup was on arm side rather than VC side. >>>>>> >>>>>> But it's hard to diagnose further. Once you've had a peripheral not >>>>>> respond, the AXI bus locks up and no further operations are possible. >>>>>> Usual causes of this are required clocks being stopped or domains >>>>>> disabled and then trying to access the hardware. >>>>>> >>>>> So when I got this on my 64-bit build: >>>>> >>>>> [? 166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- >>>>> SError >>>>> [? 166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G??????? W >>>>> ???? 5.19.0-rc6-00096-gba7973977976-dirty #1 >>>>> [? 166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT) >>>>> [? 166.702206] Workqueue: events_freezable_power_ >>>>> thermal_zone_device_check >>>>> [? 166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS >>>>> BTYPE=--) >>>>> [? 166.702242] pc : regmap_mmio_read32le+0x10/0x28 >>>>> [? 166.702261] lr : regmap_mmio_read+0x44/0x70 >>>>> ... >>>>> [? 166.702606]? bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal] >>>>> >>>>> I wondered if that reg read was stalled due to a clock being stopped. >>>>> >>>>> Lo and behold, disabling runtime pm and keeping the vec clock running >>>>> all the time fixed it[1]. >>>>> >>>>> I don't know what the problem is, but at least I can now test this >>>>> patchset. >>>>> >>>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065 >>>>> >>>> It turns out I didn't have to disable runtime pm: >>>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2 >>> If the bcm2711_thermal IP needs that clock to be enabled, it should grab >>> a reference itself, but it looks like even the device tree binding >>> doesn't ask for one. >> The missing clock in the device tree binding is expected, because >> despite of the code there is not much information about the BCM2711 >> clock tree. But i'm skeptical that the AVS IP actually needs the VEC >> clock, i think it's more likely that the VEC clock parent is changed and >> that cause this issue. I could take care of the bcm2711 binding & driver >> if i know which clock is really necessary. > > Seems you're right, keeping the parent always enabled is enough: > > clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per > > I tried enabling just the grandparent clock as well, but that didn't help. > > Without the clock hack it seems the hang occurs when switching between > NTSC and PAL, at most I've been able to do that 4-5 times before it hangs. > > For a while it looked like fbdev/fbcon had a play in this, but then I > realised that it just gave me a NTSC mode to start from and to go back > to when qutting modetest. >I've looked some more into this problem and I see that downstream is using a firmware clock for vec: clk: Move vec clock to clk-raspberrypi https://github.com/raspberrypi/linux/pull/4639 If I do the same my problem goes away. It's interesting to note that on downstream 5.10.103-v7l+ #1530, pllc_per is enabled even if tvout is not enabled: $ sudo cat /sys/kernel/debug/clk/pllc_per/regdump cm = 0x00000000 a2w = 0x00000004 (disable bit(8) is not set) It's when mainline vc4_vec disables this vec parent clock that the crash occurs. Sidenote: Another downstream fw clock change with a vec reference[1]: Another issue not related to the clock crash problem: I assumed that unloading the vc4 module would release the clocks, but this didn't happen. When I looked at it I remembered that there's a catch in the DRM unplug machinery when it comes to unloading a driver and the DRM disable hooks. static void vc4_drm_unbind(struct device *dev) { struct drm_device *drm = dev_get_drvdata(dev); drm_dev_unplug(drm); drm_atomic_helper_shutdown(drm); } Here the drm_device is first marked as unplugged and then the pipeline is disabled. Since vc4_vec_encoder_disable() is protected by drm_dev_enter() the VEC is not disabled, clocks are not released and PM is left on. In the drivers that I have written where the hardware is not expected to have gone away on device unbind (SPI), I've just left out the drm_dev_enter() check in the disable hook. Noralf. [1] https://github.com/raspberrypi/linux/pull/4706
Maxime Ripard
2022-Sep-21 14:03 UTC
[Nouveau] [PATCH v2 00/41] drm: Analog TV Improvements
On Wed, Sep 07, 2022 at 06:44:53PM +0200, Noralf Tr?nnes wrote:> > > Den 07.09.2022 12.36, skrev Stefan Wahren: > > Hi Maxime, > > > > Am 05.09.22 um 16:57 schrieb Maxime Ripard: > >> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Tr?nnes wrote: > >>> > >>> Den 01.09.2022 21.35, skrev Noralf Tr?nnes: > >>>> > >>>> I have finally found a workaround for my kernel hangs. > >>>> > >>>> Dom had a look at my kernel and found that the VideoCore was fine, and > >>>> he said this: > >>>> > >>>>> That suggests cause of lockup was on arm side rather than VC side. > >>>>> > >>>>> But it's hard to diagnose further. Once you've had a peripheral not > >>>>> respond, the AXI bus locks up and no further operations are possible. > >>>>> Usual causes of this are required clocks being stopped or domains > >>>>> disabled and then trying to access the hardware. > >>>>> > >>>> So when I got this on my 64-bit build: > >>>> > >>>> [? 166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- > >>>> SError > >>>> [? 166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G??????? W > >>>> ???? 5.19.0-rc6-00096-gba7973977976-dirty #1 > >>>> [? 166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT) > >>>> [? 166.702206] Workqueue: events_freezable_power_ > >>>> thermal_zone_device_check > >>>> [? 166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS > >>>> BTYPE=--) > >>>> [? 166.702242] pc : regmap_mmio_read32le+0x10/0x28 > >>>> [? 166.702261] lr : regmap_mmio_read+0x44/0x70 > >>>> ... > >>>> [? 166.702606]? bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal] > >>>> > >>>> I wondered if that reg read was stalled due to a clock being stopped. > >>>> > >>>> Lo and behold, disabling runtime pm and keeping the vec clock running > >>>> all the time fixed it[1]. > >>>> > >>>> I don't know what the problem is, but at least I can now test this > >>>> patchset. > >>>> > >>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065 > >>>> > >>> It turns out I didn't have to disable runtime pm: > >>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2 > >> If the bcm2711_thermal IP needs that clock to be enabled, it should grab > >> a reference itself, but it looks like even the device tree binding > >> doesn't ask for one. > > The missing clock in the device tree binding is expected, because > > despite of the code there is not much information about the BCM2711 > > clock tree. But i'm skeptical that the AVS IP actually needs the VEC > > clock, i think it's more likely that the VEC clock parent is changed and > > that cause this issue. I could take care of the bcm2711 binding & driver > > if i know which clock is really necessary. > > Seems you're right, keeping the parent always enabled is enough: > > clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per > > I tried enabling just the grandparent clock as well, but that didn't help.Yeah, adding tracing to the clock framework shows that it indeed disables PLLC_PER. So there's probably some other device that depends on it but doesn't take a reference to it. I had a look, but it's not really obvious what that might be. This patch makes sure that the PLL*_PER are never disabled, could you test it? It seems to work for me. diff --git a/drivers/clk/bcm/clk-bcm2835.c b/drivers/clk/bcm/clk-bcm2835.c index 48a1eb9f2d55..3839261ee419 100644 --- a/drivers/clk/bcm/clk-bcm2835.c +++ b/drivers/clk/bcm/clk-bcm2835.c @@ -1675,7 +1675,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = { .load_mask = CM_PLLA_LOADPER, .hold_mask = CM_PLLA_HOLDPER, .fixed_divider = 1, - .flags = CLK_SET_RATE_PARENT), + .flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT), [BCM2835_PLLA_DSI0] = REGISTER_PLL_DIV( SOC_ALL, .name = "plla_dsi0", @@ -1784,7 +1784,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = { .load_mask = CM_PLLC_LOADPER, .hold_mask = CM_PLLC_HOLDPER, .fixed_divider = 1, - .flags = CLK_SET_RATE_PARENT), + .flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT), /* * PLLD is the display PLL, used to drive DSI display panels. @@ -1891,7 +1891,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = { .load_mask = CM_PLLH_LOADAUX, .hold_mask = 0, .fixed_divider = 1, - .flags = CLK_SET_RATE_PARENT), + .flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT), [BCM2835_PLLH_PIX] = REGISTER_PLL_DIV( SOC_BCM2835, .name = "pllh_pix", Maxime -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20220921/69ebd37e/attachment.sig>