thr3ads.net - Nouveau - [Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr

If this information is useful, please help other people find it:
Share via:

Alexandre Courbot

2015-Aug-12 03:35 UTC

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

Mmm in that case it is probably best to revert that commit for the
time being. It was targeting GM20B (and maybe other Maxwells too) so
reverting it should not hurt anyone at the moment. I think Ben is on
holidays for now, is there anyone else who can send a pull request to
Dave Airlie for this? We don't want 4.2 to ship with a crash every
other reboot...

On Wed, Aug 12, 2015 at 10:01 AM, Eric Biggers <ebiggers3 at gmail.com>
wrote:> Hi,
>
> I think I've done about 10 reboots with the commit reverted and I never
> experienced the crash.  But with 4.2.0-rc6 I get the crash on about every
> other reboot.
>
> Probably relevant: the computer on which the crash occurs has two GPUs (one
> Intel and one Nvidia).  The Intel one is actually being used, whereas I
> presume the Nvidia one is being automatically disabled shortly after boot,
> perhaps when the crash occurs...
>
> Eric
>
> On Mon, Aug 10, 2015 at 11:28 PM, Alexandre Courbot <gnurou at
gmail.com>
> wrote:
>>
>> Indeed, and I am actually surprised to see one here. I will
>> double-check that patch.
>>
>> Eric, would you be able to give an estimate of the repro rate for this
>> issue? More testing with and without the patch would be welcome,
it'd
>> be good to know whether it is actually the culprit or not.
>>
>> On Mon, Aug 10, 2015 at 2:28 AM, Ilia Mirkin <imirkin at
alum.mit.edu> wrote:
>> > Alexandre, could you take a look? 0xbad* generally comes from bad
mmio
>> > reads.
>> >
>> > On Aug 9, 2015 1:08 PM, "Eric Biggers" <ebiggers3 at
gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am testing Linux v4.2-rc5 and I am sporadically getting
crashes
>> >> shortly
>> >> after
>> >> startup in gk104_fifo_intr_runlist().  What I've found is
that the
>> >> 'mask'
>> >> value
>> >> read from offset 0x2a00 comes back as '0xbad0da00'. 
This causes the
>> >> 'engn'
>> >> variable to be assigned the value 9, which is invalid; then
wake_up()
>> >> is
>> >> called
>> >> on an uninitialized waitqueue which causes the crash.
>> >>
>> >> Reverting commit 1addc12648521d ("drm/nouveau/fifo/gk104:
kick channels
>> >> when
>> >> deactivating them") seemed to make the problem go away,
although I
>> >> can't
>> >> be 100%
>> >> sure because the problem is sporadic.
>> >>
>> >> Attached an example of the kernel log up to the crash.
>> >>
>> >> Eric
>> >>
>> >> _______________________________________________
>> >> Nouveau mailing list
>> >> Nouveau at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/nouveau
>> >>
>> >
>
>

Alexandre Courbot

2015-Aug-12 03:53 UTC

head link

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

Sending the revert patch to Dave after receiving his green light for
this, and will investigate the issue on my side. I should be able to find a
gk107 somewhere...

On Wed, Aug 12, 2015 at 12:35 PM, Alexandre Courbot <gnurou at gmail.com>
wrote:> Mmm in that case it is probably best to revert that commit for the
> time being. It was targeting GM20B (and maybe other Maxwells too) so
> reverting it should not hurt anyone at the moment. I think Ben is on
> holidays for now, is there anyone else who can send a pull request to
> Dave Airlie for this? We don't want 4.2 to ship with a crash every
> other reboot...
>
> On Wed, Aug 12, 2015 at 10:01 AM, Eric Biggers <ebiggers3 at
gmail.com> wrote:
>> Hi,
>>
>> I think I've done about 10 reboots with the commit reverted and I
never
>> experienced the crash.  But with 4.2.0-rc6 I get the crash on about
every
>> other reboot.
>>
>> Probably relevant: the computer on which the crash occurs has two GPUs
(one
>> Intel and one Nvidia).  The Intel one is actually being used, whereas I
>> presume the Nvidia one is being automatically disabled shortly after
boot,
>> perhaps when the crash occurs...
>>
>> Eric
>>
>> On Mon, Aug 10, 2015 at 11:28 PM, Alexandre Courbot <gnurou at
gmail.com>
>> wrote:
>>>
>>> Indeed, and I am actually surprised to see one here. I will
>>> double-check that patch.
>>>
>>> Eric, would you be able to give an estimate of the repro rate for
this
>>> issue? More testing with and without the patch would be welcome,
it'd
>>> be good to know whether it is actually the culprit or not.
>>>
>>> On Mon, Aug 10, 2015 at 2:28 AM, Ilia Mirkin <imirkin at
alum.mit.edu> wrote:
>>> > Alexandre, could you take a look? 0xbad* generally comes from
bad mmio
>>> > reads.
>>> >
>>> > On Aug 9, 2015 1:08 PM, "Eric Biggers" <ebiggers3
at gmail.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I am testing Linux v4.2-rc5 and I am sporadically getting
crashes
>>> >> shortly
>>> >> after
>>> >> startup in gk104_fifo_intr_runlist().  What I've found
is that the
>>> >> 'mask'
>>> >> value
>>> >> read from offset 0x2a00 comes back as
'0xbad0da00'.  This causes the
>>> >> 'engn'
>>> >> variable to be assigned the value 9, which is invalid;
then wake_up()
>>> >> is
>>> >> called
>>> >> on an uninitialized waitqueue which causes the crash.
>>> >>
>>> >> Reverting commit 1addc12648521d
("drm/nouveau/fifo/gk104: kick channels
>>> >> when
>>> >> deactivating them") seemed to make the problem go
away, although I
>>> >> can't
>>> >> be 100%
>>> >> sure because the problem is sporadic.
>>> >>
>>> >> Attached an example of the kernel log up to the crash.
>>> >>
>>> >> Eric
>>> >>
>>> >> _______________________________________________
>>> >> Nouveau mailing list
>>> >> Nouveau at lists.freedesktop.org
>>> >> http://lists.freedesktop.org/mailman/listinfo/nouveau
>>> >>
>>> >
>>
>>

Ilia Mirkin

2015-Aug-12 04:00 UTC

head link

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

I'm guessing that optimus is the operative difference, not the
specific chip. Basically something that can be put to sleep via
ACPI...

On Tue, Aug 11, 2015 at 11:53 PM, Alexandre Courbot <gnurou at gmail.com>
wrote:> Sending the revert patch to Dave after receiving his green light for
> this, and will investigate the issue on my side. I should be able to find a
> gk107 somewhere...
>
> On Wed, Aug 12, 2015 at 12:35 PM, Alexandre Courbot <gnurou at
gmail.com> wrote:
>> Mmm in that case it is probably best to revert that commit for the
>> time being. It was targeting GM20B (and maybe other Maxwells too) so
>> reverting it should not hurt anyone at the moment. I think Ben is on
>> holidays for now, is there anyone else who can send a pull request to
>> Dave Airlie for this? We don't want 4.2 to ship with a crash every
>> other reboot...
>>
>> On Wed, Aug 12, 2015 at 10:01 AM, Eric Biggers <ebiggers3 at
gmail.com> wrote:
>>> Hi,
>>>
>>> I think I've done about 10 reboots with the commit reverted and
I never
>>> experienced the crash.  But with 4.2.0-rc6 I get the crash on about
every
>>> other reboot.
>>>
>>> Probably relevant: the computer on which the crash occurs has two
GPUs (one
>>> Intel and one Nvidia).  The Intel one is actually being used,
whereas I
>>> presume the Nvidia one is being automatically disabled shortly
after boot,
>>> perhaps when the crash occurs...
>>>
>>> Eric
>>>
>>> On Mon, Aug 10, 2015 at 11:28 PM, Alexandre Courbot <gnurou at
gmail.com>
>>> wrote:
>>>>
>>>> Indeed, and I am actually surprised to see one here. I will
>>>> double-check that patch.
>>>>
>>>> Eric, would you be able to give an estimate of the repro rate
for this
>>>> issue? More testing with and without the patch would be
welcome, it'd
>>>> be good to know whether it is actually the culprit or not.
>>>>
>>>> On Mon, Aug 10, 2015 at 2:28 AM, Ilia Mirkin <imirkin at
alum.mit.edu> wrote:
>>>> > Alexandre, could you take a look? 0xbad* generally comes
from bad mmio
>>>> > reads.
>>>> >
>>>> > On Aug 9, 2015 1:08 PM, "Eric Biggers"
<ebiggers3 at gmail.com> wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I am testing Linux v4.2-rc5 and I am sporadically
getting crashes
>>>> >> shortly
>>>> >> after
>>>> >> startup in gk104_fifo_intr_runlist().  What I've
found is that the
>>>> >> 'mask'
>>>> >> value
>>>> >> read from offset 0x2a00 comes back as
'0xbad0da00'.  This causes the
>>>> >> 'engn'
>>>> >> variable to be assigned the value 9, which is invalid;
then wake_up()
>>>> >> is
>>>> >> called
>>>> >> on an uninitialized waitqueue which causes the crash.
>>>> >>
>>>> >> Reverting commit 1addc12648521d
("drm/nouveau/fifo/gk104: kick channels
>>>> >> when
>>>> >> deactivating them") seemed to make the problem go
away, although I
>>>> >> can't
>>>> >> be 100%
>>>> >> sure because the problem is sporadic.
>>>> >>
>>>> >> Attached an example of the kernel log up to the crash.
>>>> >>
>>>> >> Eric
>>>> >>
>>>> >> _______________________________________________
>>>> >> Nouveau mailing list
>>>> >> Nouveau at lists.freedesktop.org
>>>> >> http://lists.freedesktop.org/mailman/listinfo/nouveau
>>>> >>
>>>> >
>>>
>>>

Nouveau - Aug 2015 - [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()

[Nouveau] [REGRESSION] nouveau: Crash in gk104_fifo_intr_runlist()