thr3ads.net - Nouveau - [Nouveau] nouveau TRAP_M2MF still there on G98 [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Adam Borowski

2018-Apr-04 22:58 UTC

[Nouveau] nouveau TRAP_M2MF still there on G98

On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs
wrote:> 2018-04-03 23:00 GMT+03:00, Adam Borowski <kilobyte at angband.pl>:
> > In commit da5e45e619b3f101420c38b3006a9ae4f3ad19b0
> >
> > yet it is still reproducible for me on 4.16-rc7 and 4.16.0, which
already
> > have your fix.  I don't know about earlier versions -- my newer
card went
> > into flames just a few days ago, and I replaced it a brand new 8400GS
(G98)
> > I happened to have in a dusty closet.  Obviously, I can bisect if that
> > would be helpful, but the error looks the same thus I'm reporting
first.
> 
> Unfortunately I will not be able to help you, as patch fixed issue on
> my system and thus I have no means to test anything more. My card is
> G98M [Quadro NVS 160M]. Besides – I'm a geographer not a programmer
> ;-)
And I'm, it seems, servant of a particular cat, all else being secondary. :p
> Still your report makes to question the original commit I was fixing
> (mmu: swap out round for ALIGN). Could you test if going back to
> rounddown fixes problem on your side?
> 
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool
> getref, bool mapref, bool sparse,
> 
>                 tail = this->addr + this->size;
>                 if (vmm->func->page_block && next &&
next->page != p)
> -                       tail = ALIGN_DOWN(tail,
vmm->func->page_block);
> +                       tail = rounddown(tail,
vmm->func->page_block);
> 
>                 if (addr <= tail && tail - addr >= size) {
>                         rb_erase(&this->tree, &vmm->free);
> 
Alas, it did work for a few hours, then a total display freeze:

[29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get
0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861 (err:
INVALID_CMD) push 00704031
[29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get
000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000 (err:
INVALID_CMD) push 00406040
[29982.044136] nouveau 0000:01:00.0: gr: DATA_ERROR 00000004 [INVALID_VALUE]
[29982.050934] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000
Xorg[2667]] subc 2 class 502d mthd 0218 data ff000000
[29982.061866] nouveau 0000:01:00.0: gr: DATA_ERROR 00000004 [INVALID_VALUE]
[29982.068658] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000
Xorg[2667]] subc 2 class 502d mthd 021c data ff000000
[29982.079584] nouveau 0000:01:00.0: gr: DATA_ERROR 0000000c [INVALID_BITFIELD]
[29982.086651] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000
Xorg[2667]] subc 2 class 502d mthd 0220 data ff000000
[29982.097517] nouveau 0000:01:00.0: fb: trapped write at 00ff000000 on
channel 2 [1fa31000 Xorg[2667]] engine 00 [PGRAPH] client 0b [PROP]
subclient 0c [DST2D] reason 00000000 [PT_NOT_PRESENT]
[29982.114491] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000010
[DST2D_FAULT] - Address 00ff000000
[29982.123620] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000,
e18: 00000000, e1c: 00000000, e20: 00000011, e24: 0c030000
[29982.135365] nouveau 0000:01:00.0: gr: 00200000 [] ch 2 [001fa31000
Xorg[2667]] subc 2 class 502d mthd 0860 data ff2e2e2e

I did not observe a TRAP_M2MF, but the above were present in previous
errors, thus it's probably random what happens first.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ ... what's the frequency of that 5V DC?
⠈⠳⣄⠀⠀⠀⠀

Ilia Mirkin

2018-Apr-04 23:03 UTC

head link

[Nouveau] nouveau TRAP_M2MF still there on G98

On Wed, Apr 4, 2018 at 6:58 PM, Adam Borowski <kilobyte at angband.pl>
wrote:> On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote:
>> 2018-04-03 23:00 GMT+03:00, Adam Borowski <kilobyte at
angband.pl>:
>> > In commit da5e45e619b3f101420c38b3006a9ae4f3ad19b0
>> >
>> > yet it is still reproducible for me on 4.16-rc7 and 4.16.0, which
already
>> > have your fix.  I don't know about earlier versions -- my
newer card went
>> > into flames just a few days ago, and I replaced it a brand new
8400GS (G98)
>> > I happened to have in a dusty closet.  Obviously, I can bisect if
that
>> > would be helpful, but the error looks the same thus I'm
reporting first.
>>
>> Unfortunately I will not be able to help you, as patch fixed issue on
>> my system and thus I have no means to test anything more. My card is
>> G98M [Quadro NVS 160M]. Besides – I'm a geographer not a programmer
>> ;-)
>
> And I'm, it seems, servant of a particular cat, all else being
secondary. :p
>
>> Still your report makes to question the original commit I was fixing
>> (mmu: swap out round for ALIGN). Could you test if going back to
>> rounddown fixes problem on your side?
>>
>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool
>> getref, bool mapref, bool sparse,
>>
>>                 tail = this->addr + this->size;
>>                 if (vmm->func->page_block && next
&& next->page != p)
>> -                       tail = ALIGN_DOWN(tail,
vmm->func->page_block);
>> +                       tail = rounddown(tail,
vmm->func->page_block);
>>
>>                 if (addr <= tail && tail - addr >= size)
{
>>                         rb_erase(&this->tree,
&vmm->free);
>>
>
> Alas, it did work for a few hours, then a total display freeze:
>
> [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]]
get
> 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861
(err:
> INVALID_CMD) push 00704031
> [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]]
get
> 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000
(err:
> INVALID_CMD) push 00406040
These, as I call them, 406040 errors, have been around on Tesla for
ages. We have no idea what leads to them, but generally some kind of
fifo desync appears to follow.

  -ilia

Māris Nartišs

2018-Apr-06 08:57 UTC

head link

[Nouveau] nouveau TRAP_M2MF still there on G98

2018-04-05 2:03 GMT+03:00, Ilia Mirkin <imirkin at
alum.mit.edu>:> On Wed, Apr 4, 2018 at 6:58 PM, Adam Borowski <kilobyte at
angband.pl> wrote:
>> On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote:
>>> Still your report makes to question the original commit I was
fixing
>>> (mmu: swap out round for ALIGN). Could you test if going back to
>>> rounddown fixes problem on your side?
>>>
>>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
>>> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm,
bool
>>> getref, bool mapref, bool sparse,
>>>
>>>                 tail = this->addr + this->size;
>>>                 if (vmm->func->page_block && next
&& next->page != p)
>>> -                       tail = ALIGN_DOWN(tail,
vmm->func->page_block);
>>> +                       tail = rounddown(tail,
vmm->func->page_block);
>>>
>>>                 if (addr <= tail && tail - addr >=
size) {
>>>                         rb_erase(&this->tree,
&vmm->free);
>>>
>>
>> Alas, it did work for a few hours, then a total display freeze:
>>
>> [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2
[Xorg[2667]]
>> get
>> 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state
80004861
>> (err:
>> INVALID_CMD) push 00704031
>> [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2
[Xorg[2667]]
>> get
>> 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state
80000000
>> (err:
>> INVALID_CMD) push 00406040
>
> These, as I call them, 406040 errors, have been around on Tesla for
> ages. We have no idea what leads to them, but generally some kind of
> fifo desync appears to follow.
>
>   -ilia
Taking this into account, going back to rounddonw from ALIGN_DOWN
seems to fix breakage on some systems. Lets wait for Ben's input on
this matter, as he swapped rounddown with ALIGN_DOWN to fix some kind
of build problems on 32bit systems.

Ilia, is there anything we could add to our kernels to shed some light
on 406040 errors? I am not certain if I have seen those on my
hardware, but, as you say, they might be rare enough to not remember
it.

Māris.

Maybe Matching Threads

Search for more reasonably related threads

Nouveau - Apr 2018 - nouveau TRAP_M2MF still there on G98

[Nouveau] nouveau TRAP_M2MF still there on G98

[Nouveau] nouveau TRAP_M2MF still there on G98

[Nouveau] nouveau TRAP_M2MF still there on G98

Maybe Matching Threads