On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote:> 2018-04-03 23:00 GMT+03:00, Adam Borowski <kilobyte at angband.pl>: > > In commit da5e45e619b3f101420c38b3006a9ae4f3ad19b0 > > > > yet it is still reproducible for me on 4.16-rc7 and 4.16.0, which already > > have your fix. I don't know about earlier versions -- my newer card went > > into flames just a few days ago, and I replaced it a brand new 8400GS (G98) > > I happened to have in a dusty closet. Obviously, I can bisect if that > > would be helpful, but the error looks the same thus I'm reporting first. > > Unfortunately I will not be able to help you, as patch fixed issue on > my system and thus I have no means to test anything more. My card is > G98M [Quadro NVS 160M]. Besides – I'm a geographer not a programmer > ;-)And I'm, it seems, servant of a particular cat, all else being secondary. :p> Still your report makes to question the original commit I was fixing > (mmu: swap out round for ALIGN). Could you test if going back to > rounddown fixes problem on your side? > > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c > @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool > getref, bool mapref, bool sparse, > > tail = this->addr + this->size; > if (vmm->func->page_block && next && next->page != p) > - tail = ALIGN_DOWN(tail, vmm->func->page_block); > + tail = rounddown(tail, vmm->func->page_block); > > if (addr <= tail && tail - addr >= size) { > rb_erase(&this->tree, &vmm->free); >Alas, it did work for a few hours, then a total display freeze: [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861 (err: INVALID_CMD) push 00704031 [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000 (err: INVALID_CMD) push 00406040 [29982.044136] nouveau 0000:01:00.0: gr: DATA_ERROR 00000004 [INVALID_VALUE] [29982.050934] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000 Xorg[2667]] subc 2 class 502d mthd 0218 data ff000000 [29982.061866] nouveau 0000:01:00.0: gr: DATA_ERROR 00000004 [INVALID_VALUE] [29982.068658] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000 Xorg[2667]] subc 2 class 502d mthd 021c data ff000000 [29982.079584] nouveau 0000:01:00.0: gr: DATA_ERROR 0000000c [INVALID_BITFIELD] [29982.086651] nouveau 0000:01:00.0: gr: 00100000 [] ch 2 [001fa31000 Xorg[2667]] subc 2 class 502d mthd 0220 data ff000000 [29982.097517] nouveau 0000:01:00.0: fb: trapped write at 00ff000000 on channel 2 [1fa31000 Xorg[2667]] engine 00 [PGRAPH] client 0b [PROP] subclient 0c [DST2D] reason 00000000 [PT_NOT_PRESENT] [29982.114491] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - 00000010 [DST2D_FAULT] - Address 00ff000000 [29982.123620] nouveau 0000:01:00.0: gr: TRAP_PROP - TP 0 - e0c: 00000000, e18: 00000000, e1c: 00000000, e20: 00000011, e24: 0c030000 [29982.135365] nouveau 0000:01:00.0: gr: 00200000 [] ch 2 [001fa31000 Xorg[2667]] subc 2 class 502d mthd 0860 data ff2e2e2e I did not observe a TRAP_M2MF, but the above were present in previous errors, thus it's probably random what happens first. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ ... what's the frequency of that 5V DC? ⠈⠳⣄⠀⠀⠀⠀
On Wed, Apr 4, 2018 at 6:58 PM, Adam Borowski <kilobyte at angband.pl> wrote:> On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote: >> 2018-04-03 23:00 GMT+03:00, Adam Borowski <kilobyte at angband.pl>: >> > In commit da5e45e619b3f101420c38b3006a9ae4f3ad19b0 >> > >> > yet it is still reproducible for me on 4.16-rc7 and 4.16.0, which already >> > have your fix. I don't know about earlier versions -- my newer card went >> > into flames just a few days ago, and I replaced it a brand new 8400GS (G98) >> > I happened to have in a dusty closet. Obviously, I can bisect if that >> > would be helpful, but the error looks the same thus I'm reporting first. >> >> Unfortunately I will not be able to help you, as patch fixed issue on >> my system and thus I have no means to test anything more. My card is >> G98M [Quadro NVS 160M]. Besides – I'm a geographer not a programmer >> ;-) > > And I'm, it seems, servant of a particular cat, all else being secondary. :p > >> Still your report makes to question the original commit I was fixing >> (mmu: swap out round for ALIGN). Could you test if going back to >> rounddown fixes problem on your side? >> >> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c >> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c >> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool >> getref, bool mapref, bool sparse, >> >> tail = this->addr + this->size; >> if (vmm->func->page_block && next && next->page != p) >> - tail = ALIGN_DOWN(tail, vmm->func->page_block); >> + tail = rounddown(tail, vmm->func->page_block); >> >> if (addr <= tail && tail - addr >= size) { >> rb_erase(&this->tree, &vmm->free); >> > > Alas, it did work for a few hours, then a total display freeze: > > [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get > 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861 (err: > INVALID_CMD) push 00704031 > [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] get > 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000 (err: > INVALID_CMD) push 00406040These, as I call them, 406040 errors, have been around on Tesla for ages. We have no idea what leads to them, but generally some kind of fifo desync appears to follow. -ilia
2018-04-05 2:03 GMT+03:00, Ilia Mirkin <imirkin at alum.mit.edu>:> On Wed, Apr 4, 2018 at 6:58 PM, Adam Borowski <kilobyte at angband.pl> wrote: >> On Wed, Apr 04, 2018 at 03:48:39PM +0300, Māris Nartišs wrote: >>> Still your report makes to question the original commit I was fixing >>> (mmu: swap out round for ALIGN). Could you test if going back to >>> rounddown fixes problem on your side? >>> >>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c >>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c >>> @@ -1354,7 +1354,7 @@ nvkm_vmm_get_locked(struct nvkm_vmm *vmm, bool >>> getref, bool mapref, bool sparse, >>> >>> tail = this->addr + this->size; >>> if (vmm->func->page_block && next && next->page != p) >>> - tail = ALIGN_DOWN(tail, vmm->func->page_block); >>> + tail = rounddown(tail, vmm->func->page_block); >>> >>> if (addr <= tail && tail - addr >= size) { >>> rb_erase(&this->tree, &vmm->free); >>> >> >> Alas, it did work for a few hours, then a total display freeze: >> >> [29982.011795] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] >> get >> 0000037d90 put 000003a2cc ib_get 000001dc ib_put 000001dd state 80004861 >> (err: >> INVALID_CMD) push 00704031 >> [29982.027959] nouveau 0000:01:00.0: fifo: DMA_PUSHER - ch 2 [Xorg[2667]] >> get >> 000003a2cc put 000003a2cc ib_get 000001dc ib_put 000001f9 state 80000000 >> (err: >> INVALID_CMD) push 00406040 > > These, as I call them, 406040 errors, have been around on Tesla for > ages. We have no idea what leads to them, but generally some kind of > fifo desync appears to follow. > > -iliaTaking this into account, going back to rounddonw from ALIGN_DOWN seems to fix breakage on some systems. Lets wait for Ben's input on this matter, as he swapped rounddown with ALIGN_DOWN to fix some kind of build problems on 32bit systems. Ilia, is there anything we could add to our kernels to shed some light on 406040 errors? I am not certain if I have seen those on my hardware, but, as you say, they might be rare enough to not remember it. Māris.