Hans de Goede
2015-Dec-15 14:15 UTC
[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?
Hi all, As part of my compute work I'm trying to get some TGSI compute code to work. The code from mesa/src/gallium/tests/trivial.c works. So now I'm trying to get a "native" tgsi kernel to run via clover, I'm using Francisco's nbody.c example for this: https://fedorapeople.org/~jwrdegoede/nbody.c Which does not work, at first I thought there was an issue with the setup of the input / output buffers, but that seems to work fine, and moreover I finally got the smart idea to look in dmesg, which says: [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000 nbody[31881]] [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 [] warp 10009 [INVALID_OPCODE] [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE] and repeats that for every "step" in the nobody simulation, this is on a gk107 card. So that seems to be the real problem, since the error says "INVALID_OPCODE", I've put the tgsi code from nbody.c through "nouveau_compiler -a e4" and then run "nvdisasm -b SM30" on it, but the output looks ok. There is a 8 byte sequence which does not get decoded every 64 bytes but AFAIK that is the scheduling info, so that should be fine. One thing which does stand out is that this: 0: ld u32 %r219 c0[0x0000000000000000+0x0] (0) 1: ld u32 %r222 c0[0x4] (0) 2: ld u64 { %r225 %r228 } c0[0x8] (0) 3: ld u32 %r234 c0[0x10] (0) Gets translated into (nvdisasm output) : /*0008*/ LDC R4, c[0x0][0x0]; /* 0x1400000003f11c86 */ /*0010*/ MOV R2, c[0x0][0x4]; /* 0x2800400010009de4 */ /*0018*/ LDC.64 R0, c[0x0][0x8]; /* 0x1400000023f01ca6 */ /*0020*/ MOV R3, c[0x0][0x10]; /* 0x280040004000dde4 */ Where I would expect for LDC instructions, could that be the problem ? If that is not the problem, then hints how to debug this further would be greatly appreciated. Regards, Hans
Ilia Mirkin
2015-Dec-15 17:00 UTC
[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?
A few things that stand out: 0: ld u32 %r219 c0[0x0000000000000000+0x0] (0) wtf is that 0x0000000000000 thing doing there? Was it a %rX which got constant-folded into 0? That indirectness should have then been removed... that said, the final encoding looks fine. I believe that kepler has this launch descriptor thing too... is that being set correctly? Please generate a mmt trace, and we can see if anything stands out compared to a blob trace that also does compute. Cheers, -ilia On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at redhat.com> wrote:> Hi all, > > As part of my compute work I'm trying to get some TGSI compute > code to work. The code from mesa/src/gallium/tests/trivial.c > works. > > So now I'm trying to get a "native" tgsi kernel to run via > clover, I'm using Francisco's nbody.c example for this: > > https://fedorapeople.org/~jwrdegoede/nbody.c > > Which does not work, at first I thought there was an issue > with the setup of the input / output buffers, but that seems to > work fine, and moreover I finally got the smart idea to look > in dmesg, which says: > > [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000 nbody[31881]] > [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 > [] warp 10009 [INVALID_OPCODE] > [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 > [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE] > > and repeats that for every "step" in the nobody simulation, this is on a > gk107 card. > > So that seems to be the real problem, since the > error says "INVALID_OPCODE", I've put the tgsi code from nbody.c > through "nouveau_compiler -a e4" and then run "nvdisasm -b SM30" > on it, but the output looks ok. There is a 8 byte sequence which does > not get decoded every 64 bytes but AFAIK that is the scheduling info, > so that should be fine. > > One thing which does stand out is that this: > > 0: ld u32 %r219 c0[0x0000000000000000+0x0] (0) > 1: ld u32 %r222 c0[0x4] (0) > 2: ld u64 { %r225 %r228 } c0[0x8] (0) > 3: ld u32 %r234 c0[0x10] (0) > > Gets translated into (nvdisasm output) : > > /*0008*/ LDC R4, c[0x0][0x0]; > /* 0x1400000003f11c86 */ > /*0010*/ MOV R2, c[0x0][0x4]; > /* 0x2800400010009de4 */ > /*0018*/ LDC.64 R0, c[0x0][0x8]; > /* 0x1400000023f01ca6 */ > /*0020*/ MOV R3, c[0x0][0x10]; > /* 0x280040004000dde4 */ > > Where I would expect for LDC instructions, could that be the problem ? > > If that is not the problem, then hints how to debug this further would be > greatly appreciated. > > Regards, > > Hans > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/nouveau
Ilia Mirkin
2015-Dec-15 19:04 UTC
[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?
Also, where's the exit op? Perhaps what's happening is that you don't have an exit and it just goes off executing into the ether? On Tue, Dec 15, 2015 at 12:00 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:> A few things that stand out: > > 0: ld u32 %r219 c0[0x0000000000000000+0x0] (0) > > wtf is that 0x0000000000000 thing doing there? Was it a %rX which got > constant-folded into 0? That indirectness should have then been > removed... that said, the final encoding looks fine. > > I believe that kepler has this launch descriptor thing too... is that > being set correctly? Please generate a mmt trace, and we can see if > anything stands out compared to a blob trace that also does compute. > > Cheers, > > -ilia > > On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at redhat.com> wrote: >> Hi all, >> >> As part of my compute work I'm trying to get some TGSI compute >> code to work. The code from mesa/src/gallium/tests/trivial.c >> works. >> >> So now I'm trying to get a "native" tgsi kernel to run via >> clover, I'm using Francisco's nbody.c example for this: >> >> https://fedorapeople.org/~jwrdegoede/nbody.c >> >> Which does not work, at first I thought there was an issue >> with the setup of the input / output buffers, but that seems to >> work fine, and moreover I finally got the smart idea to look >> in dmesg, which says: >> >> [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000 nbody[31881]] >> [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global 00000000 >> [] warp 10009 [INVALID_OPCODE] >> [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global 00000004 >> [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE] >> >> and repeats that for every "step" in the nobody simulation, this is on a >> gk107 card. >> >> So that seems to be the real problem, since the >> error says "INVALID_OPCODE", I've put the tgsi code from nbody.c >> through "nouveau_compiler -a e4" and then run "nvdisasm -b SM30" >> on it, but the output looks ok. There is a 8 byte sequence which does >> not get decoded every 64 bytes but AFAIK that is the scheduling info, >> so that should be fine. >> >> One thing which does stand out is that this: >> >> 0: ld u32 %r219 c0[0x0000000000000000+0x0] (0) >> 1: ld u32 %r222 c0[0x4] (0) >> 2: ld u64 { %r225 %r228 } c0[0x8] (0) >> 3: ld u32 %r234 c0[0x10] (0) >> >> Gets translated into (nvdisasm output) : >> >> /*0008*/ LDC R4, c[0x0][0x0]; >> /* 0x1400000003f11c86 */ >> /*0010*/ MOV R2, c[0x0][0x4]; >> /* 0x2800400010009de4 */ >> /*0018*/ LDC.64 R0, c[0x0][0x8]; >> /* 0x1400000023f01ca6 */ >> /*0020*/ MOV R3, c[0x0][0x10]; >> /* 0x280040004000dde4 */ >> >> Where I would expect for LDC instructions, could that be the problem ? >> >> If that is not the problem, then hints how to debug this further would be >> greatly appreciated. >> >> Regards, >> >> Hans >> _______________________________________________ >> Nouveau mailing list >> Nouveau at lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/nouveau