thr3ads.net - Nouveau - [Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP

If this information is useful, please help other people find it:
Share via:

Ilia Mirkin

2015-Dec-15 19:04 UTC

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

Also, where's the exit op? Perhaps what's happening is that you
don't
have an exit and it just goes off executing into the ether?

On Tue, Dec 15, 2015 at 12:00 PM, Ilia Mirkin <imirkin at alum.mit.edu>
wrote:> A few things that stand out:
>
>   0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>
> wtf is that 0x0000000000000 thing doing there? Was it a %rX which got
> constant-folded into 0? That indirectness should have then been
> removed... that said, the final encoding looks fine.
>
> I believe that kepler has this launch descriptor thing too... is that
> being set correctly? Please generate a mmt trace, and we can see if
> anything stands out compared to a blob trace that also does compute.
>
> Cheers,
>
>   -ilia
>
> On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at
redhat.com> wrote:
>> Hi all,
>>
>> As part of my compute work I'm trying to get some TGSI compute
>> code to work. The code from mesa/src/gallium/tests/trivial.c
>> works.
>>
>> So now I'm trying to get a "native" tgsi kernel to run
via
>> clover, I'm using Francisco's nbody.c example for this:
>>
>> https://fedorapeople.org/~jwrdegoede/nbody.c
>>
>> Which does not work, at first I thought there was an issue
>> with the setup of the input / output buffers, but that seems to
>> work fine, and moreover I finally got the smart idea to look
>> in dmesg, which says:
>>
>> [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000
nbody[31881]]
>> [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global
00000000
>> [] warp 10009 [INVALID_OPCODE]
>> [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global
00000004
>> [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE]
>>
>> and repeats that for every "step" in the nobody simulation,
this is on a
>> gk107 card.
>>
>> So that seems to be the real problem, since the
>> error says "INVALID_OPCODE", I've put the tgsi code from
nbody.c
>> through "nouveau_compiler -a e4" and then run "nvdisasm
-b SM30"
>> on it, but the output looks ok. There is a 8 byte sequence which does
>> not get decoded every 64 bytes but AFAIK that is the scheduling info,
>> so that should be fine.
>>
>> One thing which does stand out is that this:
>>
>>   0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>   1: ld u32 %r222 c0[0x4] (0)
>>   2: ld u64 { %r225 %r228 } c0[0x8] (0)
>>   3: ld u32 %r234 c0[0x10] (0)
>>
>> Gets translated into (nvdisasm output) :
>>
>>         /*0008*/                   LDC R4, c[0x0][0x0];
>> /* 0x1400000003f11c86 */
>>         /*0010*/                   MOV R2, c[0x0][0x4];
>> /* 0x2800400010009de4 */
>>         /*0018*/                   LDC.64 R0, c[0x0][0x8];
>> /* 0x1400000023f01ca6 */
>>         /*0020*/                   MOV R3, c[0x0][0x10];
>> /* 0x280040004000dde4 */
>>
>> Where I would expect for LDC instructions, could that be the problem ?
>>
>> If that is not the problem, then hints how to debug this further would
be
>> greatly appreciated.
>>
>> Regards,
>>
>> Hans
>> _______________________________________________
>> Nouveau mailing list
>> Nouveau at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/nouveau

Hans de Goede

2015-Dec-16 17:06 UTC

head link

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

Hi,

On 15-12-15 20:04, Ilia Mirkin wrote:> Also, where's the exit op? Perhaps what's happening is that you
don't
> have an exit and it just goes off executing into the ether?
Sorry I only included a small bit of the program in my original mail
because I found the use of "MOV" instructions to load constants
suspicious, is that normal ?

I've put a log with NV50_PROG_DEBUG=1 output here:

https://fedorapeople.org/~jwrdegoede/nbody.log

nvdisasm -b SM30 for the generated binary code is here:

https://fedorapeople.org/~jwrdegoede/nbody.disasm

There are already .tgsi, .hex and .bin files there if
you find those easier to use then the
NV50_PROG_DEBUG=1 output.

>
> On Tue, Dec 15, 2015 at 12:00 PM, Ilia Mirkin <imirkin at
alum.mit.edu> wrote:
>> A few things that stand out:
>>
>>    0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>
>> wtf is that 0x0000000000000 thing doing there? Was it a %rX which got
>> constant-folded into 0? That indirectness should have then been
>> removed... that said, the final encoding looks fine.
I don't know, maybe there is a hint in the log file?

Regards,

Hans

>>
>> I believe that kepler has this launch descriptor thing too... is that
>> being set correctly? Please generate a mmt trace, and we can see if
>> anything stands out compared to a blob trace that also does compute.
>>
>> Cheers,
>>
>>    -ilia
>>
>> On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at
redhat.com> wrote:
>>> Hi all,
>>>
>>> As part of my compute work I'm trying to get some TGSI compute
>>> code to work. The code from mesa/src/gallium/tests/trivial.c
>>> works.
>>>
>>> So now I'm trying to get a "native" tgsi kernel to
run via
>>> clover, I'm using Francisco's nbody.c example for this:
>>>
>>> https://fedorapeople.org/~jwrdegoede/nbody.c
>>>
>>> Which does not work, at first I thought there was an issue
>>> with the setup of the input / output buffers, but that seems to
>>> work fine, and moreover I finally got the smart idea to look
>>> in dmesg, which says:
>>>
>>> [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000
nbody[31881]]
>>> [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global
00000000
>>> [] warp 10009 [INVALID_OPCODE]
>>> [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global
00000004
>>> [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE]
>>>
>>> and repeats that for every "step" in the nobody
simulation, this is on a
>>> gk107 card.
>>>
>>> So that seems to be the real problem, since the
>>> error says "INVALID_OPCODE", I've put the tgsi code
from nbody.c
>>> through "nouveau_compiler -a e4" and then run
"nvdisasm -b SM30"
>>> on it, but the output looks ok. There is a 8 byte sequence which
does
>>> not get decoded every 64 bytes but AFAIK that is the scheduling
info,
>>> so that should be fine.
>>>
>>> One thing which does stand out is that this:
>>>
>>>    0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>>    1: ld u32 %r222 c0[0x4] (0)
>>>    2: ld u64 { %r225 %r228 } c0[0x8] (0)
>>>    3: ld u32 %r234 c0[0x10] (0)
>>>
>>> Gets translated into (nvdisasm output) :
>>>
>>>          /*0008*/                   LDC R4, c[0x0][0x0];
>>> /* 0x1400000003f11c86 */
>>>          /*0010*/                   MOV R2, c[0x0][0x4];
>>> /* 0x2800400010009de4 */
>>>          /*0018*/                   LDC.64 R0, c[0x0][0x8];
>>> /* 0x1400000023f01ca6 */
>>>          /*0020*/                   MOV R3, c[0x0][0x10];
>>> /* 0x280040004000dde4 */
>>>
>>> Where I would expect for LDC instructions, could that be the
problem ?
>>>
>>> If that is not the problem, then hints how to debug this further
would be
>>> greatly appreciated.
>>>
>>> Regards,
>>>
>>> Hans
>>> _______________________________________________
>>> Nouveau mailing list
>>> Nouveau at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/nouveau

Ilia Mirkin

2015-Dec-16 17:24 UTC

head link

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

I believe that your problem is this:

        /*01a0*/                   LD R8, [R8];
           /* 0x8000000000821c85 */

That needs to be LD.E (and your ST's need to be ST.E). You're using a
32-bit gmem address, but you need to be using a 64-bit one. I believe
the 32-bit ones work on fermi, but afaik not on Kepler.

Cheers,

  -ilia



On Wed, Dec 16, 2015 at 12:06 PM, Hans de Goede <hdegoede at redhat.com>
wrote:> Hi,
>
> On 15-12-15 20:04, Ilia Mirkin wrote:
>>
>> Also, where's the exit op? Perhaps what's happening is that you
don't
>> have an exit and it just goes off executing into the ether?
>
>
> Sorry I only included a small bit of the program in my original mail
> because I found the use of "MOV" instructions to load constants
> suspicious, is that normal ?
>
> I've put a log with NV50_PROG_DEBUG=1 output here:
>
> https://fedorapeople.org/~jwrdegoede/nbody.log
>
> nvdisasm -b SM30 for the generated binary code is here:
>
> https://fedorapeople.org/~jwrdegoede/nbody.disasm
>
> There are already .tgsi, .hex and .bin files there if
> you find those easier to use then the
> NV50_PROG_DEBUG=1 output.
>
>
>>
>> On Tue, Dec 15, 2015 at 12:00 PM, Ilia Mirkin <imirkin at
alum.mit.edu>
>> wrote:
>>>
>>> A few things that stand out:
>>>
>>>    0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>>
>>> wtf is that 0x0000000000000 thing doing there? Was it a %rX which
got
>>> constant-folded into 0? That indirectness should have then been
>>> removed... that said, the final encoding looks fine.
>
>
> I don't know, maybe there is a hint in the log file?
>
> Regards,
>
> Hans
>
>
>
>>>
>>> I believe that kepler has this launch descriptor thing too... is
that
>>> being set correctly? Please generate a mmt trace, and we can see if
>>> anything stands out compared to a blob trace that also does
compute.
>>>
>>> Cheers,
>>>
>>>    -ilia
>>>
>>> On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at
redhat.com>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> As part of my compute work I'm trying to get some TGSI
compute
>>>> code to work. The code from mesa/src/gallium/tests/trivial.c
>>>> works.
>>>>
>>>> So now I'm trying to get a "native" tgsi kernel
to run via
>>>> clover, I'm using Francisco's nbody.c example for this:
>>>>
>>>> https://fedorapeople.org/~jwrdegoede/nbody.c
>>>>
>>>> Which does not work, at first I thought there was an issue
>>>> with the setup of the input / output buffers, but that seems to
>>>> work fine, and moreover I finally got the smart idea to look
>>>> in dmesg, which says:
>>>>
>>>> [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000
>>>> nbody[31881]]
>>>> [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap:
global
>>>> 00000000
>>>> [] warp 10009 [INVALID_OPCODE]
>>>> [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap:
global
>>>> 00000004
>>>> [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE]
>>>>
>>>> and repeats that for every "step" in the nobody
simulation, this is on a
>>>> gk107 card.
>>>>
>>>> So that seems to be the real problem, since the
>>>> error says "INVALID_OPCODE", I've put the tgsi
code from nbody.c
>>>> through "nouveau_compiler -a e4" and then run
"nvdisasm -b SM30"
>>>> on it, but the output looks ok. There is a 8 byte sequence
which does
>>>> not get decoded every 64 bytes but AFAIK that is the scheduling
info,
>>>> so that should be fine.
>>>>
>>>> One thing which does stand out is that this:
>>>>
>>>>    0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>>>    1: ld u32 %r222 c0[0x4] (0)
>>>>    2: ld u64 { %r225 %r228 } c0[0x8] (0)
>>>>    3: ld u32 %r234 c0[0x10] (0)
>>>>
>>>> Gets translated into (nvdisasm output) :
>>>>
>>>>          /*0008*/                   LDC R4, c[0x0][0x0];
>>>> /* 0x1400000003f11c86 */
>>>>          /*0010*/                   MOV R2, c[0x0][0x4];
>>>> /* 0x2800400010009de4 */
>>>>          /*0018*/                   LDC.64 R0, c[0x0][0x8];
>>>> /* 0x1400000023f01ca6 */
>>>>          /*0020*/                   MOV R3, c[0x0][0x10];
>>>> /* 0x280040004000dde4 */
>>>>
>>>> Where I would expect for LDC instructions, could that be the
problem ?
>>>>
>>>> If that is not the problem, then hints how to debug this
further would
>>>> be
>>>> greatly appreciated.
>>>>
>>>> Regards,
>>>>
>>>> Hans
>>>> _______________________________________________
>>>> Nouveau mailing list
>>>> Nouveau at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/nouveau

Maybe Matching Threads

Search for more reasonably related threads

Nouveau - Dec 2015 - Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

Maybe Matching Threads