Ilia Mirkin
2015-May-26 23:34 UTC
[Nouveau] Tessellation shaders get MEM_OUT_OF_BOUNDS errors / missing triangles
One additional observation that I just made is that on GK208, the blob apparently doesn't use the result of S2R Rx, SR_INVOCATION_ID wholesale in TCS. It either passes it through a I2I.S32.S32 Rx, |Rx| (i.e. absolute value), or even more paradoxically, shl 2; shr 2; which removes the top *2* bits, rather than just the top 1. However I see no such behaviour on GF108. I'm going to test out tomorrow whether this is the cause of my GK208 woes. On Fri, May 22, 2015 at 5:10 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:> On Mon, May 18, 2015 at 4:48 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >> Hello, >> >> I've been debugging a few different tessellation shader issues with >> nouveau, but let's start small. I see this issue on my GK208 with high >> frequency, and I *think* I've seen it once or twice on my GF108, but >> it's exceedingly rare, if it does happen. I don't have a GK10x to test >> on, unfortunately, but I assume it'll have the same issue as the >> GK208. >> >> The issue is this -- a bunch of triangles that should come out of the >> tessellator end up black. I also see a GPC0/TPC1/MP trap: >> MEM_OUT_OF_BOUNDS error produced by nouveau -- this is output in >> response to a interrupt and MP trap generated by the hardware, read >> out with nv_rd32(priv, TPC_UNIT(gpc, tpc, 0x648)); (see >> gf100_gr_trap_mp). I assume some of the tessellation evaluation >> invocations get killed, but I have no proof of this. >> >> I also see this: TRAP ch 5 [0x003facf000 shader_runner[19044]] >> >> I would imagine that's some floating point number ending up in the >> register instead of an address, but the fp32 value of it >> (1.35107421875) does not seem familiar. > > Ben pointed out that the 0x3facf000 is a channel address, not a value > from the shader. Oops. So that theory completely doesn't hold water. > Perhaps some buffer isn't big enough? This ends up using 9 output > vertices per patch, with 2 vec4's each. I've tried playing with the > per-warp stack size to no avail, but I didn't *entirely* know what I > was doing either though. > >> >> Even when all the triangles show up, I still see the error on the >> GK208, so I'm not sure if they're the same issue or not. >> >> Now, here's the fun part -- this is completely non-deterministic. >> Sometimes everything shows up on the GK208, other times I see holes, >> in varying locations. I'm fairly sure that the actual shader code is >> correct... so I'm doing something funny wrong. (And yeah, tons of >> missed optimization opportunities in this code, but let's not dwell on >> that.) >> >> This is the piglit test: >> >> http://cgit.freedesktop.org/piglit/tree/tests/spec/arb_tessellation_shader/execution/quads.shader_test >> >> It should be noted that other piglit tests don't exhibit this error, >> however they also tend to be simpler. One key difference is that they >> don't change the patch size in TCS. I'm including a link to a text >> file with the tessellation control and evaluation shaders (decoded >> with nvdisasm which you're hopefully more familiar with), along with >> the shader headers that we generate. >> >> FTR, this is how I feed the raw shader opcode bytes into nvdisasm: >> >> perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM35 tt >> >> (for some reason it doesn't want to read from a pipe or even a fd). >> >> http://people.freedesktop.org/~imirkin/tess_shaders_quads.txt >> >> My suspicion is that we're doing something wrong with the sched codes. >> We have an elaborate calculator, but... perhaps not elaborate enough? >> You can see it here: >> >> http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_emit_nvc0.cpp#n2574 >> >> The reason I think it's an error in sched codes is due to the TRAP >> memory location that I see -- could well be some "stale" value in the >> register and the value from S2R or VILD doesn't make it in there in >> time before the ALD reads it. >> >> If you should like to try this yourself, you can use >> https://github.com/imirkin/mesa/commits/gl4-integration-2 . This >> branch is good enough to run Unigine Heaven, but still has a lot of >> known shortcomings. (Both at the core and the nouveau levels.) >> >> Any advice or suggestions for debugging this would be greatly >> appreciated. And let me know if you'd like me to generate additional >> info on this. For example I can supply a full command trace that can >> be piped to demmt, if that's helpful. >> >> Thanks in advance, >> >> -ilia
Ilia Mirkin
2015-Jul-23 06:36 UTC
[Nouveau] Tessellation shaders get MEM_OUT_OF_BOUNDS errors / missing triangles
I think I figured out what was going on. Will re-check on the GK208, but on a GF108 the random blue splotches in Unigine Heaven are gone now. Turns out that with an instruction like /*00d0*/ ALD.128 R0, a[0x70], R0; /* 0x7ecc0000381ffc02 */ The hardware will internally split it up into roughly ALD R0, a[0x70], R0 ALD R1, a[0x74], R0 ALD R2, a[0x78], R0 ALD R3, a[0x7c], R0 Of course the first one of those overwrites R0, which makes the subsequent loads be full of fail. Adding a hazard in our RA for the indirect argument resolves the issue. -ilia On Tue, May 26, 2015 at 7:34 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:> One additional observation that I just made is that on GK208, the blob > apparently doesn't use the result of S2R Rx, SR_INVOCATION_ID > wholesale in TCS. It either passes it through a I2I.S32.S32 Rx, |Rx| > (i.e. absolute value), or even more paradoxically, shl 2; shr 2; which > removes the top *2* bits, rather than just the top 1. However I see no > such behaviour on GF108. > > I'm going to test out tomorrow whether this is the cause of my GK208 woes. > > On Fri, May 22, 2015 at 5:10 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >> On Mon, May 18, 2015 at 4:48 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >>> Hello, >>> >>> I've been debugging a few different tessellation shader issues with >>> nouveau, but let's start small. I see this issue on my GK208 with high >>> frequency, and I *think* I've seen it once or twice on my GF108, but >>> it's exceedingly rare, if it does happen. I don't have a GK10x to test >>> on, unfortunately, but I assume it'll have the same issue as the >>> GK208. >>> >>> The issue is this -- a bunch of triangles that should come out of the >>> tessellator end up black. I also see a GPC0/TPC1/MP trap: >>> MEM_OUT_OF_BOUNDS error produced by nouveau -- this is output in >>> response to a interrupt and MP trap generated by the hardware, read >>> out with nv_rd32(priv, TPC_UNIT(gpc, tpc, 0x648)); (see >>> gf100_gr_trap_mp). I assume some of the tessellation evaluation >>> invocations get killed, but I have no proof of this. >>> >>> I also see this: TRAP ch 5 [0x003facf000 shader_runner[19044]] >>> >>> I would imagine that's some floating point number ending up in the >>> register instead of an address, but the fp32 value of it >>> (1.35107421875) does not seem familiar. >> >> Ben pointed out that the 0x3facf000 is a channel address, not a value >> from the shader. Oops. So that theory completely doesn't hold water. >> Perhaps some buffer isn't big enough? This ends up using 9 output >> vertices per patch, with 2 vec4's each. I've tried playing with the >> per-warp stack size to no avail, but I didn't *entirely* know what I >> was doing either though. >> >>> >>> Even when all the triangles show up, I still see the error on the >>> GK208, so I'm not sure if they're the same issue or not. >>> >>> Now, here's the fun part -- this is completely non-deterministic. >>> Sometimes everything shows up on the GK208, other times I see holes, >>> in varying locations. I'm fairly sure that the actual shader code is >>> correct... so I'm doing something funny wrong. (And yeah, tons of >>> missed optimization opportunities in this code, but let's not dwell on >>> that.) >>> >>> This is the piglit test: >>> >>> http://cgit.freedesktop.org/piglit/tree/tests/spec/arb_tessellation_shader/execution/quads.shader_test >>> >>> It should be noted that other piglit tests don't exhibit this error, >>> however they also tend to be simpler. One key difference is that they >>> don't change the patch size in TCS. I'm including a link to a text >>> file with the tessellation control and evaluation shaders (decoded >>> with nvdisasm which you're hopefully more familiar with), along with >>> the shader headers that we generate. >>> >>> FTR, this is how I feed the raw shader opcode bytes into nvdisasm: >>> >>> perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM35 tt >>> >>> (for some reason it doesn't want to read from a pipe or even a fd). >>> >>> http://people.freedesktop.org/~imirkin/tess_shaders_quads.txt >>> >>> My suspicion is that we're doing something wrong with the sched codes. >>> We have an elaborate calculator, but... perhaps not elaborate enough? >>> You can see it here: >>> >>> http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_emit_nvc0.cpp#n2574 >>> >>> The reason I think it's an error in sched codes is due to the TRAP >>> memory location that I see -- could well be some "stale" value in the >>> register and the value from S2R or VILD doesn't make it in there in >>> time before the ALD reads it. >>> >>> If you should like to try this yourself, you can use >>> https://github.com/imirkin/mesa/commits/gl4-integration-2 . This >>> branch is good enough to run Unigine Heaven, but still has a lot of >>> known shortcomings. (Both at the core and the nouveau levels.) >>> >>> Any advice or suggestions for debugging this would be greatly >>> appreciated. And let me know if you'd like me to generate additional >>> info on this. For example I can supply a full command trace that can >>> be piped to demmt, if that's helpful. >>> >>> Thanks in advance, >>> >>> -ilia
Ilia Mirkin
2015-Jul-24 16:34 UTC
[Nouveau] Tessellation shaders get MEM_OUT_OF_BOUNDS errors / missing triangles
Indeed, this fixed the original issue on the GK208. Additionally it seems like starting with GK104 the mechanism for indirect offsets for ALD/AST changed and a AL2P instruction must now be used to determine the "indirect" or "physical" offset. Once nouveau was adjusted to do this, all MEM_OUT_OF_BOUNDS errors with tessellation shaders are gone. On Thu, Jul 23, 2015 at 2:36 AM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:> I think I figured out what was going on. Will re-check on the GK208, > but on a GF108 the random blue splotches in Unigine Heaven are gone > now. Turns out that with an instruction like > > /*00d0*/ ALD.128 R0, a[0x70], R0; > /* 0x7ecc0000381ffc02 */ > > The hardware will internally split it up into roughly > > ALD R0, a[0x70], R0 > ALD R1, a[0x74], R0 > ALD R2, a[0x78], R0 > ALD R3, a[0x7c], R0 > > Of course the first one of those overwrites R0, which makes the > subsequent loads be full of fail. Adding a hazard in our RA for the > indirect argument resolves the issue. > > -ilia > > > On Tue, May 26, 2015 at 7:34 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >> One additional observation that I just made is that on GK208, the blob >> apparently doesn't use the result of S2R Rx, SR_INVOCATION_ID >> wholesale in TCS. It either passes it through a I2I.S32.S32 Rx, |Rx| >> (i.e. absolute value), or even more paradoxically, shl 2; shr 2; which >> removes the top *2* bits, rather than just the top 1. However I see no >> such behaviour on GF108. >> >> I'm going to test out tomorrow whether this is the cause of my GK208 woes. >> >> On Fri, May 22, 2015 at 5:10 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >>> On Mon, May 18, 2015 at 4:48 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote: >>>> Hello, >>>> >>>> I've been debugging a few different tessellation shader issues with >>>> nouveau, but let's start small. I see this issue on my GK208 with high >>>> frequency, and I *think* I've seen it once or twice on my GF108, but >>>> it's exceedingly rare, if it does happen. I don't have a GK10x to test >>>> on, unfortunately, but I assume it'll have the same issue as the >>>> GK208. >>>> >>>> The issue is this -- a bunch of triangles that should come out of the >>>> tessellator end up black. I also see a GPC0/TPC1/MP trap: >>>> MEM_OUT_OF_BOUNDS error produced by nouveau -- this is output in >>>> response to a interrupt and MP trap generated by the hardware, read >>>> out with nv_rd32(priv, TPC_UNIT(gpc, tpc, 0x648)); (see >>>> gf100_gr_trap_mp). I assume some of the tessellation evaluation >>>> invocations get killed, but I have no proof of this. >>>> >>>> I also see this: TRAP ch 5 [0x003facf000 shader_runner[19044]] >>>> >>>> I would imagine that's some floating point number ending up in the >>>> register instead of an address, but the fp32 value of it >>>> (1.35107421875) does not seem familiar. >>> >>> Ben pointed out that the 0x3facf000 is a channel address, not a value >>> from the shader. Oops. So that theory completely doesn't hold water. >>> Perhaps some buffer isn't big enough? This ends up using 9 output >>> vertices per patch, with 2 vec4's each. I've tried playing with the >>> per-warp stack size to no avail, but I didn't *entirely* know what I >>> was doing either though. >>> >>>> >>>> Even when all the triangles show up, I still see the error on the >>>> GK208, so I'm not sure if they're the same issue or not. >>>> >>>> Now, here's the fun part -- this is completely non-deterministic. >>>> Sometimes everything shows up on the GK208, other times I see holes, >>>> in varying locations. I'm fairly sure that the actual shader code is >>>> correct... so I'm doing something funny wrong. (And yeah, tons of >>>> missed optimization opportunities in this code, but let's not dwell on >>>> that.) >>>> >>>> This is the piglit test: >>>> >>>> http://cgit.freedesktop.org/piglit/tree/tests/spec/arb_tessellation_shader/execution/quads.shader_test >>>> >>>> It should be noted that other piglit tests don't exhibit this error, >>>> however they also tend to be simpler. One key difference is that they >>>> don't change the patch size in TCS. I'm including a link to a text >>>> file with the tessellation control and evaluation shaders (decoded >>>> with nvdisasm which you're hopefully more familiar with), along with >>>> the shader headers that we generate. >>>> >>>> FTR, this is how I feed the raw shader opcode bytes into nvdisasm: >>>> >>>> perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM35 tt >>>> >>>> (for some reason it doesn't want to read from a pipe or even a fd). >>>> >>>> http://people.freedesktop.org/~imirkin/tess_shaders_quads.txt >>>> >>>> My suspicion is that we're doing something wrong with the sched codes. >>>> We have an elaborate calculator, but... perhaps not elaborate enough? >>>> You can see it here: >>>> >>>> http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_emit_nvc0.cpp#n2574 >>>> >>>> The reason I think it's an error in sched codes is due to the TRAP >>>> memory location that I see -- could well be some "stale" value in the >>>> register and the value from S2R or VILD doesn't make it in there in >>>> time before the ALD reads it. >>>> >>>> If you should like to try this yourself, you can use >>>> https://github.com/imirkin/mesa/commits/gl4-integration-2 . This >>>> branch is good enough to run Unigine Heaven, but still has a lot of >>>> known shortcomings. (Both at the core and the nouveau levels.) >>>> >>>> Any advice or suggestions for debugging this would be greatly >>>> appreciated. And let me know if you'd like me to generate additional >>>> info on this. For example I can supply a full command trace that can >>>> be piped to demmt, if that's helpful. >>>> >>>> Thanks in advance, >>>> >>>> -ilia