Hello, A user on an NVC3 card (GF106) is running into data errors on m2mf (class 0x9039) that we haven't seen before: http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/glean/fbo.html http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/spec/!OpenGL%201.1/copyteximage%201D.html Specifically the data errors 0x51 and 0x53, when running method 0x300 ("EXEC"). Any chance you could let us know what those errors are? (Or, even better, provide the full table so that we'll have a better idea in future cases as well.) Here are a few that we know about, so you know exactly what table I'm talking about (our full list at https://github.com/envytools/envytools/blob/master/rnndb/nv50_defs.xml#L192): 0x04: INVALID_VALUE 0x05: INVALID_ENUM 0x08: INVALID_OBJECT 0x0c: INVALID_BITFIELD 0x3f: PRIMITIVE_ID_NEEDS_GP We read this data error value from mmio reg 0x400110. Furthermore, if you could provide any insight as to why we would see those errors on GF106 but not any other Fermi/Kepler that we've tested (which should all run exactly the same code paths), that would be extremely helpful as well. You can see the Fermi piglit runs we have on file at http://people.freedesktop.org/~imirkin/nvc0-comparison/problems.html Thanks, -ilia
Sorry for the very slow response to this, Ilia. For the specific error you mentioned: the error code 0x51 is "ErrorSrcLineExceedsPitch", and error code 0x53 is "ErrorDstLineExceedsPitch". It looks like class 0x9039 will generate those errors under the following conditions: if ((NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT == PITCH) && (NV9039_LAUNCH_DMA_SRC_INLINE == FALSE) && (NV9039_LINE_COUNT_VALUE > 1) && (NV9039_PITCH_IN_VALUE >= 0) && (NV9039_LINE_LENGTH_IN_VALUE > NV9039_PITCH_IN_VALUE)) { return ErrorSrcLineExceedsPitch; } if ((NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT == PITCH) && (NV9039_LINE_COUNT_VALUE > 1) && (NV9039_PITCH_OUT_VALUE >= 0) && (NV9039_LINE_LENGTH_IN_VALUE > NV9039_PITCH_OUT_VALUE)) { return ErrorDstLineExceedsPitch; } Where those NV9039_* method values are defined as: #define NV9039_LAUNCH_DMA 0x0300 #define NV9039_LAUNCH_DMA_SRC_INLINE 0:0 #define NV9039_LAUNCH_DMA_SRC_INLINE_FALSE 0x00000000 #define NV9039_LAUNCH_DMA_SRC_INLINE_TRUE 0x00000001 #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT 4:4 #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT_BLOCKLINEAR 0x00000000 #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT_PITCH 0x00000001 #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT 8:8 #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT_BLOCKLINEAR 0x00000000 #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT_PITCH 0x00000001 #define NV9039_PITCH_IN 0x0314 #define NV9039_PITCH_IN_VALUE 31:0 #define NV9039_PITCH_OUT 0x0318 #define NV9039_PITCH_OUT_VALUE 31:0 #define NV9039_LINE_LENGTH_IN 0x031c #define NV9039_LINE_LENGTH_IN_VALUE 31:0 #define NV9039_LINE_COUNT 0x0320 #define NV9039_LINE_COUNT_VALUE 31:0 As far as I can tell, these checks are not GF106-specific, so I'm not sure why the problem is only showing up there. Maybe there is something else unique about the GF106 user's configuration that causes this to be triggered? Thanks, - Andy On Tue, Mar 18, 2014 at 06:44:30AM -0700, Ilia Mirkin wrote:> Hello, > > A user on an NVC3 card (GF106) is running into data errors on m2mf > (class 0x9039) that we haven't seen before: > > http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/glean/fbo.html > http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/spec/!OpenGL%201.1/copyteximage%201D.html > > Specifically the data errors 0x51 and 0x53, when running method 0x300 > ("EXEC"). Any chance you could let us know what those errors are? (Or, > even better, provide the full table so that we'll have a better idea > in future cases as well.) > > Here are a few that we know about, so you know exactly what table I'm > talking about (our full list at > https://github.com/envytools/envytools/blob/master/rnndb/nv50_defs.xml#L192): > > 0x04: INVALID_VALUE > 0x05: INVALID_ENUM > 0x08: INVALID_OBJECT > 0x0c: INVALID_BITFIELD > 0x3f: PRIMITIVE_ID_NEEDS_GP > > We read this data error value from mmio reg 0x400110. > > Furthermore, if you could provide any insight as to why we would see > those errors on GF106 but not any other Fermi/Kepler that we've tested > (which should all run exactly the same code paths), that would be > extremely helpful as well. You can see the Fermi piglit runs we have > on file at http://people.freedesktop.org/~imirkin/nvc0-comparison/problems.html > > Thanks, > > -ilia
On Wed, Apr 30, 2014 at 11:54 AM, Andy Ritger <aritger at nvidia.com> wrote:> Sorry for the very slow response to this, Ilia. > > For the specific error you mentioned: the error code > 0x51 is "ErrorSrcLineExceedsPitch", and error code 0x53 is > "ErrorDstLineExceedsPitch". It looks like class 0x9039 will generate > those errors under the following conditions: > > if ((NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT == PITCH) && > (NV9039_LAUNCH_DMA_SRC_INLINE == FALSE) && > (NV9039_LINE_COUNT_VALUE > 1) && > (NV9039_PITCH_IN_VALUE >= 0) && > (NV9039_LINE_LENGTH_IN_VALUE > NV9039_PITCH_IN_VALUE)) { > return ErrorSrcLineExceedsPitch; > } > > if ((NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT == PITCH) && > (NV9039_LINE_COUNT_VALUE > 1) && > (NV9039_PITCH_OUT_VALUE >= 0) && > (NV9039_LINE_LENGTH_IN_VALUE > NV9039_PITCH_OUT_VALUE)) { > return ErrorDstLineExceedsPitch; > } > > Where those NV9039_* method values are defined as: > > #define NV9039_LAUNCH_DMA 0x0300 > #define NV9039_LAUNCH_DMA_SRC_INLINE 0:0 > #define NV9039_LAUNCH_DMA_SRC_INLINE_FALSE 0x00000000 > #define NV9039_LAUNCH_DMA_SRC_INLINE_TRUE 0x00000001 > #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT 4:4 > #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT_BLOCKLINEAR 0x00000000 > #define NV9039_LAUNCH_DMA_SRC_MEMORY_LAYOUT_PITCH 0x00000001 > #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT 8:8 > #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT_BLOCKLINEAR 0x00000000 > #define NV9039_LAUNCH_DMA_DST_MEMORY_LAYOUT_PITCH 0x00000001 > > #define NV9039_PITCH_IN 0x0314 > #define NV9039_PITCH_IN_VALUE 31:0 > > #define NV9039_PITCH_OUT 0x0318 > #define NV9039_PITCH_OUT_VALUE 31:0 > > #define NV9039_LINE_LENGTH_IN 0x031c > #define NV9039_LINE_LENGTH_IN_VALUE 31:0 > > #define NV9039_LINE_COUNT 0x0320 > #define NV9039_LINE_COUNT_VALUE 31:0Very helpful info, thanks! That should help narrow the source of the problem.> > As far as I can tell, these checks are not GF106-specific, so I'm not > sure why the problem is only showing up there. Maybe there is something > else unique about the GF106 user's configuration that causes this to > be triggered?Perhaps. I've also observed that different GPU's are differently sensitive to invalid values. For example we had a bug that manifested itself in G80-G94 yelling at us about out-of-bounds X/Y coordinates, while G96+ happily took the illegal values (and probably did nasty things with them like overwriting memory it wasn't supposed to touch). It is odd that _only_ GF106 would have that logic, but... whatever. I'm also missing GF104, GF110, GF117 results, so who knows, perhaps they would have also reported the issue. I guess another possibility I hadn't previously considered is that this user's GF106 could just be somehow busted, his is the only one I know of, so I couldn't cross-check with a different one. But the problem is sufficiently restricted that it seems unlikely to be a bad part, and more likely a driver bug. Anyways, now that we know what to look for, it should be much easier to identify in a command stream dump. Thanks again, -ilia> > Thanks, > - Andy > > > On Tue, Mar 18, 2014 at 06:44:30AM -0700, Ilia Mirkin wrote: >> Hello, >> >> A user on an NVC3 card (GF106) is running into data errors on m2mf >> (class 0x9039) that we haven't seen before: >> >> http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/glean/fbo.html >> http://people.freedesktop.org/~imirkin/nvc0-comparison/nvc3-2014-03-17-agashlin/spec/!OpenGL%201.1/copyteximage%201D.html >> >> Specifically the data errors 0x51 and 0x53, when running method 0x300 >> ("EXEC"). Any chance you could let us know what those errors are? (Or, >> even better, provide the full table so that we'll have a better idea >> in future cases as well.) >> >> Here are a few that we know about, so you know exactly what table I'm >> talking about (our full list at >> https://github.com/envytools/envytools/blob/master/rnndb/nv50_defs.xml#L192): >> >> 0x04: INVALID_VALUE >> 0x05: INVALID_ENUM >> 0x08: INVALID_OBJECT >> 0x0c: INVALID_BITFIELD >> 0x3f: PRIMITIVE_ID_NEEDS_GP >> >> We read this data error value from mmio reg 0x400110. >> >> Furthermore, if you could provide any insight as to why we would see >> those errors on GF106 but not any other Fermi/Kepler that we've tested >> (which should all run exactly the same code paths), that would be >> extremely helpful as well. You can see the Fermi piglit runs we have >> on file at http://people.freedesktop.org/~imirkin/nvc0-comparison/problems.html >> >> Thanks, >> >> -ilia