Mar 1 18:21:23 madman kernel: [ 1697.116256] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT Mar 1 18:21:23 madman kernel: [ 1697.116275] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 Mar 1 18:21:23 madman kernel: [ 1697.116283] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT Mar 1 18:21:23 madman kernel: [ 1697.116299] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000 00000000 00000004 00000000 Mar 1 18:21:23 madman kernel: [ 1697.116306] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP Mar 1 18:21:23 madman kernel: [ 1697.116318] [drm] nouveau 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd 0x15e0 data 0x00000000 Mar 1 18:21:23 madman kernel: [ 1697.116330] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT Mar 1 18:21:23 madman kernel: [ 1697.116342] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 Mar 1 18:21:23 madman kernel: [ 1697.116349] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT Mar 1 18:21:23 madman kernel: [ 1697.116363] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000 00000000 00000004 00080000 Mar 1 18:21:23 madman kernel: [ 1697.116371] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP Mar 1 18:21:23 madman kernel: [ 1697.116380] [drm] nouveau 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd 0x1084 data 0x219d6fff Mar 1 18:21:23 madman kernel: [ 1697.116392] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT Mar 1 18:21:23 madman kernel: [ 1697.116404] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 Mar 1 18:21:23 madman kernel: [ 1697.116410] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP Mar 1 18:21:23 madman kernel: [ 1697.116420] [drm] nouveau 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd 0x15e0 data 0x00000000 Mar 1 18:21:29 madman kernel: [ 1703.981014] [drm] nouveau 0000:01:00.0: Failed to idle channel 3. Mar 1 18:21:31 madman kernel: [ 1705.601034] [drm] nouveau 0000:01:00.0: Ctxprog is still running Those come after 15-30 minutes of running warzone2100, i haven't played any games for a while, so no idea how long this has been going on. I also got a TRAP_CCACHE on channel 2 a little while ago, it takes much longer to trigger (a few hours). I'm using todays "nouveau kernel" git. I'm guessing something is being unmapped too early or without reason, or some cache is stale. But it isn't obvious what exactly it is. Because i don't remember having these lockups before I'm inclined to guess that this commit is involved http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360 Any ideas? Maarten. -- Far away from the primal instinct, the song seems to fade away, the river get wider between your thoughts and the things we do and say.
On Tue, Mar 1, 2011 at 9:08 PM, Maarten Maathuis <madman2003 at gmail.com> wrote:> Mar ?1 18:21:23 madman kernel: [ 1697.116256] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT > Mar ?1 18:21:23 madman kernel: [ 1697.116275] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 > Mar ?1 18:21:23 madman kernel: [ 1697.116283] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT > Mar ?1 18:21:23 madman kernel: [ 1697.116299] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000 > 00000000 00000004 00000000 > Mar ?1 18:21:23 madman kernel: [ 1697.116306] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP > Mar ?1 18:21:23 madman kernel: [ 1697.116318] [drm] nouveau > 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd > 0x15e0 data 0x00000000 > Mar ?1 18:21:23 madman kernel: [ 1697.116330] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT > Mar ?1 18:21:23 madman kernel: [ 1697.116342] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 > Mar ?1 18:21:23 madman kernel: [ 1697.116349] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT > Mar ?1 18:21:23 madman kernel: [ 1697.116363] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000 > 00000000 00000004 00080000 > Mar ?1 18:21:23 madman kernel: [ 1697.116371] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP > Mar ?1 18:21:23 madman kernel: [ 1697.116380] [drm] nouveau > 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd > 0x1084 data 0x219d6fff > Mar ?1 18:21:23 madman kernel: [ 1697.116392] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH FAULT > Mar ?1 18:21:23 madman kernel: [ 1697.116404] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP_VFETCH 00f00000 0000fe0c 00000000 00000000 > Mar ?1 18:21:23 madman kernel: [ 1697.116410] [drm] nouveau > 0000:01:00.0: PGRAPH - TRAP > Mar ?1 18:21:23 madman kernel: [ 1697.116420] [drm] nouveau > 0000:01:00.0: PGRAPH - ch 3 (0x00018f3000) subc 7 class 0x8297 mthd > 0x15e0 data 0x00000000 > Mar ?1 18:21:29 madman kernel: [ 1703.981014] [drm] nouveau > 0000:01:00.0: Failed to idle channel 3. > Mar ?1 18:21:31 madman kernel: [ 1705.601034] [drm] nouveau > 0000:01:00.0: Ctxprog is still running > > Those come after 15-30 minutes of running warzone2100, i haven't > played any games for a while, so no idea how long this has been going > on. > I also got a TRAP_CCACHE on channel 2 a little while ago, it takes > much longer to trigger (a few hours). I'm using todays "nouveau > kernel" git. > > I'm guessing something is being unmapped too early or without reason, > or some cache is stale. But it isn't obvious what exactly it is. > > Because i don't remember having these lockups before I'm inclined to > guess that this commit is involved > http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360 > > Any ideas? > > Maarten. > > -- > Far away from the primal instinct, the song seems to fade away, the > river get wider between your thoughts and the things we do and say. >This is the "DDX" trap: Mar 1 22:19:48 madman kernel: [ 1499.376769] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE FAULT Mar 1 22:19:48 madman kernel: [ 1499.376782] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP_CCACHE 00000080 00000000 00000000 00000000 00000000 00000004 00000000 Mar 1 22:19:48 madman kernel: [ 1499.376785] [drm] nouveau 0000:01:00.0: PGRAPH - TRAP Mar 1 22:19:48 madman kernel: [ 1499.376790] [drm] nouveau 0000:01:00.0: PGRAPH - ch 2 (0x0000840000) subc 5 class 0x8297 mthd 0x0f04 data 0x00000000 -- Far away from the primal instinct, the song seems to fade away, the river get wider between your thoughts and the things we do and say.
On Tue, 2011-03-01 at 21:08 +0000, Maarten Maathuis wrote:> Those come after 15-30 minutes of running warzone2100, i haven't > played any games for a while, so no idea how long this has been going > on. > I also got a TRAP_CCACHE on channel 2 a little while ago, it takes > much longer to trigger (a few hours). I'm using todays "nouveau > kernel" git.You're not the first person to have reported this fwiw, personally, I haven't seen it yet..> > I'm guessing something is being unmapped too early or without reason, > or some cache is stale. But it isn't obvious what exactly it is. > > Because i don't remember having these lockups before I'm inclined to > guess that this commit is involved > http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360 > > Any ideas?Not really. If this commit *is* the cause, the problem is still somewhere else. That commit just makes sure PTEs are marked invalid, so if it's causing your faults, then previously the GPU would still have been reading/writing invalid data. Plus, I expect you should probably have seen a VM fault.. Ben.> > Maarten. >
On Sun, Mar 6, 2011 at 2:24 PM, Ben Skeggs <skeggsb at gmail.com> wrote:> > > Sent from my iPhone > > On 07/03/2011, at 0:03, Maarten Maathuis <madman2003 at gmail.com> wrote: > >> On Sun, Mar 6, 2011 at 1:44 PM, Ben Skeggs <skeggsb at gmail.com> wrote: >>> Sorry for the top posting, it's late and typing from my phone in bed lol. >>> >>> Just wanted to see if you had an update? And, this is NV86 I guess? >>> >>> Ben. >>> >>> Sent from my iPhone >>> >>> On 02/03/2011, at 8:20, Maarten Maathuis <madman2003 at gmail.com> wrote: >>> >>>> On Tue, Mar 1, 2011 at 9:51 PM, Ben Skeggs <bskeggs at redhat.com> wrote: >>>>> On Tue, 2011-03-01 at 21:08 +0000, Maarten Maathuis wrote: >>>>> >>>>>> Those come after 15-30 minutes of running warzone2100, i haven't >>>>>> played any games for a while, so no idea how long this has been going >>>>>> on. >>>>>> I also got a TRAP_CCACHE on channel 2 a little while ago, it takes >>>>>> much longer to trigger (a few hours). I'm using todays "nouveau >>>>>> kernel" git. >>>>> You're not the first person to have reported this fwiw, personally, I >>>>> haven't seen it yet.. >>>>> >>>>>> >>>>>> I'm guessing something is being unmapped too early or without reason, >>>>>> or some cache is stale. But it isn't obvious what exactly it is. >>>>>> >>>>>> Because i don't remember having these lockups before I'm inclined to >>>>>> guess that this commit is involved >>>>>> http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=6330d8f5ecc4a19fd2ad3c7fa128b2f4c2ce3360 >>>>>> >>>>>> Any ideas? >>>>> Not really. ?If this commit *is* the cause, the problem is still >>>>> somewhere else. ?That commit just makes sure PTEs are marked invalid, so >>>>> if it's causing your faults, then previously the GPU would still have >>>>> been reading/writing invalid data. >>>>> >>>>> Plus, I expect you should probably have seen a VM fault.. >>>> >>>> So these faults are just generic errors? Unrelated to page faults? >>>> >>>>> >>>>> Ben. >>>>>> >>>>>> Maarten. >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Far away from the primal instinct, the song seems to fade away, the >>>> river get wider between your thoughts and the things we do and say. >>>> _______________________________________________ >>>> Nouveau mailing list >>>> Nouveau at lists.freedesktop.org >>>> http://lists.freedesktop.org/mailman/listinfo/nouveau >>> >> >> No this is NV96. The revert definitely helps, but no luck so far in >> finding a plausible cause for the problem. > Hey, > > Ok. Hmm. I thought you had NV86 for some reason! It's a long shot and I'm not entirely convinced it'll help at all, but can you switch graph.tlb_flush pointer to the nv86 version and see if anything changes?I used to have a NV86, but it died more than a year ago in the typical way for that generation of card, due to thermal issues I guess (it was a passively cooled card). I haven't tried using the nv86 tlb flush, out of curiosity, is this something nvidia does (a lot) on nv86?> > The *other* possible thing is that the ttm delayed delete queue is causing multiple tlb flushes to happen at the same time. ?I'll add locking for that in the morning, that was a complete oversight.I've had no lockups since you added the spinlocks, so maybe that was it. Time will tell.> > Ben. > >> >> -- >> Far away from the primal instinct, the song seems to fade away, the >> river get wider between your thoughts and the things we do and say. >-- Far away from the primal instinct, the song seems to fade away, the river get wider between your thoughts and the things we do and say.