thr3ads.net - Nouveau - [Nouveau] [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Alistair Popple

2020-Oct-30 02:36 UTC

[Nouveau] [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing

This is an initial series of patches to improve channel recovery on Turing GPUs
with the goal of improving reliability enough to eventually enable SVM for
Turing. It's likely follow up patches will be required to fully address
problems
with less trivial workloads than what I have been able to test thus far.

This series primarily addresses a number of hardware changes to interrupt layout
and channel recovery for Turing and for simple cases improves handling and
reliability of recovery.

I have been testing trivial OpenCL workloads and with this series have been able
to recover from while(1) style GPU loops and bad pointer dereferences on a
Turing GPU. However if there are less trivial tests available that have been
known to cause problems with channel recovery in the past let me know and
I'll
start testing those as well.

Alistair Popple (5):
  drm/nouveau: Fix MMU fault interrupts on Turing
  drm/nouveau: Remove Turing interrupt hack
  drm/nouveau: Move Turing specific FIFO functions
  drm/nouveau: FIFO interrupt fixes for Turing
  drm/nouveau: Turing channel preemption fix

 .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.c  |  46 +--
 .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.h  |  32 ++
 .../gpu/drm/nouveau/nvkm/engine/fifo/tu102.c  | 364 +++++++++++++++++-
 .../gpu/drm/nouveau/nvkm/subdev/fault/tu102.c |  21 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c |   3 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h |   1 -
 .../gpu/drm/nouveau/nvkm/subdev/mc/tu102.c    | 113 +++++-
 7 files changed, 529 insertions(+), 51 deletions(-)

-- 
2.20.1

Alistair Popple

2020-Oct-30 02:36 UTC

head link

[Nouveau] [PATCH 1/5] drm/nouveau: Fix MMU fault interrupts on Turing

Turing reports MMU fault interrupts via new top level interrupt
registers. The old PMC MMU interrupt vector is not used by the HW. This
means we can remap the new top-level MMU interrupt to the exisiting PMC
MMU bit which simplifies the implementation until all interrupts are
moved over to using the new top level registers.

Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
 .../gpu/drm/nouveau/nvkm/subdev/fault/tu102.c |  21 +++-
 .../gpu/drm/nouveau/nvkm/subdev/mc/tu102.c    | 107 +++++++++++++++++-
 2 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/fault/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/fault/tu102.c
index 45a6a68b9f48..f080051b0c65 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/fault/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/fault/tu102.c
@@ -22,6 +22,7 @@
 #include "priv.h"
 
 #include <core/memory.h>
+#include <subdev/mc.h>
 #include <subdev/mmu.h>
 #include <engine/fifo.h>
 
@@ -34,6 +35,9 @@ tu102_fault_buffer_intr(struct nvkm_fault_buffer *buffer, bool
enable)
 	 *     which don't appear to actually work anymore, but newer
 	 *     versions of RM don't appear to touch anything at all..
 	 */
+	struct nvkm_device *device = buffer->fault->subdev.device;
+
+	nvkm_mc_intr_mask(device, NVKM_SUBDEV_FAULT, enable);
 }
 
 static void
@@ -41,6 +45,11 @@ tu102_fault_buffer_fini(struct nvkm_fault_buffer *buffer)
 {
 	struct nvkm_device *device = buffer->fault->subdev.device;
 	const u32 foff = buffer->id * 0x20;
+
+	/* Disable the fault interrupts */
+	nvkm_wr32(device, 0xb81408, 0x1);
+	nvkm_wr32(device, 0xb81410, 0x10);
+
 	nvkm_mask(device, 0xb83010 + foff, 0x80000000, 0x00000000);
 }
 
@@ -50,6 +59,10 @@ tu102_fault_buffer_init(struct nvkm_fault_buffer *buffer)
 	struct nvkm_device *device = buffer->fault->subdev.device;
 	const u32 foff = buffer->id * 0x20;
 
+	/* Enable the fault interrupts */
+	nvkm_wr32(device, 0xb81208, 0x1);
+	nvkm_wr32(device, 0xb81210, 0x10);
+
 	nvkm_mask(device, 0xb83010 + foff, 0xc0000000, 0x40000000);
 	nvkm_wr32(device, 0xb83004 + foff, upper_32_bits(buffer->addr));
 	nvkm_wr32(device, 0xb83000 + foff, lower_32_bits(buffer->addr));
@@ -109,14 +122,20 @@ tu102_fault_intr(struct nvkm_fault *fault)
 	}
 
 	if (stat & 0x00000200) {
+		/* Clear the associated interrupt flag */
+		nvkm_wr32(device, 0xb81010, 0x10);
+
 		if (fault->buffer[0]) {
 			nvkm_event_send(&fault->event, 1, 0, NULL, 0);
 			stat &= ~0x00000200;
 		}
 	}
 
-	/*XXX: guess, can't confirm until we get fw... */
+	/* Replayable MMU fault */
 	if (stat & 0x00000100) {
+		/* Clear the associated interrupt flag */
+		nvkm_wr32(device, 0xb81008, 0x1);
+
 		if (fault->buffer[1]) {
 			nvkm_event_send(&fault->event, 1, 1, NULL, 0);
 			stat &= ~0x00000100;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
index d098c44a4fcb..cda924d56a2a 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
@@ -19,13 +19,93 @@
  * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
  * OTHER DEALINGS IN THE SOFTWARE.
  */
+#define tu102_mc(p) container_of((p), struct tu102_mc, base)
 #include "priv.h"
 
+struct tu102_mc {
+	struct nvkm_mc base;
+	spinlock_t lock;
+	bool intr;
+	u32 mask;
+};
+
+static void
+tu102_mc_intr_update(struct tu102_mc *mc)
+{
+	struct nvkm_device *device = mc->base.subdev.device;
+	u32 mask = mc->intr ? mc->mask : 0, i;
+
+	for (i = 0; i < 2; i++) {
+		nvkm_wr32(device, 0x000180 + (i * 0x04), ~mask);
+		nvkm_wr32(device, 0x000160 + (i * 0x04),  mask);
+	}
+
+	if (mask & 0x00000200)
+		nvkm_wr32(device, 0xb81608, 0x6);
+	else
+		nvkm_wr32(device, 0xb81610, 0x6);
+}
+
+void
+tu102_mc_intr_unarm(struct nvkm_mc *base)
+{
+	struct tu102_mc *mc = tu102_mc(base);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mc->lock, flags);
+	mc->intr = false;
+	tu102_mc_intr_update(mc);
+	spin_unlock_irqrestore(&mc->lock, flags);
+}
+
+void
+tu102_mc_intr_rearm(struct nvkm_mc *base)
+{
+	struct tu102_mc *mc = tu102_mc(base);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mc->lock, flags);
+	mc->intr = true;
+	tu102_mc_intr_update(mc);
+	spin_unlock_irqrestore(&mc->lock, flags);
+}
+
+void
+tu102_mc_intr_mask(struct nvkm_mc *base, u32 mask, u32 intr)
+{
+	struct tu102_mc *mc = tu102_mc(base);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mc->lock, flags);
+	mc->mask = (mc->mask & ~mask) | intr;
+	tu102_mc_intr_update(mc);
+	spin_unlock_irqrestore(&mc->lock, flags);
+}
+
+static u32
+tu102_mc_intr_stat(struct nvkm_mc *mc)
+{
+	struct nvkm_device *device = mc->subdev.device;
+	u32 intr0 = nvkm_rd32(device, 0x000100);
+	u32 intr1 = nvkm_rd32(device, 0x000104);
+	u32 intr_top = nvkm_rd32(device, 0xb81600);
+
+	/* Turing and above route the MMU fault interrupts via a different
+	 * interrupt tree with different control registers. For the moment remap
+	 * them back to the old PMC vector.
+	 */
+	if (intr_top & 0x00000006)
+		intr0 |= 0x00000200;
+
+	return intr0 | intr1;
+}
+
 static void
 tu102_mc_intr_hack(struct nvkm_mc *mc, bool *handled)
 {
 	struct nvkm_device *device = mc->subdev.device;
 	u32 stat = nvkm_rd32(device, 0xb81010);
+
 	if (stat & 0x00000050) {
 		struct nvkm_subdev *subdev  			nvkm_device_subdev(device, NVKM_SUBDEV_FAULT);
@@ -40,16 +120,33 @@ static const struct nvkm_mc_func
 tu102_mc = {
 	.init = nv50_mc_init,
 	.intr = gp100_mc_intr,
-	.intr_unarm = gp100_mc_intr_unarm,
-	.intr_rearm = gp100_mc_intr_rearm,
-	.intr_mask = gp100_mc_intr_mask,
-	.intr_stat = gf100_mc_intr_stat,
+	.intr_unarm = tu102_mc_intr_unarm,
+	.intr_rearm = tu102_mc_intr_rearm,
+	.intr_mask = tu102_mc_intr_mask,
+	.intr_stat = tu102_mc_intr_stat,
 	.intr_hack = tu102_mc_intr_hack,
 	.reset = gk104_mc_reset,
 };
 
+int
+tu102_mc_new_(const struct nvkm_mc_func *func, struct nvkm_device *device,
+	      int index, struct nvkm_mc **pmc)
+{
+	struct tu102_mc *mc;
+
+	if (!(mc = kzalloc(sizeof(*mc), GFP_KERNEL)))
+		return -ENOMEM;
+	nvkm_mc_ctor(func, device, index, &mc->base);
+	*pmc = &mc->base;
+
+	spin_lock_init(&mc->lock);
+	mc->intr = false;
+	mc->mask = 0x7fffffff;
+	return 0;
+}
+
 int
 tu102_mc_new(struct nvkm_device *device, int index, struct nvkm_mc **pmc)
 {
-	return gp100_mc_new_(&tu102_mc, device, index, pmc);
+	return tu102_mc_new_(&tu102_mc, device, index, pmc);
 }
-- 
2.20.1

Alistair Popple

2020-Oct-30 02:36 UTC

head link

[Nouveau] [PATCH 2/5] drm/nouveau: Remove Turing interrupt hack

This is no longer needed now that tu102_mc_intr_stat has been updated to
look at the correct top-level interrupt bits.

Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
 drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c  |  3 ---
 drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h  |  1 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c | 16 ----------------
 3 files changed, 20 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c
index 0e57ab2a709f..09f669ac6630 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c
@@ -108,9 +108,6 @@ nvkm_mc_intr(struct nvkm_device *device, bool *handled)
 	if (stat)
 		nvkm_error(&mc->subdev, "intr %08x\n", stat);
 	*handled = intr != 0;
-
-	if (mc->func->intr_hack)
-		mc->func->intr_hack(mc, handled);
 }
 
 static u32
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h
b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h
index 4aab753a6040..0d01b2c419ff 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h
@@ -26,7 +26,6 @@ struct nvkm_mc_func {
 	void (*intr_mask)(struct nvkm_mc *, u32 mask, u32 stat);
 	/* retrieve pending interrupt mask (NV_PMC_INTR) */
 	u32 (*intr_stat)(struct nvkm_mc *);
-	void (*intr_hack)(struct nvkm_mc *, bool *handled);
 	const struct nvkm_mc_map *reset;
 	void (*unk260)(struct nvkm_mc *, u32);
 };
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
index cda924d56a2a..af0afd1ad6ee 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mc/tu102.c
@@ -100,21 +100,6 @@ tu102_mc_intr_stat(struct nvkm_mc *mc)
 	return intr0 | intr1;
 }
 
-static void
-tu102_mc_intr_hack(struct nvkm_mc *mc, bool *handled)
-{
-	struct nvkm_device *device = mc->subdev.device;
-	u32 stat = nvkm_rd32(device, 0xb81010);
-
-	if (stat & 0x00000050) {
-		struct nvkm_subdev *subdev -			nvkm_device_subdev(device, NVKM_SUBDEV_FAULT);
-		nvkm_wr32(device, 0xb81010, stat & 0x00000050);
-		if (subdev)
-			nvkm_subdev_intr(subdev);
-		*handled = true;
-	}
-}
 
 static const struct nvkm_mc_func
 tu102_mc = {
@@ -124,7 +109,6 @@ tu102_mc = {
 	.intr_rearm = tu102_mc_intr_rearm,
 	.intr_mask = tu102_mc_intr_mask,
 	.intr_stat = tu102_mc_intr_stat,
-	.intr_hack = tu102_mc_intr_hack,
 	.reset = gk104_mc_reset,
 };
 
-- 
2.20.1

Alistair Popple

2020-Oct-30 02:36 UTC

head link

[Nouveau] [PATCH 3/5] drm/nouveau: Move Turing specific FIFO functions

Turing requires some changes to FIFO interrupt handling due to changes
in HW register layout. It also requires some changes to implement robust
channel (RC) recovery. This preparatory patch moves the functions
requiring changes into nvkm/engine/fifo/tu102.c so they can be altered
without affecting gk104 and other users. It should not introduce any
functional changes.

Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
 .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.c  |  46 +-
 .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.h  |  32 ++
 .../gpu/drm/nouveau/nvkm/engine/fifo/tu102.c  | 463 +++++++++++++++++-
 3 files changed, 511 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
index 5d4b695cab8e..c73b7eab776e 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -36,19 +36,7 @@
 #include <nvif/class.h>
 #include <nvif/cl0080.h>
 
-struct gk104_fifo_engine_status {
-	bool busy;
-	bool faulted;
-	bool chsw;
-	bool save;
-	bool load;
-	struct {
-		bool tsg;
-		u32 id;
-	} prev, next, *chan;
-};
-
-static void
+void
 gk104_fifo_engine_status(struct gk104_fifo *fifo, int engn,
 			 struct gk104_fifo_engine_status *status)
 {
@@ -95,7 +83,7 @@ gk104_fifo_engine_status(struct gk104_fifo *fifo, int engn,
 		   status->chan == &status->next ? "*" : " ");
 }
 
-static int
+int
 gk104_fifo_class_new(struct nvkm_fifo *base, const struct nvkm_oclass *oclass,
 		     void *argv, u32 argc, struct nvkm_object **pobject)
 {
@@ -112,7 +100,7 @@ gk104_fifo_class_new(struct nvkm_fifo *base, const struct
nvkm_oclass *oclass,
 	return -EINVAL;
 }
 
-static int
+int
 gk104_fifo_class_get(struct nvkm_fifo *base, int index,
 		     struct nvkm_oclass *oclass)
 {
@@ -134,14 +122,14 @@ gk104_fifo_class_get(struct nvkm_fifo *base, int index,
 	return c;
 }
 
-static void
+void
 gk104_fifo_uevent_fini(struct nvkm_fifo *fifo)
 {
 	struct nvkm_device *device = fifo->engine.subdev.device;
 	nvkm_mask(device, 0x002140, 0x80000000, 0x00000000);
 }
 
-static void
+void
 gk104_fifo_uevent_init(struct nvkm_fifo *fifo)
 {
 	struct nvkm_device *device = fifo->engine.subdev.device;
@@ -556,7 +544,7 @@ gk104_fifo_bind_reason[] = {
 	{}
 };
 
-static void
+void
 gk104_fifo_intr_bind(struct gk104_fifo *fifo)
 {
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
@@ -627,7 +615,7 @@ gk104_fifo_intr_sched(struct gk104_fifo *fifo)
 	}
 }
 
-static void
+void
 gk104_fifo_intr_chsw(struct gk104_fifo *fifo)
 {
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
@@ -637,7 +625,7 @@ gk104_fifo_intr_chsw(struct gk104_fifo *fifo)
 	nvkm_wr32(device, 0x00256c, stat);
 }
 
-static void
+void
 gk104_fifo_intr_dropped_fault(struct gk104_fifo *fifo)
 {
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
@@ -680,7 +668,7 @@ static const struct nvkm_bitfield gk104_fifo_pbdma_intr_0[]
= {
 	{}
 };
 
-static void
+void
 gk104_fifo_intr_pbdma_0(struct gk104_fifo *fifo, int unit)
 {
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
@@ -729,7 +717,7 @@ static const struct nvkm_bitfield gk104_fifo_pbdma_intr_1[]
= {
 	{}
 };
 
-static void
+void
 gk104_fifo_intr_pbdma_1(struct gk104_fifo *fifo, int unit)
 {
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
@@ -750,7 +738,7 @@ gk104_fifo_intr_pbdma_1(struct gk104_fifo *fifo, int unit)
 	nvkm_wr32(device, 0x040148 + (unit * 0x2000), stat);
 }
 
-static void
+void
 gk104_fifo_intr_runlist(struct gk104_fifo *fifo)
 {
 	struct nvkm_device *device = fifo->base.engine.subdev.device;
@@ -763,7 +751,7 @@ gk104_fifo_intr_runlist(struct gk104_fifo *fifo)
 	}
 }
 
-static void
+void
 gk104_fifo_intr_engine(struct gk104_fifo *fifo)
 {
 	nvkm_fifo_uevent(&fifo->base);
@@ -861,7 +849,7 @@ gk104_fifo_intr(struct nvkm_fifo *base)
 	}
 }
 
-static void
+void
 gk104_fifo_fini(struct nvkm_fifo *base)
 {
 	struct gk104_fifo *fifo = gk104_fifo(base);
@@ -871,7 +859,7 @@ gk104_fifo_fini(struct nvkm_fifo *base)
 	nvkm_mask(device, 0x002140, 0x10000000, 0x10000000);
 }
 
-static int
+int
 gk104_fifo_info(struct nvkm_fifo *base, u64 mthd, u64 *data)
 {
 	struct gk104_fifo *fifo = gk104_fifo(base);
@@ -899,7 +887,7 @@ gk104_fifo_info(struct nvkm_fifo *base, u64 mthd, u64 *data)
 	}
 }
 
-static int
+int
 gk104_fifo_oneinit(struct nvkm_fifo *base)
 {
 	struct gk104_fifo *fifo = gk104_fifo(base);
@@ -974,7 +962,7 @@ gk104_fifo_oneinit(struct nvkm_fifo *base)
 	return nvkm_memory_map(fifo->user.mem, 0, bar, fifo->user.bar, NULL, 0);
 }
 
-static void
+void
 gk104_fifo_init(struct nvkm_fifo *base)
 {
 	struct gk104_fifo *fifo = gk104_fifo(base);
@@ -1006,7 +994,7 @@ gk104_fifo_init(struct nvkm_fifo *base)
 	nvkm_wr32(device, 0x002140, 0x7fffffff);
 }
 
-static void *
+void *
 gk104_fifo_dtor(struct nvkm_fifo *base)
 {
 	struct gk104_fifo *fifo = gk104_fifo(base);
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h
index 6407a4a174cf..4398b340e514 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h
@@ -87,11 +87,43 @@ struct gk104_fifo_func {
 	bool cgrp_force;
 };
 
+struct gk104_fifo_engine_status {
+	bool busy;
+	bool faulted;
+	bool chsw;
+	bool save;
+	bool load;
+	struct {
+		bool tsg;
+		u32 id;
+	} prev, next, *chan;
+};
+
 int gk104_fifo_new_(const struct gk104_fifo_func *, struct nvkm_device *,
 		    int index, int nr, struct nvkm_fifo **);
 void gk104_fifo_runlist_insert(struct gk104_fifo *, struct gk104_fifo_chan *);
 void gk104_fifo_runlist_remove(struct gk104_fifo *, struct gk104_fifo_chan *);
 void gk104_fifo_runlist_update(struct gk104_fifo *, int runl);
+void gk104_fifo_engine_status(struct gk104_fifo *fifo, int engn,
+			      struct gk104_fifo_engine_status *status);
+void gk104_fifo_intr_bind(struct gk104_fifo *fifo);
+void gk104_fifo_intr_chsw(struct gk104_fifo *fifo);
+void gk104_fifo_intr_dropped_fault(struct gk104_fifo *fifo);
+void gk104_fifo_intr_pbdma_0(struct gk104_fifo *fifo, int unit);
+void gk104_fifo_intr_pbdma_1(struct gk104_fifo *fifo, int unit);
+void gk104_fifo_intr_runlist(struct gk104_fifo *fifo);
+void gk104_fifo_intr_engine(struct gk104_fifo *fifo);
+void *gk104_fifo_dtor(struct nvkm_fifo *base);
+int gk104_fifo_oneinit(struct nvkm_fifo *base);
+int gk104_fifo_info(struct nvkm_fifo *base, u64 mthd, u64 *data);
+void gk104_fifo_init(struct nvkm_fifo *base);
+void gk104_fifo_fini(struct nvkm_fifo *base);
+int gk104_fifo_class_new(struct nvkm_fifo *base, const struct nvkm_oclass
*oclass,
+			 void *argv, u32 argc, struct nvkm_object **pobject);
+int gk104_fifo_class_get(struct nvkm_fifo *base, int index,
+			 struct nvkm_oclass *oclass);
+void gk104_fifo_uevent_fini(struct nvkm_fifo *fifo);
+void gk104_fifo_uevent_init(struct nvkm_fifo *fifo);
 
 extern const struct gk104_fifo_pbdma_func gk104_fifo_pbdma;
 int gk104_fifo_pbdma_nr(struct gk104_fifo *);
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
index 005f3e1729b9..2924381a6b3c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
@@ -24,7 +24,13 @@
 #include "changk104.h"
 #include "user.h"
 
+#include <core/client.h>
 #include <core/gpuobj.h>
+#include <subdev/bar.h>
+#include <subdev/fault.h>
+#include <subdev/top.h>
+#include <subdev/timer.h>
+#include <engine/sw.h>
 
 #include <nvif/class.h>
 
@@ -109,8 +115,463 @@ tu102_fifo = {
 	.cgrp_force = true,
 };
 
+static void
+tu102_fifo_recover_work(struct work_struct *w)
+{
+	struct gk104_fifo *fifo = container_of(w, typeof(*fifo), recover.work);
+	struct nvkm_device *device = fifo->base.engine.subdev.device;
+	struct nvkm_engine *engine;
+	unsigned long flags;
+	u32 engm, runm, todo;
+	int engn, runl;
+
+	spin_lock_irqsave(&fifo->base.lock, flags);
+	runm = fifo->recover.runm;
+	engm = fifo->recover.engm;
+	fifo->recover.engm = 0;
+	fifo->recover.runm = 0;
+	spin_unlock_irqrestore(&fifo->base.lock, flags);
+
+	nvkm_mask(device, 0x002630, runm, runm);
+
+	for (todo = engm; engn = __ffs(todo), todo; todo &= ~BIT(engn)) {
+		if ((engine = fifo->engine[engn].engine)) {
+			nvkm_subdev_fini(&engine->subdev, false);
+			WARN_ON(nvkm_subdev_init(&engine->subdev));
+		}
+	}
+
+	for (todo = runm; runl = __ffs(todo), todo; todo &= ~BIT(runl))
+		gk104_fifo_runlist_update(fifo, runl);
+
+	nvkm_wr32(device, 0x00262c, runm);
+	nvkm_mask(device, 0x002630, runm, 0x00000000);
+}
+
+static void tu102_fifo_recover_engn(struct gk104_fifo *fifo, int engn);
+
+static void
+tu102_fifo_recover_runl(struct gk104_fifo *fifo, int runl)
+{
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	const u32 runm = BIT(runl);
+
+	assert_spin_locked(&fifo->base.lock);
+	if (fifo->recover.runm & runm)
+		return;
+	fifo->recover.runm |= runm;
+
+	/* Block runlist to prevent channel assignment(s) from changing. */
+	nvkm_mask(device, 0x002630, runm, runm);
+
+	/* Schedule recovery. */
+	nvkm_warn(subdev, "runlist %d: scheduled for recovery\n", runl);
+	schedule_work(&fifo->recover.work);
+}
+
+static struct gk104_fifo_chan *
+tu102_fifo_recover_chid(struct gk104_fifo *fifo, int runl, int chid)
+{
+	struct gk104_fifo_chan *chan;
+	struct nvkm_fifo_cgrp *cgrp;
+
+	list_for_each_entry(chan, &fifo->runlist[runl].chan, head) {
+		if (chan->base.chid == chid) {
+			list_del_init(&chan->head);
+			return chan;
+		}
+	}
+
+	list_for_each_entry(cgrp, &fifo->runlist[runl].cgrp, head) {
+		if (cgrp->id == chid) {
+			chan = list_first_entry(&cgrp->chan, typeof(*chan), head);
+			list_del_init(&chan->head);
+			if (!--cgrp->chan_nr)
+				list_del_init(&cgrp->head);
+			return chan;
+		}
+	}
+
+	return NULL;
+}
+
+static void
+tu102_fifo_recover_chan(struct nvkm_fifo *base, int chid)
+{
+	struct gk104_fifo *fifo = gk104_fifo(base);
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	const u32  stat = nvkm_rd32(device, 0x800004 + (chid * 0x08));
+	const u32  runl = (stat & 0x000f0000) >> 16;
+	const bool used = (stat & 0x00000001);
+	unsigned long engn, engm = fifo->runlist[runl].engm;
+	struct gk104_fifo_chan *chan;
+
+	assert_spin_locked(&fifo->base.lock);
+	if (!used)
+		return;
+
+	/* Lookup SW state for channel, and mark it as dead. */
+	chan = tu102_fifo_recover_chid(fifo, runl, chid);
+	if (chan) {
+		chan->killed = true;
+		nvkm_fifo_kevent(&fifo->base, chid);
+	}
+
+	/* Disable channel. */
+	nvkm_wr32(device, 0x800004 + (chid * 0x08), stat | 0x00000800);
+	nvkm_warn(subdev, "channel %d: killed\n", chid);
+
+	/* Block channel assignments from changing during recovery. */
+	tu102_fifo_recover_runl(fifo, runl);
+
+	/* Schedule recovery for any engines the channel is on. */
+	for_each_set_bit(engn, &engm, fifo->engine_nr) {
+		struct gk104_fifo_engine_status status;
+
+		gk104_fifo_engine_status(fifo, engn, &status);
+		if (!status.chan || status.chan->id != chid)
+			continue;
+		tu102_fifo_recover_engn(fifo, engn);
+	}
+}
+
+static void
+tu102_fifo_recover_engn(struct gk104_fifo *fifo, int engn)
+{
+	struct nvkm_engine *engine = fifo->engine[engn].engine;
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	const u32 runl = fifo->engine[engn].runl;
+	const u32 engm = BIT(engn);
+	struct gk104_fifo_engine_status status;
+	int mmui = -1;
+
+	assert_spin_locked(&fifo->base.lock);
+	if (fifo->recover.engm & engm)
+		return;
+	fifo->recover.engm |= engm;
+
+	/* Block channel assignments from changing during recovery. */
+	tu102_fifo_recover_runl(fifo, runl);
+
+	/* Determine which channel (if any) is currently on the engine. */
+	gk104_fifo_engine_status(fifo, engn, &status);
+	if (status.chan) {
+		/* The channel is not longer viable, kill it. */
+		tu102_fifo_recover_chan(&fifo->base, status.chan->id);
+	}
+
+	/* Determine MMU fault ID for the engine, if we're not being
+	 * called from the fault handler already.
+	 */
+	if (!status.faulted && engine) {
+		mmui = nvkm_top_fault_id(device, engine->subdev.index);
+		if (mmui < 0) {
+			const struct nvkm_enum *en = fifo->func->fault.engine;
+
+			for (; en && en->name; en++) {
+				if (en->data2 == engine->subdev.index) {
+					mmui = en->value;
+					break;
+				}
+			}
+		}
+		WARN_ON(mmui < 0);
+	}
+
+	/* Trigger a MMU fault for the engine.
+	 *
+	 * No good idea why this is needed, but nvgpu does something similar,
+	 * and it makes recovery from CTXSW_TIMEOUT a lot more reliable.
+	 */
+	if (mmui >= 0) {
+		nvkm_wr32(device, 0x002a30 + (engn * 0x04), 0x00000100 | mmui);
+
+		/* Wait for fault to trigger. */
+		nvkm_msec(device, 2000,
+			gk104_fifo_engine_status(fifo, engn, &status);
+			if (status.faulted)
+				break;
+		);
+
+		/* Release MMU fault trigger, and ACK the fault. */
+		nvkm_wr32(device, 0x002a30 + (engn * 0x04), 0x00000000);
+		nvkm_wr32(device, 0x00259c, BIT(mmui));
+		nvkm_wr32(device, 0x002100, 0x10000000);
+	}
+
+	/* Schedule recovery. */
+	nvkm_warn(subdev, "engine %d: scheduled for recovery\n", engn);
+	schedule_work(&fifo->recover.work);
+}
+
+static void
+tu102_fifo_fault(struct nvkm_fifo *base, struct nvkm_fault_data *info)
+{
+	struct gk104_fifo *fifo = gk104_fifo(base);
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	const struct nvkm_enum *er, *ee, *ec, *ea;
+	struct nvkm_engine *engine = NULL;
+	struct nvkm_fifo_chan *chan;
+	unsigned long flags;
+	char ct[8] = "HUB/", en[16] = "";
+	int engn;
+
+	er = nvkm_enum_find(fifo->func->fault.reason, info->reason);
+	ee = nvkm_enum_find(fifo->func->fault.engine, info->engine);
+	if (info->hub) {
+		ec = nvkm_enum_find(fifo->func->fault.hubclient, info->client);
+	} else {
+		ec = nvkm_enum_find(fifo->func->fault.gpcclient, info->client);
+		snprintf(ct, sizeof(ct), "GPC%d/", info->gpc);
+	}
+	ea = nvkm_enum_find(fifo->func->fault.access, info->access);
+
+	if (ee && ee->data2) {
+		switch (ee->data2) {
+		case NVKM_SUBDEV_BAR:
+			nvkm_bar_bar1_reset(device);
+			break;
+		case NVKM_SUBDEV_INSTMEM:
+			nvkm_bar_bar2_reset(device);
+			break;
+		case NVKM_ENGINE_IFB:
+			nvkm_mask(device, 0x001718, 0x00000000, 0x00000000);
+			break;
+		default:
+			engine = nvkm_device_engine(device, ee->data2);
+			break;
+		}
+	}
+
+	if (ee == NULL) {
+		enum nvkm_devidx engidx = nvkm_top_fault(device, info->engine);
+
+		if (engidx < NVKM_SUBDEV_NR) {
+			const char *src = nvkm_subdev_name[engidx];
+			char *dst = en;
+
+			do {
+				*dst++ = toupper(*src++);
+			} while (*src);
+			engine = nvkm_device_engine(device, engidx);
+		}
+	} else {
+		snprintf(en, sizeof(en), "%s", ee->name);
+	}
+
+	spin_lock_irqsave(&fifo->base.lock, flags);
+	chan = nvkm_fifo_chan_inst_locked(&fifo->base, info->inst);
+
+	nvkm_error(subdev,
+		   "fault %02x [%s] at %016llx engine %02x [%s] client %02x "
+		   "[%s%s] reason %02x [%s] on channel %d [%010llx %s]\n",
+		   info->access, ea ? ea->name : "", info->addr,
+		   info->engine, ee ? ee->name : en,
+		   info->client, ct, ec ? ec->name : "",
+		   info->reason, er ? er->name : "", chan ? chan->chid :
-1,
+		   info->inst, chan ? chan->object.client->name :
"unknown");
+
+	/* Kill the channel that caused the fault. */
+	if (chan)
+		tu102_fifo_recover_chan(&fifo->base, chan->chid);
+
+	/* Channel recovery will probably have already done this for the
+	 * correct engine(s), but just in case we can't find the channel
+	 * information...
+	 */
+	for (engn = 0; engn < fifo->engine_nr && engine; engn++) {
+		if (fifo->engine[engn].engine == engine) {
+			tu102_fifo_recover_engn(fifo, engn);
+			break;
+		}
+	}
+
+	spin_unlock_irqrestore(&fifo->base.lock, flags);
+}
+
+static const struct nvkm_enum
+tu102_fifo_sched_reason[] = {
+	{ 0x0a, "CTXSW_TIMEOUT" },
+	{}
+};
+
+static void
+tu102_fifo_intr_sched_ctxsw(struct gk104_fifo *fifo)
+{
+	struct nvkm_device *device = fifo->base.engine.subdev.device;
+	unsigned long flags, engm = 0;
+	u32 engn;
+
+	/* We need to ACK the SCHED_ERROR here, and prevent it reasserting,
+	 * as MMU_FAULT cannot be triggered while it's pending.
+	 */
+	spin_lock_irqsave(&fifo->base.lock, flags);
+	nvkm_mask(device, 0x002140, 0x00000100, 0x00000000);
+	nvkm_wr32(device, 0x002100, 0x00000100);
+
+	for (engn = 0; engn < fifo->engine_nr; engn++) {
+		struct gk104_fifo_engine_status status;
+
+		gk104_fifo_engine_status(fifo, engn, &status);
+		if (!status.busy || !status.chsw)
+			continue;
+
+		engm |= BIT(engn);
+	}
+
+	for_each_set_bit(engn, &engm, fifo->engine_nr)
+		tu102_fifo_recover_engn(fifo, engn);
+
+	nvkm_mask(device, 0x002140, 0x00000100, 0x00000100);
+	spin_unlock_irqrestore(&fifo->base.lock, flags);
+}
+
+static void
+tu102_fifo_intr_sched(struct gk104_fifo *fifo)
+{
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	u32 intr = nvkm_rd32(device, 0x00254c);
+	u32 code = intr & 0x000000ff;
+	const struct nvkm_enum *en +		nvkm_enum_find(tu102_fifo_sched_reason, code);
+
+	nvkm_error(subdev, "SCHED_ERROR %02x [%s]\n", code, en ? en->name
: "");
+
+	switch (code) {
+	case 0x0a:
+		tu102_fifo_intr_sched_ctxsw(fifo);
+		break;
+	default:
+		break;
+	}
+}
+
+static void
+tu102_fifo_intr(struct nvkm_fifo *base)
+{
+	struct gk104_fifo *fifo = gk104_fifo(base);
+	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
+	struct nvkm_device *device = subdev->device;
+	u32 mask = nvkm_rd32(device, 0x002140);
+	u32 stat = nvkm_rd32(device, 0x002100) & mask;
+
+	if (stat & 0x00000001) {
+		gk104_fifo_intr_bind(fifo);
+		nvkm_wr32(device, 0x002100, 0x00000001);
+		stat &= ~0x00000001;
+	}
+
+	if (stat & 0x00000010) {
+		nvkm_error(subdev, "PIO_ERROR\n");
+		nvkm_wr32(device, 0x002100, 0x00000010);
+		stat &= ~0x00000010;
+	}
+
+	if (stat & 0x00000100) {
+		tu102_fifo_intr_sched(fifo);
+		nvkm_wr32(device, 0x002100, 0x00000100);
+		stat &= ~0x00000100;
+	}
+
+	if (stat & 0x00010000) {
+		gk104_fifo_intr_chsw(fifo);
+		nvkm_wr32(device, 0x002100, 0x00010000);
+		stat &= ~0x00010000;
+	}
+
+	if (stat & 0x00800000) {
+		nvkm_error(subdev, "FB_FLUSH_TIMEOUT\n");
+		nvkm_wr32(device, 0x002100, 0x00800000);
+		stat &= ~0x00800000;
+	}
+
+	if (stat & 0x01000000) {
+		nvkm_error(subdev, "LB_ERROR\n");
+		nvkm_wr32(device, 0x002100, 0x01000000);
+		stat &= ~0x01000000;
+	}
+
+	if (stat & 0x08000000) {
+		gk104_fifo_intr_dropped_fault(fifo);
+		nvkm_wr32(device, 0x002100, 0x08000000);
+		stat &= ~0x08000000;
+	}
+
+	if (stat & 0x10000000) {
+		u32 mask = nvkm_rd32(device, 0x00259c);
+
+		while (mask) {
+			u32 unit = __ffs(mask);
+			fifo->func->intr.fault(&fifo->base, unit);
+			nvkm_wr32(device, 0x00259c, (1 << unit));
+			mask &= ~(1 << unit);
+		}
+		stat &= ~0x10000000;
+	}
+
+	if (stat & 0x20000000) {
+		u32 mask = nvkm_rd32(device, 0x0025a0);
+
+		while (mask) {
+			u32 unit = __ffs(mask);
+
+			gk104_fifo_intr_pbdma_0(fifo, unit);
+			gk104_fifo_intr_pbdma_1(fifo, unit);
+			nvkm_wr32(device, 0x0025a0, (1 << unit));
+			mask &= ~(1 << unit);
+		}
+		stat &= ~0x20000000;
+	}
+
+	if (stat & 0x40000000) {
+		gk104_fifo_intr_runlist(fifo);
+		stat &= ~0x40000000;
+	}
+
+	if (stat & 0x80000000) {
+		nvkm_wr32(device, 0x002100, 0x80000000);
+		gk104_fifo_intr_engine(fifo);
+		stat &= ~0x80000000;
+	}
+
+	if (stat) {
+		nvkm_error(subdev, "INTR %08x\n", stat);
+		nvkm_mask(device, 0x002140, stat, 0x00000000);
+		nvkm_wr32(device, 0x002100, stat);
+	}
+}
+
+static const struct nvkm_fifo_func
+tu102_fifo_ = {
+	.dtor = gk104_fifo_dtor,
+	.oneinit = gk104_fifo_oneinit,
+	.info = gk104_fifo_info,
+	.init = gk104_fifo_init,
+	.fini = gk104_fifo_fini,
+	.intr = tu102_fifo_intr,
+	.fault = tu102_fifo_fault,
+	.uevent_init = gk104_fifo_uevent_init,
+	.uevent_fini = gk104_fifo_uevent_fini,
+	.recover_chan = tu102_fifo_recover_chan,
+	.class_get = gk104_fifo_class_get,
+	.class_new = gk104_fifo_class_new,
+};
+
 int
 tu102_fifo_new(struct nvkm_device *device, int index, struct nvkm_fifo **pfifo)
 {
-	return gk104_fifo_new_(&tu102_fifo, device, index, 4096, pfifo);
+	struct gk104_fifo *fifo;
+
+	if (!(fifo = kzalloc(sizeof(*fifo), GFP_KERNEL)))
+		return -ENOMEM;
+	fifo->func = &tu102_fifo;
+	INIT_WORK(&fifo->recover.work, tu102_fifo_recover_work);
+	*pfifo = &fifo->base;
+
+	return nvkm_fifo_ctor(&tu102_fifo_, device, index, 4096,
&fifo->base);
 }
-- 
2.20.1

Alistair Popple

2020-Oct-30 02:36 UTC

head link

[Nouveau] [PATCH 4/5] drm/nouveau: FIFO interrupt fixes for Turing

Some of the low level FIFO interrupt status bits have changed for
Turing. Update the handling of these to match the hardware.

Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
 .../gpu/drm/nouveau/nvkm/engine/fifo/tu102.c  | 78 +++----------------
 1 file changed, 9 insertions(+), 69 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
index 2924381a6b3c..f2f20a25182f 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
@@ -393,40 +393,21 @@ tu102_fifo_fault(struct nvkm_fifo *base, struct
nvkm_fault_data *info)
 	spin_unlock_irqrestore(&fifo->base.lock, flags);
 }
 
-static const struct nvkm_enum
-tu102_fifo_sched_reason[] = {
-	{ 0x0a, "CTXSW_TIMEOUT" },
-	{}
-};
-
 static void
-tu102_fifo_intr_sched_ctxsw(struct gk104_fifo *fifo)
+tu102_fifo_intr_ctxsw_timeout(struct gk104_fifo *fifo)
 {
 	struct nvkm_device *device = fifo->base.engine.subdev.device;
-	unsigned long flags, engm = 0;
+	unsigned long flags, engm;
 	u32 engn;
 
-	/* We need to ACK the SCHED_ERROR here, and prevent it reasserting,
-	 * as MMU_FAULT cannot be triggered while it's pending.
-	 */
 	spin_lock_irqsave(&fifo->base.lock, flags);
-	nvkm_mask(device, 0x002140, 0x00000100, 0x00000000);
-	nvkm_wr32(device, 0x002100, 0x00000100);
 
-	for (engn = 0; engn < fifo->engine_nr; engn++) {
-		struct gk104_fifo_engine_status status;
+	engm = nvkm_rd32(device, 0x2a30);
+	nvkm_wr32(device, 0x2a30, engm);
 
-		gk104_fifo_engine_status(fifo, engn, &status);
-		if (!status.busy || !status.chsw)
-			continue;
-
-		engm |= BIT(engn);
-	}
-
-	for_each_set_bit(engn, &engm, fifo->engine_nr)
+	for_each_set_bit(engn, &engm, 32)
 		tu102_fifo_recover_engn(fifo, engn);
 
-	nvkm_mask(device, 0x002140, 0x00000100, 0x00000100);
 	spin_unlock_irqrestore(&fifo->base.lock, flags);
 }
 
@@ -437,18 +418,8 @@ tu102_fifo_intr_sched(struct gk104_fifo *fifo)
 	struct nvkm_device *device = subdev->device;
 	u32 intr = nvkm_rd32(device, 0x00254c);
 	u32 code = intr & 0x000000ff;
-	const struct nvkm_enum *en -		nvkm_enum_find(tu102_fifo_sched_reason, code);
-
-	nvkm_error(subdev, "SCHED_ERROR %02x [%s]\n", code, en ? en->name
: "");
 
-	switch (code) {
-	case 0x0a:
-		tu102_fifo_intr_sched_ctxsw(fifo);
-		break;
-	default:
-		break;
-	}
+	nvkm_error(subdev, "SCHED_ERROR %02x\n", code);
 }
 
 static void
@@ -466,10 +437,9 @@ tu102_fifo_intr(struct nvkm_fifo *base)
 		stat &= ~0x00000001;
 	}
 
-	if (stat & 0x00000010) {
-		nvkm_error(subdev, "PIO_ERROR\n");
-		nvkm_wr32(device, 0x002100, 0x00000010);
-		stat &= ~0x00000010;
+	if (stat & 0x00000002) {
+		tu102_fifo_intr_ctxsw_timeout(fifo);
+		stat &= ~0x00000002;
 	}
 
 	if (stat & 0x00000100) {
@@ -484,36 +454,6 @@ tu102_fifo_intr(struct nvkm_fifo *base)
 		stat &= ~0x00010000;
 	}
 
-	if (stat & 0x00800000) {
-		nvkm_error(subdev, "FB_FLUSH_TIMEOUT\n");
-		nvkm_wr32(device, 0x002100, 0x00800000);
-		stat &= ~0x00800000;
-	}
-
-	if (stat & 0x01000000) {
-		nvkm_error(subdev, "LB_ERROR\n");
-		nvkm_wr32(device, 0x002100, 0x01000000);
-		stat &= ~0x01000000;
-	}
-
-	if (stat & 0x08000000) {
-		gk104_fifo_intr_dropped_fault(fifo);
-		nvkm_wr32(device, 0x002100, 0x08000000);
-		stat &= ~0x08000000;
-	}
-
-	if (stat & 0x10000000) {
-		u32 mask = nvkm_rd32(device, 0x00259c);
-
-		while (mask) {
-			u32 unit = __ffs(mask);
-			fifo->func->intr.fault(&fifo->base, unit);
-			nvkm_wr32(device, 0x00259c, (1 << unit));
-			mask &= ~(1 << unit);
-		}
-		stat &= ~0x10000000;
-	}
-
 	if (stat & 0x20000000) {
 		u32 mask = nvkm_rd32(device, 0x0025a0);
 
-- 
2.20.1

Alistair Popple

2020-Oct-30 02:36 UTC

head link

[Nouveau] [PATCH 5/5] drm/nouveau: Turing channel preemption fix

Previous hardware allowed a MMU fault to be generated by software to
trigger a context switch for engine recovery. Turing has the capability
to preempt all work from a specific runlist processor and removed the
registers currently used for triggering MMU faults. Attempting to access
these non-existent registers results in further errors, so use the
runlist preemption register instead.

Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
 .../gpu/drm/nouveau/nvkm/engine/fifo/tu102.c  | 43 +------------------
 1 file changed, 2 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
index f2f20a25182f..14e5b70e0255 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/tu102.c
@@ -144,7 +144,6 @@ tu102_fifo_recover_work(struct work_struct *w)
 	for (todo = runm; runl = __ffs(todo), todo; todo &= ~BIT(runl))
 		gk104_fifo_runlist_update(fifo, runl);
 
-	nvkm_wr32(device, 0x00262c, runm);
 	nvkm_mask(device, 0x002630, runm, 0x00000000);
 }
 
@@ -240,13 +239,11 @@ tu102_fifo_recover_chan(struct nvkm_fifo *base, int chid)
 static void
 tu102_fifo_recover_engn(struct gk104_fifo *fifo, int engn)
 {
-	struct nvkm_engine *engine = fifo->engine[engn].engine;
 	struct nvkm_subdev *subdev = &fifo->base.engine.subdev;
 	struct nvkm_device *device = subdev->device;
 	const u32 runl = fifo->engine[engn].runl;
 	const u32 engm = BIT(engn);
 	struct gk104_fifo_engine_status status;
-	int mmui = -1;
 
 	assert_spin_locked(&fifo->base.lock);
 	if (fifo->recover.engm & engm)
@@ -263,44 +260,8 @@ tu102_fifo_recover_engn(struct gk104_fifo *fifo, int engn)
 		tu102_fifo_recover_chan(&fifo->base, status.chan->id);
 	}
 
-	/* Determine MMU fault ID for the engine, if we're not being
-	 * called from the fault handler already.
-	 */
-	if (!status.faulted && engine) {
-		mmui = nvkm_top_fault_id(device, engine->subdev.index);
-		if (mmui < 0) {
-			const struct nvkm_enum *en = fifo->func->fault.engine;
-
-			for (; en && en->name; en++) {
-				if (en->data2 == engine->subdev.index) {
-					mmui = en->value;
-					break;
-				}
-			}
-		}
-		WARN_ON(mmui < 0);
-	}
-
-	/* Trigger a MMU fault for the engine.
-	 *
-	 * No good idea why this is needed, but nvgpu does something similar,
-	 * and it makes recovery from CTXSW_TIMEOUT a lot more reliable.
-	 */
-	if (mmui >= 0) {
-		nvkm_wr32(device, 0x002a30 + (engn * 0x04), 0x00000100 | mmui);
-
-		/* Wait for fault to trigger. */
-		nvkm_msec(device, 2000,
-			gk104_fifo_engine_status(fifo, engn, &status);
-			if (status.faulted)
-				break;
-		);
-
-		/* Release MMU fault trigger, and ACK the fault. */
-		nvkm_wr32(device, 0x002a30 + (engn * 0x04), 0x00000000);
-		nvkm_wr32(device, 0x00259c, BIT(mmui));
-		nvkm_wr32(device, 0x002100, 0x10000000);
-	}
+	/* Preempt the runlist */
+	nvkm_wr32(device, 0x2638, BIT(runl));
 
 	/* Schedule recovery. */
 	nvkm_warn(subdev, "engine %d: scheduled for recovery\n", engn);
-- 
2.20.1

Karol Herbst

2020-Oct-30 12:49 UTC

head link

[Nouveau] [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing

On Fri, Oct 30, 2020 at 3:37 AM Alistair Popple <apopple at nvidia.com>
wrote:>
> This is an initial series of patches to improve channel recovery on Turing
GPUs
> with the goal of improving reliability enough to eventually enable SVM for
> Turing. It's likely follow up patches will be required to fully address
problems
> with less trivial workloads than what I have been able to test thus far.
>
> This series primarily addresses a number of hardware changes to interrupt
layout
> and channel recovery for Turing and for simple cases improves handling and
> reliability of recovery.
>
> I have been testing trivial OpenCL workloads and with this series have been
able
> to recover from while(1) style GPU loops and bad pointer dereferences on a
> Turing GPU. However if there are less trivial tests available that have
been
> known to cause problems with channel recovery in the past let me know and
I'll
> start testing those as well.
>
Thanks for working on this! I occasionally hit fatal errors when
working on OpenCL with the official CTS, but that's on Pascal. I could
give your patches a go once I move my main development machine over to
Turing and report if I still trigger problems nouveau isn't able to
recover from.

But yeah, generally the CTS is able to cause bigger issues for me at least.
> Alistair Popple (5):
>   drm/nouveau: Fix MMU fault interrupts on Turing
>   drm/nouveau: Remove Turing interrupt hack
>   drm/nouveau: Move Turing specific FIFO functions
>   drm/nouveau: FIFO interrupt fixes for Turing
>   drm/nouveau: Turing channel preemption fix
>
>  .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.c  |  46 +--
>  .../gpu/drm/nouveau/nvkm/engine/fifo/gk104.h  |  32 ++
>  .../gpu/drm/nouveau/nvkm/engine/fifo/tu102.c  | 364 +++++++++++++++++-
>  .../gpu/drm/nouveau/nvkm/subdev/fault/tu102.c |  21 +-
>  drivers/gpu/drm/nouveau/nvkm/subdev/mc/base.c |   3 -
>  drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h |   1 -
>  .../gpu/drm/nouveau/nvkm/subdev/mc/tu102.c    | 113 +++++-
>  7 files changed, 529 insertions(+), 51 deletions(-)
>
> --
> 2.20.1
>
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
>

Apparently Analagous Threads

Search for more apparently analagous threads

Nouveau - Oct 2020 - [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing

[Nouveau] [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing

[Nouveau] [PATCH 1/5] drm/nouveau: Fix MMU fault interrupts on Turing

[Nouveau] [PATCH 2/5] drm/nouveau: Remove Turing interrupt hack

[Nouveau] [PATCH 3/5] drm/nouveau: Move Turing specific FIFO functions

[Nouveau] [PATCH 4/5] drm/nouveau: FIFO interrupt fixes for Turing

[Nouveau] [PATCH 5/5] drm/nouveau: Turing channel preemption fix

[Nouveau] [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing

Apparently Analagous Threads