thr3ads.net - Nouveau - [Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Konsta Hölttä

2015-Aug-31 11:38 UTC

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

Hi there,

Resending these now that they've had some more polish and testing, and I
heard
that Ben's vacation is over :-)

These patches work as a starting point for more explicit error mechanisms and
better robustness. At the moment, when a job hangs or faults, it seems that
nouveau doesn't quite know how to handle the situation and often results in
a
hang. Some of these situations would require either completely resetting the
gpu, and/or a complex path for only recovering the broken channel.

To start, I worked on support for letting userspace know what exactly happened.
Proper recovery would come later. The "error notifier" in the first
patch is a
simple shared buffer between kernel and userspace. Its error codes match
nvgpu's. Alternatively, the status could be queried with an ioctl, but that
would be considerably more heavyweight. I'd like to know if the event
mechanism
is meant for these kinds of events at all (engines notify errors upwards to the
drm layer). Another alternative would probably be to register the same buffer
to all necessary engines separately in nvif method calls? Or register it to
just one (e.g., fifo) and get that engine when errors happen in others (e.g.,
gr)? And drm handles probably wouldn't fit there? Please comment on this; I
wrote this before understanding the mthd mechanism.

Additionally, priority and timeout management for separate channels in flight
on the gpu is added in two patches. Neither is exactly what the name says, but
the effect is the same, and this is what nvgpu does currently. Those two
patches call the fifo channel object's methods directly from userspace, so a
hack is added in the nvif path to accept that. The objects are NEW'd from
kernel space, so calling from userspace isn't allowed, as it appears. How
should this be managed in a clean way?

Also, since nouveau often hangs on errors, the userspace hangs too (waiting on
a fence). The final patch attempts to fix this in a couple of specific error
paths to forcibly update all fences to be finished. I'd like to hear how
that
would be handled properly - consider the patch just a proof-of-concept and
sample of what would be necessary.

I don't expect the patches to be accepted as-is - as a newbie, I'd
appreciate
any high-level comments on if I've understood anything, especially the event
and nvif/method mechanisms (I use the latter from userspace with a hack
constructed from the perfmon branch seen here earlier into nvidia's internal
libdrm-equivalent). The fence-forcing thing is something that is necessary with
the error notifiers (at least with our userspace that waits really long or
infinitely on fences). I'm working specifically on Tegra and don't know
much
about the desktop's userspace details, so I may be biased in some areas.

I'd be happy to write sample tests on e.g. libdrm for the new methods once
the
kernel patches would get to a good shape, if that's required for accepting
new
features. I tested these to work as a proof-of-concept on Jetson TK1, and the
code is adapted from the latest nvgpu.

The patches can also be found in http://github.com/sooda/nouveau and are based
on a version of gnurou/staging.

Thanks!
Konsta (sooda in IRC)

Konsta Hölttä (5):
  notify channel errors to userspace
  don't verify route == owner in nvkm ioctl
  gk104: channel priority/timeslice support
  gk104: channel timeout detection
  HACK force fences updated on error

 drm/nouveau/include/nvif/class.h       |  20 ++++
 drm/nouveau/include/nvif/event.h       |  12 +++
 drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
 drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
 drm/nouveau/nouveau_chan.h             |  10 ++
 drm/nouveau/nouveau_drm.c              |   1 +
 drm/nouveau/nouveau_fence.c            |  13 ++-
 drm/nouveau/nouveau_gem.c              |  29 ++++++
 drm/nouveau/nouveau_gem.h              |   2 +
 drm/nouveau/nvkm/core/ioctl.c          |   9 +-
 drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
 drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
 drm/nouveau/nvkm/engine/fifo/gk104.c   | 166 ++++++++++++++++++++++++++++++---
 drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
 drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
 drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
 16 files changed, 415 insertions(+), 25 deletions(-)

-- 
2.1.4

Konsta Hölttä

2015-Aug-31 11:38 UTC

head link

[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace

Let userspace know about some specific types of errors via a shared
buffer registered with a new ioctl, NOUVEAU_GEM_SET_ERROR_NOTIFIER.
Once set, the notifier buffer is used until the channel is destroyed.
The exceptional per-channel error conditions are signaled upwards from
the nvkm layer via the event/notify mechanism and written to the buffer
which holds the latest error occurred, readable from userspace.

Some error situations render a channel completely stuck, which may
require even discarding and reinitializing it or the gpu completely.
Once a specific type of error happens, no others are expected until
recovery. Passing the error to userspace in a shared buffer enables the
graphics driver to periodically check in a light-weight way if anything
went wrong (typically, after a submit); e.g., the GL robustness
extension requires to detect that the context has dead.

The notifier currently signals idle timeout, sw notify, mmu fault and
illegal pbdma error conditions.

Signed-off-by: Konsta Hölttä <kholtta at nvidia.com>
---
 drm/nouveau/include/nvif/class.h       |  2 +
 drm/nouveau/include/nvif/event.h       | 11 ++++
 drm/nouveau/include/nvkm/engine/fifo.h |  3 ++
 drm/nouveau/nouveau_chan.c             | 95 ++++++++++++++++++++++++++++++++++
 drm/nouveau/nouveau_chan.h             | 10 ++++
 drm/nouveau/nouveau_drm.c              |  1 +
 drm/nouveau/nouveau_gem.c              | 29 +++++++++++
 drm/nouveau/nouveau_gem.h              |  2 +
 drm/nouveau/nvkm/engine/fifo/base.c    | 53 +++++++++++++++++++
 drm/nouveau/nvkm/engine/fifo/gk104.c   | 13 +++++
 drm/nouveau/nvkm/engine/gr/gf100.c     |  5 ++
 drm/nouveau/uapi/drm/nouveau_drm.h     | 13 +++++
 12 files changed, 237 insertions(+)

diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h
index d90e207..28176e6 100644
--- a/drm/nouveau/include/nvif/class.h
+++ b/drm/nouveau/include/nvif/class.h
@@ -410,16 +410,18 @@ struct kepler_channel_gpfifo_a_v0 {
 	__u8  engine;
 	__u16 chid;
 	__u8  pad04[4];
 	__u32 pushbuf;
 	__u32 ilength;
 	__u64 ioffset;
 };
 
+#define CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT                               0x01
+
 /*******************************************************************************
  * legacy display
 
******************************************************************************/
 
 #define NV04_DISP_NTFY_VBLANK                                              0x00
 #define NV04_DISP_NTFY_CONN                                                0x01
 
 struct nv04_disp_mthd_v0 {
diff --git a/drm/nouveau/include/nvif/event.h b/drm/nouveau/include/nvif/event.h
index 2176449..d148b85 100644
--- a/drm/nouveau/include/nvif/event.h
+++ b/drm/nouveau/include/nvif/event.h
@@ -54,9 +54,20 @@ struct nvif_notify_conn_rep_v0 {
 struct nvif_notify_uevent_req {
 	/* nvif_notify_req ... */
 };
 
 struct nvif_notify_uevent_rep {
 	/* nvif_notify_rep ... */
 };
 
+struct nvif_notify_eevent_req {
+	/* nvif_notify_req ... */
+	u32 chid;
+};
+
+struct nvif_notify_eevent_rep {
+	/* nvif_notify_rep ... */
+	u32 error;
+	u32 chid;
+};
+
 #endif
diff --git a/drm/nouveau/include/nvkm/engine/fifo.h
b/drm/nouveau/include/nvkm/engine/fifo.h
index 9100b80..cbca477 100644
--- a/drm/nouveau/include/nvkm/engine/fifo.h
+++ b/drm/nouveau/include/nvkm/engine/fifo.h
@@ -67,16 +67,17 @@ struct nvkm_fifo_base {
 #include <core/engine.h>
 #include <core/event.h>
 
 struct nvkm_fifo {
 	struct nvkm_engine base;
 
 	struct nvkm_event cevent; /* channel creation event */
 	struct nvkm_event uevent; /* async user trigger */
+	struct nvkm_event eevent; /* error notifier */
 
 	struct nvkm_object **channel;
 	spinlock_t lock;
 	u16 min;
 	u16 max;
 
 	int  (*chid)(struct nvkm_fifo *, struct nvkm_object *);
 	void (*pause)(struct nvkm_fifo *, unsigned long *);
@@ -118,11 +119,13 @@ extern struct nvkm_oclass *gk20a_fifo_oclass;
 extern struct nvkm_oclass *gk208_fifo_oclass;
 extern struct nvkm_oclass *gm204_fifo_oclass;
 extern struct nvkm_oclass *gm20b_fifo_oclass;
 
 int  nvkm_fifo_uevent_ctor(struct nvkm_object *, void *, u32,
 			   struct nvkm_notify *);
 void nvkm_fifo_uevent(struct nvkm_fifo *);
 
+void nvkm_fifo_eevent(struct nvkm_fifo *, u32 chid, u32 error);
+
 void nv04_fifo_intr(struct nvkm_subdev *);
 int  nv04_fifo_context_attach(struct nvkm_object *, struct nvkm_object *);
 #endif
diff --git a/drm/nouveau/nouveau_chan.c b/drm/nouveau/nouveau_chan.c
index 0589bab..ff37f5f 100644
--- a/drm/nouveau/nouveau_chan.c
+++ b/drm/nouveau/nouveau_chan.c
@@ -19,26 +19,29 @@
  * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
  * OTHER DEALINGS IN THE SOFTWARE.
  *
  * Authors: Ben Skeggs
  */
 
 #include <nvif/os.h>
 #include <nvif/class.h>
+#include <nvif/event.h>
+#include <nvif/notify.h>
 
 /*XXX*/
 #include <core/client.h>
 
 #include "nouveau_drm.h"
 #include "nouveau_dma.h"
 #include "nouveau_bo.h"
 #include "nouveau_chan.h"
 #include "nouveau_fence.h"
 #include "nouveau_abi16.h"
+#include "nouveau_gem.h"
 
 MODULE_PARM_DESC(vram_pushbuf, "Create DMA push buffers in VRAM");
 int nouveau_vram_pushbuf;
 module_param_named(vram_pushbuf, nouveau_vram_pushbuf, int, 0400);
 
 int
 nouveau_channel_idle(struct nouveau_channel *chan)
 {
@@ -72,16 +75,22 @@ nouveau_channel_del(struct nouveau_channel **pchan)
 		nvif_object_fini(&chan->vram);
 		nvif_object_ref(NULL, &chan->object);
 		nvif_object_fini(&chan->push.ctxdma);
 		nouveau_bo_vma_del(chan->push.buffer, &chan->push.vma);
 		nouveau_bo_unmap(chan->push.buffer);
 		if (chan->push.buffer && chan->push.buffer->pin_refcnt)
 			nouveau_bo_unpin(chan->push.buffer);
 		nouveau_bo_ref(NULL, &chan->push.buffer);
+		if (chan->error_notifier.buffer) {
+			nouveau_bo_unmap(chan->error_notifier.buffer);
+			drm_gem_object_unreference_unlocked(
+					&chan->error_notifier.buffer->gem);
+		}
+		nvif_notify_fini(&chan->error_notifier.notify);
 		nvif_device_ref(NULL, &chan->device);
 		kfree(chan);
 	}
 	*pchan = NULL;
 }
 
 static int
 nouveau_channel_prep(struct nouveau_drm *drm, struct nvif_device *device,
@@ -272,16 +281,89 @@ nouveau_channel_dma(struct nouveau_drm *drm, struct
nvif_device *device,
 			return ret;
 		}
 	} while (ret && *oclass);
 
 	nouveau_channel_del(pchan);
 	return ret;
 }
 
+static void
+nouveau_channel_set_error_notifier(struct nouveau_channel *chan, u32 error)
+{
+	struct nouveau_cli *cli = (void *)nvif_client(chan->object);
+	struct nouveau_bo *nvbo = chan->error_notifier.buffer;
+	u32 off = chan->error_notifier.offset / sizeof(u32);
+	u64 nsec;
+
+	if (!nvbo)
+		return;
+
+	nsec = ktime_to_ns(ktime_get_real());
+
+	nouveau_bo_wr32(nvbo, off + 0, nsec);
+	nouveau_bo_wr32(nvbo, off + 1, nsec >> 32);
+	nouveau_bo_wr32(nvbo, off + 2, error);
+	nouveau_bo_wr32(nvbo, off + 3, 0xffffffff);
+
+	NV_PRINTK(error, cli, "error notifier set to %d for ch %d\n",
+			error, chan->chid);
+}
+
+int
+nouveau_channel_init_error_notifier(struct nouveau_channel *chan,
+		struct drm_file *file, u32 handle, u32 offset)
+{
+	struct drm_gem_object *gem;
+	struct nouveau_bo *nvbo;
+	int ret;
+
+	gem = drm_gem_object_lookup(chan->drm->dev, file, handle);
+	if (!gem)
+		return -ENOENT;
+
+	nvbo = nouveau_gem_object(gem);
+
+	ret = nouveau_bo_map(nvbo);
+	if (ret) {
+		drm_gem_object_unreference_unlocked(gem);
+		return ret;
+	}
+
+	/* drop old ref on reassign */
+	if (chan->error_notifier.buffer) {
+		nouveau_bo_unmap(chan->error_notifier.buffer);
+		drm_gem_object_unreference_unlocked(
+				&chan->error_notifier.buffer->gem);
+	}
+
+	/* the gem reference held until channel deletion keeps this alive */
+	chan->error_notifier.buffer = nvbo;
+	chan->error_notifier.offset = offset;
+
+	/* initialize to a no-error status */
+	nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 0, 0);
+	nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 1, 0);
+	nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 2, 0);
+	nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 3, 0);
+
+	return 0;
+}
+
+static int
+nouveau_chan_eevent_handler(struct nvif_notify *notify)
+{
+	struct nouveau_channel *chan +		container_of(notify, typeof(*chan),
error_notifier.notify);
+	const struct nvif_notify_eevent_rep *rep = notify->data;
+
+	WARN_ON(rep->chid != chan->chid);
+	nouveau_channel_set_error_notifier(chan, rep->error);
+	return NVIF_NOTIFY_KEEP;
+}
 static int
 nouveau_channel_init(struct nouveau_channel *chan, u32 vram, u32 gart)
 {
 	struct nvif_device *device = chan->device;
 	struct nouveau_cli *cli = (void *)nvif_client(&device->base);
 	struct nvkm_mmu *mmu = nvxx_mmu(device);
 	struct nvkm_sw_chan *swch;
 	struct nv_dma_v0 args = {};
@@ -381,16 +463,29 @@ nouveau_channel_init(struct nouveau_channel *chan, u32
vram, u32 gart)
 		if (ret)
 			return ret;
 
 		BEGIN_NV04(chan, NvSubSw, 0x0000, 1);
 		OUT_RING  (chan, chan->nvsw.handle);
 		FIRE_RING (chan);
 	}
 
+	/* error code events on why we're stuck/broken */
+	ret = nvif_notify_init(chan->object, NULL,
+			 nouveau_chan_eevent_handler, false,
+			 CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT,
+			 &(struct nvif_notify_eevent_req) { .chid = chan->chid },
+			 sizeof(struct nvif_notify_eevent_req),
+			 sizeof(struct nvif_notify_eevent_rep),
+			 &chan->error_notifier.notify);
+	if (WARN_ON(ret))
+		return ret;
+	/* fini() does one put() */
+	nvif_notify_get(&chan->error_notifier.notify);
+
 	/* initialise synchronisation */
 	return nouveau_fence(chan->drm)->context_new(chan);
 }
 
 int
 nouveau_channel_new(struct nouveau_drm *drm, struct nvif_device *device,
 		    u32 handle, u32 arg0, u32 arg1,
 		    struct nouveau_channel **pchan)
diff --git a/drm/nouveau/nouveau_chan.h b/drm/nouveau/nouveau_chan.h
index 8b3640f..730ddba 100644
--- a/drm/nouveau/nouveau_chan.h
+++ b/drm/nouveau/nouveau_chan.h
@@ -33,20 +33,30 @@ struct nouveau_channel {
 		int ib_free;
 		int ib_put;
 	} dma;
 	u32 user_get_hi;
 	u32 user_get;
 	u32 user_put;
 
 	struct nvif_object *object;
+
+	struct {
+		struct nouveau_bo *buffer;
+		u32 offset;
+		struct nvif_notify notify;
+	} error_notifier;
 };
 
 
 int  nouveau_channel_new(struct nouveau_drm *, struct nvif_device *,
 			 u32 handle, u32 arg0, u32 arg1,
 			 struct nouveau_channel **);
 void nouveau_channel_del(struct nouveau_channel **);
 int  nouveau_channel_idle(struct nouveau_channel *);
+struct drm_file;
+int nouveau_channel_init_error_notifier(struct nouveau_channel *chan,
+					struct drm_file *file,
+					u32 handle, u32 offset);
 
 extern int nouveau_vram_pushbuf;
 
 #endif
diff --git a/drm/nouveau/nouveau_drm.c b/drm/nouveau/nouveau_drm.c
index 7d1297e..e3e6441 100644
--- a/drm/nouveau/nouveau_drm.c
+++ b/drm/nouveau/nouveau_drm.c
@@ -898,16 +898,17 @@ nouveau_ioctls[] = {
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_NEW, nouveau_gem_ioctl_new,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF, nouveau_gem_ioctl_pushbuf,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_PREP, nouveau_gem_ioctl_cpu_prep,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_FINI, nouveau_gem_ioctl_cpu_fini,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_INFO, nouveau_gem_ioctl_info,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	/* Staging ioctls */
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_SET_TILING, nouveau_gem_ioctl_set_tiling,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF_2, nouveau_gem_ioctl_pushbuf_2,
DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
+	DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_SET_ERROR_NOTIFIER,
nouveau_gem_ioctl_set_error_notifier, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW),
 };
 
 long
 nouveau_drm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct drm_file *filp = file->private_data;
 	struct drm_device *dev = filp->minor->dev;
 	long ret;
diff --git a/drm/nouveau/nouveau_gem.c b/drm/nouveau/nouveau_gem.c
index 884ab83..f8373cb 100644
--- a/drm/nouveau/nouveau_gem.c
+++ b/drm/nouveau/nouveau_gem.c
@@ -1032,16 +1032,45 @@ out_next:
 			      (chan->push.vma.offset + ((chan->dma.cur + 2) << 2));
 		req->suffix1 = 0x00000000;
 	}
 
 	return nouveau_abi16_put(abi16, ret);
 }
 
 int
+nouveau_gem_ioctl_set_error_notifier(struct drm_device *dev, void *data,
+                            struct drm_file *file_priv)
+{
+	struct nouveau_abi16 *abi16 = nouveau_abi16_get(file_priv, dev);
+	struct drm_nouveau_gem_set_error_notifier *req = data;
+	struct nouveau_abi16_chan *abi16_ch;
+	struct nouveau_channel *chan = NULL;
+	int ret = 0;
+
+	if (unlikely(!abi16))
+		return -ENOMEM;
+
+	list_for_each_entry(abi16_ch, &abi16->channels, head) {
+		if (abi16_ch->chan->object->handle
+				== (NVDRM_CHAN | req->channel)) {
+			chan = abi16_ch->chan;
+			break;
+		}
+	}
+	if (!chan)
+		return nouveau_abi16_put(abi16, -ENOENT);
+
+	ret = nouveau_channel_init_error_notifier(chan, file_priv,
+			req->buffer, req->offset);
+
+	return nouveau_abi16_put(abi16, ret);
+}
+
+int
 nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
 			   struct drm_file *file_priv)
 {
 	struct drm_nouveau_gem_cpu_prep *req = data;
 	struct drm_gem_object *gem;
 	struct nouveau_bo *nvbo;
 	bool no_wait = !!(req->flags & NOUVEAU_GEM_CPU_PREP_NOWAIT);
 	bool write = !!(req->flags & NOUVEAU_GEM_CPU_PREP_WRITE);
diff --git a/drm/nouveau/nouveau_gem.h b/drm/nouveau/nouveau_gem.h
index 9e4323f..423ea5a 100644
--- a/drm/nouveau/nouveau_gem.h
+++ b/drm/nouveau/nouveau_gem.h
@@ -26,16 +26,18 @@ extern void nouveau_gem_object_close(struct drm_gem_object
*,
 extern int nouveau_gem_ioctl_set_tiling(struct drm_device *, void *,
 					struct drm_file *);
 extern int nouveau_gem_ioctl_new(struct drm_device *, void *,
 				 struct drm_file *);
 extern int nouveau_gem_ioctl_pushbuf(struct drm_device *, void *,
 				     struct drm_file *);
 extern int nouveau_gem_ioctl_pushbuf_2(struct drm_device *, void *,
 				       struct drm_file *);
+extern int nouveau_gem_ioctl_set_error_notifier(struct drm_device *, void *,
+				       struct drm_file *);
 extern int nouveau_gem_ioctl_cpu_prep(struct drm_device *, void *,
 				      struct drm_file *);
 extern int nouveau_gem_ioctl_cpu_fini(struct drm_device *, void *,
 				      struct drm_file *);
 extern int nouveau_gem_ioctl_info(struct drm_device *, void *,
 				  struct drm_file *);
 
 extern int nouveau_gem_prime_pin(struct drm_gem_object *);
diff --git a/drm/nouveau/nvkm/engine/fifo/base.c
b/drm/nouveau/nvkm/engine/fifo/base.c
index fa223f8..a5dc6c9 100644
--- a/drm/nouveau/nvkm/engine/fifo/base.c
+++ b/drm/nouveau/nvkm/engine/fifo/base.c
@@ -191,28 +191,76 @@ nvkm_fifo_uevent_ctor(struct nvkm_object *object, void
*data, u32 size,
 void
 nvkm_fifo_uevent(struct nvkm_fifo *fifo)
 {
 	struct nvif_notify_uevent_rep rep = {
 	};
 	nvkm_event_send(&fifo->uevent, 1, 0, &rep, sizeof(rep));
 }
 
+static int
+nvkm_fifo_eevent_ctor(struct nvkm_object *object, void *data, u32 size,
+		      struct nvkm_notify *notify)
+{
+	union {
+		struct nvif_notify_eevent_req req;
+	} *req = data;
+	int ret;
+
+	if (nvif_unvers(req->req)) {
+		notify->size  = sizeof(struct nvif_notify_eevent_rep);
+		notify->types = 1;
+		notify->index = req->req.chid;
+	}
+
+	return ret;
+}
+
+static void
+gk104_fifo_eevent_init(struct nvkm_event *event, int type, int index)
+{
+}
+
+static void
+gk104_fifo_eevent_fini(struct nvkm_event *event, int type, int index)
+{
+}
+
+static const struct nvkm_event_func
+nvkm_fifo_eevent_func = {
+	.ctor = nvkm_fifo_eevent_ctor,
+	.init = gk104_fifo_eevent_init,
+	.fini = gk104_fifo_eevent_fini,
+};
+
+void
+nvkm_fifo_eevent(struct nvkm_fifo *fifo, u32 chid, u32 error)
+{
+	struct nvif_notify_eevent_rep rep = {
+		.error = error,
+		.chid = chid
+	};
+	nvkm_event_send(&fifo->eevent, 1, chid, &rep, sizeof(rep));
+}
+
 int
 _nvkm_fifo_channel_ntfy(struct nvkm_object *object, u32 type,
 			struct nvkm_event **event)
 {
 	struct nvkm_fifo *fifo = (void *)object->engine;
 	switch (type) {
 	case G82_CHANNEL_DMA_V0_NTFY_UEVENT:
 		if (nv_mclass(object) >= G82_CHANNEL_DMA) {
 			*event = &fifo->uevent;
 			return 0;
 		}
 		break;
+	case CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT:
+		*event = &fifo->eevent;
+		return 0;
 	default:
 		break;
 	}
 	return -EINVAL;
 }
 
 static int
 nvkm_fifo_chid(struct nvkm_fifo *priv, struct nvkm_object *object)
@@ -242,16 +290,17 @@ nvkm_client_name_for_fifo_chid(struct nvkm_fifo *fifo, u32
chid)
 
 	return nvkm_client_name(chan);
 }
 
 void
 nvkm_fifo_destroy(struct nvkm_fifo *priv)
 {
 	kfree(priv->channel);
+	nvkm_event_fini(&priv->eevent);
 	nvkm_event_fini(&priv->uevent);
 	nvkm_event_fini(&priv->cevent);
 	nvkm_engine_destroy(&priv->base);
 }
 
 int
 nvkm_fifo_create_(struct nvkm_object *parent, struct nvkm_object *engine,
 		  struct nvkm_oclass *oclass,
@@ -271,12 +320,16 @@ nvkm_fifo_create_(struct nvkm_object *parent, struct
nvkm_object *engine,
 	priv->channel = kzalloc(sizeof(*priv->channel) * (max + 1), GFP_KERNEL);
 	if (!priv->channel)
 		return -ENOMEM;
 
 	ret = nvkm_event_init(&nvkm_fifo_event_func, 1, 1, &priv->cevent);
 	if (ret)
 		return ret;
 
+	ret = nvkm_event_init(&nvkm_fifo_eevent_func, 1, max + 1,
&priv->eevent);
+	if (ret)
+		return ret;
+
 	priv->chid = nvkm_fifo_chid;
 	spin_lock_init(&priv->lock);
 	return 0;
 }
diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c
b/drm/nouveau/nvkm/engine/fifo/gk104.c
index 52c22b0..659c05f 100644
--- a/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -30,16 +30,18 @@
 #include <subdev/bar.h>
 #include <subdev/fb.h>
 #include <subdev/mmu.h>
 #include <subdev/timer.h>
 
 #include <nvif/class.h>
 #include <nvif/unpack.h>
 
+#include "nouveau_drm.h"
+
 #define _(a,b) { (a), ((1ULL << (a)) | (b)) }
 static const struct {
 	u64 subdev;
 	u64 mask;
 } fifo_engine[] = {
 	_(NVDEV_ENGINE_GR      , (1ULL << NVDEV_ENGINE_SW) |
 				 (1ULL << NVDEV_ENGINE_CE2)),
 	_(NVDEV_ENGINE_MSPDEC  , 0),
@@ -567,16 +569,20 @@ gk104_fifo_intr_sched_ctxsw(struct gk104_fifo_priv *priv)
 		u32 chid = load ? next : prev;
 		(void)save;
 
 		if (busy && chsw) {
 			if (!(chan = (void *)priv->base.channel[chid]))
 				continue;
 			if (!(engine = gk104_fifo_engine(priv, engn)))
 				continue;
+
+			nvkm_fifo_eevent(&priv->base, chid,
+					NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT);
+
 			gk104_fifo_recover(priv, engine, chan);
 		}
 	}
 }
 
 static void
 gk104_fifo_intr_sched(struct gk104_fifo_priv *priv)
 {
@@ -783,16 +789,19 @@ gk104_fifo_intr_fault(struct gk104_fifo_priv *priv, int
unit)
 		 ec ? ec->name : ecunk, (u64)inst << 12,
 		 nvkm_client_name(engctx));
 
 	object = engctx;
 	while (object) {
 		switch (nv_mclass(object)) {
 		case KEPLER_CHANNEL_GPFIFO_A:
 		case MAXWELL_CHANNEL_GPFIFO_A:
+			nvkm_fifo_eevent(&priv->base,
+					((struct nvkm_fifo_chan*)object)->chid,
+					NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT);
 			gk104_fifo_recover(priv, engine, (void *)object);
 			break;
 		}
 		object = object->parent;
 	}
 
 	nvkm_engctx_put(engctx);
 }
@@ -853,16 +862,18 @@ gk104_fifo_intr_pbdma_0(struct gk104_fifo_priv *priv, int
unit)
 		nv_error(priv, "PBDMA%d:", unit);
 		nvkm_bitfield_print(gk104_fifo_pbdma_intr_0, show);
 		pr_cont("\n");
 		nv_error(priv,
 			 "PBDMA%d: ch %d [%s] subc %d mthd 0x%04x data 0x%08x\n",
 			 unit, chid,
 			 nvkm_client_name_for_fifo_chid(&priv->base, chid),
 			 subc, mthd, data);
+		nvkm_fifo_eevent(&priv->base, chid,
+				NOUVEAU_GEM_CHANNEL_PBDMA_ERROR);
 	}
 
 	nv_wr32(priv, 0x040108 + (unit * 0x2000), stat);
 }
 
 static const struct nvkm_bitfield gk104_fifo_pbdma_intr_1[] = {
 	{ 0x00000001, "HCE_RE_ILLEGAL_OP" },
 	{ 0x00000002, "HCE_RE_ALIGNB" },
@@ -881,16 +892,18 @@ gk104_fifo_intr_pbdma_1(struct gk104_fifo_priv *priv, int
unit)
 
 	if (stat) {
 		nv_error(priv, "PBDMA%d:", unit);
 		nvkm_bitfield_print(gk104_fifo_pbdma_intr_1, stat);
 		pr_cont("\n");
 		nv_error(priv, "PBDMA%d: ch %d %08x %08x\n", unit, chid,
 			 nv_rd32(priv, 0x040150 + (unit * 0x2000)),
 			 nv_rd32(priv, 0x040154 + (unit * 0x2000)));
+		nvkm_fifo_eevent(&priv->base, chid,
+				NOUVEAU_GEM_CHANNEL_PBDMA_ERROR);
 	}
 
 	nv_wr32(priv, 0x040148 + (unit * 0x2000), stat);
 }
 
 static void
 gk104_fifo_intr_runlist(struct gk104_fifo_priv *priv)
 {
diff --git a/drm/nouveau/nvkm/engine/gr/gf100.c
b/drm/nouveau/nvkm/engine/gr/gf100.c
index e7c3e9e..8482078 100644
--- a/drm/nouveau/nvkm/engine/gr/gf100.c
+++ b/drm/nouveau/nvkm/engine/gr/gf100.c
@@ -32,16 +32,18 @@
 #include <engine/fifo.h>
 #include <subdev/fb.h>
 #include <subdev/mc.h>
 #include <subdev/timer.h>
 
 #include <nvif/class.h>
 #include <nvif/unpack.h>
 
+#include "nouveau_drm.h"
+
 /*******************************************************************************
  * Zero Bandwidth Clear
 
******************************************************************************/
 
 static void
 gf100_gr_zbc_clear_color(struct gf100_gr_priv *priv, int zbc)
 {
 	if (priv->zbc_color[zbc].format) {
@@ -1169,34 +1171,37 @@ gf100_gr_intr(struct nvkm_subdev *subdev)
 
 	if (stat & 0x00000020) {
 		nv_error(priv,
 			 "ILLEGAL_CLASS ch %d [0x%010llx %s] subc %d class 0x%04x mthd 0x%04x
data 0x%08x\n",
 			 chid, inst << 12, nvkm_client_name(engctx), subc,
 			 class, mthd, data);
 		nv_wr32(priv, 0x400100, 0x00000020);
 		stat &= ~0x00000020;
+		nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY);
 	}
 
 	if (stat & 0x00100000) {
 		nv_error(priv, "DATA_ERROR [");
 		nvkm_enum_print(nv50_data_error_names, code);
 		pr_cont("] ch %d [0x%010llx %s] subc %d class 0x%04x mthd 0x%04x data
0x%08x\n",
 			chid, inst << 12, nvkm_client_name(engctx), subc,
 			class, mthd, data);
 		nv_wr32(priv, 0x400100, 0x00100000);
 		stat &= ~0x00100000;
+		nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY);
 	}
 
 	if (stat & 0x00200000) {
 		nv_error(priv, "TRAP ch %d [0x%010llx %s]\n", chid, inst <<
12,
 			 nvkm_client_name(engctx));
 		gf100_gr_trap_intr(priv);
 		nv_wr32(priv, 0x400100, 0x00200000);
 		stat &= ~0x00200000;
+		nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY);
 	}
 
 	if (stat & 0x00080000) {
 		gf100_gr_ctxctl_isr(priv);
 		nv_wr32(priv, 0x400100, 0x00080000);
 		stat &= ~0x00080000;
 	}
 
diff --git a/drm/nouveau/uapi/drm/nouveau_drm.h
b/drm/nouveau/uapi/drm/nouveau_drm.h
index 1331c28..c9c296a 100644
--- a/drm/nouveau/uapi/drm/nouveau_drm.h
+++ b/drm/nouveau/uapi/drm/nouveau_drm.h
@@ -143,16 +143,27 @@ struct drm_nouveau_gem_cpu_prep {
 	uint32_t handle;
 	uint32_t flags;
 };
 
 struct drm_nouveau_gem_cpu_fini {
 	uint32_t handle;
 };
 
+#define NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT	8
+#define NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY		13
+#define NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT	31
+#define NOUVEAU_GEM_CHANNEL_PBDMA_ERROR			32
+struct drm_nouveau_gem_set_error_notifier {
+	uint32_t channel;
+	uint32_t buffer;
+	uint32_t offset; /* bytes, u32-aligned */
+};
+
+
 #define DRM_NOUVEAU_GETPARAM           0x00 /* deprecated */
 #define DRM_NOUVEAU_SETPARAM           0x01 /* deprecated */
 #define DRM_NOUVEAU_CHANNEL_ALLOC      0x02 /* deprecated */
 #define DRM_NOUVEAU_CHANNEL_FREE       0x03 /* deprecated */
 #define DRM_NOUVEAU_GROBJ_ALLOC        0x04 /* deprecated */
 #define DRM_NOUVEAU_NOTIFIEROBJ_ALLOC  0x05 /* deprecated */
 #define DRM_NOUVEAU_GPUOBJ_FREE        0x06 /* deprecated */
 #define DRM_NOUVEAU_NVIF               0x07
@@ -160,19 +171,21 @@ struct drm_nouveau_gem_cpu_fini {
 #define DRM_NOUVEAU_GEM_PUSHBUF        0x41
 #define DRM_NOUVEAU_GEM_CPU_PREP       0x42
 #define DRM_NOUVEAU_GEM_CPU_FINI       0x43
 #define DRM_NOUVEAU_GEM_INFO           0x44
 /* range 0x98..DRM_COMMAND_END (8 entries) is reserved for staging, unstable
ioctls */
 #define DRM_NOUVEAU_STAGING_IOCTL      0x58
 #define DRM_NOUVEAU_GEM_SET_TILING     (DRM_NOUVEAU_STAGING_IOCTL + 0x0)
 #define DRM_NOUVEAU_GEM_PUSHBUF_2      (DRM_NOUVEAU_STAGING_IOCTL + 0x1)
+#define DRM_NOUVEAU_GEM_SET_ERROR_NOTIFIER (DRM_NOUVEAU_STAGING_IOCTL + 0x2)
 
 #define DRM_IOCTL_NOUVEAU_GEM_NEW            DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_NEW, struct drm_nouveau_gem_new)
 #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF        DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_PUSHBUF, struct drm_nouveau_gem_pushbuf)
 #define DRM_IOCTL_NOUVEAU_GEM_CPU_PREP       DRM_IOW (DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_CPU_PREP, struct drm_nouveau_gem_cpu_prep)
 #define DRM_IOCTL_NOUVEAU_GEM_CPU_FINI       DRM_IOW (DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_CPU_FINI, struct drm_nouveau_gem_cpu_fini)
 #define DRM_IOCTL_NOUVEAU_GEM_INFO           DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_INFO, struct drm_nouveau_gem_info)
 /* staging ioctls */
 #define DRM_IOCTL_NOUVEAU_GEM_SET_TILING     DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_SET_TILING, struct drm_nouveau_gem_set_tiling)
 #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF_2      DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_PUSHBUF_2, struct drm_nouveau_gem_pushbuf_2)
+#define DRM_IOCTL_NOUVEAU_GEM_SET_ERROR_NOTIFIER DRM_IOWR(DRM_COMMAND_BASE +
DRM_NOUVEAU_GEM_SET_ERROR_NOTIFIER, struct drm_nouveau_gem_set_error_notifier)
 
 #endif /* __NOUVEAU_DRM_H__ */
-- 
2.1.4

Konsta Hölttä

2015-Aug-31 11:38 UTC

head link

[Nouveau] [RFC PATCH v2 2/5] don't verify route == owner in nvkm ioctl

HACK: Some objects we need to access from userspace are first created in
kernel, and thus owned by the kernel layer. Bypass a safety mechanism
that allows to only use userspace-created objects and create new ones
(for userspace). The objects do not span across processes but are still
owned by the same client.

This will need to be fixed in some proper way. Will these objects be
managed by userspace at some point?

Signed-off-by: Konsta Hölttä <kholtta at nvidia.com>
---
 drm/nouveau/nvkm/core/ioctl.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drm/nouveau/nvkm/core/ioctl.c b/drm/nouveau/nvkm/core/ioctl.c
index 4459ff5..30164c7 100644
--- a/drm/nouveau/nvkm/core/ioctl.c
+++ b/drm/nouveau/nvkm/core/ioctl.c
@@ -474,18 +474,23 @@ nvkm_ioctl_path(struct nvkm_handle *parent, u32 type, u32
nr, u32 *path,
 			nv_debug(object, "handle 0x%08x not found\n", path[nr]);
 			return -ENOENT;
 		}
 		nvkm_namedb_put(handle);
 		parent = handle;
 	}
 
 	if (owner != NVIF_IOCTL_V0_OWNER_ANY && owner != handle->route) {
-		nv_ioctl(object, "object route != owner\n");
-		return -EACCES;
+		nv_ioctl(object, "object route != owner: route = 0x%x, owner =
0x$%x\n",
+				handle->route, owner);
+		/* return -EACCES; */
+
+		/* continue anyway - this is required for calling objects
+		 * created in the kernel for this client from userspace, such
+		 * as the channel fifo object or its gr obj. */
 	}
 	*route = handle->route;
 	*token = handle->token;
 
 	if (ret = -EINVAL, type < ARRAY_SIZE(nvkm_ioctl_v0)) {
 		if (nvkm_ioctl_v0[type].version == 0)
 			ret = nvkm_ioctl_v0[type].func(handle, data, size);
 	}
-- 
2.1.4

Konsta Hölttä

2015-Aug-31 11:38 UTC

head link

[Nouveau] [RFC PATCH v2 3/5] gk104: channel priority/timeslice support

Change channel "priorities" by modifying the runlist timeslice value,
thus giving more time for more important jobs before scheduling a
context switch.

A new KEPLER_SET_CHANNEL_PRIORITY mthd on fifo channel objects sets the
priority to low, medium or high.

Disable the channel and preempt it out, change the timeslice, and then
enable the channel again. The shift bit in the timeslice register is
left untouched (i.e., 3) as is the enable bit.

Signed-off-by: Konsta Hölttä <kholtta at nvidia.com>
---
 drm/nouveau/include/nvif/class.h     | 10 ++++++
 drm/nouveau/nvkm/engine/fifo/gk104.c | 60 ++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h
index 28176e6..72c3b37 100644
--- a/drm/nouveau/include/nvif/class.h
+++ b/drm/nouveau/include/nvif/class.h
@@ -619,9 +619,19 @@ struct fermi_a_zbc_depth_v0 {
 #define FERMI_A_ZBC_DEPTH_V0_FMT_FP32                                      0x01
 	__u8  format;
 	__u8  index;
 	__u8  pad03[5];
 	__u32 ds;
 	__u32 l2;
 };
 
+#define KEPLER_SET_CHANNEL_PRIORITY                                        0x00
+struct kepler_set_channel_priority_v0 {
+	__u8  version;
+#define KEPLER_SET_CHANNEL_PRIORITY_LOW                                    0x00
+#define KEPLER_SET_CHANNEL_PRIORITY_MEDIUM                                 0x01
+#define KEPLER_SET_CHANNEL_PRIORITY_HIGH                                   0x02
+	__u8 priority;
+	__u8  pad03[6];
+};
+
 #endif
diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c
b/drm/nouveau/nvkm/engine/fifo/gk104.c
index 659c05f..fda726d 100644
--- a/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -333,22 +333,82 @@ gk104_fifo_chan_fini(struct nvkm_object *object, bool
suspend)
 		gk104_fifo_runlist_update(priv, chan->engine);
 	}
 
 	gk104_fifo_chan_kick(chan);
 	nv_wr32(priv, 0x800000 + (chid * 8), 0x00000000);
 	return nvkm_fifo_channel_fini(&chan->base, suspend);
 }
 
+static int
+gk104_fifo_set_runlist_timeslice(struct gk104_fifo_priv *priv,
+		struct gk104_fifo_chan *chan, u8 slice)
+{
+	struct nvkm_gpuobj *base = nv_gpuobj(nv_object(chan)->parent);
+	u32 chid = chan->base.chid;
+
+	nv_mask(priv, 0x800004 + (chid * 8), 0x00000800, 0x00000800);
+	WARN_ON(gk104_fifo_chan_kick(chan));
+	nv_wo32(base, 0xf8, slice | 0x10003000);
+	nv_mask(priv, 0x800004 + (chid * 8), 0x00000400, 0x00000400);
+	nv_info(chan, "timeslice set to %d for %d\n", slice, chid);
+
+	return 0;
+}
+
+int
+gk104_fifo_chan_set_priority(struct nvkm_object *object, void *data, u32 size)
+{
+	struct gk104_fifo_priv *priv = (void *)object->engine;
+	struct gk104_fifo_chan *chan = (void *)object;
+	union {
+		struct kepler_set_channel_priority_v0 v0;
+	} *args = data;
+	int ret;
+	u8 slice;
+
+	if (nvif_unpack(args->v0, 0, 0, false)) {
+		switch (args->v0.priority) {
+		case KEPLER_SET_CHANNEL_PRIORITY_LOW:
+			slice = 64; /* << 3 == 512 us */
+			break;
+		case KEPLER_SET_CHANNEL_PRIORITY_MEDIUM:
+			slice = 128; /* 1 ms */
+			break;
+		case KEPLER_SET_CHANNEL_PRIORITY_HIGH:
+			slice = 255; /* 2 ms */
+			break;
+		default:
+			return -EINVAL;
+		}
+		return gk104_fifo_set_runlist_timeslice(priv, chan, slice);
+	}
+
+	return ret;
+}
+
+int
+gk104_fifo_chan_mthd(struct nvkm_object *object, u32 mthd, void *data, u32
size)
+{
+	switch (mthd) {
+	case KEPLER_SET_CHANNEL_PRIORITY:
+		return gk104_fifo_chan_set_priority(object, data, size);
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
 struct nvkm_ofuncs
 gk104_fifo_chan_ofuncs = {
 	.ctor = gk104_fifo_chan_ctor,
 	.dtor = _nvkm_fifo_channel_dtor,
 	.init = gk104_fifo_chan_init,
 	.fini = gk104_fifo_chan_fini,
+	.mthd = gk104_fifo_chan_mthd,
 	.map  = _nvkm_fifo_channel_map,
 	.rd32 = _nvkm_fifo_channel_rd32,
 	.wr32 = _nvkm_fifo_channel_wr32,
 	.ntfy = _nvkm_fifo_channel_ntfy
 };
 
 static struct nvkm_oclass
 gk104_fifo_sclass[] = {
-- 
2.1.4

Konsta Hölttä

2015-Aug-31 11:38 UTC

head link

[Nouveau] [RFC PATCH v2 4/5] gk104: channel timeout detection

Enable the scheduling timeout error interrupt and set it to a low value
to happen periodically, since it can be missed in HW in certain
conditions. Increment a channel-specific counter in software in the
error handler if the current channel hasn't advanced. Abort the channel
once the timeout limit is hit (with the periodic granularity). The error
notifier is set to NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT when this
occurs.

A new KEPLER_SET_CHANNEL_TIMEOUT mthd sets the timeout limit, in
milliseconds. The interrupt granularity is set to 100 ms.

Some status bit mangling in the sched error handler is cleaned up too.

Signed-off-by: Konsta Hölttä <kholtta at nvidia.com>
---
 drm/nouveau/include/nvif/class.h     |  8 ++++
 drm/nouveau/nvkm/engine/fifo/gk104.c | 92 +++++++++++++++++++++++++++++-------
 2 files changed, 84 insertions(+), 16 deletions(-)

diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h
index 72c3b37..9b568dc 100644
--- a/drm/nouveau/include/nvif/class.h
+++ b/drm/nouveau/include/nvif/class.h
@@ -620,18 +620,26 @@ struct fermi_a_zbc_depth_v0 {
 	__u8  format;
 	__u8  index;
 	__u8  pad03[5];
 	__u32 ds;
 	__u32 l2;
 };
 
 #define KEPLER_SET_CHANNEL_PRIORITY                                        0x00
+#define KEPLER_SET_CHANNEL_TIMEOUT                                         0x01
+
 struct kepler_set_channel_priority_v0 {
 	__u8  version;
 #define KEPLER_SET_CHANNEL_PRIORITY_LOW                                    0x00
 #define KEPLER_SET_CHANNEL_PRIORITY_MEDIUM                                 0x01
 #define KEPLER_SET_CHANNEL_PRIORITY_HIGH                                   0x02
 	__u8 priority;
 	__u8  pad03[6];
 };
 
+struct kepler_set_channel_timeout_v0 {
+	__u8  version;
+	__u8  pad03[3];
+	__u32 timeout_ms;
+};
+
 #endif
diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c
b/drm/nouveau/nvkm/engine/fifo/gk104.c
index fda726d..53a464d 100644
--- a/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -49,16 +49,20 @@ static const struct {
 	_(NVDEV_ENGINE_MSVLD   , 0),
 	_(NVDEV_ENGINE_CE0     , 0),
 	_(NVDEV_ENGINE_CE1     , 0),
 	_(NVDEV_ENGINE_MSENC   , 0),
 };
 #undef _
 #define FIFO_ENGINE_NR ARRAY_SIZE(fifo_engine)
 
+#define CTXSW_STATUS_LOAD 5
+#define CTXSW_STATUS_SAVE 6
+#define CTXSW_STATUS_SWITCH 7
+
 struct gk104_fifo_engn {
 	struct nvkm_gpuobj *runlist[2];
 	int cur_runlist;
 	wait_queue_head_t wait;
 };
 
 struct gk104_fifo_priv {
 	struct nvkm_fifo base;
@@ -83,18 +87,25 @@ struct gk104_fifo_base {
 struct gk104_fifo_chan {
 	struct nvkm_fifo_chan base;
 	u32 engine;
 	enum {
 		STOPPED,
 		RUNNING,
 		KILLED
 	} state;
+	struct {
+		u32 sum_ms;
+		u32 limit_ms;
+		u32 gpfifo_get;
+	} timeout;
 };
 
+#define GRFIFO_TIMEOUT_CHECK_PERIOD_MS 100
+
 /*******************************************************************************
  * FIFO channel objects
 
******************************************************************************/
 
 static void
 gk104_fifo_runlist_update(struct gk104_fifo_priv *priv, u32 engine)
 {
 	struct nvkm_bar *bar = nvkm_bar(priv);
@@ -288,16 +299,21 @@ gk104_fifo_chan_ctor(struct nvkm_object *parent, struct
nvkm_object *engine,
 	nv_wo32(base, 0x94, 0x30000001);
 	nv_wo32(base, 0x9c, 0x00000100);
 	nv_wo32(base, 0xac, 0x0000001f);
 	nv_wo32(base, 0xe8, chan->base.chid);
 	nv_wo32(base, 0xb8, 0xf8000000);
 	nv_wo32(base, 0xf8, 0x10003080); /* 0x002310 */
 	nv_wo32(base, 0xfc, 0x10000010); /* 0x002350 */
 	bar->flush(bar);
+
+	chan->timeout.sum_ms = 0;
+	chan->timeout.limit_ms = -1;
+	chan->timeout.gpfifo_get = 0;
+
 	return 0;
 }
 
 static int
 gk104_fifo_chan_init(struct nvkm_object *object)
 {
 	struct nvkm_gpuobj *base = nv_gpuobj(object->parent);
 	struct gk104_fifo_priv *priv = (void *)object->engine;
@@ -381,21 +397,39 @@ gk104_fifo_chan_set_priority(struct nvkm_object *object,
void *data, u32 size)
 		}
 		return gk104_fifo_set_runlist_timeslice(priv, chan, slice);
 	}
 
 	return ret;
 }
 
 int
+gk104_fifo_chan_set_timeout(struct nvkm_object *object, void *data, u32 size)
+{
+	struct gk104_fifo_chan *chan = (void *)object;
+	union {
+		struct kepler_set_channel_timeout_v0 v0;
+	} *args = data;
+	int ret;
+
+	if (nvif_unpack(args->v0, 0, 0, false)) {
+		chan->timeout.limit_ms = args->v0.timeout_ms;
+	}
+
+	return ret;
+}
+
+int
 gk104_fifo_chan_mthd(struct nvkm_object *object, u32 mthd, void *data, u32
size)
 {
 	switch (mthd) {
 	case KEPLER_SET_CHANNEL_PRIORITY:
 		return gk104_fifo_chan_set_priority(object, data, size);
+	case KEPLER_SET_CHANNEL_TIMEOUT:
+		return gk104_fifo_chan_set_timeout(object, data, size);
 	default:
 		break;
 	}
 	return -EINVAL;
 }
 
 struct nvkm_ofuncs
 gk104_fifo_chan_ofuncs = {
@@ -606,61 +640,83 @@ gk104_fifo_intr_bind(struct gk104_fifo_priv *priv)
 }
 
 static const struct nvkm_enum
 gk104_fifo_sched_reason[] = {
 	{ 0x0a, "CTXSW_TIMEOUT" },
 	{}
 };
 
+static bool
+gk104_fifo_update_timeout(struct gk104_fifo_priv *priv,
+		struct gk104_fifo_chan *chan, u32 dt)
+{
+	u32 gpfifo_get = nv_rd32(priv, 0x88);
+	/* advancing, but slowly; reset counting */
+	if (gpfifo_get != chan->timeout.gpfifo_get)
+		chan->timeout.sum_ms = 0;
+
+	chan->timeout.sum_ms += dt;
+	chan->timeout.gpfifo_get = gpfifo_get;
+
+	return chan->timeout.sum_ms > chan->timeout.limit_ms;
+}
+
 static void
 gk104_fifo_intr_sched_ctxsw(struct gk104_fifo_priv *priv)
 {
 	struct nvkm_engine *engine;
 	struct gk104_fifo_chan *chan;
 	u32 engn;
 
 	for (engn = 0; engn < ARRAY_SIZE(fifo_engine); engn++) {
-		u32 stat = nv_rd32(priv, 0x002640 + (engn * 0x04));
-		u32 busy = (stat & 0x80000000);
-		u32 next = (stat & 0x07ff0000) >> 16;
-		u32 chsw = (stat & 0x00008000);
-		u32 save = (stat & 0x00004000);
-		u32 load = (stat & 0x00002000);
-		u32 prev = (stat & 0x000007ff);
-		u32 chid = load ? next : prev;
-		(void)save;
-
-		if (busy && chsw) {
+		u32 engstat = nv_rd32(priv, 0x002640 + (engn * 0x04));
+		u32 busy = (engstat & 0x80000000);
+		u32 next = (engstat & 0x07ff0000) >> 16;
+		u32 ctxstat = (engstat & 0x0000e000) >> 13;
+		u32 prev = (engstat & 0x000007ff);
+
+		u32 chid = ctxstat == CTXSW_STATUS_LOAD ? next : prev;
+		u32 ctxsw_active = ctxstat == CTXSW_STATUS_LOAD ||
+			ctxstat == CTXSW_STATUS_SAVE ||
+			ctxstat == CTXSW_STATUS_SWITCH;
+
+		if (busy && ctxsw_active) {
 			if (!(chan = (void *)priv->base.channel[chid]))
 				continue;
 			if (!(engine = gk104_fifo_engine(priv, engn)))
 				continue;
 
-			nvkm_fifo_eevent(&priv->base, chid,
-					NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT);
-
-			gk104_fifo_recover(priv, engine, chan);
+			if (gk104_fifo_update_timeout(priv, chan,
+						GRFIFO_TIMEOUT_CHECK_PERIOD_MS)) {
+				nvkm_fifo_eevent(&priv->base, chid,
+						NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT);
+				gk104_fifo_recover(priv, engine, chan);
+			} else {
+				nv_debug(priv, "fifo waiting for ctxsw %d ms on ch %d\n",
+						chan->timeout.sum_ms, chid);
+			}
 		}
 	}
 }
 
 static void
 gk104_fifo_intr_sched(struct gk104_fifo_priv *priv)
 {
 	u32 intr = nv_rd32(priv, 0x00254c);
 	u32 code = intr & 0x000000ff;
 	const struct nvkm_enum *en;
 	char enunk[6] = "";
 
 	en = nvkm_enum_find(gk104_fifo_sched_reason, code);
 	if (!en)
 		snprintf(enunk, sizeof(enunk), "UNK%02x", code);
 
-	nv_error(priv, "SCHED_ERROR [ %s ]\n", en ? en->name : enunk);
+	/* this is a normal situation, not so loud */
+	nv_debug(priv, "SCHED_ERROR [ %s ]\n", en ? en->name : enunk);
 
 	switch (code) {
 	case 0x0a:
 		gk104_fifo_intr_sched_ctxsw(priv);
 		break;
 	default:
 		break;
 	}
@@ -1133,18 +1189,22 @@ gk104_fifo_init(struct nvkm_object *object)
 	/* PBDMA[n].HCE */
 	for (i = 0; i < priv->spoon_nr; i++) {
 		nv_wr32(priv, 0x040148 + (i * 0x2000), 0xffffffff); /* INTR */
 		nv_wr32(priv, 0x04014c + (i * 0x2000), 0xffffffff); /* INTREN */
 	}
 
 	nv_wr32(priv, 0x002254, 0x10000000 | priv->user.bar.offset >> 12);
 
+	/* enable interrupts */
 	nv_wr32(priv, 0x002100, 0xffffffff);
 	nv_wr32(priv, 0x002140, 0x7fffffff);
+
+	/* engine context switch timeout */
+	nv_wr32(priv, 0x002a0c, 0x80000000 | (1000 * GRFIFO_TIMEOUT_CHECK_PERIOD_MS));
 	return 0;
 }
 
 void
 gk104_fifo_dtor(struct nvkm_object *object)
 {
 	struct gk104_fifo_priv *priv = (void *)object;
 	int i;
-- 
2.1.4

Konsta Hölttä

2015-Aug-31 11:38 UTC

head link

[Nouveau] [RFC PATCH v2 5/5] HACK force fences updated on error

Some error conditions just stop a channel and fences get stuck, so they
either need to be kicked ready in overwriting hw seq numbers (as nvgpu
does) or faked with a sw flag like this. This is just a hack as an
example of what would be needed.

Here, a channel id whose fences should be forced updated is passed
upwards with the uevent response. Normally, this is -1 to match no
channel id, but some error paths fake an update event with an explicit
channel id.

Note: if userspace has some meaningful timeouts on the fences, then they
do finish but without any notification that the channel is broken now
(how do you distinguish a too long gpu job from a stuck one?). In many
cases, a channel needs to be shut down completely when it breaks (e.g.,
mmu fault).

Signed-off-by: Konsta Hölttä <kholtta at nvidia.com>
---
 drm/nouveau/include/nvif/event.h       |  1 +
 drm/nouveau/include/nvkm/engine/fifo.h |  2 +-
 drm/nouveau/nouveau_fence.c            | 13 ++++++++-----
 drm/nouveau/nvkm/engine/fifo/base.c    |  3 ++-
 drm/nouveau/nvkm/engine/fifo/gf100.c   |  2 +-
 drm/nouveau/nvkm/engine/fifo/gk104.c   |  7 ++++++-
 drm/nouveau/nvkm/engine/fifo/nv04.c    |  2 +-
 7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drm/nouveau/include/nvif/event.h b/drm/nouveau/include/nvif/event.h
index d148b85..a9ff4ee 100644
--- a/drm/nouveau/include/nvif/event.h
+++ b/drm/nouveau/include/nvif/event.h
@@ -52,16 +52,17 @@ struct nvif_notify_conn_rep_v0 {
 };
 
 struct nvif_notify_uevent_req {
 	/* nvif_notify_req ... */
 };
 
 struct nvif_notify_uevent_rep {
 	/* nvif_notify_rep ... */
+	__u32 force_chid;
 };
 
 struct nvif_notify_eevent_req {
 	/* nvif_notify_req ... */
 	u32 chid;
 };
 
 struct nvif_notify_eevent_rep {
diff --git a/drm/nouveau/include/nvkm/engine/fifo.h
b/drm/nouveau/include/nvkm/engine/fifo.h
index cbca477..946eb68 100644
--- a/drm/nouveau/include/nvkm/engine/fifo.h
+++ b/drm/nouveau/include/nvkm/engine/fifo.h
@@ -117,15 +117,15 @@ extern struct nvkm_oclass *gf100_fifo_oclass;
 extern struct nvkm_oclass *gk104_fifo_oclass;
 extern struct nvkm_oclass *gk20a_fifo_oclass;
 extern struct nvkm_oclass *gk208_fifo_oclass;
 extern struct nvkm_oclass *gm204_fifo_oclass;
 extern struct nvkm_oclass *gm20b_fifo_oclass;
 
 int  nvkm_fifo_uevent_ctor(struct nvkm_object *, void *, u32,
 			   struct nvkm_notify *);
-void nvkm_fifo_uevent(struct nvkm_fifo *);
+void nvkm_fifo_uevent(struct nvkm_fifo *, u32 force_chid);
 
 void nvkm_fifo_eevent(struct nvkm_fifo *, u32 chid, u32 error);
 
 void nv04_fifo_intr(struct nvkm_subdev *);
 int  nv04_fifo_context_attach(struct nvkm_object *, struct nvkm_object *);
 #endif
diff --git a/drm/nouveau/nouveau_fence.c b/drm/nouveau/nouveau_fence.c
index 38bccb0..b7d9987 100644
--- a/drm/nouveau/nouveau_fence.c
+++ b/drm/nouveau/nouveau_fence.c
@@ -123,50 +123,53 @@ nouveau_fence_context_put(struct kref *fence_ref)
 
 void
 nouveau_fence_context_free(struct nouveau_fence_chan *fctx)
 {
 	kref_put(&fctx->fence_ref, nouveau_fence_context_put);
 }
 
 static int
-nouveau_fence_update(struct nouveau_channel *chan, struct nouveau_fence_chan
*fctx)
+nouveau_fence_update(struct nouveau_channel *chan,
+		struct nouveau_fence_chan *fctx, u32 force_chid)
 {
 	struct nouveau_fence *fence;
 	int drop = 0;
 	u32 seq = fctx->read(chan);
+	bool force = force_chid == chan->chid;
 
 	while (!list_empty(&fctx->pending)) {
 		fence = list_entry(fctx->pending.next, typeof(*fence), head);
 
-		if ((int)(seq - fence->base.seqno) < 0)
+		if ((int)(seq - fence->base.seqno) < 0 && !force)
 			break;
 
 		drop |= nouveau_fence_signal(fence);
 	}
 
 	return drop;
 }
 
 static int
 nouveau_fence_wait_uevent_handler(struct nvif_notify *notify)
 {
 	struct nouveau_fence_chan *fctx  		container_of(notify, typeof(*fctx),
notify);
+	const struct nvif_notify_uevent_rep *rep = notify->data;
 	unsigned long flags;
 	int ret = NVIF_NOTIFY_KEEP;
 
 	spin_lock_irqsave(&fctx->lock, flags);
 	if (!list_empty(&fctx->pending)) {
 		struct nouveau_fence *fence;
 		struct nouveau_channel *chan;
 
 		fence = list_entry(fctx->pending.next, typeof(*fence), head);
 		chan = rcu_dereference_protected(fence->channel,
lockdep_is_held(&fctx->lock));
-		if (nouveau_fence_update(fence->channel, fctx))
+		if (nouveau_fence_update(fence->channel, fctx, rep->force_chid))
 			ret = NVIF_NOTIFY_DROP;
 	}
 	spin_unlock_irqrestore(&fctx->lock, flags);
 
 	return ret;
 }
 
 void
@@ -278,17 +281,17 @@ nouveau_fence_emit(struct nouveau_fence *fence, struct
nouveau_channel *chan)
 	kref_get(&fctx->fence_ref);
 
 	trace_fence_emit(&fence->base);
 	ret = fctx->emit(fence);
 	if (!ret) {
 		fence_get(&fence->base);
 		spin_lock_irq(&fctx->lock);
 
-		if (nouveau_fence_update(chan, fctx))
+		if (nouveau_fence_update(chan, fctx, -1))
 			nvif_notify_put(&fctx->notify);
 
 		list_add_tail(&fence->head, &fctx->pending);
 		spin_unlock_irq(&fctx->lock);
 	}
 
 	return ret;
 }
@@ -302,17 +305,17 @@ nouveau_fence_done(struct nouveau_fence *fence)
 		struct nouveau_channel *chan;
 		unsigned long flags;
 
 		if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
 			return true;
 
 		spin_lock_irqsave(&fctx->lock, flags);
 		chan = rcu_dereference_protected(fence->channel,
lockdep_is_held(&fctx->lock));
-		if (chan && nouveau_fence_update(chan, fctx))
+		if (chan && nouveau_fence_update(chan, fctx, -1))
 			nvif_notify_put(&fctx->notify);
 		spin_unlock_irqrestore(&fctx->lock, flags);
 	}
 	return fence_is_signaled(&fence->base);
 }
 
 static long
 nouveau_fence_wait_legacy(struct fence *f, bool intr, long wait)
diff --git a/drm/nouveau/nvkm/engine/fifo/base.c
b/drm/nouveau/nvkm/engine/fifo/base.c
index a5dc6c9..e35d711 100644
--- a/drm/nouveau/nvkm/engine/fifo/base.c
+++ b/drm/nouveau/nvkm/engine/fifo/base.c
@@ -184,19 +184,20 @@ nvkm_fifo_uevent_ctor(struct nvkm_object *object, void
*data, u32 size,
 		notify->types = 1;
 		notify->index = 0;
 	}
 
 	return ret;
 }
 
 void
-nvkm_fifo_uevent(struct nvkm_fifo *fifo)
+nvkm_fifo_uevent(struct nvkm_fifo *fifo, u32 force_chid)
 {
 	struct nvif_notify_uevent_rep rep = {
+		.force_chid = force_chid
 	};
 	nvkm_event_send(&fifo->uevent, 1, 0, &rep, sizeof(rep));
 }
 
 static int
 nvkm_fifo_eevent_ctor(struct nvkm_object *object, void *data, u32 size,
 		      struct nvkm_notify *notify)
 {
diff --git a/drm/nouveau/nvkm/engine/fifo/gf100.c
b/drm/nouveau/nvkm/engine/fifo/gf100.c
index b745252..ca86dfe 100644
--- a/drm/nouveau/nvkm/engine/fifo/gf100.c
+++ b/drm/nouveau/nvkm/engine/fifo/gf100.c
@@ -732,17 +732,17 @@ gf100_fifo_intr_engine_unit(struct gf100_fifo_priv *priv,
int engn)
 	u32 inte = nv_rd32(priv, 0x002628);
 	u32 unkn;
 
 	nv_wr32(priv, 0x0025a8 + (engn * 0x04), intr);
 
 	for (unkn = 0; unkn < 8; unkn++) {
 		u32 ints = (intr >> (unkn * 0x04)) & inte;
 		if (ints & 0x1) {
-			nvkm_fifo_uevent(&priv->base);
+			nvkm_fifo_uevent(&priv->base, -1);
 			ints &= ~1;
 		}
 		if (ints) {
 			nv_error(priv, "ENGINE %d %d %01x", engn, unkn, ints);
 			nv_mask(priv, 0x002628, ints, 0);
 		}
 	}
 }
diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c
b/drm/nouveau/nvkm/engine/fifo/gk104.c
index 53a464d..caecef1 100644
--- a/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -908,16 +908,18 @@ gk104_fifo_intr_fault(struct gk104_fifo_priv *priv, int
unit)
 	object = engctx;
 	while (object) {
 		switch (nv_mclass(object)) {
 		case KEPLER_CHANNEL_GPFIFO_A:
 		case MAXWELL_CHANNEL_GPFIFO_A:
 			nvkm_fifo_eevent(&priv->base,
 					((struct nvkm_fifo_chan*)object)->chid,
 					NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT);
+			nvkm_fifo_uevent(&priv->base,
+					((struct nvkm_fifo_chan*)object)->chid);
 			gk104_fifo_recover(priv, engine, (void *)object);
 			break;
 		}
 		object = object->parent;
 	}
 
 	nvkm_engctx_put(engctx);
 }
@@ -978,18 +980,21 @@ gk104_fifo_intr_pbdma_0(struct gk104_fifo_priv *priv, int
unit)
 		nv_error(priv, "PBDMA%d:", unit);
 		nvkm_bitfield_print(gk104_fifo_pbdma_intr_0, show);
 		pr_cont("\n");
 		nv_error(priv,
 			 "PBDMA%d: ch %d [%s] subc %d mthd 0x%04x data 0x%08x\n",
 			 unit, chid,
 			 nvkm_client_name_for_fifo_chid(&priv->base, chid),
 			 subc, mthd, data);
+
 		nvkm_fifo_eevent(&priv->base, chid,
 				NOUVEAU_GEM_CHANNEL_PBDMA_ERROR);
+
+		nvkm_fifo_uevent(&priv->base, chid);
 	}
 
 	nv_wr32(priv, 0x040108 + (unit * 0x2000), stat);
 }
 
 static const struct nvkm_bitfield gk104_fifo_pbdma_intr_1[] = {
 	{ 0x00000001, "HCE_RE_ILLEGAL_OP" },
 	{ 0x00000002, "HCE_RE_ALIGNB" },
@@ -1030,17 +1035,17 @@ gk104_fifo_intr_runlist(struct gk104_fifo_priv *priv)
 		nv_wr32(priv, 0x002a00, 1 << engn);
 		mask &= ~(1 << engn);
 	}
 }
 
 static void
 gk104_fifo_intr_engine(struct gk104_fifo_priv *priv)
 {
-	nvkm_fifo_uevent(&priv->base);
+	nvkm_fifo_uevent(&priv->base, -1);
 }
 
 static void
 gk104_fifo_intr(struct nvkm_subdev *subdev)
 {
 	struct gk104_fifo_priv *priv = (void *)subdev;
 	u32 mask = nv_rd32(priv, 0x002140);
 	u32 stat = nv_rd32(priv, 0x002100) & mask;
diff --git a/drm/nouveau/nvkm/engine/fifo/nv04.c
b/drm/nouveau/nvkm/engine/fifo/nv04.c
index 043e429..1749614 100644
--- a/drm/nouveau/nvkm/engine/fifo/nv04.c
+++ b/drm/nouveau/nvkm/engine/fifo/nv04.c
@@ -536,17 +536,17 @@ nv04_fifo_intr(struct nvkm_subdev *subdev)
 	if (device->card_type == NV_50) {
 		if (stat & 0x00000010) {
 			stat &= ~0x00000010;
 			nv_wr32(priv, 0x002100, 0x00000010);
 		}
 
 		if (stat & 0x40000000) {
 			nv_wr32(priv, 0x002100, 0x40000000);
-			nvkm_fifo_uevent(&priv->base);
+			nvkm_fifo_uevent(&priv->base, -1);
 			stat &= ~0x40000000;
 		}
 	}
 
 	if (stat) {
 		nv_warn(priv, "unknown intr 0x%08x\n", stat);
 		nv_mask(priv, NV03_PFIFO_INTR_EN_0, stat, 0x00000000);
 		nv_wr32(priv, NV03_PFIFO_INTR_0, stat);
-- 
2.1.4

Ben Skeggs

2015-Sep-01 13:26 UTC

head link

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com>
wrote:> Hi there,
>
> Resending these now that they've had some more polish and testing, and
I heard
> that Ben's vacation is over :-)
>
> These patches work as a starting point for more explicit error mechanisms
and
> better robustness. At the moment, when a job hangs or faults, it seems that
> nouveau doesn't quite know how to handle the situation and often
results in a
> hang. Some of these situations would require either completely resetting
the
> gpu, and/or a complex path for only recovering the broken channel.Yes, this has been an issue that's been on my "really nice to
implement" list for quite a long time now, and only gotten around to
getting it partially done (as you can see).  Thanks for looking at
this!
>
> To start, I worked on support for letting userspace know what exactly
happened.
> Proper recovery would come later. The "error notifier" in the
first patch is a
> simple shared buffer between kernel and userspace. Its error codes match
> nvgpu's. Alternatively, the status could be queried with an ioctl, but
that
> would be considerably more heavyweight. I'd like to know if the event
mechanism
> is meant for these kinds of events at all (engines notify errors upwards to
the
> drm layer). Another alternative would probably be to register the same
buffer
> to all necessary engines separately in nvif method calls? Or register it to
> just one (e.g., fifo) and get that engine when errors happen in others
(e.g.,
> gr)? And drm handles probably wouldn't fit there? Please comment on
this; I
> wrote this before understanding the mthd mechanism.For the moment, I've just got a couple of high-level
comments/questions on these patches so far before we worry about
nit-picking the details:

Is there any real need to have a full-blown notifier-style object to
pass this info back to userspace?  The DRM has an event mechanism
where userspace could poll on its fd to get events, which is what I'd
assume you'd be doing inside of fence waits with the error notifier?
NVIF events are actually hooked up to this mechanism already, so
userspace could directly request them and cut quite a lot out of that
first patch.  But, the NVKM-level passing of these events back over
NVIF is pretty much where I was going with the design too.  So, good
to see you've done that :)
>
> Additionally, priority and timeout management for separate channels in
flight
> on the gpu is added in two patches. Neither is exactly what the name says,
but
> the effect is the same, and this is what nvgpu does currently. Those two
> patches call the fifo channel object's methods directly from userspace,
so a
> hack is added in the nvif path to accept that. The objects are NEW'd
from
> kernel space, so calling from userspace isn't allowed, as it appears.
How
> should this be managed in a clean way?Is there any requirement to be able to adjust the priority of
in-flight channels?  Or would it be satisfactory to have these as
arguments to the channel creation functions?  Either way is fine, the
latter is preferred if there's no use case for adjusting it afterwards
though.

I'm attempting to get some better channel handling (you might have
noticed the current stuff the kernel uses is a bit fragile, it's
evolved over a long time and several generations of channel classes)
in place in time for 4.4, part of which will likely involve giving
userspace better control over the arguments that get passed at channel
creation time.

As for the permissions problem, that can be resolved fairly easily I
think, I'll play with a few ideas and see what I come up with.

Thanks,
Ben.
>
> Also, since nouveau often hangs on errors, the userspace hangs too (waiting
on
> a fence). The final patch attempts to fix this in a couple of specific
error
> paths to forcibly update all fences to be finished. I'd like to hear
how that
> would be handled properly - consider the patch just a proof-of-concept and
> sample of what would be necessary.
>
> I don't expect the patches to be accepted as-is - as a newbie, I'd
appreciate
> any high-level comments on if I've understood anything, especially the
event
> and nvif/method mechanisms (I use the latter from userspace with a hack
> constructed from the perfmon branch seen here earlier into nvidia's
internal
> libdrm-equivalent). The fence-forcing thing is something that is necessary
with
> the error notifiers (at least with our userspace that waits really long or
> infinitely on fences). I'm working specifically on Tegra and don't
know much
> about the desktop's userspace details, so I may be biased in some
areas.
>
> I'd be happy to write sample tests on e.g. libdrm for the new methods
once the
> kernel patches would get to a good shape, if that's required for
accepting new
> features. I tested these to work as a proof-of-concept on Jetson TK1, and
the
> code is adapted from the latest nvgpu.
>
> The patches can also be found in http://github.com/sooda/nouveau and are
based
> on a version of gnurou/staging.
>
> Thanks!
> Konsta (sooda in IRC)
>
> Konsta Hölttä (5):
>   notify channel errors to userspace
>   don't verify route == owner in nvkm ioctl
>   gk104: channel priority/timeslice support
>   gk104: channel timeout detection
>   HACK force fences updated on error
>
>  drm/nouveau/include/nvif/class.h       |  20 ++++
>  drm/nouveau/include/nvif/event.h       |  12 +++
>  drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
>  drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
>  drm/nouveau/nouveau_chan.h             |  10 ++
>  drm/nouveau/nouveau_drm.c              |   1 +
>  drm/nouveau/nouveau_fence.c            |  13 ++-
>  drm/nouveau/nouveau_gem.c              |  29 ++++++
>  drm/nouveau/nouveau_gem.h              |   2 +
>  drm/nouveau/nvkm/core/ioctl.c          |   9 +-
>  drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
>  drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
>  drm/nouveau/nvkm/engine/fifo/gk104.c   | 166
++++++++++++++++++++++++++++++---
>  drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
>  drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
>  drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
>  16 files changed, 415 insertions(+), 25 deletions(-)
>
> --
> 2.1.4
>
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

Konsta Hölttä

2015-Sep-02 10:01 UTC

head link

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

On 01/09/15 16:26, Ben Skeggs wrote:> On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com>
wrote:
>> Hi there,
>>
>> Resending these now that they've had some more polish and testing,
and I heard
>> that Ben's vacation is over :-)
>>
>> These patches work as a starting point for more explicit error
mechanisms and
>> better robustness. At the moment, when a job hangs or faults, it seems
that
>> nouveau doesn't quite know how to handle the situation and often
results in a
>> hang. Some of these situations would require either completely
resetting the
>> gpu, and/or a complex path for only recovering the broken channel.
> Yes, this has been an issue that's been on my "really nice to
> implement" list for quite a long time now, and only gotten around to
> getting it partially done (as you can see).  Thanks for looking at
> this!
>
>> To start, I worked on support for letting userspace know what exactly
happened.
>> Proper recovery would come later. The "error notifier" in the
first patch is a
>> simple shared buffer between kernel and userspace. Its error codes
match
>> nvgpu's. Alternatively, the status could be queried with an ioctl,
but that
>> would be considerably more heavyweight. I'd like to know if the
event mechanism
>> is meant for these kinds of events at all (engines notify errors
upwards to the
>> drm layer). Another alternative would probably be to register the same
buffer
>> to all necessary engines separately in nvif method calls? Or register
it to
>> just one (e.g., fifo) and get that engine when errors happen in others
(e.g.,
>> gr)? And drm handles probably wouldn't fit there? Please comment on
this; I
>> wrote this before understanding the mthd mechanism.
> For the moment, I've just got a couple of high-level
> comments/questions on these patches so far before we worry about
> nit-picking the details:
>
> Is there any real need to have a full-blown notifier-style object to
> pass this info back to userspace?  The DRM has an event mechanism
> where userspace could poll on its fd to get events, which is what I'd
> assume you'd be doing inside of fence waits with the error notifier?
> NVIF events are actually hooked up to this mechanism already, so
> userspace could directly request them and cut quite a lot out of that
> first patch.  But, the NVKM-level passing of these events back over
> NVIF is pretty much where I was going with the design too.  So, good
> to see you've done that :)I hadn't heard about that event mechanism yet - do you mean just 
drm_poll() and ->event_wait that is woken up in usif_notify()? That 
doesn't have any values that could be passed to userland. Is there a way 
for that too already? The motivation for the current approach is that 
it's lightweight, not requiring any kernel calls (and is compatible with 
nvidia gl drivers...)

I don't actually know all the details how higher-level parts of the 
driver use the buffer :( just assuming things atm. I'd guess that the 
buffer would be read after each submit (after a fence wait has finished, 
which is a separate thing) if checking the status is required. My 
understanding is that at least the desktop driver has (or used to have) 
a shared buffer for lots of common stuff and this would be part of it 
(hence the offset field), which would complicate the whole driver stack 
if this were done differently. In Tegra's libdrm-equivalent this is 
initialized on channel creation, so could be in fact part of channel 
init args in nouveau?

I considered also that the buffer could be just one page and allocated 
in the kernel, but as said, our current drivers want to be able to pass 
a separate buffer there.
>
>> Additionally, priority and timeout management for separate channels in
flight
>> on the gpu is added in two patches. Neither is exactly what the name
says, but
>> the effect is the same, and this is what nvgpu does currently. Those
two
>> patches call the fifo channel object's methods directly from
userspace, so a
>> hack is added in the nvif path to accept that. The objects are
NEW'd from
>> kernel space, so calling from userspace isn't allowed, as it
appears. How
>> should this be managed in a clean way?
> Is there any requirement to be able to adjust the priority of
> in-flight channels?  Or would it be satisfactory to have these as
> arguments to the channel creation functions?  Either way is fine, the
> latter is preferred if there's no use case for adjusting it afterwards
> though.Maybe not exactly, but there is a technical requirement forced by nvidia 
userspace driver that doesn't enforce this to be done at creation time, 
but allows it at any time. I looked at that option too when starting to 
write these patches and yes, that (specifying at init time) would've 
simplified stuff.

It's certainly possible that the timeout/priority is set only later in 
userspace after finding out that this is e.g. a webgl process requiring 
to finish faster. That's one plausible usecase.
>
> I'm attempting to get some better channel handling (you might have
> noticed the current stuff the kernel uses is a bit fragile, it's
> evolved over a long time and several generations of channel classes)
> in place in time for 4.4, part of which will likely involve giving
> userspace better control over the arguments that get passed at channel
> creation time.The current stuff is at least not trivial to grasp initially :) I'd love 
to also help to write documentation for lots of things if time permits - 
I've been taking some notes for myself, but a general nvkm glossary and 
general reasoning about how things are done the way they are would 
definitely help new folks to get to know the codebase (wasn't easy for 
me). The problem is that I only could write about how things are done, 
not why. Getting to know all that would require lots of discussion (and 
time).
>
> As for the permissions problem, that can be resolved fairly easily I
> think, I'll play with a few ideas and see what I come up with.
Good to hear, thanks! A bit related question to this: what exactly 
determines the level of control kernel vs userspace have for the 
channels? Management rights for nouveau_bo's? Looks like some stuff of 
nouveau_chan.c could live in userspace too.

Cheers,
Konsta
>
> Thanks,
> Ben.
>
>> Also, since nouveau often hangs on errors, the userspace hangs too
(waiting on
>> a fence). The final patch attempts to fix this in a couple of specific
error
>> paths to forcibly update all fences to be finished. I'd like to
hear how that
>> would be handled properly - consider the patch just a proof-of-concept
and
>> sample of what would be necessary.
>>
>> I don't expect the patches to be accepted as-is - as a newbie,
I'd appreciate
>> any high-level comments on if I've understood anything, especially
the event
>> and nvif/method mechanisms (I use the latter from userspace with a hack
>> constructed from the perfmon branch seen here earlier into nvidia's
internal
>> libdrm-equivalent). The fence-forcing thing is something that is
necessary with
>> the error notifiers (at least with our userspace that waits really long
or
>> infinitely on fences). I'm working specifically on Tegra and
don't know much
>> about the desktop's userspace details, so I may be biased in some
areas.
>>
>> I'd be happy to write sample tests on e.g. libdrm for the new
methods once the
>> kernel patches would get to a good shape, if that's required for
accepting new
>> features. I tested these to work as a proof-of-concept on Jetson TK1,
and the
>> code is adapted from the latest nvgpu.
>>
>> The patches can also be found in http://github.com/sooda/nouveau and
are based
>> on a version of gnurou/staging.
>>
>> Thanks!
>> Konsta (sooda in IRC)
>>
>> Konsta Hölttä (5):
>>    notify channel errors to userspace
>>    don't verify route == owner in nvkm ioctl
>>    gk104: channel priority/timeslice support
>>    gk104: channel timeout detection
>>    HACK force fences updated on error
>>
>>   drm/nouveau/include/nvif/class.h       |  20 ++++
>>   drm/nouveau/include/nvif/event.h       |  12 +++
>>   drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
>>   drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
>>   drm/nouveau/nouveau_chan.h             |  10 ++
>>   drm/nouveau/nouveau_drm.c              |   1 +
>>   drm/nouveau/nouveau_fence.c            |  13 ++-
>>   drm/nouveau/nouveau_gem.c              |  29 ++++++
>>   drm/nouveau/nouveau_gem.h              |   2 +
>>   drm/nouveau/nvkm/core/ioctl.c          |   9 +-
>>   drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
>>   drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
>>   drm/nouveau/nvkm/engine/fifo/gk104.c   | 166
++++++++++++++++++++++++++++++---
>>   drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
>>   drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
>>   drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
>>   16 files changed, 415 insertions(+), 25 deletions(-)
>>
>> --
>> 2.1.4
>>
>> _______________________________________________
>> Nouveau mailing list
>> Nouveau at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/nouveau
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-tegra" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexandre Courbot

2015-Sep-03 09:01 UTC

head link

[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace

On Mon, Aug 31, 2015 at 8:38 PM, Konsta Hölttä <kholtta at nvidia.com>
wrote:> Let userspace know about some specific types of errors via a shared
> buffer registered with a new ioctl, NOUVEAU_GEM_SET_ERROR_NOTIFIER.
> Once set, the notifier buffer is used until the channel is destroyed.
> The exceptional per-channel error conditions are signaled upwards from
> the nvkm layer via the event/notify mechanism and written to the buffer
> which holds the latest error occurred, readable from userspace.
>
> Some error situations render a channel completely stuck, which may
> require even discarding and reinitializing it or the gpu completely.
> Once a specific type of error happens, no others are expected until
> recovery. Passing the error to userspace in a shared buffer enables the
> graphics driver to periodically check in a light-weight way if anything
> went wrong (typically, after a submit); e.g., the GL robustness
> extension requires to detect that the context has dead.
>
> The notifier currently signals idle timeout, sw notify, mmu fault and
> illegal pbdma error conditions.
Hi Konsta,

Before this series can be considered for upstream, I suppose it will
need to be ported to the latest codebase.

This will likely require a rewrite due to the heavy changes that
occured, but hopefully it will also allow you to address some of the
hacks more cleanly.

Nouveau - Sep 2015 - [RFC PATCH v2 0/5] More explicit pushbuf error handling

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace

[Nouveau] [RFC PATCH v2 2/5] don't verify route == owner in nvkm ioctl

[Nouveau] [RFC PATCH v2 3/5] gk104: channel priority/timeslice support

[Nouveau] [RFC PATCH v2 4/5] gk104: channel timeout detection

[Nouveau] [RFC PATCH v2 5/5] HACK force fences updated on error

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace