Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling
Hi there, Resending these now that they've had some more polish and testing, and I heard that Ben's vacation is over :-) These patches work as a starting point for more explicit error mechanisms and better robustness. At the moment, when a job hangs or faults, it seems that nouveau doesn't quite know how to handle the situation and often results in a hang. Some of these situations would require either completely resetting the gpu, and/or a complex path for only recovering the broken channel. To start, I worked on support for letting userspace know what exactly happened. Proper recovery would come later. The "error notifier" in the first patch is a simple shared buffer between kernel and userspace. Its error codes match nvgpu's. Alternatively, the status could be queried with an ioctl, but that would be considerably more heavyweight. I'd like to know if the event mechanism is meant for these kinds of events at all (engines notify errors upwards to the drm layer). Another alternative would probably be to register the same buffer to all necessary engines separately in nvif method calls? Or register it to just one (e.g., fifo) and get that engine when errors happen in others (e.g., gr)? And drm handles probably wouldn't fit there? Please comment on this; I wrote this before understanding the mthd mechanism. Additionally, priority and timeout management for separate channels in flight on the gpu is added in two patches. Neither is exactly what the name says, but the effect is the same, and this is what nvgpu does currently. Those two patches call the fifo channel object's methods directly from userspace, so a hack is added in the nvif path to accept that. The objects are NEW'd from kernel space, so calling from userspace isn't allowed, as it appears. How should this be managed in a clean way? Also, since nouveau often hangs on errors, the userspace hangs too (waiting on a fence). The final patch attempts to fix this in a couple of specific error paths to forcibly update all fences to be finished. I'd like to hear how that would be handled properly - consider the patch just a proof-of-concept and sample of what would be necessary. I don't expect the patches to be accepted as-is - as a newbie, I'd appreciate any high-level comments on if I've understood anything, especially the event and nvif/method mechanisms (I use the latter from userspace with a hack constructed from the perfmon branch seen here earlier into nvidia's internal libdrm-equivalent). The fence-forcing thing is something that is necessary with the error notifiers (at least with our userspace that waits really long or infinitely on fences). I'm working specifically on Tegra and don't know much about the desktop's userspace details, so I may be biased in some areas. I'd be happy to write sample tests on e.g. libdrm for the new methods once the kernel patches would get to a good shape, if that's required for accepting new features. I tested these to work as a proof-of-concept on Jetson TK1, and the code is adapted from the latest nvgpu. The patches can also be found in http://github.com/sooda/nouveau and are based on a version of gnurou/staging. Thanks! Konsta (sooda in IRC) Konsta Hölttä (5): notify channel errors to userspace don't verify route == owner in nvkm ioctl gk104: channel priority/timeslice support gk104: channel timeout detection HACK force fences updated on error drm/nouveau/include/nvif/class.h | 20 ++++ drm/nouveau/include/nvif/event.h | 12 +++ drm/nouveau/include/nvkm/engine/fifo.h | 5 +- drm/nouveau/nouveau_chan.c | 95 +++++++++++++++++++ drm/nouveau/nouveau_chan.h | 10 ++ drm/nouveau/nouveau_drm.c | 1 + drm/nouveau/nouveau_fence.c | 13 ++- drm/nouveau/nouveau_gem.c | 29 ++++++ drm/nouveau/nouveau_gem.h | 2 + drm/nouveau/nvkm/core/ioctl.c | 9 +- drm/nouveau/nvkm/engine/fifo/base.c | 56 ++++++++++- drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +- drm/nouveau/nvkm/engine/fifo/gk104.c | 166 ++++++++++++++++++++++++++++++--- drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +- drm/nouveau/nvkm/engine/gr/gf100.c | 5 + drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++ 16 files changed, 415 insertions(+), 25 deletions(-) -- 2.1.4
Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace
Let userspace know about some specific types of errors via a shared buffer registered with a new ioctl, NOUVEAU_GEM_SET_ERROR_NOTIFIER. Once set, the notifier buffer is used until the channel is destroyed. The exceptional per-channel error conditions are signaled upwards from the nvkm layer via the event/notify mechanism and written to the buffer which holds the latest error occurred, readable from userspace. Some error situations render a channel completely stuck, which may require even discarding and reinitializing it or the gpu completely. Once a specific type of error happens, no others are expected until recovery. Passing the error to userspace in a shared buffer enables the graphics driver to periodically check in a light-weight way if anything went wrong (typically, after a submit); e.g., the GL robustness extension requires to detect that the context has dead. The notifier currently signals idle timeout, sw notify, mmu fault and illegal pbdma error conditions. Signed-off-by: Konsta Hölttä <kholtta at nvidia.com> --- drm/nouveau/include/nvif/class.h | 2 + drm/nouveau/include/nvif/event.h | 11 ++++ drm/nouveau/include/nvkm/engine/fifo.h | 3 ++ drm/nouveau/nouveau_chan.c | 95 ++++++++++++++++++++++++++++++++++ drm/nouveau/nouveau_chan.h | 10 ++++ drm/nouveau/nouveau_drm.c | 1 + drm/nouveau/nouveau_gem.c | 29 +++++++++++ drm/nouveau/nouveau_gem.h | 2 + drm/nouveau/nvkm/engine/fifo/base.c | 53 +++++++++++++++++++ drm/nouveau/nvkm/engine/fifo/gk104.c | 13 +++++ drm/nouveau/nvkm/engine/gr/gf100.c | 5 ++ drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++++ 12 files changed, 237 insertions(+) diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h index d90e207..28176e6 100644 --- a/drm/nouveau/include/nvif/class.h +++ b/drm/nouveau/include/nvif/class.h @@ -410,16 +410,18 @@ struct kepler_channel_gpfifo_a_v0 { __u8 engine; __u16 chid; __u8 pad04[4]; __u32 pushbuf; __u32 ilength; __u64 ioffset; }; +#define CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT 0x01 + /******************************************************************************* * legacy display ******************************************************************************/ #define NV04_DISP_NTFY_VBLANK 0x00 #define NV04_DISP_NTFY_CONN 0x01 struct nv04_disp_mthd_v0 { diff --git a/drm/nouveau/include/nvif/event.h b/drm/nouveau/include/nvif/event.h index 2176449..d148b85 100644 --- a/drm/nouveau/include/nvif/event.h +++ b/drm/nouveau/include/nvif/event.h @@ -54,9 +54,20 @@ struct nvif_notify_conn_rep_v0 { struct nvif_notify_uevent_req { /* nvif_notify_req ... */ }; struct nvif_notify_uevent_rep { /* nvif_notify_rep ... */ }; +struct nvif_notify_eevent_req { + /* nvif_notify_req ... */ + u32 chid; +}; + +struct nvif_notify_eevent_rep { + /* nvif_notify_rep ... */ + u32 error; + u32 chid; +}; + #endif diff --git a/drm/nouveau/include/nvkm/engine/fifo.h b/drm/nouveau/include/nvkm/engine/fifo.h index 9100b80..cbca477 100644 --- a/drm/nouveau/include/nvkm/engine/fifo.h +++ b/drm/nouveau/include/nvkm/engine/fifo.h @@ -67,16 +67,17 @@ struct nvkm_fifo_base { #include <core/engine.h> #include <core/event.h> struct nvkm_fifo { struct nvkm_engine base; struct nvkm_event cevent; /* channel creation event */ struct nvkm_event uevent; /* async user trigger */ + struct nvkm_event eevent; /* error notifier */ struct nvkm_object **channel; spinlock_t lock; u16 min; u16 max; int (*chid)(struct nvkm_fifo *, struct nvkm_object *); void (*pause)(struct nvkm_fifo *, unsigned long *); @@ -118,11 +119,13 @@ extern struct nvkm_oclass *gk20a_fifo_oclass; extern struct nvkm_oclass *gk208_fifo_oclass; extern struct nvkm_oclass *gm204_fifo_oclass; extern struct nvkm_oclass *gm20b_fifo_oclass; int nvkm_fifo_uevent_ctor(struct nvkm_object *, void *, u32, struct nvkm_notify *); void nvkm_fifo_uevent(struct nvkm_fifo *); +void nvkm_fifo_eevent(struct nvkm_fifo *, u32 chid, u32 error); + void nv04_fifo_intr(struct nvkm_subdev *); int nv04_fifo_context_attach(struct nvkm_object *, struct nvkm_object *); #endif diff --git a/drm/nouveau/nouveau_chan.c b/drm/nouveau/nouveau_chan.c index 0589bab..ff37f5f 100644 --- a/drm/nouveau/nouveau_chan.c +++ b/drm/nouveau/nouveau_chan.c @@ -19,26 +19,29 @@ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR * OTHER DEALINGS IN THE SOFTWARE. * * Authors: Ben Skeggs */ #include <nvif/os.h> #include <nvif/class.h> +#include <nvif/event.h> +#include <nvif/notify.h> /*XXX*/ #include <core/client.h> #include "nouveau_drm.h" #include "nouveau_dma.h" #include "nouveau_bo.h" #include "nouveau_chan.h" #include "nouveau_fence.h" #include "nouveau_abi16.h" +#include "nouveau_gem.h" MODULE_PARM_DESC(vram_pushbuf, "Create DMA push buffers in VRAM"); int nouveau_vram_pushbuf; module_param_named(vram_pushbuf, nouveau_vram_pushbuf, int, 0400); int nouveau_channel_idle(struct nouveau_channel *chan) { @@ -72,16 +75,22 @@ nouveau_channel_del(struct nouveau_channel **pchan) nvif_object_fini(&chan->vram); nvif_object_ref(NULL, &chan->object); nvif_object_fini(&chan->push.ctxdma); nouveau_bo_vma_del(chan->push.buffer, &chan->push.vma); nouveau_bo_unmap(chan->push.buffer); if (chan->push.buffer && chan->push.buffer->pin_refcnt) nouveau_bo_unpin(chan->push.buffer); nouveau_bo_ref(NULL, &chan->push.buffer); + if (chan->error_notifier.buffer) { + nouveau_bo_unmap(chan->error_notifier.buffer); + drm_gem_object_unreference_unlocked( + &chan->error_notifier.buffer->gem); + } + nvif_notify_fini(&chan->error_notifier.notify); nvif_device_ref(NULL, &chan->device); kfree(chan); } *pchan = NULL; } static int nouveau_channel_prep(struct nouveau_drm *drm, struct nvif_device *device, @@ -272,16 +281,89 @@ nouveau_channel_dma(struct nouveau_drm *drm, struct nvif_device *device, return ret; } } while (ret && *oclass); nouveau_channel_del(pchan); return ret; } +static void +nouveau_channel_set_error_notifier(struct nouveau_channel *chan, u32 error) +{ + struct nouveau_cli *cli = (void *)nvif_client(chan->object); + struct nouveau_bo *nvbo = chan->error_notifier.buffer; + u32 off = chan->error_notifier.offset / sizeof(u32); + u64 nsec; + + if (!nvbo) + return; + + nsec = ktime_to_ns(ktime_get_real()); + + nouveau_bo_wr32(nvbo, off + 0, nsec); + nouveau_bo_wr32(nvbo, off + 1, nsec >> 32); + nouveau_bo_wr32(nvbo, off + 2, error); + nouveau_bo_wr32(nvbo, off + 3, 0xffffffff); + + NV_PRINTK(error, cli, "error notifier set to %d for ch %d\n", + error, chan->chid); +} + +int +nouveau_channel_init_error_notifier(struct nouveau_channel *chan, + struct drm_file *file, u32 handle, u32 offset) +{ + struct drm_gem_object *gem; + struct nouveau_bo *nvbo; + int ret; + + gem = drm_gem_object_lookup(chan->drm->dev, file, handle); + if (!gem) + return -ENOENT; + + nvbo = nouveau_gem_object(gem); + + ret = nouveau_bo_map(nvbo); + if (ret) { + drm_gem_object_unreference_unlocked(gem); + return ret; + } + + /* drop old ref on reassign */ + if (chan->error_notifier.buffer) { + nouveau_bo_unmap(chan->error_notifier.buffer); + drm_gem_object_unreference_unlocked( + &chan->error_notifier.buffer->gem); + } + + /* the gem reference held until channel deletion keeps this alive */ + chan->error_notifier.buffer = nvbo; + chan->error_notifier.offset = offset; + + /* initialize to a no-error status */ + nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 0, 0); + nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 1, 0); + nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 2, 0); + nouveau_bo_wr32(nvbo, offset / sizeof(u32) + 3, 0); + + return 0; +} + +static int +nouveau_chan_eevent_handler(struct nvif_notify *notify) +{ + struct nouveau_channel *chan + container_of(notify, typeof(*chan), error_notifier.notify); + const struct nvif_notify_eevent_rep *rep = notify->data; + + WARN_ON(rep->chid != chan->chid); + nouveau_channel_set_error_notifier(chan, rep->error); + return NVIF_NOTIFY_KEEP; +} static int nouveau_channel_init(struct nouveau_channel *chan, u32 vram, u32 gart) { struct nvif_device *device = chan->device; struct nouveau_cli *cli = (void *)nvif_client(&device->base); struct nvkm_mmu *mmu = nvxx_mmu(device); struct nvkm_sw_chan *swch; struct nv_dma_v0 args = {}; @@ -381,16 +463,29 @@ nouveau_channel_init(struct nouveau_channel *chan, u32 vram, u32 gart) if (ret) return ret; BEGIN_NV04(chan, NvSubSw, 0x0000, 1); OUT_RING (chan, chan->nvsw.handle); FIRE_RING (chan); } + /* error code events on why we're stuck/broken */ + ret = nvif_notify_init(chan->object, NULL, + nouveau_chan_eevent_handler, false, + CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT, + &(struct nvif_notify_eevent_req) { .chid = chan->chid }, + sizeof(struct nvif_notify_eevent_req), + sizeof(struct nvif_notify_eevent_rep), + &chan->error_notifier.notify); + if (WARN_ON(ret)) + return ret; + /* fini() does one put() */ + nvif_notify_get(&chan->error_notifier.notify); + /* initialise synchronisation */ return nouveau_fence(chan->drm)->context_new(chan); } int nouveau_channel_new(struct nouveau_drm *drm, struct nvif_device *device, u32 handle, u32 arg0, u32 arg1, struct nouveau_channel **pchan) diff --git a/drm/nouveau/nouveau_chan.h b/drm/nouveau/nouveau_chan.h index 8b3640f..730ddba 100644 --- a/drm/nouveau/nouveau_chan.h +++ b/drm/nouveau/nouveau_chan.h @@ -33,20 +33,30 @@ struct nouveau_channel { int ib_free; int ib_put; } dma; u32 user_get_hi; u32 user_get; u32 user_put; struct nvif_object *object; + + struct { + struct nouveau_bo *buffer; + u32 offset; + struct nvif_notify notify; + } error_notifier; }; int nouveau_channel_new(struct nouveau_drm *, struct nvif_device *, u32 handle, u32 arg0, u32 arg1, struct nouveau_channel **); void nouveau_channel_del(struct nouveau_channel **); int nouveau_channel_idle(struct nouveau_channel *); +struct drm_file; +int nouveau_channel_init_error_notifier(struct nouveau_channel *chan, + struct drm_file *file, + u32 handle, u32 offset); extern int nouveau_vram_pushbuf; #endif diff --git a/drm/nouveau/nouveau_drm.c b/drm/nouveau/nouveau_drm.c index 7d1297e..e3e6441 100644 --- a/drm/nouveau/nouveau_drm.c +++ b/drm/nouveau/nouveau_drm.c @@ -898,16 +898,17 @@ nouveau_ioctls[] = { DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_NEW, nouveau_gem_ioctl_new, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF, nouveau_gem_ioctl_pushbuf, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_PREP, nouveau_gem_ioctl_cpu_prep, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_FINI, nouveau_gem_ioctl_cpu_fini, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_INFO, nouveau_gem_ioctl_info, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), /* Staging ioctls */ DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_SET_TILING, nouveau_gem_ioctl_set_tiling, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF_2, nouveau_gem_ioctl_pushbuf_2, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), + DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_SET_ERROR_NOTIFIER, nouveau_gem_ioctl_set_error_notifier, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), }; long nouveau_drm_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct drm_file *filp = file->private_data; struct drm_device *dev = filp->minor->dev; long ret; diff --git a/drm/nouveau/nouveau_gem.c b/drm/nouveau/nouveau_gem.c index 884ab83..f8373cb 100644 --- a/drm/nouveau/nouveau_gem.c +++ b/drm/nouveau/nouveau_gem.c @@ -1032,16 +1032,45 @@ out_next: (chan->push.vma.offset + ((chan->dma.cur + 2) << 2)); req->suffix1 = 0x00000000; } return nouveau_abi16_put(abi16, ret); } int +nouveau_gem_ioctl_set_error_notifier(struct drm_device *dev, void *data, + struct drm_file *file_priv) +{ + struct nouveau_abi16 *abi16 = nouveau_abi16_get(file_priv, dev); + struct drm_nouveau_gem_set_error_notifier *req = data; + struct nouveau_abi16_chan *abi16_ch; + struct nouveau_channel *chan = NULL; + int ret = 0; + + if (unlikely(!abi16)) + return -ENOMEM; + + list_for_each_entry(abi16_ch, &abi16->channels, head) { + if (abi16_ch->chan->object->handle + == (NVDRM_CHAN | req->channel)) { + chan = abi16_ch->chan; + break; + } + } + if (!chan) + return nouveau_abi16_put(abi16, -ENOENT); + + ret = nouveau_channel_init_error_notifier(chan, file_priv, + req->buffer, req->offset); + + return nouveau_abi16_put(abi16, ret); +} + +int nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_nouveau_gem_cpu_prep *req = data; struct drm_gem_object *gem; struct nouveau_bo *nvbo; bool no_wait = !!(req->flags & NOUVEAU_GEM_CPU_PREP_NOWAIT); bool write = !!(req->flags & NOUVEAU_GEM_CPU_PREP_WRITE); diff --git a/drm/nouveau/nouveau_gem.h b/drm/nouveau/nouveau_gem.h index 9e4323f..423ea5a 100644 --- a/drm/nouveau/nouveau_gem.h +++ b/drm/nouveau/nouveau_gem.h @@ -26,16 +26,18 @@ extern void nouveau_gem_object_close(struct drm_gem_object *, extern int nouveau_gem_ioctl_set_tiling(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_new(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_pushbuf(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_pushbuf_2(struct drm_device *, void *, struct drm_file *); +extern int nouveau_gem_ioctl_set_error_notifier(struct drm_device *, void *, + struct drm_file *); extern int nouveau_gem_ioctl_cpu_prep(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_cpu_fini(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_info(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_prime_pin(struct drm_gem_object *); diff --git a/drm/nouveau/nvkm/engine/fifo/base.c b/drm/nouveau/nvkm/engine/fifo/base.c index fa223f8..a5dc6c9 100644 --- a/drm/nouveau/nvkm/engine/fifo/base.c +++ b/drm/nouveau/nvkm/engine/fifo/base.c @@ -191,28 +191,76 @@ nvkm_fifo_uevent_ctor(struct nvkm_object *object, void *data, u32 size, void nvkm_fifo_uevent(struct nvkm_fifo *fifo) { struct nvif_notify_uevent_rep rep = { }; nvkm_event_send(&fifo->uevent, 1, 0, &rep, sizeof(rep)); } +static int +nvkm_fifo_eevent_ctor(struct nvkm_object *object, void *data, u32 size, + struct nvkm_notify *notify) +{ + union { + struct nvif_notify_eevent_req req; + } *req = data; + int ret; + + if (nvif_unvers(req->req)) { + notify->size = sizeof(struct nvif_notify_eevent_rep); + notify->types = 1; + notify->index = req->req.chid; + } + + return ret; +} + +static void +gk104_fifo_eevent_init(struct nvkm_event *event, int type, int index) +{ +} + +static void +gk104_fifo_eevent_fini(struct nvkm_event *event, int type, int index) +{ +} + +static const struct nvkm_event_func +nvkm_fifo_eevent_func = { + .ctor = nvkm_fifo_eevent_ctor, + .init = gk104_fifo_eevent_init, + .fini = gk104_fifo_eevent_fini, +}; + +void +nvkm_fifo_eevent(struct nvkm_fifo *fifo, u32 chid, u32 error) +{ + struct nvif_notify_eevent_rep rep = { + .error = error, + .chid = chid + }; + nvkm_event_send(&fifo->eevent, 1, chid, &rep, sizeof(rep)); +} + int _nvkm_fifo_channel_ntfy(struct nvkm_object *object, u32 type, struct nvkm_event **event) { struct nvkm_fifo *fifo = (void *)object->engine; switch (type) { case G82_CHANNEL_DMA_V0_NTFY_UEVENT: if (nv_mclass(object) >= G82_CHANNEL_DMA) { *event = &fifo->uevent; return 0; } break; + case CHANNEL_GPFIFO_ERROR_NOTIFIER_EEVENT: + *event = &fifo->eevent; + return 0; default: break; } return -EINVAL; } static int nvkm_fifo_chid(struct nvkm_fifo *priv, struct nvkm_object *object) @@ -242,16 +290,17 @@ nvkm_client_name_for_fifo_chid(struct nvkm_fifo *fifo, u32 chid) return nvkm_client_name(chan); } void nvkm_fifo_destroy(struct nvkm_fifo *priv) { kfree(priv->channel); + nvkm_event_fini(&priv->eevent); nvkm_event_fini(&priv->uevent); nvkm_event_fini(&priv->cevent); nvkm_engine_destroy(&priv->base); } int nvkm_fifo_create_(struct nvkm_object *parent, struct nvkm_object *engine, struct nvkm_oclass *oclass, @@ -271,12 +320,16 @@ nvkm_fifo_create_(struct nvkm_object *parent, struct nvkm_object *engine, priv->channel = kzalloc(sizeof(*priv->channel) * (max + 1), GFP_KERNEL); if (!priv->channel) return -ENOMEM; ret = nvkm_event_init(&nvkm_fifo_event_func, 1, 1, &priv->cevent); if (ret) return ret; + ret = nvkm_event_init(&nvkm_fifo_eevent_func, 1, max + 1, &priv->eevent); + if (ret) + return ret; + priv->chid = nvkm_fifo_chid; spin_lock_init(&priv->lock); return 0; } diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c b/drm/nouveau/nvkm/engine/fifo/gk104.c index 52c22b0..659c05f 100644 --- a/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -30,16 +30,18 @@ #include <subdev/bar.h> #include <subdev/fb.h> #include <subdev/mmu.h> #include <subdev/timer.h> #include <nvif/class.h> #include <nvif/unpack.h> +#include "nouveau_drm.h" + #define _(a,b) { (a), ((1ULL << (a)) | (b)) } static const struct { u64 subdev; u64 mask; } fifo_engine[] = { _(NVDEV_ENGINE_GR , (1ULL << NVDEV_ENGINE_SW) | (1ULL << NVDEV_ENGINE_CE2)), _(NVDEV_ENGINE_MSPDEC , 0), @@ -567,16 +569,20 @@ gk104_fifo_intr_sched_ctxsw(struct gk104_fifo_priv *priv) u32 chid = load ? next : prev; (void)save; if (busy && chsw) { if (!(chan = (void *)priv->base.channel[chid])) continue; if (!(engine = gk104_fifo_engine(priv, engn))) continue; + + nvkm_fifo_eevent(&priv->base, chid, + NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT); + gk104_fifo_recover(priv, engine, chan); } } } static void gk104_fifo_intr_sched(struct gk104_fifo_priv *priv) { @@ -783,16 +789,19 @@ gk104_fifo_intr_fault(struct gk104_fifo_priv *priv, int unit) ec ? ec->name : ecunk, (u64)inst << 12, nvkm_client_name(engctx)); object = engctx; while (object) { switch (nv_mclass(object)) { case KEPLER_CHANNEL_GPFIFO_A: case MAXWELL_CHANNEL_GPFIFO_A: + nvkm_fifo_eevent(&priv->base, + ((struct nvkm_fifo_chan*)object)->chid, + NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT); gk104_fifo_recover(priv, engine, (void *)object); break; } object = object->parent; } nvkm_engctx_put(engctx); } @@ -853,16 +862,18 @@ gk104_fifo_intr_pbdma_0(struct gk104_fifo_priv *priv, int unit) nv_error(priv, "PBDMA%d:", unit); nvkm_bitfield_print(gk104_fifo_pbdma_intr_0, show); pr_cont("\n"); nv_error(priv, "PBDMA%d: ch %d [%s] subc %d mthd 0x%04x data 0x%08x\n", unit, chid, nvkm_client_name_for_fifo_chid(&priv->base, chid), subc, mthd, data); + nvkm_fifo_eevent(&priv->base, chid, + NOUVEAU_GEM_CHANNEL_PBDMA_ERROR); } nv_wr32(priv, 0x040108 + (unit * 0x2000), stat); } static const struct nvkm_bitfield gk104_fifo_pbdma_intr_1[] = { { 0x00000001, "HCE_RE_ILLEGAL_OP" }, { 0x00000002, "HCE_RE_ALIGNB" }, @@ -881,16 +892,18 @@ gk104_fifo_intr_pbdma_1(struct gk104_fifo_priv *priv, int unit) if (stat) { nv_error(priv, "PBDMA%d:", unit); nvkm_bitfield_print(gk104_fifo_pbdma_intr_1, stat); pr_cont("\n"); nv_error(priv, "PBDMA%d: ch %d %08x %08x\n", unit, chid, nv_rd32(priv, 0x040150 + (unit * 0x2000)), nv_rd32(priv, 0x040154 + (unit * 0x2000))); + nvkm_fifo_eevent(&priv->base, chid, + NOUVEAU_GEM_CHANNEL_PBDMA_ERROR); } nv_wr32(priv, 0x040148 + (unit * 0x2000), stat); } static void gk104_fifo_intr_runlist(struct gk104_fifo_priv *priv) { diff --git a/drm/nouveau/nvkm/engine/gr/gf100.c b/drm/nouveau/nvkm/engine/gr/gf100.c index e7c3e9e..8482078 100644 --- a/drm/nouveau/nvkm/engine/gr/gf100.c +++ b/drm/nouveau/nvkm/engine/gr/gf100.c @@ -32,16 +32,18 @@ #include <engine/fifo.h> #include <subdev/fb.h> #include <subdev/mc.h> #include <subdev/timer.h> #include <nvif/class.h> #include <nvif/unpack.h> +#include "nouveau_drm.h" + /******************************************************************************* * Zero Bandwidth Clear ******************************************************************************/ static void gf100_gr_zbc_clear_color(struct gf100_gr_priv *priv, int zbc) { if (priv->zbc_color[zbc].format) { @@ -1169,34 +1171,37 @@ gf100_gr_intr(struct nvkm_subdev *subdev) if (stat & 0x00000020) { nv_error(priv, "ILLEGAL_CLASS ch %d [0x%010llx %s] subc %d class 0x%04x mthd 0x%04x data 0x%08x\n", chid, inst << 12, nvkm_client_name(engctx), subc, class, mthd, data); nv_wr32(priv, 0x400100, 0x00000020); stat &= ~0x00000020; + nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY); } if (stat & 0x00100000) { nv_error(priv, "DATA_ERROR ["); nvkm_enum_print(nv50_data_error_names, code); pr_cont("] ch %d [0x%010llx %s] subc %d class 0x%04x mthd 0x%04x data 0x%08x\n", chid, inst << 12, nvkm_client_name(engctx), subc, class, mthd, data); nv_wr32(priv, 0x400100, 0x00100000); stat &= ~0x00100000; + nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY); } if (stat & 0x00200000) { nv_error(priv, "TRAP ch %d [0x%010llx %s]\n", chid, inst << 12, nvkm_client_name(engctx)); gf100_gr_trap_intr(priv); nv_wr32(priv, 0x400100, 0x00200000); stat &= ~0x00200000; + nvkm_fifo_eevent(pfifo, chid, NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY); } if (stat & 0x00080000) { gf100_gr_ctxctl_isr(priv); nv_wr32(priv, 0x400100, 0x00080000); stat &= ~0x00080000; } diff --git a/drm/nouveau/uapi/drm/nouveau_drm.h b/drm/nouveau/uapi/drm/nouveau_drm.h index 1331c28..c9c296a 100644 --- a/drm/nouveau/uapi/drm/nouveau_drm.h +++ b/drm/nouveau/uapi/drm/nouveau_drm.h @@ -143,16 +143,27 @@ struct drm_nouveau_gem_cpu_prep { uint32_t handle; uint32_t flags; }; struct drm_nouveau_gem_cpu_fini { uint32_t handle; }; +#define NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT 8 +#define NOUVEAU_GEM_CHANNEL_GR_ERROR_SW_NOTIFY 13 +#define NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT 31 +#define NOUVEAU_GEM_CHANNEL_PBDMA_ERROR 32 +struct drm_nouveau_gem_set_error_notifier { + uint32_t channel; + uint32_t buffer; + uint32_t offset; /* bytes, u32-aligned */ +}; + + #define DRM_NOUVEAU_GETPARAM 0x00 /* deprecated */ #define DRM_NOUVEAU_SETPARAM 0x01 /* deprecated */ #define DRM_NOUVEAU_CHANNEL_ALLOC 0x02 /* deprecated */ #define DRM_NOUVEAU_CHANNEL_FREE 0x03 /* deprecated */ #define DRM_NOUVEAU_GROBJ_ALLOC 0x04 /* deprecated */ #define DRM_NOUVEAU_NOTIFIEROBJ_ALLOC 0x05 /* deprecated */ #define DRM_NOUVEAU_GPUOBJ_FREE 0x06 /* deprecated */ #define DRM_NOUVEAU_NVIF 0x07 @@ -160,19 +171,21 @@ struct drm_nouveau_gem_cpu_fini { #define DRM_NOUVEAU_GEM_PUSHBUF 0x41 #define DRM_NOUVEAU_GEM_CPU_PREP 0x42 #define DRM_NOUVEAU_GEM_CPU_FINI 0x43 #define DRM_NOUVEAU_GEM_INFO 0x44 /* range 0x98..DRM_COMMAND_END (8 entries) is reserved for staging, unstable ioctls */ #define DRM_NOUVEAU_STAGING_IOCTL 0x58 #define DRM_NOUVEAU_GEM_SET_TILING (DRM_NOUVEAU_STAGING_IOCTL + 0x0) #define DRM_NOUVEAU_GEM_PUSHBUF_2 (DRM_NOUVEAU_STAGING_IOCTL + 0x1) +#define DRM_NOUVEAU_GEM_SET_ERROR_NOTIFIER (DRM_NOUVEAU_STAGING_IOCTL + 0x2) #define DRM_IOCTL_NOUVEAU_GEM_NEW DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_NEW, struct drm_nouveau_gem_new) #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_PUSHBUF, struct drm_nouveau_gem_pushbuf) #define DRM_IOCTL_NOUVEAU_GEM_CPU_PREP DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_PREP, struct drm_nouveau_gem_cpu_prep) #define DRM_IOCTL_NOUVEAU_GEM_CPU_FINI DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_FINI, struct drm_nouveau_gem_cpu_fini) #define DRM_IOCTL_NOUVEAU_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_INFO, struct drm_nouveau_gem_info) /* staging ioctls */ #define DRM_IOCTL_NOUVEAU_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_SET_TILING, struct drm_nouveau_gem_set_tiling) #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF_2 DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_PUSHBUF_2, struct drm_nouveau_gem_pushbuf_2) +#define DRM_IOCTL_NOUVEAU_GEM_SET_ERROR_NOTIFIER DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_SET_ERROR_NOTIFIER, struct drm_nouveau_gem_set_error_notifier) #endif /* __NOUVEAU_DRM_H__ */ -- 2.1.4
Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 2/5] don't verify route == owner in nvkm ioctl
HACK: Some objects we need to access from userspace are first created in kernel, and thus owned by the kernel layer. Bypass a safety mechanism that allows to only use userspace-created objects and create new ones (for userspace). The objects do not span across processes but are still owned by the same client. This will need to be fixed in some proper way. Will these objects be managed by userspace at some point? Signed-off-by: Konsta Hölttä <kholtta at nvidia.com> --- drm/nouveau/nvkm/core/ioctl.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drm/nouveau/nvkm/core/ioctl.c b/drm/nouveau/nvkm/core/ioctl.c index 4459ff5..30164c7 100644 --- a/drm/nouveau/nvkm/core/ioctl.c +++ b/drm/nouveau/nvkm/core/ioctl.c @@ -474,18 +474,23 @@ nvkm_ioctl_path(struct nvkm_handle *parent, u32 type, u32 nr, u32 *path, nv_debug(object, "handle 0x%08x not found\n", path[nr]); return -ENOENT; } nvkm_namedb_put(handle); parent = handle; } if (owner != NVIF_IOCTL_V0_OWNER_ANY && owner != handle->route) { - nv_ioctl(object, "object route != owner\n"); - return -EACCES; + nv_ioctl(object, "object route != owner: route = 0x%x, owner = 0x$%x\n", + handle->route, owner); + /* return -EACCES; */ + + /* continue anyway - this is required for calling objects + * created in the kernel for this client from userspace, such + * as the channel fifo object or its gr obj. */ } *route = handle->route; *token = handle->token; if (ret = -EINVAL, type < ARRAY_SIZE(nvkm_ioctl_v0)) { if (nvkm_ioctl_v0[type].version == 0) ret = nvkm_ioctl_v0[type].func(handle, data, size); } -- 2.1.4
Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 3/5] gk104: channel priority/timeslice support
Change channel "priorities" by modifying the runlist timeslice value, thus giving more time for more important jobs before scheduling a context switch. A new KEPLER_SET_CHANNEL_PRIORITY mthd on fifo channel objects sets the priority to low, medium or high. Disable the channel and preempt it out, change the timeslice, and then enable the channel again. The shift bit in the timeslice register is left untouched (i.e., 3) as is the enable bit. Signed-off-by: Konsta Hölttä <kholtta at nvidia.com> --- drm/nouveau/include/nvif/class.h | 10 ++++++ drm/nouveau/nvkm/engine/fifo/gk104.c | 60 ++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h index 28176e6..72c3b37 100644 --- a/drm/nouveau/include/nvif/class.h +++ b/drm/nouveau/include/nvif/class.h @@ -619,9 +619,19 @@ struct fermi_a_zbc_depth_v0 { #define FERMI_A_ZBC_DEPTH_V0_FMT_FP32 0x01 __u8 format; __u8 index; __u8 pad03[5]; __u32 ds; __u32 l2; }; +#define KEPLER_SET_CHANNEL_PRIORITY 0x00 +struct kepler_set_channel_priority_v0 { + __u8 version; +#define KEPLER_SET_CHANNEL_PRIORITY_LOW 0x00 +#define KEPLER_SET_CHANNEL_PRIORITY_MEDIUM 0x01 +#define KEPLER_SET_CHANNEL_PRIORITY_HIGH 0x02 + __u8 priority; + __u8 pad03[6]; +}; + #endif diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c b/drm/nouveau/nvkm/engine/fifo/gk104.c index 659c05f..fda726d 100644 --- a/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -333,22 +333,82 @@ gk104_fifo_chan_fini(struct nvkm_object *object, bool suspend) gk104_fifo_runlist_update(priv, chan->engine); } gk104_fifo_chan_kick(chan); nv_wr32(priv, 0x800000 + (chid * 8), 0x00000000); return nvkm_fifo_channel_fini(&chan->base, suspend); } +static int +gk104_fifo_set_runlist_timeslice(struct gk104_fifo_priv *priv, + struct gk104_fifo_chan *chan, u8 slice) +{ + struct nvkm_gpuobj *base = nv_gpuobj(nv_object(chan)->parent); + u32 chid = chan->base.chid; + + nv_mask(priv, 0x800004 + (chid * 8), 0x00000800, 0x00000800); + WARN_ON(gk104_fifo_chan_kick(chan)); + nv_wo32(base, 0xf8, slice | 0x10003000); + nv_mask(priv, 0x800004 + (chid * 8), 0x00000400, 0x00000400); + nv_info(chan, "timeslice set to %d for %d\n", slice, chid); + + return 0; +} + +int +gk104_fifo_chan_set_priority(struct nvkm_object *object, void *data, u32 size) +{ + struct gk104_fifo_priv *priv = (void *)object->engine; + struct gk104_fifo_chan *chan = (void *)object; + union { + struct kepler_set_channel_priority_v0 v0; + } *args = data; + int ret; + u8 slice; + + if (nvif_unpack(args->v0, 0, 0, false)) { + switch (args->v0.priority) { + case KEPLER_SET_CHANNEL_PRIORITY_LOW: + slice = 64; /* << 3 == 512 us */ + break; + case KEPLER_SET_CHANNEL_PRIORITY_MEDIUM: + slice = 128; /* 1 ms */ + break; + case KEPLER_SET_CHANNEL_PRIORITY_HIGH: + slice = 255; /* 2 ms */ + break; + default: + return -EINVAL; + } + return gk104_fifo_set_runlist_timeslice(priv, chan, slice); + } + + return ret; +} + +int +gk104_fifo_chan_mthd(struct nvkm_object *object, u32 mthd, void *data, u32 size) +{ + switch (mthd) { + case KEPLER_SET_CHANNEL_PRIORITY: + return gk104_fifo_chan_set_priority(object, data, size); + default: + break; + } + return -EINVAL; +} + struct nvkm_ofuncs gk104_fifo_chan_ofuncs = { .ctor = gk104_fifo_chan_ctor, .dtor = _nvkm_fifo_channel_dtor, .init = gk104_fifo_chan_init, .fini = gk104_fifo_chan_fini, + .mthd = gk104_fifo_chan_mthd, .map = _nvkm_fifo_channel_map, .rd32 = _nvkm_fifo_channel_rd32, .wr32 = _nvkm_fifo_channel_wr32, .ntfy = _nvkm_fifo_channel_ntfy }; static struct nvkm_oclass gk104_fifo_sclass[] = { -- 2.1.4
Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 4/5] gk104: channel timeout detection
Enable the scheduling timeout error interrupt and set it to a low value to happen periodically, since it can be missed in HW in certain conditions. Increment a channel-specific counter in software in the error handler if the current channel hasn't advanced. Abort the channel once the timeout limit is hit (with the periodic granularity). The error notifier is set to NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT when this occurs. A new KEPLER_SET_CHANNEL_TIMEOUT mthd sets the timeout limit, in milliseconds. The interrupt granularity is set to 100 ms. Some status bit mangling in the sched error handler is cleaned up too. Signed-off-by: Konsta Hölttä <kholtta at nvidia.com> --- drm/nouveau/include/nvif/class.h | 8 ++++ drm/nouveau/nvkm/engine/fifo/gk104.c | 92 +++++++++++++++++++++++++++++------- 2 files changed, 84 insertions(+), 16 deletions(-) diff --git a/drm/nouveau/include/nvif/class.h b/drm/nouveau/include/nvif/class.h index 72c3b37..9b568dc 100644 --- a/drm/nouveau/include/nvif/class.h +++ b/drm/nouveau/include/nvif/class.h @@ -620,18 +620,26 @@ struct fermi_a_zbc_depth_v0 { __u8 format; __u8 index; __u8 pad03[5]; __u32 ds; __u32 l2; }; #define KEPLER_SET_CHANNEL_PRIORITY 0x00 +#define KEPLER_SET_CHANNEL_TIMEOUT 0x01 + struct kepler_set_channel_priority_v0 { __u8 version; #define KEPLER_SET_CHANNEL_PRIORITY_LOW 0x00 #define KEPLER_SET_CHANNEL_PRIORITY_MEDIUM 0x01 #define KEPLER_SET_CHANNEL_PRIORITY_HIGH 0x02 __u8 priority; __u8 pad03[6]; }; +struct kepler_set_channel_timeout_v0 { + __u8 version; + __u8 pad03[3]; + __u32 timeout_ms; +}; + #endif diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c b/drm/nouveau/nvkm/engine/fifo/gk104.c index fda726d..53a464d 100644 --- a/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -49,16 +49,20 @@ static const struct { _(NVDEV_ENGINE_MSVLD , 0), _(NVDEV_ENGINE_CE0 , 0), _(NVDEV_ENGINE_CE1 , 0), _(NVDEV_ENGINE_MSENC , 0), }; #undef _ #define FIFO_ENGINE_NR ARRAY_SIZE(fifo_engine) +#define CTXSW_STATUS_LOAD 5 +#define CTXSW_STATUS_SAVE 6 +#define CTXSW_STATUS_SWITCH 7 + struct gk104_fifo_engn { struct nvkm_gpuobj *runlist[2]; int cur_runlist; wait_queue_head_t wait; }; struct gk104_fifo_priv { struct nvkm_fifo base; @@ -83,18 +87,25 @@ struct gk104_fifo_base { struct gk104_fifo_chan { struct nvkm_fifo_chan base; u32 engine; enum { STOPPED, RUNNING, KILLED } state; + struct { + u32 sum_ms; + u32 limit_ms; + u32 gpfifo_get; + } timeout; }; +#define GRFIFO_TIMEOUT_CHECK_PERIOD_MS 100 + /******************************************************************************* * FIFO channel objects ******************************************************************************/ static void gk104_fifo_runlist_update(struct gk104_fifo_priv *priv, u32 engine) { struct nvkm_bar *bar = nvkm_bar(priv); @@ -288,16 +299,21 @@ gk104_fifo_chan_ctor(struct nvkm_object *parent, struct nvkm_object *engine, nv_wo32(base, 0x94, 0x30000001); nv_wo32(base, 0x9c, 0x00000100); nv_wo32(base, 0xac, 0x0000001f); nv_wo32(base, 0xe8, chan->base.chid); nv_wo32(base, 0xb8, 0xf8000000); nv_wo32(base, 0xf8, 0x10003080); /* 0x002310 */ nv_wo32(base, 0xfc, 0x10000010); /* 0x002350 */ bar->flush(bar); + + chan->timeout.sum_ms = 0; + chan->timeout.limit_ms = -1; + chan->timeout.gpfifo_get = 0; + return 0; } static int gk104_fifo_chan_init(struct nvkm_object *object) { struct nvkm_gpuobj *base = nv_gpuobj(object->parent); struct gk104_fifo_priv *priv = (void *)object->engine; @@ -381,21 +397,39 @@ gk104_fifo_chan_set_priority(struct nvkm_object *object, void *data, u32 size) } return gk104_fifo_set_runlist_timeslice(priv, chan, slice); } return ret; } int +gk104_fifo_chan_set_timeout(struct nvkm_object *object, void *data, u32 size) +{ + struct gk104_fifo_chan *chan = (void *)object; + union { + struct kepler_set_channel_timeout_v0 v0; + } *args = data; + int ret; + + if (nvif_unpack(args->v0, 0, 0, false)) { + chan->timeout.limit_ms = args->v0.timeout_ms; + } + + return ret; +} + +int gk104_fifo_chan_mthd(struct nvkm_object *object, u32 mthd, void *data, u32 size) { switch (mthd) { case KEPLER_SET_CHANNEL_PRIORITY: return gk104_fifo_chan_set_priority(object, data, size); + case KEPLER_SET_CHANNEL_TIMEOUT: + return gk104_fifo_chan_set_timeout(object, data, size); default: break; } return -EINVAL; } struct nvkm_ofuncs gk104_fifo_chan_ofuncs = { @@ -606,61 +640,83 @@ gk104_fifo_intr_bind(struct gk104_fifo_priv *priv) } static const struct nvkm_enum gk104_fifo_sched_reason[] = { { 0x0a, "CTXSW_TIMEOUT" }, {} }; +static bool +gk104_fifo_update_timeout(struct gk104_fifo_priv *priv, + struct gk104_fifo_chan *chan, u32 dt) +{ + u32 gpfifo_get = nv_rd32(priv, 0x88); + /* advancing, but slowly; reset counting */ + if (gpfifo_get != chan->timeout.gpfifo_get) + chan->timeout.sum_ms = 0; + + chan->timeout.sum_ms += dt; + chan->timeout.gpfifo_get = gpfifo_get; + + return chan->timeout.sum_ms > chan->timeout.limit_ms; +} + static void gk104_fifo_intr_sched_ctxsw(struct gk104_fifo_priv *priv) { struct nvkm_engine *engine; struct gk104_fifo_chan *chan; u32 engn; for (engn = 0; engn < ARRAY_SIZE(fifo_engine); engn++) { - u32 stat = nv_rd32(priv, 0x002640 + (engn * 0x04)); - u32 busy = (stat & 0x80000000); - u32 next = (stat & 0x07ff0000) >> 16; - u32 chsw = (stat & 0x00008000); - u32 save = (stat & 0x00004000); - u32 load = (stat & 0x00002000); - u32 prev = (stat & 0x000007ff); - u32 chid = load ? next : prev; - (void)save; - - if (busy && chsw) { + u32 engstat = nv_rd32(priv, 0x002640 + (engn * 0x04)); + u32 busy = (engstat & 0x80000000); + u32 next = (engstat & 0x07ff0000) >> 16; + u32 ctxstat = (engstat & 0x0000e000) >> 13; + u32 prev = (engstat & 0x000007ff); + + u32 chid = ctxstat == CTXSW_STATUS_LOAD ? next : prev; + u32 ctxsw_active = ctxstat == CTXSW_STATUS_LOAD || + ctxstat == CTXSW_STATUS_SAVE || + ctxstat == CTXSW_STATUS_SWITCH; + + if (busy && ctxsw_active) { if (!(chan = (void *)priv->base.channel[chid])) continue; if (!(engine = gk104_fifo_engine(priv, engn))) continue; - nvkm_fifo_eevent(&priv->base, chid, - NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT); - - gk104_fifo_recover(priv, engine, chan); + if (gk104_fifo_update_timeout(priv, chan, + GRFIFO_TIMEOUT_CHECK_PERIOD_MS)) { + nvkm_fifo_eevent(&priv->base, chid, + NOUVEAU_GEM_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT); + gk104_fifo_recover(priv, engine, chan); + } else { + nv_debug(priv, "fifo waiting for ctxsw %d ms on ch %d\n", + chan->timeout.sum_ms, chid); + } } } } static void gk104_fifo_intr_sched(struct gk104_fifo_priv *priv) { u32 intr = nv_rd32(priv, 0x00254c); u32 code = intr & 0x000000ff; const struct nvkm_enum *en; char enunk[6] = ""; en = nvkm_enum_find(gk104_fifo_sched_reason, code); if (!en) snprintf(enunk, sizeof(enunk), "UNK%02x", code); - nv_error(priv, "SCHED_ERROR [ %s ]\n", en ? en->name : enunk); + /* this is a normal situation, not so loud */ + nv_debug(priv, "SCHED_ERROR [ %s ]\n", en ? en->name : enunk); switch (code) { case 0x0a: gk104_fifo_intr_sched_ctxsw(priv); break; default: break; } @@ -1133,18 +1189,22 @@ gk104_fifo_init(struct nvkm_object *object) /* PBDMA[n].HCE */ for (i = 0; i < priv->spoon_nr; i++) { nv_wr32(priv, 0x040148 + (i * 0x2000), 0xffffffff); /* INTR */ nv_wr32(priv, 0x04014c + (i * 0x2000), 0xffffffff); /* INTREN */ } nv_wr32(priv, 0x002254, 0x10000000 | priv->user.bar.offset >> 12); + /* enable interrupts */ nv_wr32(priv, 0x002100, 0xffffffff); nv_wr32(priv, 0x002140, 0x7fffffff); + + /* engine context switch timeout */ + nv_wr32(priv, 0x002a0c, 0x80000000 | (1000 * GRFIFO_TIMEOUT_CHECK_PERIOD_MS)); return 0; } void gk104_fifo_dtor(struct nvkm_object *object) { struct gk104_fifo_priv *priv = (void *)object; int i; -- 2.1.4
Konsta Hölttä
2015-Aug-31 11:38 UTC
[Nouveau] [RFC PATCH v2 5/5] HACK force fences updated on error
Some error conditions just stop a channel and fences get stuck, so they either need to be kicked ready in overwriting hw seq numbers (as nvgpu does) or faked with a sw flag like this. This is just a hack as an example of what would be needed. Here, a channel id whose fences should be forced updated is passed upwards with the uevent response. Normally, this is -1 to match no channel id, but some error paths fake an update event with an explicit channel id. Note: if userspace has some meaningful timeouts on the fences, then they do finish but without any notification that the channel is broken now (how do you distinguish a too long gpu job from a stuck one?). In many cases, a channel needs to be shut down completely when it breaks (e.g., mmu fault). Signed-off-by: Konsta Hölttä <kholtta at nvidia.com> --- drm/nouveau/include/nvif/event.h | 1 + drm/nouveau/include/nvkm/engine/fifo.h | 2 +- drm/nouveau/nouveau_fence.c | 13 ++++++++----- drm/nouveau/nvkm/engine/fifo/base.c | 3 ++- drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +- drm/nouveau/nvkm/engine/fifo/gk104.c | 7 ++++++- drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +- 7 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drm/nouveau/include/nvif/event.h b/drm/nouveau/include/nvif/event.h index d148b85..a9ff4ee 100644 --- a/drm/nouveau/include/nvif/event.h +++ b/drm/nouveau/include/nvif/event.h @@ -52,16 +52,17 @@ struct nvif_notify_conn_rep_v0 { }; struct nvif_notify_uevent_req { /* nvif_notify_req ... */ }; struct nvif_notify_uevent_rep { /* nvif_notify_rep ... */ + __u32 force_chid; }; struct nvif_notify_eevent_req { /* nvif_notify_req ... */ u32 chid; }; struct nvif_notify_eevent_rep { diff --git a/drm/nouveau/include/nvkm/engine/fifo.h b/drm/nouveau/include/nvkm/engine/fifo.h index cbca477..946eb68 100644 --- a/drm/nouveau/include/nvkm/engine/fifo.h +++ b/drm/nouveau/include/nvkm/engine/fifo.h @@ -117,15 +117,15 @@ extern struct nvkm_oclass *gf100_fifo_oclass; extern struct nvkm_oclass *gk104_fifo_oclass; extern struct nvkm_oclass *gk20a_fifo_oclass; extern struct nvkm_oclass *gk208_fifo_oclass; extern struct nvkm_oclass *gm204_fifo_oclass; extern struct nvkm_oclass *gm20b_fifo_oclass; int nvkm_fifo_uevent_ctor(struct nvkm_object *, void *, u32, struct nvkm_notify *); -void nvkm_fifo_uevent(struct nvkm_fifo *); +void nvkm_fifo_uevent(struct nvkm_fifo *, u32 force_chid); void nvkm_fifo_eevent(struct nvkm_fifo *, u32 chid, u32 error); void nv04_fifo_intr(struct nvkm_subdev *); int nv04_fifo_context_attach(struct nvkm_object *, struct nvkm_object *); #endif diff --git a/drm/nouveau/nouveau_fence.c b/drm/nouveau/nouveau_fence.c index 38bccb0..b7d9987 100644 --- a/drm/nouveau/nouveau_fence.c +++ b/drm/nouveau/nouveau_fence.c @@ -123,50 +123,53 @@ nouveau_fence_context_put(struct kref *fence_ref) void nouveau_fence_context_free(struct nouveau_fence_chan *fctx) { kref_put(&fctx->fence_ref, nouveau_fence_context_put); } static int -nouveau_fence_update(struct nouveau_channel *chan, struct nouveau_fence_chan *fctx) +nouveau_fence_update(struct nouveau_channel *chan, + struct nouveau_fence_chan *fctx, u32 force_chid) { struct nouveau_fence *fence; int drop = 0; u32 seq = fctx->read(chan); + bool force = force_chid == chan->chid; while (!list_empty(&fctx->pending)) { fence = list_entry(fctx->pending.next, typeof(*fence), head); - if ((int)(seq - fence->base.seqno) < 0) + if ((int)(seq - fence->base.seqno) < 0 && !force) break; drop |= nouveau_fence_signal(fence); } return drop; } static int nouveau_fence_wait_uevent_handler(struct nvif_notify *notify) { struct nouveau_fence_chan *fctx container_of(notify, typeof(*fctx), notify); + const struct nvif_notify_uevent_rep *rep = notify->data; unsigned long flags; int ret = NVIF_NOTIFY_KEEP; spin_lock_irqsave(&fctx->lock, flags); if (!list_empty(&fctx->pending)) { struct nouveau_fence *fence; struct nouveau_channel *chan; fence = list_entry(fctx->pending.next, typeof(*fence), head); chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock)); - if (nouveau_fence_update(fence->channel, fctx)) + if (nouveau_fence_update(fence->channel, fctx, rep->force_chid)) ret = NVIF_NOTIFY_DROP; } spin_unlock_irqrestore(&fctx->lock, flags); return ret; } void @@ -278,17 +281,17 @@ nouveau_fence_emit(struct nouveau_fence *fence, struct nouveau_channel *chan) kref_get(&fctx->fence_ref); trace_fence_emit(&fence->base); ret = fctx->emit(fence); if (!ret) { fence_get(&fence->base); spin_lock_irq(&fctx->lock); - if (nouveau_fence_update(chan, fctx)) + if (nouveau_fence_update(chan, fctx, -1)) nvif_notify_put(&fctx->notify); list_add_tail(&fence->head, &fctx->pending); spin_unlock_irq(&fctx->lock); } return ret; } @@ -302,17 +305,17 @@ nouveau_fence_done(struct nouveau_fence *fence) struct nouveau_channel *chan; unsigned long flags; if (test_bit(FENCE_FLAG_SIGNALED_BIT, &fence->base.flags)) return true; spin_lock_irqsave(&fctx->lock, flags); chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock)); - if (chan && nouveau_fence_update(chan, fctx)) + if (chan && nouveau_fence_update(chan, fctx, -1)) nvif_notify_put(&fctx->notify); spin_unlock_irqrestore(&fctx->lock, flags); } return fence_is_signaled(&fence->base); } static long nouveau_fence_wait_legacy(struct fence *f, bool intr, long wait) diff --git a/drm/nouveau/nvkm/engine/fifo/base.c b/drm/nouveau/nvkm/engine/fifo/base.c index a5dc6c9..e35d711 100644 --- a/drm/nouveau/nvkm/engine/fifo/base.c +++ b/drm/nouveau/nvkm/engine/fifo/base.c @@ -184,19 +184,20 @@ nvkm_fifo_uevent_ctor(struct nvkm_object *object, void *data, u32 size, notify->types = 1; notify->index = 0; } return ret; } void -nvkm_fifo_uevent(struct nvkm_fifo *fifo) +nvkm_fifo_uevent(struct nvkm_fifo *fifo, u32 force_chid) { struct nvif_notify_uevent_rep rep = { + .force_chid = force_chid }; nvkm_event_send(&fifo->uevent, 1, 0, &rep, sizeof(rep)); } static int nvkm_fifo_eevent_ctor(struct nvkm_object *object, void *data, u32 size, struct nvkm_notify *notify) { diff --git a/drm/nouveau/nvkm/engine/fifo/gf100.c b/drm/nouveau/nvkm/engine/fifo/gf100.c index b745252..ca86dfe 100644 --- a/drm/nouveau/nvkm/engine/fifo/gf100.c +++ b/drm/nouveau/nvkm/engine/fifo/gf100.c @@ -732,17 +732,17 @@ gf100_fifo_intr_engine_unit(struct gf100_fifo_priv *priv, int engn) u32 inte = nv_rd32(priv, 0x002628); u32 unkn; nv_wr32(priv, 0x0025a8 + (engn * 0x04), intr); for (unkn = 0; unkn < 8; unkn++) { u32 ints = (intr >> (unkn * 0x04)) & inte; if (ints & 0x1) { - nvkm_fifo_uevent(&priv->base); + nvkm_fifo_uevent(&priv->base, -1); ints &= ~1; } if (ints) { nv_error(priv, "ENGINE %d %d %01x", engn, unkn, ints); nv_mask(priv, 0x002628, ints, 0); } } } diff --git a/drm/nouveau/nvkm/engine/fifo/gk104.c b/drm/nouveau/nvkm/engine/fifo/gk104.c index 53a464d..caecef1 100644 --- a/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -908,16 +908,18 @@ gk104_fifo_intr_fault(struct gk104_fifo_priv *priv, int unit) object = engctx; while (object) { switch (nv_mclass(object)) { case KEPLER_CHANNEL_GPFIFO_A: case MAXWELL_CHANNEL_GPFIFO_A: nvkm_fifo_eevent(&priv->base, ((struct nvkm_fifo_chan*)object)->chid, NOUVEAU_GEM_CHANNEL_FIFO_ERROR_MMU_ERR_FLT); + nvkm_fifo_uevent(&priv->base, + ((struct nvkm_fifo_chan*)object)->chid); gk104_fifo_recover(priv, engine, (void *)object); break; } object = object->parent; } nvkm_engctx_put(engctx); } @@ -978,18 +980,21 @@ gk104_fifo_intr_pbdma_0(struct gk104_fifo_priv *priv, int unit) nv_error(priv, "PBDMA%d:", unit); nvkm_bitfield_print(gk104_fifo_pbdma_intr_0, show); pr_cont("\n"); nv_error(priv, "PBDMA%d: ch %d [%s] subc %d mthd 0x%04x data 0x%08x\n", unit, chid, nvkm_client_name_for_fifo_chid(&priv->base, chid), subc, mthd, data); + nvkm_fifo_eevent(&priv->base, chid, NOUVEAU_GEM_CHANNEL_PBDMA_ERROR); + + nvkm_fifo_uevent(&priv->base, chid); } nv_wr32(priv, 0x040108 + (unit * 0x2000), stat); } static const struct nvkm_bitfield gk104_fifo_pbdma_intr_1[] = { { 0x00000001, "HCE_RE_ILLEGAL_OP" }, { 0x00000002, "HCE_RE_ALIGNB" }, @@ -1030,17 +1035,17 @@ gk104_fifo_intr_runlist(struct gk104_fifo_priv *priv) nv_wr32(priv, 0x002a00, 1 << engn); mask &= ~(1 << engn); } } static void gk104_fifo_intr_engine(struct gk104_fifo_priv *priv) { - nvkm_fifo_uevent(&priv->base); + nvkm_fifo_uevent(&priv->base, -1); } static void gk104_fifo_intr(struct nvkm_subdev *subdev) { struct gk104_fifo_priv *priv = (void *)subdev; u32 mask = nv_rd32(priv, 0x002140); u32 stat = nv_rd32(priv, 0x002100) & mask; diff --git a/drm/nouveau/nvkm/engine/fifo/nv04.c b/drm/nouveau/nvkm/engine/fifo/nv04.c index 043e429..1749614 100644 --- a/drm/nouveau/nvkm/engine/fifo/nv04.c +++ b/drm/nouveau/nvkm/engine/fifo/nv04.c @@ -536,17 +536,17 @@ nv04_fifo_intr(struct nvkm_subdev *subdev) if (device->card_type == NV_50) { if (stat & 0x00000010) { stat &= ~0x00000010; nv_wr32(priv, 0x002100, 0x00000010); } if (stat & 0x40000000) { nv_wr32(priv, 0x002100, 0x40000000); - nvkm_fifo_uevent(&priv->base); + nvkm_fifo_uevent(&priv->base, -1); stat &= ~0x40000000; } } if (stat) { nv_warn(priv, "unknown intr 0x%08x\n", stat); nv_mask(priv, NV03_PFIFO_INTR_EN_0, stat, 0x00000000); nv_wr32(priv, NV03_PFIFO_INTR_0, stat); -- 2.1.4
Ben Skeggs
2015-Sep-01 13:26 UTC
[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling
On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com> wrote:> Hi there, > > Resending these now that they've had some more polish and testing, and I heard > that Ben's vacation is over :-) > > These patches work as a starting point for more explicit error mechanisms and > better robustness. At the moment, when a job hangs or faults, it seems that > nouveau doesn't quite know how to handle the situation and often results in a > hang. Some of these situations would require either completely resetting the > gpu, and/or a complex path for only recovering the broken channel.Yes, this has been an issue that's been on my "really nice to implement" list for quite a long time now, and only gotten around to getting it partially done (as you can see). Thanks for looking at this!> > To start, I worked on support for letting userspace know what exactly happened. > Proper recovery would come later. The "error notifier" in the first patch is a > simple shared buffer between kernel and userspace. Its error codes match > nvgpu's. Alternatively, the status could be queried with an ioctl, but that > would be considerably more heavyweight. I'd like to know if the event mechanism > is meant for these kinds of events at all (engines notify errors upwards to the > drm layer). Another alternative would probably be to register the same buffer > to all necessary engines separately in nvif method calls? Or register it to > just one (e.g., fifo) and get that engine when errors happen in others (e.g., > gr)? And drm handles probably wouldn't fit there? Please comment on this; I > wrote this before understanding the mthd mechanism.For the moment, I've just got a couple of high-level comments/questions on these patches so far before we worry about nit-picking the details: Is there any real need to have a full-blown notifier-style object to pass this info back to userspace? The DRM has an event mechanism where userspace could poll on its fd to get events, which is what I'd assume you'd be doing inside of fence waits with the error notifier? NVIF events are actually hooked up to this mechanism already, so userspace could directly request them and cut quite a lot out of that first patch. But, the NVKM-level passing of these events back over NVIF is pretty much where I was going with the design too. So, good to see you've done that :)> > Additionally, priority and timeout management for separate channels in flight > on the gpu is added in two patches. Neither is exactly what the name says, but > the effect is the same, and this is what nvgpu does currently. Those two > patches call the fifo channel object's methods directly from userspace, so a > hack is added in the nvif path to accept that. The objects are NEW'd from > kernel space, so calling from userspace isn't allowed, as it appears. How > should this be managed in a clean way?Is there any requirement to be able to adjust the priority of in-flight channels? Or would it be satisfactory to have these as arguments to the channel creation functions? Either way is fine, the latter is preferred if there's no use case for adjusting it afterwards though. I'm attempting to get some better channel handling (you might have noticed the current stuff the kernel uses is a bit fragile, it's evolved over a long time and several generations of channel classes) in place in time for 4.4, part of which will likely involve giving userspace better control over the arguments that get passed at channel creation time. As for the permissions problem, that can be resolved fairly easily I think, I'll play with a few ideas and see what I come up with. Thanks, Ben.> > Also, since nouveau often hangs on errors, the userspace hangs too (waiting on > a fence). The final patch attempts to fix this in a couple of specific error > paths to forcibly update all fences to be finished. I'd like to hear how that > would be handled properly - consider the patch just a proof-of-concept and > sample of what would be necessary. > > I don't expect the patches to be accepted as-is - as a newbie, I'd appreciate > any high-level comments on if I've understood anything, especially the event > and nvif/method mechanisms (I use the latter from userspace with a hack > constructed from the perfmon branch seen here earlier into nvidia's internal > libdrm-equivalent). The fence-forcing thing is something that is necessary with > the error notifiers (at least with our userspace that waits really long or > infinitely on fences). I'm working specifically on Tegra and don't know much > about the desktop's userspace details, so I may be biased in some areas. > > I'd be happy to write sample tests on e.g. libdrm for the new methods once the > kernel patches would get to a good shape, if that's required for accepting new > features. I tested these to work as a proof-of-concept on Jetson TK1, and the > code is adapted from the latest nvgpu. > > The patches can also be found in http://github.com/sooda/nouveau and are based > on a version of gnurou/staging. > > Thanks! > Konsta (sooda in IRC) > > Konsta Hölttä (5): > notify channel errors to userspace > don't verify route == owner in nvkm ioctl > gk104: channel priority/timeslice support > gk104: channel timeout detection > HACK force fences updated on error > > drm/nouveau/include/nvif/class.h | 20 ++++ > drm/nouveau/include/nvif/event.h | 12 +++ > drm/nouveau/include/nvkm/engine/fifo.h | 5 +- > drm/nouveau/nouveau_chan.c | 95 +++++++++++++++++++ > drm/nouveau/nouveau_chan.h | 10 ++ > drm/nouveau/nouveau_drm.c | 1 + > drm/nouveau/nouveau_fence.c | 13 ++- > drm/nouveau/nouveau_gem.c | 29 ++++++ > drm/nouveau/nouveau_gem.h | 2 + > drm/nouveau/nvkm/core/ioctl.c | 9 +- > drm/nouveau/nvkm/engine/fifo/base.c | 56 ++++++++++- > drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +- > drm/nouveau/nvkm/engine/fifo/gk104.c | 166 ++++++++++++++++++++++++++++++--- > drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +- > drm/nouveau/nvkm/engine/gr/gf100.c | 5 + > drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++ > 16 files changed, 415 insertions(+), 25 deletions(-) > > -- > 2.1.4 > > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/nouveau
Konsta Hölttä
2015-Sep-02 10:01 UTC
[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling
On 01/09/15 16:26, Ben Skeggs wrote:> On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com> wrote: >> Hi there, >> >> Resending these now that they've had some more polish and testing, and I heard >> that Ben's vacation is over :-) >> >> These patches work as a starting point for more explicit error mechanisms and >> better robustness. At the moment, when a job hangs or faults, it seems that >> nouveau doesn't quite know how to handle the situation and often results in a >> hang. Some of these situations would require either completely resetting the >> gpu, and/or a complex path for only recovering the broken channel. > Yes, this has been an issue that's been on my "really nice to > implement" list for quite a long time now, and only gotten around to > getting it partially done (as you can see). Thanks for looking at > this! > >> To start, I worked on support for letting userspace know what exactly happened. >> Proper recovery would come later. The "error notifier" in the first patch is a >> simple shared buffer between kernel and userspace. Its error codes match >> nvgpu's. Alternatively, the status could be queried with an ioctl, but that >> would be considerably more heavyweight. I'd like to know if the event mechanism >> is meant for these kinds of events at all (engines notify errors upwards to the >> drm layer). Another alternative would probably be to register the same buffer >> to all necessary engines separately in nvif method calls? Or register it to >> just one (e.g., fifo) and get that engine when errors happen in others (e.g., >> gr)? And drm handles probably wouldn't fit there? Please comment on this; I >> wrote this before understanding the mthd mechanism. > For the moment, I've just got a couple of high-level > comments/questions on these patches so far before we worry about > nit-picking the details: > > Is there any real need to have a full-blown notifier-style object to > pass this info back to userspace? The DRM has an event mechanism > where userspace could poll on its fd to get events, which is what I'd > assume you'd be doing inside of fence waits with the error notifier? > NVIF events are actually hooked up to this mechanism already, so > userspace could directly request them and cut quite a lot out of that > first patch. But, the NVKM-level passing of these events back over > NVIF is pretty much where I was going with the design too. So, good > to see you've done that :)I hadn't heard about that event mechanism yet - do you mean just drm_poll() and ->event_wait that is woken up in usif_notify()? That doesn't have any values that could be passed to userland. Is there a way for that too already? The motivation for the current approach is that it's lightweight, not requiring any kernel calls (and is compatible with nvidia gl drivers...) I don't actually know all the details how higher-level parts of the driver use the buffer :( just assuming things atm. I'd guess that the buffer would be read after each submit (after a fence wait has finished, which is a separate thing) if checking the status is required. My understanding is that at least the desktop driver has (or used to have) a shared buffer for lots of common stuff and this would be part of it (hence the offset field), which would complicate the whole driver stack if this were done differently. In Tegra's libdrm-equivalent this is initialized on channel creation, so could be in fact part of channel init args in nouveau? I considered also that the buffer could be just one page and allocated in the kernel, but as said, our current drivers want to be able to pass a separate buffer there.> >> Additionally, priority and timeout management for separate channels in flight >> on the gpu is added in two patches. Neither is exactly what the name says, but >> the effect is the same, and this is what nvgpu does currently. Those two >> patches call the fifo channel object's methods directly from userspace, so a >> hack is added in the nvif path to accept that. The objects are NEW'd from >> kernel space, so calling from userspace isn't allowed, as it appears. How >> should this be managed in a clean way? > Is there any requirement to be able to adjust the priority of > in-flight channels? Or would it be satisfactory to have these as > arguments to the channel creation functions? Either way is fine, the > latter is preferred if there's no use case for adjusting it afterwards > though.Maybe not exactly, but there is a technical requirement forced by nvidia userspace driver that doesn't enforce this to be done at creation time, but allows it at any time. I looked at that option too when starting to write these patches and yes, that (specifying at init time) would've simplified stuff. It's certainly possible that the timeout/priority is set only later in userspace after finding out that this is e.g. a webgl process requiring to finish faster. That's one plausible usecase.> > I'm attempting to get some better channel handling (you might have > noticed the current stuff the kernel uses is a bit fragile, it's > evolved over a long time and several generations of channel classes) > in place in time for 4.4, part of which will likely involve giving > userspace better control over the arguments that get passed at channel > creation time.The current stuff is at least not trivial to grasp initially :) I'd love to also help to write documentation for lots of things if time permits - I've been taking some notes for myself, but a general nvkm glossary and general reasoning about how things are done the way they are would definitely help new folks to get to know the codebase (wasn't easy for me). The problem is that I only could write about how things are done, not why. Getting to know all that would require lots of discussion (and time).> > As for the permissions problem, that can be resolved fairly easily I > think, I'll play with a few ideas and see what I come up with.Good to hear, thanks! A bit related question to this: what exactly determines the level of control kernel vs userspace have for the channels? Management rights for nouveau_bo's? Looks like some stuff of nouveau_chan.c could live in userspace too. Cheers, Konsta> > Thanks, > Ben. > >> Also, since nouveau often hangs on errors, the userspace hangs too (waiting on >> a fence). The final patch attempts to fix this in a couple of specific error >> paths to forcibly update all fences to be finished. I'd like to hear how that >> would be handled properly - consider the patch just a proof-of-concept and >> sample of what would be necessary. >> >> I don't expect the patches to be accepted as-is - as a newbie, I'd appreciate >> any high-level comments on if I've understood anything, especially the event >> and nvif/method mechanisms (I use the latter from userspace with a hack >> constructed from the perfmon branch seen here earlier into nvidia's internal >> libdrm-equivalent). The fence-forcing thing is something that is necessary with >> the error notifiers (at least with our userspace that waits really long or >> infinitely on fences). I'm working specifically on Tegra and don't know much >> about the desktop's userspace details, so I may be biased in some areas. >> >> I'd be happy to write sample tests on e.g. libdrm for the new methods once the >> kernel patches would get to a good shape, if that's required for accepting new >> features. I tested these to work as a proof-of-concept on Jetson TK1, and the >> code is adapted from the latest nvgpu. >> >> The patches can also be found in http://github.com/sooda/nouveau and are based >> on a version of gnurou/staging. >> >> Thanks! >> Konsta (sooda in IRC) >> >> Konsta Hölttä (5): >> notify channel errors to userspace >> don't verify route == owner in nvkm ioctl >> gk104: channel priority/timeslice support >> gk104: channel timeout detection >> HACK force fences updated on error >> >> drm/nouveau/include/nvif/class.h | 20 ++++ >> drm/nouveau/include/nvif/event.h | 12 +++ >> drm/nouveau/include/nvkm/engine/fifo.h | 5 +- >> drm/nouveau/nouveau_chan.c | 95 +++++++++++++++++++ >> drm/nouveau/nouveau_chan.h | 10 ++ >> drm/nouveau/nouveau_drm.c | 1 + >> drm/nouveau/nouveau_fence.c | 13 ++- >> drm/nouveau/nouveau_gem.c | 29 ++++++ >> drm/nouveau/nouveau_gem.h | 2 + >> drm/nouveau/nvkm/core/ioctl.c | 9 +- >> drm/nouveau/nvkm/engine/fifo/base.c | 56 ++++++++++- >> drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +- >> drm/nouveau/nvkm/engine/fifo/gk104.c | 166 ++++++++++++++++++++++++++++++--- >> drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +- >> drm/nouveau/nvkm/engine/gr/gf100.c | 5 + >> drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++ >> 16 files changed, 415 insertions(+), 25 deletions(-) >> >> -- >> 2.1.4 >> >> _______________________________________________ >> Nouveau mailing list >> Nouveau at lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/nouveau > -- > To unsubscribe from this list: send the line "unsubscribe linux-tegra" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Courbot
2015-Sep-03 09:01 UTC
[Nouveau] [RFC PATCH v2 1/5] notify channel errors to userspace
On Mon, Aug 31, 2015 at 8:38 PM, Konsta Hölttä <kholtta at nvidia.com> wrote:> Let userspace know about some specific types of errors via a shared > buffer registered with a new ioctl, NOUVEAU_GEM_SET_ERROR_NOTIFIER. > Once set, the notifier buffer is used until the channel is destroyed. > The exceptional per-channel error conditions are signaled upwards from > the nvkm layer via the event/notify mechanism and written to the buffer > which holds the latest error occurred, readable from userspace. > > Some error situations render a channel completely stuck, which may > require even discarding and reinitializing it or the gpu completely. > Once a specific type of error happens, no others are expected until > recovery. Passing the error to userspace in a shared buffer enables the > graphics driver to periodically check in a light-weight way if anything > went wrong (typically, after a submit); e.g., the GL robustness > extension requires to detect that the context has dead. > > The notifier currently signals idle timeout, sw notify, mmu fault and > illegal pbdma error conditions.Hi Konsta, Before this series can be considered for upstream, I suppose it will need to be ported to the latest codebase. This will likely require a rewrite due to the heavy changes that occured, but hopefully it will also allow you to address some of the hacks more cleanly.
Possibly Parallel Threads
- [PATCH 0/5][GIT PULL] Miscellaneous fixes for GF100+
- [PATCH 0/5] Improve Robust Channel (RC) recovery for Turing
- [PATCH] drm/nv84+: fix fence context seqno's
- [PATCH RFC 05/20] pm: reorganize the nvif interface
- [PATCH RFC 00/20] expose global performance counters