Timur Tabi
2024-Jul-30 21:49 UTC
[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
On Tue, 2024-07-30 at 02:21 +0200, Danilo Krummrich wrote:> Hi Timur, > > > This method has the advantage that it allows the buffers to be parsed > > even when the logging ELF file is not available to the user. However, > > it has the disadvantage the debubfs entries need to remain until the > > debubfs -> debugfsOk> > driver is unloaded. > > > > The buffers are exposed via debugfs. If GSP-RM fails to initialize, > > then Nouveau immediately shuts down the GSP interface. This would > > normally also deallocate the logging buffers, thereby preventing the > > user from capturing the debug logs. > > > > To avoid this, introduce the keep-gsp-logging command line parameter. > > If specified, and if at least one logging buffer has content, then > > Nouveau will migrate these buffers into new debugfs entries that are > > retained until the driver unloads. > > > > An end-user can capture the logs using the following commands: > > > > cp /sys/kernel/debug/dri/<path>/loginit loginit > > cp /sys/kernel/debug/dri/<path>/logrm logrm > > cp /sys/kernel/debug/dri/<path>/logintr logintr > > cp /sys/kernel/debug/dri/<path>/logpmu logpmu > > > > where <path> is the PCI ID of the GPU (e.g. 0000:65:00.0). > > I think this depends on the actual device type. Could also be a platform or even > aux device. > > No need to add more examples though, I think it's enough if we just change > this to: > > "where (for a PCI device) <path> is the PCI ID of the GPU (e.g. 0000:65:00.0)."I'll make the change, but AFAIK non-PCI versions of Turing+ GPUs do not exist. Even the GH100, which uses an nvlink data connection between GPU to the ARM host, still uses PCI for control.> > > > +/** > > + * nvif_log - structure for tracking logging buffers > > + * @entry: an entry in a list of struct nvif_logs > > + * @shutdown: pointer to function to call to clean up > > + * > > + * Structure used to track logging buffers so that they can be cleaned > > up > > + * when the driver exits. > > "Driver exit" is a bit of an undefined term, the closest thing is probably > "driver detach" (from a device). In this case I think you actually mean > "module > exit" though.Yes, I use driver and module interchangeably, but I guess that's not accurate.> > + * > > + * The @shutdown function is called when the driver exits. It should > > free all > > Same here.I replaced 'driver' with 'module'.> > > + * backing resources, such as logging buffers. > > + */ > > +struct nvif_log { > > + struct list_head entry; > > + void (*shutdown)(struct nvif_log *log); > > +}; > > + > > +#define NVIF_LOGS_DECLARE(_log) \ > > + struct nvif_log _log = { LIST_HEAD_INIT(_log.entry) } > > If you declare this as > > #define NVIF_LOGS_DECLARE(_log) \ > struct nvif_log _log = { LIST_HEAD_INIT(_log.entry), nvif_log_shutdown } > > and change the signature of nvif_log_shutdown() to > > static inline void nvif_log_shutdown(struct nvif_log *logs) > > you can just call > > gsp_logs.shutdown(&gsp_logs); > > in nouveau_drm_exit(). > > Admittedly, maybe a bit too sneaky though. :)gsp_logs.shutdown(&gsp_logs) -- are you sure you want this? ?This is some weird C++ wanna-be code, IMHO. ?I don't think this is an improvement. I'd rather keep it as-is.> > +/* > > + * GSP-RM uses a pseudo-class mechanism to define of a variety of per-"engine" > > + * data structures, and each engine has a "class ID" genererated by a > > + * pre-processor. This is the class ID for the PMU. > > + */ > > +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU 0xf3d722 > > + > > +/** > > + * rpc_ucode_libos_print_v1E_08 - RPC payload for libos print buffers > > Is this structure versioned? If so, does it relate to a specific GSP-RM version?Yes, the structure is versioned, but it's a little weird. When a new RPC is added, the developer can choose to use any version number that matches the "current" VGPU version or earlier. src/nvidia/generated/g_rpc-structures.h is generated from a special "def" file, and the developer just chooses whatever number he wants. Typically, the developer will choose whatever version that VGPU is at when he starts working on the code. But since this is the first and only version of rpc_ucode_libos_print, he could have chosen v01_00. So "previous RPC version (if it exists)" < "new RPC version" <= "current VGPU version" at the time the RPC was created/updated.> > +static void > > +r535_gsp_libos_debugfs_init(struct nvkm_gsp *gsp) > > +{ > > + struct dentry *dir, *dir_init; > > + struct dentry *dir_intr = NULL, *dir_rm = NULL, *dir_pmu = NULL; > > + struct device *dev = gsp->subdev.device->dev; > > + > > + /* Each GPU has a subdir based on its device name, so find it */ > > + struct drm_device *drm_dev = dev_get_drvdata(dev); > > + > > + if (!drm_dev || !drm_dev->debugfs_root) { > > I don't think we need any of those checks, please remove them.Ok> > + nvkm_debug(&gsp->subdev, "created debugfs GSP-RM logging entries\n"); > > + > > + if (keep_gsp_logging) { > > + nvkm_warn(&gsp->subdev, > > nvkm_info() is probably enough.Ok.> > +static void r535_gsp_retain_logging(struct nvkm_gsp *gsp) > > +{ > > + struct device *dev = gsp->subdev.device->dev; > > + struct dentry *root, *dir; > > + struct r535_gsp_log *log; > > + int ret; > > + > > + /* If we were told not to keep logs, then don't. */ > > + if (!keep_gsp_logging) > > + return; > > + > > + /* Check to make sure at least one buffer has data. */ > > + if (is_empty(&gsp->blob_init) && is_empty(&gsp->blob_intr) && > > + is_empty(&gsp->blob_rm) && is_empty(&gsp->blob_rm)) { > > + nvkm_warn(&gsp->subdev, "all logging buffers are empty\n"); > > + return; > > + } > > + > > + /* Find the 'dri' root debugfs entry. Every GPU has a dentry under it */ > > + root = debugfs_lookup("dri", NULL); > > + if (IS_ERR(root)) { > > I don't think this can ever happen. This entry is created in drm_core_init().So first, the check is wrong. It should be "if (!root)". However, I think it can happen. drm_core_init() does not fail if debugfs_create_dir() fails: drm_debugfs_root = debugfs_create_dir("dri", NULL); It doesn't even print a warning message. It just keeps going. So I think there should be some error checking somewhere. I tested this, and if drm_core_init() fails to create the dentry, then r535_gsp_retain_logging() will just keep going trying to create debugfs entries, because a root of NULL is actually valid, and the entries are created in /sys/kernel/debug/0000:65:00.0/ instead of /sys/kernel/debug/dri/0000:65:00.0/ In fact, I think I found a small bug in dput(): void dput(struct dentry *dentry) { if (!dentry) return; This should probably be "if (IS_ERR_OR_NULL(dentry))".
Danilo Krummrich
2024-Jul-31 14:54 UTC
[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
On Tue, Jul 30, 2024 at 09:49:15PM +0000, Timur Tabi wrote:> > "Driver exit" is a bit of an undefined term, the closest thing is probably > > "driver detach" (from a device). In this case I think you actually mean > > "module > > exit" though. > > Yes, I use driver and module interchangeably, but I guess that's not > accurate.Yes, it's not. They're entirely different things. Technically you don't even need a module to register a driver. And within a module you can register N drivers, including zero. It's just that in the typical application, you register a driver in module_init() and unregister it in module_exit().> > > + * backing resources, such as logging buffers. > > > + */ > > > +struct nvif_log { > > > + struct list_head entry; > > > + void (*shutdown)(struct nvif_log *log); > > > +}; > > > + > > > +#define NVIF_LOGS_DECLARE(_log) \ > > > + struct nvif_log _log = { LIST_HEAD_INIT(_log.entry) } > > > > If you declare this as > > > > #define NVIF_LOGS_DECLARE(_log) \ > > struct nvif_log _log = { LIST_HEAD_INIT(_log.entry), nvif_log_shutdown } > > > > and change the signature of nvif_log_shutdown() to > > > > static inline void nvif_log_shutdown(struct nvif_log *logs) > > > > you can just call > > > > gsp_logs.shutdown(&gsp_logs); > > > > in nouveau_drm_exit(). > > > > Admittedly, maybe a bit too sneaky though. :) > > gsp_logs.shutdown(&gsp_logs) -- are you sure you want this? ?This is some > weird C++ wanna-be code, IMHO. ?I don't think this is an improvement. I'd > rather keep it as-is.That's why I wrote "maybe a bit too sneaky". :) I think what I asked for originally in one of the last versions of this patch is having both, struct nvif_log (exactly the way you have it) and a separate struct nvif_logs: struct nvif_logs { struct list_head head; }; Then you use this in NVIF_LOGS_DECLARE() and nvif_log_shutdown() static inline void nvif_log_shutdown(struct nvif_logs *logs) and in nouveau_drm_exit() you just pass &gsp_logs. nvif_log_shutdown(&gsp_logs); This way things are more type safe, i.e. nvif_log_shutdown() can't be called with a random list_head and struct nvif_log::shutdown can't be called with the "head instance" of struct struct nvif_log.> > > > +/* > > > + * GSP-RM uses a pseudo-class mechanism to define of a variety of per-"engine" > > > + * data structures, and each engine has a "class ID" genererated by a > > > + * pre-processor. This is the class ID for the PMU. > > > + */ > > > +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU 0xf3d722 > > > + > > > +/** > > > + * rpc_ucode_libos_print_v1E_08 - RPC payload for libos print buffers > > > > Is this structure versioned? If so, does it relate to a specific GSP-RM version? > > Yes, the structure is versioned, but it's a little weird. When a new RPC is > added, the developer can choose to use any version number that matches the > "current" VGPU version or earlier. src/nvidia/generated/g_rpc-structures.h > is generated from a special "def" file, and the developer just chooses > whatever number he wants. Typically, the developer will choose whatever > version that VGPU is at when he starts working on the code. But since this > is the first and only version of rpc_ucode_libos_print, he could have chosen > v01_00. > > So "previous RPC version (if it exists)" < "new RPC version" <= "current > VGPU version" at the time the RPC was created/updated.Pretty interesting, thanks for clarifying! (No need to document it though.)> > > +static void r535_gsp_retain_logging(struct nvkm_gsp *gsp) > > > +{ > > > + struct device *dev = gsp->subdev.device->dev; > > > + struct dentry *root, *dir; > > > + struct r535_gsp_log *log; > > > + int ret; > > > + > > > + /* If we were told not to keep logs, then don't. */ > > > + if (!keep_gsp_logging) > > > + return; > > > + > > > + /* Check to make sure at least one buffer has data. */ > > > + if (is_empty(&gsp->blob_init) && is_empty(&gsp->blob_intr) && > > > + is_empty(&gsp->blob_rm) && is_empty(&gsp->blob_rm)) { > > > + nvkm_warn(&gsp->subdev, "all logging buffers are empty\n"); > > > + return; > > > + } > > > + > > > + /* Find the 'dri' root debugfs entry. Every GPU has a dentry under it */ > > > + root = debugfs_lookup("dri", NULL); > > > + if (IS_ERR(root)) { > > > > I don't think this can ever happen. This entry is created in drm_core_init().Oh, I see how this reads. I think I didn't quite finish the reply. Sorry for the confusion.> > So first, the check is wrong. It should be "if (!root)".That's what I meant, it can't be an ERR_PTR.> > However, I think it can happen. drm_core_init() does not fail if > debugfs_create_dir() fails: > > drm_debugfs_root = debugfs_create_dir("dri", NULL); > > It doesn't even print a warning message. It just keeps going. So I think > there should be some error checking somewhere.For DRM probably not, since the ERR_PTR is honored by other debugfs functions as described in [1].> > I tested this, and if drm_core_init() fails to create the dentry, then > r535_gsp_retain_logging() will just keep going trying to create debugfs > entries, because a root of NULL is actually valid, and the entries are > created in /sys/kernel/debug/0000:65:00.0/ instead of > /sys/kernel/debug/dri/0000:65:00.0/This is because debugfs_lookup() doesn't return an ERR_PTR, but NULL. It'd probably better go along with what is documented in [1] if debugfs_lookup() would return ERR_PTR(-ENOENT) if no entry was found. (This is where I was heading to in my previous reply.)> > In fact, I think I found a small bug in dput(): > > void dput(struct dentry *dentry) > { > if (!dentry) > return; > > This should probably be "if (IS_ERR_OR_NULL(dentry))".Yes, I agree, good catch. [1] https://elixir.bootlin.com/linux/v6.10/source/fs/debugfs/inode.c#L586
Possibly Parallel Threads
- [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
- [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
- [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
- [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs
- [PATCH 2/2] [v5] drm/nouveau: expose GSP-RM logging buffers via debugfs