thr3ads.net - Nouveau - [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs [Jul 2024]

If this information is useful, please help other people find it:
Share via:

Timur Tabi

2024-Jul-30 21:49 UTC

[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

On Tue, 2024-07-30 at 02:21 +0200, Danilo Krummrich
wrote:> Hi Timur,
> 
> > This method has the advantage that it allows the buffers to be parsed
> > even when the logging ELF file is not available to the user. However,
> > it has the disadvantage the debubfs entries need to remain until the
> 
> debubfs -> debugfs
Ok
> > driver is unloaded.
> > 
> > The buffers are exposed via debugfs. If GSP-RM fails to initialize,
> > then Nouveau immediately shuts down the GSP interface. This would
> > normally also deallocate the logging buffers, thereby preventing the
> > user from capturing the debug logs.
> > 
> > To avoid this, introduce the keep-gsp-logging command line parameter.
> > If specified, and if at least one logging buffer has content, then
> > Nouveau will migrate these buffers into new debugfs entries that are
> > retained until the driver unloads.
> > 
> > An end-user can capture the logs using the following commands:
> > 
> >     cp /sys/kernel/debug/dri/<path>/loginit loginit
> >     cp /sys/kernel/debug/dri/<path>/logrm logrm
> >     cp /sys/kernel/debug/dri/<path>/logintr logintr
> >     cp /sys/kernel/debug/dri/<path>/logpmu logpmu
> > 
> > where <path> is the PCI ID of the GPU (e.g. 0000:65:00.0).
> 
> I think this depends on the actual device type. Could also be a platform or
even
> aux device.
> 
> No need to add more examples though, I think it's enough if we just
change
> this to:
> 
> "where (for a PCI device) <path> is the PCI ID of the GPU (e.g.
0000:65:00.0)."
I'll make the change, but AFAIK non-PCI versions of Turing+ GPUs do not
exist.  Even the GH100, which uses an nvlink data connection between GPU to
the ARM host, still uses PCI for control.
> > 
> > +/**
> > + * nvif_log - structure for tracking logging buffers
> > + * @entry: an entry in a list of struct nvif_logs
> > + * @shutdown: pointer to function to call to clean up
> > + *
> > + * Structure used to track logging buffers so that they can be
cleaned
> > up
> > + * when the driver exits.
> 
> "Driver exit" is a bit of an undefined term, the closest thing is
probably
> "driver detach" (from a device). In this case I think you
actually mean
> "module
> exit" though.
Yes, I use driver and module interchangeably, but I guess that's not
accurate.
> > + *
> > + * The @shutdown function is called when the driver exits. It should
> > free all
> 
> Same here.
I replaced 'driver' with 'module'.
> 
> > + * backing resources, such as logging buffers.
> > + */
> > +struct nvif_log {
> > +	struct list_head entry;
> > +	void (*shutdown)(struct nvif_log *log);
> > +};
> > +
> > +#define NVIF_LOGS_DECLARE(_log) \
> > +	struct nvif_log _log = { LIST_HEAD_INIT(_log.entry) }
> 
> If you declare this as
> 
> 	#define NVIF_LOGS_DECLARE(_log) \
> 		struct nvif_log _log = { LIST_HEAD_INIT(_log.entry), nvif_log_shutdown }
> 
> and change the signature of nvif_log_shutdown() to
> 
> 	static inline void nvif_log_shutdown(struct nvif_log *logs)
> 
> you can just call
> 
> 	gsp_logs.shutdown(&gsp_logs);
> 
> in nouveau_drm_exit().
> 
> Admittedly, maybe a bit too sneaky though. :)
gsp_logs.shutdown(&gsp_logs) -- are you sure you want this? ?This is some
weird C++ wanna-be code, IMHO. ?I don't think this is an improvement. 
I'd
rather keep it as-is.
> > +/*
> > + * GSP-RM uses a pseudo-class mechanism to define of a variety of
per-"engine"
> > + * data structures, and each engine has a "class ID"
genererated by a
> > + * pre-processor. This is the class ID for the PMU.
> > + */
> > +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU		0xf3d722
> > +
> > +/**
> > + * rpc_ucode_libos_print_v1E_08 - RPC payload for libos print buffers
> 
> Is this structure versioned? If so, does it relate to a specific GSP-RM
version?
Yes, the structure is versioned, but it's a little weird.  When a new RPC is
added, the developer can choose to use any version number that matches the
"current" VGPU version or earlier. 
src/nvidia/generated/g_rpc-structures.h
is generated from a special "def" file, and the developer just chooses
whatever number he wants.  Typically, the developer will choose whatever
version that VGPU is at when he starts working on the code.  But since this
is the first and only version of rpc_ucode_libos_print, he could have chosen
v01_00.

So "previous RPC version (if it exists)" < "new RPC
version" <= "current
VGPU version" at the time the RPC was created/updated.
> > +static void
> > +r535_gsp_libos_debugfs_init(struct nvkm_gsp *gsp)
> > +{
> > +	struct dentry *dir, *dir_init;
> > +	struct dentry *dir_intr = NULL, *dir_rm = NULL, *dir_pmu = NULL;
> > +	struct device *dev = gsp->subdev.device->dev;
> > +
> > +	/* Each GPU has a subdir based on its device name, so find it */
> > +	struct drm_device *drm_dev = dev_get_drvdata(dev);
> > +
> > +	if (!drm_dev || !drm_dev->debugfs_root) {
> 
> I don't think we need any of those checks, please remove them.
Ok
> > +	nvkm_debug(&gsp->subdev, "created debugfs GSP-RM logging
entries\n");
> > +
> > +	if (keep_gsp_logging) {
> > +		nvkm_warn(&gsp->subdev,
> 
> nvkm_info() is probably enough.
Ok.
> > +static void r535_gsp_retain_logging(struct nvkm_gsp *gsp)
> > +{
> > +	struct device *dev = gsp->subdev.device->dev;
> > +	struct dentry *root, *dir;
> > +	struct r535_gsp_log *log;
> > +	int ret;
> > +
> > +	/* If we were told not to keep logs, then don't. */
> > +	if (!keep_gsp_logging)
> > +		return;
> > +
> > +	/* Check to make sure at least one buffer has data. */
> > +	if (is_empty(&gsp->blob_init) &&
is_empty(&gsp->blob_intr) &&
> > +	    is_empty(&gsp->blob_rm) &&
is_empty(&gsp->blob_rm)) {
> > +		nvkm_warn(&gsp->subdev, "all logging buffers are
empty\n");
> > +		return;
> > +	}
> > +
> > +	/* Find the 'dri' root debugfs entry. Every GPU has a dentry
under it */
> > +	root = debugfs_lookup("dri", NULL);
> > +	if (IS_ERR(root)) {
> 
> I don't think this can ever happen. This entry is created in
drm_core_init().
So first, the check is wrong.  It should be "if (!root)".

However, I think it can happen.  drm_core_init() does not fail if
debugfs_create_dir() fails:

	drm_debugfs_root = debugfs_create_dir("dri", NULL);

It doesn't even print a warning message.  It just keeps going.  So I think
there should be some error checking somewhere.

I tested this, and if drm_core_init() fails to create the dentry, then
r535_gsp_retain_logging() will just keep going trying to create debugfs
entries, because a root of NULL is actually valid, and the entries are
created in /sys/kernel/debug/0000:65:00.0/ instead of
/sys/kernel/debug/dri/0000:65:00.0/

In fact, I think I found a small bug in dput():

void dput(struct dentry *dentry)
{
	if (!dentry)
		return;

This should probably be "if (IS_ERR_OR_NULL(dentry))".

Danilo Krummrich

2024-Jul-31 14:54 UTC

head link

[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

On Tue, Jul 30, 2024 at 09:49:15PM +0000, Timur Tabi
wrote:> > "Driver exit" is a bit of an undefined term, the closest
thing is probably
> > "driver detach" (from a device). In this case I think you
actually mean
> > "module
> > exit" though.
> 
> Yes, I use driver and module interchangeably, but I guess that's not
> accurate.
Yes, it's not. They're entirely different things. Technically you
don't even
need a module to register a driver. And within a module you can register N
drivers, including zero.

It's just that in the typical application, you register a driver in
module_init() and unregister it in module_exit().
> > > + * backing resources, such as logging buffers.
> > > + */
> > > +struct nvif_log {
> > > +	struct list_head entry;
> > > +	void (*shutdown)(struct nvif_log *log);
> > > +};
> > > +
> > > +#define NVIF_LOGS_DECLARE(_log) \
> > > +	struct nvif_log _log = { LIST_HEAD_INIT(_log.entry) }
> > 
> > If you declare this as
> > 
> > 	#define NVIF_LOGS_DECLARE(_log) \
> > 		struct nvif_log _log = { LIST_HEAD_INIT(_log.entry),
nvif_log_shutdown }
> > 
> > and change the signature of nvif_log_shutdown() to
> > 
> > 	static inline void nvif_log_shutdown(struct nvif_log *logs)
> > 
> > you can just call
> > 
> > 	gsp_logs.shutdown(&gsp_logs);
> > 
> > in nouveau_drm_exit().
> > 
> > Admittedly, maybe a bit too sneaky though. :)
> 
> gsp_logs.shutdown(&gsp_logs) -- are you sure you want this? ?This is
some
> weird C++ wanna-be code, IMHO. ?I don't think this is an improvement. 
I'd
> rather keep it as-is.
That's why I wrote "maybe a bit too sneaky". :)

I think what I asked for originally in one of the last versions of this patch
is having both, struct nvif_log (exactly the way you have it) and a separate
struct nvif_logs:

	struct nvif_logs {
		struct list_head head;
	};

Then you use this in NVIF_LOGS_DECLARE() and nvif_log_shutdown()

	static inline void nvif_log_shutdown(struct nvif_logs *logs)

and in nouveau_drm_exit() you just pass &gsp_logs.

	nvif_log_shutdown(&gsp_logs);

This way things are more type safe, i.e. nvif_log_shutdown() can't be called
with a random list_head and struct nvif_log::shutdown can't be called with
the
"head instance" of struct struct nvif_log.
> 
> > > +/*
> > > + * GSP-RM uses a pseudo-class mechanism to define of a variety
of per-"engine"
> > > + * data structures, and each engine has a "class ID"
genererated by a
> > > + * pre-processor. This is the class ID for the PMU.
> > > + */
> > > +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU		0xf3d722
> > > +
> > > +/**
> > > + * rpc_ucode_libos_print_v1E_08 - RPC payload for libos print
buffers
> > 
> > Is this structure versioned? If so, does it relate to a specific
GSP-RM version?
> 
> Yes, the structure is versioned, but it's a little weird.  When a new
RPC is
> added, the developer can choose to use any version number that matches the
> "current" VGPU version or earlier. 
src/nvidia/generated/g_rpc-structures.h
> is generated from a special "def" file, and the developer just
chooses
> whatever number he wants.  Typically, the developer will choose whatever
> version that VGPU is at when he starts working on the code.  But since this
> is the first and only version of rpc_ucode_libos_print, he could have
chosen
> v01_00.
> 
> So "previous RPC version (if it exists)" < "new RPC
version" <= "current
> VGPU version" at the time the RPC was created/updated.
Pretty interesting, thanks for clarifying! (No need to document it though.)
> > > +static void r535_gsp_retain_logging(struct nvkm_gsp *gsp)
> > > +{
> > > +	struct device *dev = gsp->subdev.device->dev;
> > > +	struct dentry *root, *dir;
> > > +	struct r535_gsp_log *log;
> > > +	int ret;
> > > +
> > > +	/* If we were told not to keep logs, then don't. */
> > > +	if (!keep_gsp_logging)
> > > +		return;
> > > +
> > > +	/* Check to make sure at least one buffer has data. */
> > > +	if (is_empty(&gsp->blob_init) &&
is_empty(&gsp->blob_intr) &&
> > > +	    is_empty(&gsp->blob_rm) &&
is_empty(&gsp->blob_rm)) {
> > > +		nvkm_warn(&gsp->subdev, "all logging buffers are
empty\n");
> > > +		return;
> > > +	}
> > > +
> > > +	/* Find the 'dri' root debugfs entry. Every GPU has a
dentry under it */
> > > +	root = debugfs_lookup("dri", NULL);
> > > +	if (IS_ERR(root)) {
> > 
> > I don't think this can ever happen. This entry is created in
drm_core_init().
Oh, I see how this reads. I think I didn't quite finish the reply. Sorry for
the
confusion.
> 
> So first, the check is wrong.  It should be "if (!root)".
That's what I meant, it can't be an ERR_PTR.
> 
> However, I think it can happen.  drm_core_init() does not fail if
> debugfs_create_dir() fails:
> 
> 	drm_debugfs_root = debugfs_create_dir("dri", NULL);
> 
> It doesn't even print a warning message.  It just keeps going.  So I
think
> there should be some error checking somewhere.
For DRM probably not, since the ERR_PTR is honored by other debugfs functions
as described in [1].
> 
> I tested this, and if drm_core_init() fails to create the dentry, then
> r535_gsp_retain_logging() will just keep going trying to create debugfs
> entries, because a root of NULL is actually valid, and the entries are
> created in /sys/kernel/debug/0000:65:00.0/ instead of
> /sys/kernel/debug/dri/0000:65:00.0/
This is because debugfs_lookup() doesn't return an ERR_PTR, but NULL.
It'd
probably better go along with what is documented in [1] if debugfs_lookup()
would return ERR_PTR(-ENOENT) if no entry was found.

(This is where I was heading to in my previous reply.)
> 
> In fact, I think I found a small bug in dput():
> 
> void dput(struct dentry *dentry)
> {
> 	if (!dentry)
> 		return;
> 
> This should probably be "if (IS_ERR_OR_NULL(dentry))".
Yes, I agree, good catch.

[1] https://elixir.bootlin.com/linux/v6.10/source/fs/debugfs/inode.c#L586

Maybe Matching Threads

Search for more reasonably related threads

Nouveau - Jul 2024 - [PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

[PATCH 2/2] [v6] drm/nouveau: expose GSP-RM logging buffers via debugfs

Maybe Matching Threads