Hello everyone, To summarize the issue I'm trying to address here: Processes can allocate resources through a file descriptor without being held responsible for it. Especially for the DRM graphics driver subsystem this is rather problematic. Modern games tend to allocate huge amounts of system memory through the DRM drivers to make it accessible to GPU rendering. But even outside of the DRM subsystem this problem exists and it is trivial to exploit. See the following simple example of using memfd_create(): fd = memfd_create("test", 0); while (1) write(fd, page, 4096); Compile this and you can bring down any standard desktop system within seconds. The background is that the OOM killer will kill every processes in the system, but just not the one which holds the only reference to the memory allocated by the memfd. Those problems where brought up on the mailing list multiple times now [1][2][3], but without any final conclusion how to address them. Since file descriptors are considered shared the process can not directly held accountable for allocations made through them. Additional to that file descriptors can also easily move between processes as well. So what this patch set does is to instead of trying to account the allocated memory to a specific process it adds a callback to struct file_operations which the OOM killer can use to query the specific OOM badness of this file reference. This badness is then divided by the file_count, so that every process using a shmem file, DMA-buf or DRM driver will get it's equal amount of OOM badness. Callbacks are then implemented for the two core users (memfd and DMA-buf) as well as 72 DRM based graphics drivers. The result is that the OOM killer can now much better judge if a process is worth killing to free up memory. Resulting a quite a bit better system stability in OOM situations, especially while running games. The only other possibility I can see would be to change the accounting of resources whenever references to the file structure change, but this would mean quite some additional overhead for a rather common operation. Additionally I think trying to limit device driver allocations using cgroups is orthogonal to this effort. While cgroups is very useful, it works on per process limits and tries to enforce a collaborative model on memory management while the OOM killer enforces a competitive model. Please comment and/or review, we have that problem flying around for years now and are not at a point where we finally need to find a solution for this. Regards, Christian. [1] https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html [2] https://lkml.org/lkml/2018/1/18/543 [3] https://lkml.org/lkml/2021/2/4/799
Christian König
2022-May-31 09:59 UTC
[Nouveau] [PATCH 01/13] fs: add OOM badness callback to file_operatrations struct
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> This allows file_operation implementations to specify an additional badness for the OOM killer when they allocate memory on behalf of userspace. This badness is per file because file descriptor and therefor the reference to the allocated memory can migrate between processes. For easy debugging this also adds printing of the per file oom badness to fdinfo inside procfs. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- fs/proc/fd.c | 4 ++++ include/linux/fs.h | 1 + 2 files changed, 5 insertions(+) diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 172c86270b31..d1905c05cb3a 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -59,6 +59,10 @@ static int seq_show(struct seq_file *m, void *v) real_mount(file->f_path.mnt)->mnt_id, file_inode(file)->i_ino); + if (file->f_op->oom_badness) + seq_printf(m, "oom_badness:\t%lu\n", + file->f_op->oom_badness(file)); + /* show_fd_locks() never deferences files so a stale value is safe */ show_fd_locks(m, file, files); if (seq_has_overflowed(m)) diff --git a/include/linux/fs.h b/include/linux/fs.h index bbde95387a23..d5222543aeb0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1995,6 +1995,7 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + long (*oom_badness)(struct file *); } __randomize_layout; struct inode_operations { -- 2.25.1
Christian König
2022-May-31 09:59 UTC
[Nouveau] [PATCH 02/13] oom: take per file badness into account
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Try to make better decisions which process to kill based on per file OOM badness. For this the per file oom badness is queried from every file which supports that and divided by the number of references to that file structure. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- mm/oom_kill.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 49d7df39b02d..8a4d05e9568b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -52,6 +52,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/oom.h> +#include <linux/fdtable.h> + int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; @@ -189,6 +191,19 @@ static bool should_dump_unreclaim_slab(void) return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru); } +/* Sumup how much resources are bound by files opened. */ +static int oom_file_badness(const void *points, struct file *file, unsigned n) +{ + long badness; + + if (!file->f_op->oom_badness) + return 0; + + badness = file->f_op->oom_badness(file); + *((long *)points) += DIV_ROUND_UP(badness, file_count(file)); + return 0; +} + /** * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate @@ -229,6 +244,12 @@ long oom_badness(struct task_struct *p, unsigned long totalpages) */ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + mm_pgtables_bytes(p->mm) / PAGE_SIZE; + + /* + * Add how much memory a task uses in opened files, e.g. device drivers. + */ + iterate_fd(p->files, 0, oom_file_badness, &points); + task_unlock(p); /* Normalize to oom_score_adj units */ -- 2.25.1
Christian König
2022-May-31 09:59 UTC
[Nouveau] [PATCH 03/13] mm: shmem: provide oom badness for shmem files
This gives the OOM killer an additional hint which processes are referencing shmem files with potentially no other accounting for them. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- mm/shmem.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index 4b2fea33158e..a4ad92a16968 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2179,6 +2179,11 @@ unsigned long shmem_get_unmapped_area(struct file *file, return inflated_addr; } +static long shmem_oom_badness(struct file *file) +{ + return i_size_read(file_inode(file)) >> PAGE_SHIFT; +} + #ifdef CONFIG_NUMA static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) { @@ -3780,6 +3785,7 @@ EXPORT_SYMBOL(shmem_aops); static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, .get_unmapped_area = shmem_get_unmapped_area, + .oom_badness = shmem_oom_badness, #ifdef CONFIG_TMPFS .llseek = shmem_file_llseek, .read_iter = shmem_file_read_iter, -- 2.25.1
Christian König
2022-May-31 09:59 UTC
[Nouveau] [PATCH 04/13] dma-buf: provide oom badness for DMA-buf files
For now just return the size of the DMA-buf in pages as badness in the OOM situation. That should probably be extended to be in control of the exporter in the future. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/dma-buf/dma-buf.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index a2f9a1815e38..bdd4e8767cd3 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -388,6 +388,12 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) spin_unlock(&dmabuf->name_lock); } +static long dma_buf_oom_badness(struct file *file) +{ + /* TODO: This should probably be controlled by a flag */ + return i_size_read(file_inode(file)) >> PAGE_SHIFT; +} + static const struct file_operations dma_buf_fops = { .release = dma_buf_file_release, .mmap = dma_buf_mmap_internal, @@ -396,6 +402,7 @@ static const struct file_operations dma_buf_fops = { .unlocked_ioctl = dma_buf_ioctl, .compat_ioctl = compat_ptr_ioctl, .show_fdinfo = dma_buf_show_fdinfo, + .oom_badness = dma_buf_oom_badness, }; /* -- 2.25.1
Christian König
2022-May-31 09:59 UTC
[Nouveau] [PATCH 05/13] drm/gem: adjust per file OOM badness on handling buffers
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Large amounts of VRAM are usually not CPU accessible, so they are not mapped into the processes address space. But since the device drivers usually support swapping buffers from VRAM to system memory we can still run into an out of memory situation when userspace starts to allocate to much. This patch gives the OOM killer another hint which process is holding references to memory resources. A GEM helper is provided and automatically used for all drivers using the DEFINE_DRM_GEM_FOPS() and DEFINE_DRM_GEM_CMA_FOPS() macros. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- drivers/gpu/drm/drm_file.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_gem.c | 5 +++++ include/drm/drm_file.h | 9 +++++++++ include/drm/drm_gem.h | 1 + include/drm/drm_gem_cma_helper.h | 1 + 5 files changed, 35 insertions(+) diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index ed25168619fc..1959a5b7029e 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -1049,3 +1049,22 @@ unsigned long drm_get_unmapped_area(struct file *file, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ EXPORT_SYMBOL_GPL(drm_get_unmapped_area); #endif /* CONFIG_MMU */ + + +/** + * drm_oom_badness() - get oom badness for struct drm_file + * @f: struct drm_file to get the badness from + * + * Return how many pages are allocated for this client. + */ +long drm_oom_badness(struct file *f) +{ + + struct drm_file *file_priv = f->private_data; + + if (file_priv) + return atomic_long_read(&file_priv->f_oom_badness); + + return 0; +} +EXPORT_SYMBOL(drm_oom_badness); diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index eb0c2d041f13..768b28b198cd 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -256,6 +256,7 @@ drm_gem_object_release_handle(int id, void *ptr, void *data) drm_gem_remove_prime_handles(obj, file_priv); drm_vma_node_revoke(&obj->vma_node, file_priv); + atomic_long_sub(obj->size >> PAGE_SHIFT, &file_priv->f_oom_badness); drm_gem_object_handle_put_unlocked(obj); return 0; @@ -291,6 +292,8 @@ drm_gem_handle_delete(struct drm_file *filp, u32 handle) idr_remove(&filp->object_idr, handle); spin_unlock(&filp->table_lock); + atomic_long_sub(obj->size >> PAGE_SHIFT, &filp->f_oom_badness); + return 0; } EXPORT_SYMBOL(drm_gem_handle_delete); @@ -399,6 +402,8 @@ drm_gem_handle_create_tail(struct drm_file *file_priv, } *handlep = handle; + + atomic_long_add(obj->size >> PAGE_SHIFT, &file_priv->f_oom_badness); return 0; err_revoke: diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h index e0a73a1e2df7..5926766d79f0 100644 --- a/include/drm/drm_file.h +++ b/include/drm/drm_file.h @@ -366,6 +366,13 @@ struct drm_file { #if IS_ENABLED(CONFIG_DRM_LEGACY) unsigned long lock_count; /* DRI1 legacy lock count */ #endif + + /** + * @f_oom_badness: + * + * How many pages are allocated through this driver connection. + */ + atomic_long_t f_oom_badness; }; /** @@ -430,4 +437,6 @@ unsigned long drm_get_unmapped_area(struct file *file, #endif /* CONFIG_MMU */ +long drm_oom_badness(struct file *f); + #endif /* _DRM_FILE_H_ */ diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 9d7c61a122dc..0adf8c2f62e8 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -338,6 +338,7 @@ struct drm_gem_object { .read = drm_read,\ .llseek = noop_llseek,\ .mmap = drm_gem_mmap,\ + .oom_badness = drm_oom_badness,\ } void drm_gem_object_release(struct drm_gem_object *obj); diff --git a/include/drm/drm_gem_cma_helper.h b/include/drm/drm_gem_cma_helper.h index fbda4ce5d5fb..455ce1aa6d2c 100644 --- a/include/drm/drm_gem_cma_helper.h +++ b/include/drm/drm_gem_cma_helper.h @@ -273,6 +273,7 @@ unsigned long drm_gem_cma_get_unmapped_area(struct file *filp, .read = drm_read,\ .llseek = noop_llseek,\ .mmap = drm_gem_mmap,\ + .oom_badness = drm_oom_badness,\ DRM_GEM_CMA_UNMAPPED_AREA_FOPS \ } -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 06/13] drm/gma500: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/gma500/psb_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/gma500/psb_drv.c b/drivers/gpu/drm/gma500/psb_drv.c index 1d8744f3e702..d5ab4e081b53 100644 --- a/drivers/gpu/drm/gma500/psb_drv.c +++ b/drivers/gpu/drm/gma500/psb_drv.c @@ -513,6 +513,7 @@ static const struct file_operations psb_gem_fops = { .mmap = drm_gem_mmap, .poll = drm_poll, .read = drm_read, + .oom_badness = drm_oom_badness, }; static const struct drm_driver driver = { -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 07/13] drm/amdgpu: Use drm_oom_badness for amdgpu
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index ebd37fb19cdb..9d6e57c93d3e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2651,8 +2651,9 @@ static const struct file_operations amdgpu_driver_kms_fops = { .compat_ioctl = amdgpu_kms_compat_ioctl, #endif #ifdef CONFIG_PROC_FS - .show_fdinfo = amdgpu_show_fdinfo + .show_fdinfo = amdgpu_show_fdinfo, #endif + .oom_badness = drm_oom_badness, }; int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv) -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 08/13] drm/radeon: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/radeon/radeon_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 956c72b5aa33..7e7308c096d8 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -550,6 +550,7 @@ static const struct file_operations radeon_driver_kms_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = radeon_kms_compat_ioctl, #endif + .oom_badness = drm_oom_badness, }; static const struct drm_ioctl_desc radeon_ioctls_kms[] = { -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 09/13] drm/i915: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/i915/i915_driver.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index 3ffb617d75c9..f9676a5b8aeb 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -1748,6 +1748,7 @@ static const struct file_operations i915_driver_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = i915_drm_client_fdinfo, #endif + .oom_badness = drm_oom_badness, }; static int -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 10/13] drm/nouveau: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/nouveau/nouveau_drm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 561309d447e0..5439b6938455 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -1218,6 +1218,7 @@ nouveau_driver_fops = { .compat_ioctl = nouveau_compat_ioctl, #endif .llseek = noop_llseek, + .oom_badness = drm_oom_badness, }; static struct drm_driver -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 11/13] drm/omap: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/omapdrm/omap_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/omapdrm/omap_drv.c b/drivers/gpu/drm/omapdrm/omap_drv.c index eaf67b9e5f12..ca2c484f48ca 100644 --- a/drivers/gpu/drm/omapdrm/omap_drv.c +++ b/drivers/gpu/drm/omapdrm/omap_drv.c @@ -684,6 +684,7 @@ static const struct file_operations omapdriver_fops = { .poll = drm_poll, .read = drm_read, .llseek = noop_llseek, + .oom_badness = drm_oom_badness, }; static const struct drm_driver omap_drm_driver = { -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 12/13] drm/vmwgfx: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c index 01a5b47e95f9..e447e8ae29be 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c @@ -1577,6 +1577,7 @@ static const struct file_operations vmwgfx_driver_fops = { #endif .llseek = noop_llseek, .get_unmapped_area = vmw_get_unmapped_area, + .oom_badness = drm_oom_badness, }; static const struct drm_driver driver = { -- 2.25.1
Christian König
2022-May-31 10:00 UTC
[Nouveau] [PATCH 13/13] drm/tegra: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/tegra/drm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 9464f522e257..89ea4f658815 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -803,6 +803,7 @@ static const struct file_operations tegra_drm_fops = { .read = drm_read, .compat_ioctl = drm_compat_ioctl, .llseek = noop_llseek, + .oom_badness = drm_oom_badness, }; static int tegra_drm_context_cleanup(int id, void *p, void *data) -- 2.25.1
+ dri-devel On Tue, May 31, 2022 at 6:00 AM Christian K?nig <ckoenig.leichtzumerken at gmail.com> wrote:> > Hello everyone, > > To summarize the issue I'm trying to address here: Processes can allocate > resources through a file descriptor without being held responsible for it. > > Especially for the DRM graphics driver subsystem this is rather > problematic. Modern games tend to allocate huge amounts of system memory > through the DRM drivers to make it accessible to GPU rendering. > > But even outside of the DRM subsystem this problem exists and it is > trivial to exploit. See the following simple example of > using memfd_create(): > > fd = memfd_create("test", 0); > while (1) > write(fd, page, 4096); > > Compile this and you can bring down any standard desktop system within > seconds. > > The background is that the OOM killer will kill every processes in the > system, but just not the one which holds the only reference to the memory > allocated by the memfd. > > Those problems where brought up on the mailing list multiple times now > [1][2][3], but without any final conclusion how to address them. Since > file descriptors are considered shared the process can not directly held > accountable for allocations made through them. Additional to that file > descriptors can also easily move between processes as well. > > So what this patch set does is to instead of trying to account the > allocated memory to a specific process it adds a callback to struct > file_operations which the OOM killer can use to query the specific OOM > badness of this file reference. This badness is then divided by the > file_count, so that every process using a shmem file, DMA-buf or DRM > driver will get it's equal amount of OOM badness. > > Callbacks are then implemented for the two core users (memfd and DMA-buf) > as well as 72 DRM based graphics drivers. > > The result is that the OOM killer can now much better judge if a process > is worth killing to free up memory. Resulting a quite a bit better system > stability in OOM situations, especially while running games. > > The only other possibility I can see would be to change the accounting of > resources whenever references to the file structure change, but this would > mean quite some additional overhead for a rather common operation. > > Additionally I think trying to limit device driver allocations using > cgroups is orthogonal to this effort. While cgroups is very useful, it > works on per process limits and tries to enforce a collaborative model on > memory management while the OOM killer enforces a competitive model. > > Please comment and/or review, we have that problem flying around for years > now and are not at a point where we finally need to find a solution for > this. > > Regards, > Christian. > > [1] https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html > [2] https://lkml.org/lkml/2018/1/18/543 > [3] https://lkml.org/lkml/2021/2/4/799 > >