Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [RFC] Per file OOM-badness / RSS once more
Hello everyone, To summarize the issue I'm trying to address here: Processes can allocate resources through a file descriptor without being held responsible for it. I'm not explaining all the details again. See here for a more deeply description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig at amd.com/ With this iteration I'm trying to address a bunch of the comments Michal Hocko (thanks a lot for that) gave as well as giving some new ideas. Changes made so far: 1. Renamed the callback into file_rss(). This is at least a start to better describe what this is all about. I've been going back and forth over the naming here, if you have any better idea please speak up. 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up all the pages allocated by the files in a file descriptor table. 3. Using the actual number of allocated pages for the shmem implementation instead of just the size. I also tried to ignore shmem files which are part of tmpfs, cause that has a separate accounting/limitation approach. 4. The OOM killer now prints the memory of the killed process including the per file pages which makes the whole things much more comprehensible. 5. I've added the per file pages to the different reports in RSS in procfs. This has the interesting effect that tools like top suddenly give a much more accurate overview of the memory use as well. This of course increases the overhead of gathering those information quite a bit and I'm not sure how feasible that is for up-streaming. On the other hand this once more clearly shows that we need to do something about this issue. Another rather interesting observation is that multiple subsystems (shmem, tmpfs, ttm) came up with the same workaround of limiting the memory which can be allocated through them to 50% of the whole system memory. Unfortunately that isn't the same 50% and it doesn't apply everywhere, so you can still easily crash the box. Ideas and/or comments are really welcome. Thanks, Christian.
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Some files allocate large amounts of memory on behalf of userspace without any on disk backing store. This memory isn't necessarily mapped into the address space, but should still accounts towards the RSS of a process just like mapped shared pages do. That information can then be used by the OOM killer to make better decisions which process to reap. For easy debugging this also adds printing of the per file RSS to fdinfo. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- fs/file.c | 23 +++++++++++++++++++++++ fs/proc/fd.c | 3 +++ include/linux/fdtable.h | 1 + include/linux/fs.h | 1 + 4 files changed, 28 insertions(+) diff --git a/fs/file.c b/fs/file.c index 3bcc1ecc314a..b58730a513be 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1307,3 +1307,26 @@ int iterate_fd(struct files_struct *files, unsigned n, return res; } EXPORT_SYMBOL(iterate_fd); + +static int sumup_file_rss(const void *sum, struct file *file, unsigned n) +{ + if (!file->f_op->file_rss) + return 0; + + *((unsigned long *)sum) += file->f_op->file_rss(file); + return 0; +} + +/** + * files_rss- how much resources are bound by opened files + * @files: opened files + * + * Returns sum of all resources bound by files not accounted in file systems. + */ +unsigned long files_rss(struct files_struct *files) +{ + unsigned long sum = 0; + + iterate_fd(files, 0, sumup_file_rss, &sum); + return sum; +} diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 913bef0d2a36..9943bfca74f7 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -59,6 +59,9 @@ static int seq_show(struct seq_file *m, void *v) real_mount(file->f_path.mnt)->mnt_id, file_inode(file)->i_ino); + if (file->f_op->file_rss) + seq_printf(m, "rss:\t%lu\n", file->f_op->file_rss(file)); + /* show_fd_locks() never deferences files so a stale value is safe */ show_fd_locks(m, file, files); if (seq_has_overflowed(m)) diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index e066816f3519..101770266f38 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -122,6 +122,7 @@ void do_close_on_exec(struct files_struct *); int iterate_fd(struct files_struct *, unsigned, int (*)(const void *, struct file *, unsigned), const void *); +unsigned long files_rss(struct files_struct *files); extern int close_fd(unsigned int fd); extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9ad5e3520fae..edacbdce5e4c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2003,6 +2003,7 @@ struct file_operations { loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); + long (*file_rss)(struct file *); } __randomize_layout; struct inode_operations { -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 02/14] oom: take per file RSS into account
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Try to make better decisions which process to kill based on per file RSS. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- mm/oom_kill.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3c6cf9e3cd66..76a5ea73eb6a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -18,6 +18,7 @@ * kernel subsystems and hints as to where to find out what things do. */ +#include <linux/fdtable.h> #include <linux/oom.h> #include <linux/mm.h> #include <linux/err.h> @@ -228,7 +229,8 @@ long oom_badness(struct task_struct *p, unsigned long totalpages) * task's rss, pagetable and swap space use. */ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + - mm_pgtables_bytes(p->mm) / PAGE_SIZE; + files_rss(p->files) + mm_pgtables_bytes(p->mm) / PAGE_SIZE; + task_unlock(p); /* Normalize to oom_score_adj units */ @@ -401,8 +403,8 @@ static int dump_task(struct task_struct *p, void *arg) pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - mm_pgtables_bytes(task->mm), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm) + + files_rss(task->files), mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); task_unlock(task); @@ -594,7 +596,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), - K(get_mm_counter(mm, MM_FILEPAGES)), + K(get_mm_counter(mm, MM_FILEPAGES) + + files_rss(tsk->files)), K(get_mm_counter(mm, MM_SHMEMPAGES))); out_finish: trace_finish_task_reaping(tsk->pid); @@ -950,7 +953,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", message, task_pid_nr(victim), victim->comm, K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)), - K(get_mm_counter(mm, MM_FILEPAGES)), + K(get_mm_counter(mm, MM_FILEPAGES) + files_rss(victim->files)), K(get_mm_counter(mm, MM_SHMEMPAGES)), from_kuid(&init_user_ns, task_uid(victim)), mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); -- 2.25.1
Add the per file RSS to the memory management accounting. This allows to see the per file RSS in tools like top as well. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- fs/proc/array.c | 7 +++++-- fs/proc/internal.h | 3 ++- fs/proc/task_mmu.c | 6 ++++-- fs/proc/task_nommu.c | 3 ++- 4 files changed, 13 insertions(+), 6 deletions(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index eb815759842c..a3aabe4ac7c8 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -437,7 +437,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_state(m, ns, pid, task); if (mm) { - task_mem(m, mm); + task_mem(m, mm, task->files); task_core_dumping(m, task); task_thp_status(m, mm); mmput(mm); @@ -589,7 +589,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, seq_put_decimal_ull(m, " ", 0); seq_put_decimal_ull(m, " ", start_time); seq_put_decimal_ull(m, " ", vsize); - seq_put_decimal_ull(m, " ", mm ? get_mm_rss(mm) : 0); + seq_put_decimal_ull(m, " ", (mm ? get_mm_rss(mm) : 0) + + files_rss(task->files)); seq_put_decimal_ull(m, " ", rsslim); seq_put_decimal_ull(m, " ", mm ? (permitted ? mm->start_code : 1) : 0); seq_put_decimal_ull(m, " ", mm ? (permitted ? mm->end_code : 1) : 0); @@ -673,6 +674,8 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, size = task_statm(mm, &shared, &text, &data, &resident); mmput(mm); + shared += files_rss(task->files); + /* * For quick read, open code by putting numbers directly * expected format is diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 06a80f78433d..1b123893f7ae 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -305,7 +305,8 @@ extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, unsigned long *, unsigned long *, unsigned long *, unsigned long *); -extern void task_mem(struct seq_file *, struct mm_struct *); +extern void task_mem(struct seq_file *, struct mm_struct *, + struct files_struct *); extern const struct dentry_operations proc_net_dentry_ops; static inline void pde_force_lookup(struct proc_dir_entry *pde) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2d04e3470d4c..a4adc6fc13d3 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -20,6 +20,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/fdtable.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -28,13 +29,14 @@ #define SEQ_PUT_DEC(str, val) \ seq_put_decimal_ull_width(m, str, (val) << (PAGE_SHIFT-10), 8) -void task_mem(struct seq_file *m, struct mm_struct *mm) +void task_mem(struct seq_file *m, struct mm_struct *mm, + struct files_struct *files) { unsigned long text, lib, swap, anon, file, shmem; unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; anon = get_mm_counter(mm, MM_ANONPAGES); - file = get_mm_counter(mm, MM_FILEPAGES); + file = get_mm_counter(mm, MM_FILEPAGES) + files_rss(files); shmem = get_mm_counter(mm, MM_SHMEMPAGES); /* diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index a6d21fc0033c..5b6b9c5ed9ec 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -18,7 +18,8 @@ * each process that owns it. Non-shared memory is counted * accurately. */ -void task_mem(struct seq_file *m, struct mm_struct *mm) +void task_mem(struct seq_file *m, struct mm_struct *mm, + struct files_struct *files) { struct vm_area_struct *vma; struct vm_region *region; -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 04/14] mm: shmem: provide RSS for shmem files
This gives the OOM killer an additional hint which processes are referencing shmem files with potentially no other accounting for them. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- mm/shmem.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..b068ac5ba4bf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2209,6 +2209,21 @@ unsigned long shmem_get_unmapped_area(struct file *file, return inflated_addr; } +static long shmem_file_rss(struct file *file) +{ + struct inode *inode = file_inode(file); + unsigned long nrpages; + + /* Only account shmem files which aren't part of any fs */ + if (atomic_read(&inode->i_count) > 1) + return 0; + + xa_lock(&file->f_mapping->i_pages); + nrpages = file->f_mapping->nrpages; + xa_unlock(&file->f_mapping->i_pages); + return nrpages; +} + #ifdef CONFIG_NUMA static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) { @@ -3811,6 +3826,7 @@ EXPORT_SYMBOL(shmem_aops); static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, .get_unmapped_area = shmem_get_unmapped_area, + .file_rss = shmem_file_rss, #ifdef CONFIG_TMPFS .llseek = shmem_file_llseek, .read_iter = shmem_file_read_iter, -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 05/14] dma-buf: provide file RSS for DMA-buf files
Just return the size of the DMA-buf in pages since pages allocated or mapped through DMA-bufs are usually not accounted elsewhere. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/dma-buf/dma-buf.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 630133284e2b..16162ec3538c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -494,6 +494,11 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) spin_unlock(&dmabuf->name_lock); } +static long dma_buf_file_rss(struct file *file) +{ + return i_size_read(file_inode(file)) >> PAGE_SHIFT; +} + static const struct file_operations dma_buf_fops = { .release = dma_buf_file_release, .mmap = dma_buf_mmap_internal, @@ -502,6 +507,7 @@ static const struct file_operations dma_buf_fops = { .unlocked_ioctl = dma_buf_ioctl, .compat_ioctl = compat_ptr_ioctl, .show_fdinfo = dma_buf_show_fdinfo, + .file_rss = dma_buf_file_rss, }; /* -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 06/14] drm/gem: adjust per file RSS on handling buffers
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> Large amounts of VRAM are usually not CPU accessible, so they are not mapped into the processes address space. But since the device drivers usually support swapping buffers from VRAM to system memory we can still run into an out of memory situation when userspace starts to allocate to much. This patch gives the OOM killer another hint which process is holding references to memory resources. A GEM helper is provided and automatically used for all drivers using the DEFINE_DRM_GEM_FOPS() and DEFINE_DRM_GEM_CMA_FOPS() macros. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- drivers/gpu/drm/drm_file.c | 24 ++++++++++++++++++++++++ drivers/gpu/drm/drm_gem.c | 5 +++++ include/drm/drm_file.h | 9 +++++++++ include/drm/drm_gem.h | 1 + include/drm/drm_gem_cma_helper.h | 1 + 5 files changed, 40 insertions(+) diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index ed25168619fc..b60795c5067c 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -1049,3 +1049,27 @@ unsigned long drm_get_unmapped_area(struct file *file, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ EXPORT_SYMBOL_GPL(drm_get_unmapped_area); #endif /* CONFIG_MMU */ + + +/** + * drm_file_rss() - get number of pages held by struct drm_file + * @f: struct drm_file to get the number of pages for + * + * Return how many pages are allocated for this client. + */ +long drm_file_rss(struct file *f) +{ + + struct drm_file *file_priv = f->private_data; + + if (!file_priv) + return 0; + + /* + * Since DRM file descriptors are often DUP()ed divide by the file count + * reference so that each descriptor gets an equal share. + */ + return DIV_ROUND_UP(atomic_long_read(&file_priv->f_rss), + file_count(f)); +} +EXPORT_SYMBOL(drm_file_rss); diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index eb0c2d041f13..69b3e93db816 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -256,6 +256,7 @@ drm_gem_object_release_handle(int id, void *ptr, void *data) drm_gem_remove_prime_handles(obj, file_priv); drm_vma_node_revoke(&obj->vma_node, file_priv); + atomic_long_sub(obj->size >> PAGE_SHIFT, &file_priv->f_rss); drm_gem_object_handle_put_unlocked(obj); return 0; @@ -291,6 +292,8 @@ drm_gem_handle_delete(struct drm_file *filp, u32 handle) idr_remove(&filp->object_idr, handle); spin_unlock(&filp->table_lock); + atomic_long_sub(obj->size >> PAGE_SHIFT, &filp->f_rss); + return 0; } EXPORT_SYMBOL(drm_gem_handle_delete); @@ -399,6 +402,8 @@ drm_gem_handle_create_tail(struct drm_file *file_priv, } *handlep = handle; + + atomic_long_add(obj->size >> PAGE_SHIFT, &file_priv->f_rss); return 0; err_revoke: diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h index e0a73a1e2df7..7c6ca13d8549 100644 --- a/include/drm/drm_file.h +++ b/include/drm/drm_file.h @@ -366,6 +366,13 @@ struct drm_file { #if IS_ENABLED(CONFIG_DRM_LEGACY) unsigned long lock_count; /* DRI1 legacy lock count */ #endif + + /** + * @f_rss: + * + * How many pages are allocated through this driver connection. + */ + atomic_long_t f_rss; }; /** @@ -430,4 +437,6 @@ unsigned long drm_get_unmapped_area(struct file *file, #endif /* CONFIG_MMU */ +long drm_file_rss(struct file *f); + #endif /* _DRM_FILE_H_ */ diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 9d7c61a122dc..b64cad26e9e9 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -338,6 +338,7 @@ struct drm_gem_object { .read = drm_read,\ .llseek = noop_llseek,\ .mmap = drm_gem_mmap,\ + .file_rss = drm_file_rss,\ } void drm_gem_object_release(struct drm_gem_object *obj); diff --git a/include/drm/drm_gem_cma_helper.h b/include/drm/drm_gem_cma_helper.h index fbda4ce5d5fb..8c56cbc8d10f 100644 --- a/include/drm/drm_gem_cma_helper.h +++ b/include/drm/drm_gem_cma_helper.h @@ -273,6 +273,7 @@ unsigned long drm_gem_cma_get_unmapped_area(struct file *filp, .read = drm_read,\ .llseek = noop_llseek,\ .mmap = drm_gem_mmap,\ + .file_rss = drm_file_rss,\ DRM_GEM_CMA_UNMAPPED_AREA_FOPS \ } -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 07/14] drm/gma500: use drm_file_rss
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/gma500/psb_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/gma500/psb_drv.c b/drivers/gpu/drm/gma500/psb_drv.c index 1d8744f3e702..92c005aa6e9e 100644 --- a/drivers/gpu/drm/gma500/psb_drv.c +++ b/drivers/gpu/drm/gma500/psb_drv.c @@ -513,6 +513,7 @@ static const struct file_operations psb_gem_fops = { .mmap = drm_gem_mmap, .poll = drm_poll, .read = drm_read, + .file_rss = drm_file_rss, }; static const struct drm_driver driver = { -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 08/14] drm/amdgpu: use drm_file_rss
From: Andrey Grodzovsky <andrey.grodzovsky at amd.com> This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 8890300766a5..4508791fe80c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2635,8 +2635,9 @@ static const struct file_operations amdgpu_driver_kms_fops = { .compat_ioctl = amdgpu_kms_compat_ioctl, #endif #ifdef CONFIG_PROC_FS - .show_fdinfo = amdgpu_show_fdinfo + .show_fdinfo = amdgpu_show_fdinfo, #endif + .file_rss = drm_file_rss, }; int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv) -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 09/14] drm/radeon: use drm_oom_badness
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/radeon/radeon_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 956c72b5aa33..11d310cdd2e8 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -550,6 +550,7 @@ static const struct file_operations radeon_driver_kms_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = radeon_kms_compat_ioctl, #endif + .file_rss = drm_file_rss, }; static const struct drm_ioctl_desc radeon_ioctls_kms[] = { -- 2.25.1
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/i915/i915_driver.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index 90b0ce5051af..fc269055a07c 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -1741,6 +1741,7 @@ static const struct file_operations i915_driver_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = i915_drm_client_fdinfo, #endif + .file_rss = drm_file_rss, }; static int -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 11/14] drm/nouveau: use drm_file_rss
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/nouveau/nouveau_drm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 561309d447e0..cc0ac7b059fe 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -1218,6 +1218,7 @@ nouveau_driver_fops = { .compat_ioctl = nouveau_compat_ioctl, #endif .llseek = noop_llseek, + .file_rss = drm_file_rss, }; static struct drm_driver -- 2.25.1
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/omapdrm/omap_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/omapdrm/omap_drv.c b/drivers/gpu/drm/omapdrm/omap_drv.c index eaf67b9e5f12..dff637de00a3 100644 --- a/drivers/gpu/drm/omapdrm/omap_drv.c +++ b/drivers/gpu/drm/omapdrm/omap_drv.c @@ -684,6 +684,7 @@ static const struct file_operations omapdriver_fops = { .poll = drm_poll, .read = drm_read, .llseek = noop_llseek, + .file_rss = drm_file_rss, }; static const struct drm_driver omap_drm_driver = { -- 2.25.1
Christian König
2022-Jun-24 08:04 UTC
[Nouveau] [PATCH 13/14] drm/vmwgfx: use drm_file_rss
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c index 01a5b47e95f9..99bf405d31b9 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c @@ -1577,6 +1577,7 @@ static const struct file_operations vmwgfx_driver_fops = { #endif .llseek = noop_llseek, .get_unmapped_area = vmw_get_unmapped_area, + .file_rss = drm_file_rss, }; static const struct drm_driver driver = { -- 2.25.1
This allows the OOM killer to make a better decision which process to reap. Signed-off-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/tegra/drm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 4cdc8faf798f..cc0c2fc57250 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -804,6 +804,7 @@ static const struct file_operations tegra_drm_fops = { .read = drm_read, .compat_ioctl = drm_compat_ioctl, .llseek = noop_llseek, + .file_rss = drm_file_rss, }; static int tegra_drm_context_cleanup(int id, void *p, void *data) -- 2.25.1
On Fri 24-06-22 10:04:30, Christian K?nig wrote:> Hello everyone, > > To summarize the issue I'm trying to address here: Processes can allocate > resources through a file descriptor without being held responsible for it. > > I'm not explaining all the details again. See here for a more deeply > description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig at amd.com/ > > With this iteration I'm trying to address a bunch of the comments Michal Hocko > (thanks a lot for that) gave as well as giving some new ideas. > > Changes made so far: > 1. Renamed the callback into file_rss(). This is at least a start to better > describe what this is all about. I've been going back and forth over the > naming here, if you have any better idea please speak up. > > 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up > all the pages allocated by the files in a file descriptor table. > > 3. Using the actual number of allocated pages for the shmem implementation > instead of just the size. I also tried to ignore shmem files which are part > of tmpfs, cause that has a separate accounting/limitation approach.OK, this is better than the original approach there are still holes there though I am afraid. I am not sure your i_count hack is correct but that would be mostly an implementation detail. The scheme will over-account memory mapped files (including memfd). How much that matters will really differ. For the global OOM situations it is very likely that there will be barely any disk based page cache as it would be reclaimed by the time the oom killer is invoked. So this should be OK. Swap backed page cache (shmem and its users) is more tricky. It is swap bound and processes which map it will get "charged" in the form of swap entries while those which rely on read/write will just escape from the sight of the oom killer no matter how much memory they own via their shmem backed fd. This sounds rather serious to me and I hope I haven't missed anything subtle here that would keep those pages somehow visible. Anyway something to very carefully document. For the memcg OOM this gets even more tricky. Files can be shared among tasks accross memcgs. Something that is not really straightforward from the userspace POV because this is not strictly deterministic as first-one-first-charged logic is applied so a lot might depend on timing. This could also easily mean that a large part of the in memory state of the file is outside of the reclaim and therefore OOM scope of the memcg which is hitting the hard limit. This could result in tasks being killed just because they (co)operate on a large file outside of their memcg domain. To be honest I am not sure how big of a problem this would be in practice and the existing behavior has its own cons so to me it sounds like changing one set of deficiency with other. As we have discussed previously, there is unlikely a great solution but you a) need to document most prominent downsides so that people can at least see this is understood and documented behavior and b) think of the runaway situation wrt non mapped shmems memtioned above and see whether there is something we can do about that. -- Michal Hocko SUSE Labs