David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
This is (obviously) for v5.13++; no need to rush with review, but I decided to send it around right away. Looking for places where the kernel might unconditionally read PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore needs some more love to not touch some other pages we really don't want to read -- i.e., hwpoisoned. Examples for PageOffline() pages are pages inflated in a balloon, memory unplugged via virtio-mem, and partially-present sections in memory added by the Hyper-V balloon. When reading pages inflated in a balloon, we essentially produce unnecessary load in the hypervisor; holes in partially present sections in case of Hyper-V are not accessible and already were a problem for /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In the future, virtio-mem might disallow reading unplugged memory -- marked as PageOffline() -- in some environments, resulting in undefined behavior when accessed; therefore, I'm trying to identify and rework all these (corner) cases. With this series, there is really only access via /dev/mem, /proc/vmcore and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case use case -- we won't care for now if someone explicitly tries to do nasty things by reading from/writing to physical addresses we better not touch. /dev/mem is a use case we won't support for virtio-mem, at least for now, so we'll simply disallow mapping any virtio-mem memory via /dev/mem next. /proc/vmcore is really only a problem when dumping the old kernel via something that's not makedumpfile (read: basically never), however, we'll try sanitizing that as well in the second kernel in the future. Tested via kcore_dump: https://github.com/schlafwandler/kcore_dump Cc: Andrew Morton <akpm at linux-foundation.org> Cc: "Michael S. Tsirkin" <mst at redhat.com> Cc: Jason Wang <jasowang at redhat.com> Cc: Alexey Dobriyan <adobriyan at gmail.com> Cc: Mike Rapoport <rppt at kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy at infradead.org> Cc: Oscar Salvador <osalvador at suse.de> Cc: Michal Hocko <mhocko at suse.com> Cc: Roman Gushchin <guro at fb.com> Cc: Alex Shi <alex.shi at linux.alibaba.com> Cc: Steven Price <steven.price at arm.com> Cc: Mike Kravetz <mike.kravetz at oracle.com> Cc: Aili Yao <yaoaili at kingsoft.com> Cc: Jiri Bohac <jbohac at suse.cz> Cc: "K. Y. Srinivasan" <kys at microsoft.com> Cc: Haiyang Zhang <haiyangz at microsoft.com> Cc: Stephen Hemminger <sthemmin at microsoft.com> Cc: Wei Liu <wei.liu at kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi at nec.com> Cc: linux-hyperv at vger.kernel.org Cc: virtualization at lists.linux-foundation.org Cc: linux-fsdevel at vger.kernel.org Cc: linux-mm at kvack.org David Hildenbrand (7): fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM mm: rename and move page_is_poisoned() fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline() virtio-mem: use page_offline_(start|end) when setting PageOffline() fs/proc/kcore: use page_offline_(freeze|unfreeze) drivers/virtio/virtio_mem.c | 2 ++ fs/proc/kcore.c | 69 ++++++++++++++++++++++++++++++------- include/linux/kcore.h | 3 -- include/linux/page-flags.h | 12 +++++++ mm/gup.c | 6 +++- mm/internal.h | 20 ----------- mm/util.c | 40 +++++++++++++++++++++ 7 files changed, 115 insertions(+), 37 deletions(-) base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717 -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()") removed the last user of KCORE_REMAP. Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page") removed the last user of KCORE_OTHER. Let's drop both types. While at it, also drop vaddr in "struct kcore_list", used by KCORE_REMAP only. Signed-off-by: David Hildenbrand <david at redhat.com> --- fs/proc/kcore.c | 7 ++----- include/linux/kcore.h | 3 --- 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 4d2e64e9016c..09f77d3c6e15 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) phdr->p_type = PT_LOAD; phdr->p_flags = PF_R | PF_W | PF_X; phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset; - if (m->type == KCORE_REMAP) - phdr->p_vaddr = (size_t)m->vaddr; - else - phdr->p_vaddr = (size_t)m->addr; - if (m->type == KCORE_RAM || m->type == KCORE_REMAP) + phdr->p_vaddr = (size_t)m->addr; + if (m->type == KCORE_RAM) phdr->p_paddr = __pa(m->addr); else if (m->type == KCORE_TEXT) phdr->p_paddr = __pa_symbol(m->addr); diff --git a/include/linux/kcore.h b/include/linux/kcore.h index da676cdbd727..86c0f1d18998 100644 --- a/include/linux/kcore.h +++ b/include/linux/kcore.h @@ -11,14 +11,11 @@ enum kcore_type { KCORE_RAM, KCORE_VMEMMAP, KCORE_USER, - KCORE_OTHER, - KCORE_REMAP, }; struct kcore_list { struct list_head list; unsigned long addr; - unsigned long vaddr; size_t size; int type; }; -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
Let's resturcture the code, using switch-case, and checking pfn_is_ram() only when we are dealing with KCORE_RAM. Signed-off-by: David Hildenbrand <david at redhat.com> --- fs/proc/kcore.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 09f77d3c6e15..ed6fbb3bd50c 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) goto out; } m = NULL; /* skip the list anchor */ - } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { - if (clear_user(buffer, tsz)) { - ret = -EFAULT; - goto out; - } - } else if (m->type == KCORE_VMALLOC) { + goto skip; + } + + switch (m->type) { + case KCORE_VMALLOC: vread(buf, (char *)start, tsz); /* we have to zero-fill user buffer even if no read */ if (copy_to_user(buffer, buf, tsz)) { ret = -EFAULT; goto out; } - } else if (m->type == KCORE_USER) { + break; + case KCORE_USER: /* User page is handled prior to normal kernel page: */ if (copy_to_user(buffer, (char *)start, tsz)) { ret = -EFAULT; goto out; } - } else { + break; + case KCORE_RAM: + if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { + if (clear_user(buffer, tsz)) { + ret = -EFAULT; + goto out; + } + break; + } + fallthrough; + case KCORE_VMEMMAP: + case KCORE_TEXT: if (kern_addr_valid(start)) { /* * Using bounce buffer to bypass the @@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) goto out; } } + break; + default: + pr_warn_once("Unhandled KCORE type: %d\n", m->type); + if (clear_user(buffer, tsz)) { + ret = -EFAULT; + goto out; + } } +skip: buflen -= tsz; *fpos += tsz; buffer += tsz; -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 3/7] mm: rename and move page_is_poisoned()
Commit d3378e86d182 ("mm/gup: check page posion status for coredump.") introduced page_is_poisoned(), however, v5 [1] of the patch used "page_is_hwpoison()" and something went wrong while upstreaming. Rename the function and move it to page-flags.h, from where it can be used in other -- kcore -- context. Move the comment to the place where it belongs and simplify. [1] https://lkml.kernel.org/r/20210322193318.377c9ce9 at alex-virtual-machine Signed-off-by: David Hildenbrand <david at redhat.com> --- include/linux/page-flags.h | 7 +++++++ mm/gup.c | 6 +++++- mm/internal.h | 20 -------------------- 3 files changed, 12 insertions(+), 21 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 04a34c08e0a6..b8c56672a588 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -694,6 +694,13 @@ PAGEFLAG_FALSE(DoubleMap) TESTSCFLAG_FALSE(DoubleMap) #endif +static inline bool is_page_hwpoison(struct page *page) +{ + if (PageHWPoison(page)) + return true; + return PageHuge(page) && PageHWPoison(compound_head(page)); +} + /* * For pages that are never mapped to userspace (and aren't PageSlab), * page_type may be used. Because it is initialised to -1, we invert the diff --git a/mm/gup.c b/mm/gup.c index ef7d2da9f03f..000f3303e7f2 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1536,7 +1536,11 @@ struct page *get_dump_page(unsigned long addr) if (locked) mmap_read_unlock(mm); - if (ret == 1 && is_page_poisoned(page)) + /* + * We might have hwpoisoned pages still mapped into user space. Don't + * read these pages when creating a coredump, access could be fatal. + */ + if (ret == 1 && is_page_hwpoison(page)) return NULL; return (ret == 1) ? page : NULL; diff --git a/mm/internal.h b/mm/internal.h index cb3c5e0a7799..1432feec62df 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -97,26 +97,6 @@ static inline void set_page_refcounted(struct page *page) set_page_count(page, 1); } -/* - * When kernel touch the user page, the user page may be have been marked - * poison but still mapped in user space, if without this page, the kernel - * can guarantee the data integrity and operation success, the kernel is - * better to check the posion status and avoid touching it, be good not to - * panic, coredump for process fatal signal is a sample case matching this - * scenario. Or if kernel can't guarantee the data integrity, it's better - * not to call this function, let kernel touch the poison page and get to - * panic. - */ -static inline bool is_page_poisoned(struct page *page) -{ - if (PageHWPoison(page)) - return true; - else if (PageHuge(page) && PageHWPoison(compound_head(page))) - return true; - - return false; -} - extern unsigned long highest_memmap_pfn; /* -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
Let's avoid reading: 1) Offline memory sections: the content of offline memory sections is stale as the memory is effectively unused by the kernel. On s390x with standby memory, offline memory sections (belonging to offline storage increments) are not accessible. With virtio-mem and the hyper-v balloon, we can have unavailable memory chunks that should not be accessed inside offline memory sections. Last but not least, offline memory sections might contain hwpoisoned pages which we can no longer identify because the memmap is stale. 2) PG_offline pages: logically offline pages that are documented as "The content of these pages is effectively stale. Such pages should not be touched (read/write/dump/save) except by their owner.". Examples include pages inflated in a balloon or unavailble memory ranges inside hotplugged memory sections with virtio-mem or the hyper-v balloon. 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. As documented: "Accessing is not safe since it may cause another machine check. Don't touch!" Reading /proc/kcore now performs similar checks as when reading /proc/vmcore for kdump via makedumpfile: problematic pages are exclude. It's also similar to hibernation code, however, we don't skip hwpoisoned pages when processing pages in kernel/power/snapshot.c:saveable_page() yet. Note 1: we can race against memory offlining code, especially memory going offline and getting unplugged: however, we will properly tear down the identity mapping and handle faults gracefully when accessing this memory from kcore code. Note 2: we can race against drivers setting PageOffline() and turning memory inaccessible in the hypervisor. We'll handle this in a follow-up patch. Signed-off-by: David Hildenbrand <david at redhat.com> --- fs/proc/kcore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index ed6fbb3bd50c..92ff1e4436cb 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) m = NULL; while (buflen) { + struct page *page; + unsigned long pfn; + /* * If this is the first iteration or the address is not within * the previous entry, search for a matching entry. @@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) } break; case KCORE_RAM: - if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { + pfn = __pa(start) >> PAGE_SHIFT; + page = pfn_to_online_page(pfn); + + /* + * Don't read offline sections, logically offline pages + * (e.g., inflated in a balloon), hwpoisoned pages, + * and explicitly excluded physical ranges. + */ + if (!page || PageOffline(page) || + is_page_hwpoison(page) || !pfn_is_ram(pfn)) { if (clear_user(buffer, tsz)) { ret = -EFAULT; goto out; -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()
A driver might set a page logically offline -- PageOffline() -- and turn the page inaccessible in the hypervisor; after that, access to page content can be fatal. One example is virtio-mem; while unplugged memory -- marked as PageOffline() can currently be read in the hypervisor, this will no longer be the case in the future; for example, when having a virtio-mem device backed by huge pages in the hypervisor. Some special PFN walkers -- i.e., /proc/kcore -- read content of random pages after checking PageOffline(); however, these PFN walkers can race with drivers that set PageOffline(). Let's introduce page_offline_(begin|end|freeze|unfreeze) for synchronizing. page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to synchronize with such drivers, achieving that a page cannot be set PageOffline() while frozen. page_offline_begin()/page_offline_end() is used by drivers that care about such races when setting a page PageOffline(). For simplicity, use a rwsem for now; neither drivers nor users are performance sensitive. Signed-off-by: David Hildenbrand <david at redhat.com> --- include/linux/page-flags.h | 5 +++++ mm/util.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b8c56672a588..e3d00c72f459 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy) */ PAGE_TYPE_OPS(Offline, offline) +extern void page_offline_freeze(void); +extern void page_offline_unfreeze(void); +extern void page_offline_begin(void); +extern void page_offline_end(void); + /* * Marks pages in use as page tables. */ diff --git a/mm/util.c b/mm/util.c index 54870226cea6..95395d4e4209 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object) } pr_cont(" non-slab/vmalloc memory.\n"); } + +/* + * A driver might set a page logically offline -- PageOffline() -- and + * turn the page inaccessible in the hypervisor; after that, access to page + * content can be fatal. + * + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random + * pages after checking PageOffline(); however, these PFN walkers can race + * with drivers that set PageOffline(). + * + * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to + * synchronize with such drivers, achieving that a page cannot be set + * PageOffline() while frozen. + * + * page_offline_begin()/page_offline_end() is used by drivers that care about + * such races when setting a page PageOffline(). + */ +static DECLARE_RWSEM(page_offline_rwsem); + +void page_offline_freeze(void) +{ + down_read(&page_offline_rwsem); +} + +void page_offline_unfreeze(void) +{ + up_read(&page_offline_rwsem); +} + +void page_offline_begin(void) +{ + down_write(&page_offline_rwsem); +} + +void page_offline_end(void) +{ + up_write(&page_offline_rwsem); +} -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
Let's properly use page_offline_(start|end) to synchronize setting PageOffline(), so we won't have valid page access to unplugged memory regions from /proc/kcore. Signed-off-by: David Hildenbrand <david at redhat.com> --- drivers/virtio/virtio_mem.c | 2 ++ mm/util.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 10ec60d81e84..dc2a2e2b2ff8 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, static void virtio_mem_set_fake_offline(unsigned long pfn, unsigned long nr_pages, bool onlined) { + page_offline_begin(); for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn, ClearPageReserved(page); } } + page_offline_end(); } /* diff --git a/mm/util.c b/mm/util.c index 95395d4e4209..d0e357bd65e6 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1046,8 +1046,10 @@ void page_offline_begin(void) { down_write(&page_offline_rwsem); } +EXPORT_SYMBOL(page_offline_begin); void page_offline_end(void) { up_write(&page_offline_rwsem); } +EXPORT_SYMBOL(page_offline_end); -- 2.30.2
David Hildenbrand
2021-Apr-29 12:25 UTC
[PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
Let's properly synchronize with drivers that set PageOffline(). Unfreeze every now and then, so drivers that want to set PageOffline() can make progress. Signed-off-by: David Hildenbrand <david at redhat.com> --- fs/proc/kcore.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 92ff1e4436cb..3d7531f47389 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name, static ssize_t read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) { + size_t page_offline_frozen = 0; char *buf = file->private_data; size_t phdrs_offset, notes_offset, data_offset; size_t phdrs_len, notes_len; @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) pfn = __pa(start) >> PAGE_SHIFT; page = pfn_to_online_page(pfn); + /* + * Don't race against drivers that set PageOffline() + * and expect no further page access. + */ + if (page_offline_frozen == MAX_ORDER_NR_PAGES) { + page_offline_unfreeze(); + page_offline_frozen = 0; + cond_resched(); + } + if (!page_offline_frozen++) + page_offline_freeze(); + /* * Don't read offline sections, logically offline pages * (e.g., inflated in a balloon), hwpoisoned pages, @@ -565,6 +578,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) } out: + if (page_offline_frozen) + page_offline_unfreeze(); up_read(&kclist_lock); if (ret) return ret; -- 2.30.2
Michael S. Tsirkin
2021-May-03 08:23 UTC
[PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()
On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:> Let's properly use page_offline_(start|end) to synchronize setting > PageOffline(), so we won't have valid page access to unplugged memory > regions from /proc/kcore. > > Signed-off-by: David Hildenbrand <david at redhat.com>the patch looks good to me as such Acked-by: Michael S. Tsirkin <mst at redhat.com> Feel free to merge with rest of patcgset - it seems to mostly live in the fs/mm space. IF you respin, maybe add the explanation you sent in response to Mike's comments in the commit log.> --- > drivers/virtio/virtio_mem.c | 2 ++ > mm/util.c | 2 ++ > 2 files changed, 4 insertions(+) > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index 10ec60d81e84..dc2a2e2b2ff8 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, > static void virtio_mem_set_fake_offline(unsigned long pfn, > unsigned long nr_pages, bool onlined) > { > + page_offline_begin(); > for (; nr_pages--; pfn++) { > struct page *page = pfn_to_page(pfn); > > @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn, > ClearPageReserved(page); > } > } > + page_offline_end(); > } > > /* > diff --git a/mm/util.c b/mm/util.c > index 95395d4e4209..d0e357bd65e6 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -1046,8 +1046,10 @@ void page_offline_begin(void) > { > down_write(&page_offline_rwsem); > } > +EXPORT_SYMBOL(page_offline_begin); > > void page_offline_end(void) > { > up_write(&page_offline_rwsem); > } > +EXPORT_SYMBOL(page_offline_end); > -- > 2.30.2