David Hildenbrand
2021-May-03 11:35 UTC
[PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)
On 03.05.21 13:33, Mike Rapoport wrote:> On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote: >> On 03.05.21 11:28, Mike Rapoport wrote: >>> On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote: >>>> On 02.05.21 08:34, Mike Rapoport wrote: >>>>> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote: >>>>>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze >>>>>> every now and then, so drivers that want to set PageOffline() can make >>>>>> progress. >>>>>> >>>>>> Signed-off-by: David Hildenbrand <david at redhat.com> >>>>>> --- >>>>>> fs/proc/kcore.c | 15 +++++++++++++++ >>>>>> 1 file changed, 15 insertions(+) >>>>>> >>>>>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c >>>>>> index 92ff1e4436cb..3d7531f47389 100644 >>>>>> --- a/fs/proc/kcore.c >>>>>> +++ b/fs/proc/kcore.c >>>>>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name, >>>>>> static ssize_t >>>>>> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) >>>>>> { >>>>>> + size_t page_offline_frozen = 0; >>>>>> char *buf = file->private_data; >>>>>> size_t phdrs_offset, notes_offset, data_offset; >>>>>> size_t phdrs_len, notes_len; >>>>>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) >>>>>> pfn = __pa(start) >> PAGE_SHIFT; >>>>>> page = pfn_to_online_page(pfn); >>>>> >>>>> Can't this race with page offlining for the first time we get here? >>>> >>>> >>>> To clarify, we have three types of offline pages in the kernel ... >>>> >>>> a) Pages part of an offline memory section; the memap is stale and not >>>> trustworthy. pfn_to_online_page() checks that. We *can* protect against >>>> memory offlining using get_online_mems()/put_online_mems(), but usually >>>> avoid doing so as the race window is very small (and a problem all over the >>>> kernel we basically never hit) and locking is rather expensive. In the >>>> future, we might switch to rcu to handle that more efficiently and avoiding >>>> these possible races. >>>> >>>> b) PageOffline(): logically offline pages contained in an online memory >>>> section with a sane memmap. virtio-mem calls these pages "fake offline"; >>>> something like a "temporary" memory hole. The new mechanism I propose will >>>> be used to handle synchronization as races can be more severe, e.g., when >>>> reading actual page content here. >>>> >>>> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet, >>>> but could become harmful in the future. So we better try to remove the page >>>> from the page allcoator and try to migrate away existing users. >>>> >>>> >>>> So page_offline_* handle "b) PageOffline()" only. There is a tiny race >>>> between pfn_to_online_page(pfn) and looking at the memmap as we have in many >>>> cases already throughout the kernel, to be tackled in the future. >>> >>> Right, but here you anyway add locking, so why exclude the first iteration? >> >> What we're protecting is PageOffline() below. If I didn't mess up, we should >> always be calling page_offline_freeze() before calling PageOffline(). Or am >> I missing something? > > Somehow I was under impression we are protecting both pfn_to_online_page() > and PageOffline(). > >>> BTW, did you consider something like >> >> Yes, I played with something like that. We'd have to handle the first >> page_offline_freeze() freeze differently, though, and that's where things >> got a bit ugly in my attempts. >> >>> >>> if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) { >>> page_offline_unfreeze(); >>> cond_resched(); >>> page_offline_freeze(); >>> } >>> >>> We don't seem to care about page_offline_frozen overflows here, do we? >> >> No, the buffer size is also size_t and gets incremented on a per-byte basis. >> The variant I have right now looked the cleanest to me. Happy to hear >> simpler alternatives. > > Well, locking for the first time before the while() loop and doing > resched-relock outside switch() would be definitely nicer, and it makes the > last unlock unconditional. > > The cost of prevention of memory offline during reads of !KCORE_RAM parts > does not seem that significant to me, but I may be missing something.Also true, I'll have a look if I can just simplify that. -- Thanks, David / dhildenb