thr3ads.net - Virtualization - [PATCH 0/3] recover hardware corrupted page by virtio balloon [May 2022]

If this information is useful, please help other people find it:
Share via:

zhenwei pi

2022-May-20 07:06 UTC

[PATCH 0/3] recover hardware corrupted page by virtio balloon

Hi,

I'm trying to recover hardware corrupted page by virtio balloon, the
workflow of this feature like this:

Guest              5.MF -> 6.RVQ FE    10.Unpoison page
                    /           \            /
-------------------+-------------+----------+-----------
                   |             |          |
                4.MCE        7.RVQ BE   9.RVQ Event
 QEMU             /               \       /
             3.SIGBUS              8.Remap
                /
----------------+------------------------------------
                |
            +--2.MF
 Host       /
       1.HW error

1, HardWare page error occurs randomly.
2, host side handles corrupted page by Memory Failure mechanism, sends
   SIGBUS to the user process if early-kill is enabled.
3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
4, QEMU tries to inject MCE into guest.
5, guest handles memory failure again.

1-5 is already supported for a long time, the next steps are supported
in this patch(also related driver patch):

6, guest balloon driver gets noticed of the corrupted PFN, and sends
   request to host side by Recover VQ FrontEnd.
7, QEMU handles request from Recover VQ BackEnd, then:
8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
9, QEMU acks the guest side the result by Recover VQ.
10, guest unpoisons the page if the corrupted page gets recoverd
    successfully.

Test:
This patch set can be tested with QEMU(also in developing):
https://github.com/pizhenwei/qemu/tree/balloon-recover

Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678
0x8c

The guest works fine(on Intel Platinum 8260):
 mce: [Hardware Error]: Machine check events logged
 Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
 virtio_balloon virtio5: recovered pfn 0x61646
 Unpoison: Unpoisoned page 0x61646 by virtio-balloon
 MCE: Killing stress:24502 due to hardware memory corruption fault at
7f5be2e5a010

And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.

About the protocol of virtio balloon recover VQ, it's undefined and in
developing currently:
- 'struct virtio_balloon_recover' defines the structure which is used to
  exchange message between guest and host.
- '__le32 corrupted_pages' in struct virtio_balloon_config is used in
the next
  step:
  1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
     unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest,
the guest
     has a chance to isolate the 512 pages ahead of time.

  2, after migrating to another host, the corrupted pages are actually
recovered,
     once the guest gets the 'corrupted_pages' with 0, then the guest
could
     unpoison all the poisoned pages which are recorded in the balloon driver.

zhenwei pi (3):
  memory-failure: Introduce memory failure notifier
  mm/memory-failure.c: support reset PTE during unpoison
  virtio_balloon: Introduce memory recover

 drivers/virtio/virtio_balloon.c     | 243 ++++++++++++++++++++++++++++
 include/linux/mm.h                  |   4 +-
 include/uapi/linux/virtio_balloon.h |  16 ++
 mm/hwpoison-inject.c                |   2 +-
 mm/memory-failure.c                 |  59 ++++++-
 5 files changed, 315 insertions(+), 9 deletions(-)

-- 
2.20.1

zhenwei pi

2022-May-20 07:06 UTC

head link

[PATCH 1/3] memory-failure: Introduce memory failure notifier

Introduce memory failure notifier, once hardware memory failure
occurs, after the kernel handles the corrupted page successfully,
someone who registered this chain gets noticed of the corrupted PFN.

Signed-off-by: zhenwei pi <pizhenwei at bytedance.com>
---
 include/linux/mm.h  |  2 ++
 mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f44254af8ce..665873c2788c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,8 @@ extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
+extern int register_memory_failure_notifier(struct notifier_block *nb);
+extern int unregister_memory_failure_notifier(struct notifier_block *nb);
 #ifdef CONFIG_MEMORY_FAILURE
 extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
 #else
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2d590cba412c..95c218bb0a37 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -68,6 +68,35 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+static BLOCKING_NOTIFIER_HEAD(mf_notifier_list);
+
+/**
+ * register_memory_failure_notifier - Register function to be called if a
+ *                                    corrupted page gets handled successfully
+ * @nb: Info about notifier function to be called
+ *
+ * Currently always returns zero, as blocking_notifier_chain_register()
+ * always returns zero.
+ */
+int register_memory_failure_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&mf_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_memory_failure_notifier);
+
+/**
+ * unregister_memory_failure_notifier - Unregister previously registered
+ *                                      memory failure notifier
+ * @nb: Hook to be unregistered
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+int unregister_memory_failure_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&mf_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_memory_failure_notifier);
+
 static bool __page_handle_poison(struct page *page)
 {
 	int ret;
@@ -1136,6 +1165,10 @@ static void action_result(unsigned long pfn, enum
mf_action_page_type type,
 	num_poisoned_pages_inc();
 	pr_err("Memory failure: %#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
+
+	/* notify the chain if we handle successfully only */
+	if (result == MF_RECOVERED)
+		blocking_notifier_call_chain(&mf_notifier_list, pfn, NULL);
 }
 
 static int page_action(struct page_state *ps, struct page *p,
-- 
2.20.1

zhenwei pi

2022-May-20 07:06 UTC

head link

[PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

Origianlly, unpoison_memory() is only used by hwpoison-inject, and
unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
entry has no change during software poison/unpoison.

On a virtualization platform, it's possible to fix hardware corrupted page
by hypervisor, typically the hypervisor remaps the error HVA(host virtual
address). So add a new parameter 'const char *reason' to show the reason
called by.

Once the corrupted page gets fixed, the guest kernel needs put page to
buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
 BUG: unable to handle page fault for address: ffff888061646000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
 Oops: 0002 [#1] PREEMPT SMP NOPTI
 CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G   M       OE    
5.18.0-rc6.bm.1-amd64 #6
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
 RIP: 0010:clear_page_erms+0x7/0x10

The kernel PTE entry of the fixed page is still uncorrected, kernel hits
page fault during prep_new_page. So add 'bool reset_kpte' to get a
change
to fix the PTE entry if the page is fixed by hypervisor.

Signed-off-by: zhenwei pi <pizhenwei at bytedance.com>
---
 include/linux/mm.h   |  2 +-
 mm/hwpoison-inject.c |  2 +-
 mm/memory-failure.c  | 26 +++++++++++++++++++-------
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 665873c2788c..7ba210e86401 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3191,7 +3191,7 @@ enum mf_flags {
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
-extern int unpoison_memory(unsigned long pfn);
+extern int unpoison_memory(unsigned long pfn, bool reset_kpte, const char
*reason);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 5c0cddd81505..0dd17ba98ade 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -57,7 +57,7 @@ static int hwpoison_unpoison(void *data, u64 val)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	return unpoison_memory(val);
+	return unpoison_memory(val, false, "hwpoison-inject");
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject,
"%lli\n");
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 95c218bb0a37..a46de3be1dd7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2132,21 +2132,26 @@ core_initcall(memory_failure_init);
 /**
  * unpoison_memory - Unpoison a previously poisoned page
  * @pfn: Page number of the to be unpoisoned page
+ * @reset_kpte: Reset the PTE entry for kmap
+ * @reason: The callers tells why unpoisoning the page
  *
- * Software-unpoison a page that has been poisoned by
- * memory_failure() earlier.
+ * Unpoison a page that has been poisoned by memory_failure() earlier.
  *
- * This is only done on the software-level, so it only works
- * for linux injected failures, not real hardware failures
+ * For linux injected failures, there is no need to reset PTE entry.
+ * It's possible to fix hardware memory failure on a virtualization
platform,
+ * once hypervisor fixes the failure, guest needs put page back to buddy and
+ * reset the PTE entry in kernel.
  *
  * Returns 0 for success, otherwise -errno.
  */
-int unpoison_memory(unsigned long pfn)
+int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason)
 {
 	struct page *page;
 	struct page *p;
 	int ret = -EBUSY;
 	int freeit = 0;
+	pte_t *kpte;
+	unsigned long addr;
 	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
 					DEFAULT_RATELIMIT_BURST);
 
@@ -2208,8 +2213,15 @@ int unpoison_memory(unsigned long pfn)
 	mutex_unlock(&mf_mutex);
 	if (!ret || freeit) {
 		num_poisoned_pages_dec();
-		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
-				 page_to_pfn(p), &unpoison_rs);
+		pr_info("Unpoison: Unpoisoned page %#lx by %s\n",
+				 page_to_pfn(p), reason);
+		if (reset_kpte) {
+			preempt_disable();
+			addr = (unsigned long)page_to_virt(p);
+			kpte = virt_to_kpte(addr);
+			set_pte_at(&init_mm, addr, kpte, pfn_pte(pfn, PAGE_KERNEL));
+			preempt_enable();
+		}
 	}
 	return ret;
 }
-- 
2.20.1

zhenwei pi

2022-May-20 07:06 UTC

head link

[PATCH 3/3] virtio_balloon: Introduce memory recover

Introduce a new queue 'recover VQ', this allows guest to recover
hardware corrupted page:

Guest              5.MF -> 6.RVQ FE    10.Unpoison page
                    /           \            /
-------------------+-------------+----------+-----------
                   |             |          |
                4.MCE        7.RVQ BE   9.RVQ Event
 QEMU             /               \       /
             3.SIGBUS              8.Remap
                /
----------------+------------------------------------
                |
            +--2.MF
 Host       /
       1.HW error

The workflow:
1, HardWare page error occurs randomly.
2, host side handles corrupted page by Memory Failure mechanism, sends
   SIGBUS to the user process if early-kill is enabled.
3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
4, QEMU tries to inject MCE into guest.
5, guest handles memory failure again.

1-5 is already supported for a long time, the next steps are supported
in this patch(also related driver patch):
6, guest balloon driver gets noticed of the corrupted PFN, and sends
   request to host side by Recover VQ FrontEnd.
7, QEMU handles request from Recover VQ BackEnd, then:
8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
9, QEMU acks the guest side the result by Recover VQ.
10, guest unpoisons the page if the corrupted page gets recoverd
    successfully.

Then the guest fixes the HW page error dynamiclly without rebooting.

Emulate MCE by QEMU, the guest works fine:
 mce: [Hardware Error]: Machine check events logged
 Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
 virtio_balloon virtio5: recovered pfn 0x61646
 Unpoison: Unpoisoned page 0x61646 by virtio-balloon
 MCE: Killing stress:24502 due to hardware memory corruption fault at
7f5be2e5a010

The 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.

Signed-off-by: zhenwei pi <pizhenwei at bytedance.com>
---
 drivers/virtio/virtio_balloon.c     | 243 ++++++++++++++++++++++++++++
 include/uapi/linux/virtio_balloon.h |  16 ++
 2 files changed, 259 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f4c34a2a6b8e..f9d95d1d8a4d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -52,6 +52,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
 	VIRTIO_BALLOON_VQ_REPORTING,
+	VIRTIO_BALLOON_VQ_RECOVER,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
 	VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
 };
 
+/* the request body to commucate with host side */
+struct __virtio_balloon_recover {
+	struct virtio_balloon_recover vbr;
+	__virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
@@ -126,6 +133,16 @@ struct virtio_balloon {
 	/* Free page reporting device */
 	struct virtqueue *reporting_vq;
 	struct page_reporting_dev_info pr_dev_info;
+
+	/* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
+	struct virtqueue *recover_vq;
+	spinlock_t recover_vq_lock;
+	struct notifier_block memory_failure_nb;
+	struct list_head corrupted_page_list;
+	struct list_head recovered_page_list;
+	spinlock_t recover_page_list_lock;
+	struct __virtio_balloon_recover in_vbr;
+	struct work_struct unpoison_memory_work;
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct
*work)
 		queue_work(system_freezable_wq, work);
 }
 
+/*
+ * virtballoon_memory_failure - notified by memory failure, try to fix the
+ *                              corrupted page.
+ * The memory failure notifier is designed to call back when the kernel handled
+ * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
+ * error(memory error handling is a best effort, not 100% coverd).
+ */
+static int virtballoon_memory_failure(struct notifier_block *notifier,
+				      unsigned long pfn, void *parm)
+{
+	struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
+						 memory_failure_nb);
+	struct page *page;
+	struct __virtio_balloon_recover *out_vbr;
+	struct scatterlist sg;
+	unsigned long flags;
+	int err;
+
+	page = pfn_to_online_page(pfn);
+	if (WARN_ON_ONCE(!page))
+		return NOTIFY_DONE;
+
+	if (PageHuge(page))
+		return NOTIFY_DONE;
+
+	if (WARN_ON_ONCE(!PageHWPoison(page)))
+		return NOTIFY_DONE;
+
+	if (WARN_ON_ONCE(page_count(page) != 1))
+		return NOTIFY_DONE;
+
+	get_page(page); /* balloon reference */
+
+	out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
+	if (WARN_ON_ONCE(!out_vbr))
+		return NOTIFY_BAD;
+
+	spin_lock(&vb->recover_page_list_lock);
+	balloon_page_push(&vb->corrupted_page_list, page);
+	spin_unlock(&vb->recover_page_list_lock);
+
+	out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;
+	set_page_pfns(vb, out_vbr->pfns, page);
+	sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
+
+	spin_lock_irqsave(&vb->recover_vq_lock, flags);
+	err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr,
GFP_KERNEL);
+	if (unlikely(err)) {
+		spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+		return NOTIFY_DONE;
+	}
+	virtqueue_kick(vb->recover_vq);
+	spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+
+	return NOTIFY_OK;
+}
+
+static int recover_vq_get_response(struct virtio_balloon *vb)
+{
+	struct __virtio_balloon_recover *in_vbr;
+	struct scatterlist sg;
+	unsigned long flags;
+	int err;
+
+	spin_lock_irqsave(&vb->recover_vq_lock, flags);
+	in_vbr = &vb->in_vbr;
+	memset(in_vbr, 0x00, sizeof(*in_vbr));
+	sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
+	err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
+	if (unlikely(err)) {
+		spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+		return err;
+	}
+
+	virtqueue_kick(vb->recover_vq);
+	spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+
+	return 0;
+}
+
+static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned int
len)
+{
+	struct __virtio_balloon_recover *in_vbr;
+	struct virtio_balloon_recover *vbr;
+	struct page *page;
+	unsigned int pfns;
+	u32 pfn0, pfn1;
+	__u8 status;
+
+	/* the response is not expected */
+	if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
+		return;
+
+	in_vbr = &vb->in_vbr;
+	vbr = &in_vbr->vbr;
+	if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
+		return;
+
+	/* to make sure the contiguous balloon PFNs */
+	for (pfns = 1; pfns < VIRTIO_BALLOON_PAGES_PER_PAGE; pfns++) {
+		pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns - 1]);
+		pfn1 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns]);
+		if (pfn1 - pfn0 != 1)
+			return;
+	}
+
+	pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[0]);
+	if (!pfn_valid(pfn0))
+		return;
+
+	pfn1 = -1;
+	spin_lock(&vb->recover_page_list_lock);
+	list_for_each_entry(page, &vb->corrupted_page_list, lru) {
+		pfn1 = page_to_pfn(page);
+		if (pfn1 == pfn0)
+			break;
+	}
+	spin_unlock(&vb->recover_page_list_lock);
+
+	status = vbr->status;
+	switch (status) {
+	case VIRTIO_BALLOON_R_STATUS_RECOVERED:
+		if (pfn1 == pfn0) {
+			spin_lock(&vb->recover_page_list_lock);
+			list_del(&page->lru);
+			balloon_page_push(&vb->recovered_page_list, page);
+			spin_unlock(&vb->recover_page_list_lock);
+			queue_work(system_freezable_wq, &vb->unpoison_memory_work);
+			dev_info_ratelimited(&vb->vdev->dev, "recovered pfn
0x%x", pfn0);
+		}
+		break;
+	case VIRTIO_BALLOON_R_STATUS_FAILED:
+		/* the hypervisor can't fix this corrupted page, balloon puts page */
+		if (pfn1 == pfn0) {
+			spin_lock(&vb->recover_page_list_lock);
+			list_del(&page->lru);
+			spin_unlock(&vb->recover_page_list_lock);
+			put_page(page);
+			dev_info_ratelimited(&vb->vdev->dev, "failed to recover pfn
0x%x", pfn0);
+		}
+	default:
+		break;
+	};
+
+	/* continue to get response from host side if the response gets handled
successfully */
+	recover_vq_get_response(vb);
+}
+
+static void unpoison_memory_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+	struct page *page;
+
+	vb = container_of(work, struct virtio_balloon, unpoison_memory_work);
+
+	do {
+		spin_lock(&vb->recover_page_list_lock);
+		page = list_first_entry_or_null(&vb->recovered_page_list,
+						struct page, lru);
+		if (page)
+			list_del(&page->lru);
+		spin_unlock(&vb->recover_page_list_lock);
+
+		if (page) {
+			put_page(page);
+			unpoison_memory(page_to_pfn(page), true, "virtio-balloon");
+		}
+	} while (page);
+}
+
+static void recover_vq_cb(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct __virtio_balloon_recover *vbr;
+	unsigned long flags;
+	unsigned int len;
+
+	spin_lock_irqsave(&vb->recover_vq_lock, flags);
+	do {
+		virtqueue_disable_cb(vq);
+		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
+			spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+			if (vbr == &vb->in_vbr)
+				recover_vq_handle_response(vb, len);
+			else
+				kfree(vbr); /* just free the memory for out vbr request */
+			spin_lock_irqsave(&vb->recover_vq_lock, flags);
+		}
+	} while (!virtqueue_enable_cb(vq));
+	spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
 	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
@@ -515,6 +724,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
+	names[VIRTIO_BALLOON_VQ_RECOVER] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -531,6 +741,11 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER)) {
+		names[VIRTIO_BALLOON_VQ_RECOVER] = "recover_vq";
+		callbacks[VIRTIO_BALLOON_VQ_RECOVER] = recover_vq_cb;
+	}
+
 	err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
 			      callbacks, names, NULL);
 	if (err)
@@ -566,6 +781,9 @@ static int init_vqs(struct virtio_balloon *vb)
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
 		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER))
+		vb->recover_vq = vqs[VIRTIO_BALLOON_VQ_RECOVER];
+
 	return 0;
 }
 
@@ -1015,12 +1233,31 @@ static int virtballoon_probe(struct virtio_device *vdev)
 			goto out_unregister_oom;
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
+		err = recover_vq_get_response(vb);
+		if (err)
+			goto out_unregister_reporting;
+
+		vb->memory_failure_nb.notifier_call = virtballoon_memory_failure;
+		spin_lock_init(&vb->recover_page_list_lock);
+		spin_lock_init(&vb->recover_vq_lock);
+		INIT_LIST_HEAD(&vb->corrupted_page_list);
+		INIT_LIST_HEAD(&vb->recovered_page_list);
+		INIT_WORK(&vb->unpoison_memory_work, unpoison_memory_func);
+		err = register_memory_failure_notifier(&vb->memory_failure_nb);
+		if (err)
+			goto out_unregister_reporting;
+	}
+
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
 
+out_unregister_reporting:
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		page_reporting_unregister(&vb->pr_dev_info);
 out_unregister_oom:
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		unregister_oom_notifier(&vb->oom_nb);
@@ -1082,6 +1319,11 @@ static void virtballoon_remove(struct virtio_device
*vdev)
 		destroy_workqueue(vb->balloon_wq);
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
+		unregister_memory_failure_notifier(&vb->memory_failure_nb);
+		cancel_work_sync(&vb->unpoison_memory_work);
+	}
+
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -1147,6 +1389,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
 	VIRTIO_BALLOON_F_REPORTING,
+	VIRTIO_BALLOON_F_RECOVER,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h
b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..41d0ffa2fb54 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -37,6 +37,7 @@
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
+#define VIRTIO_BALLOON_F_RECOVER	6 /* Memory recover virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -59,6 +60,8 @@ struct virtio_balloon_config {
 	};
 	/* Stores PAGE_POISON if page poisoning is in use */
 	__le32 poison_val;
+	/* Number of hardware corrupted pages, guest read only */
+	__le32 corrupted_pages;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
@@ -116,4 +119,17 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+#define VIRTIO_BALLOON_R_CMD_RECOVER      0
+#define VIRTIO_BALLOON_R_CMD_RESPONSE     0x80
+
+#define VIRTIO_BALLOON_R_STATUS_CORRUPTED 0
+#define VIRTIO_BALLOON_R_STATUS_RECOVERED 1
+#define VIRTIO_BALLOON_R_STATUS_FAILED    2
+
+struct virtio_balloon_recover {
+	__u8 cmd;
+	__u8 status;
+	__u8 padding[6];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.20.1

David Hildenbrand

2022-May-24 18:59 UTC

head link

[PATCH 0/3] recover hardware corrupted page by virtio balloon

On 20.05.22 09:06, zhenwei pi wrote:> Hi,
> 
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
> 
> Guest              5.MF -> 6.RVQ FE    10.Unpoison page
>                     /           \            /
> -------------------+-------------+----------+-----------
>                    |             |          |
>                 4.MCE        7.RVQ BE   9.RVQ Event
>  QEMU             /               \       /
>              3.SIGBUS              8.Remap
>                 /
> ----------------+------------------------------------
>                 |
>             +--2.MF
>  Host       /
>        1.HW error
> 
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
>    SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
> 
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
> 
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
>    request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
>     successfully.
> 
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
> 
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd
0x61646678 0x8c
> 
> The guest works fine(on Intel Platinum 8260):
>  mce: [Hardware Error]: Machine check events logged
>  Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
>  virtio_balloon virtio5: recovered pfn 0x61646
>  Unpoison: Unpoisoned page 0x61646 by virtio-balloon
>  MCE: Killing stress:24502 due to hardware memory corruption fault at
7f5be2e5a010
> 
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
> 
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is
used to
>   exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used
in the next
>   step:
>   1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
>      unaccessible. Reporting 512 * 4K 'corrupted_pages' to the
guest, the guest
>      has a chance to isolate the 512 pages ahead of time.
> 
>   2, after migrating to another host, the corrupted pages are actually
recovered,
>      once the guest gets the 'corrupted_pages' with 0, then the
guest could
>      unpoison all the poisoned pages which are recorded in the balloon
driver.
> 
Hi,

I'm still on vacation this week, I'll try to have a look when I'm
back
(and flushed out my overflowing inbox :D).


-- 
Thanks,

David / dhildenb

zhenwei pi

2022-May-27 03:47 UTC

head link

[PATCH 0/3] recover hardware corrupted page by virtio balloon

Hi, Andrew & Naoya

I would appreciate it if you could give me any hint about the changes of 
memory/memory-failure!

On 5/20/22 15:06, zhenwei pi wrote:> Hi,
> 
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
> 
> Guest              5.MF -> 6.RVQ FE    10.Unpoison page
>                      /           \            /
> -------------------+-------------+----------+-----------
>                     |             |          |
>                  4.MCE        7.RVQ BE   9.RVQ Event
>   QEMU             /               \       /
>               3.SIGBUS              8.Remap
>                  /
> ----------------+------------------------------------
>                  |
>              +--2.MF
>   Host       /
>         1.HW error
> 
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
>     SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
> 
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
> 
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
>     request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
>      successfully.
> 
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
> 
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd
0x61646678 0x8c
> 
> The guest works fine(on Intel Platinum 8260):
>   mce: [Hardware Error]: Machine check events logged
>   Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
>   virtio_balloon virtio5: recovered pfn 0x61646
>   Unpoison: Unpoisoned page 0x61646 by virtio-balloon
>   MCE: Killing stress:24502 due to hardware memory corruption fault at
7f5be2e5a010
> 
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
> 
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is
used to
>    exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used
in the next
>    step:
>    1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
>       unaccessible. Reporting 512 * 4K 'corrupted_pages' to the
guest, the guest
>       has a chance to isolate the 512 pages ahead of time.
> 
>    2, after migrating to another host, the corrupted pages are actually
recovered,
>       once the guest gets the 'corrupted_pages' with 0, then the
guest could
>       unpoison all the poisoned pages which are recorded in the balloon
driver.
> 
> zhenwei pi (3):
>    memory-failure: Introduce memory failure notifier
>    mm/memory-failure.c: support reset PTE during unpoison
>    virtio_balloon: Introduce memory recover
> 
>   drivers/virtio/virtio_balloon.c     | 243 ++++++++++++++++++++++++++++
>   include/linux/mm.h                  |   4 +-
>   include/uapi/linux/virtio_balloon.h |  16 ++
>   mm/hwpoison-inject.c                |   2 +-
>   mm/memory-failure.c                 |  59 ++++++-
>   5 files changed, 315 insertions(+), 9 deletions(-)
> 
-- 
zhenwei pi

Virtualization - May 2022 - [PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 1/3] memory-failure: Introduce memory failure notifier

[PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

[PATCH 3/3] virtio_balloon: Introduce memory recover

[PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon