thr3ads.net - Virtualization - [PATCH 0/3] recover hardware corrupted page by virtio balloon [Jun 2022]

If this information is useful, please help other people find it:
Share via:

zhenwei pi

2022-Jun-01 02:17 UTC

[PATCH 0/3] recover hardware corrupted page by virtio balloon

On 5/31/22 12:08, Jue Wang wrote:> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx at redhat.com>
wrote:
>>
>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz))
occurs, the
>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest
poisons 4K
>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx,
GPAz)
>>> except GPAy). This is the worse case, so I want to add
>>>   '__le32 corrupted_pages' in struct virtio_balloon_config,
it is used in the
>>> next step: reporting 512 * 4K 'corrupted_pages' to the
guest, the guest has
>>> a chance to isolate the other 511 pages ahead of time. And the
guest
>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>
>> It sounds hackish to teach a virtio device to assume one page will
always
>> be poisoned in huge page granule.  That's only a limitation to host
kernel
>> not virtio itself.
>>
>> E.g. there're upstream effort ongoing with enabling doublemap on
hugetlbfs
>> pages so hugetlb pages can be mapped in 4k with it.  It provides
potential
>> possibility to do page poisoning with huge pages in 4k too.  When
that'll
>> be ready the assumption can go away, and that does sound like a better
>> approach towards this problem.
> 
> +1.
> 
> A hypervisor should always strive to minimize the guest memory loss.
> 
> The HugeTLB double mapping enlightened memory poisoning behavior (only
> poison 4K out of a 2MB huge page and 4K in guest) is a much better
> solution here. To be completely transparent, it's not _strictly_
> required to poison the page (whatever the granularity it is) on the
> host side, as long as the following are true:
> 
> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the
guest.
> 2. The host page with the UC error is "isolated" (could be
PG_HWPOISON
> or in some other way) and prevented from being reused by other
> processes.
> 
> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
> poisoning is a good solution.
> 
>>
>>>
>>>>
>>>> I assume when talking about "the performance memory drops
a lot", you
>>>> imply that this patch set can mitigate that performance drop?
>>>>
>>>> But why do you see a performance drop? Because we might lose
some
>>>> possible THP candidates (in the host or the guest) and you want
to plug
>>>> does holes? I assume you'll see a performance drop simply
because
>>>> poisoning memory is expensive, including migrating pages around
on CE.
>>>>
>>>> If you have some numbers to share, especially before/after this
change,
>>>> that would be great.
>>>>
>>>
>>> The CE storm leads 2 problems I have even seen:
>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>> instruction of CPU increases a lot.
>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to
use a
>>> lot time to handle IRQ.
>>
>> Totally no good knowledge on CMCI, but if 2) is true then I'm
wondering
>> whether it's necessary to handle the interrupts that frequently. 
When I
>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>
>> /*
>>   * The interrupt handler. This is called on every event.
>>   * Just call the poller directly to log any events.
>>   * This could in theory increase the threshold under high load,
>>   * but doesn't for now.
>>   */
>> static void intel_threshold_interrupt(void)
>>
>> I think that matches with what I was thinking..  I mean for 2) not sure
>> whether it can be seen as a CMCI problem and potentially can be
optimized
>> by adjust the cmci threshold dynamically.
> 
> The CE storm caused performance drop is caused by the extra cycles
> spent by the ECC steps in memory controller, not in CMCI handling.
> This is observed in the Google fleet as well. A good solution is to
> monitor the CE rate closely in user space via /dev/mcelog and migrate
> all VMs to another host once the CE rate exceeds some threshold.
> 
> CMCI is a _background_ interrupt that is not handled in the process
> execution context and its handler is setup to switch to poll (1 / 5
> min) mode if there are more than ~ a dozen CEs reported via CMCI per
> second.
>>
>> --
>> Peter Xu
>>
Hi, Andrew, David, Naoya

According to the suggestions, I'd give up the improvement of memory 
failure on huge page in this series.

Is it worth recovering corrupted pages for the guest kernel? I'd follow 
your decision.

-- 
zhenwei pi

David Hildenbrand

2022-Jun-01 07:59 UTC

head link

[PATCH 0/3] recover hardware corrupted page by virtio balloon

On 01.06.22 04:17, zhenwei pi wrote:> On 5/31/22 12:08, Jue Wang wrote:
>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <peterx at redhat.com>
wrote:
>>>
>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz))
occurs, the
>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the
guest poisons 4K
>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE
([GPAx, GPAz)
>>>> except GPAy). This is the worse case, so I want to add
>>>>   '__le32 corrupted_pages' in struct
virtio_balloon_config, it is used in the
>>>> next step: reporting 512 * 4K 'corrupted_pages' to the
guest, the guest has
>>>> a chance to isolate the other 511 pages ahead of time. And the
guest
>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>
>>> It sounds hackish to teach a virtio device to assume one page will
always
>>> be poisoned in huge page granule.  That's only a limitation to
host kernel
>>> not virtio itself.
>>>
>>> E.g. there're upstream effort ongoing with enabling doublemap
on hugetlbfs
>>> pages so hugetlb pages can be mapped in 4k with it.  It provides
potential
>>> possibility to do page poisoning with huge pages in 4k too.  When
that'll
>>> be ready the assumption can go away, and that does sound like a
better
>>> approach towards this problem.
>>
>> +1.
>>
>> A hypervisor should always strive to minimize the guest memory loss.
>>
>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>> solution here. To be completely transparent, it's not _strictly_
>> required to poison the page (whatever the granularity it is) on the
>> host side, as long as the following are true:
>>
>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to
the guest.
>> 2. The host page with the UC error is "isolated" (could be
PG_HWPOISON
>> or in some other way) and prevented from being reused by other
>> processes.
>>
>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>> poisoning is a good solution.
>>
>>>
>>>>
>>>>>
>>>>> I assume when talking about "the performance memory
drops a lot", you
>>>>> imply that this patch set can mitigate that performance
drop?
>>>>>
>>>>> But why do you see a performance drop? Because we might
lose some
>>>>> possible THP candidates (in the host or the guest) and you
want to plug
>>>>> does holes? I assume you'll see a performance drop
simply because
>>>>> poisoning memory is expensive, including migrating pages
around on CE.
>>>>>
>>>>> If you have some numbers to share, especially before/after
this change,
>>>>> that would be great.
>>>>>
>>>>
>>>> The CE storm leads 2 problems I have even seen:
>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles
per
>>>> instruction of CPU increases a lot.
>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU
has to use a
>>>> lot time to handle IRQ.
>>>
>>> Totally no good knowledge on CMCI, but if 2) is true then I'm
wondering
>>> whether it's necessary to handle the interrupts that
frequently.  When I
>>> was reading the Intel CMCI vector handler I stumbled over this
comment:
>>>
>>> /*
>>>   * The interrupt handler. This is called on every event.
>>>   * Just call the poller directly to log any events.
>>>   * This could in theory increase the threshold under high load,
>>>   * but doesn't for now.
>>>   */
>>> static void intel_threshold_interrupt(void)
>>>
>>> I think that matches with what I was thinking..  I mean for 2) not
sure
>>> whether it can be seen as a CMCI problem and potentially can be
optimized
>>> by adjust the cmci threshold dynamically.
>>
>> The CE storm caused performance drop is caused by the extra cycles
>> spent by the ECC steps in memory controller, not in CMCI handling.
>> This is observed in the Google fleet as well. A good solution is to
>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>> all VMs to another host once the CE rate exceeds some threshold.
>>
>> CMCI is a _background_ interrupt that is not handled in the process
>> execution context and its handler is setup to switch to poll (1 / 5
>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>> second.
>>>
>>> --
>>> Peter Xu
>>>
> 
> Hi, Andrew, David, Naoya
> 
> According to the suggestions, I'd give up the improvement of memory 
> failure on huge page in this series.
> 
> Is it worth recovering corrupted pages for the guest kernel? I'd follow
> your decision.
Well, as I said, I am not sure if we really need/want this for a handful
of 4k poisoned pages in a VM. As I suspected, doing so might primarily
be interesting for some sort of de-fragmentation (allow again a higher
order page to be placed at the affected PFNs), not because of the slight
reduction of available memory. A simple VM reboot would get the job
similarly done.

As the poisoning refcount code is already a bit shaky as I learned
recently in the context of memory offlining, I do wonder if we really
want to expose the unpoisoning code outside of debugfs (hwpoison) usage.

Interestingly, unpoison_memory() documents: "This is only done on the
software-level, so it only works for linux injected failures, not real
hardware failures" -- ehm?

-- 
Thanks,

David / dhildenb

Virtualization - Jun 2022 - [PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon