thr3ads.net - Linux Virtualization - [PATCH 0/3] recover hardware corrupted page by virtio balloon [May 2022]

If this information is useful, please help other people find it:
Share via:

David Hildenbrand

2022-May-30 07:41 UTC

[PATCH 0/3] recover hardware corrupted page by virtio balloon

On 27.05.22 08:32, zhenwei pi wrote:> On 5/27/22 02:37, Peter Xu wrote:
>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>> The hypervisor _must_ emulate poisons identified in guest physical
>>> address space (could be transported from the source VM), this is to
>>> prevent silent data corruption in the guest. With a paravirtual
>>> approach like this patch series, the hypervisor can clear some of
the
>>> poisoned HVAs knowing for certain that the guest OS has isolated
the
>>> poisoned page. I wonder how much value it provides to the guest if
the
>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>> worth of memory.
>>
>> I'm curious the same on how unpoisoning could help here.  The
reasoning
>> behind would be great material to be mentioned in the next cover
letter.
>>
>> Shouldn't we consider migrating serious workloads off the host
already
>> where there's a sign of more severe hardware issues, instead?
>>
>> Thanks,
>>
> 
> I'm maintaining 1000,000+ virtual machines, from my experience:
> UE is quite unusual and occurs randomly, and I did not hit UE storm case 
> in the past years. The memory also has no obvious performance drop after 
> hitting UE.
> 
> I hit several CE storm case, the performance memory drops a lot. But I 
> can't find obvious relationship between UE and CE.
> 
> So from the point of my view, to fix the corrupted page for VM seems 
> good enough. And yes, unpoisoning several pages does not help 
> significantly, but it is still a chance to make the virtualization better.
> 
I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.

IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.

If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?

I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?

But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.

If you have some numbers to share, especially before/after this change,
that would be great.

-- 
Thanks,

David / dhildenb

zhenwei pi

2022-May-30 11:33 UTC

head link

[PATCH 0/3] recover hardware corrupted page by virtio balloon

On 5/30/22 15:41, David Hildenbrand wrote:> On 27.05.22 08:32, zhenwei pi wrote:
>> On 5/27/22 02:37, Peter Xu wrote:
>>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>>> The hypervisor _must_ emulate poisons identified in guest
physical
>>>> address space (could be transported from the source VM), this
is to
>>>> prevent silent data corruption in the guest. With a paravirtual
>>>> approach like this patch series, the hypervisor can clear some
of the
>>>> poisoned HVAs knowing for certain that the guest OS has
isolated the
>>>> poisoned page. I wonder how much value it provides to the guest
if the
>>>> guest and workload are _not_ in a pressing need for the extra
KB/MB
>>>> worth of memory.
>>>
>>> I'm curious the same on how unpoisoning could help here.  The
reasoning
>>> behind would be great material to be mentioned in the next cover
letter.
>>>
>>> Shouldn't we consider migrating serious workloads off the host
already
>>> where there's a sign of more severe hardware issues, instead?
>>>
>>> Thanks,
>>>
>>
>> I'm maintaining 1000,000+ virtual machines, from my experience:
>> UE is quite unusual and occurs randomly, and I did not hit UE storm
case
>> in the past years. The memory also has no obvious performance drop
after
>> hitting UE.
>>
>> I hit several CE storm case, the performance memory drops a lot. But I
>> can't find obvious relationship between UE and CE.
>>
>> So from the point of my view, to fix the corrupted page for VM seems
>> good enough. And yes, unpoisoning several pages does not help
>> significantly, but it is still a chance to make the virtualization
better.
>>
> 
> I'm curious why we should care about resurrecting a handful of poisoned
> pages in a VM. The cover letter doesn't touch on that.
> 
> IOW, I'm missing the motivation why we should add additional
> code+complexity to unpoison pages at all.
> 
> If we're talking about individual 4k pages, it's certainly
sub-optimal,
> but does it matter in practice? I could understand if we're losing
> megabytes of memory. But then, I assume the workload might be seriously
> harmed either way already?
> 
Yes, resurrecting a handful of poisoned pages does not help 
significantly. And, in some ways, it seems nice to have. :D

A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, 
the 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest 
poisons 4K (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE 
([GPAx, GPAz) except GPAy). This is the worse case, so I want to add
  '__le32 corrupted_pages' in struct virtio_balloon_config, it is used 
in the next step: reporting 512 * 4K 'corrupted_pages' to the guest, the
guest has a chance to isolate the other 511 pages ahead of time. And the 
guest actually loses 2M, fixing 512*4K seems to help significantly.
> 
> I assume when talking about "the performance memory drops a lot",
you
> imply that this patch set can mitigate that performance drop?
> 
> But why do you see a performance drop? Because we might lose some
> possible THP candidates (in the host or the guest) and you want to plug
> does holes? I assume you'll see a performance drop simply because
> poisoning memory is expensive, including migrating pages around on CE.
> 
> If you have some numbers to share, especially before/after this change,
> that would be great.
> 
The CE storm leads 2 problems I have even seen:
1, the memory bandwidth slows down to 10%~20%, and the cycles per 
instruction of CPU increases a lot.
2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use 
a lot time to handle IRQ.

But no corrupted page occurs. Migrating VM to another healthy host seems 
a good choice. This patch does not handle CE storm case.

-- 
zhenwei pi

Linux Virtualization - May 2022 - [PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon

[PATCH 0/3] recover hardware corrupted page by virtio balloon