David Hildenbrand
2022-May-30 07:41 UTC
[PATCH 0/3] recover hardware corrupted page by virtio balloon
On 27.05.22 08:32, zhenwei pi wrote:> On 5/27/22 02:37, Peter Xu wrote: >> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote: >>> The hypervisor _must_ emulate poisons identified in guest physical >>> address space (could be transported from the source VM), this is to >>> prevent silent data corruption in the guest. With a paravirtual >>> approach like this patch series, the hypervisor can clear some of the >>> poisoned HVAs knowing for certain that the guest OS has isolated the >>> poisoned page. I wonder how much value it provides to the guest if the >>> guest and workload are _not_ in a pressing need for the extra KB/MB >>> worth of memory. >> >> I'm curious the same on how unpoisoning could help here. The reasoning >> behind would be great material to be mentioned in the next cover letter. >> >> Shouldn't we consider migrating serious workloads off the host already >> where there's a sign of more severe hardware issues, instead? >> >> Thanks, >> > > I'm maintaining 1000,000+ virtual machines, from my experience: > UE is quite unusual and occurs randomly, and I did not hit UE storm case > in the past years. The memory also has no obvious performance drop after > hitting UE. > > I hit several CE storm case, the performance memory drops a lot. But I > can't find obvious relationship between UE and CE. > > So from the point of my view, to fix the corrupted page for VM seems > good enough. And yes, unpoisoning several pages does not help > significantly, but it is still a chance to make the virtualization better. >I'm curious why we should care about resurrecting a handful of poisoned pages in a VM. The cover letter doesn't touch on that. IOW, I'm missing the motivation why we should add additional code+complexity to unpoison pages at all. If we're talking about individual 4k pages, it's certainly sub-optimal, but does it matter in practice? I could understand if we're losing megabytes of memory. But then, I assume the workload might be seriously harmed either way already? I assume when talking about "the performance memory drops a lot", you imply that this patch set can mitigate that performance drop? But why do you see a performance drop? Because we might lose some possible THP candidates (in the host or the guest) and you want to plug does holes? I assume you'll see a performance drop simply because poisoning memory is expensive, including migrating pages around on CE. If you have some numbers to share, especially before/after this change, that would be great. -- Thanks, David / dhildenb
zhenwei pi
2022-May-30 11:33 UTC
[PATCH 0/3] recover hardware corrupted page by virtio balloon
On 5/30/22 15:41, David Hildenbrand wrote:> On 27.05.22 08:32, zhenwei pi wrote: >> On 5/27/22 02:37, Peter Xu wrote: >>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote: >>>> The hypervisor _must_ emulate poisons identified in guest physical >>>> address space (could be transported from the source VM), this is to >>>> prevent silent data corruption in the guest. With a paravirtual >>>> approach like this patch series, the hypervisor can clear some of the >>>> poisoned HVAs knowing for certain that the guest OS has isolated the >>>> poisoned page. I wonder how much value it provides to the guest if the >>>> guest and workload are _not_ in a pressing need for the extra KB/MB >>>> worth of memory. >>> >>> I'm curious the same on how unpoisoning could help here. The reasoning >>> behind would be great material to be mentioned in the next cover letter. >>> >>> Shouldn't we consider migrating serious workloads off the host already >>> where there's a sign of more severe hardware issues, instead? >>> >>> Thanks, >>> >> >> I'm maintaining 1000,000+ virtual machines, from my experience: >> UE is quite unusual and occurs randomly, and I did not hit UE storm case >> in the past years. The memory also has no obvious performance drop after >> hitting UE. >> >> I hit several CE storm case, the performance memory drops a lot. But I >> can't find obvious relationship between UE and CE. >> >> So from the point of my view, to fix the corrupted page for VM seems >> good enough. And yes, unpoisoning several pages does not help >> significantly, but it is still a chance to make the virtualization better. >> > > I'm curious why we should care about resurrecting a handful of poisoned > pages in a VM. The cover letter doesn't touch on that. > > IOW, I'm missing the motivation why we should add additional > code+complexity to unpoison pages at all. > > If we're talking about individual 4k pages, it's certainly sub-optimal, > but does it matter in practice? I could understand if we're losing > megabytes of memory. But then, I assume the workload might be seriously > harmed either way already? >Yes, resurrecting a handful of poisoned pages does not help significantly. And, in some ways, it seems nice to have. :D A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz) except GPAy). This is the worse case, so I want to add '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has a chance to isolate the other 511 pages ahead of time. And the guest actually loses 2M, fixing 512*4K seems to help significantly.> > I assume when talking about "the performance memory drops a lot", you > imply that this patch set can mitigate that performance drop? > > But why do you see a performance drop? Because we might lose some > possible THP candidates (in the host or the guest) and you want to plug > does holes? I assume you'll see a performance drop simply because > poisoning memory is expensive, including migrating pages around on CE. > > If you have some numbers to share, especially before/after this change, > that would be great. >The CE storm leads 2 problems I have even seen: 1, the memory bandwidth slows down to 10%~20%, and the cycles per instruction of CPU increases a lot. 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a lot time to handle IRQ. But no corrupted page occurs. Migrating VM to another healthy host seems a good choice. This patch does not handle CE storm case. -- zhenwei pi