On 05.02.20 10:49, Wang, Wei W wrote:> On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote: >>> >>> Not sure how TCG tracks the dirty bits. But In whatever >>> implementation, the hypervisor should have >> >> There is only a single bitmap for that purpose. (well, the one where KVM >> syncs to) >> >>> already dealt with the race between he current round and the previous >> round dirty recording. >>> (the race isn't brought by this feature essentially) >> >> It is guaranteed to work reliably without this feature as you only clear what >> *has been migrated*, > > Not "clear what has been migrated" (that skips nothing..) > Anyway, it's a hint used for optimization.Yes, an optimization that might easily lead to data corruption when the two bitmaps are either not in place or don't play along in that specific way (and I suspect this is the case under TCG). -- Thanks, David / dhildenb
On Wed, Feb 05, 2020 at 10:58:14AM +0100, David Hildenbrand wrote:> On 05.02.20 10:49, Wang, Wei W wrote: > > On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote: > >>> > >>> Not sure how TCG tracks the dirty bits. But In whatever > >>> implementation, the hypervisor should have > >> > >> There is only a single bitmap for that purpose. (well, the one where KVM > >> syncs to) > >> > >>> already dealt with the race between he current round and the previous > >> round dirty recording. > >>> (the race isn't brought by this feature essentially) > >> > >> It is guaranteed to work reliably without this feature as you only clear what > >> *has been migrated*, > > > > Not "clear what has been migrated" (that skips nothing..) > > Anyway, it's a hint used for optimization. > > Yes, an optimization that might easily lead to data corruption when the > two bitmaps are either not in place or don't play along in that specific > way (and I suspect this is the case under TCG).So I checked and TCG has two copies too. Each block has bmap used for migration and also dirty_memory where pages are marked dirty. See cpu_physical_memory_sync_dirty_bitmap. So from QEMU POV, there is a callback that tells balloon when it's safe to request hints. As that affects the bitmap, that must not happen in parallel with dirty bitmap handling. Sounds like a reasonable limitation. The hint can be useful outside migration, but in its current form needs to then be non-destructive. E.g. I can imaging userspace calling MADV_SOFT_OFFLINE on the hinted memory. Again a flag that tells guest it should wait until used could be a reasonable expension. If we stick to the shrinker it's actually implementable easily. With an OOM notifier - I'm not so sure ... And a big part of the problem is that after all this time the page hinting interfaces are still undocumented. Quite sad really :(> -- > Thanks, > > David / dhildenb
On 05.02.20 11:25, Michael S. Tsirkin wrote:> On Wed, Feb 05, 2020 at 10:58:14AM +0100, David Hildenbrand wrote: >> On 05.02.20 10:49, Wang, Wei W wrote: >>> On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote: >>>>> >>>>> Not sure how TCG tracks the dirty bits. But In whatever >>>>> implementation, the hypervisor should have >>>> >>>> There is only a single bitmap for that purpose. (well, the one where KVM >>>> syncs to) >>>> >>>>> already dealt with the race between he current round and the previous >>>> round dirty recording. >>>>> (the race isn't brought by this feature essentially) >>>> >>>> It is guaranteed to work reliably without this feature as you only clear what >>>> *has been migrated*, >>> >>> Not "clear what has been migrated" (that skips nothing..) >>> Anyway, it's a hint used for optimization. >> >> Yes, an optimization that might easily lead to data corruption when the >> two bitmaps are either not in place or don't play along in that specific >> way (and I suspect this is the case under TCG). > > So I checked and TCG has two copies too. > Each block has bmap used for migration and also dirty_memory > where pages are marked dirty. See cpu_physical_memory_sync_dirty_bitmap.qemu_guest_free_page_hint() works on block->bmap. cpu_physical_memory_set_dirty_range() works on ram_list.dirty_memory[i]. So you are right - sorry for the false alarm and thanks for verifying :) [...]> > Again a flag that tells guest it should wait until used > could be a reasonable expension. If we stick to the shrinker > it's actually implementable easily. With an OOM notifier - I'm not so > sure ...See my other mail. I think we should keep handling just as is and not overcomplicate things (especially in our implementation as you noted) Instead, maybe abstract the reporting feature.> > And a big part of the problem is that after all this time the page > hinting interfaces are still undocumented. Quite sad really :(Yes, that was the source of my confusion ... the double-bitmap thingy is non-obvious. And anybody who wants to implement that interface in a hypervisor has to be aware that the race I explained has to be avoided using e.g., two bitmaps and the sync. -- Thanks, David / dhildenb