>> 1. Guest allocates a page and sends it to the host. >> 2. Shrinker gets active and releases that page again. >> 3. Some user in the guest allocates and modifies that page. The dirty bit is >> set in the hypervisor. > > The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts. > >> 4. The host processes the request and clears the bit in the dirty bitmap. > > This clears the bit from the QEMU bitmap, and this page will not be sent in this round. > >> 5. The guest is stopped and the last set of dirty pages is migrated. The >> modified page is not being migrated (because not marked dirty). > > When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3. > Then the modified page gets sent.So, if you run a TCG guest and use it with free page reporting, the race is possible? So the correctness depends on two dirty bitmaps in the hypervisor and how they interact. wow this is fragile. Thanks for the info :) -- Thanks, David / dhildenb
On Wednesday, February 5, 2020 5:23 PM, David Hildenbrand wrote:> So, if you run a TCG guest and use it with free page reporting, the race is > possible? So the correctness depends on two dirty bitmaps in the hypervisor > and how they interact. wow this is fragile. >Not sure how TCG tracks the dirty bits. But In whatever implementation, the hypervisor should have already dealt with the race between he current round and the previous round dirty recording. (the race isn't brought by this feature essentially) Best, Wei
On Wed, Feb 05, 2020 at 10:22:34AM +0100, David Hildenbrand wrote:> >> 1. Guest allocates a page and sends it to the host. > >> 2. Shrinker gets active and releases that page again. > >> 3. Some user in the guest allocates and modifies that page. The dirty bit is > >> set in the hypervisor. > > > > The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts. > > > >> 4. The host processes the request and clears the bit in the dirty bitmap. > > > > This clears the bit from the QEMU bitmap, and this page will not be sent in this round. > > > >> 5. The guest is stopped and the last set of dirty pages is migrated. The > >> modified page is not being migrated (because not marked dirty). > > > > When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3. > > Then the modified page gets sent. > > So, if you run a TCG guest and use it with free page reporting, the race > is possible?I'd have to look at the implementation but the basic idea is not kvm specific. The idea is that hypervisor can detect that 3 happened after 1, by means of creating a copy of the dirty bitmap when request is sent to the guest.> So the correctness depends on two dirty bitmaps in the > hypervisor and how they interact. wow this is fragile. > > Thanks for the info :)It's pretty fragile, and the annoying part is we do not actually benefit from this at all since it all only triggers in the shrinker corner case. The original idea was that we can send any hint to hypervisor and reuse the page immediately without waiting for hint to be seen. That seemed worth having, as a means to minimize impact of hinting. Then we dropped that and switched to OOM, and there not having to wait also seemed like a worthwhile thing. In the end we switched to shrinker where we can wait if we like, and many guests never even hit the shrinker so we have sacrificed robustness for nothing. If we go back to OOM then at least it's justified ..> -- > Thanks, > > David / dhildenb
On 05.02.20 10:35, Wang, Wei W wrote:> On Wednesday, February 5, 2020 5:23 PM, David Hildenbrand wrote: >> So, if you run a TCG guest and use it with free page reporting, the race is >> possible? So the correctness depends on two dirty bitmaps in the hypervisor >> and how they interact. wow this is fragile. >> > > Not sure how TCG tracks the dirty bits. But In whatever implementation, the hypervisor should haveThere is only a single bitmap for that purpose. (well, the one where KVM syncs to)> already dealt with the race between he current round and the previous round dirty recording. > (the race isn't brought by this feature essentially)It is guaranteed to work reliably without this feature as you only clear what *has been migrated*, not what your guest thinks should not been migrated at one point and decides differently at another point. The race is bought forwards by this feature. -- Thanks, David / dhildenb