thr3ads.net - Linux Virtualization - DANGER WILL ROBINSON, DANGER [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Paolo Bonzini

2019-Oct-02 20:10 UTC

DANGER WILL ROBINSON, DANGER

On 02/10/19 19:04, Jerome Glisse wrote:> On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
>>>> If the mapping of the source VMA changes, mirroring can update
the
>>>> target VMA via insert_pfn.  But what ensures that KVM's MMU
notifier
>>>> dismantles its own existing page tables (so that they can be
recreated
>>>> with the new mapping from the source VMA)?
>>
>> The KVM inspector process is also (or can be) a QEMU that will have to
>> create its own KVM guest page table.  So if a page in the source VMA is
>> unmapped we want:
>>
>> - the source KVM to invalidate its guest page table (done by the KVM
MMU
>> notifier)
>>
>> - the target VMA to be invalidated (easy using mirroring)
>>
>> - the target KVM to invalidate its guest page table, as a result of
>> invalidation of the target VMA
> 
> You can do the target KVM invalidation inside the mirroring invalidation
> code.
Why should the source and target KVMs behave differently?  If the source
invalidates its guest page table via MMU notifiers, so should the target.

The KVM MMU notifier exists so that nothing (including mirroring) needs
to know that there is KVM on the other side.  Any interaction between
KVM page tables and VMAs must be mediated by MMU notifiers, anything
else is unacceptable.

If it is possible to invoke the MMU notifiers around the calls to
insert_pfn, that of course would be perfect.

Thanks,

Paolo

Jerome Glisse

2019-Oct-03 15:42 UTC

head link

DANGER WILL ROBINSON, DANGER

On Wed, Oct 02, 2019 at 10:10:18PM +0200, Paolo Bonzini
wrote:> On 02/10/19 19:04, Jerome Glisse wrote:
> > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> >>>> If the mapping of the source VMA changes, mirroring can
update the
> >>>> target VMA via insert_pfn.  But what ensures that
KVM's MMU notifier
> >>>> dismantles its own existing page tables (so that they can
be recreated
> >>>> with the new mapping from the source VMA)?
> >>
> >> The KVM inspector process is also (or can be) a QEMU that will
have to
> >> create its own KVM guest page table.  So if a page in the source
VMA is
> >> unmapped we want:
> >>
> >> - the source KVM to invalidate its guest page table (done by the
KVM MMU
> >> notifier)
> >>
> >> - the target VMA to be invalidated (easy using mirroring)
> >>
> >> - the target KVM to invalidate its guest page table, as a result
of
> >> invalidation of the target VMA
> > 
> > You can do the target KVM invalidation inside the mirroring
invalidation
> > code.
> 
> Why should the source and target KVMs behave differently?  If the source
> invalidates its guest page table via MMU notifiers, so should the target.
> 
> The KVM MMU notifier exists so that nothing (including mirroring) needs
> to know that there is KVM on the other side.  Any interaction between
> KVM page tables and VMAs must be mediated by MMU notifiers, anything
> else is unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to
> insert_pfn, that of course would be perfect.
Ok and yes you can do that exactly ie inside the mmu notifier callback
from the target. For instance it is as easy as:
    target_mirror_notifier_start_callback(start, end) {
        struct kvm_mirror_struct *kvmms = from_mmun(...);
        unsigned long target_foff, size;

        size = end - start;
        target_foff = kvmms_convert_mirror_address(start);
        take_lock(kvmms->mirror_fault_exclusion_lock);
        unmap_mapping_range(kvmms->address_space, target_foff, size, 1);
        drop_lock(kvmms->mirror_fault_exclusion_lock);
    }

All that is needed is to make sure that vm_normal_page() will see those
pte (inside the process that is mirroring the other process) as special
which is the case either because insert_pfn() mark the pte as special or
the kvm device driver which control the vm_operation struct set a
find_special_page() callback that always return NULL, or the vma has
either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).

So you can keep the existing kvm code unmodified.

Cheers,
J?r?me

Paolo Bonzini

2019-Oct-03 15:50 UTC

head link

DANGER WILL ROBINSON, DANGER

On 03/10/19 17:42, Jerome Glisse wrote:> All that is needed is to make sure that vm_normal_page() will see those
> pte (inside the process that is mirroring the other process) as special
> which is the case either because insert_pfn() mark the pte as special or
> the kvm device driver which control the vm_operation struct set a
> find_special_page() callback that always return NULL, or the vma has
> either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).
> 
> So you can keep the existing kvm code unmodified.
Great, thanks.  And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP,
so that should work.

Paolo

Jerome Glisse

2019-Oct-03 18:31 UTC

head link

DANGER WILL ROBINSON, DANGER

On Thu, Oct 03, 2019 at 04:42:20PM +0000, Mircea CIRJALIU - MELIU
wrote:> > On 03/10/19 17:42, Jerome Glisse wrote:
> > > All that is needed is to make sure that vm_normal_page() will see
> > > those pte (inside the process that is mirroring the other
process) as
> > > special which is the case either because insert_pfn() mark the
pte as
> > > special or the kvm device driver which control the vm_operation
struct
> > > set a
> > > find_special_page() callback that always return NULL, or the vma
has
> > > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> > insert_pfn).
> > >
> > > So you can keep the existing kvm code unmodified.
> > 
> > Great, thanks.  And KVM is already able to handle
> > VM_PFNMAP/VM_MIXEDMAP, so that should work.
> 
> This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the
VM's system RAM.
> Will it have any side effects?
You do not set it up on the anonymous vma but on the mmap of the
kvm device file, the resulting vma is under the control of the
kvm device file and is not an anonymous vma but a "device" special
vma.

So in summary, the source qemu process has anonymous vma (regular
libc malloc for instance). The introspector qemu process which
mirror the the source qemu use mmap on /dev/kvm (assuming you can
reuse the kvm device file for this otherwise you can introduce a
new kvm device file). The resulting mmap inside the introspector
qemu process is a vma which has vma->vm_file pointing to the kvm
device file and has VM_PFNMAP or VM_MIXEDMAP (i think you want the
former). On architecture with ARCH_SPECIAL_PTE the pte will be
mark as special when using insert_pfn() on other architecture you
can either rely on VM_PFNMAP/VM_MIXEDMAP flag or set a specific
find_special_page() callbacks in vm_ops.

I am at a conference right now but i will put an example of what
i mean next week.

Cheers,
J?r?me

Paolo Bonzini

2019-Oct-03 19:38 UTC

head link

DANGER WILL ROBINSON, DANGER

On 03/10/19 20:31, Jerome Glisse wrote:> So in summary, the source qemu process has anonymous vma (regular
> libc malloc for instance). The introspector qemu process which
> mirror the the source qemu use mmap on /dev/kvm (assuming you can
> reuse the kvm device file for this otherwise you can introduce a
> new kvm device file). 
It should be a new device, something like /dev/kvmmem.  BitDefender's
RFC patches already have the right userspace API, that was not an issue.

Paolo

Paolo Bonzini

2019-Oct-04 11:46 UTC

head link

DANGER WILL ROBINSON, DANGER

On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote:> I get it so far. I have a patch that does mirroring in a separate VMA.
> We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
> source VMA in the other QEMU and is refreshed by the device MMU notifier.
So for example on the host you'd have a new ioctl on the kvm file
descriptor.  You pass a size and you get back a file descriptor for that
guest's physical memory, which is mmap-able up to the size you specified
in the ioctl.

In turn, the file descriptor would have ioctls to map/unmap ranges of
the guest memory into its mmap-able range.  Accessing an unmapped range
produces a SIGSEGV.

When asked via the QEMU monitor, QEMU will create the file descriptor
and pass it back via SCM_RIGHTS.  The management application can then
use it to hotplug memory into the destination...
> Create a new memslot based on the mirror VMA, hotplug it into the guest as
> new memory device (is this possible?) and have a guest-side driver allocate
> pages from that area.
... using the existing ivshmem device, whose BAR can be accessed and
mmap-ed from the guest via sysfs.  In other words, the hotplugging will
use the file descriptor returned by QEMU when creating the ivshmem device.

We then need an additional mechanism to invoke the map/unmap ioctls from
the guest.  Without writing a guest-side driver it is possible to:

- pass a socket into the "create guest physical memory view" ioctl
above.  KVM will then associate that KVMI socket with the newly created
file descriptor.

- use KVMI messages to that socket to map/unmap sections of memory
> Redirect (some) GFN->HVA translations into the new VMA based on a table 
> of addresses required by the introspector process.
That would be tricky because there are multiple paths (gfn_to_page,
gfn_to_pfn, etc.).

There is some complication in this because the new device has to be
plumbed at multiple levels (KVM, QEMU, libvirt).  But it seems like a
very easily separated piece of code (except for the KVMI socket part,
which can be added later), so I suggest that you contribute the KVM
parts first.

Paolo

Seemingly Similar Threads

Search for more reasonably related threads

Linux Virtualization - Oct 2019 - DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

DANGER WILL ROBINSON, DANGER

Seemingly Similar Threads