Liran Rotenberg
2020-Jun-09 13:00 UTC
Re: Permission to disk set wrong when restoring from memory snapshot?
On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote:> On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote: > > Hi all, > > Passing on Bug 1840609 < > https://bugzilla.redhat.com/show_bug.cgi?id=1840609> > > - Wake up from hibernation failed:internal error: unable to execute QEMU > > command 'cont': Failed to get "write" lock. > > > > In Ovirt/RHV there is a specific flow that prevents the VM from starting > on > > the first host. > > The result is: > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock > > Is another process using the image > > > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid > > 177876 (<unknown process>) > > This error comes from qemu's internal file locking. It usually means > that there is another qemu or qemu-img which has the given image open. > > Is there anything which would access the image at that specific time or > slightly around? >I don't think so. The volumes are created and added to the volume chain on the VM metadata domxml(two snapshots created). Then the user restores the latest snapshot and deletes them (while the VM is down) - they are removed. The VM is set with the volume and going up, restoring the memory. The mounting place (in /rhev/data-center) points to the same disk and volume. On the first run I see the new place /rhev/data-center/<uuid>... that I can't tell why or where it comes from. It is set with 'rw', while the normal destination to the shared NFS is only with 'r'.> > It might be a race condition from something trying to modify the image > perhaps combined with propagation of locks via NFS. >I can't see any race from the RHV point of view. The strangest thing is why it works on the second host? In RHV after the VM is up we remove the memory disk, but that doesn't happen in this case, since the VM wasn't up.
Peter Krempa
2020-Jun-09 13:14 UTC
Re: Permission to disk set wrong when restoring from memory snapshot?
On Tue, Jun 09, 2020 at 16:00:02 +0300, Liran Rotenberg wrote:> On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote: > > > On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote: > > > Hi all, > > > Passing on Bug 1840609 < > > https://bugzilla.redhat.com/show_bug.cgi?id=1840609> > > > - Wake up from hibernation failed:internal error: unable to execute QEMU > > > command 'cont': Failed to get "write" lock. > > > > > > In Ovirt/RHV there is a specific flow that prevents the VM from starting > > on > > > the first host. > > > The result is: > > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock > > > Is another process using the image > > > > > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? > > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid > > > 177876 (<unknown process>) > > > > This error comes from qemu's internal file locking. It usually means > > that there is another qemu or qemu-img which has the given image open. > > > > Is there anything which would access the image at that specific time or > > slightly around? > > > I don't think so. The volumes are created and added to the volume chain on > the VM metadata domxml(two snapshots created). Then the user restores the > latest snapshot and deletes them (while the VM is down) - they are removed. > The VM is set with the volume and going up, restoring the memory. The > mounting place (in /rhev/data-center) points to the same disk and volume. > On the first run I see the new place /rhev/data-center/<uuid>... that I > can't tell why or where it comes from. It is set with 'rw', while the > normal destination to the shared NFS is only with 'r'.That's definitely something not related to the locking itself. Please attach the XML document used to start the VM in both places (working and non-working). I can't tell if there's a difference without seeing those. The difference might very well be in the XMLs.> > It might be a race condition from something trying to modify the image > > perhaps combined with propagation of locks via NFS. > > > I can't see any race from the RHV point of view. The strangest thing is why > it works on the second host? > In RHV after the VM is up we remove the memory disk, but that doesn't > happen in this case, since the VM wasn't up.I'd speculate that it's timing. There's more time for the locks to propagate or an offending process to terminate. The error message definitely does not look like a permission issue, and qemuSetupImagePathCgroup:75 is in the end a no-op on on non-device paths (makes sense only for LVM logical volumes, partitions etc).
Liran Rotenberg
2020-Jun-09 13:41 UTC
Re: Permission to disk set wrong when restoring from memory snapshot?
On Tue, Jun 9, 2020 at 4:14 PM Peter Krempa <pkrempa@redhat.com> wrote:> On Tue, Jun 09, 2020 at 16:00:02 +0300, Liran Rotenberg wrote: > > On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote: > > > > > On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote: > > > > Hi all, > > > > Passing on Bug 1840609 < > > > https://bugzilla.redhat.com/show_bug.cgi?id=1840609> > > > > - Wake up from hibernation failed:internal error: unable to execute > QEMU > > > > command 'cont': Failed to get "write" lock. > > > > > > > > In Ovirt/RHV there is a specific flow that prevents the VM from > starting > > > on > > > > the first host. > > > > The result is: > > > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock > > > > Is another process using the image > > > > > > > > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? > > > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from > pid > > > > 177876 (<unknown process>) > > > > > > This error comes from qemu's internal file locking. It usually means > > > that there is another qemu or qemu-img which has the given image open. > > > > > > Is there anything which would access the image at that specific time or > > > slightly around? > > > > > I don't think so. The volumes are created and added to the volume chain > on > > the VM metadata domxml(two snapshots created). Then the user restores the > > latest snapshot and deletes them (while the VM is down) - they are > removed. > > The VM is set with the volume and going up, restoring the memory. The > > mounting place (in /rhev/data-center) points to the same disk and volume. > > On the first run I see the new place /rhev/data-center/<uuid>... that I > > can't tell why or where it comes from. It is set with 'rw', while the > > normal destination to the shared NFS is only with 'r'. > > That's definitely something not related to the locking itself. > > Please attach the XML document used to start the VM in both places > (working and non-working). I can't tell if there's a difference without > seeing those. > > The difference might very well be in the XMLs. >I added a snippet from the engine log. The two first domxmls are from the first attempt on 'host_mixed_1' resulting in: "unable to execute QEMU command 'cont': Failed to get "write" lock.">From this point, it's the second attempt on 'host_mixed_2' (2020-06-0915:13:00,308+03 in log time).> > > > It might be a race condition from something trying to modify the image > > > perhaps combined with propagation of locks via NFS. > > > > > I can't see any race from the RHV point of view. The strangest thing is > why > > it works on the second host? > > In RHV after the VM is up we remove the memory disk, but that doesn't > > happen in this case, since the VM wasn't up. > > I'd speculate that it's timing. There's more time for the locks to > propagate or an offending process to terminate. The error message > definitely does not look like a permission issue, and > qemuSetupImagePathCgroup:75 is in the end a no-op on on non-device > paths (makes sense only for LVM logical volumes, partitions etc). > >
Apparently Analagous Threads
- Permission to disk set wrong when restoring from memory snapshot?
- Re: Permission to disk set wrong when restoring from memory snapshot?
- Re: Permission to disk set wrong when restoring from memory snapshot?
- Is there a way to get notification, when a domain shutdown/restarts
- XCI: can we get to the demo state?