Liran Rotenberg
2020-Jun-09 12:38 UTC
Permission to disk set wrong when restoring from memory snapshot?
Hi all, Passing on Bug 1840609 <https://bugzilla.redhat.com/show_bug.cgi?id=1840609> - Wake up from hibernation failed:internal error: unable to execute QEMU command 'cont': Failed to get "write" lock. In Ovirt/RHV there is a specific flow that prevents the VM from starting on the first host. The result is: 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock Is another process using the image [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid 177876 (<unknown process>) We have a rerun mechanism. Therefore the VM starts again on another host if available. On the second try the VM does succeed to run. After looking in the debug logs (attached in the bug), I have seen this on the first host: 2020-06-09 12:12:46.288+0000: 177879: debug : qemuSetupImageCgroupInternal:139 : Not updating cgroups for disk path '<null>', type: file 2020-06-09 12:12:46.288+0000: 177879: debug : qemuSetupImagePathCgroup:75 : Allow path /rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191 384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe, perms: rw *2020-06-09 12:12:46.288+0000: 177879: debug : qemuSetupImagePathCgroup:75 : Allow path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe, perms: r* And immediately after retry on the second host: 2020-06-09 12:13:01.839+0000: 15781: debug : qemuSetupImagePathCgroup:75 : Allow path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/4fd23357-60 47-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe, perms: rw Is it intended and might cause the QEMU error? Regards, Liran.
Peter Krempa
2020-Jun-09 12:46 UTC
Re: Permission to disk set wrong when restoring from memory snapshot?
On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote:> Hi all, > Passing on Bug 1840609 <https://bugzilla.redhat.com/show_bug.cgi?id=1840609> > - Wake up from hibernation failed:internal error: unable to execute QEMU > command 'cont': Failed to get "write" lock. > > In Ovirt/RHV there is a specific flow that prevents the VM from starting on > the first host. > The result is: > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock > Is another process using the image > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid > 177876 (<unknown process>)This error comes from qemu's internal file locking. It usually means that there is another qemu or qemu-img which has the given image open. Is there anything which would access the image at that specific time or slightly around? It might be a race condition from something trying to modify the image perhaps combined with propagation of locks via NFS.
Liran Rotenberg
2020-Jun-09 13:00 UTC
Re: Permission to disk set wrong when restoring from memory snapshot?
On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote:> On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote: > > Hi all, > > Passing on Bug 1840609 < > https://bugzilla.redhat.com/show_bug.cgi?id=1840609> > > - Wake up from hibernation failed:internal error: unable to execute QEMU > > command 'cont': Failed to get "write" lock. > > > > In Ovirt/RHV there is a specific flow that prevents the VM from starting > on > > the first host. > > The result is: > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock > > Is another process using the image > > > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]? > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid > > 177876 (<unknown process>) > > This error comes from qemu's internal file locking. It usually means > that there is another qemu or qemu-img which has the given image open. > > Is there anything which would access the image at that specific time or > slightly around? >I don't think so. The volumes are created and added to the volume chain on the VM metadata domxml(two snapshots created). Then the user restores the latest snapshot and deletes them (while the VM is down) - they are removed. The VM is set with the volume and going up, restoring the memory. The mounting place (in /rhev/data-center) points to the same disk and volume. On the first run I see the new place /rhev/data-center/<uuid>... that I can't tell why or where it comes from. It is set with 'rw', while the normal destination to the shared NFS is only with 'r'.> > It might be a race condition from something trying to modify the image > perhaps combined with propagation of locks via NFS. >I can't see any race from the RHV point of view. The strangest thing is why it works on the second host? In RHV after the VM is up we remove the memory disk, but that doesn't happen in this case, since the VM wasn't up.