thr3ads.net - libvirt users - [libvirt-users] Intermittent live migration hang with ceph RBD attached volume [Jun 2019]

If this information is useful, please help other people find it:
Share via:
Scott Sullivan
2019-Jun-21 15:16 UTC
[libvirt-users] Intermittent live migration hang with ceph RBD attached volume

Software in use:

*Source hypervisor:* *Qemu:* stable-2.12 branch *Libvirt*: v3.2-maint
branch *OS*: CentOS 6
*Destination hypervisor: **Qemu:* stable-2.12 branch *Libvirt*: v4.9-maint
branch *OS*: CentOS 7

I'm experiencing an intermittent live migration hang of a virtual machine
(KVM) with a ceph RBD volume attached.

At the high level what I see is that when this does happen, the virtual
machine is left in a paused state (per virsh list) on both source and
destination hypervisors indefinitely.

Here's the virsh command I am running on the source (where 10.30.76.66 is
the destination hypervisor):

virsh migrate --live --copy-storage-all --verbose --xml> /root/live_migration.cfg test_vm qemu+ssh://10.30.76.66/system tcp://
> 10.30.76.66

Here it is in "ps faux" while its in the hung state:

root     10997  0.3  0.0 380632  6156 ?        Sl   12:24  
0:26>  \_ virsh migrate --live --copy-storage-all --verbose --xml
> /root/live_migration.cfg test_vm qemu+ssh://10.30.76.66/sys
> root     10999  0.0  0.0  60024  4044 ?        S    12:24   0:00
>    \_ ssh 10.30.76.66 sh -c 'if 'nc' -q 2>&1 | grep
"requires an argument"
> >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U

The only reason i'm using the `--xml` arg is so the auth information can be
updated for the new hypervisor (I setup a cephx user for each hypervisor).
Below is a diff between my normal xml config and the one I passed in --xml
arg to illustrate:

60,61c60,61> <       <auth username="source">
> <               <secret type="ceph"
> uuid="d4a47178-ab90-404e-8f25-058148da8446"/>
> ---
> >       <auth username="destination">
> >               <secret type="ceph"
> uuid="72e9373d-7101-4a93-a7d2-6cce5ec1e6f1"/>

The libvirt secret as shown above is properly setup with good credentials
on both source and destination hypervisors.

When this happens, I don't see anything logged on the destination
hypervisor in the libvirt log. However in the source hypervisors log, I do
see this:

2019-06-21 12:38:21.004+0000: 28400: warning :> qemuDomainObjEnterMonitorInternal:3764 : This thread seems to be the async
> job owner; entering monitor without asking for a nested job is dangerous

But nothing else logged in the libvirt log on either source or destination.
The actual `virsh migrate --live` command pasted above still runs while
stuck in this state, and it just outputs "Migration: [100 %]" over and
over. If I strace the qemu process on the source, I see this over and over:

ppoll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN}, {fd=4,
events=POLLIN},> {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=18, events=POLLIN},
> {fd=19, events=POLLIN}, {fd=35, events=0}, {fd=35, events=POLLIN}], 9, {0,
> 14960491}, NULL, 8) = 0 (Timeout)

Here's those fds:

[root@source ~]# ll /proc/31804/fd/{8,4,6,15,18,19,35}> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/15 ->
socket:[931291]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/18 ->
socket:[931295]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/19 ->
socket:[931297]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/35 ->
socket:[931306]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/4 -> [signalfd]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/6 -> [eventfd]
> lrwx------ 1 qemu qemu 64 Jun 21 13:18 /proc/31804/fd/8 -> [eventfd]
> [root@source ~]#
>
> [root@source ~]# grep -E '(931291|931295|931297|931306)'
/proc/net/tcp
>    3: 00000000:170C 00000000:0000 0A 00000000:00000000 00:00000000
> 00000000   107        0 931295 1 ffff88043a27f840 99 0 0 10 -1
>
>    4: 00000000:170D 00000000:0000 0A 00000000:00000000 00:00000000
> 00000000   107        0 931297 1 ffff88043a27f140 99 0 0 10 -1
>
> [root@source ~]#

Further, on the source, if I query the blockjobs status, it says no
blockjob is running:

[root@source ~]# virsh list>  Id    Name                           State
> ----------------------------------------------------
>  11    test_vm                         paused
> [root@source ~]# virsh blockjob 11 vda
> No current block job for vda
> [root@source ~]#

and that nc/ssh connection is still ok in the hung state:

[root@source~]# netstat -tuapn|grep \.66> tcp        0      0 10.30.76.48:48876           10.30.76.66:22
>    ESTABLISHED 10999/ssh
> [root@source ~]#
> root     10999  0.0  0.0  60024  4044 ?        S    12:24   0:00
>    \_ ssh 10.30.76.66 sh -c 'if 'nc' -q 2>&1 | grep
"requires an argument"
> >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U
> /var/run/libvirt/libvirt-sock'

Here's the state of the migration on source while its stuck like this:

[root@source ~]# virsh qemu-monitor-command 11
'{"execute":"query-migrate"}'>
>
{"return":{"status":"completed","setup-time":2,"downtime":2451,"total-time":3753,"ram":{"total":2114785280,"postcopy-requests":0,"dirty-sync-count":3,"page-size":4096,"remaining":0,"mbps":898.199209,"transferred":421345514,"duplicate":414940,"dirty-pages-rate":0,"skipped":0,"normal-bytes":416796672,"normal":101757}},"id":"libvirt-317"}
> [root@source ~]#

I'm unable the run the above command on the destination while its in this
state however, and get a lock error (which could be expected perhaps, since
it never cutover to the source yet):

[root@destination ~]# virsh list>  Id   Name     State
> -----------------------
>  4    test_vm   paused
> [root@destination ~]# virsh qemu-monitor-command 4
> '{"execute":"query-migrate"}'
> error: Timed out during operation: cannot acquire state change lock (held
> by monitor=remoteDispatchDomainMigratePrepare3Params)
> [root@destination ~]#


Does anyone have any pointers of other things I should check? Or if this
was/is a known bug in perhaps the old stable-3.2?

I haven't seen this when migrating on a host with libvirt 4.9 on both
source and destinations. However the ones I have with the older 3.2 are
centos 6 based, and aren't as easily upgraded to 4.9. Or, if anyone has
ideas of patches I could potentially look to port to 3.2 to mitigate this,
that would also be welcome. Would also be interested in forcing the cutover
in this state if possible, though I suspect that isn't safe since the
block-job isnt running while in this bad state.

Thanks in advance
Maybe Matching Threads

Search for more seemingly similar threads
libvirt users - Jun 2019 - Intermittent live migration hang with ceph RBD attached volume

[libvirt-users] Intermittent live migration hang with ceph RBD attached volume

Maybe Matching Threads

Wisdom of the Ancients