thr3ads.net - libvirt users - [libvirt-users] Vm in state "in shutdown" [Aug 2019]

If this information is useful, please help other people find it:
Share via:

马昊骢 Ianmalcolm Ma

2019-Aug-05 11:25 UTC

[libvirt-users] Vm in state "in shutdown"

Description of problem:
libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel 3.10.0-693.21.1.el7.x86_64)
on Qemu version 2.10.0

I’m currently facing a strange situation. Sometimes my vm is shown by ‘virsh
list’ as in state “in shutdown” but there is no qemu-kvm process linked to it.

Libvirt log when “in shutdown” state occur is as follows:
“d470c3b284425b9bacb34d3b5f3845fe” is vm’s name, remoteDispatchDomainMemoryStats
API is called by ‘collectd’, which is used to collect some vm running states and
host information once every 30’s.

2019-07-25 14:23:58.706+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235
: type: POWERDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.413 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235
: type: SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.202 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235
: type: STOP, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235
: type: SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs
2019-07-25 14:23:59.629+0000: 15818: error : qemuMonitorIORead:597 : Unable to
read from monitor: Connection reset by peer
2019-07-25 14:23:59.629+0000: 121081: warning : qemuProcessEventHandler:4840 :
vm: d470c3b284425b9bacb34d3b5f3845fe, event: 6 locked
2019-07-25 14:23:59.629+0000: 15822: error : qemuMonitorJSONCommandWithFd:364 :
internal error: Missing monitor reply object
2019-07-25 14:24:29.483+0000: 15821: warning : qemuGetProcessInfo:1468 : cannot
parse process status data
2019-07-25 14:24:29.829+0000: 15823: warning :
qemuDomainObjBeginJobInternal:4391 : Cannot start job (modify, none) for domain
d470c3b284425b9bacb34d3b5f3845fe; current job is (query, none) owned by (15822
remoteDispatchDomainMemoryStats, 0 <null>) for (30s, 0s)
2019-07-25 14:24:29.829+0000: 15823: error : qemuDomainObjBeginJobInternal:4403
: Timed out during operation: cannot acquire state change lock (held by
remoteDispatchDomainMemoryStats)
2019-07-25 14:24:29.829+0000: 121081: warning :
qemuDomainObjBeginJobInternal:4391 : Cannot start job (destroy, none) for domain
d470c3b284425b9bacb34d3b5f3845fe; current job is (query, none) owned by (15822
remoteDispatchDomainMemoryStats, 0 <null>) for (30s, 0s)
2019-07-25 14:24:29.829+0000: 121081: error : qemuDomainObjBeginJobInternal:4403
: Timed out during operation: cannot acquire state change lock (held by
remoteDispatchDomainMemoryStats)
2019-07-25 14:24:29.829+0000: 121081: warning : qemuProcessEventHandler:4875 :
vm: d470c3b284425b9bacb34d3b5f3845fe, event: 6, cost 31.459 secs

I’ve tried to find out how did this problem happened. I analyzed the execution
process of the job and speculated that the problem occurred as follows:
step one: libvirt send command 'system_powerdown' to qemu.
step two: libvirt receive qemu monitor close event, and then deal EOF event.
step three: remoteDispatchDomainMemoryStats job start on the same vm.
step four: work thread dealing stop job is waited on job.cond, the timeout is
30s.

It seems that the remoteDispatchDomainMemoryStats job is too slow to let the
stop job wait timeout.

Then I tried to reproduce this process. The step is as follows:
First step:  add a sleep on ‘qemuProcessEventHandler' by using
pthread_cond_timedwait. So can execute 'virsh dommemstat active’ command at
this time interval.
Second step: start a vm to test
Third step: execute 'virsh shutdown active’ to shutdown the vm
Fourth step: execute 'virsh dommemstat active’ when stop job is sleep.

Then it works. The test vm state became to 'in shutdown’, and the libvirt
log is as follows:
“active” is my test vm’s name.

2019-08-05 08:39:57.001+0000: 25889: warning :
qemuDomainObjBeginJobInternal:4308 : Starting job: modify (vm=0x7f7bbc145fe0
name=active, current job=none async=none)
2019-08-05 08:39:57.003+0000: 25889: warning : qemuDomainObjEndJob:4522 :
Stopping job: modify (async=none vm=0x7f7bbc145fe0 name=active)
2019-08-05 08:39:57.003+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235
: type: POWERDOWN, vm: active, cost 0.008 secs
2019-08-05 08:39:57.854+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235
: type: SHUTDOWN, vm: active, cost 1.709 secs
2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235
: type: STOP, vm: active, cost 1.751 secs
2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235
: type: SHUTDOWN, vm: active, cost 1.751 secs
2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:756 : Error on
monitor <null>
2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:777 : Triggering
EOF callback
2019-08-05 08:39:57.915+0000: 26915: warning : qemuProcessEventHandler:4822 :
usleep 20s
2019-08-05 08:40:01.004+0000: 25886: warning :
qemuDomainObjBeginJobInternal:4308 : Starting job: query (vm=0x7f7bbc145fe0
name=active, current job=none async=none)
^@2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4845 :
vm=0x7f7bbc145fe0, event=6
2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4851 :
vm: active, event: 6 locked
2019-08-05 08:40:17.915+0000: 26915: warning :
qemuDomainObjBeginJobInternal:4308 : Starting job: destroy (vm=0x7f7bbc145fe0
name=active, current job=query async=none)
2019-08-05 08:40:17.915+0000: 26915: warning :
qemuDomainObjBeginJobInternal:4344 : Waiting for job (vm=0x7f7bbc145fe0
name=active), job 26915 qemuProcessEventHandler, owned by 25886
remoteDispatchDomainMemoryStats
^@^@^@2019-08-05 08:40:47.915+0000: 26915: warning :
qemuDomainObjBeginJobInternal:4405 : Cannot start job (destroy, none) for domain
active; current job is (query, none) owned by (25886
remoteDispatchDomainMemoryStats, 0 <null>) for (46s, 0s)
2019-08-05 08:40:47.915+0000: 26915: error : qemuDomainObjBeginJobInternal:4417
: 操作超时：cannot acquire state change lock (held by
remoteDispatchDomainMemoryStats)
2019-08-05 08:40:47.915+0000: 26915: warning : qemuProcessEventHandler:4886 :
vm: active, event: 6, cost 31.830 secs

The threads backtraces is as follows:
[Switching to thread 17 (Thread 0x7f12fddf5700 (LWP 32580))]
#2  0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0,
    msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075
1075            if (virCondWait(&mon->notify, &mon->parent.lock)
< 0) {
(gdb) l
1070        PROBE(QEMU_MONITOR_SEND_MSG,
1071              "mon=%p msg=%s fd=%d",
1072              mon, mon->msg->txBuffer, mon->msg->txFD);
1073
1074        while (mon && mon->msg &&
!mon->msg->finished) {
1075            if (virCondWait(&mon->notify, &mon->parent.lock)
< 0) {
1076                virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1077                               _("Unable to wait on monitor
condition"));
1078                goto cleanup;
1079            }
(gdb) p *msg
$2 = {txFD = -1,
  txBuffer = 0x7f12b8000d50
"{\"execute\":\"qom-list\",\"arguments\":{\"path\":\"/machine/peripheral\"},\"id\":\"libvirt-17\"}\r\n",
txOffset = 0,
  txLength = 85, rxBuffer = 0x0, rxLength = 0, rxObject = 0x0,
  finished = false, passwordHandler = 0x0, passwordOpaque = 0x0}
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f130d738316 in virCondWait (c=c@entry=0x7f12d0013b28,
    m=m@entry=0x7f12d0013b00) at util/virthread.c:154
#2  0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0,
    msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075
#3  0x00007f12eb283b40 in qemuMonitorJSONCommandWithFd (
    mon=mon@entry=0x7f12d0013af0, cmd=cmd@entry=0x7f12b80008c0,
    scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f12fddf4780)
    at qemu/qemu_monitor_json.c:355
#4  0x00007f12eb2904b9 in qemuMonitorJSONCommand (reply=0x7f12fddf4780,
    cmd=0x7f12b80008c0, mon=0x7f12d0013af0) at qemu/qemu_monitor_json.c:385
...

I found that on function ‘qemuMonitorUpdateWatch’, if  mon->watch is zero,
then event would’t be update. But 'qemuMonitorSend'
would still wait on mon->notify. So the remoteDispatchDomainMemoryStats job
is blocked, and then the stop job blocked too.
618 static void
619 qemuMonitorUpdateWatch(qemuMonitorPtr mon)
620 {
621     int events 622         VIR_EVENT_HANDLE_HANGUP |
623         VIR_EVENT_HANDLE_ERROR;
624
625     if (!mon->watch)
626         return;
627
628     if (mon->lastError.code == VIR_ERR_OK) {
629         events |= VIR_EVENT_HANDLE_READABLE;
630
631         if ((mon->msg && mon->msg->txOffset <
mon->msg->txLength) &&
632             !mon->waitGreeting)
633             events |= VIR_EVENT_HANDLE_WRITABLE;
634     }
635
636     virEventUpdateHandle(mon->watch, events);
637 }

I try to fix this bugs by add a judgment before qemuMonitorUpdateWatch, and it
seem works.
1046 int
1047 qemuMonitorSend(qemuMonitorPtr mon,
1048                 qemuMonitorMessagePtr msg)
1049 {
1050     int ret = -1;
1051
1052     /* Check whether qemu quit unexpectedly */
1053     if (mon->lastError.code != VIR_ERR_OK) {
1054         VIR_DEBUG("Attempt to send command while error is set
%s",
1055                   NULLSTR(mon->lastError.message));
1056         virSetError(&mon->lastError);
1057         return -1;
1058     }
1059
1060     if (!mon->watch) {
1061         VIR_WARN("Attempt to send command while mon->watch is
zero");
1062         virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1063                        _("attempt to send command when the monitor is
closing"));
1064         return -1;
1065     }
1066
1067     mon->msg = msg;
1068     qemuMonitorUpdateWatch(mon);
1069
1070     PROBE(QEMU_MONITOR_SEND_MSG,
1071           "mon=%p msg=%s fd=%d",
1072           mon, mon->msg->txBuffer, mon->msg->txFD);
1073
1074     while (mon && mon->msg && !mon->msg->finished)
{
1075         if (virCondWait(&mon->notify, &mon->parent.lock) <
0) {
1076             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1077                            _("Unable to wait on monitor
condition"));
1078             goto cleanup;
1079         }
1080     }
1081
1082     if (mon->lastError.code != VIR_ERR_OK) {
1083         VIR_DEBUG("Send command resulted in error %s",
1084                   NULLSTR(mon->lastError.message));
1085         virSetError(&mon->lastError);
1086         goto cleanup;
1087     }
1088
1089     ret = 0;
1090
1091  cleanup:
1092     mon->msg = NULL;
1093     qemuMonitorUpdateWatch(mon);
1094
1095     return ret;
1096 }

I’m not really sure that this change will affect other process.

Thanks!

Ma haocong

Michal Prívozník

2019-Aug-07 13:11 UTC

head link

Re: [libvirt-users] Vm in state "in shutdown"

On 8/5/19 1:25 PM, 马昊骢 Ianmalcolm Ma wrote:> Description of problem:
> libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel
3.10.0-693.21.1.el7.x86_64) on Qemu version 2.10.0
I vaguely recall a bug like this, but I don't know any more details.
Then again, libvirt-3.9.0 is 2 years old. So I suggest try to upgrade
libvirt and see if it helps. Alternatively, you may try to run git
bisect and find the commit that fixed the problem.

Michal

马昊骢 Ianmalcolm Ma

2019-Aug-09 12:19 UTC

head link

Re: [libvirt-users] Vm in state "in shutdown"

在 2019/8/7 下午9:11，“Michal Prívozník”<mprivozn@redhat.com> 写入:

   > On 8/5/19 1:25 PM, 马昊骢 Ianmalcolm Ma wrote:
   >> Description of problem:
   >> libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel
3.10.0-693.21.1.el7.x86_64) on Qemu version 2.10.0
    
   > I vaguely recall a bug like this, but I don't know any more details.
   > Then again, libvirt-3.9.0 is 2 years old. So I suggest try to upgrade
   > libvirt and see if it helps. Alternatively, you may try to run git
   > bisect and find the commit that fixed the problem.
    
   > Michal
   Thank you for your help!
   I have already found the report on Bugzilla named ‘Bug 1536461 - Libvirtd
hangs when shutting down domains’,
   which is the same as my problem. This bug have already fixed on patch named
‘Revert "qemu: monitor: do not
   report error on shutdown"’. I test it in my problem scenario and it
works fine.
   
   Thanks,
   Ma haocong

Reasonably Related Threads

Search for more possibly parallel threads

libvirt users - Aug 2019 - Vm in state "in shutdown"

[libvirt-users] Vm in state "in shutdown"

Re: [libvirt-users] Vm in state "in shutdown"

Re: [libvirt-users] Vm in state "in shutdown"

Reasonably Related Threads