thr3ads.net - libvirt users - Re: [libvirt-users] concurrent migration of several domains rarely fails [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Lentes, Bernd

2018-Dec-04 11:51 UTC

[libvirt-users] concurrent migration of several domains rarely fails

Hi,

i have a two-node cluster with several domains as resources. During testing i
tried several times to migrate some domains concurrently.
Usually it suceeded, but rarely it failed. I found one clue in the log:

Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252:
error : virKeepAliveTimerInternal:143 : internal error: connection closed due to
keepalive timeout

The domains are configured similar:
primitive vm_geneious VirtualDomain \
        params config="/mnt/san/share/config.xml" \
        params hypervisor="qemu:///system" \
        params migration_transport=ssh \
        op start interval=0 timeout=120 trace_ra=1 \
        op stop interval=0 timeout=130 trace_ra=1 \
        op monitor interval=30 timeout=25 trace_ra=1 \
        op migrate_from interval=0 timeout=300 trace_ra=1 \
        op migrate_to interval=0 timeout=300 trace_ra=1 \
        meta allow-migrate=true target-role=Started is-managed=true \
        utilization cpu=2 hv_memory=8000

What is the algorithm to discover the port used for live migration ?
I have the impression that "params migration_transport=ssh" is
worthless, port 22 isn't involved for live migration.
My experience is that for the migration tcp ports > 49151 are used. But the
exact procedure isn't clear for me.
Does live migration uses first tcp port 49152 and for each following domain one
port higher ?
E.g. for the concurrent live migration of three domains 49152, 49153 and 49154.

Why does live migration for three domains usually succeed, although on both
hosts just 49152 and 49153 is open ?
Is the migration not really concurrent, but sometimes sequential ?

Bernd






-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
[ mailto:bernd.lentes@helmholtz-muenchen.de | bernd.lentes@helmholtz-muenchen.de
]
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
[ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg ] 

wer Fehler macht kann etwas lernen 
wer nichts macht kann auch nichts lernen
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler,
Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

Lentes, Bernd

2018-Dec-06 17:12 UTC

head link

Re: [libvirt-users] concurrent migration of several domains rarely fails

> Hi,
> 
> i have a two-node cluster with several domains as resources. During testing
i
> tried several times to migrate some domains concurrently.
> Usually it suceeded, but rarely it failed. I found one clue in the log:
> 
> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000:
3252:
> error : virKeepAliveTimerInternal:143 : internal error: connection closed
due
> to keepalive timeout
> 
> The domains are configured similar:
> primitive vm_geneious VirtualDomain \
>        params config="/mnt/san/share/config.xml" \
>        params hypervisor="qemu:///system" \
>        params migration_transport=ssh \
>        op start interval=0 timeout=120 trace_ra=1 \
>        op stop interval=0 timeout=130 trace_ra=1 \
>        op monitor interval=30 timeout=25 trace_ra=1 \
>        op migrate_from interval=0 timeout=300 trace_ra=1 \
>        op migrate_to interval=0 timeout=300 trace_ra=1 \
>        meta allow-migrate=true target-role=Started is-managed=true \
>        utilization cpu=2 hv_memory=8000
> 
> What is the algorithm to discover the port used for live migration ?
> I have the impression that "params migration_transport=ssh" is
worthless, port
> 22 isn't involved for live migration.
> My experience is that for the migration tcp ports > 49151 are used. But
the
> exact procedure isn't clear for me.
> Does live migration uses first tcp port 49152 and for each following domain
one
> port higher ?
> E.g. for the concurrent live migration of three domains 49152, 49153 and
49154.
> 
> Why does live migration for three domains usually succeed, although on both
> hosts just 49152 and 49153 is open ?
> Is the migration not really concurrent, but sometimes sequential ?
> 
> Bernd
> Hi,

i tried to narrow down the problem.
My first assumption was that something with the network between the hosts is not
ok.
I opened port 49152 - 49172 in the firewall - problem persisted.
So i deactivated the firewall on both nodes - problem persisted.

Then i wanted to exclude the HA-Cluster software (pacemaker).
I unmanaged the VirtualDomains in pacemaker and migrated them with virsh -
problem persists.

I wrote a script to migrate three domains sequentially from host A to host B and
vice versa via virsh.
I raised up the loglevel from libvirtd and found s.th. in the log which may be
the culprit:

This is the output of my script:

Thu Dec  6 17:02:53 CET 2018
migrate sim
Migration: [100 %]
Thu Dec  6 17:03:07 CET 2018
migrate geneious
Migration: [100 %]
Thu Dec  6 17:03:16 CET 2018
migrate mausdb
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed   
<===== error !

Thu Dec  6 17:05:32 CET 2018      <======== time of error
Guests on ha-idg-1: \n
 Id    Name                           State
----------------------------------------------------
 1     sim                            running
 2     geneious                       running
 -     mausdb                         shut off

migrate to ha-idg-2\n
Thu Dec  6 17:05:32 CET 2018

This is what journalctl told:

Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=0 idle=30
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553:
error : virKeepAliveTimerInternal:143 : internal error: connection closed due to
keepalive timeout
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553:
info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740

Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=1 idle=25
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553:
info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=2 idle=20
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553:
info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=3 idle=15
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553:
info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=4 idle=10
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553:
info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553:
info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=5 idle=5
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553:
info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

There seems to be a kind of a countdown. From googleing i found that this may be
related to libvirtd.conf:

# Keepalive settings for the admin interface
#admin_keepalive_interval = 5
#admin_keepalive_count = 5

What is meant by the "admin interface" ? virsh ?
What is meant by "client" in libvirtd.conf ? virsh ? Why do i have
regular timeouts although my two hosts are very performant ? 128GB RAM, 16
cores, 2 1GBit/s network adapter on each host in bonding.
During migration i don't see much load, although nearly no waiting for IO.

Should i set admin_keepalive_interval to -1 ?


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler,
Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

Lentes, Bernd

2018-Dec-06 17:20 UTC

head link

Re: [libvirt-users] concurrent migration of several domains rarely fails

Hi,

sorry, i forgot my setup:

SLES 12 SP3 64bit

ha-idg-1:~ # rpm -qa|grep -i libvirt
libvirt-daemon-driver-secret-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-mpath-3.3.0-5.22.1.x86_64
libvirt-glib-1_0-0-0.2.1-1.2.x86_64
typelib-1_0-LibvirtGLib-1_0-0.2.1-1.2.x86_64
libvirt-daemon-qemu-3.3.0-5.22.1.x86_64
libvirt-daemon-config-network-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-scsi-3.3.0-5.22.1.x86_64
libvirt-client-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-network-3.3.0-5.22.1.x86_64
libvirt-libs-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-disk-3.3.0-5.22.1.x86_64
libvirt-daemon-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-interface-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-qemu-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-nwfilter-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-logical-3.3.0-5.22.1.x86_64
libvirt-python-3.3.0-1.38.x86_64
libvirt-daemon-driver-nodedev-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-iscsi-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-rbd-3.3.0-5.22.1.x86_64
libvirt-daemon-driver-storage-core-3.3.0-5.22.1.x86_64


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler,
Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

Jim Fehlig

2018-Dec-07 18:34 UTC

head link

Re: [libvirt-users] concurrent migration of several domains rarely fails

On 12/6/18 10:12 AM, Lentes, Bernd wrote:> 
>> Hi,
>>
>> i have a two-node cluster with several domains as resources. During
testing i
>> tried several times to migrate some domains concurrently.
>> Usually it suceeded, but rarely it failed. I found one clue in the log:
>>
>> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000:
3252:
>> error : virKeepAliveTimerInternal:143 : internal error: connection
closed due
>> to keepalive timeout
>>
>> The domains are configured similar:
>> primitive vm_geneious VirtualDomain \
>>         params config="/mnt/san/share/config.xml" \
>>         params hypervisor="qemu:///system" \
>>         params migration_transport=ssh \
>>         op start interval=0 timeout=120 trace_ra=1 \
>>         op stop interval=0 timeout=130 trace_ra=1 \
>>         op monitor interval=30 timeout=25 trace_ra=1 \
>>         op migrate_from interval=0 timeout=300 trace_ra=1 \
>>         op migrate_to interval=0 timeout=300 trace_ra=1 \
>>         meta allow-migrate=true target-role=Started is-managed=true \
>>         utilization cpu=2 hv_memory=8000
>>
>> What is the algorithm to discover the port used for live migration ?
>> I have the impression that "params migration_transport=ssh"
is worthless, port
>> 22 isn't involved for live migration.
>> My experience is that for the migration tcp ports > 49151 are used.
But the
>> exact procedure isn't clear for me.
>> Does live migration uses first tcp port 49152 and for each following
domain one
>> port higher ?
>> E.g. for the concurrent live migration of three domains 49152, 49153
and 49154.
>>
>> Why does live migration for three domains usually succeed, although on
both
>> hosts just 49152 and 49153 is open ?
>> Is the migration not really concurrent, but sometimes sequential ?
>>
>> Bernd
>>
> Hi,
> 
> i tried to narrow down the problem.
> My first assumption was that something with the network between the hosts
is not ok.
> I opened port 49152 - 49172 in the firewall - problem persisted.
> So i deactivated the firewall on both nodes - problem persisted.
> 
> Then i wanted to exclude the HA-Cluster software (pacemaker).
> I unmanaged the VirtualDomains in pacemaker and migrated them with virsh -
problem persists.
> 
> I wrote a script to migrate three domains sequentially from host A to host
B and vice versa via virsh.
> I raised up the loglevel from libvirtd and found s.th. in the log which may
be the culprit:
> 
> This is the output of my script:
> 
> Thu Dec  6 17:02:53 CET 2018
> migrate sim
> Migration: [100 %]
> Thu Dec  6 17:03:07 CET 2018
> migrate geneious
> Migration: [100 %]
> Thu Dec  6 17:03:16 CET 2018
> migrate mausdb
> Migration: [ 99 %]error: operation failed: migration job: unexpectedly
failed    <===== error !
> 
> Thu Dec  6 17:05:32 CET 2018      <======== time of error
> Guests on ha-idg-1: \n
>   Id    Name                           State
> ----------------------------------------------------
>   1     sim                            running
>   2     geneious                       running
>   -     mausdb                         shut off
> 
> migrate to ha-idg-2\n
> Thu Dec  6 17:05:32 CET 2018
> 
> This is what journalctl told:
> 
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=0 idle=30
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000:
12553: error : virKeepAliveTimerInternal:143 : internal error: connection closed
due to keepalive timeout
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000:
12553: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740
> 
> Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=1 idle=25
> Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000:
12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=2 idle=20
> Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000:
12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=3 idle=15
> Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000:
12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=4 idle=10
> Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000:
12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000:
12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT:
ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=5 idle=5
> Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000:
12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740
client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> There seems to be a kind of a countdown. From googleing i found that this
may be related to libvirtd.conf:
> 
> # Keepalive settings for the admin interface
> #admin_keepalive_interval = 5
> #admin_keepalive_count = 5
> 
> What is meant by the "admin interface" ? virsh ?
virsh-admin, which you can use to change some admin settings of libvirtd, e.g. 
log_level. You are interested in the keepalive settings above those ones in 
libvirtd.conf, specifically

#keepalive_interval = 5
#keepalive_count = 5
> What is meant by "client" in libvirtd.conf ? virsh ?
Yes, virsh is a client, as is virt-manager or any application connecting to 
libvirtd.
> Why do i have regular timeouts although my two hosts are very performant ?
128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding.
> During migration i don't see much load, although nearly no waiting for
IO.
I'd think concurrently migrating 3 VMs on a 1G network might cause some 
congestion :-).
> Should i set admin_keepalive_interval to -1 ?
You should try 'keepalive_interval = -1'. You can also avoid sending
keepalive
messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate
...'.

If this doesn't help, are you in a position to test a newer libvirt,
preferably
master or the recent 4.10.0 release?

Regards,
Jim

Reasonably Related Threads

Search for more possibly parallel threads

libvirt users - Dec 2018 - Re: concurrent migration of several domains rarely fails

[libvirt-users] concurrent migration of several domains rarely fails

Re: [libvirt-users] concurrent migration of several domains rarely fails

Re: [libvirt-users] concurrent migration of several domains rarely fails

Re: [libvirt-users] concurrent migration of several domains rarely fails

Reasonably Related Threads