Jim Fehlig
2018-Dec-07 18:34 UTC
Re: [libvirt-users] concurrent migration of several domains rarely fails
On 12/6/18 10:12 AM, Lentes, Bernd wrote:> >> Hi, >> >> i have a two-node cluster with several domains as resources. During testing i >> tried several times to migrate some domains concurrently. >> Usually it suceeded, but rarely it failed. I found one clue in the log: >> >> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252: >> error : virKeepAliveTimerInternal:143 : internal error: connection closed due >> to keepalive timeout >> >> The domains are configured similar: >> primitive vm_geneious VirtualDomain \ >> params config="/mnt/san/share/config.xml" \ >> params hypervisor="qemu:///system" \ >> params migration_transport=ssh \ >> op start interval=0 timeout=120 trace_ra=1 \ >> op stop interval=0 timeout=130 trace_ra=1 \ >> op monitor interval=30 timeout=25 trace_ra=1 \ >> op migrate_from interval=0 timeout=300 trace_ra=1 \ >> op migrate_to interval=0 timeout=300 trace_ra=1 \ >> meta allow-migrate=true target-role=Started is-managed=true \ >> utilization cpu=2 hv_memory=8000 >> >> What is the algorithm to discover the port used for live migration ? >> I have the impression that "params migration_transport=ssh" is worthless, port >> 22 isn't involved for live migration. >> My experience is that for the migration tcp ports > 49151 are used. But the >> exact procedure isn't clear for me. >> Does live migration uses first tcp port 49152 and for each following domain one >> port higher ? >> E.g. for the concurrent live migration of three domains 49152, 49153 and 49154. >> >> Why does live migration for three domains usually succeed, although on both >> hosts just 49152 and 49153 is open ? >> Is the migration not really concurrent, but sometimes sequential ? >> >> Bernd >> > Hi, > > i tried to narrow down the problem. > My first assumption was that something with the network between the hosts is not ok. > I opened port 49152 - 49172 in the firewall - problem persisted. > So i deactivated the firewall on both nodes - problem persisted. > > Then i wanted to exclude the HA-Cluster software (pacemaker). > I unmanaged the VirtualDomains in pacemaker and migrated them with virsh - problem persists. > > I wrote a script to migrate three domains sequentially from host A to host B and vice versa via virsh. > I raised up the loglevel from libvirtd and found s.th. in the log which may be the culprit: > > This is the output of my script: > > Thu Dec 6 17:02:53 CET 2018 > migrate sim > Migration: [100 %] > Thu Dec 6 17:03:07 CET 2018 > migrate geneious > Migration: [100 %] > Thu Dec 6 17:03:16 CET 2018 > migrate mausdb > Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed <===== error ! > > Thu Dec 6 17:05:32 CET 2018 <======== time of error > Guests on ha-idg-1: \n > Id Name State > ---------------------------------------------------- > 1 sim running > 2 geneious running > - mausdb shut off > > migrate to ha-idg-2\n > Thu Dec 6 17:05:32 CET 2018 > > This is what journalctl told: > > Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=0 idle=30 > Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout > Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740 > > Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=1 idle=25 > Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1 > > Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=2 idle=20 > Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1 > > Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=3 idle=15 > Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1 > > Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=4 idle=10 > Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1 > > Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=5 idle=5 > Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1 > > There seems to be a kind of a countdown. From googleing i found that this may be related to libvirtd.conf: > > # Keepalive settings for the admin interface > #admin_keepalive_interval = 5 > #admin_keepalive_count = 5 > > What is meant by the "admin interface" ? virsh ?virsh-admin, which you can use to change some admin settings of libvirtd, e.g. log_level. You are interested in the keepalive settings above those ones in libvirtd.conf, specifically #keepalive_interval = 5 #keepalive_count = 5> What is meant by "client" in libvirtd.conf ? virsh ?Yes, virsh is a client, as is virt-manager or any application connecting to libvirtd.> Why do i have regular timeouts although my two hosts are very performant ? 128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding. > During migration i don't see much load, although nearly no waiting for IO.I'd think concurrently migrating 3 VMs on a 1G network might cause some congestion :-).> Should i set admin_keepalive_interval to -1 ?You should try 'keepalive_interval = -1'. You can also avoid sending keepalive messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate ...'. If this doesn't help, are you in a position to test a newer libvirt, preferably master or the recent 4.10.0 release? Regards, Jim
Lentes, Bernd
2018-Dec-10 12:46 UTC
Re: [libvirt-users] concurrent migration of several domains rarely fails
Jim wrote:>> >> What is meant by the "admin interface" ? virsh ? > > virsh-admin, which you can use to change some admin settings of libvirtd, e.g. > log_level. You are interested in the keepalive settings above those ones in > libvirtd.conf, specifically > > #keepalive_interval = 5 > #keepalive_count = 5 > >> What is meant by "client" in libvirtd.conf ? virsh ? > > Yes, virsh is a client, as is virt-manager or any application connecting to > libvirtd. > >> Why do i have regular timeouts although my two hosts are very performant ? 128GB >> RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding. >> During migration i don't see much load, although nearly no waiting for IO. > > I'd think concurrently migrating 3 VMs on a 1G network might cause some > congestion :-). > >> Should i set admin_keepalive_interval to -1 ? > > You should try 'keepalive_interval = -1'. You can also avoid sending keepalive > messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate ...'. > > If this doesn't help, are you in a position to test a newer libvirt, preferably > master or the recent 4.10.0 release?Hi Jim, Unfortunately not. I have some more questions, maybe you can help me a bit. I found http://epic-alfa.kavli.tudelft.nl/share/doc/libvirt-devel-0.10.2/migration.html , which is quite interesting. When i migrate with virsh, i use: virsh --connect=qemu:///system migrate --verbose --live domain qemu+ssh://ha-idg-1/system When pacemaker migrates, it creates this sequence: virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system which is quite the same. Do i understand the webpage correctly, is this a "Native migration, client to two libvirtd servers" ? Furthermore the document says: "To force migration over an alternate network interface the optional hypervisor specific URI must be provided". I have both hosts also connected directly to each other with a bonding device using round-robin, and an internal ip (192.168.100.xx). When i want to use this device, which is maybe a bit faster and more secure (directly connected), how do i have to specify that ? virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system tcp://192.168.100.xx Does it have to be the ip from the source or the destination ? Does the source then use automatically use also its device with 192.168.100.xx ? Thanks. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671
Chenli Hu
2018-Dec-21 09:50 UTC
Re: [libvirt-users] concurrent migration of several domains rarely fails
Hi, Bernd If you want to operate(e.g. migrate 60) guests concurrently, you need to edit related settings, then restart libvirtd service. 1) Edit the /etc/libvirt/libvirtd.conf: max_clients = 5000 max_queued_clients = 1000 min_workers = 500 max_workers = 1000 max_client_requests = 1000 keepalive_interval = -1 2) Edit /etc/libvirt/qemu.conf: lock_manager = "lockd" max_processes = 65535 max_files = 65535 keepalive_interval = -1 3) #service libvirtd restart Regards, Chenli Hu ----- Original Message ----- From: "Bernd Lentes" <bernd.lentes@helmholtz-muenchen.de> To: "libvirt-ML" <libvirt-users@redhat.com> Sent: Monday, December 10, 2018 8:46:22 PM Subject: Re: [libvirt-users] concurrent migration of several domains rarely fails Jim wrote:>> >> What is meant by the "admin interface" ? virsh ? > > virsh-admin, which you can use to change some admin settings of libvirtd, e.g. > log_level. You are interested in the keepalive settings above those ones in > libvirtd.conf, specifically > > #keepalive_interval = 5 > #keepalive_count = 5 > >> What is meant by "client" in libvirtd.conf ? virsh ? > > Yes, virsh is a client, as is virt-manager or any application connecting to > libvirtd. > >> Why do i have regular timeouts although my two hosts are very performant ? 128GB >> RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding. >> During migration i don't see much load, although nearly no waiting for IO. > > I'd think concurrently migrating 3 VMs on a 1G network might cause some > congestion :-). > >> Should i set admin_keepalive_interval to -1 ? > > You should try 'keepalive_interval = -1'. You can also avoid sending keepalive > messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate ...'. > > If this doesn't help, are you in a position to test a newer libvirt, preferably > master or the recent 4.10.0 release?Hi Jim, Unfortunately not. I have some more questions, maybe you can help me a bit. I found http://epic-alfa.kavli.tudelft.nl/share/doc/libvirt-devel-0.10.2/migration.html , which is quite interesting. When i migrate with virsh, i use: virsh --connect=qemu:///system migrate --verbose --live domain qemu+ssh://ha-idg-1/system When pacemaker migrates, it creates this sequence: virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system which is quite the same. Do i understand the webpage correctly, is this a "Native migration, client to two libvirtd servers" ? Furthermore the document says: "To force migration over an alternate network interface the optional hypervisor specific URI must be provided". I have both hosts also connected directly to each other with a bonding device using round-robin, and an internal ip (192.168.100.xx). When i want to use this device, which is maybe a bit faster and more secure (directly connected), how do i have to specify that ? virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system tcp://192.168.100.xx Does it have to be the ip from the source or the destination ? Does the source then use automatically use also its device with 192.168.100.xx ? Thanks. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 _______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users
Jim Fehlig
2019-Feb-06 16:38 UTC
Re: [libvirt-users] concurrent migration of several domains rarely fails
On 12/10/18 5:46 AM, Lentes, Bernd wrote: > Hi Jim, > Hi Bernd, I was doing some mail housekeeping and realized I never responded to this message. Apologies for the long delay.> Unfortunately not. > > I have some more questions, maybe you can help me a bit. > I found http://epic-alfa.kavli.tudelft.nl/share/doc/libvirt-devel-0.10.2/migration.html , which is > quite interesting.The canonical location of that doc is the libvirt website :-) https://libvirt.org/migration.html> When i migrate with virsh, i use: > virsh --connect=qemu:///system migrate --verbose --live domain qemu+ssh://ha-idg-1/system > > When pacemaker migrates, it creates this sequence: > virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system > which is quite the same. > Do i understand the webpage correctly, is this a "Native migration, client to two libvirtd servers" ?Yes, that is correct.> Furthermore the document says: > "To force migration over an alternate network interface the optional hypervisor specific URI must be provided". > > I have both hosts also connected directly to each other with a bonding device using round-robin, and an internal ip (192.168.100.xx). > When i want to use this device, which is maybe a bit faster and more secure (directly connected), how do i have to specify that ?By using the 'migrateuri' parameter. See 'migrate' section of virsh man page for more details.> virsh --connect=qemu:///system --quiet migrate --live domain qemu+ssh://ha-idg-1/system tcp://192.168.100.xxThis example looks correct, where the 'tcp://192.168.100.xx' part is the migrateuri.> Does it have to be the ip from the source or the destination ? Does the source then use automatically use > also its device with 192.168.100.xx ?It is the address of the destination. The destination will select a port if not specified, open a socket for the incoming migration, and provide the address/port to the source so it can connect and send the migration stream. Note that /etc/libvirt/qemu.conf contains some knobs for controlling default migration settings in the qemu driver. Regards, Jim
Possibly Parallel Threads
- Re: concurrent migration of several domains rarely fails
- Re: concurrent migration of several domains rarely fails
- concurrent migration of several domains rarely fails
- is it possible to create a snapshot from a guest residing in a plain partition ?
- Re: snapshots with virsh in a pacemaker cluster