thr3ads.net - libvirt users - race condition? virsh migrate --copy-storage-all [Apr 2022]

If this information is useful, please help other people find it:
Share via:

Peter Krempa

2022-Apr-19 14:07 UTC

race condition? virsh migrate --copy-storage-all

On Tue, Apr 19, 2022 at 15:51:32 +0200, Valentijn Sessink
wrote:> Hi Peter,
> 
> Thanks.
> 
> On 19-04-2022 13:22, Peter Krempa wrote:
> > It would be helpful if you provide the VM XML file to see how your
disks
> > are configured and the debug log file when the bug reproduces:
> 
> I created a random VM to show the effect. XML file attached.
> 
> > Without that my only hunch would be that you ran out of disk space on
> > the destination which caused the I/O error.
> 
> ... it's an LVM2 volume with exact the same size as the source machine,
so
> that would be rather odd ;-)
Oh, you are using raw disks backed by block volumes. That was not
obvious before ;)
> 
> I'm guessing that it's this weird message at the destination
machine:
> 
> 2022-04-19 13:31:09.394+0000: 1412559: error :
virKeepAliveTimerInternal:137
> : internal error: connection closed due to keepalive timeout
That certainly could be a hint ...
> 
> Source machine says:
> 2022-04-19 13:31:09.432+0000: 2641309: debug :
> qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp":
{"seconds":
> 1650375069, "microseconds": 432613}, "event":
"BLOCK_JOB_ERROR", "data":
> {"device": "drive-virtio-disk2", "operation":
"write", "action": "report"}}]
> 2022-04-19 13:31:09.432+0000: 2641309: debug : virJSONValueFromString:1822
:
> string={"timestamp": {"seconds": 1650375069,
"microseconds": 432613},
> "event": "BLOCK_JOB_ERROR", "data":
{"device": "drive-virtio-disk2",
> "operation": "write", "action":
"report"}}
The migration of non-shared storage works as follows:

1) libvirt sets up everything
2) libvirt asks destination qemu to open an NBD server exporting the
   disk backends
3) source libvirt instructs qemu to copy the disks to the NBD server via
   a block-copy job
4) when the block jobs converge, source qemu is instructed to migrate
   memory
5) when memory migrates, source qemu is killed and destination is
resumed

Now from the keepalive failure on the destiantion it seems that the
network connection at least between the migration controller and the
destination libvirt broke. That might actually cause also the NBD
connection to break and in such case the block job gets an I/O error.

Now the I/O error is actually based on the network connection and not
any storage issue.

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.

Valentijn Sessink

2022-Apr-19 14:21 UTC

head link

race condition? virsh migrate --copy-storage-all

Hi,

On 19-04-2022 16:07, Peter Krempa wrote:> So at this point I suspect that something without the network broke and
> the migration was aborted in the storage copy phase, but could been in
> any other.
Hmm, thank you. My problem is much clearer now - and probably not 
getting easier:

     192.168.112.31.39324 > 192.168.112.12.22: Flags [P.], cksum 0x61e7 
(incorrect -> 0x3220), seq 9618:9686, ack 5038, win 501, options 
[nop,nop,TS val 3380045136 ecr 1586940949], length 68
(Many more of these - then a timeout. And mind you: this is not related 
to any virtual checksum waiver or anything like that, it's the physical 
machine).

Anyway, thanks for your help.

Best regards,

Valentijn
--

libvirt users - Apr 2022 - race condition? virsh migrate --copy-storage-all

race condition? virsh migrate --copy-storage-all

race condition? virsh migrate --copy-storage-all