thr3ads.net - libvirt users - Re: [libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance) [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Dominik Psenner

2017-Aug-14 06:42 UTC

[libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

Hi,

a small update on this. We have migrated the virtualized host to use the
virtio drivers and now the drive performance is improved so that we can see
a constant transfer rate. Before it used to be the same rate but regularly
dropped to a few bytes/sec for a few seconds and then was fast again.

However we still observe that the following fails regularily:

$ virsh snapshot-create-as --domain domain --name backup --no-metadata
--atomic --disk-only --diskspec hda,snapshot=external
$ virsh blockcommit domain hda --active --pivot
error: failed to pivot job for disk hda
error: block copy still active: disk 'hda' not ready for pivot yet
Could not merge changes for disk hda of domain. VM may be in invalid state.

Then running the following in the morning succeeds and successfully pivotes
the snapshot into the base image while the vm is live:

$ virsh blockjob domain hda --abort
$ virsh blockcommit domain hda --active --pivot
Successfully pivoted

We run the backup process every day once and it failed on the following
days:

2017-07-07
2017-07-20
2017-07-27
2017-08-12
2017-08-14

Looking at this it roughly happens once a week and the guest from then on
writes into the snapshot backlog. That snapshot backlog file grows about
8gb every day and thus the issue always needs immediate attention.

Any ideas what could cause this issue? Is this a bug (race condition) of
`virsh blockcommit` that sometimes fails because it is invoked at the wrong
time?

Cheers,
Dominik

2017-07-07 9:21 GMT+02:00 Dominik Psenner <dpsenner@gmail.com>:
> Of course the cronjob fails when trying to virsh blockcommit and not when
> creating the snapshot, sorry for the noise.
>
> 2017-07-07 9:15 GMT+02:00 Dominik Psenner <dpsenner@gmail.com>:
>
>> Hi,
>>
>> different day, same issue.. cronjob runs and fails:
>>
>> $ virsh snapshot-create-as --domain domain --name backup --no-metadata
>> --atomic --disk-only --diskspec hda,snapshot=external
>> error: failed to pivot job for disk hda
>> error: block copy still active: disk 'hda' not ready for pivot
yet
>> Could not merge changes for disk hda of domain. VM may be in invalid
>> state.
>>
>> Then running the following in the morning succeeds and successfully
>> pivotes the snapshot into the base image while the vm is live:
>>
>> $ virsh blockjob domain hda --abort
>> $ virsh blockcommit domain hda --active --pivot
>> Successfully pivoted
>>
>> This need of manual interventions is becoming a tiring job..
>>
>> I someone else seeing the same issue or has an idea what the cause
could
>> be?
>> Can I trust the output and is the base image really up to the latest
>> state?
>>
>> Cheers
>>
>> 2017-07-02 10:30 GMT+02:00 Dominik Psenner <dpsenner@gmail.com>:
>>
>>> Just a little catch-up. This time I was able to resolve the issue
by
>>> doing:
>>>
>>> virsh blockjob domain hda --abort
>>> virsh blockcommit domain hda --active --pivot
>>>
>>> Last time I had to shut down the virtual machine and do this while
being
>>> offline.
>>>
>>> Thanks Wang for your valuable input. As far as the memory goes,
there's
>>> plenty of head room:
>>>
>>> $ free -h
>>>               total        used        free      shared  buff/cache
>>> available
>>> Mem:           7.8G        1.8G        407M        9.7M
>>> 5.5G        5.5G
>>> Swap:          8.0G        619M        7.4G
>>>
>>> 2017-07-02 10:26 GMT+02:00 王李明 <wanglm@certusnet.com.cn>:
>>>
>>>> mybe this is because you physic host memory is small
>>>>
>>>> then this will Causing instability of the virtual machine
>>>>
>>>> But I'm just guessing
>>>>
>>>> You can try to increase your memory
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Wang Liming
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *发件人:* libvirt-users-bounces@redhat.com
[mailto:libvirt-users-bounces@
>>>> redhat.com] *代表 *Dominik Psenner
>>>> *发送时间:* 2017年7月2日 16:22
>>>> *收件人:* libvirt-users@redhat.com
>>>> *主题:* Re: [libvirt-users] virtual drive performance
>>>>
>>>>
>>>>
>>>> Hi again,
>>>>
>>>> just today an issue I've thought to be resolved popped up
again. We
>>>> backup the machine by doing:
>>>>
>>>> virsh snapshot-create-as --domain domain --name backup
--no-metadata
>>>> --atomic --disk-only --diskspec hda,snapshot=external
>>>>
>>>> # backup hda.qcow2
>>>>
>>>> virsh blockcommit domain hda --active --pivot
>>>>
>>>> Every now and then this process fails with the following error
message:
>>>>
>>>> error: failed to pivot job for disk hda
>>>> error: block copy still active: disk 'hda' not ready
for pivot yet
>>>> Could not merge changes for disk hda of domain. VM may be in
invalid
>>>> state.
>>>>
>>>> I expect live backups are a great asset and should work. Is
this a bug
>>>> that may relates also to the virtual drive performance issues
we observe?
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> 2017-07-02 10:10 GMT+02:00 Dominik Psenner
<dpsenner@gmail.com>:
>>>>
>>>> Hi
>>>>
>>>> a small update on this. I just migrated the vm from the site to
my
>>>> laptop and fired it up. The exact same xml configuration
(except file paths
>>>> and such) starts up and bursts with 50Mb/s to 115Mb/s in the
guest. This
>>>> allows only one reasonable answer: the cpu on my laptop is
somehow better
>>>> suited to emulate IO than the CPU built into the host on site.
The host
>>>> there is a HP proliant microserver gen8 with xeon processor.
But the
>>>> processor there is also never capped at 100% when the guest
copies files.
>>>>
>>>> I just ran another test by copying a 3Gb large file on the
guest. What
>>>> I can observe on my computer is that the copy process is not at
a constant
>>>> rate but rather starts with 90Mb/s, then drops down to 30Mb/s,
goes up to
>>>> 70Mb/s, drops down to 1Mb/s, goes up to 75Mb/s, drops to 1Mb/s,
goes up to
>>>> 55Mb/s and the pattern continues. Please note that the drive is
still
>>>> configured as:
>>>>
>>>> <driver name='qemu' type='qcow2'
cache='none' io='threads'/>
>>>>
>>>> and I would expect a constant rate that is either high or low
since
>>>> there is no caching involved and the underlying hard drive is a
samsung ssd
>>>> evo 850. To have an idea how fast that drive is on my laptop:
>>>>
>>>> $ dd if=/dev/zero of=testfile bs=1M count=1000 oflag=direct
>>>> 1000+0 records in
>>>> 1000+0 records out
>>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.47301 s, 424 MB/s
>>>>
>>>>
>>>>
>>>> I can further observe that the smaller the saved chunks are the
slower
>>>> the overall performance is:
>>>>
>>>> dd if=/dev/zero of=testfile bs=512K count=1000 oflag=direct
>>>> 1000+0 records in
>>>> 1000+0 records out
>>>> 524288000 bytes (524 MB, 500 MiB) copied, 1.34874 s, 389 MB/s
>>>>
>>>> $ dd if=/dev/zero of=testfile bs=5K count=1000 oflag=direct
>>>> 1000+0 records in
>>>> 1000+0 records out
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.105109 s, 48.7 MB/s
>>>>
>>>> $ dd if=/dev/zero of=testfile bs=1K count=10000 oflag=direct
>>>> 10000+0 records in
>>>> 10000+0 records out
>>>> 10240000 bytes (10 MB, 9.8 MiB) copied, 0.668438 s, 15.3 MB/s
>>>>
>>>> $ dd if=/dev/zero of=testfile bs=512 count=20000 oflag=direct
>>>> 20000+0 records in
>>>> 20000+0 records out
>>>> 10240000 bytes (10 MB, 9.8 MiB) copied, 1.10964 s, 9.2 MB/s
>>>>
>>>> Could this be a limiting factor? Does qemu/kvm do many many
writes of
>>>> just a few bytes?
>>>>
>>>>
>>>> Ideas, anyone?
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> 2017-06-21 20:46 GMT+02:00 Dan <srwx4096@gmail.com>:
>>>>
>>>> On Tue, Jun 20, 2017 at 04:24:32PM +0200, Gianluca Cecchi
wrote:
>>>> > On Tue, Jun 20, 2017 at 3:38 PM, Dominik Psenner
<dpsenner@gmail.com>
>>>> wrote:
>>>> >
>>>> > >
>>>> > > to the following:
>>>> > >
>>>> > > <disk type='file'
device='disk'>
>>>> > >   <driver name='qemu' type='qcow2'
cache='none'/>
>>>> > >   <source
file='/var/data/virtuals/machines/windows-server-2016-
>>>> > > x64/image.qcow2'/>
>>>> > >   <backingStore/>
>>>> > >   <target dev='hda'
bus='scsi'/>
>>>> > >   <address type='drive'
controller='0' bus='0' target='0' unit='0'/>
>>>> > > </disk>
>>>> > >
>>>> > > Do you see any gotchas in this configuration that
could prevent the
>>>> > > virtualized guest to power on and boot up?
>>>> > >
>>>> > >
>>>> > When I configure like this, from a linux guest point of
view I get
>>>> this
>>>> > Symbios Logic SCSI Controller:
>>>> > 00:08.0 SCSI storage controller: LSI Logic / Symbios Logic
53c895a
>>>> >
>>>> > But htis is true only if you add the SCSI controller too,
not only
>>>> the disk
>>>> > definition.
>>>> > In my case
>>>> >
>>>> >     <controller type='scsi'
index='0'>
>>>> >       <address type='pci'
domain='0x0000' bus='0x00' slot='0x08'
>>>> > function='0x0'/>
>>>> >     </controller>
>>>> >
>>>> > Note the slot='0x08' that is reflected into the
first field of lspci
>>>> inside
>>>> > my linux guest.
>>>> > So between your controllers you have to add the SCSI one
>>>> >
>>>> > In my case (Fedora 25 with
virt-manager-1.4.1-2.fc25.noarch,
>>>> > qemu-kvm-2.7.1-6.fc25.x86_64, libvirt-2.2.1-2.fc25.x86_64)
with "Disk
>>>> bus"
>>>> > set as SCSI in virt-manager, the xml defintiion for the
guest is
>>>> > automatically updated with the controller if not existent
yet.
>>>> > And the disk definition sections is like this:
>>>> >
>>>> >     <disk type='file' device='disk'>
>>>> >       <driver name='qemu'
type='qcow2'/>
>>>> >       <source
file='/var/lib/libvirt/images/slaxsmall.qcow2'/>
>>>> >       <target dev='sda' bus='scsi'/>
>>>> >       <boot order='1'/>
>>>> >       <address type='drive'
controller='0' bus='0' target='0'
>>>> unit='0'/>
>>>> >     </disk>
>>>> >
>>>> > So I think you should set dev='sda' and not
'hda' in your xml for it
>>>> >
>>>>
>>>> I am actually very curious to know if that would make a
difference. I
>>>> don't have a such windows vm images ready to test at
present.
>>>>
>>>> Dan
>>>> > I don't kknow if w2016 contains the symbios logic
drivers already
>>>> > installed, so that a "simple" reboot could imply
an automatic
>>>> > reconfiguration of the guest....
>>>> > Note also that in Windows when the hw configuration is
considered
>>>> heavily
>>>> > changed, you could be asked to register again (I don't
think that the
>>>> IDE
>>>> > --> SCSI should imply it...)
>>>> >
>>>> > Gianluca
>>>>
>>>> > _______________________________________________
>>>> > libvirt-users mailing list
>>>> > libvirt-users@redhat.com
>>>> > https://www.redhat.com/mailman/listinfo/libvirt-users
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Dominik Psenner
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Dominik Psenner
>>>>
>>>
>>>
>>>
>>> --
>>> Dominik Psenner
>>>
>>
>>
>>
>> --
>> Dominik Psenner
>>
>
>
>
> --
> Dominik Psenner
>


-- 
Dominik Psenner

Peter Krempa

2017-Aug-14 15:05 UTC

head link

Re: [libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

On Mon, Aug 14, 2017 at 08:42:24 +0200, Dominik Psenner
wrote:> Hi,
Hi,
> 
> a small update on this. We have migrated the virtualized host to use the
> virtio drivers and now the drive performance is improved so that we can see
> a constant transfer rate. Before it used to be the same rate but regularly
> dropped to a few bytes/sec for a few seconds and then was fast again.
> 
> However we still observe that the following fails regularily:
> 
> $ virsh snapshot-create-as --domain domain --name backup --no-metadata
> --atomic --disk-only --diskspec hda,snapshot=external
> $ virsh blockcommit domain hda --active --pivot
> error: failed to pivot job for disk hda
> error: block copy still active: disk 'hda' not ready for pivot yet
> Could not merge changes for disk hda of domain. VM may be in invalid state.
since this thread was renamed, please re-state the version of libvirt
you are using. I don't really want to dig through the old thread.
> Then running the following in the morning succeeds and successfully pivotes
> the snapshot into the base image while the vm is live:
> 
> $ virsh blockjob domain hda --abort
> $ virsh blockcommit domain hda --active --pivot
> Successfully pivoted
> 
> We run the backup process every day once and it failed on the following
> days:
> 
> 2017-07-07
> 2017-07-20
> 2017-07-27
> 2017-08-12
> 2017-08-14
> 
> Looking at this it roughly happens once a week and the guest from then on
> writes into the snapshot backlog. That snapshot backlog file grows about
> 8gb every day and thus the issue always needs immediate attention.
> 
> Any ideas what could cause this issue? Is this a bug (race condition) of
> `virsh blockcommit` that sometimes fails because it is invoked at the wrong
> time?
So the 'virsh blockcommit domain hda --active --pivot' operation
consists of 3 parts:

1) virsh blockcommit domain hda --active
2) waiting until the block job finishes
3) virsh blockjob --pivot domain hda

The problem is that some times 2) finishes too soon and then operation 3
fails. This should not happen any more, since there's code in virsh [1]
which waits for the completion event from libvirtd, which is fired only
when the job is actually ready to be pivoted.

This code has a lot of fallback options in case when libvirtd is old or
so.

At any rate, manual pivoting later should help. Also probably updating
to a more recent version.

In case you are using a farily recent version, it's possible that there
are still bugs though.

Peter

[1]:

commit 7408403560f7d054da75acaab855a95c51a92e2b
Author: Peter Krempa <pkrempa@redhat.com>
Date:   Mon Jul 13 17:04:49 2015 +0200

    virsh: Refactor block job waiting in cmdBlockCommit
    
    Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to
    use the common code. This additionally fixes a bug when working with
    new qemus, where when doing an active commit with --pivot the pivoting
    would fail, since qemu reaches 100% completion but the job doesn't
    switch to synchronized phase right away.

$ git describe --contains 7408403560f7d054da75acaab855a95c51a92e2b
v1.2.18-rc1~33

Dominik Psenner

2017-Aug-14 15:46 UTC

head link

Re: [libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

Thanks Peter for your feedback. Interestingly the version of virsh is newer
than 1.2.18 and thus should contain the fix:

$ virsh --version
1.3.1

$ uname -a
Linux agsserver 4.4.0-91-generic #114-Ubuntu SMP Tue Aug 8 11:56:56 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

But we're still having the issue. Is there anything else that you can think
about? Feel free to query me for more information. I'm willing to help
wherever I can because this bugs us quite regularly. We could probably
improve our daily backup cronjob to retry blockcommit after a blockjob
abort, but it feels so hacky that I would do that only as the last resort.

2017-08-14 17:05 GMT+02:00 Peter Krempa <pkrempa@redhat.com>:
> On Mon, Aug 14, 2017 at 08:42:24 +0200, Dominik Psenner wrote:
> > Hi,
>
> Hi,
>
> >
> > a small update on this. We have migrated the virtualized host to use
the
> > virtio drivers and now the drive performance is improved so that we
can
> see
> > a constant transfer rate. Before it used to be the same rate but
> regularly
> > dropped to a few bytes/sec for a few seconds and then was fast again.
> >
> > However we still observe that the following fails regularily:
> >
> > $ virsh snapshot-create-as --domain domain --name backup --no-metadata
> > --atomic --disk-only --diskspec hda,snapshot=external
> > $ virsh blockcommit domain hda --active --pivot
> > error: failed to pivot job for disk hda
> > error: block copy still active: disk 'hda' not ready for pivot
yet
> > Could not merge changes for disk hda of domain. VM may be in invalid
> state.
>
> since this thread was renamed, please re-state the version of libvirt
> you are using. I don't really want to dig through the old thread.
>
> > Then running the following in the morning succeeds and successfully
> pivotes
> > the snapshot into the base image while the vm is live:
> >
> > $ virsh blockjob domain hda --abort
> > $ virsh blockcommit domain hda --active --pivot
> > Successfully pivoted
> >
> > We run the backup process every day once and it failed on the
following
> > days:
> >
> > 2017-07-07
> > 2017-07-20
> > 2017-07-27
> > 2017-08-12
> > 2017-08-14
> >
> > Looking at this it roughly happens once a week and the guest from then
on
> > writes into the snapshot backlog. That snapshot backlog file grows
about
> > 8gb every day and thus the issue always needs immediate attention.
> >
> > Any ideas what could cause this issue? Is this a bug (race condition)
of
> > `virsh blockcommit` that sometimes fails because it is invoked at the
> wrong
> > time?
>
> So the 'virsh blockcommit domain hda --active --pivot' operation
> consists of 3 parts:
>
> 1) virsh blockcommit domain hda --active
> 2) waiting until the block job finishes
> 3) virsh blockjob --pivot domain hda
>
> The problem is that some times 2) finishes too soon and then operation 3
> fails. This should not happen any more, since there's code in virsh [1]
> which waits for the completion event from libvirtd, which is fired only
> when the job is actually ready to be pivoted.
>
> This code has a lot of fallback options in case when libvirtd is old or
> so.
>
> At any rate, manual pivoting later should help. Also probably updating
> to a more recent version.
>
> In case you are using a farily recent version, it's possible that there
> are still bugs though.
>
> Peter
>
> [1]:
>
> commit 7408403560f7d054da75acaab855a95c51a92e2b
> Author: Peter Krempa <pkrempa@redhat.com>
> Date:   Mon Jul 13 17:04:49 2015 +0200
>
>     virsh: Refactor block job waiting in cmdBlockCommit
>
>     Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to
>     use the common code. This additionally fixes a bug when working with
>     new qemus, where when doing an active commit with --pivot the pivoting
>     would fail, since qemu reaches 100% completion but the job doesn't
>     switch to synchronized phase right away.
>
> $ git describe --contains 7408403560f7d054da75acaab855a95c51a92e2b
> v1.2.18-rc1~33
>
>

-- 
Dominik Psenner

Maybe Matching Threads

Search for more reasonably related threads

libvirt users - Aug 2017 - Re: virsh blockcommit fails regularily (was: virtual drive performance)

[libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

Re: [libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

Re: [libvirt-users] virsh blockcommit fails regularily (was: virtual drive performance)

Maybe Matching Threads