thr3ads.net - CentOS virt - [CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Richard Landsman - Rimote

2017-Apr-08 14:49 UTC

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Hello,

I would really appreciate some help/guidance with this problem. First of 
all sorry for the long message. I would file a bug, but do not know if 
it is my fault, dm-cache, qemu or (probably) a combination of both. And 
i can imagine some of you have this setup up and running without 
problems (or maybe you think it works, just like i did, but it does not):

PROBLEM
LVM cache writeback stops working as expected after a while with a 
qemu-kvm VM. A 100% working setup would be the holy grail in my 
opinion... and the performance of KVM/qemu is great i must say in the 
beginning.

DESCRIPTION

When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create 
a cached LV out of them, the VM performs initially great (at least 
40.000 IOPS on 4k rand read/write)! But then after a while (and a lot of 
random IO, ca 10 - 20 G) it effectively turns in to a writethrough cache 
although there's much space left on the cachedlv.


When  working as expected on KVM host all writes go to SSDs

iostat -x -m 2

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   324.50    0.00   22.00     0.00    14.94 
1390.57     1.90   86.39    0.00   86.39   5.32  11.70
sdb               0.00   324.50    0.00   22.00     0.00    14.94 
1390.57     2.03   92.45    0.00   92.45   5.48  12.05
sdc               0.00  3932.00    0.00 *2191.50*     0.00 *270.07*   
252.39    37.83   17.55    0.00   17.55   0.36 *78.05*
sdd               0.00  3932.00    0.00 *2197.50 *    0.00 *271.01 *  
252.57    38.96   18.14    0.00   18.14   0.36 *78.95*


When not working as expected on KVM host all writes go through the SSD 
on to the HDDs (effectively disabling writeback so it becomes a 
writethrough)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     7.00  234.50 *173.50 * 0.92 *1.95*    
14.38    29.27   71.27  111.89   16.37 2.45 *100.00*
sdb               0.00     3.50  212.00 *177.50 * 0.83 *1.95*    
14.60    35.58   91.24  143.00   29.42 2.57*100.10*
sdc               2.50     0.00  566.00 *199.00 * 2.69     0.78     
9.28     0.08    0.11    0.13    0.04   0.10 *7.70*
sdd               1.50     0.00   76.00 *199.00* 0.65     0.78    
10.66     0.02    0.07    0.16    0.04   0.07 *1.85*


Stuff i've checked/tried:

- The data in the cached LV has then not exceeded even half of the 
space, so this should not happen. It even happens when only 20% of 
cachedata is used.
- It seems to be triggerd most of the time when %cpy/sync column of `lvs 
-a` is about 30%. But this is not always the case!
- changing the cachepolicy from cleaner to smq, wait (check flush ready 
with lvs -a) and then back to smq seems to help /sometimes/! But not 
always...

lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv

lvs -a

lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv

- *when mounting the LV inside the host this does not seem to happen!!* 
So it looks like a qemu-kvm / dm-cache combination issue. Only 
difference is that inside host i do mkfs in stead of LVM inside VM (so 
could be LVM inside VM on top of LVM on KVM host problem too? small 
chance probably because the first 10 - 20GB it works great!)

- tried disabling Selinux, upgrading to newest kernels (elrepo ml and 
lt), played around with dirty_cache thingeys like 
proc/sys/vm/dirty_writeback_centisecs 
/proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , and 
migration threashold of dmsetup, and other probably non important stuff 
like vm.dirty_bytes

- when in "slow state" the systems kworkers are exessively using IO
(10
- 20 MB per kworker process). This seems to be the writeback process 
(CPY%Sync) because the cache wants to flush to HDD. But the strange 
thing is that after a good sync (0% left), the disk may become slow 
again after a few MBs of data. A reboot sometimes helps.

- have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi 
controller, cachesettings, disk shedulers etc. Nothing helped.

- the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have AMD 
FX(tm)-8350, 16G RAM

It feels like the lvm cache has a threshold (about 20G of data that is 
dirty) and that is stops allowing the qemu-kvm process to use writeback 
caching (the root uses inside the host seems to not have this 
limitation). It starts flushing, but only to a certain point. After a 
few  MBs of data it is right back in the slow spot again. Only solution 
is waiting for a long time (independant of CPY%SYNC) or sometimes change 
cachepolicy and force flush. This prevents for me the production use of 
this system. But it's so promising, so I hope somebody can help.

desired state:  Doing the FIO test (described in section reproduce) 
repeatedly should keep being fast till cachedlv is more or less full. If 
resyncing back to disc causes this degradation, it should actually flush 
it fully within a reasonable time and give opportunity to write fast 
again up to a given threshold. It now seems like a one time use cache 
that only uses a fraction of the SSD and is useless/very unstable 
afterwards.

REPRODUCE
1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep a lot 
of space for the LVM cache (no /home)! So make the VG as large as 
possible during anaconda partitioning.

2. once installed and booted in to the system, install qemu-kvm

yum install -y centos-release-qemu-ev
yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
# disbale ksm (probably not important / needed)
systemctl disable ksm
systemctl disable ksmtuned

3. create LVM cache

#set some variables and create a raid1 array with the two SSDs

VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1
&&
hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm
--create
--verbose ${ssdraiddevice} --level=mirror --bitmap=none --raid-devices=2 
${ssddevice1} ${ssddevice2}

# create PV and extend VG

  pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}

# create slow LV on HDDs (use max space left if you want)

  pvdisplay ${hddraiddevice}
  lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}

# create the meta and data: for testing purposes I keep about 20G of the 
SSD for a uncached lv. To rule out it is not the SSD.

lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}

#The rest can be used as cachedata/metadata.

  pvdisplay ${ssdraiddevice}
# about 1/1000 of the space you have left on the SSD for the meta 
(minimum of 4)
  lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
# the rest can be used as cachedata
  lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}

# convert/combine pools so cachedlv is actually cached

  lvconvert --type cache-pool --cachemode writeback --poolmetadata 
${VGBASE}/cachemeta ${VGBASE}/cachedata

  lvconvert --type cache --cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv


# my system now looks like (VG is called cl, default of installer)
[root at localhost ~]# lvs -a
   LV                VG Attr       LSize   Pool Origin
   [cachedata]       cl Cwi---C--- 97.66g
*  [cachedata_cdata] cl Cwi-ao---- 97.66g **
**  [cachedata_cmeta] cl ewi-ao---- 100.00m *
*  cachedlv          cl Cwi-aoC---   1.75t [cachedata] [cachedlv_corig] *
   [cachedlv_corig]  cl owi-aoC--- 1.75t
   [lvol0_pmspare]   cl ewi------- 100.00m
   root              cl -wi-ao---- 46.56g
   swap              cl -wi-ao---- 14.96g
*  testssd           cl -wi-a-----  45.47g

*[root at localhost ~]#lsblk*
*
NAME                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdd                        8:48   0   163G  0 disk
??sdd1                     8:49   0   163G  0 part
   ??md128                  9:128  0 162.9G  0 raid1
     ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
     ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
     ??cl-testssd         253:2    0  45.5G  0 lvm
     ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
       ??cl-cachedlv      253:6    0   1.8T  0 lvm
sdb                        8:16   0   1.8T  0 disk
??sdb2                     8:18   0   1.8T  0 part
? ??md127                  9:127  0   1.8T  0 raid1
?   ??cl-swap            253:1    0    15G  0 lvm   [SWAP]
?   ??cl-root            253:0    0  46.6G  0 lvm   /
?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
??sdb1                     8:17   0   954M  0 part
   ??md126                  9:126  0   954M  0 raid1 /boot
sdc                        8:32   0   163G  0 disk
??sdc1                     8:33   0   163G  0 part
   ??md128                  9:128  0 162.9G  0 raid1
     ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
     ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
     ??cl-testssd         253:2    0  45.5G  0 lvm
     ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
       ??cl-cachedlv      253:6    0   1.8T  0 lvm
sda                        8:0    0   1.8T  0 disk
??sda2                     8:2    0   1.8T  0 part
? ??md127                  9:127  0   1.8T  0 raid1
?   ??cl-swap            253:1    0    15G  0 lvm   [SWAP]
?   ??cl-root            253:0    0  46.6G  0 lvm   /
?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
??sda1                     8:1    0   954M  0 part
   ??md126                  9:126  0   954M  0 raid1 /boot

# now create vm
wget 
http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso 
-P /home/
DISK=/dev/mapper/XXXX-cachedlv

# watch out, my netsetup uses a custom bridge/network in the following 
command. Please replace with what you normally use.
virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 --disk 
path=${DISK},cache=none,bus=virtio --network bridge=pubbr,model=virtio 
--cdrom /home/CentOS-6.9-x86_64-minimal.iso --graphics 
vnc,port=5998,listen=0.0.0.0 --cpu host

# now connect with client PC to qemu
virt-viewer --connect=qemu+ssh://root at 192.168.0.XXX/system --name CentOS1

And install everything on the single vda disc with LVM (i use defaults 
in anaconda, but remove the large /home to prevent SSD beeing over used).

After install and reboot log in to VM and

yum install epel-release -y && yum install screen fio htop -y

and then run disk test:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 
--name=test *--filename=test* --bs=4k --iodepth=64 --size=4G 
--readwrite=randrw --rwmixread=75

then *keep repeating *but *change the filename* attribute so it does not 
use the same blocks over and over again.

In the beginning the performance is great!! Wow, very impressive 150MB/s 
4k random r/w (close to bare metal, about 20% - 30% loss). But after a 
few (usually about 4 or 5) runs (always changing the filename, but not 
overfilling the FS, it drops to about 10 MBs/sec.

normal/in the beginning

  read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
   write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec

but then

  read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
   write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec

or even worse up to the point that it is actually the HDD that is 
written to (about 500 iops).

P.S. when a test is/was slow, that means it is on the HDDs. So even when 
fixing the problem (sometimes just by waiting), that specific file will 
keep being slow when redoing the test till its promoted to the lvm cache 
(takes a lot of reads I think). And once on the SSD it sometimes keeps 
beeing fast, although a new testfile will be slow. So I really recommend 
changing the testfile all the time when trying to see if a change in 
speed has occurred.

-- 
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates)

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170408/d1384e00/attachment-0002.html>

Sandro Bonazzola

2017-Apr-10 08:08 UTC

head link

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Adding Paolo and Miroslav.

On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote <richard at
rimote.nl> wrote:
> Hello,
>
> I would really appreciate some help/guidance with this problem. First of
> all sorry for the long message. I would file a bug, but do not know if it
> is my fault, dm-cache, qemu or (probably) a combination of both. And i can
> imagine some of you have this setup up and running without problems (or
> maybe you think it works, just like i did, but it does not):
>
> PROBLEM
> LVM cache writeback stops working as expected after a while with a
> qemu-kvm VM. A 100% working setup would be the holy grail in my opinion...
> and the performance of KVM/qemu is great i must say in the beginning.
>
> DESCRIPTION
>
> When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create a
> cached LV out of them, the VM performs initially great (at least 40.000
> IOPS on 4k rand read/write)! But then after a while (and a lot of random
> IO, ca 10 - 20 G) it effectively turns in to a writethrough cache although
> there's much space left on the cachedlv.
>
>
> When  working as expected on KVM host all writes go to SSDs
>
> iostat -x -m 2
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00   324.50    0.00   22.00     0.00    14.94
> 1390.57     1.90   86.39    0.00   86.39   5.32  11.70
> sdb               0.00   324.50    0.00   22.00     0.00    14.94
> 1390.57     2.03   92.45    0.00   92.45   5.48  12.05
> sdc               0.00  3932.00    0.00 *2191.50*     0.00   *270.07*
> 252.39    37.83   17.55    0.00   17.55   0.36 * 78.05*
> sdd               0.00  3932.00    0.00 *2197.50 *    0.00   *271.01 *
> 252.57    38.96   18.14    0.00   18.14   0.36  *78.95*
>
>
> When not working as expected on KVM host all writes go through the SSD on
> to the HDDs (effectively disabling writeback so it becomes a writethrough)
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     7.00  234.50  *173.50 *    0.92    * 1.95*
> 14.38    29.27   71.27  111.89   16.37   2.45 *100.00*
> sdb               0.00     3.50  212.00  *177.50 *    0.83    * 1.95*
> 14.60    35.58   91.24  143.00   29.42   2.57* 100.10*
> sdc               2.50     0.00  566.00  *199.00 *    2.69     0.78
> 9.28     0.08    0.11    0.13    0.04   0.10   *7.70*
> sdd               1.50     0.00   76.00  *199.00*     0.65     0.78
> 10.66     0.02    0.07    0.16    0.04   0.07   *1.85*
>
>
> Stuff i've checked/tried:
>
> - The data in the cached LV has then not exceeded even half of the space,
> so this should not happen. It even happens when only 20% of cachedata is
> used.
> - It seems to be triggerd most of the time when %cpy/sync column of `lvs
> -a` is about 30%. But this is not always the case!
> - changing the cachepolicy from cleaner to smq, wait (check flush ready
> with lvs -a) and then back to smq seems to help *sometimes*! But not
> always...
>
> lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
> lvs -a
>
> lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
> - *when mounting the LV inside the host this does not seem to happen!!*
> So it looks like a qemu-kvm / dm-cache combination issue. Only difference
> is that inside host i do mkfs in stead of LVM inside VM (so could be LVM
> inside VM on top of LVM on KVM host problem too? small chance probably
> because the first 10 - 20GB it works great!)
>
> - tried disabling Selinux, upgrading to newest kernels (elrepo ml and lt),
> played around with dirty_cache thingeys like
proc/sys/vm/dirty_writeback_centisecs
> /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , and
> migration threashold of dmsetup, and other probably non important stuff
> like vm.dirty_bytes
>
> - when in "slow state" the systems kworkers are exessively using
IO (10 -
> 20 MB per kworker process). This seems to be the writeback process
> (CPY%Sync) because the cache wants to flush to HDD. But the strange thing
> is that after a good sync (0% left), the disk may become slow again after a
> few MBs of data. A reboot sometimes helps.
>
> - have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi
> controller, cachesettings, disk shedulers etc. Nothing helped.
>
> - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have AMD
> FX(tm)-8350, 16G RAM
>
> It feels like the lvm cache has a threshold (about 20G of data that is
> dirty) and that is stops allowing the qemu-kvm process to use writeback
> caching (the root uses inside the host seems to not have this limitation).
> It starts flushing, but only to a certain point. After a few  MBs of data
> it is right back in the slow spot again. Only solution is waiting for a
> long time (independant of CPY%SYNC) or sometimes change cachepolicy and
> force flush. This prevents for me the production use of this system. But
> it's so promising, so I hope somebody can help.
>
> desired state:  Doing the FIO test (described in section reproduce)
> repeatedly should keep being fast till cachedlv is more or less full. If
> resyncing back to disc causes this degradation, it should actually flush it
> fully within a reasonable time and give opportunity to write fast again up
> to a given threshold. It now seems like a one time use cache that only uses
> a fraction of the SSD and is useless/very unstable afterwards.
>
> REPRODUCE
> 1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep a lot of
> space for the LVM cache (no /home)! So make the VG as large as possible
> during anaconda partitioning.
>
> 2. once installed and booted in to the system, install qemu-kvm
>
> yum install -y centos-release-qemu-ev
> yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
> # disbale ksm (probably not important / needed)
> systemctl disable ksm
> systemctl disable ksmtuned
>
> 3. create LVM cache
>
> #set some variables and create a raid1 array with the two SSDs
>
> VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1
&&
> hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX &&
mdadm --create
> --verbose ${ssdraiddevice} --level=mirror --bitmap=none --raid-devices=2
> ${ssddevice1} ${ssddevice2}
>
> # create PV and extend VG
>
>  pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}
>
> # create slow LV on HDDs (use max space left if you want)
>
>  pvdisplay ${hddraiddevice}
>  lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
> # create the meta and data: for testing purposes I keep about 20G of the
> SSD for a uncached lv. To rule out it is not the SSD.
>
> lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
> #The rest can be used as cachedata/metadata.
>
>  pvdisplay ${ssdraiddevice}
> # about 1/1000 of the space you have left on the SSD for the meta (minimum
> of 4)
>  lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
> # the rest can be used as cachedata
>  lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
> # convert/combine pools so cachedlv is actually cached
>
>  lvconvert --type cache-pool --cachemode writeback --poolmetadata
> ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
>  lvconvert --type cache --cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv
>
>
> # my system now looks like (VG is called cl, default of installer)
> [root at localhost ~]# lvs -a
>   LV                VG Attr       LSize   Pool        Origin
>   [cachedata]       cl Cwi---C---  97.66g
>
> *  [cachedata_cdata] cl Cwi-ao----
> 97.66g                                                                    
*
> *  [cachedata_cmeta] cl ewi-ao---- 100.00m     *
>
> *  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
> [cachedlv_corig]     *
>   [cachedlv_corig]  cl owi-aoC---   1.75t
>
>   [lvol0_pmspare]   cl ewi------- 100.00m
>
>   root              cl -wi-ao----  46.56g
>
>   swap              cl -wi-ao----  14.96g
>
>
>
> *  testssd           cl -wi-a-----  45.47g *[root at localhost ~]#lsblk
>
> NAME                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> sdd                        8:48   0   163G  0 disk
> ??sdd1                     8:49   0   163G  0 part
>   ??md128                  9:128  0 162.9G  0 raid1
>     ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
>     ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     ??cl-testssd         253:2    0  45.5G  0 lvm
>     ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>       ??cl-cachedlv      253:6    0   1.8T  0 lvm
> sdb                        8:16   0   1.8T  0 disk
> ??sdb2                     8:18   0   1.8T  0 part
> ? ??md127                  9:127  0   1.8T  0 raid1
> ?   ??cl-swap            253:1    0    15G  0 lvm   [SWAP]
> ?   ??cl-root            253:0    0  46.6G  0 lvm   /
> ?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
> ?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
> ??sdb1                     8:17   0   954M  0 part
>   ??md126                  9:126  0   954M  0 raid1 /boot
> sdc                        8:32   0   163G  0 disk
> ??sdc1                     8:33   0   163G  0 part
>   ??md128                  9:128  0 162.9G  0 raid1
>     ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
>     ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     ??cl-testssd         253:2    0  45.5G  0 lvm
>     ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>       ??cl-cachedlv      253:6    0   1.8T  0 lvm
> sda                        8:0    0   1.8T  0 disk
> ??sda2                     8:2    0   1.8T  0 part
> ? ??md127                  9:127  0   1.8T  0 raid1
> ?   ??cl-swap            253:1    0    15G  0 lvm   [SWAP]
> ?   ??cl-root            253:0    0  46.6G  0 lvm   /
> ?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
> ?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
> ??sda1                     8:1    0   954M  0 part
>   ??md126                  9:126  0   954M  0 raid1 /boot
>
> # now create vm
> wget http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-
> x86_64-minimal.iso -P /home/
> DISK=/dev/mapper/XXXX-cachedlv
>
> # watch out, my netsetup uses a custom bridge/network in the following
> command. Please replace with what you normally use.
> virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 --disk
> path=${DISK},cache=none,bus=virtio --network bridge=pubbr,model=virtio
> --cdrom /home/CentOS-6.9-x86_64-minimal.iso --graphics
> vnc,port=5998,listen=0.0.0.0 --cpu host
>
> # now connect with client PC to qemu
> virt-viewer --connect=qemu+ssh://root at 192.168.0.XXX/system --name
CentOS1
>
> And install everything on the single vda disc with LVM (i use defaults in
> anaconda, but remove the large /home to prevent SSD beeing over used).
>
> After install and reboot log in to VM and
>
> yum install epel-release -y && yum install screen fio htop -y
>
> and then run disk test:
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
> --readwrite=randrw --rwmixread=75
>
> then *keep repeating *but *change the filename* attribute so it does not
> use the same blocks over and over again.
>
> In the beginning the performance is great!! Wow, very impressive 150MB/s
> 4k random r/w (close to bare metal, about 20% - 30% loss). But after a few
> (usually about 4 or 5) runs (always changing the filename, but not
> overfilling the FS, it drops to about 10 MBs/sec.
>
> normal/in the beginning
>
>  read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
>   write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
> but then
>
>  read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
>   write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
> or even worse up to the point that it is actually the HDD that is written
> to (about 500 iops).
>
> P.S. when a test is/was slow, that means it is on the HDDs. So even when
> fixing the problem (sometimes just by waiting), that specific file will
> keep being slow when redoing the test till its promoted to the lvm cache
> (takes a lot of reads I think). And once on the SSD it sometimes keeps
> beeing fast, although a new testfile will be slow. So I really recommend
> changing the testfile all the time when trying to see if a change in speed
> has occurred.
>
> --
> Met vriendelijke groet,
>
> Richard Landsmanhttp://rimote.nl
>
> T: +31 (0)50 - 763 04 07
> (ma-vr 9:00 tot 18:00)
>
> 24/7 bij storingen:
> +31 (0)6 - 4388 7949
> @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>
>

-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig>
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170410/c61d4193/attachment-0002.html>

Richard Landsman - Rimote

2017-Apr-20 10:32 UTC

head link

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Hello everyone,

Anybody had the chance to test out this setup and reproduce the problem? 
I assumed it would be something that's used often these days and a 
solution would benefit a lot of users. If can be of any assistance 
please contact me.

-- 
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates)

On 04/10/2017 10:08 AM, Sandro Bonazzola wrote:> Adding Paolo and Miroslav.
>
> On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote 
> <richard at rimote.nl <mailto:richard at rimote.nl>> wrote:
>
>     Hello,
>
>     I would really appreciate some help/guidance with this problem.
>     First of all sorry for the long message. I would file a bug, but
>     do not know if it is my fault, dm-cache, qemu or (probably) a
>     combination of both. And i can imagine some of you have this setup
>     up and running without problems (or maybe you think it works, just
>     like i did, but it does not):
>
>     PROBLEM
>     LVM cache writeback stops working as expected after a while with a
>     qemu-kvm VM. A 100% working setup would be the holy grail in my
>     opinion... and the performance of KVM/qemu is great i must say in
>     the beginning.
>
>     DESCRIPTION
>
>     When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and
>     create a cached LV out of them, the VM performs initially great
>     (at least 40.000 IOPS on 4k rand read/write)! But then after a
>     while (and a lot of random IO, ca 10 - 20 G) it effectively turns
>     in to a writethrough cache although there's much space left on the
>     cachedlv.
>
>
>     When  working as expected on KVM host all writes go to SSDs
>
>     iostat -x -m 2
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>     sda               0.00   324.50    0.00   22.00 0.00    14.94 
>     1390.57     1.90   86.39    0.00 86.39   5.32  11.70
>     sdb               0.00   324.50    0.00   22.00 0.00    14.94 
>     1390.57     2.03   92.45    0.00 92.45   5.48  12.05
>     sdc               0.00  3932.00    0.00 *2191.50* 0.00 *270.07*  
>     252.39    37.83   17.55 0.00   17.55   0.36 *78.05*
>     sdd               0.00  3932.00    0.00 *2197.50 * 0.00 *271.01 * 
>     252.57    38.96   18.14 0.00   18.14   0.36 *78.95*
>
>
>     When not working as expected on KVM host all writes go through the
>     SSD on to the HDDs (effectively disabling writeback so it becomes
>     a writethrough)
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>     sda               0.00     7.00  234.50 *173.50 * 0.92 *1.95*   
>     14.38    29.27   71.27  111.89 16.37   2.45 *100.00*
>     sdb               0.00     3.50  212.00 *177.50 * 0.83 *1.95*   
>     14.60    35.58   91.24  143.00 29.42   2.57*100.10*
>     sdc               2.50     0.00  566.00 *199.00 * 2.69    
>     0.78     9.28     0.08    0.11    0.13 0.04   0.10 *7.70*
>     sdd               1.50     0.00   76.00 *199.00* 0.65     0.78   
>     10.66     0.02    0.07    0.16 0.04   0.07 *1.85*
>
>
>     Stuff i've checked/tried:
>
>     - The data in the cached LV has then not exceeded even half of the
>     space, so this should not happen. It even happens when only 20% of
>     cachedata is used.
>     - It seems to be triggerd most of the time when %cpy/sync column
>     of `lvs -a` is about 30%. But this is not always the case!
>     - changing the cachepolicy from cleaner to smq, wait (check flush
>     ready with lvs -a) and then back to smq seems to help /sometimes/!
>     But not always...
>
>     lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
>     lvs -a
>
>     lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
>     - *when mounting the LV inside the host this does not seem to
>     happen!!* So it looks like a qemu-kvm / dm-cache combination
>     issue. Only difference is that inside host i do mkfs in stead of
>     LVM inside VM (so could be LVM inside VM on top of LVM on KVM host
>     problem too? small chance probably because the first 10 - 20GB it
>     works great!)
>
>     - tried disabling Selinux, upgrading to newest kernels (elrepo ml
>     and lt), played around with dirty_cache thingeys like
>     proc/sys/vm/dirty_writeback_centisecs
>     /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio ,
>     and migration threashold of dmsetup, and other probably non
>     important stuff like vm.dirty_bytes
>
>     - when in "slow state" the systems kworkers are exessively
using
>     IO (10 - 20 MB per kworker process). This seems to be the
>     writeback process (CPY%Sync) because the cache wants to flush to
>     HDD. But the strange thing is that after a good sync (0% left),
>     the disk may become slow again after a few MBs of data. A reboot
>     sometimes helps.
>
>     - have tried iothreads, virtio-scsi, vcpu driver setting on
>     virtio-scsi controller, cachesettings, disk shedulers etc. Nothing
>     helped.
>
>     - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have
>     AMD FX(tm)-8350, 16G RAM
>
>     It feels like the lvm cache has a threshold (about 20G of data
>     that is dirty) and that is stops allowing the qemu-kvm process to
>     use writeback caching (the root uses inside the host seems to not
>     have this limitation). It starts flushing, but only to a certain
>     point. After a few  MBs of data it is right back in the slow spot
>     again. Only solution is waiting for a long time (independant of
>     CPY%SYNC) or sometimes change cachepolicy and force flush. This
>     prevents for me the production use of this system. But it's so
>     promising, so I hope somebody can help.
>
>     desired state:  Doing the FIO test (described in section
>     reproduce) repeatedly should keep being fast till cachedlv is more
>     or less full. If resyncing back to disc causes this degradation,
>     it should actually flush it fully within a reasonable time and
>     give opportunity to write fast again up to a given threshold. It
>     now seems like a one time use cache that only uses a fraction of
>     the SSD and is useless/very unstable afterwards.
>
>     REPRODUCE
>     1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep
>     a lot of space for the LVM cache (no /home)! So make the VG as
>     large as possible during anaconda partitioning.
>
>     2. once installed and booted in to the system, install qemu-kvm
>
>     yum install -y centos-release-qemu-ev
>     yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
>     # disbale ksm (probably not important / needed)
>     systemctl disable ksm
>     systemctl disable ksmtuned
>
>     3. create LVM cache
>
>     #set some variables and create a raid1 array with the two SSDs
>
>     VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1
&&
>     hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX &&
mdadm
>     --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none
>     --raid-devices=2 ${ssddevice1} ${ssddevice2}
>
>     # create PV and extend VG
>
>      pvcreate ${ssdraiddevice} && vgextend ${VGBASE}
${ssdraiddevice}
>
>     # create slow LV on HDDs (use max space left if you want)
>
>      pvdisplay ${hddraiddevice}
>      lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
>     # create the meta and data: for testing purposes I keep about 20G
>     of the SSD for a uncached lv. To rule out it is not the SSD.
>
>     lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
>     #The rest can be used as cachedata/metadata.
>
>      pvdisplay ${ssdraiddevice}
>     # about 1/1000 of the space you have left on the SSD for the meta
>     (minimum of 4)
>      lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
>     # the rest can be used as cachedata
>      lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
>     # convert/combine pools so cachedlv is actually cached
>
>      lvconvert --type cache-pool --cachemode writeback --poolmetadata
>     ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
>      lvconvert --type cache --cachepool ${VGBASE}/cachedata
>     ${VGBASE}/cachedlv
>
>
>     # my system now looks like (VG is called cl, default of installer)
>     [root at localhost ~]# lvs -a
>       LV                VG Attr       LSize   Pool Origin
>       [cachedata]       cl Cwi---C--- 97.66g
>     *  [cachedata_cdata] cl Cwi-ao---- 97.66g **
>     **  [cachedata_cmeta] cl ewi-ao---- 100.00m *
>     *  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
>     [cachedlv_corig] *
>       [cachedlv_corig]  cl owi-aoC--- 1.75t
>       [lvol0_pmspare]   cl ewi------- 100.00m
>       root              cl -wi-ao---- 46.56g
>       swap              cl -wi-ao---- 14.96g
>     *  testssd           cl -wi-a-----  45.47g
>
>     *[root at localhost ~]#lsblk*
>     *
>     NAME                     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>     sdd                        8:48   0   163G  0 disk
>     ??sdd1                     8:49   0   163G  0 part
>       ??md128                  9:128  0 162.9G  0 raid1
>         ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
>         ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
>         ??cl-testssd         253:2    0  45.5G  0 lvm
>         ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>           ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     sdb                        8:16   0   1.8T  0 disk
>     ??sdb2                     8:18   0   1.8T  0 part
>     ? ??md127                  9:127  0   1.8T  0 raid1
>     ?   ??cl-swap            253:1    0    15G  0 lvm [SWAP]
>     ?   ??cl-root            253:0    0  46.6G  0 lvm   /
>     ?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
>     ?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     ??sdb1                     8:17   0   954M  0 part
>       ??md126                  9:126  0   954M  0 raid1 /boot
>     sdc                        8:32   0   163G  0 disk
>     ??sdc1                     8:33   0   163G  0 part
>       ??md128                  9:128  0 162.9G  0 raid1
>         ??cl-cachedata_cmeta 253:4    0   100M  0 lvm
>         ? ??cl-cachedlv      253:6    0   1.8T  0 lvm
>         ??cl-testssd         253:2    0  45.5G  0 lvm
>         ??cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>           ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     sda                        8:0    0   1.8T  0 disk
>     ??sda2                     8:2    0   1.8T  0 part
>     ? ??md127                  9:127  0   1.8T  0 raid1
>     ?   ??cl-swap            253:1    0    15G  0 lvm [SWAP]
>     ?   ??cl-root            253:0    0  46.6G  0 lvm   /
>     ?   ??cl-cachedlv_corig  253:5    0   1.8T  0 lvm
>     ?     ??cl-cachedlv      253:6    0   1.8T  0 lvm
>     ??sda1                     8:1    0   954M  0 part
>       ??md126                  9:126  0   954M  0 raid1 /boot
>
>     # now create vm
>     wget
>    
http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
>    
<http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso>
>     -P /home/
>     DISK=/dev/mapper/XXXX-cachedlv
>
>     # watch out, my netsetup uses a custom bridge/network in the
>     following command. Please replace with what you normally use.
>     virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7
>     --disk path=${DISK},cache=none,bus=virtio --network
>     bridge=pubbr,model=virtio --cdrom
>     /home/CentOS-6.9-x86_64-minimal.iso --graphics
>     vnc,port=5998,listen=0.0.0.0 --cpu host
>
>     # now connect with client PC to qemu
>     virt-viewer --connect=qemu+ssh://root at 192.168.0.XXX/system --name
>     CentOS1
>
>     And install everything on the single vda disc with LVM (i use
>     defaults in anaconda, but remove the large /home to prevent SSD
>     beeing over used).
>
>     After install and reboot log in to VM and
>
>     yum install epel-release -y && yum install screen fio htop -y
>
>     and then run disk test:
>
>     fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>     --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
>     --readwrite=randrw --rwmixread=75
>
>     then *keep repeating *but *change the filename* attribute so it
>     does not use the same blocks over and over again.
>
>     In the beginning the performance is great!! Wow, very impressive
>     150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss).
>     But after a few (usually about 4 or 5) runs (always changing the
>     filename, but not overfilling the FS, it drops to about 10 MBs/sec.
>
>     normal/in the beginning
>
>      read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
>       write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
>     but then
>
>      read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
>       write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
>     or even worse up to the point that it is actually the HDD that is
>     written to (about 500 iops).
>
>     P.S. when a test is/was slow, that means it is on the HDDs. So
>     even when fixing the problem (sometimes just by waiting), that
>     specific file will keep being slow when redoing the test till its
>     promoted to the lvm cache (takes a lot of reads I think). And once
>     on the SSD it sometimes keeps beeing fast, although a new testfile
>     will be slow. So I really recommend changing the testfile all the
>     time when trying to see if a change in speed has occurred.
>
>     -- 
>     Met vriendelijke groet,
>
>     Richard Landsman
>     http://rimote.nl
>
>     T: +31 (0)50 - 763 04 07
>     (ma-vr 9:00 tot 18:00)
>
>     24/7 bij storingen:
>     +31 (0)6 - 4388 7949
>     @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
>     _______________________________________________
>     CentOS-virt mailing list
>     CentOS-virt at centos.org <mailto:CentOS-virt at centos.org>
>     https://lists.centos.org/mailman/listinfo/centos-virt
>     <https://lists.centos.org/mailman/listinfo/centos-virt>
>
>
>
>
> -- 
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D
>
> Red Hat EMEA <https://www.redhat.com/>
>
> <https://red.ht/sig> 	
> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
>
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170420/8557646a/attachment-0002.html>

Possibly Parallel Threads

Search for more reasonably related threads

CentOS virt - Apr 2017 - lvm cache + qemu-kvm stops working after about 20GB of writes

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Possibly Parallel Threads