Ramesh Nachimuthu
2017-Mar-06 04:26 UTC
[Gluster-users] [ovirt-users] Replicated Glusterfs on top of ZFS
+gluster-users Regards, Ramesh ----- Original Message -----> From: "Arman Khalatyan" <arm2arm at gmail.com> > To: "Juan Pablo" <pablo.localhost at gmail.com> > Cc: "users" <users at ovirt.org>, "FERNANDO FREDIANI" <fernando.frediani at upx.com> > Sent: Friday, March 3, 2017 8:32:31 PM > Subject: Re: [ovirt-users] Replicated Glusterfs on top of ZFS > > The problem itself is not the streaming data performance., and also dd zero > does not help much in the production zfs running with compression. > the main problem comes when the gluster is starting to do something with > that, it is using xattrs, probably accessing extended attributes inside the > zfs is slower than XFS. > Also primitive find file or ls -l in the (dot)gluster folders takes ages: > > now I can see that arbiter host has almost 100% cache miss during the > rebuild, which is actually natural while he is reading always the new > datasets: > [root at clei26 ~]# arcstat.py 1 > time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 15:57:31 29 29 100 29 100 0 0 29 100 685M 31G > 15:57:32 530 476 89 476 89 0 0 457 89 685M 31G > 15:57:33 480 467 97 467 97 0 0 463 97 685M 31G > 15:57:34 452 443 98 443 98 0 0 435 97 685M 31G > 15:57:35 582 547 93 547 93 0 0 536 94 685M 31G > 15:57:36 439 417 94 417 94 0 0 393 94 685M 31G > 15:57:38 435 392 90 392 90 0 0 374 89 685M 31G > 15:57:39 364 352 96 352 96 0 0 352 96 685M 31G > 15:57:40 408 375 91 375 91 0 0 360 91 685M 31G > 15:57:41 552 539 97 539 97 0 0 539 97 685M 31G > > It looks like we cannot have in the same system performance and reliability > :( > Simply final conclusion is with the single disk+ssd even zfs doesnot help to > speedup the glusterfs healing. > I will stop here:) > > > > > On Fri, Mar 3, 2017 at 3:35 PM, Juan Pablo < pablo.localhost at gmail.com > > wrote: > > > > cd to inside the pool path > then dd if=/dev/zero of= test.tt bs=1M > leave it runing 5/10 minutes. > do ctrl+c paste result here. > etc. > > 2017-03-03 11:30 GMT-03:00 Arman Khalatyan < arm2arm at gmail.com > : > > > > No, I have one pool made of the one disk and ssd as a cache and log device. > I have 3 Glusterfs bricks- separate 3 hosts:Volume type Replicate (Arbiter)> replica 2+1! > That how much you can push into compute nodes(they have only 3 disk slots). > > > On Fri, Mar 3, 2017 at 3:19 PM, Juan Pablo < pablo.localhost at gmail.com > > wrote: > > > > ok, you have 3 pools, zclei22, logs and cache, thats wrong. you should have 1 > pool, with zlog+cache if you are looking for performance. > also, dont mix drives. > whats the performance issue you are facing? > > > regards, > > 2017-03-03 11:00 GMT-03:00 Arman Khalatyan < arm2arm at gmail.com > : > > > > This is CentOS 7.3 ZoL version 0.6.5.9-1 > > > > > > [root at clei22 ~]# lsscsi > > [2:0:0:0] disk ATA INTEL SSDSC2CW24 400i /dev/sda > > [3:0:0:0] disk ATA HGST HUS724040AL AA70 /dev/sdb > > [4:0:0:0] disk ATA WDC WD2002FYPS-0 1G01 /dev/sdc > > > > > [root at clei22 ~]# pvs ;vgs;lvs > > PV VG Fmt Attr PSize PFree > > /dev/mapper/INTEL_SSDSC2CW240A3_CVCV306302RP240CGN vg_cache lvm2 a-- 223.57g > 0 > > /dev/sdc2 centos_clei22 lvm2 a-- 1.82t 64.00m > > VG #PV #LV #SN Attr VSize VFree > > centos_clei22 1 3 0 wz--n- 1.82t 64.00m > > vg_cache 1 2 0 wz--n- 223.57g 0 > > LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert > > home centos_clei22 -wi-ao---- 1.74t > > root centos_clei22 -wi-ao---- 50.00g > > swap centos_clei22 -wi-ao---- 31.44g > > lv_cache vg_cache -wi-ao---- 213.57g > > lv_slog vg_cache -wi-ao---- 10.00g > > > > > [root at clei22 ~]# zpool status -v > > pool: zclei22 > > state: ONLINE > > scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017 > > config: > > > > > NAME STATE READ WRITE CKSUM > > zclei22 ONLINE 0 0 0 > > HGST_HUS724040ALA640_PN2334PBJ4SV6T1 ONLINE 0 0 0 > > logs > > lv_slog ONLINE 0 0 0 > > cache > > lv_cache ONLINE 0 0 0 > > > > > errors: No known data errors > > > ZFS config: > > > > [root at clei22 ~]# zfs get all zclei22/01 > > NAME PROPERTY VALUE SOURCE > > zclei22/01 type filesystem - > > zclei22/01 creation Tue Feb 28 14:06 2017 - > > zclei22/01 used 389G - > > zclei22/01 available 3.13T - > > zclei22/01 referenced 389G - > > zclei22/01 compressratio 1.01x - > > zclei22/01 mounted yes - > > zclei22/01 quota none default > > zclei22/01 reservation none default > > zclei22/01 recordsize 128K local > > zclei22/01 mountpoint /zclei22/01 default > > zclei22/01 sharenfs off default > > zclei22/01 checksum on default > > zclei22/01 compression off local > > zclei22/01 atime on default > > zclei22/01 devices on default > > zclei22/01 exec on default > > zclei22/01 setuid on default > > zclei22/01 readonly off default > > zclei22/01 zoned off default > > zclei22/01 snapdir hidden default > > zclei22/01 aclinherit restricted default > > zclei22/01 canmount on default > > zclei22/01 xattr sa local > > zclei22/01 copies 1 default > > zclei22/01 version 5 - > > zclei22/01 utf8only off - > > zclei22/01 normalization none - > > zclei22/01 casesensitivity sensitive - > > zclei22/01 vscan off default > > zclei22/01 nbmand off default > > zclei22/01 sharesmb off default > > zclei22/01 refquota none default > > zclei22/01 refreservation none default > > zclei22/01 primarycache metadata local > > zclei22/01 secondarycache metadata local > > zclei22/01 usedbysnapshots 0 - > > zclei22/01 usedbydataset 389G - > > zclei22/01 usedbychildren 0 - > > zclei22/01 usedbyrefreservation 0 - > > zclei22/01 logbias latency default > > zclei22/01 dedup off default > > zclei22/01 mlslabel none default > > zclei22/01 sync disabled local > > zclei22/01 refcompressratio 1.01x - > > zclei22/01 written 389G - > > zclei22/01 logicalused 396G - > > zclei22/01 logicalreferenced 396G - > > zclei22/01 filesystem_limit none default > > zclei22/01 snapshot_limit none default > > zclei22/01 filesystem_count none default > > zclei22/01 snapshot_count none default > > zclei22/01 snapdev hidden default > > zclei22/01 acltype off default > > zclei22/01 context none default > > zclei22/01 fscontext none default > > zclei22/01 defcontext none default > > zclei22/01 rootcontext none default > > zclei22/01 relatime off default > > zclei22/01 redundant_metadata all default > > zclei22/01 overlay off default > > > > > > On Fri, Mar 3, 2017 at 2:52 PM, Juan Pablo < pablo.localhost at gmail.com > > wrote: > > > > Which operating system version are you using for your zfs storage? > do: > zfs get all your-pool-name > use arc_summary.py from freenas git repo if you wish. > > > 2017-03-03 10:33 GMT-03:00 Arman Khalatyan < arm2arm at gmail.com > : > > > > Pool load: > [root at clei21 ~]# zpool iostat -v 1 > capacity operations bandwidth > pool alloc free read write read write > -------------------------------------- ----- ----- ----- ----- ----- ----- > zclei21 10.1G 3.62T 0 112 823 8.82M > HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 46 626 4.40M > logs - - - - - - > lv_slog 225M 9.72G 0 66 198 4.45M > cache - - - - - - > lv_cache 9.81G 204G 0 46 56 4.13M > -------------------------------------- ----- ----- ----- ----- ----- ----- > > capacity operations bandwidth > pool alloc free read write read write > -------------------------------------- ----- ----- ----- ----- ----- ----- > zclei21 10.1G 3.62T 0 191 0 12.8M > HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 0 0 0 > logs - - - - - - > lv_slog 225M 9.72G 0 191 0 12.8M > cache - - - - - - > lv_cache 9.83G 204G 0 218 0 20.0M > -------------------------------------- ----- ----- ----- ----- ----- ----- > > capacity operations bandwidth > pool alloc free read write read write > -------------------------------------- ----- ----- ----- ----- ----- ----- > zclei21 10.1G 3.62T 0 191 0 12.7M > HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 0 0 0 > logs - - - - - - > lv_slog 225M 9.72G 0 191 0 12.7M > cache - - - - - - > lv_cache 9.83G 204G 0 72 0 7.68M > -------------------------------------- ----- ----- ----- ----- ----- ----- > > > On Fri, Mar 3, 2017 at 2:32 PM, Arman Khalatyan < arm2arm at gmail.com > wrote: > > > > Glusterfs now in healing mode: > Receiver: > [root at clei21 ~]# arcstat.py 1 > time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 13:24:49 0 0 0 0 0 0 0 0 0 4.6G 31G > 13:24:50 154 80 51 80 51 0 0 80 51 4.6G 31G > 13:24:51 179 62 34 62 34 0 0 62 42 4.6G 31G > 13:24:52 148 68 45 68 45 0 0 68 45 4.6G 31G > 13:24:53 140 64 45 64 45 0 0 64 45 4.6G 31G > 13:24:54 124 48 38 48 38 0 0 48 38 4.6G 31G > 13:24:55 157 80 50 80 50 0 0 80 50 4.7G 31G > 13:24:56 202 68 33 68 33 0 0 68 41 4.7G 31G > 13:24:57 127 54 42 54 42 0 0 54 42 4.7G 31G > 13:24:58 126 50 39 50 39 0 0 50 39 4.7G 31G > 13:24:59 116 40 34 40 34 0 0 40 34 4.7G 31G > > > Sender > [root at clei22 ~]# arcstat.py 1 > time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 13:28:37 8 2 25 2 25 0 0 2 25 468M 31G > 13:28:38 1.2K 727 62 727 62 0 0 525 54 469M 31G > 13:28:39 815 508 62 508 62 0 0 376 55 469M 31G > 13:28:40 994 624 62 624 62 0 0 450 54 469M 31G > 13:28:41 783 456 58 456 58 0 0 338 50 470M 31G > 13:28:42 916 541 59 541 59 0 0 390 50 470M 31G > 13:28:43 768 437 56 437 57 0 0 313 48 471M 31G > 13:28:44 877 534 60 534 60 0 0 393 53 470M 31G > 13:28:45 957 630 65 630 65 0 0 450 57 470M 31G > 13:28:46 819 479 58 479 58 0 0 357 51 471M 31G > > > On Thu, Mar 2, 2017 at 7:18 PM, Juan Pablo < pablo.localhost at gmail.com > > wrote: > > > > hey, > what are you using for zfs? get an arc status and show please > > > 2017-03-02 9:57 GMT-03:00 Arman Khalatyan < arm2arm at gmail.com > : > > > > no, > ZFS itself is not on top of lvm. only ssd was spitted by lvm for slog(10G) > and cache (the rest) > but in any-case the ssd does not help much on glusterfs/ovirt load it has > almost 100% cache misses....:( (terrible performance compare with nfs) > > > > > > On Thu, Mar 2, 2017 at 1:47 PM, FERNANDO FREDIANI < fernando.frediani at upx.com > > wrote: > > > > > > Am I understanding correctly, but you have Gluster on the top of ZFS which is > on the top of LVM ? If so, why the usage of LVM was necessary ? I have ZFS > with any need of LVM. > > Fernando > > On 02/03/2017 06:19, Arman Khalatyan wrote: > > > > Hi, > I use 3 nodes with zfs and glusterfs. > Are there any suggestions to optimize it? > > host zfs config 4TB-HDD+250GB-SSD: > [root at clei22 ~]# zpool status > pool: zclei22 > state: ONLINE > scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017 > config: > > NAME STATE READ WRITE CKSUM > zclei22 ONLINE 0 0 0 > HGST_HUS724040ALA640_PN2334PBJ4SV6T1 ONLINE 0 0 0 > logs > lv_slog ONLINE 0 0 0 > cache > lv_cache ONLINE 0 0 0 > > errors: No known data errors > > Name: > GluReplica > Volume ID: > ee686dfe-203a-4caa-a691-26353460cc48 > Volume Type: > Replicate (Arbiter) > Replica Count: > 2 + 1 > Number of Bricks: > 3 > Transport Types: > TCP, RDMA > Maximum no of snapshots: > 256 > Capacity: > 3.51 TiB total, 190.56 GiB used, 3.33 TiB free > > > _______________________________________________ > Users mailing list Users at ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > > _______________________________________________ > Users mailing list > Users at ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > > > _______________________________________________ > Users mailing list > Users at ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > > > > > > > > > > > _______________________________________________ > Users mailing list > Users at ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
Arman Khalatyan
2017-Mar-06 09:51 UTC
[Gluster-users] [ovirt-users] Replicated Glusterfs on top of ZFS
On Fri, Mar 3, 2017 at 7:00 PM, Darrell Budic <budic at onholyground.com> wrote:> Why are you using an arbitrator if all your HW configs are identical? I?d > use a true replica 3 in this case. > >This was just GIU suggestion when I was creating the cluster it was asking for the 3 Hosts , I did not knew even that an Arbiter does not keep the data. I am not so sure if I can change the type of the glusterfs to triplicated one in the running system, probably I need to destroy whole cluster.> Also in my experience with gluster and vm hosting, the ZIL/slog degrades > write performance unless it?s a truly dedicated disk. But I have 8 spinners > backing my ZFS volumes, so trying to share a sata disk wasn?t a good zil. > If yours is dedicated SAS, keep it, if it?s SATA, try testing without it. > >We have also several huge systems running with zfs quite successful over the years. This was an idea to use zfs + glusterfs for the HA solutions.> You don?t have compression enabled on your zfs volume, and I?d recommend > enabling relatime on it. Depending on the amount of RAM in these boxes, you > probably want to limit your zfs arc size to 8G or so (1/4 total ram or > less). Gluster just works volumes hard during a rebuild, what?s the problem > you?re seeing? If it?s affecting your VMs, using shading and tuning client > & server threads can help avoid interruptions to your VMs while repairs are > running. If you really need to limit it, you can use cgroups to keep it > from hogging all the CPU, but it takes longer to heal, of course. There are > a couple older posts and blogs about it, if you go back a while. >Yes I saw that glusterfs is CPU/RAM hugry!!! 99% of all 16 cores used just for healing 500GB vm disks. It was taking almost infinity compare with nfs storage (single disk+zfs ssd cache, for sure one get an penalty for the HA:) ) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170306/11a954e8/attachment.html>