Vijay Bellur
2015-Oct-16 16:51 UTC
[Gluster-users] Unnecessary healing in 3-node replication setup on reboot
On Friday 16 October 2015 08:11 PM, Lindsay Mathieson wrote:> > On 17 October 2015 at 00:26, Udo Giacomozzi <udo.giacomozzi at indunet.it > <mailto:udo.giacomozzi at indunet.it>> wrote: > > To me this sounds like Gluster is not really suited for big files, > like as the main storage for VMs - since they are being modified > constantly. > > > Depends :) > > Any replicated storage will have to heal its copies if they are written > to when a node is down. So long as the files can still be read/written > while being healed and the resource usage (CPU/Network) is not to high > then it should be transparent - that's a major whole pint of a > replicated filesystem. > > I'm guessing that like me, you are running your gluster storage on your > VM Hosts and you like me are a chronic tweaker, so tend to reboot the > hosts more than you should. In that case you might want to consider > moving your gluster storage to seperate dedicated nodes that you can > leave up. > > Or am I missing something? Perhaps Gluster can be configured to heal > only modified parts of the files? > > > > Not that I know of. >self-healing in gluster by default syncs only modified parts of the files from a source node. Gluster does a rolling checksum of a file needing self-heal to identify regions of the file which need to be synced over the network. This rolling checksum computation can sometimes be expensive and there are plans to have a lighter self-healing in 3.8 with more granular changelogs that can do away with the need to do a rolling checksum. You may also want to check sharding (currently in beta with 3.7) where large files are chunked to smaller fragments. With this scheme, self-healing (and rolling checksum computation thereby) happens only on those fragments that undergo changes when one of the nodes in a replicated set is offline. This has shown nice improvements in gluster's resource utilization during self-healing. Regards, Vijay
Lindsay Mathieson
2015-Oct-16 22:17 UTC
[Gluster-users] Unnecessary healing in 3-node replication setup on reboot
On 17 October 2015 at 02:51, Vijay Bellur <vbellur at redhat.com> wrote:> You may also want to check sharding (currently in beta with 3.7) where > large files are chunked to smaller fragments. With this scheme, > self-healing (and rolling checksum computation thereby) happens only on > those fragments that undergo changes when one of the nodes in a replicated > set is offline. This has shown nice improvements in gluster's resource > utilization during self-healing. >Very interesting, I presume you'd have top create a new volume to test it. Also you'd loose the ability to access the file on the host filesystem in emergencies wouldn't you? -- Lindsay -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151017/90d90ec3/attachment.html>
Lindsay Mathieson
2015-Oct-16 22:45 UTC
[Gluster-users] Unnecessary healing in 3-node replication setup on reboot
On 17 October 2015 at 02:51, Vijay Bellur <vbellur at redhat.com> wrote:> You may also want to check sharding (currently in beta with 3.7) where > large files are chunked to smaller fragments. With this scheme, > self-healing (and rolling checksum computation thereby) happens only on > those fragments that undergo changes when one of the nodes in a replicated > set is offline. This has shown nice improvements in gluster's resource > utilization during self-healing. >Does it effect read speed and random i/o? I guess that would depend on the methodology used to calculate shard location for a given block. Could be quite interesting on top of zfs, love to test. -- Lindsay -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151017/9564482f/attachment.html>
Udo Giacomozzi
2015-Oct-17 15:38 UTC
[Gluster-users] Unnecessary healing in 3-node replication setup on reboot
Am 16.10.2015 um 18:51 schrieb Vijay Bellur:> > self-healing in gluster by default syncs only modified parts of the > files from a source node. Gluster does a rolling checksum of a file > needing self-heal to identify regions of the file which need to be > synced over the network. This rolling checksum computation can > sometimes be expensive and there are plans to have a lighter > self-healing in 3.8 with more granular changelogs that can do away > with the need to do a rolling checksum.I did some tests (see below) - could you please check this and tell me if this is normal? For example, I have a 200GB VM disk image in my volume (the biggest file). About 75% of that disk is currently unused space and writes are only about 50 kbytes/sec. That 200GB disk image /always/ "heals" a very long time (at least 30 minutes) - even if I'm pretty sure only a few blocks could have been changed. Anyway, I just rebooted a node (about 2-3 minutes downtime) to collect some information: * In total I have about 790GB* files in that Gluster volume * about 411GB* belong to active VM HDD images, the remaining are backup/template files * only VM HDD images are being healed (max 15 files) * while healing, glusterfsd shows varying CPU usages between 70% and 650% (it's a 16 cores server); total 106 minutes CPU time once healing completed * once healing completes, the machine received a total of 7.0 GB and sent 3.6 GB over the internal network (so, yes, you're right that not all contents are transferred) * *total heal time: whopping 58 minutes* /* these are summed up file sizes; "du" and "df" commands show smaller usage /Node details (all 3 nodes are identical):/ / * DELL PowerEdge R730 * Intel Xeon E5-2600 @ 2.4GHz * 64 GB DDR4 RAM * the server is able to gzip-compress about 1 GB data / second (all cores together) * 3 TB HW-RAID10 HDD (2.7TB reserved for Gluster); minimum 500 MB/s write speed, 350 MB/s read speed * redundant 1GBit/s internal network * Debian 7 Wheezy / Proxmox 3.4, Kernel 2.6.32, Gluster 3.5.2 Volume setup:/ / # gluster volume info systems Volume Name: systems Type: Replicate Volume ID: b2d72784-4b0e-4f7b-b858-4ec59979a064 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: metal1:/data/gluster/systems Brick2: metal2:/data/gluster/systems Brick3: metal3:/data/gluster/systems Options Reconfigured: cluster.server-quorum-ratio: 51% /Note that `//gluster volume heal "systems" info//` takes 3-10 seconds to complete during heal - I hope that doesn't slow down healing since I tend to run that command frequently./ Would you expect these results or is something wrong? Would upgrading to Gluster 3.6 or 3.7 improve healing performance? Thanks, Udo -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151017/5947994f/attachment.html>