On 26/03/2016 12:14 AM, Ravishankar N wrote:> I think you need the exact no. of files and size of files that need > healing to make any meaningful comparison of self-heal performance > across versions. > VM workloads with sharding might not be the ideal 'reproducer' since > you really don't know how many shards get modified when a replica is > down and I/O on the VMs happen. I suppose you could try testing the > heal performance of a specific no. of files on a sharded volume and > compare results.Maybe my subject description was poor - while heal progress is not the best, its the I/O stalls that *really* concern me. If I reboot a node (or it crashes etc) any VM that is running on the cluster when that happened freezes on I/O access when heal kicks in until it finishes, which will take over an hour. I see similar behaviour noted in the "GlusterFS cluster stalls if one server from the cluster goes down and then comes back up". I tried setting "cluster.data-self-heal" off as suggested on that thread and it seems to have improved things. In the middle of maintenance right now and will test it more later. thanks, -- Lindsay Mathieson
Pranith Kumar Karampuri
2016-Mar-26 13:32 UTC
[Gluster-users] Very poor heal behaviour in 3.7.9
On 03/26/2016 06:55 AM, Lindsay Mathieson wrote:> On 26/03/2016 12:14 AM, Ravishankar N wrote: >> I think you need the exact no. of files and size of files that need >> healing to make any meaningful comparison of self-heal performance >> across versions. >> VM workloads with sharding might not be the ideal 'reproducer' since >> you really don't know how many shards get modified when a replica is >> down and I/O on the VMs happen. I suppose you could try testing the >> heal performance of a specific no. of files on a sharded volume and >> compare results. > > Maybe my subject description was poor - while heal progress is not the > best, its the I/O stalls that *really* concern me. If I reboot a node > (or it crashes etc) any VM that is running on the cluster when that > happened freezes on I/O access when heal kicks in until it finishes, > which will take over an hour. > > I see similar behaviour noted in the "GlusterFS cluster stalls if one > server from the cluster goes down and then comes back up". > > I tried setting "cluster.data-self-heal" off as suggested on that > thread and it seems to have improved things. In the middle of > maintenance right now and will test it more later.Yes, this is a bug we are addressing for 3.7.10. The patch is already merged. http://review.gluster.org/13564 Pranith> > thanks, >