After following the instructions to convert a distributed volume to a distributed/replicated volume running V3.3, we found we were getting a very large number of "possible split-brain" errors all over the place in many of our Gluster directories. Somewhere in this process, some files were actually lost. The error logs are long, and finding exactly what happened is difficult. But this was in a 70TB 3 (or 6 if you want to count replication) -brick cluster. I am beginning to feel that rsync within an infiniband-based network is the safest way to replicate (I'm not so worried about bricks going offline) and I had been quite happy with the Gluster performance before I decided to be "responsible" and replicate..... All of that being said, can anyone answer this question: I changed the Gluster mount to NFS-based instead of native client based. I don't know exactly what to do to /force/ Gluster to leave itself alone -- mount all of the underlying bricks RO on their home systems? *Is there any orderly way to tell a Gluster volume -- I want you to stop replicating and revert to distributed only?* Then I can let rsync do its job while still allowing readonly access to the distributed cluster. In the meantime, I am seeing self-heal errors on the client machine that look like this: [2013-05-23 08:08:57.240567] E [afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1: Unable to self-heal permissions/ownership of '/cfce1/data' (possible split-brain). Please fix the file on all backend volumes [2013-05-23 08:08:57.240959] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-gf1-replicate-1: background meta-data entry self-heal failed on /cfce1/data [2013-05-23 08:09:16.900705] E [afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1: Unable to self-heal permissions/ownership of '/cfce1/data' (possible split-brain). Please fix the file on all backend volumes [2013-05-23 08:09:16.901100] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-gf1-replicate-1: background meta-data entry self-heal failed on /cfce1/data [2013-05-23 08:09:20.654338] E [afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1: Unable to self-heal permissions/ownership of '/cfce1/data' (possible split-brain). Please fix the file on all backend volumes [2013-05-23 08:09:20.654695] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-gf1-replicate-1: background meta-data entry self-heal failed on /cfce1/data [2013-05-23 08:09:25.829579] E [afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1: Unable to self-heal permissions/ownership of '/cfce1/data' (possible split-brain). Please fix the file on all backend volumes [2013-05-23 08:09:25.829973] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 0-gf1-replicate-1: background meta-data entry self-heal failed on /cfce1/data "data" is a great huge directory, but at least it's replaceable. After actually losing files, I'm nervous. This happened just after a discussion with one of the frequent contributors who was saying to me "I don't do replication with Gluster because replication is REALLY difficult." Can anyone help? Matt Temple ------ Matt Temple Director, Research Computing Dana-Farber Cancer Institute. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130528/64731449/attachment.html>