On 03/20/2017 06:31 PM, Bernhard D?bi wrote:> Hi Ravi,
>
> thank you very much for looking into this
> The gluster volumes are used by CommVault Simpana to store backup 
> data. Nothing/Nobody should access the underlying infrastructure.
>
> while looking at the xattrs of the files, I noticed that the only 
> difference was the bit-rot.version. So, I assume that something in the 
> synchronization of the bit-rot data went wrong and having different 
> bit-rot.versions is considered like a split-brain situation and access 
> is denied because there is no guarantee of correctness. this is just a 
> wild guess.
Hi Bernhard,
bit-rot version can be different between bricks of the replica when I/O 
is successful only on one brick of the replica when the other brick was 
down. (though AFR self-heal will later heal the contents, but not modify 
bitrot xattrs). So that is not a problem.
>
> over the weekend I identified hundreds of files with input/output 
> errors. I compared the sha256sum of both bricks, they were always the 
> same. I then deleted the affected files from gluster and recreated 
> them. this should have fixed the issue. Verification is still running.
>
> if you're interested in the root cause, I can send you more log files 
> and the xattrs of some files
If you did not access the underlying bricks directly like you said then 
it could possibly be a bitrot bug. If you don't mind please raise a BZ  
under the bitrot component and the appropriate gluster version with all 
client and brick logs attached.
Also if you do have some kind of reproducer, that would help a lot.
-Ravi
>
>
> Best Regards
> Bernhard
>
>
> 2017-03-20 12:57 GMT+01:00 Ravishankar N <ravishankar at redhat.com 
> <mailto:ravishankar at redhat.com>>:
>
>     SFILE_CONTAINER_080 is the one which seems to be in split-brain.
>     SFILE_CONTAINER_046, for which you have provided the getfattr
>     output, hard links etc doesn't seem to be in split-brain.  We do
>     see that the fops on SFILE_CONTAINER_046 are failing on the client
>     translator itself due to EIO:
>
>     [2017-03-17 19:49:56.088867] E [MSGID: 114031]
>     [client-rpc-fops.c:444:client3_3_open_cbk]
>     0-Server_Legal_01-client-0: remote operation failed. Path:
>     /Server_Legal/CV_MAGNETIC/V_944453/CHUNK_9291168/SFILE_CONTAINER_046
>     (bfdfe21a-1af3-474b-a6a4-bc0e17edb529) [Input/output error]
>
>     [2017-03-17 19:49:56.089012] E [MSGID: 114031]
>     [client-rpc-fops.c:444:client3_3_open_cbk]
>     0-Server_Legal_01-client-1: remote operation failed. Path:
>     /Server_Legal/CV_MAGNETIC/V_944453/CHUNK_9291168/SFILE_CONTAINER_046
>     (bfdfe21a-1af3-474b-a6a4-bc0e17edb529) [Input/output error]
>
>     which is  why the sha256sum on the mount gave EIO.  And that is
>     because the file seems to be corrupt on both bricks because the
>     'trusted.bit-rot.bad-file' xattr is set.
>
>     Did you write to the files directly on the backend? What is
>     interesting is that the sha256sum is same on both the bricks
>     despite being both marked as bad by bitrot.
>
>     -Ravi
>
>
>     On 03/18/2017 03:20 AM, Bernhard D?bi wrote:
>>     Hi,
>>
>>     I have a situation
>>
>>     the volume logfile reports a possible split-brain but when I try
>>     to heal it fails because the file is not in split-brain. Any ideas?
>>
>>
>>
>>
>>     Regards
>>
>>     Bernhard
>>
>>
>>
>>     _______________________________________________
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>     http://lists.gluster.org/mailman/listinfo/gluster-users
>>     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170320/aba3c3bd/attachment.html>