thr3ads.net - Gluster users - [Gluster-users] heaps split-brains during back-transfert [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Geoffrey Letessier

2015-Aug-04 14:53 UTC

[Gluster-users] heaps split-brains during back-transfert

OK, after a couple of times/hours (the last message was blocked but sent around
9h30 AM french time), here is the result:
# du -sk /home/sterpone_team/
10583360073     /home/sterpone_team/

In other words: ~9.86TB
So, as you can read below, the result is globally the same than before (with
'du -sh?: ~9.9TB) and quite different of the gluster volume quota output
(the difference is ~0.5TB, 500-600GB) -very big difference-:
# gluster volume quota vol_home list /sterpone_team
                  Path                   Hard-limit Soft-limit   Used  Available
Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------
/sterpone_team 

Thanks,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ing?nieur syst?me
UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr

Le 4 ao?t 2015 ? 10:26, Geoffrey Letessier <geoffrey.letessier at cnrs.fr>
a ?crit :
> Hi Vijay,
> 
>> du command can round-off the values, could you check the values with
'du -sk??
> It?s ongoing. I?ll let you know the new value ASAP.
> 
>> We will investigate on this issue and update you soon on the same.
> FYI it?s mainly concerning 1 brick per replicate (and thus its replicate
brick: brick1 on storage1 and brick 1 on storage2).
> To avoid to explode my /var partition capacity, yesterday i set my
brick-log-level parameter to CRITICAL -but now I know there is a big problem on
several bricks..
> 
> Thanks in advance for the help and fix
> Geoffrey
> 
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ing?nieur syst?me
> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
> 
> Le 4 ao?t 2015 ? 05:54, Vijaikumar M <vmallika at redhat.com> a ?crit
:
> 
>> Adding Raghavendra.G for RDMA issue...
>> 
>> 
>> Hi Geoffrey,
>> 
>> Please find my comments in-line..
>> 
>> Thanks,
>> Vijay
>> 
>> 
>> On Monday 03 August 2015 09:15 PM, Geoffrey Letessier wrote:
>>> Hi Vijay,
>>> 
>>> Yes of course, i sent my email after making some tests and checks
and the result was still wrong (even after a couple of hours/1day after having
forced the start of every bricks) ? until i decided to do a ? du ? on every
quota path. Now, all seems to ~OK as you can read below:
>>> # gluster volume quota vol_home list
>>>                   Path                   Hard-limit Soft-limit  
Used  Available  Soft-limit exceeded? Hard-limit exceeded?
>>>
---------------------------------------------------------------------------------------------------------------------------
>>> /simlab_team                               5.0TB       80%      
1.2TB   3.8TB              No                   No
>>> /amyloid_team                              7.0TB       80%      
4.9TB   2.1TB              No                   No
>>> /amyloid_team/nguyen                       3.5TB       80%      
2.0TB   1.5TB              No                   No
>>> /sacquin_team                             10.0TB       80%     
55.3GB   9.9TB              No                   No
>>> /baaden_team                              20.0TB       80%     
11.5TB   8.5TB              No                   No
>>> /derreumaux_team                           5.0TB       80%      
2.2TB   2.8TB              No                   No
>>> /sterpone_team                            14.0TB       80%      
9.3TB   4.7TB              No                   No
>>> /admin_team                                1.0TB       80%     
15.8GB 1008.2GB              No                   No
>>> # for path in $(gluster volume quota vol_home list|awk 'NR>2
{print $1}'); do pdsh -w storage[1,3] "du -sh
/export/brick_home/brick{1,2}/data$path"; done
>>> storage1: 219G /export/brick_home/brick1/data/simlab_team
>>> storage3: 334G /export/brick_home/brick1/data/simlab_team
>>> storage1: 307G /export/brick_home/brick2/data/simlab_team
>>> storage3: 327G /export/brick_home/brick2/data/simlab_team
>>> storage1: 1,2T /export/brick_home/brick1/data/amyloid_team
>>> storage3: 1,2T /export/brick_home/brick1/data/amyloid_team
>>> storage1: 1,2T /export/brick_home/brick2/data/amyloid_team
>>> storage3: 1,2T /export/brick_home/brick2/data/amyloid_team
>>> storage1: 505G /export/brick_home/brick1/data/amyloid_team/nguyen
>>> storage1: 483G /export/brick_home/brick2/data/amyloid_team/nguyen
>>> storage3: 508G /export/brick_home/brick1/data/amyloid_team/nguyen
>>> storage3: 503G /export/brick_home/brick2/data/amyloid_team/nguyen
>>> storage3: 16G /export/brick_home/brick1/data/sacquin_team
>>> storage1: 14G /export/brick_home/brick1/data/sacquin_team
>>> storage3: 13G /export/brick_home/brick2/data/sacquin_team
>>> storage1: 13G /export/brick_home/brick2/data/sacquin_team
>>> storage1: 3,2T /export/brick_home/brick1/data/baaden_team
>>> storage1: 2,8T /export/brick_home/brick2/data/baaden_team
>>> storage3: 2,9T /export/brick_home/brick1/data/baaden_team
>>> storage3: 2,7T /export/brick_home/brick2/data/baaden_team
>>> storage3: 588G /export/brick_home/brick1/data/derreumaux_team
>>> storage1: 566G /export/brick_home/brick1/data/derreumaux_team
>>> storage1: 563G /export/brick_home/brick2/data/derreumaux_team
>>> storage3: 610G /export/brick_home/brick2/data/derreumaux_team
>>> storage3: 2,5T /export/brick_home/brick1/data/sterpone_team
>>> storage1: 2,7T /export/brick_home/brick1/data/sterpone_team
>>> storage3: 2,4T /export/brick_home/brick2/data/sterpone_team
>>> storage1: 2,4T /export/brick_home/brick2/data/sterpone_team
>>> storage3: 519M /export/brick_home/brick1/data/admin_team
>>> storage1: 11G /export/brick_home/brick1/data/admin_team
>>> storage3: 974M /export/brick_home/brick2/data/admin_team
>>> storage1: 4,0G /export/brick_home/brick2/data/admin_team
>>> 
>>> In short:
>>>  simlab_team: ~1.2TB
>>>  amyloid_team: ~4.8TB
>>>  amyloid_team/nguyen: ~2TB
>>>  sacquin_team: ~56GB
>>>  baaden_team: ~11.6TB
>>>  derreumaux_team: 2.3TB
>>>  sterpone_team: ~10TB
>>>  admin_team: ~16.5GB
>>> 
>>> There?s still some difference but it?s globally quite correct
(except for sterpone_team quota defined).
>>> 
>>> But, I also noticed something strange: here are the result of every
? du ? i did to force the ? recompute ? of the quota size (on the glusterfs
mount point):
>>> # du -sh /home/simlab_team/
>>> 1,2T    /home/simlab_team/
>>> # du -sh /home/amyloid_team/
>>> 4,7T    /home/amyloid_team/
>>> # du -sh /home/sacquin_team/
>>> 56G     /home/sacquin_team/
>>> # du -sh /home/baaden_team/
>>> 12T     /home/baaden_team/
>>> # du -sh /home/derreumaux_team/
>>> 2,3T    /home/derreumaux_team/
>>> # du -sh /home/sterpone_team/
>>> 9,9T    /home/sterpone_team/
>>> 
>>> As you can above, I dont understand why the quota size computed by
quota daemon is different than a "du", especially concerning the quota
size of /sterpone_team
>>> 
>> du command can round-off the values, could you check the values with
'du -sk'?
>>  
>> 
>> 
>>> Now, concerning all hangs i met, can you provide me the brand of
your infiniband interconnect? From my side, we use QLogic -maybe the problem
takes its origin here (Intel/Qlogic and Mellanox are quite different).
>>> 
>>> 
>>> Concerning the brick logs, I just noticed I have a lot of error on
one of my brick logs and the file take around 5GB. Here is an extract:
>>> # tail -30l
/var/log/glusterfs/bricks/export-brick_home-brick1-data.log
>>> [2015-08-03 15:32:37.408204] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.410017] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.410689] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.410860] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.412638] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.413435] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.413640] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.415325] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.416102] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.416308] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.418025] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.418799] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.419001] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.420681] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.421416] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.421607] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.423208] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.423882] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.424089] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.425863] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.426581] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.426790] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.428438] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.429133] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> [2015-08-03 15:32:37.429325] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>> The message "W [MSGID: 120003]
[quota.c:759:quota_build_ancestry_cbk] 0-vol_home-quota: parent is NULL
[Argument invalide]" repeated 9016 times between [2015-08-03
15:31:55.379522] and [2015-08-03 15:32:00.997113]
>>> [2015-08-03 15:32:37.442244] I [MSGID: 115036]
[server.c:545:server_rpc_notify] 0-vol_home-server: disconnecting connection
from lucifer.lbt.ibpc.fr-21153-2015/08/03-15:31:23:33181-vol_home-client-0-0-0
>>> [2015-08-03 15:32:37.442286] I [MSGID: 101055]
[client_t.c:419:gf_client_unref] 0-vol_home-server: Shutting down connection
lucifer.lbt.ibpc.fr-21153-2015/08/03-15:31:23:33181-vol_home-client-0-0-0
>>> The message "E [MSGID: 113104]
[posix-handle.c:154:posix_make_ancestryfromgfid] 0-vol_home-posix: could not
read the link from the gfid handle
/export/brick_home/brick1/data/.glusterfs/19/b6/19b67130-b409-4666-9237-2661241a8847
[Aucun fichier ou dossier de ce type]" repeated 755 times between
[2015-08-03 15:31:25.553801] and [2015-08-03 15:31:43.528305]
>>> The message "E [MSGID: 113104]
[posix-handle.c:154:posix_make_ancestryfromgfid] 0-vol_home-posix: could not
read the link from the gfid handle
/export/brick_home/brick1/data/.glusterfs/81/5a/815acde3-7f47-410b-9131-e8d75c71a5bd
[Aucun fichier ou dossier de ce type]" repeated 8147 times between
[2015-08-03 15:31:25.521255] and [2015-08-03 15:31:53.593932]
>>> Do you have an idea where this issue come from and what I have to
do to fix it?
>> We will investigate on this issue and update you soon on the same.
>> 
>> 
>> 
>> 
>>> 
>>> # grep -rc "\] E \["
/var/log/glusterfs/bricks/export-brick_home-brick{1,2}-data.log
>>>
/var/log/glusterfs/bricks/export-brick_home-brick1-data.log:11038933
>>> /var/log/glusterfs/bricks/export-brick_home-brick2-data.log:243
>>> 
>>> FYI I updated GlusterFS to the latest version (v3.7.3) 2 days ago.
>>> 
>>> Thanks in advance for the next answers. and thanks for all your
help (all the support team).
>>> Best,
>>> Geoffrey
>>> 
>>> ------------------------------------------------------
>>> Geoffrey Letessier
>>> Responsable informatique & ing?nieur syst?me
>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>> Institut de Biologie Physico-Chimique
>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>> 
>>> Le 3 ao?t 2015 ? 08:51, Vijaikumar M <vmallika at redhat.com>
a ?crit :
>>> 
>>>> Hi Geoffrey,
>>>> 
>>>> Please find my comments in-line.
>>>> 
>>>> 
>>>> On Saturday 01 August 2015 04:10 AM, Geoffrey Letessier wrote:
>>>>> Hello,
>>>>> 
>>>>> As Krutika said, I resolved with success all split-brains
(more than 3450) appeared after the first data transfert from one backup server
to my new and fresh volume but?
>>>>> 
>>>>> The following step to validate my new volume was to enable
the quota on it; and now, more than one day after this activation, all the
results are still completely wrong:
>>>>> Example:
>>>>> # df -h /home/sterpone_team
>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>> ib-storage1:vol_home.tcp
>>>>>                        14T  3,3T   11T  24% /home
>>>>> # pdsh -w storage[1,3] du -sh
/export/brick_home/brick{1,2}/data/sterpone_team
>>>>> storage3: 2,5T /export/brick_home/brick1/data/sterpone_team
>>>>> storage3: 2,4T /export/brick_home/brick2/data/sterpone_team
>>>>> storage1: 2,7T /export/brick_home/brick1/data/sterpone_team
>>>>> storage1: 2,4T /export/brick_home/brick2/data/sterpone_team
>>>>> As you can read, all data for this account is around 10TB
and quota displays only 3.3TB used.
>>>>> 
>>>>> Worse:
>>>>> # pdsh -w storage[1,3] du -sh
/export/brick_home/brick{1,2}/data/baaden_team
>>>>> storage3: 2,9T /export/brick_home/brick1/data/baaden_team
>>>>> storage3: 2,7T /export/brick_home/brick2/data/baaden_team
>>>>> storage1: 3,2T /export/brick_home/brick1/data/baaden_team
>>>>> storage1: 2,8T /export/brick_home/brick2/data/baaden_team
>>>>> # df -h /home/baaden_team/
>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>> ib-storage1:vol_home.tcp
>>>>>                        20T  786G   20T   4% /home
>>>>> # gluster volume quota vol_home list /baaden_team
>>>>>                   Path                   Hard-limit
Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
>>>>>
---------------------------------------------------------------------------------------------------------------------------
>>>>> /baaden_team                              20.0TB       80% 
785.6GB  19.2TB              No                   No
>>>>> This account is around 11.6TB and quota detects only 786GB
used?
>>>>> 
>>>> As you mentioned below, some of the bricks were down.
'quota list' will only show the aggregated value of online bricks, Could
you please check the 'quota list' when all the bricks are up and
running?
>>>> I suspect quota initiate might not have completed because of
brick down.
>>>> 
>>>>> Can someone help me to fix it -knowing if I've
previously updated GlusterFS from 3.5.3 to 3.7.2 it was exactly to solve a
similar trouble?
>>>>> 
>>>>> For information, in quotad log file:
>>>>> [2015-07-31 22:13:00.574361] I [MSGID: 114047]
[client-handshake.c:1225:client_setvolume_cbk] 0-vol_home-client-7: Server and
Client lk-version numbers are not same, reopening the fds
>>>>> [2015-07-31 22:13:00.574507] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-vol_home-client-7: Server
lk version = 1
>>>>> 
>>>>> is there any causal connection (client/server version
conflict)?
>>>>> 
>>>>> Here what i noticed on my
/var/log/glusterfs/quota-mount-vol_home.log file:
>>>>> ? <same kind of lines>
>>>>> [2015-07-31 21:26:15.247269] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_home-client-5: changing port to 49162
(from 0)
>>>>> [2015-07-31 21:26:15.250272] E
[socket.c:2332:socket_connect_finish] 0-vol_home-client-5: connection to
10.0.4.2:49162 failed (Connexion refus?e)
>>>>> [2015-07-31 21:26:19.250545] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_home-client-5: changing port to 49162
(from 0)
>>>>> [2015-07-31 21:26:19.253643] E
[socket.c:2332:socket_connect_finish] 0-vol_home-client-5: connection to
10.0.4.2:49162 failed (Connexion refus?e)
>>>>> ? <same kind of lines>
>>>>> 
>>>> Connection refused is because brick is down.
>>>> 
>>>>> <A few minutes after:> OK, this was due to one brick
which was down. It?s strange: since I have updated GlusteFS to 3.7.x I notice a
lot of bricks which go down, sometimes a few moment after starting the volume,
sometime after a couple of days/weeks? What never happened with GlusterFS
version 3.3.1 and 3.5.3.
>>>>> 
>>>> Could please provide brick log? We will check the log on this
issue, once this issue is fixed, we can initiate quota healing again.
>>>> 
>>>> 
>>>>> Now, I need to stop-and-start the volume because I notice
again some hangs with "gluster volume quota ? ", "df", etc.
One more time, i?ve never noticed this kind of hangs with previous versions of
GlusterFS I used; is it "expected"?
>>>> 
>>>> From you previous mail we tried re-creating hang problem,
however it was not re-creating.
>>>> 
>>>> 
>>>> 
>>>>> One more time: thank you very much by advance.
>>>>> Geoffrey
>>>>> 
>>>>> ------------------------------------------------------
>>>>> Geoffrey Letessier
>>>>> Responsable informatique & ing?nieur syst?me
>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>>> Institut de Biologie Physico-Chimique
>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>>> 
>>>>> Le 31 juil. 2015 ? 11:26, Niels de Vos <ndevos at
redhat.com> a ?crit :
>>>>> 
>>>>>> On Wed, Jul 29, 2015 at 12:44:38AM +0200, Geoffrey
Letessier wrote:
>>>>>>> OK, thank you Niels for this explanation. Now, this
makes sense.
>>>>>>> 
>>>>>>> And concerning all split-brains appeared during the
back-transfert, do you have an idea where is this coming from?
>>>>>> 
>>>>>> Sorry, no, I dont know how that is happening in your
environment. I'll
>>>>>> try to find someone that understands more about it and
can help you with
>>>>>> that.
>>>>>> 
>>>>>> Niels
>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> Geoffrey
>>>>>>>
------------------------------------------------------
>>>>>>> Geoffrey Letessier
>>>>>>> Responsable informatique & ing?nieur syst?me
>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at
ibpc.fr
>>>>>>> 
>>>>>>> Le 29 juil. 2015 ? 00:02, Niels de Vos <ndevos
at redhat.com> a ?crit :
>>>>>>> 
>>>>>>>> On Tue, Jul 28, 2015 at 03:46:37PM +0200,
Geoffrey Letessier wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> In addition of all split brains reported,
is it normal to notice
>>>>>>>>> thousands and thousands (several tens nay
hundreds of thousands)
>>>>>>>>> broken symlinks browsing the .glusterfs
directory on each brick?
>>>>>>>> 
>>>>>>>> Yes, I think it is normal. A symlink points to
a particular filename,
>>>>>>>> possibly in a different directory. If the
target file is located on a
>>>>>>>> different brick, the symlink points to a
non-local file.
>>>>>>>> 
>>>>>>>> Consider this example with two bricks in a
distributed volume:
>>>>>>>> - file: README
>>>>>>>> - symlink: IMPORTANT -> README
>>>>>>>> 
>>>>>>>> When the distribution algorithm is done, README
'hashes' to brick-A. The
>>>>>>>> symlink 'hashes' to brick-B. This means
that README will be localed on
>>>>>>>> brick-A, and the symlink with name IMPORTANT
would be located on
>>>>>>>> brick-B. Because README is not on the same
brick as IMPORTANT, the
>>>>>>>> symlink points to the non-existing file README
on brick-B.
>>>>>>>> 
>>>>>>>> However, when a Gluster client reads the target
of symlink IMPORTANT,
>>>>>>>> the Gluster client calculate the location of
README and will know that
>>>>>>>> README can be found on brick-A.
>>>>>>>> 
>>>>>>>> I hope that makes sense?
>>>>>>>> 
>>>>>>>> Niels
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> For the moment, i just synchronized one
remote directory (around 30TB
>>>>>>>>> and a few million files) into my new
volume. No other operations on
>>>>>>>>> files on this volume has yet been done.
>>>>>>>>> How can I fix it? Can I delete these
dead-symlinks? How can I fix all
>>>>>>>>> my split-brains? 
>>>>>>>>> 
>>>>>>>>> Here is an example of a ls:
>>>>>>>>> [root at cl-storage3 ~]# cd
/export/brick_home/brick1/data/.glusterfs/7b/d2/
>>>>>>>>> [root at cl-storage3 d2]# ll
>>>>>>>>> total 8,7M
>>>>>>>>>    13706 drwx------   2 root      root     
8,0K 26 juil. 17:22 .
>>>>>>>>> 2147483784 drwx------ 258 root      root   
8,0K 20 juil. 23:07 ..
>>>>>>>>> 2148444137 -rwxrwxrwx   2 baaden   
baaden_team     173K 22 mai    2008 7bd200dd-1774-4395-9065-605ae30ec18b
>>>>>>>>>  1559384 -rw-rw-r--   2 tarus    
amyloid_team    4,3K 19 juin   2013 7bd2155c-7a05-4edc-ae77-35ed7e16afbc
>>>>>>>>>   287295 lrwxrwxrwx   1 root      root     
58 20 juil. 23:38 7bd2370a-100b-411e-89a4-d184da9f0f88 ->
../../a7/59/a759de6f-cdf5-43dd-809a-baf81d103bf7/prop-base
>>>>>>>>> 2149090201 -rw-rw-r--   2 tarus    
amyloid_team     76K  8 mars   2014 7bd2497f-d24b-4b19-a1c5-80a4956e56a1
>>>>>>>>> 2148561174 -rw-r--r--   2 tran     
derreumaux_team  575 14 f?vr. 07:54 7bd25db0-67f5-43e5-a56a-52cf8c4c60dd
>>>>>>>>>  1303943 -rw-r--r--   2 tran     
derreumaux_team  576 10 f?vr. 06:06 7bd25e97-18be-4faf-b122-5868582b4fd8
>>>>>>>>>  1308607 -rw-r--r--   2 tran     
derreumaux_team 414K 16 juin  11:05 7bd2618f-950a-4365-a753-723597ef29f5
>>>>>>>>>    45745 -rw-r--r--   2 letessier
admin_team       585  5 janv.  2012 7bd265c7-e204-4ee8-8717-e4a0c393fb0f
>>>>>>>>> 2148144918 -rw-rw-r--   2 tarus    
amyloid_team    107K 28 f?vr.  2014 7bd26c5b-d48a-481a-9ca6-2dc27768b5ad
>>>>>>>>>    13705 -rw-rw-r--   2 tarus    
amyloid_team     25K  4 juin   2014 7bd27e4c-46ba-4f21-a766-389bfa52fd78
>>>>>>>>>  1633627 -rw-rw-r--   2 tarus    
amyloid_team     75K 12 mars   2014 7bd28631-90af-4c16-8ff0-c3d46d5026c6
>>>>>>>>>  1329165 -rw-r--r--   2 tran     
derreumaux_team  175 15 juin  23:40 7bd2957e-a239-4110-b3d8-b4926c7f060b
>>>>>>>>>   797803 lrwxrwxrwx   2 baaden   
baaden_team       26  2 avril  2007 7bd29933-1c80-4c6b-ae48-e64e4da874cb ->
../divided/a7/2a7o.pdb1.gz
>>>>>>>>>  1532463 -rw-rw-rw-   2 baaden   
baaden_team     1,8M  2 nov.   2009 7bd29d70-aeb4-4eca-ac55-fae2d46ba911
>>>>>>>>>  1411112 -rw-r--r--   2 sterpone 
sterpone_team   3,1K  2 mai    2012 7bd2a5eb-62a4-47fc-b149-31e10bd3c33d
>>>>>>>>> 2148865896 -rw-r--r--   2 tran     
derreumaux_team 2,1M 15 juin  23:46 7bd2ae9c-18ca-471f-a54a-6e4aec5aea89
>>>>>>>>> 2148762578 -rw-rw-r--   2 tarus    
amyloid_team    154K 11 mars   2014 7bd2b7d7-7745-4842-b7b4-400791c1d149
>>>>>>>>>   149216 -rw-r--r--   2 vamparys 
sacquin_team    241K 17 mai    2013 7bd2ba98-6a42-40ea-87ea-acb607d73cb5
>>>>>>>>> 2148977923 -rwxr-xr-x   2 murail   
baaden_team      23K 18 juin   2012 7bd2cf57-19e7-451c-885d-fd02fd988d43
>>>>>>>>>  1176623 -rw-rw-r--   2 tarus    
amyloid_team    227K  8 mars   2014 7bd2d92c-7ec8-4af8-9043-49d1908a99dc
>>>>>>>>>  1172122 lrwxrwxrwx   2 sterpone 
sterpone_team     61 17 avril 12:49 7bd2d96e-e925-45f0-a26a-56b95c084122 ->
../../../../../src/libs/ck-libs/ParFUM-Tops-Dev/ParFUM_TOPS.h
>>>>>>>>>  1385933 -rw-r--r--   2 tran     
derreumaux_team 2,9M 16 juin  05:29 7bd2df54-17d2-4644-96b7-f8925a67ec1e
>>>>>>>>>   745899 lrwxrwxrwx   1 root      root     
58 22 juil. 09:50 7bd2df83-ce58-4a17-aca8-a32b71e953d4 ->
../../5c/39/5c39010f-fa77-49df-8df6-8d72cf74fd64/model_009
>>>>>>>>> 2149100186 -rw-rw-r--   2 tarus    
amyloid_team    494K 17 mars   2014 7bd2e865-a2f4-4d90-ab29-dccebe2e3440
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best.
>>>>>>>>> Geoffrey
>>>>>>>>>
------------------------------------------------------
>>>>>>>>> Geoffrey Letessier
>>>>>>>>> Responsable informatique & ing?nieur
syst?me
>>>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>>>> Tel: 01 58 41 50 93 - eMail:
geoffrey.letessier at ibpc.fr
>>>>>>>>> 
>>>>>>>>> Le 27 juil. 2015 ? 22:57, Geoffrey
Letessier <geoffrey.letessier at cnrs.fr> a ?crit :
>>>>>>>>> 
>>>>>>>>>> Dears,
>>>>>>>>>> 
>>>>>>>>>> For a couple of weeks (more than one
month), our computing production is stopped due to several -but amazing-
troubles with GlusterFS.
>>>>>>>>>> 
>>>>>>>>>> After having noticed a big problem with
incorrect quota size accounted for many many files, i decided under the guidance
of Gluster team support to upgrade my storage cluster from version 3.5.3 to the
latest (3.7.2-3) because these bugs are theoretically fixed in this branch. Now,
since i?ve done this upgrade, it?s the amazing mess and i cannot restart the
production.
>>>>>>>>>> Indeed :
>>>>>>>>>>  1 - RDMA protocol is not working and
hang my system / shell commands; only TCP protocol (over Infiniband) is more or
less operational   - it?s not a blocking point but?
>>>>>>>>>>  2 - read/write performance relatively
low
>>>>>>>>>>  3 - thousands split-brains are
appeared.
>>>>>>>>>> 
>>>>>>>>>> So, for the moment, i believe GlusterFS
3.7 is not actually production ready.
>>>>>>>>>> 
>>>>>>>>>> Concerning the third point: after
having destroy all my volumes (RAID re-init, new partition, GlusterFS volumes,
etc.), recreate the main one, I tried to back-transfert my data from
archive/backup server info this new volume and I note a lot of errors in my
mount log file, as your can read in this extract:
>>>>>>>>>> [2015-07-26 22:35:16.962815] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
>>>>>>>>>> [2015-07-26 22:35:16.965896] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>,
e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and
820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> [2015-07-26 22:35:16.975206] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
>>>>>>>>>> [2015-07-26 22:35:28.719935] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> [2015-07-26 22:35:29.764891] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
>>>>>>>>>> [2015-07-26 22:35:29.768339] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>,
e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and
820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> [2015-07-26 22:35:29.775037] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
>>>>>>>>>> [2015-07-26 22:35:29.776857] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> [2015-07-26 22:35:29.800535] W [MSGID:
108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:29382d8d-c507-4d2e-b74d-dbdcb791ca65>/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0
>>>>>>>>>> 
>>>>>>>>>> And when I try to browse some folders
(still in mount log file):
>>>>>>>>>> [2015-07-27 09:00:19.005763] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
>>>>>>>>>> [2015-07-27 09:00:22.322316] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>,
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> [2015-07-27 09:00:23.008771] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
>>>>>>>>>> [2015-07-27 08:59:50.359187] W [MSGID:
108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012588.gro
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0
>>>>>>>>>> [2015-07-27 09:00:02.500419] W [MSGID:
108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012590.gro
b22aec09-2be3-41ea-a976-7b8d0e6f61f0 on vol_home-client-1 and
ec100f9e-ec48-4b29-b75e-a50ec6245de6 on vol_home-client-0
>>>>>>>>>> [2015-07-27 09:00:02.506925] W [MSGID:
108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0009059.gro
0485c093-11ca-4829-b705-e259668ebd8c on vol_home-client-1 and
e83a492b-7f8c-4b32-a76e-343f984142fe on vol_home-client-0
>>>>>>>>>> [2015-07-27 09:00:23.001121] W [MSGID:
108008] [afr-read-txn.c:241:afr_read_txn] 0-vol_home-replicate-0: Unreadable
subvolume -1 found with event generation 2. (Possible split-brain)
>>>>>>>>>> [2015-07-27 09:00:26.231262] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>,
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>> 
>>>>>>>>>> And, above all, browsing folder I get a
lot of input/ouput errors.
>>>>>>>>>> 
>>>>>>>>>> Currently I have 6.2M inodes and
roughly 30TB in my "new" volume.
>>>>>>>>>> 
>>>>>>>>>> For the moment, Quota is disable to
increase the IO performance during the back-transfert?
>>>>>>>>>> 
>>>>>>>>>> Your can also find in attachments:
>>>>>>>>>>  - an "ls" result
>>>>>>>>>>  - a split-brain research result
>>>>>>>>>>  - the volume information and status
>>>>>>>>>>  - a complete volume heal info
>>>>>>>>>> 
>>>>>>>>>> Hoping this can help your to help me to
fix all my problems and reopen the computing production.
>>>>>>>>>> 
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Geoffrey
>>>>>>>>>> 
>>>>>>>>>> PS: ? Erreur d?Entr?e/Sortie ? = ?
Input / Output Error ?
>>>>>>>>>>
------------------------------------------------------
>>>>>>>>>> Geoffrey Letessier
>>>>>>>>>> Responsable informatique &
ing?nieur syst?me
>>>>>>>>>> UPR 9080 - CNRS - Laboratoire de
Biochimie Th?orique
>>>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>>>> 13, rue Pierre et Marie Curie - 75005
Paris
>>>>>>>>>> Tel: 01 58 41 50 93 - eMail:
geoffrey.letessier at ibpc.fr
>>>>>>>>>> 
>>>>>>>>>> <ls_example.txt>
>>>>>>>>>> <split_brain__20150725.txt>
>>>>>>>>>> <vol_home_healinfo.txt>
>>>>>>>>>> <vol_home_info.txt>
>>>>>>>>>> <vol_home_status.txt>
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150804/e590ea8d/attachment.html>

Geoffrey Letessier

2015-Aug-04 20:40 UTC

head link

[Gluster-users] heaps split-brains during back-transfert

Now it?s more fast but it?s interesting to notice a diff?rence (a decease) in
the "du -sk" output:
[root at lucifer ~]# du -sk /home/sterpone_team/
10583360073     /home/sterpone_team/
[root at lucifer ~]# time du -sk /home/sterpone_team/
10583360057     /home/sterpone_team/

real    21m21.068s
user    0m1.766s
sys     0m8.864s

Do you have an idea?

Best,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ing?nieur syst?me
UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr

Le 4 ao?t 2015 ? 16:53, Geoffrey Letessier <geoffrey.letessier at cnrs.fr>
a ?crit :
> OK, after a couple of times/hours (the last message was blocked but sent
around 9h30 AM french time), here is the result:
> # du -sk /home/sterpone_team/
> 10583360073     /home/sterpone_team/
> 
> In other words: ~9.86TB
> So, as you can read below, the result is globally the same than before
(with 'du -sh?: ~9.9TB) and quite different of the gluster volume quota
output (the difference is ~0.5TB, 500-600GB) -very big difference-:
> # gluster volume quota vol_home list /sterpone_team
>                   Path                   Hard-limit Soft-limit   Used 
Available  Soft-limit exceeded? Hard-limit exceeded?
>
---------------------------------------------------------------------------------------------------------------------------
> /sterpone_team 
> 
> Thanks,
> Geoffrey
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ing?nieur syst?me
> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
> 
> Le 4 ao?t 2015 ? 10:26, Geoffrey Letessier <geoffrey.letessier at
cnrs.fr> a ?crit :
> 
>> Hi Vijay,
>> 
>>> du command can round-off the values, could you check the values
with 'du -sk??
>> It?s ongoing. I?ll let you know the new value ASAP.
>> 
>>> We will investigate on this issue and update you soon on the same.
>> FYI it?s mainly concerning 1 brick per replicate (and thus its
replicate brick: brick1 on storage1 and brick 1 on storage2).
>> To avoid to explode my /var partition capacity, yesterday i set my
brick-log-level parameter to CRITICAL -but now I know there is a big problem on
several bricks..
>> 
>> Thanks in advance for the help and fix
>> Geoffrey
>> 
>> ------------------------------------------------------
>> Geoffrey Letessier
>> Responsable informatique & ing?nieur syst?me
>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>> 
>> Le 4 ao?t 2015 ? 05:54, Vijaikumar M <vmallika at redhat.com> a
?crit :
>> 
>>> Adding Raghavendra.G for RDMA issue...
>>> 
>>> 
>>> Hi Geoffrey,
>>> 
>>> Please find my comments in-line..
>>> 
>>> Thanks,
>>> Vijay
>>> 
>>> 
>>> On Monday 03 August 2015 09:15 PM, Geoffrey Letessier wrote:
>>>> Hi Vijay,
>>>> 
>>>> Yes of course, i sent my email after making some tests and
checks and the result was still wrong (even after a couple of hours/1day after
having forced the start of every bricks) ? until i decided to do a ? du ? on
every quota path. Now, all seems to ~OK as you can read below:
>>>> # gluster volume quota vol_home list
>>>>                   Path                   Hard-limit Soft-limit 
Used  Available  Soft-limit exceeded? Hard-limit exceeded?
>>>>
---------------------------------------------------------------------------------------------------------------------------
>>>> /simlab_team                               5.0TB       80%     
1.2TB   3.8TB              No                   No
>>>> /amyloid_team                              7.0TB       80%     
4.9TB   2.1TB              No                   No
>>>> /amyloid_team/nguyen                       3.5TB       80%     
2.0TB   1.5TB              No                   No
>>>> /sacquin_team                             10.0TB       80%     
55.3GB   9.9TB              No                   No
>>>> /baaden_team                              20.0TB       80%     
11.5TB   8.5TB              No                   No
>>>> /derreumaux_team                           5.0TB       80%     
2.2TB   2.8TB              No                   No
>>>> /sterpone_team                            14.0TB       80%     
9.3TB   4.7TB              No                   No
>>>> /admin_team                                1.0TB       80%     
15.8GB 1008.2GB              No                   No
>>>> # for path in $(gluster volume quota vol_home list|awk
'NR>2 {print $1}'); do pdsh -w storage[1,3] "du -sh
/export/brick_home/brick{1,2}/data$path"; done
>>>> storage1: 219G /export/brick_home/brick1/data/simlab_team
>>>> storage3: 334G /export/brick_home/brick1/data/simlab_team
>>>> storage1: 307G /export/brick_home/brick2/data/simlab_team
>>>> storage3: 327G /export/brick_home/brick2/data/simlab_team
>>>> storage1: 1,2T /export/brick_home/brick1/data/amyloid_team
>>>> storage3: 1,2T /export/brick_home/brick1/data/amyloid_team
>>>> storage1: 1,2T /export/brick_home/brick2/data/amyloid_team
>>>> storage3: 1,2T /export/brick_home/brick2/data/amyloid_team
>>>> storage1: 505G
/export/brick_home/brick1/data/amyloid_team/nguyen
>>>> storage1: 483G
/export/brick_home/brick2/data/amyloid_team/nguyen
>>>> storage3: 508G
/export/brick_home/brick1/data/amyloid_team/nguyen
>>>> storage3: 503G
/export/brick_home/brick2/data/amyloid_team/nguyen
>>>> storage3: 16G /export/brick_home/brick1/data/sacquin_team
>>>> storage1: 14G /export/brick_home/brick1/data/sacquin_team
>>>> storage3: 13G /export/brick_home/brick2/data/sacquin_team
>>>> storage1: 13G /export/brick_home/brick2/data/sacquin_team
>>>> storage1: 3,2T /export/brick_home/brick1/data/baaden_team
>>>> storage1: 2,8T /export/brick_home/brick2/data/baaden_team
>>>> storage3: 2,9T /export/brick_home/brick1/data/baaden_team
>>>> storage3: 2,7T /export/brick_home/brick2/data/baaden_team
>>>> storage3: 588G /export/brick_home/brick1/data/derreumaux_team
>>>> storage1: 566G /export/brick_home/brick1/data/derreumaux_team
>>>> storage1: 563G /export/brick_home/brick2/data/derreumaux_team
>>>> storage3: 610G /export/brick_home/brick2/data/derreumaux_team
>>>> storage3: 2,5T /export/brick_home/brick1/data/sterpone_team
>>>> storage1: 2,7T /export/brick_home/brick1/data/sterpone_team
>>>> storage3: 2,4T /export/brick_home/brick2/data/sterpone_team
>>>> storage1: 2,4T /export/brick_home/brick2/data/sterpone_team
>>>> storage3: 519M /export/brick_home/brick1/data/admin_team
>>>> storage1: 11G /export/brick_home/brick1/data/admin_team
>>>> storage3: 974M /export/brick_home/brick2/data/admin_team
>>>> storage1: 4,0G /export/brick_home/brick2/data/admin_team
>>>> 
>>>> In short:
>>>>  simlab_team: ~1.2TB
>>>>  amyloid_team: ~4.8TB
>>>>  amyloid_team/nguyen: ~2TB
>>>>  sacquin_team: ~56GB
>>>>  baaden_team: ~11.6TB
>>>>  derreumaux_team: 2.3TB
>>>>  sterpone_team: ~10TB
>>>>  admin_team: ~16.5GB
>>>> 
>>>> There?s still some difference but it?s globally quite correct
(except for sterpone_team quota defined).
>>>> 
>>>> But, I also noticed something strange: here are the result of
every ? du ? i did to force the ? recompute ? of the quota size (on the
glusterfs mount point):
>>>> # du -sh /home/simlab_team/
>>>> 1,2T    /home/simlab_team/
>>>> # du -sh /home/amyloid_team/
>>>> 4,7T    /home/amyloid_team/
>>>> # du -sh /home/sacquin_team/
>>>> 56G     /home/sacquin_team/
>>>> # du -sh /home/baaden_team/
>>>> 12T     /home/baaden_team/
>>>> # du -sh /home/derreumaux_team/
>>>> 2,3T    /home/derreumaux_team/
>>>> # du -sh /home/sterpone_team/
>>>> 9,9T    /home/sterpone_team/
>>>> 
>>>> As you can above, I dont understand why the quota size computed
by quota daemon is different than a "du", especially concerning the
quota size of /sterpone_team
>>>> 
>>> du command can round-off the values, could you check the values
with 'du -sk'?
>>>  
>>> 
>>> 
>>>> Now, concerning all hangs i met, can you provide me the brand
of your infiniband interconnect? From my side, we use QLogic -maybe the problem
takes its origin here (Intel/Qlogic and Mellanox are quite different).
>>>> 
>>>> 
>>>> Concerning the brick logs, I just noticed I have a lot of error
on one of my brick logs and the file take around 5GB. Here is an extract:
>>>> # tail -30l
/var/log/glusterfs/bricks/export-brick_home-brick1-data.log
>>>> [2015-08-03 15:32:37.408204] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.410017] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.410689] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.410860] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.412638] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.413435] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.413640] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.415325] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.416102] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.416308] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.418025] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.418799] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.419001] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.420681] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.421416] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.421607] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.423208] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.423882] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.424089] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.425863] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.426581] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.426790] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.428438] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.429133] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> [2015-08-03 15:32:37.429325] E [dict.c:1418:dict_copy_with_ref]
(-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(server_resolve_inode+0x60)
[0x7f021c6f7410]
-->/usr/lib64/glusterfs/3.7.3/xlator/protocol/server.so(resolve_gfid+0x88)
[0x7f021c6f7188] -->/usr/lib64/libglusterfs.so.0(dict_copy_with_ref+0xa4)
[0x7f0229cba674] ) 0-dict: invalid argument: dict [Argument invalide]
>>>> The message "W [MSGID: 120003]
[quota.c:759:quota_build_ancestry_cbk] 0-vol_home-quota: parent is NULL
[Argument invalide]" repeated 9016 times between [2015-08-03
15:31:55.379522] and [2015-08-03 15:32:00.997113]
>>>> [2015-08-03 15:32:37.442244] I [MSGID: 115036]
[server.c:545:server_rpc_notify] 0-vol_home-server: disconnecting connection
from lucifer.lbt.ibpc.fr-21153-2015/08/03-15:31:23:33181-vol_home-client-0-0-0
>>>> [2015-08-03 15:32:37.442286] I [MSGID: 101055]
[client_t.c:419:gf_client_unref] 0-vol_home-server: Shutting down connection
lucifer.lbt.ibpc.fr-21153-2015/08/03-15:31:23:33181-vol_home-client-0-0-0
>>>> The message "E [MSGID: 113104]
[posix-handle.c:154:posix_make_ancestryfromgfid] 0-vol_home-posix: could not
read the link from the gfid handle
/export/brick_home/brick1/data/.glusterfs/19/b6/19b67130-b409-4666-9237-2661241a8847
[Aucun fichier ou dossier de ce type]" repeated 755 times between
[2015-08-03 15:31:25.553801] and [2015-08-03 15:31:43.528305]
>>>> The message "E [MSGID: 113104]
[posix-handle.c:154:posix_make_ancestryfromgfid] 0-vol_home-posix: could not
read the link from the gfid handle
/export/brick_home/brick1/data/.glusterfs/81/5a/815acde3-7f47-410b-9131-e8d75c71a5bd
[Aucun fichier ou dossier de ce type]" repeated 8147 times between
[2015-08-03 15:31:25.521255] and [2015-08-03 15:31:53.593932]
>>>> Do you have an idea where this issue come from and what I have
to do to fix it?
>>> We will investigate on this issue and update you soon on the same.
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> # grep -rc "\] E \["
/var/log/glusterfs/bricks/export-brick_home-brick{1,2}-data.log
>>>>
/var/log/glusterfs/bricks/export-brick_home-brick1-data.log:11038933
>>>> /var/log/glusterfs/bricks/export-brick_home-brick2-data.log:243
>>>> 
>>>> FYI I updated GlusterFS to the latest version (v3.7.3) 2 days
ago.
>>>> 
>>>> Thanks in advance for the next answers. and thanks for all your
help (all the support team).
>>>> Best,
>>>> Geoffrey
>>>> 
>>>> ------------------------------------------------------
>>>> Geoffrey Letessier
>>>> Responsable informatique & ing?nieur syst?me
>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>> Institut de Biologie Physico-Chimique
>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>> 
>>>> Le 3 ao?t 2015 ? 08:51, Vijaikumar M <vmallika at
redhat.com> a ?crit :
>>>> 
>>>>> Hi Geoffrey,
>>>>> 
>>>>> Please find my comments in-line.
>>>>> 
>>>>> 
>>>>> On Saturday 01 August 2015 04:10 AM, Geoffrey Letessier
wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> As Krutika said, I resolved with success all
split-brains (more than 3450) appeared after the first data transfert from one
backup server to my new and fresh volume but?
>>>>>> 
>>>>>> The following step to validate my new volume was to
enable the quota on it; and now, more than one day after this activation, all
the results are still completely wrong:
>>>>>> Example:
>>>>>> # df -h /home/sterpone_team
>>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>>> ib-storage1:vol_home.tcp
>>>>>>                        14T  3,3T   11T  24% /home
>>>>>> # pdsh -w storage[1,3] du -sh
/export/brick_home/brick{1,2}/data/sterpone_team
>>>>>> storage3: 2,5T
/export/brick_home/brick1/data/sterpone_team
>>>>>> storage3: 2,4T
/export/brick_home/brick2/data/sterpone_team
>>>>>> storage1: 2,7T
/export/brick_home/brick1/data/sterpone_team
>>>>>> storage1: 2,4T
/export/brick_home/brick2/data/sterpone_team
>>>>>> As you can read, all data for this account is around
10TB and quota displays only 3.3TB used.
>>>>>> 
>>>>>> Worse:
>>>>>> # pdsh -w storage[1,3] du -sh
/export/brick_home/brick{1,2}/data/baaden_team
>>>>>> storage3: 2,9T
/export/brick_home/brick1/data/baaden_team
>>>>>> storage3: 2,7T
/export/brick_home/brick2/data/baaden_team
>>>>>> storage1: 3,2T
/export/brick_home/brick1/data/baaden_team
>>>>>> storage1: 2,8T
/export/brick_home/brick2/data/baaden_team
>>>>>> # df -h /home/baaden_team/
>>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>>> ib-storage1:vol_home.tcp
>>>>>>                        20T  786G   20T   4% /home
>>>>>> # gluster volume quota vol_home list /baaden_team
>>>>>>                   Path                   Hard-limit
Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
>>>>>>
---------------------------------------------------------------------------------------------------------------------------
>>>>>> /baaden_team                              20.0TB      
80%     785.6GB  19.2TB              No                   No
>>>>>> This account is around 11.6TB and quota detects only
786GB used?
>>>>>> 
>>>>> As you mentioned below, some of the bricks were down.
'quota list' will only show the aggregated value of online bricks, Could
you please check the 'quota list' when all the bricks are up and
running?
>>>>> I suspect quota initiate might not have completed because
of brick down.
>>>>> 
>>>>>> Can someone help me to fix it -knowing if I've
previously updated GlusterFS from 3.5.3 to 3.7.2 it was exactly to solve a
similar trouble?
>>>>>> 
>>>>>> For information, in quotad log file:
>>>>>> [2015-07-31 22:13:00.574361] I [MSGID: 114047]
[client-handshake.c:1225:client_setvolume_cbk] 0-vol_home-client-7: Server and
Client lk-version numbers are not same, reopening the fds
>>>>>> [2015-07-31 22:13:00.574507] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-vol_home-client-7: Server
lk version = 1
>>>>>> 
>>>>>> is there any causal connection (client/server version
conflict)?
>>>>>> 
>>>>>> Here what i noticed on my
/var/log/glusterfs/quota-mount-vol_home.log file:
>>>>>> ? <same kind of lines>
>>>>>> [2015-07-31 21:26:15.247269] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_home-client-5: changing port to 49162
(from 0)
>>>>>> [2015-07-31 21:26:15.250272] E
[socket.c:2332:socket_connect_finish] 0-vol_home-client-5: connection to
10.0.4.2:49162 failed (Connexion refus?e)
>>>>>> [2015-07-31 21:26:19.250545] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_home-client-5: changing port to 49162
(from 0)
>>>>>> [2015-07-31 21:26:19.253643] E
[socket.c:2332:socket_connect_finish] 0-vol_home-client-5: connection to
10.0.4.2:49162 failed (Connexion refus?e)
>>>>>> ? <same kind of lines>
>>>>>> 
>>>>> Connection refused is because brick is down.
>>>>> 
>>>>>> <A few minutes after:> OK, this was due to one
brick which was down. It?s strange: since I have updated GlusteFS to 3.7.x I
notice a lot of bricks which go down, sometimes a few moment after starting the
volume, sometime after a couple of days/weeks? What never happened with
GlusterFS version 3.3.1 and 3.5.3.
>>>>>> 
>>>>> Could please provide brick log? We will check the log on
this issue, once this issue is fixed, we can initiate quota healing again.
>>>>> 
>>>>> 
>>>>>> Now, I need to stop-and-start the volume because I
notice again some hangs with "gluster volume quota ? ",
"df", etc. One more time, i?ve never noticed this kind of hangs with
previous versions of GlusterFS I used; is it "expected"?
>>>>> 
>>>>> From you previous mail we tried re-creating hang problem,
however it was not re-creating.
>>>>> 
>>>>> 
>>>>> 
>>>>>> One more time: thank you very much by advance.
>>>>>> Geoffrey
>>>>>> 
>>>>>> ------------------------------------------------------
>>>>>> Geoffrey Letessier
>>>>>> Responsable informatique & ing?nieur syst?me
>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>>>> Institut de Biologie Physico-Chimique
>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at
ibpc.fr
>>>>>> 
>>>>>> Le 31 juil. 2015 ? 11:26, Niels de Vos <ndevos at
redhat.com> a ?crit :
>>>>>> 
>>>>>>> On Wed, Jul 29, 2015 at 12:44:38AM +0200, Geoffrey
Letessier wrote:
>>>>>>>> OK, thank you Niels for this explanation. Now,
this makes sense.
>>>>>>>> 
>>>>>>>> And concerning all split-brains appeared during
the back-transfert, do you have an idea where is this coming from?
>>>>>>> 
>>>>>>> Sorry, no, I dont know how that is happening in
your environment. I'll
>>>>>>> try to find someone that understands more about it
and can help you with
>>>>>>> that.
>>>>>>> 
>>>>>>> Niels
>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Geoffrey
>>>>>>>>
------------------------------------------------------
>>>>>>>> Geoffrey Letessier
>>>>>>>> Responsable informatique & ing?nieur
syst?me
>>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier
at ibpc.fr
>>>>>>>> 
>>>>>>>> Le 29 juil. 2015 ? 00:02, Niels de Vos
<ndevos at redhat.com> a ?crit :
>>>>>>>> 
>>>>>>>>> On Tue, Jul 28, 2015 at 03:46:37PM +0200,
Geoffrey Letessier wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> In addition of all split brains
reported, is it normal to notice
>>>>>>>>>> thousands and thousands (several tens
nay hundreds of thousands)
>>>>>>>>>> broken symlinks browsing the .glusterfs
directory on each brick?
>>>>>>>>> 
>>>>>>>>> Yes, I think it is normal. A symlink points
to a particular filename,
>>>>>>>>> possibly in a different directory. If the
target file is located on a
>>>>>>>>> different brick, the symlink points to a
non-local file.
>>>>>>>>> 
>>>>>>>>> Consider this example with two bricks in a
distributed volume:
>>>>>>>>> - file: README
>>>>>>>>> - symlink: IMPORTANT -> README
>>>>>>>>> 
>>>>>>>>> When the distribution algorithm is done,
README 'hashes' to brick-A. The
>>>>>>>>> symlink 'hashes' to brick-B. This
means that README will be localed on
>>>>>>>>> brick-A, and the symlink with name
IMPORTANT would be located on
>>>>>>>>> brick-B. Because README is not on the same
brick as IMPORTANT, the
>>>>>>>>> symlink points to the non-existing file
README on brick-B.
>>>>>>>>> 
>>>>>>>>> However, when a Gluster client reads the
target of symlink IMPORTANT,
>>>>>>>>> the Gluster client calculate the location
of README and will know that
>>>>>>>>> README can be found on brick-A.
>>>>>>>>> 
>>>>>>>>> I hope that makes sense?
>>>>>>>>> 
>>>>>>>>> Niels
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> For the moment, i just synchronized one
remote directory (around 30TB
>>>>>>>>>> and a few million files) into my new
volume. No other operations on
>>>>>>>>>> files on this volume has yet been done.
>>>>>>>>>> How can I fix it? Can I delete these
dead-symlinks? How can I fix all
>>>>>>>>>> my split-brains? 
>>>>>>>>>> 
>>>>>>>>>> Here is an example of a ls:
>>>>>>>>>> [root at cl-storage3 ~]# cd
/export/brick_home/brick1/data/.glusterfs/7b/d2/
>>>>>>>>>> [root at cl-storage3 d2]# ll
>>>>>>>>>> total 8,7M
>>>>>>>>>>    13706 drwx------   2 root      root 
8,0K 26 juil. 17:22 .
>>>>>>>>>> 2147483784 drwx------ 258 root     
root            8,0K 20 juil. 23:07 ..
>>>>>>>>>> 2148444137 -rwxrwxrwx   2 baaden   
baaden_team     173K 22 mai    2008 7bd200dd-1774-4395-9065-605ae30ec18b
>>>>>>>>>>  1559384 -rw-rw-r--   2 tarus    
amyloid_team    4,3K 19 juin   2013 7bd2155c-7a05-4edc-ae77-35ed7e16afbc
>>>>>>>>>>   287295 lrwxrwxrwx   1 root      root 
58 20 juil. 23:38 7bd2370a-100b-411e-89a4-d184da9f0f88 ->
../../a7/59/a759de6f-cdf5-43dd-809a-baf81d103bf7/prop-base
>>>>>>>>>> 2149090201 -rw-rw-r--   2 tarus    
amyloid_team     76K  8 mars   2014 7bd2497f-d24b-4b19-a1c5-80a4956e56a1
>>>>>>>>>> 2148561174 -rw-r--r--   2 tran     
derreumaux_team  575 14 f?vr. 07:54 7bd25db0-67f5-43e5-a56a-52cf8c4c60dd
>>>>>>>>>>  1303943 -rw-r--r--   2 tran     
derreumaux_team  576 10 f?vr. 06:06 7bd25e97-18be-4faf-b122-5868582b4fd8
>>>>>>>>>>  1308607 -rw-r--r--   2 tran     
derreumaux_team 414K 16 juin  11:05 7bd2618f-950a-4365-a753-723597ef29f5
>>>>>>>>>>    45745 -rw-r--r--   2 letessier
admin_team       585  5 janv.  2012 7bd265c7-e204-4ee8-8717-e4a0c393fb0f
>>>>>>>>>> 2148144918 -rw-rw-r--   2 tarus    
amyloid_team    107K 28 f?vr.  2014 7bd26c5b-d48a-481a-9ca6-2dc27768b5ad
>>>>>>>>>>    13705 -rw-rw-r--   2 tarus    
amyloid_team     25K  4 juin   2014 7bd27e4c-46ba-4f21-a766-389bfa52fd78
>>>>>>>>>>  1633627 -rw-rw-r--   2 tarus    
amyloid_team     75K 12 mars   2014 7bd28631-90af-4c16-8ff0-c3d46d5026c6
>>>>>>>>>>  1329165 -rw-r--r--   2 tran     
derreumaux_team  175 15 juin  23:40 7bd2957e-a239-4110-b3d8-b4926c7f060b
>>>>>>>>>>   797803 lrwxrwxrwx   2 baaden   
baaden_team       26  2 avril  2007 7bd29933-1c80-4c6b-ae48-e64e4da874cb ->
../divided/a7/2a7o.pdb1.gz
>>>>>>>>>>  1532463 -rw-rw-rw-   2 baaden   
baaden_team     1,8M  2 nov.   2009 7bd29d70-aeb4-4eca-ac55-fae2d46ba911
>>>>>>>>>>  1411112 -rw-r--r--   2 sterpone 
sterpone_team   3,1K  2 mai    2012 7bd2a5eb-62a4-47fc-b149-31e10bd3c33d
>>>>>>>>>> 2148865896 -rw-r--r--   2 tran     
derreumaux_team 2,1M 15 juin  23:46 7bd2ae9c-18ca-471f-a54a-6e4aec5aea89
>>>>>>>>>> 2148762578 -rw-rw-r--   2 tarus    
amyloid_team    154K 11 mars   2014 7bd2b7d7-7745-4842-b7b4-400791c1d149
>>>>>>>>>>   149216 -rw-r--r--   2 vamparys 
sacquin_team    241K 17 mai    2013 7bd2ba98-6a42-40ea-87ea-acb607d73cb5
>>>>>>>>>> 2148977923 -rwxr-xr-x   2 murail   
baaden_team      23K 18 juin   2012 7bd2cf57-19e7-451c-885d-fd02fd988d43
>>>>>>>>>>  1176623 -rw-rw-r--   2 tarus    
amyloid_team    227K  8 mars   2014 7bd2d92c-7ec8-4af8-9043-49d1908a99dc
>>>>>>>>>>  1172122 lrwxrwxrwx   2 sterpone 
sterpone_team     61 17 avril 12:49 7bd2d96e-e925-45f0-a26a-56b95c084122 ->
../../../../../src/libs/ck-libs/ParFUM-Tops-Dev/ParFUM_TOPS.h
>>>>>>>>>>  1385933 -rw-r--r--   2 tran     
derreumaux_team 2,9M 16 juin  05:29 7bd2df54-17d2-4644-96b7-f8925a67ec1e
>>>>>>>>>>   745899 lrwxrwxrwx   1 root      root 
58 22 juil. 09:50 7bd2df83-ce58-4a17-aca8-a32b71e953d4 ->
../../5c/39/5c39010f-fa77-49df-8df6-8d72cf74fd64/model_009
>>>>>>>>>> 2149100186 -rw-rw-r--   2 tarus    
amyloid_team    494K 17 mars   2014 7bd2e865-a2f4-4d90-ab29-dccebe2e3440
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best.
>>>>>>>>>> Geoffrey
>>>>>>>>>>
------------------------------------------------------
>>>>>>>>>> Geoffrey Letessier
>>>>>>>>>> Responsable informatique &
ing?nieur syst?me
>>>>>>>>>> UPR 9080 - CNRS - Laboratoire de
Biochimie Th?orique
>>>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>>>> 13, rue Pierre et Marie Curie - 75005
Paris
>>>>>>>>>> Tel: 01 58 41 50 93 - eMail:
geoffrey.letessier at ibpc.fr
>>>>>>>>>> 
>>>>>>>>>> Le 27 juil. 2015 ? 22:57, Geoffrey
Letessier <geoffrey.letessier at cnrs.fr> a ?crit :
>>>>>>>>>> 
>>>>>>>>>>> Dears,
>>>>>>>>>>> 
>>>>>>>>>>> For a couple of weeks (more than
one month), our computing production is stopped due to several -but amazing-
troubles with GlusterFS.
>>>>>>>>>>> 
>>>>>>>>>>> After having noticed a big problem
with incorrect quota size accounted for many many files, i decided under the
guidance of Gluster team support to upgrade my storage cluster from version
3.5.3 to the latest (3.7.2-3) because these bugs are theoretically fixed in this
branch. Now, since i?ve done this upgrade, it?s the amazing mess and i cannot
restart the production.
>>>>>>>>>>> Indeed :
>>>>>>>>>>>  1 - RDMA protocol is not working
and hang my system / shell commands; only TCP protocol (over Infiniband) is more
or less operational   - it?s not a blocking point but?
>>>>>>>>>>>  2 - read/write performance
relatively low
>>>>>>>>>>>  3 - thousands split-brains are
appeared.
>>>>>>>>>>> 
>>>>>>>>>>> So, for the moment, i believe
GlusterFS 3.7 is not actually production ready.
>>>>>>>>>>> 
>>>>>>>>>>> Concerning the third point: after
having destroy all my volumes (RAID re-init, new partition, GlusterFS volumes,
etc.), recreate the main one, I tried to back-transfert my data from
archive/backup server info this new volume and I note a lot of errors in my
mount log file, as your can read in this extract:
>>>>>>>>>>> [2015-07-26 22:35:16.962815] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
>>>>>>>>>>> [2015-07-26 22:35:16.965896] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>,
e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and
820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> [2015-07-26 22:35:16.975206] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
>>>>>>>>>>> [2015-07-26 22:35:28.719935] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> [2015-07-26 22:35:29.764891] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
>>>>>>>>>>> [2015-07-26 22:35:29.768339] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>,
e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and
820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> [2015-07-26 22:35:29.775037] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
>>>>>>>>>>> [2015-07-26 22:35:29.776857] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> [2015-07-26 22:35:29.800535] W
[MSGID: 108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:29382d8d-c507-4d2e-b74d-dbdcb791ca65>/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt
951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and
5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0
>>>>>>>>>>> 
>>>>>>>>>>> And when I try to browse some
folders (still in mount log file):
>>>>>>>>>>> [2015-07-27 09:00:19.005763] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
>>>>>>>>>>> [2015-07-27 09:00:22.322316] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>,
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> [2015-07-27 09:00:23.008771] I
[afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0:
performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
>>>>>>>>>>> [2015-07-27 08:59:50.359187] W
[MSGID: 108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012588.gro
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0
>>>>>>>>>>> [2015-07-27 09:00:02.500419] W
[MSGID: 108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012590.gro
b22aec09-2be3-41ea-a976-7b8d0e6f61f0 on vol_home-client-1 and
ec100f9e-ec48-4b29-b75e-a50ec6245de6 on vol_home-client-0
>>>>>>>>>>> [2015-07-27 09:00:02.506925] W
[MSGID: 108008] [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check]
0-vol_home-replicate-0: GFID mismatch for
<gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0009059.gro
0485c093-11ca-4829-b705-e259668ebd8c on vol_home-client-1 and
e83a492b-7f8c-4b32-a76e-343f984142fe on vol_home-client-0
>>>>>>>>>>> [2015-07-27 09:00:23.001121] W
[MSGID: 108008] [afr-read-txn.c:241:afr_read_txn] 0-vol_home-replicate-0:
Unreadable subvolume -1 found with event generation 2. (Possible split-brain)
>>>>>>>>>>> [2015-07-27 09:00:26.231262] E
[afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch]
0-vol_home-replicate-0: Gfid mismatch detected for
<2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>,
9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and
1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping conservative
merge on the file.
>>>>>>>>>>> 
>>>>>>>>>>> And, above all, browsing folder I
get a lot of input/ouput errors.
>>>>>>>>>>> 
>>>>>>>>>>> Currently I have 6.2M inodes and
roughly 30TB in my "new" volume.
>>>>>>>>>>> 
>>>>>>>>>>> For the moment, Quota is disable to
increase the IO performance during the back-transfert?
>>>>>>>>>>> 
>>>>>>>>>>> Your can also find in attachments:
>>>>>>>>>>>  - an "ls" result
>>>>>>>>>>>  - a split-brain research result
>>>>>>>>>>>  - the volume information and
status
>>>>>>>>>>>  - a complete volume heal info
>>>>>>>>>>> 
>>>>>>>>>>> Hoping this can help your to help
me to fix all my problems and reopen the computing production.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Geoffrey
>>>>>>>>>>> 
>>>>>>>>>>> PS: ? Erreur d?Entr?e/Sortie ? = ?
Input / Output Error ?
>>>>>>>>>>>
------------------------------------------------------
>>>>>>>>>>> Geoffrey Letessier
>>>>>>>>>>> Responsable informatique &
ing?nieur syst?me
>>>>>>>>>>> UPR 9080 - CNRS - Laboratoire de
Biochimie Th?orique
>>>>>>>>>>> Institut de Biologie
Physico-Chimique
>>>>>>>>>>> 13, rue Pierre et Marie Curie -
75005 Paris
>>>>>>>>>>> Tel: 01 58 41 50 93 - eMail:
geoffrey.letessier at ibpc.fr
>>>>>>>>>>> 
>>>>>>>>>>> <ls_example.txt>
>>>>>>>>>>> <split_brain__20150725.txt>
>>>>>>>>>>> <vol_home_healinfo.txt>
>>>>>>>>>>> <vol_home_info.txt>
>>>>>>>>>>> <vol_home_status.txt>
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150804/479bab45/attachment.html>

Gluster users - Aug 2015 - heaps split-brains during back-transfert

[Gluster-users] heaps split-brains during back-transfert

[Gluster-users] heaps split-brains during back-transfert