Hi all
After some testing and debugging I was able to reproduce the problem in our lab.
It turned out that this behaviour happens when root-sqaushing is turned on, see
the details below. Without root-squashing turned on rebalancing happens just
fine.
Volume Name: public
Type: Distributed-Replicate
Volume ID: 158bf6ae-a486-4164-bb39-ca089ecdf767
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gfs01a-dcg:/mnt/public/brick1
Brick2: gfs01b-dcg:/mnt/public/brick1
Brick3:
gfs02a-dcg.intnet.be<http://gfs02a-dcg.intnet.be>:/mnt/public/brick1
Brick4:
gfs02b-dcg.intnet.be<http://gfs02b-dcg.intnet.be>:/mnt/public/brick1
Options Reconfigured:
server.anongid: 33
server.anonuid: 33
server.root-squash: on
Now only one question remains, what is the way to go to get the cluster back in
a healthy state?
Any help would be really appreciated.
Kind regards
Davy
On 15 Sep 2015, at 17:04, Davy Croonen <davy.croonen at
smartbit.be<mailto:davy.croonen at smartbit.be>> wrote:
Hi all
After expanding our cluster we are facing failures while rebalancing. In my
opinion this doesn?t look good, so can anybody maybe explain how these failures
could arise, how you can fix them or what the consequences can be?
$gluster volume rebalance public status
Node Rebalanced-files size
scanned failures skipped status run time in
secs
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 0
0Bytes 49496 23464 0 in progress
3821.00
gfs01b-dcg.intnet.be<http://gfs01b-dcg.intnet.be/>
0 0Bytes 49496 0
0 in progress 3821.00
gfs02a-dcg.intnet.be<http://gfs02a-dcg.intnet.be/>
0 0Bytes 49497 0
0 in progress 3821.00
gfs02b-dcg.intnet.be<http://gfs02b-dcg.intnet.be/>
0 0Bytes 49495 0
0 in progress 3821.00
After looking in the public-rebalance.log this is one paragraph that shows up.
The whole log is filled up with these.
[2015-09-15 14:50:58.239554] I [dht-common.c:3309:dht_setxattr] 0-public-dht:
fixing the layout of /ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355
[2015-09-15 14:50:58.239730] I [dht-selfheal.c:960:dht_fix_layout_of_directory]
0-public-dht: subvolume 0 (public-replicate-0): 251980 chunks
[2015-09-15 14:50:58.239750] I [dht-selfheal.c:960:dht_fix_layout_of_directory]
0-public-dht: subvolume 1 (public-replicate-1): 251980 chunks
[2015-09-15 14:50:58.239759] I
[dht-selfheal.c:1065:dht_selfheal_layout_new_directory] 0-public-dht: chunk size
= 0xffffffff / 503960 = 0x214a
[2015-09-15 14:50:58.239784] I
[dht-selfheal.c:1103:dht_selfheal_layout_new_directory] 0-public-dht: assigning
range size 0x7ffe51f8 to public-replicate-0
[2015-09-15 14:50:58.239791] I
[dht-selfheal.c:1103:dht_selfheal_layout_new_directory] 0-public-dht: assigning
range size 0x7ffe51f8 to public-replicate-1
[2015-09-15 14:50:58.239816] I [MSGID: 109036]
[dht-common.c:6296:dht_log_new_layout_for_dir_selfheal] 0-public-dht: Setting
layout of /ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355 with [Subvol_name:
public-replicate-0, Err: -1 , Start: 0 , Stop: 2147373559 ], [Subvol_name:
public-replicate-1, Err: -1 , Start: 2147373560 , Stop: 4294967295 ],
[2015-09-15 14:50:58.306701] I [dht-rebalance.c:1405:gf_defrag_migrate_data]
0-public-dht: migrate data called on
/ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355
[2015-09-15 14:50:58.346531] W [client-rpc-fops.c:1090:client3_3_getxattr_cbk]
0-public-client-2: remote operation failed: Permission denied. Path:
/ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355/1.1 rationale getallen.pdf
(ba5220be-a462-4008-ac67-79abb16f4dd9). Key: trusted.glusterfs.pathinfo
[2015-09-15 14:50:58.354111] W [client-rpc-fops.c:1090:client3_3_getxattr_cbk]
0-public-client-3: remote operation failed: Permission denied. Path:
/ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355/1.1 rationale getallen.pdf
(ba5220be-a462-4008-ac67-79abb16f4dd9). Key: trusted.glusterfs.pathinfo
[2015-09-15 14:50:58.354166] E [dht-rebalance.c:1576:gf_defrag_migrate_data]
0-public-dht: /ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355/1.1 rationale
getallen.pdf: failed to get trusted.distribute.linkinfo key - Permission denied
[2015-09-15 14:50:58.356191] I [dht-rebalance.c:1649:gf_defrag_migrate_data]
0-public-dht: Migration operation on dir
/ka1hasselt/Lqw9pnXKV8ojBzzzsqHyChSU914422947204355 took 0.05 secs
Now the file which is referenced here, 1.1 rationale getallen.pdf, exists on the
hosts referenced by 0-public-client-0 and 0-public-client-1 and not on the hosts
referenced by 0-public-client-2 and 0-public-client-3. So another question what
is the system really trying to do here and is this normal?
I really hope somebody could give me a deeper understanding about what is going
on here.
Thanks in advance.
Kind regards
Davy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150916/efdf4901/attachment.html>