On 06/05/2015 12:05 AM, Branden Timm wrote:> I should add that there are additional errors as well in the brick logs.
I've posted them to a gist at
https://gist.github.com/brandentimm/576432ddabd70184d257
As I mentioned earlier, DHT team can answer all your question on this
failure.
~Atin>
>
> ________________________________
> From: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> on behalf of Branden Timm <btimm at wisc.edu>
> Sent: Thursday, June 4, 2015 1:31 PM
> To: Atin Mukherjee
> Cc: gluster-users at gluster.org
> Subject: Re: [Gluster-users] One host won't rebalance
>
>
> I have stopped and restarted the rebalance several times, with no
difference in results. I have restarted all gluster services several times, and
completely rebooted the affected system.
>
>
> Yes, gluster volume status does show an active rebalance task for volume
bigdata2.
>
>
> I just noticed something else in the brick logs. I am seeing tons of
message similar to these two:
>
>
> [2015-06-04 16:22:26.179797] E [posix-helpers.c:938:posix_handle_pair]
0-bigdata2-posix: /<redacted path>: key:glusterfs-internal-fop flags: 1
length:4 error:Operation not supported
> [2015-06-04 16:22:26.179874] E [posix.c:2325:posix_create]
0-bigdata2-posix: setting xattrs on /<path redacted> failed (Operation not
supported)
>
>
> Note that both messages were referring to the same file. I have confirmed
that xattr support is on in the underlying system. Additionally, these messages
are NOT appearing on the other cluster members that seem to be unaffected by
whatever is going on.
>
>
> I found this bug which seems to be similar, but it was theoretically closed
for the 3.6.1 release: https://bugzilla.redhat.com/show_bug.cgi?id=1098794
>
>
> Thanks again for your help.
>
>
> ________________________________
> From: Atin Mukherjee <atin.mukherjee83 at gmail.com>
> Sent: Thursday, June 4, 2015 1:25 PM
> To: Branden Timm
> Cc: Shyamsundar Ranganathan; Susant Palai; gluster-users at gluster.org;
Atin Mukherjee; Nithya Balachandran
> Subject: Re: [Gluster-users] One host won't rebalance
>
>
> Sent from Samsung Galaxy S4
> On 4 Jun 2015 22:18, "Branden Timm" <btimm at
wisc.edu<mailto:btimm at wisc.edu>> wrote:
>>
>> Atin, thank you for the response. Indeed I have investigated the locks
on that file, and it is a glusterfs process with an exclusive read/write lock on
the entire file:
>>
>> lsof
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>> glusterfs 12776 root 6uW REG 253,1 6 15730814
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
>>
>> That process was invoked with the following options:
>>
>> ps -ef | grep 12776
>> root 12776 1 0 Jun03 ? 00:00:03 /usr/sbin/glusterfs -s
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes
--xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log
> This means there is already a rebalance process alive. Could you help me
with following:
> 1. What does bigdata2-rebalance.log says? Don't you see a shutting down
log somewhere?
> 2. Does output of gluster volume status consider bigdata2 is in
rebalancing?
>
> As a work around can you kill this process and start a fresh rebalance
process?
>>
>> Not sure if this information is helpful, but thanks for your reply.
>>
>> ________________________________________
>> From: Atin Mukherjee <amukherj at redhat.com<mailto:amukherj at
redhat.com>>
>> Sent: Thursday, June 4, 2015 9:24 AM
>> To: Branden Timm; gluster-users at gluster.org<mailto:gluster-users
at gluster.org>; Nithya Balachandran; Susant Palai; Shyamsundar Ranganathan
>> Subject: Re: [Gluster-users] One host won't rebalance
>>
>> On 06/04/2015 06:30 PM, Branden Timm wrote:
>>> I'm really hoping somebody can at least point me in the right
direction on how to diagnose this. This morning, roughly 24 hours after
initiating the rebalance, one host of three in the cluster still hasn't done
anything:
>>>
>>>
>>> Node Rebalanced-files size scanned
failures skipped status run time in secs
>>> --------- ----------- ----------- -----------
----------- ----------- ------------ --------------
>>> localhost 2543 14.2TB 11162
0 0 in progress 60946.00
>>> gluster-8 1358 6.7TB 9298
0 0 in progress 60946.00
>>> gluster-6 0 0Bytes 0
0 0 in progress 0.00
>>>
>>>
>>> The only error showing up in the rebalance log is this:
>>>
>>>
>>> [2015-06-03 19:59:58.314100] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]
>> This looks like acquiring posix file lock failed and seems like
>> rebalance is *actually not* running. I would leave it to dht folks to
>> comment on it.
>>
>> ~Atin
>>>
>>>
>>> Any help would be greatly appreciated!
>>>
>>>
>>>
>>> ________________________________
>>> From: gluster-users-bounces at
gluster.org<mailto:gluster-users-bounces at gluster.org>
<gluster-users-bounces at gluster.org<mailto:gluster-users-bounces at
gluster.org>> on behalf of Branden Timm <btimm at
wisc.edu<mailto:btimm at wisc.edu>>
>>> Sent: Wednesday, June 3, 2015 11:52 AM
>>> To: gluster-users at gluster.org<mailto:gluster-users at
gluster.org>
>>> Subject: [Gluster-users] One host won't rebalance
>>>
>>>
>>> Greetings Gluster Users,
>>>
>>> I started a rebalance operation on my distributed volume today
(CentOS 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster
is just sitting at 0.00 for 'run time in secs', and shows 0 files
scanned, failed, or skipped.
>>>
>>>
>>> I've reviewed the rebalance log for the affected server, and
I'm seeing these messages:
>>>
>>>
>>> [2015-06-03 15:34:32.703692] I [MSGID: 100030]
[glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running
/usr/sbin/glusterfs version 3.6.3 (args: /usr/sbin/glusterfs -s localhost
--volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes
--xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log)
>>> [2015-06-03 15:34:32.704217] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]
>>>
>>>
>>> I initially investigated the first warning, readv on
127.0.0.1:24007<http://127.0.0.1:24007> failed. netstat shows that ip/port
belonging to a glusterd process. Beyond that I wasn't able to tell why there
would be a problem.
>>>
>>>
>>> Next, I checked out what was up with the lock file that reported
resource temprarily unavailable. The file is present and contains the pid of a
running glusterd process:
>>>
>>>
>>> root 12776 1 0 10:18 ? 00:00:00 /usr/sbin/glusterfs
-s localhost --volfile-id rebalance/bigdata2 --xlator-option
*dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log
>>>
>>>
>>> Finally, one other thing I saw from running 'gluster volume
status <volname> clients' is that the affected server is the only one
of the three that lists a 127.0.0.1<http://127.0.0.1>:<port> client
for each of it's bricks. I don't know why there would be a client coming
from loopback on the server, but it seems strange. Additionally, it makes me
wonder if the fact that I have auth.allow set to a single subnet (that
doesn't include 127.0.0.1) is causing this problem for some reason, or if
loopback is implicitly allowed to connect.
>>>
>>>
>>> Any tips or suggestions would be much appreciated. Thanks!
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org<mailto:Gluster-users at
gluster.org>
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>> --
>> ~Atin
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
--
~Atin