On 06/04/2015 06:30 PM, Branden Timm wrote:> I'm really hoping somebody can at least point me in the right direction
on how to diagnose this. This morning, roughly 24 hours after initiating the
rebalance, one host of three in the cluster still hasn't done anything:
>
>
> Node Rebalanced-files size scanned failures
skipped status run time in secs
> --------- ----------- ----------- ----------- -----------
----------- ------------ --------------
> localhost 2543 14.2TB 11162 0
0 in progress 60946.00
> gluster-8 1358 6.7TB 9298 0
0 in progress 60946.00
> gluster-6 0 0Bytes 0 0
0 in progress 0.00
>
>
> The only error showing up in the rebalance log is this:
>
>
> [2015-06-03 19:59:58.314100] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]
This looks like acquiring posix file lock failed and seems like
rebalance is *actually not* running. I would leave it to dht folks to
comment on it.
~Atin>
>
> Any help would be greatly appreciated!
>
>
>
> ________________________________
> From: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> on behalf of Branden Timm <btimm at wisc.edu>
> Sent: Wednesday, June 3, 2015 11:52 AM
> To: gluster-users at gluster.org
> Subject: [Gluster-users] One host won't rebalance
>
>
> Greetings Gluster Users,
>
> I started a rebalance operation on my distributed volume today (CentOS
6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is just
sitting at 0.00 for 'run time in secs', and shows 0 files scanned,
failed, or skipped.
>
>
> I've reviewed the rebalance log for the affected server, and I'm
seeing these messages:
>
>
> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 (args:
/usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 --xlator-option
*dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log)
> [2015-06-03 15:34:32.704217] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]
>
>
> I initially investigated the first warning, readv on 127.0.0.1:24007
failed. netstat shows that ip/port belonging to a glusterd process. Beyond that
I wasn't able to tell why there would be a problem.
>
>
> Next, I checked out what was up with the lock file that reported resource
temprarily unavailable. The file is present and contains the pid of a running
glusterd process:
>
>
> root 12776 1 0 10:18 ? 00:00:00 /usr/sbin/glusterfs -s
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes
--xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log
>
>
> Finally, one other thing I saw from running 'gluster volume status
<volname> clients' is that the affected server is the only one of the
three that lists a 127.0.0.1:<port> client for each of it's bricks. I
don't know why there would be a client coming from loopback on the server,
but it seems strange. Additionally, it makes me wonder if the fact that I have
auth.allow set to a single subnet (that doesn't include 127.0.0.1) is
causing this problem for some reason, or if loopback is implicitly allowed to
connect.
>
>
> Any tips or suggestions would be much appreciated. Thanks!
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
--
~Atin