thr3ads.net - Gluster users - [Gluster-users] One host won't rebalance [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Branden Timm

2015-Jun-04 16:48 UTC

[Gluster-users] One host won't rebalance

Atin, thank you for the response.  Indeed I have investigated the locks on that
file, and it is a glusterfs process with an exclusive read/write lock on the
entire file:

lsof
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
glusterfs 12776 root    6uW  REG  253,1        6 15730814
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid

That process was invoked with the following options:

ps -ef | grep 12776
root     12776     1  0 Jun03 ?        00:00:03 /usr/sbin/glusterfs -s localhost
--volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes
--xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log

Not sure if this information is helpful, but thanks for your reply.

________________________________________
From: Atin Mukherjee <amukherj at redhat.com>
Sent: Thursday, June 4, 2015 9:24 AM
To: Branden Timm; gluster-users at gluster.org; Nithya Balachandran; Susant
Palai; Shyamsundar Ranganathan
Subject: Re: [Gluster-users] One host won't rebalance

On 06/04/2015 06:30 PM, Branden Timm wrote:> I'm really hoping somebody can at least point me in the right direction
on how to diagnose this. This morning, roughly 24 hours after initiating the
rebalance, one host of three in the cluster still hasn't done anything:
>
>
>  Node       Rebalanced-files          size       scanned      failures     
skipped               status   run time in secs
>  ---------      -----------   -----------   -----------   -----------  
-----------         ------------     --------------
>  localhost             2543        14.2TB         11162             0      
0          in progress           60946.00
>  gluster-8             1358         6.7TB          9298             0      
0          in progress           60946.00
>  gluster-6                0        0Bytes             0             0      
0          in progress               0.00
>
>
> The only error showing up in the rebalance log is this:
>
>
> [2015-06-03 19:59:58.314100] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]This looks like acquiring posix file lock failed and seems like
rebalance is *actually not* running. I would leave it to dht folks to
comment on it.

~Atin>
>
> Any help would be greatly appreciated!
>
>
>
> ________________________________
> From: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> on behalf of Branden Timm <btimm at wisc.edu>
> Sent: Wednesday, June 3, 2015 11:52 AM
> To: gluster-users at gluster.org
> Subject: [Gluster-users] One host won't rebalance
>
>
> Greetings Gluster Users,
>
> I started a rebalance operation on my distributed volume today (CentOS
6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is just
sitting at 0.00 for 'run time in secs', and shows 0 files scanned,
failed, or skipped.
>
>
> I've reviewed the rebalance log for the affected server, and I'm
seeing these messages:
>
>
> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3 (args:
/usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 --xlator-option
*dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log)
> [2015-06-03 15:34:32.704217] E [MSGID: 100018]
[glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]
>
>
> I initially investigated the first warning, readv on 127.0.0.1:24007
failed. netstat shows that ip/port belonging to a glusterd process. Beyond that
I wasn't able to tell why there would be a problem.
>
>
> Next, I checked out what was up with the lock file that reported resource
temprarily unavailable. The file is present and contains the pid of a running
glusterd process:
>
>
> root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -s
localhost --volfile-id rebalance/bigdata2 --xlator-option *dht.use-readdirp=yes
--xlator-option *dht.lookup-unhashed=yes --xlator-option
*dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off
--xlator-option *replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off
--xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1
--xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db
--socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log
>
>
> Finally, one other thing I saw from running 'gluster volume status
<volname> clients' is that the affected server is the only one of the
three that lists a 127.0.0.1:<port> client for each of it's bricks. I
don't know why there would be a client coming from loopback on the server,
but it seems strange. Additionally, it makes me wonder if the fact that I have
auth.allow set to a single subnet (that doesn't include 127.0.0.1) is
causing this problem for some reason, or if loopback is implicitly allowed to
connect.
>
>
> Any tips or suggestions would be much appreciated. Thanks!
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
--
~Atin

Atin Mukherjee

2015-Jun-04 18:25 UTC

head link

[Gluster-users] One host won't rebalance

Sent from Samsung Galaxy S4
On 4 Jun 2015 22:18, "Branden Timm" <btimm at wisc.edu>
wrote:>
> Atin, thank you for the response.  Indeed I have investigated the lockson that file, and it is a glusterfs process with an exclusive read/write
lock on the entire file:>
> lsof
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid> COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
> glusterfs 12776 root    6uW  REG  253,1        6 15730814
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid>
> That process was invoked with the following options:
>
> ps -ef | grep 12776
> root     12776     1  0 Jun03 ?        00:00:03 /usr/sbin/glusterfs -slocalhost --volfile-id rebalance/bigdata2 --xlator-option
*dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes
--xlator-option *dht.assert-no-child-down=yes --xlator-option
*replicate*.data-self-heal=off --xlator-option
*replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on
--xlator-option *dht.rebalance-cmd=1 --xlator-option
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log
This means there is already a rebalance process alive. Could you help me
with following:
1. What does bigdata2-rebalance.log says? Don't you see a shutting down log
somewhere?
2. Does output of gluster volume status consider bigdata2 is in rebalancing?

As a work around can you kill this process and start a fresh rebalance
process?>
> Not sure if this information is helpful, but thanks for your reply.
>
> ________________________________________
> From: Atin Mukherjee <amukherj at redhat.com>
> Sent: Thursday, June 4, 2015 9:24 AM
> To: Branden Timm; gluster-users at gluster.org; Nithya Balachandran; Susant
Palai; Shyamsundar Ranganathan> Subject: Re: [Gluster-users] One host won't rebalance
>
> On 06/04/2015 06:30 PM, Branden Timm wrote:
> > I'm really hoping somebody can at least point me in the right
directionon how to diagnose this. This morning, roughly 24 hours after initiating
the rebalance, one host of three in the cluster still hasn't done
anything:> >
> >
> >  Node       Rebalanced-files          size       scanned      failures
     skipped               status   run time in secs> >  ---------      -----------   -----------   -----------   -----------
 -----------         ------------     --------------> >  localhost             2543        14.2TB         11162             0
           0          in progress           60946.00> >  gluster-8             1358         6.7TB          9298             0
           0          in progress           60946.00> >  gluster-6                0        0Bytes             0             0
           0          in progress               0.00> >
> >
> > The only error showing up in the rebalance log is this:
> >
> >
> > [2015-06-03 19:59:58.314100] E [MSGID: 100018][glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]> This looks like acquiring posix file lock failed and seems like
> rebalance is *actually not* running. I would leave it to dht folks to
> comment on it.
>
> ~Atin
> >
> >
> > Any help would be greatly appreciated!
> >
> >
> >
> > ________________________________
> > From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on behalf of Branden Timm <btimm at
wisc.edu>
> > Sent: Wednesday, June 3, 2015 11:52 AM
> > To: gluster-users at gluster.org
> > Subject: [Gluster-users] One host won't rebalance
> >
> >
> > Greetings Gluster Users,
> >
> > I started a rebalance operation on my distributed volume today (CentOS6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster is
just sitting at 0.00 for 'run time in secs', and shows 0 files scanned,
failed, or skipped.> >
> >
> > I've reviewed the rebalance log for the affected server, and
I'm seeing
these messages:> >
> >
> > [2015-06-03 15:34:32.703692] I [MSGID: 100030]
[glusterfsd.c:2018:main]0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3
(args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2
--xlator-option *dht.use-readdirp=yes --xlator-option
*dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes
--xlator-option *replicate*.data-self-heal=off --xlator-option
*replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on
--xlator-option *dht.rebalance-cmd=1 --xlator-option
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log)> > [2015-06-03 15:34:32.704217] E [MSGID: 100018][glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
lock failed [Resource temporarily unavailable]> >
> >
> > I initially investigated the first warning, readv on 127.0.0.1:24007failed. netstat shows that ip/port belonging to a glusterd process. Beyond
that I wasn't able to tell why there would be a
problem.> >
> >
> > Next, I checked out what was up with the lock file that reportedresource temprarily unavailable. The file is present and contains the pid
of a running glusterd process:> >
> >
> > root     12776     1  0 10:18 ?        00:00:00 /usr/sbin/glusterfs -slocalhost --volfile-id rebalance/bigdata2 --xlator-option
*dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes
--xlator-option *dht.assert-no-child-down=yes --xlator-option
*replicate*.data-self-heal=off --xlator-option
*replicate*.metadata-self-heal=off --xlator-option
*replicate*.entry-self-heal=off --xlator-option
*replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on
--xlator-option *dht.rebalance-cmd=1 --xlator-option
*dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file
/var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock
--pid-file
/var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid
-l /var/log/glusterfs/bigdata2-rebalance.log> >
> >
> > Finally, one other thing I saw from running 'gluster volume status<volname> clients' is that the affected server is the only one of the
three
that lists a 127.0.0.1:<port> client for each of it's bricks. I
don't know
why there would be a client coming from loopback on the server, but it
seems strange. Additionally, it makes me wonder if the fact that I have
auth.allow set to a single subnet (that doesn't include 127.0.0.1) is
causing this problem for some reason, or if loopback is implicitly allowed
to connect.> >
> >
> > Any tips or suggestions would be much appreciated. Thanks!
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
> --
> ~Atin
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150604/c9abc749/attachment.html>

Gluster users - Jun 2015 - One host won't rebalance

[Gluster-users] One host won't rebalance

[Gluster-users] One host won't rebalance