thr3ads.net - Gluster users - [Gluster-users] Gluster trouble [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Marcus Bointon

2013-Jul-03 14:55 UTC

[Gluster-users] Gluster trouble

Back in March I posted about some gluster problems:

http://gluster.org/pipermail/gluster-users/2013-March/035737.html
http://gluster.org/pipermail/gluster-users/2013-March/035655.html

I'm still in the same situation - a straightforward 2-node, 2-way AFR setup
with each server mounting the single shared volume via NFS using gluster 3.3.0
(can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu lucid).
Gluster appears to be working, but won't mount on boot by any means I've
tried, and it's still logging prodigious amounts of incomprehensible rubbish
(to me!).

gluster says everything is ok:

gluster volume status
Status of volume: shared
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.10:/var/shared                          24009   Y       3097
Brick 192.168.1.11:/var/shared                          24009   Y       3020
NFS Server on localhost                                 38467   Y       3103
Self-heal Daemon on localhost                           N/A     Y       3109
NFS Server on 192.168.1.11                              38467   Y       3057
Self-heal Daemon on 192.168.1.11                        N/A     Y       3096

(other node says the same thing with IPs the other way around)

Yet the logs tell a different story.

In syslog, this happens every second:

Jul  3 00:17:29 web1 init: glusterd main process (14958) terminated with status
255
Jul  3 00:17:29 web1 init: glusterd main process ended, respawning

In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this:

[2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd:
Started running /usr/sbin/glusterd version 3.3.0
[2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup]
0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily
unavailable)

In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these:

2013-07-03 14:24:00.792751] I
[afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0:
background  meta-data self-heal completed on
<gfid:16adce4d-1933-485f-8359-66c47c757cd3>
[2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal]
0-shared-replicate-0: background  meta-data self-heal triggered. path:
<gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected
pending operations
[2013-07-03 14:24:00.796411] I
[afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0:
background  meta-data self-heal completed on
<gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>

'gluster volume heal shared info' says:

Heal operation on volume shared has been successful

Brick 192.168.1.10:/var/shared
Number of entries: 335
...

I'm not clear whether this means it has 335 files still to fix, or whether
it's done so already.

Both servers are logging the same kind of stuff. I'm sure all these are
related since they happen at about the same rate.

The lock error looks the most interesting, but I've no idea why that should
happen. As before, I've tried deleting all traces of gluster, reinstalling
and reconfiguring and putting all the data back on, but nothing changes.

Here's the command I used to create the volume:

gluster volume create shared replica 2 transport tcp 192.168.1.10:/var/shared
192.168.1.11:/var/shared

Here's the volume file it created:

+------------------------------------------------------------------------------+
  1: volume shared-posix
  2:     type storage/posix
  3:     option directory /var/shared
  4:     option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0
  5: end-volume
  6:
  7: volume shared-access-control
  8:     type features/access-control
  9:     subvolumes shared-posix
 10: end-volume
 11:
 12: volume shared-locks
 13:     type features/locks
 14:     subvolumes shared-access-control
 15: end-volume
 16:
 17: volume shared-io-threads
 18:     type performance/io-threads
 19:     subvolumes shared-locks
 20: end-volume
 21:
 22: volume shared-index
 23:     type features/index
 24:     option index-base /var/shared/.glusterfs/indices
 25:     subvolumes shared-io-threads
 26: end-volume
 27:
 28: volume shared-marker
 29:     type features/marker
 30:     option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0
 31:     option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp
 32:     option xtime off
 33:     option quota off
 34:     subvolumes shared-index
 35: end-volume
 36:
 37: volume /var/shared
 38:     type debug/io-stats
 39:     option latency-measurement off
 40:     option count-fop-hits off
 41:     subvolumes shared-marker
 42: end-volume
 43:
 44: volume shared-server
 45:     type protocol/server
 46:     option transport-type tcp
 47:     option auth.login./var/shared.allow
94017411-d986-48e4-a7ac-47c1db14fba0
 48:     option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password
3929acf9-fcf1-4684-b271-07927d375c9b
 49:     option auth.addr./var/shared.allow *
 50:     subvolumes /var/shared
 51: end-volume

Despite all this, I've not seen gluster do anything visibly wrong - if I
create a file on the shared volume it appears on the other node, checksums
match, clients can read etc, but I don't want to be running on luck!
It's all very troubling, and it's making a right mess of my new
distributed logging system...

Any ideas?

Marcus

Marcus Bointon

2013-Jul-05 00:06 UTC

head link

[Gluster-users] Gluster trouble

On 3 Jul 2013, at 16:55, Marcus Bointon <marcus at synchromedia.co.uk>
wrote:
> [2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main]
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0
> [2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup]
0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily
unavailable)
I just noticed something: those errors are from glusterfsd about a glusterd pid
file. Is there some clash going on where one daemon is trying to use the
other's pid file?

Marcus

Kaushal M

2013-Jul-05 08:53 UTC

head link

[Gluster-users] Gluster trouble

On Wed, Jul 3, 2013 at 8:25 PM, Marcus Bointon
<marcus at synchromedia.co.uk> wrote:> Back in March I posted about some gluster problems:
>
> http://gluster.org/pipermail/gluster-users/2013-March/035737.html
> http://gluster.org/pipermail/gluster-users/2013-March/035655.html
>
> I'm still in the same situation - a straightforward 2-node, 2-way AFR
setup with each server mounting the single shared volume via NFS using gluster
3.3.0 (can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu
lucid). Gluster appears to be working, but won't mount on boot by any means
I've tried, and it's still logging prodigious amounts of
incomprehensible rubbish (to me!).
>
> gluster says everything is ok:
>
> gluster volume status
> Status of volume: shared
> Gluster process                                         Port    Online  Pid
>
------------------------------------------------------------------------------
> Brick 192.168.1.10:/var/shared                          24009   Y      
3097
> Brick 192.168.1.11:/var/shared                          24009   Y      
3020
> NFS Server on localhost                                 38467   Y      
3103
> Self-heal Daemon on localhost                           N/A     Y      
3109
> NFS Server on 192.168.1.11                              38467   Y      
3057
> Self-heal Daemon on 192.168.1.11                        N/A     Y      
3096
>
> (other node says the same thing with IPs the other way around)
>
> Yet the logs tell a different story.
>
> In syslog, this happens every second:
>
> Jul  3 00:17:29 web1 init: glusterd main process (14958) terminated with
status 255
> Jul  3 00:17:29 web1 init: glusterd main process ended, respawning
>
This seems like the init system is trying to restart glusterd.
Glusterd is a daemon process which is spawned by the process launched
by init. Init might be thinking that the main process dying as
glusterd dying and try to restart it. But since glusterd is already
running you are getting the below logs.
What packages are you using and what distro is this on?
> In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this:
>
> [2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main]
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0
> [2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup]
0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily
unavailable)
>
> In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these:
>
> 2013-07-03 14:24:00.792751] I
[afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0:
background  meta-data self-heal completed on
<gfid:16adce4d-1933-485f-8359-66c47c757cd3>
> [2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal]
0-shared-replicate-0: background  meta-data self-heal triggered. path:
<gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected
pending operations
> [2013-07-03 14:24:00.796411] I
[afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0:
background  meta-data self-heal completed on
<gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>
>
> 'gluster volume heal shared info' says:
>
> Heal operation on volume shared has been successful
>
> Brick 192.168.1.10:/var/shared
> Number of entries: 335
> ...
>
> I'm not clear whether this means it has 335 files still to fix, or
whether it's done so already.
This means there are 335 files still to fix.
> Both servers are logging the same kind of stuff. I'm sure all these are
related since they happen at about the same rate.
>
> The lock error looks the most interesting, but I've no idea why that
should happen. As before, I've tried deleting all traces of gluster,
reinstalling and reconfiguring and putting all the data back on, but nothing
changes.
>
> Here's the command I used to create the volume:
>
> gluster volume create shared replica 2 transport tcp
192.168.1.10:/var/shared 192.168.1.11:/var/shared
>
> Here's the volume file it created:
>
>
+------------------------------------------------------------------------------+
>   1: volume shared-posix
>   2:     type storage/posix
>   3:     option directory /var/shared
>   4:     option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0
>   5: end-volume
>   6:
>   7: volume shared-access-control
>   8:     type features/access-control
>   9:     subvolumes shared-posix
>  10: end-volume
>  11:
>  12: volume shared-locks
>  13:     type features/locks
>  14:     subvolumes shared-access-control
>  15: end-volume
>  16:
>  17: volume shared-io-threads
>  18:     type performance/io-threads
>  19:     subvolumes shared-locks
>  20: end-volume
>  21:
>  22: volume shared-index
>  23:     type features/index
>  24:     option index-base /var/shared/.glusterfs/indices
>  25:     subvolumes shared-io-threads
>  26: end-volume
>  27:
>  28: volume shared-marker
>  29:     type features/marker
>  30:     option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0
>  31:     option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp
>  32:     option xtime off
>  33:     option quota off
>  34:     subvolumes shared-index
>  35: end-volume
>  36:
>  37: volume /var/shared
>  38:     type debug/io-stats
>  39:     option latency-measurement off
>  40:     option count-fop-hits off
>  41:     subvolumes shared-marker
>  42: end-volume
>  43:
>  44: volume shared-server
>  45:     type protocol/server
>  46:     option transport-type tcp
>  47:     option auth.login./var/shared.allow
94017411-d986-48e4-a7ac-47c1db14fba0
>  48:     option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password
3929acf9-fcf1-4684-b271-07927d375c9b
>  49:     option auth.addr./var/shared.allow *
>  50:     subvolumes /var/shared
>  51: end-volume
>
> Despite all this, I've not seen gluster do anything visibly wrong - if
I create a file on the shared volume it appears on the other node, checksums
match, clients can read etc, but I don't want to be running on luck!
It's all very troubling, and it's making a right mess of my new
distributed logging system...
>
> Any ideas?
>
> Marcus
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

Gluster users - Jul 2013 - Gluster trouble

[Gluster-users] Gluster trouble

[Gluster-users] Gluster trouble

[Gluster-users] Gluster trouble