thr3ads.net - Gluster users - [Gluster-users] 3.6.6 healing issues? [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Osborne, Paul (paul.osborne@canterbury.ac.uk)

2015-Oct-15 15:40 UTC

[Gluster-users] 3.6.6 healing issues?

HI,


I am seeing what I can best describe as an oddity where my monitoring is telling
me that there is an issue (nagios touches a file and then removes it - to check
read/write access available on the client mount point), gluster says that there
is not an issue on the server mounting the file store there is allegedly a lack
of space and I am not certain where to turn.:

# dpkg --list | grep gluster
ii  glusterfs-client                   3.6.6-1                           amd64  
clustered file-system (client package)
ii  glusterfs-common                   3.6.6-1                           amd64  
GlusterFS common libraries and translator modules
ii  glusterfs-server                   3.6.6-1                           amd64  
clustered file-system (server package)


So these are packages straight out of the gluster.org repository.

# gluster volume status kerberos
Status of volume: kerberos
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick gfsi-rh-01:/srv/hod/kerberos/gfs			49171	Y	12863
Brick gfsi-isr-01:/srv/hod/kerberos/gfs			49169	Y	37115
Brick gfsi-cant-01:/srv/hod/kerberos/gfs		49163	Y	49057
NFS Server on localhost					2049	Y	49069
Self-heal Daemon on localhost				N/A	Y	49076
NFS Server on gfsi-isr-01.core.canterbury.ac.uk		2049	Y	37127
Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk	N/A	Y	37134
NFS Server on gfsi-rh-01.core.canterbury.ac.uk		2049	Y	12875
Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk	N/A	Y	12882
 
Task Status of Volume kerberos
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume info kerberos
 
Volume Name: kerberos
Type: Replicate
Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs
Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs
Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs
Options Reconfigured:
cluster.server-quorum-ratio: 51


# gluster volume heal kerberos statistics | grep No  | grep -v 0


which to me looks good - 3 servers as a replica with qurom set.


BUT:

# mount | grep kerberos
gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs
(rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115)

To be clear we use autofs failover mounting NFS  to our 3 bricks with a .5
second timeout

# df -h
Filesystem               Size  Used Avail Use% Mounted on
gfsi-cant-01:/kerberos    97M   31M   61M  34% /var/gfs/kerberos

Its a deliberately small volume as it only holds kerberos keytabs that need to
be in sync across web servers.  Amusingly there is nothing like that much data
being used:

/var/gfs/kerberos# ls -la
total 8
drwxr-xr-x  7 root   root   1024 Oct 15 16:03 .
drwxr-xr-x  5 root   root      0 Oct 15 16:03 ..
drwxr-xr-x  2 root   root   1024 Sep 16 12:09 HTTP_blog-mgmnt
drwxr-xr-x  2 root   root   1024 Jul 31  2014 HTTP_kerbtest
drwxr-xr-x  2 root   root   1024 May 14 15:48 HTTP_wiki-dev
-rw-r--r--+ 1 root   root   2546 Sep 16 11:24 krb5-keytab-autoupdate
drwxr-xr-x  2 nagios nagios 1024 Oct 15 16:18 .nagios
-rw-r--r--  1 nagios nagios   47 Apr 23 16:08 .nagioscheck

Rlampint-rh-01:/var/gfs/kerberos# du -hs *
5.0K	HTTP_blog-mgmnt
2.5K	HTTP_kerbtest
2.5K	HTTP_wiki-dev
2.5K	krb5-keytab-autoupdate

Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios
1.0K	.nagios


So I can only assume that the rest of the data is masked as Gluster metadata.

Rlampint-rh-01:/var/gfs/kerberos# ls -la >file
bash: file: No space left on device


which suggests that the volume is full - it clearly isn't.

In the logs:
/var/log/glusterfs# less glfsheal-kerberos.log
[2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi:
failed to get the 'volume file' from server
[2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt:
failed to fetch volume file (key:kerberos)
<Repeatedly>

I have stopped and restarted the volume, which has made no difference that I can
see.

Other volumes configured and provisioned in a similar way on these 3 GFS servers
are not reporting issues and they are rather well loaded compared to this one.

The nfs.log shows:
[2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f:
/.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device)
[2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-2: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-0: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-1: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d:
/.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device)


At this point I am rather confused and not entirely certain where to turn next -
as I can see that there is space although allegedly there isnt - the three
bricks are comprised of 3 100MB volumes - in any case testing has shown
previously that gluster will size for the smallest brick when using replicas.


Thoughts/comments are as always welcome.

Thanks

Paul

Osborne, Paul (paul.osborne@canterbury.ac.uk)

2015-Oct-16 08:00 UTC

head link

[Gluster-users] 3.6.6 healing issues?

Just to add to the report:

gluster> volume heal kerberos info
Volume kerberos does not exist
Volume heal failed


Which has me rather concerned as it clearly does exist.


Anyway - since the reports I was getting indicated that there was a lack of
space rather than a lack of write permissions, I have extended the size of the
volume from 100MB to 500MB and the volume does now appear to work.

However a volume heal gives the same response as before, which leaves me with 2
questions:

1. why is volume heal reporting that the volume does not exist.
2. how much 'hidden' overhead does gluster actually use? 

Thanks

Paul

________________________________________
From: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> on behalf of Osborne, Paul (paul.osborne at canterbury.ac.uk)
<paul.osborne at canterbury.ac.uk>
Sent: 15 October 2015 16:40
To: gluster-users at gluster.org
Subject: [Gluster-users] 3.6.6  healing issues?

HI,


I am seeing what I can best describe as an oddity where my monitoring is telling
me that there is an issue (nagios touches a file and then removes it - to check
read/write access available on the client mount point), gluster says that there
is not an issue on the server mounting the file store there is allegedly a lack
of space and I am not certain where to turn.:

# dpkg --list | grep gluster
ii  glusterfs-client                   3.6.6-1                           amd64  
clustered file-system (client package)
ii  glusterfs-common                   3.6.6-1                           amd64  
GlusterFS common libraries and translator modules
ii  glusterfs-server                   3.6.6-1                           amd64  
clustered file-system (server package)


So these are packages straight out of the gluster.org repository.

# gluster volume status kerberos
Status of volume: kerberos
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick gfsi-rh-01:/srv/hod/kerberos/gfs                  49171   Y       12863
Brick gfsi-isr-01:/srv/hod/kerberos/gfs                 49169   Y       37115
Brick gfsi-cant-01:/srv/hod/kerberos/gfs                49163   Y       49057
NFS Server on localhost                                 2049    Y       49069
Self-heal Daemon on localhost                           N/A     Y       49076
NFS Server on gfsi-isr-01.core.canterbury.ac.uk         2049    Y       37127
Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk   N/A     Y       37134
NFS Server on gfsi-rh-01.core.canterbury.ac.uk          2049    Y       12875
Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk    N/A     Y       12882

Task Status of Volume kerberos
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume info kerberos

Volume Name: kerberos
Type: Replicate
Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs
Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs
Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs
Options Reconfigured:
cluster.server-quorum-ratio: 51


# gluster volume heal kerberos statistics | grep No  | grep -v 0


which to me looks good - 3 servers as a replica with qurom set.


BUT:

# mount | grep kerberos
gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs
(rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115)

To be clear we use autofs failover mounting NFS  to our 3 bricks with a .5
second timeout

# df -h
Filesystem               Size  Used Avail Use% Mounted on
gfsi-cant-01:/kerberos    97M   31M   61M  34% /var/gfs/kerberos

Its a deliberately small volume as it only holds kerberos keytabs that need to
be in sync across web servers.  Amusingly there is nothing like that much data
being used:

/var/gfs/kerberos# ls -la
total 8
drwxr-xr-x  7 root   root   1024 Oct 15 16:03 .
drwxr-xr-x  5 root   root      0 Oct 15 16:03 ..
drwxr-xr-x  2 root   root   1024 Sep 16 12:09 HTTP_blog-mgmnt
drwxr-xr-x  2 root   root   1024 Jul 31  2014 HTTP_kerbtest
drwxr-xr-x  2 root   root   1024 May 14 15:48 HTTP_wiki-dev
-rw-r--r--+ 1 root   root   2546 Sep 16 11:24 krb5-keytab-autoupdate
drwxr-xr-x  2 nagios nagios 1024 Oct 15 16:18 .nagios
-rw-r--r--  1 nagios nagios   47 Apr 23 16:08 .nagioscheck

Rlampint-rh-01:/var/gfs/kerberos# du -hs *
5.0K    HTTP_blog-mgmnt
2.5K    HTTP_kerbtest
2.5K    HTTP_wiki-dev
2.5K    krb5-keytab-autoupdate

Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios
1.0K    .nagios


So I can only assume that the rest of the data is masked as Gluster metadata.

Rlampint-rh-01:/var/gfs/kerberos# ls -la >file
bash: file: No space left on device


which suggests that the volume is full - it clearly isn't.

In the logs:
/var/log/glusterfs# less glfsheal-kerberos.log
[2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi:
failed to get the 'volume file' from server
[2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt:
failed to fetch volume file (key:kerberos)
<Repeatedly>

I have stopped and restarted the volume, which has made no difference that I can
see.

Other volumes configured and provisioned in a similar way on these 3 GFS servers
are not reporting issues and they are rather well loaded compared to this one.

The nfs.log shows:
[2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f:
/.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device)
[2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-2: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-0: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk]
0-kerberos-client-1: remote operation failed: No space left on device. Path:
/.nagios/.nagiosrwtest.1444923328
[2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d:
/.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device)


At this point I am rather confused and not entirely certain where to turn next -
as I can see that there is space although allegedly there isnt - the three
bricks are comprised of 3 100MB volumes - in any case testing has shown
previously that gluster will size for the smallest brick when using replicas.


Thoughts/comments are as always welcome.

Thanks

Paul





_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Gluster users - Oct 2015 - 3.6.6 healing issues?

[Gluster-users] 3.6.6 healing issues?

[Gluster-users] 3.6.6 healing issues?