Osborne, Paul (paul.osborne@canterbury.ac.uk)
2015-Oct-15 15:40 UTC
[Gluster-users] 3.6.6 healing issues?
HI, I am seeing what I can best describe as an oddity where my monitoring is telling me that there is an issue (nagios touches a file and then removes it - to check read/write access available on the client mount point), gluster says that there is not an issue on the server mounting the file store there is allegedly a lack of space and I am not certain where to turn.: # dpkg --list | grep gluster ii glusterfs-client 3.6.6-1 amd64 clustered file-system (client package) ii glusterfs-common 3.6.6-1 amd64 GlusterFS common libraries and translator modules ii glusterfs-server 3.6.6-1 amd64 clustered file-system (server package) So these are packages straight out of the gluster.org repository. # gluster volume status kerberos Status of volume: kerberos Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick gfsi-rh-01:/srv/hod/kerberos/gfs 49171 Y 12863 Brick gfsi-isr-01:/srv/hod/kerberos/gfs 49169 Y 37115 Brick gfsi-cant-01:/srv/hod/kerberos/gfs 49163 Y 49057 NFS Server on localhost 2049 Y 49069 Self-heal Daemon on localhost N/A Y 49076 NFS Server on gfsi-isr-01.core.canterbury.ac.uk 2049 Y 37127 Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk N/A Y 37134 NFS Server on gfsi-rh-01.core.canterbury.ac.uk 2049 Y 12875 Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk N/A Y 12882 Task Status of Volume kerberos ------------------------------------------------------------------------------ There are no active volume tasks # gluster volume info kerberos Volume Name: kerberos Type: Replicate Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs Options Reconfigured: cluster.server-quorum-ratio: 51 # gluster volume heal kerberos statistics | grep No | grep -v 0 which to me looks good - 3 servers as a replica with qurom set. BUT: # mount | grep kerberos gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115) To be clear we use autofs failover mounting NFS to our 3 bricks with a .5 second timeout # df -h Filesystem Size Used Avail Use% Mounted on gfsi-cant-01:/kerberos 97M 31M 61M 34% /var/gfs/kerberos Its a deliberately small volume as it only holds kerberos keytabs that need to be in sync across web servers. Amusingly there is nothing like that much data being used: /var/gfs/kerberos# ls -la total 8 drwxr-xr-x 7 root root 1024 Oct 15 16:03 . drwxr-xr-x 5 root root 0 Oct 15 16:03 .. drwxr-xr-x 2 root root 1024 Sep 16 12:09 HTTP_blog-mgmnt drwxr-xr-x 2 root root 1024 Jul 31 2014 HTTP_kerbtest drwxr-xr-x 2 root root 1024 May 14 15:48 HTTP_wiki-dev -rw-r--r--+ 1 root root 2546 Sep 16 11:24 krb5-keytab-autoupdate drwxr-xr-x 2 nagios nagios 1024 Oct 15 16:18 .nagios -rw-r--r-- 1 nagios nagios 47 Apr 23 16:08 .nagioscheck Rlampint-rh-01:/var/gfs/kerberos# du -hs * 5.0K HTTP_blog-mgmnt 2.5K HTTP_kerbtest 2.5K HTTP_wiki-dev 2.5K krb5-keytab-autoupdate Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios 1.0K .nagios So I can only assume that the rest of the data is masked as Gluster metadata. Rlampint-rh-01:/var/gfs/kerberos# ls -la >file bash: file: No space left on device which suggests that the volume is full - it clearly isn't. In the logs: /var/log/glusterfs# less glfsheal-kerberos.log [2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi: failed to get the 'volume file' from server [2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:kerberos) <Repeatedly> I have stopped and restarted the volume, which has made no difference that I can see. Other volumes configured and provisioned in a similar way on these 3 GFS servers are not reporting issues and they are rather well loaded compared to this one. The nfs.log shows: [2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f: /.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device) [2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-2: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-0: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-1: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d: /.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device) At this point I am rather confused and not entirely certain where to turn next - as I can see that there is space although allegedly there isnt - the three bricks are comprised of 3 100MB volumes - in any case testing has shown previously that gluster will size for the smallest brick when using replicas. Thoughts/comments are as always welcome. Thanks Paul
Osborne, Paul (paul.osborne@canterbury.ac.uk)
2015-Oct-16 08:00 UTC
[Gluster-users] 3.6.6 healing issues?
Just to add to the report: gluster> volume heal kerberos info Volume kerberos does not exist Volume heal failed Which has me rather concerned as it clearly does exist. Anyway - since the reports I was getting indicated that there was a lack of space rather than a lack of write permissions, I have extended the size of the volume from 100MB to 500MB and the volume does now appear to work. However a volume heal gives the same response as before, which leaves me with 2 questions: 1. why is volume heal reporting that the volume does not exist. 2. how much 'hidden' overhead does gluster actually use? Thanks Paul ________________________________________ From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on behalf of Osborne, Paul (paul.osborne at canterbury.ac.uk) <paul.osborne at canterbury.ac.uk> Sent: 15 October 2015 16:40 To: gluster-users at gluster.org Subject: [Gluster-users] 3.6.6 healing issues? HI, I am seeing what I can best describe as an oddity where my monitoring is telling me that there is an issue (nagios touches a file and then removes it - to check read/write access available on the client mount point), gluster says that there is not an issue on the server mounting the file store there is allegedly a lack of space and I am not certain where to turn.: # dpkg --list | grep gluster ii glusterfs-client 3.6.6-1 amd64 clustered file-system (client package) ii glusterfs-common 3.6.6-1 amd64 GlusterFS common libraries and translator modules ii glusterfs-server 3.6.6-1 amd64 clustered file-system (server package) So these are packages straight out of the gluster.org repository. # gluster volume status kerberos Status of volume: kerberos Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick gfsi-rh-01:/srv/hod/kerberos/gfs 49171 Y 12863 Brick gfsi-isr-01:/srv/hod/kerberos/gfs 49169 Y 37115 Brick gfsi-cant-01:/srv/hod/kerberos/gfs 49163 Y 49057 NFS Server on localhost 2049 Y 49069 Self-heal Daemon on localhost N/A Y 49076 NFS Server on gfsi-isr-01.core.canterbury.ac.uk 2049 Y 37127 Self-heal Daemon on gfsi-isr-01.core.canterbury.ac.uk N/A Y 37134 NFS Server on gfsi-rh-01.core.canterbury.ac.uk 2049 Y 12875 Self-heal Daemon on gfsi-rh-01.core.canterbury.ac.uk N/A Y 12882 Task Status of Volume kerberos ------------------------------------------------------------------------------ There are no active volume tasks # gluster volume info kerberos Volume Name: kerberos Type: Replicate Volume ID: 89d63332-bb1e-4b47-8882-dfdb9af7f97d Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gfsi-rh-01:/srv/hod/kerberos/gfs Brick2: gfsi-isr-01:/srv/hod/kerberos/gfs Brick3: gfsi-cant-01:/srv/hod/kerberos/gfs Options Reconfigured: cluster.server-quorum-ratio: 51 # gluster volume heal kerberos statistics | grep No | grep -v 0 which to me looks good - 3 servers as a replica with qurom set. BUT: # mount | grep kerberos gfsi-cant-01:/kerberos on /var/gfs/kerberos type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=50,retrans=2,sec=sys,mountaddr=194.82.211.115,mountvers=3,mountport=38465,mountproto=tcp,local_lock=all,addr=194.82.211.115) To be clear we use autofs failover mounting NFS to our 3 bricks with a .5 second timeout # df -h Filesystem Size Used Avail Use% Mounted on gfsi-cant-01:/kerberos 97M 31M 61M 34% /var/gfs/kerberos Its a deliberately small volume as it only holds kerberos keytabs that need to be in sync across web servers. Amusingly there is nothing like that much data being used: /var/gfs/kerberos# ls -la total 8 drwxr-xr-x 7 root root 1024 Oct 15 16:03 . drwxr-xr-x 5 root root 0 Oct 15 16:03 .. drwxr-xr-x 2 root root 1024 Sep 16 12:09 HTTP_blog-mgmnt drwxr-xr-x 2 root root 1024 Jul 31 2014 HTTP_kerbtest drwxr-xr-x 2 root root 1024 May 14 15:48 HTTP_wiki-dev -rw-r--r--+ 1 root root 2546 Sep 16 11:24 krb5-keytab-autoupdate drwxr-xr-x 2 nagios nagios 1024 Oct 15 16:18 .nagios -rw-r--r-- 1 nagios nagios 47 Apr 23 16:08 .nagioscheck Rlampint-rh-01:/var/gfs/kerberos# du -hs * 5.0K HTTP_blog-mgmnt 2.5K HTTP_kerbtest 2.5K HTTP_wiki-dev 2.5K krb5-keytab-autoupdate Rlampint-rh-01:/var/gfs/kerberos# du -hs .nagios 1.0K .nagios So I can only assume that the rest of the data is masked as Gluster metadata. Rlampint-rh-01:/var/gfs/kerberos# ls -la >file bash: file: No space left on device which suggests that the volume is full - it clearly isn't. In the logs: /var/log/glusterfs# less glfsheal-kerberos.log [2015-10-15 15:29:31.407041] E [glfs-mgmt.c:520:mgmt_getspec_cbk] 0-gfapi: failed to get the 'volume file' from server [2015-10-15 15:29:31.407090] E [glfs-mgmt.c:599:mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:kerberos) <Repeatedly> I have stopped and restarted the volume, which has made no difference that I can see. Other volumes configured and provisioned in a similar way on these 3 GFS servers are not reporting issues and they are rather well loaded compared to this one. The nfs.log shows: [2015-10-15 15:35:15.222424] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 9b62284f: /.nagios/.nagiosrwtest.1444923315 => -1 (No space left on device) [2015-10-15 15:35:28.081437] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-2: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.082000] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-0: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.082064] W [client-rpc-fops.c:2220:client3_3_create_cbk] 0-kerberos-client-1: remote operation failed: No space left on device. Path: /.nagios/.nagiosrwtest.1444923328 [2015-10-15 15:35:28.083019] W [nfs3.c:2370:nfs3svc_create_cbk] 0-nfs: 4b32485d: /.nagios/.nagiosrwtest.1444923328 => -1 (No space left on device) At this point I am rather confused and not entirely certain where to turn next - as I can see that there is space although allegedly there isnt - the three bricks are comprised of 3 100MB volumes - in any case testing has shown previously that gluster will size for the smallest brick when using replicas. Thoughts/comments are as always welcome. Thanks Paul _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://www.gluster.org/mailman/listinfo/gluster-users