thr3ads.net - Gluster users - [Gluster-users] Gluster mounts becoming stale and never recovering [May 2019]

If this information is useful, please help other people find it:
Share via:
Jeff Bischoff
2019-May-16 02:18 UTC
[Gluster-users] Gluster mounts becoming stale and never recovering

Hi all,

 

We are having a sporadic issue with our Gluster mounts that is affecting several
of our Kubernetes environments. We are having trouble understanding what is
causing it, and we could use some guidance from the pros!

 

Scenario

We have an environment running a single-node Kubernetes with Heketi and several
pods using Gluster mounts. The environment runs fine and the mounts appear to be
healthy for up to several days. Suddenly, one or more (sometimes all) Gluster
mounts have a problem and shut down the brick. The affected containers enter a
crash loop that continues indefinitely, until someone intervenes. To work-around
the crash loop, a user needs to trigger the bricks to be started again--either
through manually starting them, restarting the Gluster pod or restarting the
entire node.

 

Diagnostics

The tell-tale error message is seeing the following when describing a pod that
is in a crash loop:

 

Message:      error while creating mount source path
'/var/lib/kubelet/pods/4a2574bb-6fa4-11e9-a315-005056b83c80/volumes/kubernetes.io~glusterfs/db':
mkdir
/var/lib/kubelet/pods/4a2574bb-6fa4-11e9-a315-005056b83c80/volumes/kubernetes.io~glusterfs/db:
file exists

 

We always see that "file exists" message when this error occurs.

 

Looking at the glusterd.log file, there had been nothing in the log for over a
day and then suddenly, at the time the crash loop started, this:

 

[2019-05-08 13:49:04.733147] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a3cef78a5914a2808da0b5736e3daec7/brick
on port 49168

[2019-05-08 13:49:04.733374] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_7614e5014a0e402630a0e1fd776acf0a/brick
on port 49167

[2019-05-08 13:49:05.003848] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/fe4ac75011a4de0e.socket failed (No data available)

[2019-05-08 13:49:05.065420] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/85e9fb223aa121f2.socket failed (No data available)

[2019-05-08 13:49:05.066479] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/e2a66e8cd8f5f606.socket failed (No data available)

[2019-05-08 13:49:05.067444] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/a0625e5b78d69bb8.socket failed (No data available)

[2019-05-08 13:49:05.068471] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/770bc294526d0360.socket failed (No data available)

[2019-05-08 13:49:05.074278] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/adbd37fe3e1eed36.socket failed (No data available)

[2019-05-08 13:49:05.075497] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/17712138f3370e53.socket failed (No data available)

[2019-05-08 13:49:05.076545] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/a6cf1aca8b23f394.socket failed (No data available)

[2019-05-08 13:49:05.077511] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/d0f83b191213e877.socket failed (No data available)

[2019-05-08 13:49:05.078447] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/d5dd08945d4f7f6d.socket failed (No data available)

[2019-05-08 13:49:05.079424] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/c8d7b10108758e2f.socket failed (No data available)

[2019-05-08 13:49:14.778619] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_0ed4f7f941de388cda678fe273e9ceb4/brick
on port 49166

... (and more of the same)

 

Nothing further has been printed to the gluster log since. The bricks do not
come back on their own.

The version of gluster we are using (running in a container, using the
gluster/gluster-centos image from dockerhub):

 

# rpm -qa | grep gluster

glusterfs-rdma-4.1.7-1.el7.x86_64

gluster-block-0.3-2.el7.x86_64

python2-gluster-4.1.7-1.el7.x86_64

centos-release-gluster41-1.0-3.el7.centos.noarch

glusterfs-4.1.7-1.el7.x86_64

glusterfs-api-4.1.7-1.el7.x86_64

glusterfs-cli-4.1.7-1.el7.x86_64

glusterfs-geo-replication-4.1.7-1.el7.x86_64

glusterfs-libs-4.1.7-1.el7.x86_64

glusterfs-client-xlators-4.1.7-1.el7.x86_64

glusterfs-fuse-4.1.7-1.el7.x86_64

glusterfs-server-4.1.7-1.el7.x86_64

 

The version of glusterfs running on our Kubernetes node (a CentOS system):

 

]$ rpm -qa | grep gluster

glusterfs-libs-3.12.2-18.el7.x86_64

glusterfs-3.12.2-18.el7.x86_64

glusterfs-fuse-3.12.2-18.el7.x86_64

glusterfs-client-xlators-3.12.2-18.el7.x86_64

 

The Kubernetes version:

 

$  kubectl version

Client Version: version.Info{Major:"1", Minor:"13",
GitVersion:"v1.13.5",
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2",
GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z",
GoVersion:"go1.11.5", Compiler:"gc",
Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"13",
GitVersion:"v1.13.5",
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2",
GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z",
GoVersion:"go1.11.5", Compiler:"gc",
Platform:"linux/amd64"}

 

Our gluster settings/volume options:

 

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

  name: gluster-heketi

  selfLink: /apis/storage.k8s.io/v1/storageclasses/gluster-heketi

parameters:

  gidMax: "50000"

  gidMin: "2000"

  resturl: http://10.233.35.158:8080

  restuser: "null"

  restuserkey: "null"

  volumetype: "none"

  volumeoptions: cluster.post-op-delay-secs 0, performance.client-io-threads
off, performance.open-behind off, performance.readdir-ahead off,
performance.read-ahead off, performance.stat-prefetch off,
performance.write-behind off, performance.io-cache off,
cluster.consistent-metadata on, performance.quick-read off,
performance.strict-o-direct on

provisioner: kubernetes.io/glusterfs

reclaimPolicy: Delete

 

Volume info for the heketi volume:

 
gluster> volume info heketidbstorage
 
Volume Name: heketidbstorage
Type: Distribute
Volume ID: 34b897d0-0953-4f8f-9c5c-54e043e55d92
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1:
10.10.168.25:/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a16f9f0374fe5db948a60a017a3f5e60/brick
Options Reconfigured:
user.heketi.id: 1d2400626dac780fce12e45a07494853
transport.address-family: inet
nfs.disable: on
 

Full Gluster logs available if needed, just let me know how best to provide
them.

 

Thanks in advance for any help or suggestions on this!

 

Best,

 

Jeff Bischoff

Turbonomic

 

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190515/d01eb649/attachment.html>
Gluster users - May 2019 - Gluster mounts becoming stale and never recovering

[Gluster-users] Gluster mounts becoming stale and never recovering