Emma Hogbin Westby
2017-Feb-10 17:01 UTC
[Gluster-users] Troubleshooting an outage with version mismatch in 3.8.x
Hello, I am trying to understand an outage that we had recently when adding a new GlusterFS brick to our pool. The three nodes were each running 3.8.5. The new node was 3.8.8. We didn't have any reason to think a point difference would cause problems. Within ten hours one of our sites experienced the following problems: - nginx was unable to read files from GlusterFS - Docker container providing nginx service became unresponsive to stop / start commands - restarting the Docker service did not make it possible to stop / start the affected nginx containers - ultimately a reboot of the host server was required During the early part of the outage, the GlusterFS commands stopped working. As the outage proceeded, it was possible to navigate the files via Gluster, but not serve them via nginx. We experienced three outages in three days all with similar symptoms. - After the first outage we simply restarted the server to get the files to be delivered normally. - After the second outage (22.5 hours later) we stopped the GlusterFS service on the new server. It was listed as "disconnected". - After the third outage (11 hours later) we manually removed the volume for the affected (high volume) site. It was only after taking this action that the outages stopped. As best I can tell the problem is the new brick which has a 0.0.2 difference from our nodes in the pool. Is this the expected behaviour from a point release? I would have thought a patch release would be fine. Regards, Emma