Alex K
2018-Feb-15 19:34 UTC
[Gluster-users] Failover problems with gluster 3.8.8-1 (latest Debian stable)
Hi, Have you checked for any file system errors on the brick mount point? I once was facing weird io errors and xfs_repair fixed the issue. What about the heal? Does it report any pending heals? On Feb 15, 2018 14:20, "Dave Sherohman" <dave at sherohman.org> wrote:> Well, it looks like I've stumped the list, so I did a bit of additional > digging myself: > > azathoth replicates with yog-sothoth, so I compared their brick > directories. `ls -R /var/local/brick0/data | md5sum` gives the same > result on both servers, so the filenames are identical in both bricks. > However, `du -s /var/local/brick0/data` shows that azathoth has about 3G > more data (445G vs 442G) than yog. > > This seems consistent with my assumption that the problem is on > yog-sothoth (everything is fine with only azathoth; there are problems > with only yog-sothoth) and I am reminded that a few weeks ago, > yog-sothoth was offline for 4-5 days, although it should have been > brought back up-to-date once it came back online. > > So, assuming that the issue is stale/missing data on yog-sothoth, is > there a way to force gluster to do a full refresh of the data from > azathoth's brick to yog-sothoth's brick? I would have expected running > heal and/or rebalance to do that sort of thing, but I've run them both > (with and without fix-layout on the rebalance) and the problem persists. > > If there isn't a way to force a refresh, how risky would it be to kill > gluster on yog-sothoth, wipe everything from /var/local/brick0, and then > re-add it to the cluster as if I were replacing a physically failed > disk? Seems like that should work in principle, but it feels dangerous > to wipe the partition and rebuild, regardless. > > On Tue, Feb 13, 2018 at 07:33:44AM -0600, Dave Sherohman wrote: > > I'm using gluster for a virt-store with 3x2 distributed/replicated > > servers for 16 qemu/kvm/libvirt virtual machines using image files > > stored in gluster and accessed via libgfapi. Eight of these disk images > > are standalone, while the other eight are qcow2 images which all share a > > single backing file. > > > > For the most part, this is all working very well. However, one of the > > gluster servers (azathoth) causes three of the standalone VMs and all 8 > > of the shared-backing-image VMs to fail if it goes down. Any of the > > other gluster servers can go down with no problems; only azathoth causes > > issues. > > > > In addition, the kvm hosts have the gluster volume fuse mounted and one > > of them (out of five) detects an error on the gluster volume and puts > > the fuse mount into read-only mode if azathoth goes down. libgfapi > > connections to the VM images continue to work normally from this host > > despite this and the other four kvm hosts are unaffected. > > > > It initially seemed relevant that I have the libgfapi URIs specified as > > gluster://azathoth/..., but I've tried changing them to make the initial > > connection via other gluster hosts and it had no effect on the problem. > > Losing azathoth still took them out. > > > > In addition to changing the mount URI, I've also manually run a heal and > > rebalance on the volume, enabled the bitrot daemons (then turned them > > back off a week later, since they reported no activity in that time), > > and copied one of the standalone images to a new file in case it was a > > problem with the file itself. As far as I can tell, none of these > > attempts changed anything. > > > > So I'm at a loss. Is this a known type of problem? If so, how do I fix > > it? If not, what's the next step to troubleshoot it? > > > > > > # gluster --version > > glusterfs 3.8.8 built on Jan 11 2017 14:07:11 > > Repository revision: git://git.gluster.com/glusterfs.git > > > > # gluster volume status > > Status of volume: palantir > > Gluster process TCP Port RDMA Port Online > > Pid > > ------------------------------------------------------------ > ------------------ > > Brick saruman:/var/local/brick0/data 49154 0 Y > > 10690 > > Brick gandalf:/var/local/brick0/data 49155 0 Y > > 18732 > > Brick azathoth:/var/local/brick0/data 49155 0 Y > > 9507 > > Brick yog-sothoth:/var/local/brick0/data 49153 0 Y > > 39559 > > Brick cthulhu:/var/local/brick0/data 49152 0 Y > > 2682 > > Brick mordiggian:/var/local/brick0/data 49152 0 Y > > 39479 > > Self-heal Daemon on localhost N/A N/A Y > > 9614 > > Self-heal Daemon on saruman.lub.lu.se N/A N/A Y > > 15016 > > Self-heal Daemon on cthulhu.lub.lu.se N/A N/A Y > > 9756 > > Self-heal Daemon on gandalf.lub.lu.se N/A N/A Y > > 5962 > > Self-heal Daemon on mordiggian.lub.lu.se N/A N/A Y > > 8295 > > Self-heal Daemon on yog-sothoth.lub.lu.se N/A N/A Y > > 7588 > > > > Task Status of Volume palantir > > ------------------------------------------------------------ > ------------------ > > Task : Rebalance > > ID : c38e11fe-fe1b-464d-b9f5-1398441cc229 > > Status : completed > > > > > > -- > > Dave Sherohman > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > Dave Sherohman > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180215/69d8c74e/attachment.html>
Dave Sherohman
2018-Feb-16 11:44 UTC
[Gluster-users] Failover problems with gluster 3.8.8-1 (latest Debian stable)
On Thu, Feb 15, 2018 at 09:34:02PM +0200, Alex K wrote:> Have you checked for any file system errors on the brick mount point?I hadn't. fsck reports no errors.> What about the heal? Does it report any pending heals?There are now. It looks like taking the brick offline to fsck it was enough to trigger gluster to recheck everything. I'll check after it finishes to see whether this ultimately resolves the issue. -- Dave Sherohman
Dave Sherohman
2018-Feb-20 09:59 UTC
[Gluster-users] Failover problems with gluster 3.8.8-1 (latest Debian stable)
On Fri, Feb 16, 2018 at 05:44:43AM -0600, Dave Sherohman wrote:> On Thu, Feb 15, 2018 at 09:34:02PM +0200, Alex K wrote: > > Have you checked for any file system errors on the brick mount point? > > I hadn't. fsck reports no errors. > > > What about the heal? Does it report any pending heals? > > There are now. It looks like taking the brick offline to fsck it was > enough to trigger gluster to recheck everything. I'll check after it > finishes to see whether this ultimately resolves the issue.Finally got a chance to test it and, nope, still having problems. Although I'm sure the gluster volume is in better condition now than before I got this housekeeping to run, it doesn't seem to have actually changed the final results of azathoth going down at all. Anyone have a more thorough way of forcing a replica to refresh itself from its twin? -- Dave Sherohman
Apparently Analagous Threads
- Failover problems with gluster 3.8.8-1 (latest Debian stable)
- Failover problems with gluster 3.8.8-1 (latest Debian stable)
- Failover problems with gluster 3.8.8-1 (latest Debian stable)
- Quorum in distributed-replicate volume
- Quorum in distributed-replicate volume