thr3ads.net - Gluster users - [Gluster-users] Issues in AFR and self healing [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Pablo Schandin

2018-Aug-10 17:55 UTC

[Gluster-users] Issues in AFR and self healing

Hello everyone!

I'm having some trouble with something but I'm not quite sure of with
what yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I have two
servers (nodes) in the cluster in a replica mode. Each server has 2
bricks. As the servers are KVM running several VMs, one brick has some
VMs locally defined in it and the second brick is the replicated from
the other server. It has data but not actual writing is being done
except for the replication.

??? ??? ??? ??? ??? ??? ??? Server 1 ??? ??? ? ??? ??? ??? ??? ??? ?
??? ??? Server 2
Volume 1 (gv1): Brick 1 defined VMs (read/write) ?? ----> ??? ??? ??? ?
Brick 1 replicated qcow2 files
Volume 2 (gv2): Brick 2 replicated qcow2 files??? ??? <----- ??? ???
???? Brick 2 defined VMs (read/write)

So, the main issue arose when I got a nagios alarm that warned about a
file listed to be healed. And then it disappeared. I came to find out
that every 5 minutes, the self heal daemon triggers the healing and this
fixes it. But looking at the logs I have a lot of entries in the
glustershd.log file like this:

[2018-08-09 14:23:37.689403] I [MSGID: 108026]
[afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0:
Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c.
sources=[0]? sinks=1
[2018-08-09 14:44:37.933143] I [MSGID: 108026]
[afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0:
Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f.
sources=[0]? sinks=1

The qcow2 files are being healed several times a day (up to 30 in
occasions). As I understand, this means that a data heal occurred on
file with gfid 407b... and 7371... in source to sink. Local server to
replica server? Is it OK for the shd to heal files in the replicated
brick that supposedly has no writing on it besides the mirroring? How
does that work?

How does afr replication work? The file with gfid 7371... is the qcow2
root disk of an owncloud server with 17GB of data. It does not seem to
be that big to be a bottleneck of some sort, I think.

Also, I was investigating the directory tree in brick/.glusterfs/indices
and I notices that both in xattrop and dirty I always have a file
created named xattrop-xxxxxx and dirty-xxxxxx. I read that the xattrop
file is like a parent file or handle to reference other files created
there as hardlinks with gfid name for the shd to heal. Is the same case
as the ones in the dirty dir?

Any help will be greatly appreciated it. Thanks!

Pablo.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180810/b1fd4b57/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4008 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180810/b1fd4b57/attachment.p7s>

Ravishankar N

2018-Aug-11 03:19 UTC

head link

[Gluster-users] Issues in AFR and self healing

On 08/10/2018 11:25 PM, Pablo Schandin wrote:>
> Hello everyone!
>
> I'm having some trouble with something but I'm not quite sure of
with
> what yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I have two 
> servers (nodes) in the cluster in a replica mode. Each server has 2 
> bricks. As the servers are KVM running several VMs, one brick has some 
> VMs locally defined in it and the second brick is the replicated from 
> the other server. It has data but not actual writing is being done 
> except for the replication.
>
> ??? ??? ??? ??? ??? ??? ??? Server 1 ??? ??? ? ??? ??? ??? ??? ??? ? 
> ??? ??? Server 2
> Volume 1 (gv1): Brick 1 defined VMs (read/write) ?? ----> ??? ??? ??? 
> ? Brick 1 replicated qcow2 files
> Volume 2 (gv2): Brick 2 replicated qcow2 files <-----??? ??? ??? ???? 
> Brick 2 defined VMs (read/write)
>
> So, the main issue arose when I got a nagios alarm that warned about a 
> file listed to be healed. And then it disappeared. I came to find out 
> that every 5 minutes, the self heal daemon triggers the healing and 
> this fixes it. But looking at the logs I have a lot of entries in the 
> glustershd.log file like this:
>
> [2018-08-09 14:23:37.689403] I [MSGID: 108026] 
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0: 
> Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c. 
> sources=[0]? sinks=1
> [2018-08-09 14:44:37.933143] I [MSGID: 108026] 
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0: 
> Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f. 
> sources=[0]? sinks=1
>
> The qcow2 files are being healed several times a day (up to 30 in 
> occasions). As I understand, this means that a data heal occurred on 
> file with gfid 407b... and 7371... in source to sink. Local server to 
> replica server? Is it OK for the shd to heal files in the replicated 
> brick that supposedly has no writing on it besides the mirroring? How 
> does that work?
>In AFR, for writes, there is no notion of local/remote brick. No matter 
from which client you write to the volume, it gets sent to both bricks. 
i.e. the replication is synchronous and real time.
> How does afr replication work? The file with gfid 7371... is the qcow2 
> root disk of an owncloud server with 17GB of data. It does not seem to 
> be that big to be a bottleneck of some sort, I think.
>
> Also, I was investigating the directory tree in 
> brick/.glusterfs/indices and I notices that both in xattrop and dirty 
> I always have a file created named xattrop-xxxxxx and dirty-xxxxxx. I 
> read that the xattrop file is like a parent file or handle to 
> reference other files created there as hardlinks with gfid name for 
> the shd to heal. Is the same case as the ones in the dirty dir?
>Yes, before the write, the gfid gets captured inside dirty on all 
bricks. If the write is successful, it gets removed. In addition, if the 
write fails on one brick, the other brick will capture the gfid inside 
xattrop.>
> Any help will be greatly appreciated it. Thanks!
>If frequent heals are triggered, it could mean there are frequent 
network disconnects from the clients to the bricks as writes happen. You 
can check the mount logs and see if that is the case and investigate 
possible network issues.

HTH,
Ravi>
> Pablo.
>
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180811/59de1c04/attachment.html>

Gluster users - Aug 2018 - Issues in AFR and self healing

[Gluster-users] Issues in AFR and self healing

[Gluster-users] Issues in AFR and self healing