thr3ads.net - Gluster users - [Gluster-users] [External] Re: Self Heal Confusion [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Brett Holcomb

2018-Dec-31 09:34 UTC

[Gluster-users] [External] Re: Self Heal Confusion

That is probably the case as a lot of files were deleted some time ago.

I'm on version 5.2 but was on 3.12 until about a week ago.

Here is the quorum info.? I'm running a distributed replicated volumes 
in 2 x 3 = 6

cluster.quorum-type auto
cluster.quorum-count (null)
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
cluster.quorum-reads??????????????????? no

Where exacty do I remove the gfid entries from - the .glusterfs 
directory?? Do I just delete all the directories can files under this 
directory?

Where do I put the cluster.heal-timeout option - which file?

I think you've hit on the cause of the issue.? Thinking back we've had 
some extended power outages and due to a misconfiguration in the swap 
file device name a couple of the nodes did not come up and I didn't 
catch it for a while so maybe the deletes occured then.

Thank you.

On 12/31/18 2:58 AM, Davide Obbi wrote:> if the long GFID does not correspond to any file it could mean the 
> file has been deleted by the client mounting the volume. I think this 
> is caused when the delete was issued and the number of active bricks 
> were not reaching quorum majority or a second brick was taken down 
> while another was down or did not finish the selfheal, the latter more 
> likely.
> It would be interesting to see:
> - what version of glusterfs you running, it happened to me with 3.12
> - volume quorum rules: "gluster volume get vol all | grep quorum"
>
> To clean it up if i remember correctly it should be possible to delete 
> the gfid entries from the brick mounts on the glusterfs server nodes 
> reporting the files to heal.
>
> As a side note you might want to consider changing the selfheal 
> timeout to more agressive schedule in cluster.heal-timeout option

Davide Obbi

2018-Dec-31 09:39 UTC

head link

[Gluster-users] [External] Re: Self Heal Confusion

cluster.quorum-type auto
cluster.quorum-count (null)
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
cluster.quorum-reads                    no

Where exacty do I remove the gfid entries from - the .glusterfs
directory? --> yes can't remember exactly where but try to do a find in
the
brick paths with the gfid  it should return something

Where do I put the cluster.heal-timeout option - which file? --> gluster
volume set volumename option value

On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb <biholcomb at l1049h.com>
wrote:
> That is probably the case as a lot of files were deleted some time ago.
>
> I'm on version 5.2 but was on 3.12 until about a week ago.
>
> Here is the quorum info.  I'm running a distributed replicated volumes
> in 2 x 3 = 6
>
> cluster.quorum-type auto
> cluster.quorum-count (null)
> cluster.server-quorum-type off
> cluster.server-quorum-ratio 0
> cluster.quorum-reads                    no
>
> Where exacty do I remove the gfid entries from - the .glusterfs
> directory?  Do I just delete all the directories can files under this
> directory?
>
> Where do I put the cluster.heal-timeout option - which file?
>
> I think you've hit on the cause of the issue.  Thinking back we've
had
> some extended power outages and due to a misconfiguration in the swap
> file device name a couple of the nodes did not come up and I didn't
> catch it for a while so maybe the deletes occured then.
>
> Thank you.
>
> On 12/31/18 2:58 AM, Davide Obbi wrote:
> > if the long GFID does not correspond to any file it could mean the
> > file has been deleted by the client mounting the volume. I think this
> > is caused when the delete was issued and the number of active bricks
> > were not reaching quorum majority or a second brick was taken down
> > while another was down or did not finish the selfheal, the latter more
> > likely.
> > It would be interesting to see:
> > - what version of glusterfs you running, it happened to me with 3.12
> > - volume quorum rules: "gluster volume get vol all | grep
quorum"
> >
> > To clean it up if i remember correctly it should be possible to delete
> > the gfid entries from the brick mounts on the glusterfs server nodes
> > reporting the files to heal.
> >
> > As a side note you might want to consider changing the selfheal
> > timeout to more agressive schedule in cluster.heal-timeout option
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users


-- 
Davide Obbi
System Administrator

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
Direct +31207031558
[image: Booking.com] <https://www.booking.com/>
Empowering people to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181231/6214218b/attachment.html>

Brett Holcomb

2019-Jan-01 16:58 UTC

head link

[Gluster-users] [External] Re: Self Heal Confusion

Healing time set to 120 seconds for now.

Just to make sure I understand I need to take the result of the gluster 
volume heal projects info and put it in a file. Then try and find each 
guid listed in that file in the .glusterfs directory for each brick 
listed in the output as having unhealed files and delete that file - if 
it exists.? If it doesn't exist don't worry about it.

So these bricks have unhealed entries listed

/srv/gfs01/Projects/.glusterfs - 85 files

/srv/gfs05/Projects/.glusterfs? - 58854 files

/srv/gfs06/Projects/.glusterfs- 58854 files

Script time!

On 12/31/18 4:39 AM, Davide Obbi wrote:> cluster.quorum-type auto
> cluster.quorum-count (null)
> cluster.server-quorum-type off
> cluster.server-quorum-ratio 0
> cluster.quorum-reads??????????????????? no
>
> Where exacty do I remove the gfid entries from - the .glusterfs
> directory? --> yes can't remember exactly where but try to do a find
> in the brick paths with the gfid? it should return something
>
> Where do I put the cluster.heal-timeout option - which file? --> 
> gluster volume set volumename option value
>
> On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb <biholcomb at l1049h.com 
> <mailto:biholcomb at l1049h.com>> wrote:
>
>     That is probably the case as a lot of files were deleted some time
>     ago.
>
>     I'm on version 5.2 but was on 3.12 until about a week ago.
>
>     Here is the quorum info.? I'm running a distributed replicated
>     volumes
>     in 2 x 3 = 6
>
>     cluster.quorum-type auto
>     cluster.quorum-count (null)
>     cluster.server-quorum-type off
>     cluster.server-quorum-ratio 0
>     cluster.quorum-reads??????????????????? no
>
>     Where exacty do I remove the gfid entries from - the .glusterfs
>     directory?? Do I just delete all the directories can files under this
>     directory?
>
>     Where do I put the cluster.heal-timeout option - which file?
>
>     I think you've hit on the cause of the issue.? Thinking back
we've
>     had
>     some extended power outages and due to a misconfiguration in the swap
>     file device name a couple of the nodes did not come up and I didn't
>     catch it for a while so maybe the deletes occured then.
>
>     Thank you.
>
>     On 12/31/18 2:58 AM, Davide Obbi wrote:
>     > if the long GFID does not correspond to any file it could mean the
>     > file has been deleted by the client mounting the volume. I think
>     this
>     > is caused when the delete was issued and the number of active
>     bricks
>     > were not reaching quorum majority or a second brick was taken down
>     > while another was down or did not finish the selfheal, the
>     latter more
>     > likely.
>     > It would be interesting to see:
>     > - what version of glusterfs you running, it happened to me with
3.12
>     > - volume quorum rules: "gluster volume get vol all | grep
quorum"
>     >
>     > To clean it up if i remember correctly it should be possible to
>     delete
>     > the gfid entries from the brick mounts on the glusterfs server
>     nodes
>     > reporting the files to heal.
>     >
>     > As a side note you might want to consider changing the selfheal
>     > timeout to more agressive schedule in cluster.heal-timeout option
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> -- 
> Davide Obbi
> System Administrator
>
> Booking.com B.V.
> Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
> Direct +31207031558
> Booking.com <https://www.booking.com/>
> Empowering people to experience the world since 1996
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190101/1f2e8cc0/attachment.html>

Brett Holcomb

2019-Jan-04 22:24 UTC

head link

[Gluster-users] [External] Re: Self Heal Confusion

I wrote a script to search the output of gluster volume heal projects 
info, picks the brick I gave it and then deletes any of the files? 
listed that actually exist in .glusterfs/dir1/dir2.? I did this on the 
first host which had 85 pending and that cleared them up so I'll do it 
via ssh on the other two servers.

Hopefully that will clear it up and glusterfs will be happy again.

Thanks everyone for the help.


On 12/31/18 4:39 AM, Davide Obbi wrote:> cluster.quorum-type auto
> cluster.quorum-count (null)
> cluster.server-quorum-type off
> cluster.server-quorum-ratio 0
> cluster.quorum-reads??????????????????? no
>
> Where exacty do I remove the gfid entries from - the .glusterfs
> directory? --> yes can't remember exactly where but try to do a find
> in the brick paths with the gfid? it should return something
>
> Where do I put the cluster.heal-timeout option - which file? --> 
> gluster volume set volumename option value
>
> On Mon, Dec 31, 2018 at 10:34 AM Brett Holcomb <biholcomb at l1049h.com 
> <mailto:biholcomb at l1049h.com>> wrote:
>
>     That is probably the case as a lot of files were deleted some time
>     ago.
>
>     I'm on version 5.2 but was on 3.12 until about a week ago.
>
>     Here is the quorum info.? I'm running a distributed replicated
>     volumes
>     in 2 x 3 = 6
>
>     cluster.quorum-type auto
>     cluster.quorum-count (null)
>     cluster.server-quorum-type off
>     cluster.server-quorum-ratio 0
>     cluster.quorum-reads??????????????????? no
>
>     Where exacty do I remove the gfid entries from - the .glusterfs
>     directory?? Do I just delete all the directories can files under this
>     directory?
>
>     Where do I put the cluster.heal-timeout option - which file?
>
>     I think you've hit on the cause of the issue.? Thinking back
we've
>     had
>     some extended power outages and due to a misconfiguration in the swap
>     file device name a couple of the nodes did not come up and I didn't
>     catch it for a while so maybe the deletes occured then.
>
>     Thank you.
>
>     On 12/31/18 2:58 AM, Davide Obbi wrote:
>     > if the long GFID does not correspond to any file it could mean the
>     > file has been deleted by the client mounting the volume. I think
>     this
>     > is caused when the delete was issued and the number of active
>     bricks
>     > were not reaching quorum majority or a second brick was taken down
>     > while another was down or did not finish the selfheal, the
>     latter more
>     > likely.
>     > It would be interesting to see:
>     > - what version of glusterfs you running, it happened to me with
3.12
>     > - volume quorum rules: "gluster volume get vol all | grep
quorum"
>     >
>     > To clean it up if i remember correctly it should be possible to
>     delete
>     > the gfid entries from the brick mounts on the glusterfs server
>     nodes
>     > reporting the files to heal.
>     >
>     > As a side note you might want to consider changing the selfheal
>     > timeout to more agressive schedule in cluster.heal-timeout option
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> -- 
> Davide Obbi
> System Administrator
>
> Booking.com B.V.
> Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
> Direct +31207031558
> Booking.com <https://www.booking.com/>
> Empowering people to experience the world since 1996
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190104/313f9b43/attachment.html>

Gluster users - Jan 2019 - [External] Re: Self Heal Confusion

[Gluster-users] [External] Re: Self Heal Confusion

[Gluster-users] [External] Re: Self Heal Confusion

[Gluster-users] [External] Re: Self Heal Confusion

[Gluster-users] [External] Re: Self Heal Confusion