thr3ads.net - Gluster users - [Gluster-users] outage post-mortem [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Nicolas Ochem

2014-Mar-28 06:08 UTC

[Gluster-users] outage post-mortem

Hi list,
I would like to describe an issue I had today with Gluster and ask for
opinion:

I have a replicated mount with 2 replica. There is about 1TB of production
data in there in around 100.000 files. They sit on 2x Supermicro x9dr3-ln4f
machines with a RAID array of 18TB each, 64gb of ram, 2x Xeon CPUs, as
recommended in Red Hat hardware guidelines for storage server. They have a
10gb link between each other. I am running gluster 3.4.2 on centos 6.5

This storage is NFS-mounted to a lot of production servers. A very little
part of this data is actually useful, the rest is legacy.

Due to some unrelated issue with one of the supermicro server (faulty
memory), I had to take one of the nodes offline for 3 days.

When I brought it back up, some files and directories ended up in
heal-failed state (but no split-brain). Unfortunately that were the
critical files that had been edited in the last 3 days. On the NFS mounts,
attempts to read these files resulted in I/O error.

I was able to fix a few of these files by manually removing them in each
brick and then copying them to the mounted volume again. But I did not know
what to do when full directories were unreachable because of "heal
failed".

I later read that healing could take time and that heal-failed may be a
transient state (is that correct?
http://stackoverflow.com/questions/19257054/is-it-normal-to-get-a-lot-of-heal-failed-entries-in-a-gluster-mount),
but at the time I thought that was beyond recovery, so I proceeded to
destroy the gluster volume. Then on one of the replicas I moved the content
of the brick to another directory, created another volume with the same
name, then copied the content of the brick to the mounted volume. This took
around 2 hours. Then I had to reboot all my NFS-mounted machines which were
in "stale NFS file handle" state.

Few questions :
- I realize that I cannot expect 1TB of data to heal instantly, but is
there any way for me to know if the system would have recovered eventually
despite being shown as "heal failed" ?
- if yes, what amount of files and filesize should I clean-up from my
volume to make this time go under 10 minutes ?
- would native gluster mounts instead of NFS have been of help here ?
- would any other course of action have resulted in faster recovery time ?
- is there a way in such situation to make one replica have authority about
the correct status of the filesystem ?

Thanks in advance for your replies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140327/a879b239/attachment.html>

Joe Julian

2014-Mar-28 15:13 UTC

head link

[Gluster-users] outage post-mortem

On March 27, 2014 11:08:03 PM PDT, Nicolas Ochem <nicolas.ochem at
gmail.com> wrote:>Hi list,
>I would like to describe an issue I had today with Gluster and ask for
>opinion:
>
>I have a replicated mount with 2 replica. There is about 1TB of
>production
>data in there in around 100.000 files. They sit on 2x Supermicro
>x9dr3-ln4f
>machines with a RAID array of 18TB each, 64gb of ram, 2x Xeon CPUs, as
>recommended in Red Hat hardware guidelines for storage server. They
>have a
>10gb link between each other. I am running gluster 3.4.2 on centos 6.5
>
>This storage is NFS-mounted to a lot of production servers. A very
>little
>part of this data is actually useful, the rest is legacy.
>
>Due to some unrelated issue with one of the supermicro server (faulty
>memory), I had to take one of the nodes offline for 3 days.
>
>When I brought it back up, some files and directories ended up in
>heal-failed state (but no split-brain). Unfortunately that were the
>critical files that had been edited in the last 3 days. On the NFS
>mounts,
>attempts to read these files resulted in I/O error.
>
>I was able to fix a few of these files by manually removing them in
>each
>brick and then copying them to the mounted volume again. But I did not
>know
>what to do when full directories were unreachable because of "heal
>failed".
>
>I later read that healing could take time and that heal-failed may be a
>transient state (is that correct?
>http://stackoverflow.com/questions/19257054/is-it-normal-to-get-a-lot-of-heal-failed-entries-in-a-gluster-mount),
>but at the time I thought that was beyond recovery, so I proceeded to
>destroy the gluster volume. Then on one of the replicas I moved the
>content
>of the brick to another directory, created another volume with the same
>name, then copied the content of the brick to the mounted volume. This
>took
>around 2 hours. Then I had to reboot all my NFS-mounted machines which
>were
>in "stale NFS file handle" state.
>
>Few questions :
>- I realize that I cannot expect 1TB of data to heal instantly, but is
>there any way for me to know if  the system would have recovered
>eventually
>despite being shown as "heal failed" ?
>- if yes, what amount of files and filesize should I clean-up from my
>volume to make this time go under 10 minutes ?
>- would native gluster mounts instead of NFS have been of help here ?
>- would any other course of action have resulted in faster recovery
>time ?
>- is there a way in such situation to make one replica have authority
>about
>the correct status of the filesystem  ?
>
>Thanks in advance for your replies.
>
>Although the self-heal daemon can take time to heal all the files, accessing a
file that needs healed does trigger the heal to be performed immediately by the
client (the nfs server is the client in this case).

Like pretty much all errors in GlusterFS, you would have had to look in the logs
to find why something as vague as "heal failed" happened.

Gluster users - Mar 2014 - outage post-mortem

[Gluster-users] outage post-mortem

[Gluster-users] outage post-mortem