thr3ads.net - Gluster users - [Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2) [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Giuseppe Ragusa

2016-Nov-29 22:36 UTC

[Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

Hi all,

I'm writing to kindly ask for help on the issue in subject line above and
documented in:


https://bugzilla.redhat.com/show_bug.cgi?id=1381970


Brief recap:


a 3-node replicated (with arbiter, confined on the same dedicated node for all
volumes) distributed volume cluster experiences regular nfs crashes on at least
one (non arbiter) node at a time (all two non arbiter nodes crash if given
enough time without enacting the workaround cited below); there are no Gluster
native clients, only NFS ones, all on a dedicated network.


Simply restarting an NFS-enabled volume restarts the nfs services on all (non
arbiter) nodes for all volumes and all seems well up to the next crash (crashes
happen many times a day under our normal workload).


Am almost sure way of making nfs crash immediately is recreating the yum
metadata directory on a CentOS7 OS mirror repo hosted on a NFS-enabled volume.


Since it is a production cluster and we had to disable various cron jobs that
were regularly crashing the internal NFS Gluster part (no NFS-Ganesha in use
here), I am almost ready to accept even the upgrade to 3.8.x as a solution (I
dare to say so since I've seen various fixes in Gerrit that were not being
backported to 3.7 and one I even reported to Bugzilla, cloning the 3.8 bug and
kindly asking for a backport, given that the patch applied cleanly; this brings
the question: is the backporting of patches to 3.7 being phased out if not
explicitly requested for?).

The only caveat could be that the cluster is an hyperconverged setup with oVirt
3.6.7 (but the oVirt part with its dedicated Gluster volumes is working
flawlessly and is absolutely not being used to manage Gluster, only to monitor
it), so I would need to check for 3.8 compatibility before upgrading.


Many thanks in advance to anyone who can offer any advice on this issue.


Best regards,

Giuseppe
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161129/2a6ca8de/attachment.html>

Giuseppe Ragusa

2016-Dec-02 20:12 UTC

head link

[Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

Hi all,

I've added more info and full /var/log/gluster contents from cluster nodes
to https://bugzilla.redhat.com/show_bug.cgi?id=1381970

After many tries with configuration tweaks, the crash problem still remains and
is exactly reproducible in a few minutes by generating NFS load.

Any advice whatsoever would be very welcome!

(I started thinking about migrating to NFS-Ganesha, but we cannot safely
introduce a full Pacemaker/Corosync stack into a hyperconverged GlusterFS/oVirt
setup, I think)

Many thanks in advance.

Best regards,
Giuseppe Ragusa

________________________________________
Da: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> per conto di Giuseppe Ragusa <giuseppe.ragusa at
hotmail.com>
Inviato: marted? 29 novembre 2016 23.36
A: gluster-users at gluster.org
Oggetto: [Gluster-users] Gluster 3.7.17 distributed-replicated volume
experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

Hi all,

I'm writing to kindly ask for help on the issue in subject line above and
documented in:


https://bugzilla.redhat.com/show_bug.cgi?id=1381970


Brief recap:


a 3-node replicated (with arbiter, confined on the same dedicated node for all
volumes) distributed volume cluster experiences regular nfs crashes on at least
one (non arbiter) node at a time (all two non arbiter nodes crash if given
enough time without enacting the workaround cited below); there are no Gluster
native clients, only NFS ones, all on a dedicated network.


Simply restarting an NFS-enabled volume restarts the nfs services on all (non
arbiter) nodes for all volumes and all seems well up to the next crash (crashes
happen many times a day under our normal workload).


Am almost sure way of making nfs crash immediately is recreating the yum
metadata directory on a CentOS7 OS mirror repo hosted on a NFS-enabled volume.


Since it is a production cluster and we had to disable various cron jobs that
were regularly crashing the internal NFS Gluster part (no NFS-Ganesha in use
here), I am almost ready to accept even the upgrade to 3.8.x as a solution (I
dare to say so since I've seen various fixes in Gerrit that were not being
backported to 3.7 and one I even reported to Bugzilla, cloning the 3.8 bug and
kindly asking for a backport, given that the patch applied cleanly; this brings
the question: is the backporting of patches to 3.7 being phased out if not
explicitly requested for?).

The only caveat could be that the cluster is an hyperconverged setup with oVirt
3.6.7 (but the oVirt part with its dedicated Gluster volumes is working
flawlessly and is absolutely not being used to manage Gluster, only to monitor
it), so I would need to check for 3.8 compatibility before upgrading.


Many thanks in advance to anyone who can offer any advice on this issue.


Best regards,

Giuseppe

Gluster users - Nov 2016 - Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

[Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

[Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)