thr3ads.net - Gluster users - [Gluster-users] NFS service dying [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Paul Allen

2017-Jan-11 18:58 UTC

[Gluster-users] NFS service dying

I'm running into an issue where the gluster nfs service keeps dying on a
new cluster I have setup recently. We've been using Gluster on several
other clusters now for about a year or so and I have never seen this
issue before, nor have I been able to find anything remotely similar to
it while searching on-line. I initially was using the latest version in
the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
using the next one down, 3.8.7-1. Both behave the same for me.

What I was seeing was after a while the nfs service on the NAS server
would suddenly die after a number of processes had run on the app server
I had connected to the new NAS servers for testing (we're upgrading the
NAS servers for this cluster to newer hardware and expanded storage, the
current production NAS servers are using nfs-kernel-server with no type
of clustering of the data). I checked the logs but all it showed me was
something that looked like a stack trace in the nfs.log and the
glustershd.log showed the nfs service disconnecting. I turned on
debugging but it didn't give me a whole lot more, and certainly nothing
that helps me identify the source of my issue. It is pretty consistent
in dying shortly after I mount the file system on the servers and start
testing, usually within 15-30 minutes. But if I have nothing using the
file system, mounted or no, the service stays running for days. I tried
mounting it using the gluster client, and it works fine, but I can;t use
that due to the performance penalty, it slows the websites down by a few
seconds at a minimum.

Here is the output from the logs one of the times it died:

glustershd.log:

[2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
data available)
[2017-01-10 19:06:20.265964] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
nfs has disconnected from glusterd.


nfs.log:

[2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
0-NLM-client: address-family not specified, marking it as unspec for
getaddrinfo to resolve from (remote-host: 10.20.5.13)
[2017-01-10 19:06:20.135531] D [MSGID: 0]
[common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
(port-48963) for hostname: 10.20.5.13 and port: 48963
[2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
0-logging-infra: Log buffer size reduced. About to flush 5 extra log
messages
[2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
0-logging-infra: Just flushed 5 extra log messages
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-01-10 19:06:20
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.9.0
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
/lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d]


The IP showing in the nfs.log is actually for a web server I was also
testing with, not the app server, but it doesn't appear to me that would
be the cause for the nfs service dying. I'm at a loss as to what is
going on, and I need to try and get this fixed pretty quickly here, I
was hoping to have this in production last Friday. If anyone has any
ideas I'd be very grateful.

-- 

Paul Allen

Inetz System Administrator

Giuseppe Ragusa

2017-Jan-11 23:11 UTC

head link

[Gluster-users] NFS service dying

Da: gluster-users-bounces at gluster.org <gluster-users-bounces at
gluster.org> per conto di Paul Allen <paul at inetz.com>
Inviato: mercoled? 11 gennaio 2017 19.58
A: gluster-users at gluster.org
Oggetto: [Gluster-users] NFS service dying

I'm running into an issue where the gluster nfs service keeps dying on a
new cluster I have setup recently. We've been using Gluster on several
other clusters now for about a year or so and I have never seen this
issue before, nor have I been able to find anything remotely similar to
it while searching on-line. I initially was using the latest version in
the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
using the next one down, 3.8.7-1. Both behave the same for me.

What I was seeing was after a while the nfs service on the NAS server
would suddenly die after a number of processes had run on the app server
I had connected to the new NAS servers for testing (we're upgrading the
NAS servers for this cluster to newer hardware and expanded storage, the
current production NAS servers are using nfs-kernel-server with no type
of clustering of the data). I checked the logs but all it showed me was
something that looked like a stack trace in the nfs.log and the
glustershd.log showed the nfs service disconnecting. I turned on
debugging but it didn't give me a whole lot more, and certainly nothing
that helps me identify the source of my issue. It is pretty consistent
in dying shortly after I mount the file system on the servers and start
testing, usually within 15-30 minutes. But if I have nothing using the
file system, mounted or no, the service stays running for days. I tried
mounting it using the gluster client, and it works fine, but I can;t use
that due to the performance penalty, it slows the websites down by a few
seconds at a minimum.

Here is the output from the logs one of the times it died:

glustershd.log:

[2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
data available)
[2017-01-10 19:06:20.265964] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
nfs has disconnected from glusterd.


nfs.log:

[2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
0-NLM-client: address-family not specified, marking it as unspec for
getaddrinfo to resolve from (remote-host: 10.20.5.13)
[2017-01-10 19:06:20.135531] D [MSGID: 0]
[common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
(port-48963) for hostname: 10.20.5.13 and port: 48963
[2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
0-logging-infra: Log buffer size reduced. About to flush 5 extra log
messages
[2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
0-logging-infra: Just flushed 5 extra log messages
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-01-10 19:06:20
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.9.0
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
/lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d]


The IP showing in the nfs.log is actually for a web server I was also
testing with, not the app server, but it doesn't appear to me that would
be the cause for the nfs service dying. I'm at a loss as to what is
going on, and I need to try and get this fixed pretty quickly here, I
was hoping to have this in production last Friday. If anyone has any
ideas I'd be very grateful.

-------------------------------------------------------------------------------------------------------------------------------------------

Hi Paul,

I've experienced almost the same symptoms but I'm running Gluster 3.7.17

I've found a really similar bug and added my logs and information to it:

https://bugzilla.redhat.com/show_bug.cgi?id=1381970

but it got no signs of activity (apart my comments/logs) in almost three months

I've also tried to catch the attention of the mailing list here:

http://www.gluster.org/pipermail/gluster-users/2016-November/029333.html

but got no response whatsoever.

As stated in the above Bugzilla entry and mailing list post, my setup is an
hyperconverged one and so while investigating some minor "cosmetical"
bugs (regarding oVirt/VDSM and GlusterFS interaction) I think I could have
recently found something "strange" that I hope will explain/solve the
mistery (see https://bugzilla.redhat.com/show_bug.cgi?id=1406569 ).

For the time being, I manage to keep the cluster somehow up (it was already in
production phase when the crashes were first noticed) by means of a workaround
based on CTDB (which was already deployed and running to provide HA and load
balancing to the NFS/SMB services): you can find it documented in my comment # 3
in Bugzilla entry #1381970 ; feel free to ask for more details, scripts etc. if
I can be of any help to you.

I hope that with another notification of the same problem we can finally catch
the developers' attention and complete the troubleshooting ;-)

Best regards,
Giuseppe

-------------------------------------------------------------------------------------------------------------------------------------------

--

Paul Allen

Inetz System Administrator


_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Niels de Vos

2017-Jan-13 11:39 UTC

head link

[Gluster-users] NFS service dying

On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen
wrote:> I'm running into an issue where the gluster nfs service keeps dying on
a
> new cluster I have setup recently. We've been using Gluster on several
> other clusters now for about a year or so and I have never seen this
> issue before, nor have I been able to find anything remotely similar to
> it while searching on-line. I initially was using the latest version in
> the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
> using the next one down, 3.8.7-1. Both behave the same for me.
> 
> What I was seeing was after a while the nfs service on the NAS server
> would suddenly die after a number of processes had run on the app server
> I had connected to the new NAS servers for testing (we're upgrading the
> NAS servers for this cluster to newer hardware and expanded storage, the
> current production NAS servers are using nfs-kernel-server with no type
> of clustering of the data). I checked the logs but all it showed me was
> something that looked like a stack trace in the nfs.log and the
> glustershd.log showed the nfs service disconnecting. I turned on
> debugging but it didn't give me a whole lot more, and certainly nothing
> that helps me identify the source of my issue. It is pretty consistent
> in dying shortly after I mount the file system on the servers and start
> testing, usually within 15-30 minutes. But if I have nothing using the
> file system, mounted or no, the service stays running for days. I tried
> mounting it using the gluster client, and it works fine, but I can;t use
> that due to the performance penalty, it slows the websites down by a few
> seconds at a minimum.
This seems to be related to the NLM protocol that Gluster/NFS provides.
Earlier this week one of our Red Hat quality engineers also reported
this (or a very similar) problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1411344

At the moment I suspect that this is related to re-connects of some
kind, but I have not been able to identify the cause sufficiently to be
sure. This definitely is a coding problem in Gluster/NFS, but the more I
look at the NLM implementation, the more potential issues I see with it.

If the workload does not require locking operations, you may be able to
work around the problem by mounting with "-o nolock". Depending on the
application, this can be safe or cause data corruption...

An other alternative is to use NFS-Ganesha instead of Gluster/NFS.
Ganesha is more mature than Gluster/NFS and is more actively developed.
Gluster/NFS is being deprecated in favour of NFS-Ganesha.

HTH,
Niels

> 
> Here is the output from the logs one of the times it died:
> 
> glustershd.log:
> 
> [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
> on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
> data available)
> [2017-01-10 19:06:20.265964] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> nfs has disconnected from glusterd.
> 
> 
> nfs.log:
> 
> [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
> 0-NLM-client: address-family not specified, marking it as unspec for
> getaddrinfo to resolve from (remote-host: 10.20.5.13)
> [2017-01-10 19:06:20.135531] D [MSGID: 0]
> [common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
> (port-48963) for hostname: 10.20.5.13 and port: 48963
> [2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
> 0-logging-infra: Log buffer size reduced. About to flush 5 extra log
> messages
> [2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
> 0-logging-infra: Just flushed 5 extra log messages
> pending frames:
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 11
> time of crash:
> 2017-01-10 19:06:20
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.9.0
>
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
>
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
>
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
>
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
>
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
>
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
>
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
>
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4]
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d]
> 
> 
> The IP showing in the nfs.log is actually for a web server I was also
> testing with, not the app server, but it doesn't appear to me that
would
> be the cause for the nfs service dying. I'm at a loss as to what is
> going on, and I need to try and get this fixed pretty quickly here, I
> was hoping to have this in production last Friday. If anyone has any
> ideas I'd be very grateful.
> 
> -- 
> 
> Paul Allen
> 
> Inetz System Administrator
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20170113/163c03e1/attachment.sig>

Gluster users - Jan 2017 - NFS service dying

[Gluster-users] NFS service dying

[Gluster-users] NFS service dying

[Gluster-users] NFS service dying