thr3ads.net - Gluster users - [Gluster-users] NFS crashes under load [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Dan Bretherton

2010-Nov-07 23:53 UTC

[Gluster-users] NFS crashes under load

I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very
impressed; I think it is a big step forward.  Unfortunately there is one
"feature" that is causing me a big problem - the NFS process crashes
every few hours  when under load.  I have pasted the relevant error
messages from nfs.log at the end of this message.  The rest of the log
file is swamped with these messages incidentally.

[2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor]
nfsrpc: RPC program not available

There are no apparent problems while these errors are being produced so
this issue probably isn't relevant to the crashes.

To give an indication of what I mean by "under load", we have a small
HPC cluster that is used for running ocean models.  A typical model run
involves 20 processors, all needing to read simultaneously from the same
input data files at regular intervals during the run.  There are roughly
20 files, each ~1GB in size.  At the same time this is going on several
people, typically, are processing output from previous runs from this
and other (much bigger) clusters, chugging through hundreds of GB and
tens of thousands of files every few hours.  I don't think the
Gluster-NFS crashes are purely load dependant because they seem to occur
at different load levels, which is what leads me to suspect something
subtle related to the cluster's 20-processor model runs.  I would prefer
to use the GlusterFS client on the cluster's compute nodes, but
unfortunately the pre-FUSE Linux kernel has been customised in a way
that has thwarted all my attempts to build a FUSE module that the kernel
will accept (see
http://gluster.org/pipermail/gluster-users/2010-April/004538.html)

The servers that are exporting NFS are all running CentOS 5.5 with
GlusterFS installed from RPMs, and the GlusterFS volumes are distributed
(not repicated).  Two of the servers with GlusterFS bricks are actually
running SuSE Enterprise 10; I don't know if this is relevant.  I used
previous GlusterFS versions with SLES10 without any problems, but as
RPMs are not provided for SuSE I presume it is not an officially
supported distro.  For that reason I am only using the CentOS machines
as NFS servers for the GlusterFS volumes.

I would be very grateful for any suggested solutions or workarounds that
might help to prevent these NFS crashes.

-Dan.
nfs.log extract
------------------
[2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind]
(-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)
[0x2aaaab30813e]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)
[0x2aaaab9a6da1]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)
[0x2aaaab9b0bdd]))) : Assertion failed: fd->refcount
pending frames:

patchset: v3.1.0
signal received: 11
time of crash: 2010-11-06 23:07:10
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0
/lib64/libc.so.6[0x35746302d0]
/lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722]
/usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2aaaab9b0bdd]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2aaaab9a6da1]
/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2aaaab30813e]
/usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38f2222a69]
/usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aaaaaeec459]
/usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2aaaaace402f]
/usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2aaaab521d24]
/usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aaaaaacd2b7]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8]
/usr/lib64/libglusterfs.so.0[0x38f2236ee7]
/usr/sbin/glusterfs(main+0x37d)[0x4046ad]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994]
/usr/sbin/glusterfs[0x402dc9]
---------

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 6AL
UK

Tel. +44 118 378 5205
Fax: +44 118 378 6413

Shehjar Tikoo

2010-Nov-08 04:24 UTC

head link

[Gluster-users] NFS crashes under load

Thanks. I'll be looking into it. I've filed a bug at:

http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2061

You may add yourself to the CC list for notifications.

It seems the crash is easily reproduced on your setup. Can you please 
post the log from Gluster NFS process in TRACE log level to the bug?

Dan Bretherton wrote:> I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very
> impressed; I think it is a big step forward.  Unfortunately there is one
> "feature" that is causing me a big problem - the NFS process
crashes
> every few hours  when under load.  I have pasted the relevant error
> messages from nfs.log at the end of this message.  The rest of the log
> file is swamped with these messages incidentally.
> 
> [2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor]
> nfsrpc: RPC program not available
> 
> There are no apparent problems while these errors are being produced so
> this issue probably isn't relevant to the crashes.
Correct. That error is misleading and will be removed in 3.1.1

Thanks
-Shehjar> 
> To give an indication of what I mean by "under load", we have a
small
> HPC cluster that is used for running ocean models.  A typical model run
> involves 20 processors, all needing to read simultaneously from the same
> input data files at regular intervals during the run.  There are roughly
> 20 files, each ~1GB in size.  At the same time this is going on several
> people, typically, are processing output from previous runs from this
> and other (much bigger) clusters, chugging through hundreds of GB and
> tens of thousands of files every few hours.  I don't think the
> Gluster-NFS crashes are purely load dependant because they seem to occur
> at different load levels, which is what leads me to suspect something
> subtle related to the cluster's 20-processor model runs.  I would
prefer
> to use the GlusterFS client on the cluster's compute nodes, but
> unfortunately the pre-FUSE Linux kernel has been customised in a way
> that has thwarted all my attempts to build a FUSE module that the kernel
> will accept (see
> http://gluster.org/pipermail/gluster-users/2010-April/004538.html)
> 
> The servers that are exporting NFS are all running CentOS 5.5 with
> GlusterFS installed from RPMs, and the GlusterFS volumes are distributed
> (not repicated).  Two of the servers with GlusterFS bricks are actually
> running SuSE Enterprise 10; I don't know if this is relevant.  I used
> previous GlusterFS versions with SLES10 without any problems, but as
> RPMs are not provided for SuSE I presume it is not an officially
> supported distro.  For that reason I am only using the CentOS machines
> as NFS servers for the GlusterFS volumes.
> 
> I would be very grateful for any suggested solutions or workarounds that
> might help to prevent these NFS crashes.
> 
> -Dan.
> nfs.log extract
> ------------------
> [2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind]
>
(-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)
> [0x2aaaab30813e]
>
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)
> [0x2aaaab9a6da1]
>
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)
> [0x2aaaab9b0bdd]))) : Assertion failed: fd->refcount
> pending frames:
> 
> patchset: v3.1.0
> signal received: 11
> time of crash: 2010-11-06 23:07:10
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.1.0
> /lib64/libc.so.6[0x35746302d0]
> /lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722]
> /usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d]
>
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2aaaab9b0bdd]
>
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2aaaab9a6da1]
>
/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2aaaab30813e]
> /usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38f2222a69]
>
/usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aaaaaeec459]
>
/usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2aaaaace402f]
>
/usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2aaaab521d24]
>
/usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aaaaaacd2b7]
> /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2]
> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd]
> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c]
>
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f]
>
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8]
> /usr/lib64/libglusterfs.so.0[0x38f2236ee7]
> /usr/sbin/glusterfs(main+0x37d)[0x4046ad]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994]
> /usr/sbin/glusterfs[0x402dc9]
> ---------
>

Gluster users - Nov 2010 - NFS crashes under load

[Gluster-users] NFS crashes under load

[Gluster-users] NFS crashes under load