I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very impressed; I think it is a big step forward. Unfortunately there is one "feature" that is causing me a big problem - the NFS process crashes every few hours when under load. I have pasted the relevant error messages from nfs.log at the end of this message. The rest of the log file is swamped with these messages incidentally. [2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor] nfsrpc: RPC program not available There are no apparent problems while these errors are being produced so this issue probably isn't relevant to the crashes. To give an indication of what I mean by "under load", we have a small HPC cluster that is used for running ocean models. A typical model run involves 20 processors, all needing to read simultaneously from the same input data files at regular intervals during the run. There are roughly 20 files, each ~1GB in size. At the same time this is going on several people, typically, are processing output from previous runs from this and other (much bigger) clusters, chugging through hundreds of GB and tens of thousands of files every few hours. I don't think the Gluster-NFS crashes are purely load dependant because they seem to occur at different load levels, which is what leads me to suspect something subtle related to the cluster's 20-processor model runs. I would prefer to use the GlusterFS client on the cluster's compute nodes, but unfortunately the pre-FUSE Linux kernel has been customised in a way that has thwarted all my attempts to build a FUSE module that the kernel will accept (see http://gluster.org/pipermail/gluster-users/2010-April/004538.html) The servers that are exporting NFS are all running CentOS 5.5 with GlusterFS installed from RPMs, and the GlusterFS volumes are distributed (not repicated). Two of the servers with GlusterFS bricks are actually running SuSE Enterprise 10; I don't know if this is relevant. I used previous GlusterFS versions with SLES10 without any problems, but as RPMs are not provided for SuSE I presume it is not an officially supported distro. For that reason I am only using the CentOS machines as NFS servers for the GlusterFS volumes. I would be very grateful for any suggested solutions or workarounds that might help to prevent these NFS crashes. -Dan. nfs.log extract ------------------ [2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind] (-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e) [0x2aaaab30813e] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41) [0x2aaaab9a6da1] (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d) [0x2aaaab9b0bdd]))) : Assertion failed: fd->refcount pending frames: patchset: v3.1.0 signal received: 11 time of crash: 2010-11-06 23:07:10 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.0 /lib64/libc.so.6[0x35746302d0] /lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722] /usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2aaaab9b0bdd] /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2aaaab9a6da1] /usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2aaaab30813e] /usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38f2222a69] /usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aaaaaeec459] /usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2aaaaace402f] /usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2aaaab521d24] /usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aaaaaacd2b7] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f] /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8] /usr/lib64/libglusterfs.so.0[0x38f2236ee7] /usr/sbin/glusterfs(main+0x37d)[0x4046ad] /lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994] /usr/sbin/glusterfs[0x402dc9] --------- -- Mr. D.A. Bretherton Computer System Manager Environmental Systems Science Centre Harry Pitt Building 3 Earley Gate University of Reading Reading, RG6 6AL UK Tel. +44 118 378 5205 Fax: +44 118 378 6413
Thanks. I'll be looking into it. I've filed a bug at: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2061 You may add yourself to the CC list for notifications. It seems the crash is easily reproduced on your setup. Can you please post the log from Gluster NFS process in TRACE log level to the bug? Dan Bretherton wrote:> I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very > impressed; I think it is a big step forward. Unfortunately there is one > "feature" that is causing me a big problem - the NFS process crashes > every few hours when under load. I have pasted the relevant error > messages from nfs.log at the end of this message. The rest of the log > file is swamped with these messages incidentally. > > [2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor] > nfsrpc: RPC program not available > > There are no apparent problems while these errors are being produced so > this issue probably isn't relevant to the crashes.Correct. That error is misleading and will be removed in 3.1.1 Thanks -Shehjar> > To give an indication of what I mean by "under load", we have a small > HPC cluster that is used for running ocean models. A typical model run > involves 20 processors, all needing to read simultaneously from the same > input data files at regular intervals during the run. There are roughly > 20 files, each ~1GB in size. At the same time this is going on several > people, typically, are processing output from previous runs from this > and other (much bigger) clusters, chugging through hundreds of GB and > tens of thousands of files every few hours. I don't think the > Gluster-NFS crashes are purely load dependant because they seem to occur > at different load levels, which is what leads me to suspect something > subtle related to the cluster's 20-processor model runs. I would prefer > to use the GlusterFS client on the cluster's compute nodes, but > unfortunately the pre-FUSE Linux kernel has been customised in a way > that has thwarted all my attempts to build a FUSE module that the kernel > will accept (see > http://gluster.org/pipermail/gluster-users/2010-April/004538.html) > > The servers that are exporting NFS are all running CentOS 5.5 with > GlusterFS installed from RPMs, and the GlusterFS volumes are distributed > (not repicated). Two of the servers with GlusterFS bricks are actually > running SuSE Enterprise 10; I don't know if this is relevant. I used > previous GlusterFS versions with SLES10 without any problems, but as > RPMs are not provided for SuSE I presume it is not an officially > supported distro. For that reason I am only using the CentOS machines > as NFS servers for the GlusterFS volumes. > > I would be very grateful for any suggested solutions or workarounds that > might help to prevent these NFS crashes. > > -Dan. > nfs.log extract > ------------------ > [2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind] > (-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e) > [0x2aaaab30813e] > (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41) > [0x2aaaab9a6da1] > (-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d) > [0x2aaaab9b0bdd]))) : Assertion failed: fd->refcount > pending frames: > > patchset: v3.1.0 > signal received: 11 > time of crash: 2010-11-06 23:07:10 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > fdatasync 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 3.1.0 > /lib64/libc.so.6[0x35746302d0] > /lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722] > /usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d] > /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2aaaab9b0bdd] > /usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2aaaab9a6da1] > /usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2aaaab30813e] > /usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38f2222a69] > /usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aaaaaeec459] > /usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2aaaaace402f] > /usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2aaaab521d24] > /usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aaaaaacd2b7] > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2] > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c] > /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f] > /usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8] > /usr/lib64/libglusterfs.so.0[0x38f2236ee7] > /usr/sbin/glusterfs(main+0x37d)[0x4046ad] > /lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994] > /usr/sbin/glusterfs[0x402dc9] > --------- >