I have been testing 3.1.2 over the last few days. My overall
impression is that it resolved several bugs from 3.1.1, but the latest
version is still prone to crashing under moderate to heavy loads.
I was running some stress tests on a two server replicated setup today
with ~150 clients connected with RDMA. The glusterfsd process crashed
on one server. I waited about 30 minutes to see if the automatic
fail-over would work, but I continued to receive "Transport: endpoint
not connected" error messages on all the clients. I saw the following
error messages in the server log:
(I removed several hundred error messages from the following snippet)
[2011-01-21 15:10:13.804308] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x66540x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.supportdir-server)
[2011-01-21 15:10:13.804314] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x64658x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.supportdir-server)
[2011-01-21 15:10:13.804342] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 15:10:13.804365] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 15:10:13.804636] I [server.c:428:server_rpc_notify]
supportdir-server: disconnected connection from 192.168.50.7:1020
[2011-01-21 15:10:13.804702] I
[server-helpers.c:670:server_connection_destroy] supportdir-server:
destroyed connection of
n7-12719-2011/01/19-17:36:59:497983-supportdir-client-0
[2011-01-21 15:10:13.805028] I [server.c:428:server_rpc_notify]
supportdir-server: disconnected connection from 192.168.50.127:1020
[2011-01-21 15:10:13.805071] I
[server-helpers.c:670:server_connection_destroy] supportdir-server:
destroyed connection of
n127-12567-2011/01/19-17:43:17:468018-supportdir-client-0
pending frames:
patchset: v3.1.1-64-gf2a067c
signal received: 11
time of crash: 2011-01-21 15:10:13
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.2
/lib64/libc.so.6(+0x32a60)[0x7fc2a7f64a60]
/usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/xlator/protocol/server.so(server_release+0x54)[0x7fc2a4f05454]
/usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x26f)[0x7fc2a88d25ef]
/usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpcsvc_notify+0x123)[0x7fc2a88d2c23]
/usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpc_transport_notify+0x2d)[0x7fc2a88d6a9d]
/usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(rdma_pollin_notify+0xd1)[0x7fc2a4ae68b1]
/usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(rdma_process_recv+0x14b)[0x7fc2a4ae6e8b]
/usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(+0xb226)[0x7fc2a4ae7226]
/lib64/libpthread.so.0(+0x6a4f)[0x7fc2a8298a4f]
/lib64/libc.so.6(clone+0x6d)[0x7fc2a800282d]
I think the crash is related to this bug:
http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197
I ran some smaller tests on a single server setup. The were ~50
clients connected via RDMA. While the jobs were running, several of
them crashed with "File descriptor in bad state" or "Stale File
Descriptor" errors. Here are the error messages from the server log:
[2011-01-21 10:15:52.442908] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x16660x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.443012] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x20251x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.442949] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x77360x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.443351] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x26495832x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 40) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.445247] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x25199x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.445291] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x60907x, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport
(rdma.maindir-server)
[2011-01-21 10:15:52.447572] I [server.c:428:server_rpc_notify]
maindir-server: disconnected connection from 192.168.50.116:1018
[2011-01-21 10:15:52.455116] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455227] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455325] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455436] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455896] I
[server-helpers.c:670:server_connection_destroy] maindir-server:
destroyed connection of
n116-14977-2011/01/20-12:43:18:128066-maindir-client-0
[2011-01-21 10:15:52.455610] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455659] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.455564] E [server.c:137:server_submit_reply] :
Reply submission failed
[2011-01-21 10:15:52.458581] I [server.c:428:server_rpc_notify]
maindir-server: disconnected connection from 192.168.50.19:1018
[2011-01-21 10:15:52.458677] I
[server-helpers.c:670:server_connection_destroy] maindir-server:
destroyed connection of
n19-15053-2011/01/20-12:38:13:243408-maindir-client-0
(I removed dozens of similar error message)
The glusterfsd process did not crash in that instance.
Jeremy Stout
On Fri, Jan 21, 2011 at 6:49 AM, David Lloyd
<david.lloyd at v-consultants.co.uk> wrote:> Hello,
>
> Haven't heard much feedback about installing glusterfs 3.1.2.
>
> Should I infer that it's all gone extremely very smoothly for everyone,
or
> is everyone being as cowardly as me and waiting for others to do it first?
>
> Cheers
> David
>
> --
> David Lloyd
> V Consultants
> www.v-consultants.co.uk
> tel: +44 7983 816501
> skype: davidlloyd1243
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>