thr3ads.net - Gluster users - [Gluster-users] 3.1.1 crashing under moderate load [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Lana Deere

2010-Dec-03 20:13 UTC

[Gluster-users] 3.1.1 crashing under moderate load

I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
transport, native/fuse access.

I have a directory which is shared on the gluster.  In fact, it is a clone
of /lib from one of the clients, shared so all can see it.

I have a script which does
    find lib -type f -print0 | xargs -0 sum | md5sum

If I run this on my clients one at a time, they all yield the same md5sum:
    for h in <<hosts>>; do ssh $host script; done

If I run this on my clients concurrently, up to roughly 25 at a time they
still yield the same md5sum.
    for h in <<hosts>>; do ssh $host script& done

Beyond that the gluster share often, but not always, fails.  The errors vary.
    - sometimes I get "sum: xxx.so not found"
    - sometimes I get the wrong checksum without any error message
    - sometimes the job simply hangs until I kill it


Some of the server logs show messages like these from the time of the
failures (other servers show nothing from around that time):

[2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
socket (peer: 10.54.255.240:1022) after handshake is complete
[2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e82, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
Reply submission failed
[2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e83, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
Reply submission failed


On a client which had a failure I see messages like:

[2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
(peer: 10.54.50.101:24009) after handshake is complete
[2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20492
[2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20529
[2010-12-03 10:03:06.26827] I
[client-handshake.c:993:select_server_supported_programs]
RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
Version (310)
[2010-12-03 10:03:06.27029] I
[client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
Connected to 10.54.50.101:24009, attached to remote volume '/data'.
[2010-12-03 10:03:06.27067] I
[client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
fds open - Delaying child_up until they are re-opened


Anyone else seen anything like this and/or have suggestions about options I can
set to work around this?


.. Lana (lana.deere at gmail.com)

Raghavendra G

2010-Dec-06 05:30 UTC

head link

[Gluster-users] 3.1.1 crashing under moderate load

Hi Lana,

I need some clarifications about test setup:

* Are you seeing problem when there are more than 25 clients? If this is the
case, are these clients on different physical nodes or is it that more than one
client shares same node? In other words, clients are mounted on how many
physical nodes are there in your test setup? Also, are you running the command
on each of these clients simultaneously?

* Or is it that there are more than 25 concurrent concurrent invocations of the
script? If this is the case, how many clients are present in your test setup and
on how many physical nodes these clients are mounted?

regards,
----- Original Message -----
From: "Lana Deere" <lana.deere at gmail.com>
To: gluster-users at gluster.org
Sent: Saturday, December 4, 2010 12:13:30 AM
Subject: [Gluster-users] 3.1.1 crashing under moderate load

I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
transport, native/fuse access.

I have a directory which is shared on the gluster.  In fact, it is a clone
of /lib from one of the clients, shared so all can see it.

I have a script which does
    find lib -type f -print0 | xargs -0 sum | md5sum

If I run this on my clients one at a time, they all yield the same md5sum:
    for h in <<hosts>>; do ssh $host script; done

If I run this on my clients concurrently, up to roughly 25 at a time they
still yield the same md5sum.
    for h in <<hosts>>; do ssh $host script& done

Beyond that the gluster share often, but not always, fails.  The errors vary.
    - sometimes I get "sum: xxx.so not found"
    - sometimes I get the wrong checksum without any error message
    - sometimes the job simply hangs until I kill it


Some of the server logs show messages like these from the time of the
failures (other servers show nothing from around that time):

[2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
socket (peer: 10.54.255.240:1022) after handshake is complete
[2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e82, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
Reply submission failed
[2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e83, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
Reply submission failed


On a client which had a failure I see messages like:

[2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
(peer: 10.54.50.101:24009) after handshake is complete
[2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20492
[2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20529
[2010-12-03 10:03:06.26827] I
[client-handshake.c:993:select_server_supported_programs]
RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
Version (310)
[2010-12-03 10:03:06.27029] I
[client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
Connected to 10.54.50.101:24009, attached to remote volume '/data'.
[2010-12-03 10:03:06.27067] I
[client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
fds open - Delaying child_up until they are re-opened


Anyone else seen anything like this and/or have suggestions about options I can
set to work around this?


.. Lana (lana.deere at gmail.com)
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Gluster users - Dec 2010 - 3.1.1 crashing under moderate load

[Gluster-users] 3.1.1 crashing under moderate load

[Gluster-users] 3.1.1 crashing under moderate load