I had a look at our Zabbix monitoring, and the high cpu usage is very
obvious.
In the logs you just see that the gluster client lose the connection.
[2012-09-25 12:33:45.589916] C
[client-handshake.c:126:rpc_client_ping_timer_expired] 0-vol0-client-3:
server 127.0.0.1:24009 has not responded in the last 42 seconds,
disconnecting.
[2012-09-25 12:33:45.671106] E [rpc-clnt.c:373:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f6909b175b0]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
[0x7f6909b17220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x7f6909b1714e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS
3.1) op(FINODELK(30)) called at 2012-09-25 12:33:01.396928 (xid=0x73201808x)
[2012-09-25 12:33:45.671134] W
[client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote
operation failed: Transport endpoint is not connected
[2012-09-25 12:33:45.671197] E [rpc-clnt.c:373:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f6909b175b0]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
[0x7f6909b17220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x7f6909b1714e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS
Handshake) op(PING(3)) called at 2012-09-25 12:33:03.587430
(xid=0x73201809x)
[2012-09-25 12:33:45.675973] W [client-handshake.c:275:client_ping_cbk]
0-vol0-client-3: timer must have expired
[2012-09-25 12:33:45.683852] I [client.c:2090:client_rpc_notify]
0-vol0-client-3: disconnected
[2012-09-25 12:33:45.691006] W [client3_1-fops.c:5267:client3_1_finodelk]
0-vol0-client-3: (a670c9bc-7d60-4319-99df-cccd1f4af368) remote_fd is -1.
EBADFD
[2012-09-25 12:35:24.766320] W [client3_1-fops.c:5267:client3_1_finodelk]
0-vol0-client-3: (f879cc43-5107-4937-9505-89752f06d8f3) remote_fd is -1.
EBADFD
[2012-09-25 13:04:06.762987] E [rpc-clnt.c:208:call_bail] 0-vol0-client-3:
bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x73201810x sent 2012-09-25
12:33:56.652483. timeout = 1800
[2012-09-25 13:04:06.763024] W
[client-handshake.c:1819:client_dump_version_cbk] 0-vol0-client-3: received
RPC status error
[2012-09-25 14:19:48.059956] E [rpc-clnt.c:208:call_bail] 0-vol0-client-1:
bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 0x68744748x sent
= 2012-09-25 13:49:40.407493. timeout = 1800
[2012-09-25 14:19:48.059995] W [client3_1-fops.c:2630:client3_1_lookup_cbk]
0-vol0-client-1: remote operation failed: Transport endpoint is not
connected. Path: /instances/instance-00000035/disk
(ea2f993e-7106-4f56-b362-974de56d33ef)
2012/9/25 Christian Wittwer <wittwerch at gmail.com>
> Hi everybody,
> We run a 4 brick gluster cluster (replicate+distribute) on Ubuntu
> 12.04 with Gluster 3.3.0-1. Filesystem is ext4. It was running fine since
> the release of Gluster 3.3.
> But during the last 3-4 weeks we see a strange problem occuring over and
> over again. Out of nowhere the Gluster Daemon on a brick stops
> responding. The process is still there, but all Gluster clients loss the
> connection.
> If I look at the cmd "top", I see the Daemon running at around
1200% cpu
> usage (16 core server). But the cpu column in "ps aux" show
around 0% cpu
> usage.
>
> I think we found a bug in Gluster (or at least I hope so). Is it a known
> bug?
> Can you advise what you exaclty need for a bug report?
>
> Currently we solve the problem with a reboot of the whole server. A kill
> is not enough as the process gets in the state "defunct" and is
not
> killable at all.
>
> Cheers,
> Christian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120925/220bdd14/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2012-09-25 at 22.37.18 .png
Type: image/png
Size: 43032 bytes
Desc: not available
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120925/220bdd14/attachment.png>