Sorry for the delay. Here is what's installed:
# rpm -qa | grep gluster
glusterfs-geo-replication-3.7.4-2.el6.x86_64
glusterfs-client-xlators-3.7.4-2.el6.x86_64
glusterfs-3.7.4-2.el6.x86_64
glusterfs-libs-3.7.4-2.el6.x86_64
glusterfs-api-3.7.4-2.el6.x86_64
glusterfs-fuse-3.7.4-2.el6.x86_64
glusterfs-server-3.7.4-2.el6.x86_64
glusterfs-cli-3.7.4-2.el6.x86_64
The cmd_history.log file is attached.
In gluster.log I have filtered out a bunch of lines like the one below due
to make them more readable. I had a node down for multiple days due to
maintenance and another one went down due to a hardware failure during that
time too.
[2015-10-01 00:16:09.643631] W [MSGID: 114031]
[client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-gv0-client-0: remote
operation failed. Path: <gfid:31f17f8c-6c96-4440-88c0-f813b3c8d364>
(31f17f8c-6c96-4440-88c0-f813b3c8d364) [No such file or directory]
I also filtered out a boat load of self heal lines like these two:
[2015-10-01 15:14:14.851015] I [MSGID: 108026]
[afr-self-heal-metadata.c:56:__afr_selfheal_metadata_do] 0-gv0-replicate-0:
performing metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3
[2015-10-01 15:14:14.856392] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-gv0-replicate-0: Completed
metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3. source=0 sinks=1
[root at eapps-gluster01 glusterfs]# cat glustershd.log |grep -v 'remote
operation failed' |grep -v 'self-heal'
[2015-09-27 08:46:56.893125] E [rpc-clnt.c:201:call_bail] 0-glusterfs:
bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x6 sent
2015-09-27 08:16:51.742731. timeout = 1800 for 127.0.0.1:24007
[2015-09-28 12:54:17.524924] W [socket.c:588:__socket_rwv] 0-glusterfs:
readv on 127.0.0.1:24007 failed (Connection reset by peer)
[2015-09-28 12:54:27.844374] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-28 12:57:03.485027] W [socket.c:588:__socket_rwv] 0-gv0-client-2:
readv on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-09-28 12:57:05.872973] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection
refused)
[2015-09-28 12:57:38.490578] W [socket.c:588:__socket_rwv] 0-glusterfs:
readv on 127.0.0.1:24007 failed (No data available)
[2015-09-28 12:57:49.054475] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-28 13:01:12.062960] W [glusterfsd.c:1219:cleanup_and_exit]
(-->/lib64/libpthread.so.0() [0x3c65e07a51]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received
signum (15), shutting down
[2015-09-28 13:01:12.981945] I [MSGID: 100030] [glusterfsd.c:2301:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4
(args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/lib/glusterd/glustershd/run/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option
*replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-09-28 13:01:13.009171] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2015-09-28 13:01:13.092483] I [graph.c:269:gf_add_cmdline_options]
0-gv0-replicate-0: adding option 'node-uuid' for volume
'gv0-replicate-0'
with value '416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-09-28 13:01:13.100856] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 2
[2015-09-28 13:01:13.103995] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-0: parent translators are ready, attempting connect on
transport
[2015-09-28 13:01:13.114745] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-1: parent translators are ready, attempting connect on
transport
[2015-09-28 13:01:13.115725] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-0: changing port to 49152 (from 0)
[2015-09-28 13:01:13.125619] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-2: parent translators are ready, attempting connect on
transport
[2015-09-28 13:01:13.132316] E [socket.c:2278:socket_connect_finish]
0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-09-28 13:01:13.132650] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0:
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-28 13:01:13.133322] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to
gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-09-28 13:01:13.133365] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and
Client lk-version numbers are not same, reopening the fds
[2015-09-28 13:01:13.133782] I [MSGID: 108005]
[afr-common.c:3998:afr_notify] 0-gv0-replicate-0: Subvolume
'gv0-client-0'
came back up; going online.
[2015-09-28 13:01:13.133863] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server
lk version = 1
Final graph:
+------------------------------------------------------------------------------+
1: volume gv0-client-0
2: type protocol/client
3: option clnt-lk-version 1
4: option volfile-checksum 0
5: option volfile-key gluster/glustershd
6: option client-version 3.7.4
7: option process-uuid
eapps-gluster01-65147-2015/09/28-13:01:12:970131-gv0-client-0-0-0
8: option fops-version 1298437
9: option ping-timeout 42
10: option remote-host eapps-gluster01.uwg.westga.edu
11: option remote-subvolume /export/sdb1/gv0
12: option transport-type socket
13: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
14: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
15: end-volume
16:
17: volume gv0-client-1
18: type protocol/client
19: option ping-timeout 42
20: option remote-host eapps-gluster02.uwg.westga.edu
21: option remote-subvolume /export/sdb1/gv0
22: option transport-type socket
23: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
24: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
25: end-volume
26:
27: volume gv0-client-2
28: type protocol/client
29: option ping-timeout 42
30: option remote-host eapps-gluster03.uwg.westga.edu
31: option remote-subvolume /export/sdb1/gv0
32: option transport-type socket
33: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
34: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
35: end-volume
36:
37: volume gv0-replicate-0
38: type cluster/replicate
39: option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
46: subvolumes gv0-client-0 gv0-client-1 gv0-client-2
47: end-volume
48:
49: volume glustershd
50: type debug/io-stats
51: subvolumes gv0-replicate-0
52: end-volume
53:
+------------------------------------------------------------------------------+
[2015-09-28 13:01:13.154898] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed
to get the port number for remote subvolume. Please run 'gluster volume
status' on server to see if brick process is running.
[2015-09-28 13:01:13.155031] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-gv0-client-2: disconnected from
gv0-client-2. Client process will keep trying to connect to glusterd until
brick's port is available
[2015-09-28 13:01:13.155080] W [MSGID: 108001]
[afr-common.c:4081:afr_notify] 0-gv0-replicate-0: Client-quorum is not met
[2015-09-29 08:11:24.728797] I [MSGID: 100011]
[glusterfsd.c:1291:reincarnate] 0-glusterfsd: Fetching the volume file from
server...
[2015-09-29 08:11:24.763338] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-29 12:50:41.915254] E [rpc-clnt.c:201:call_bail] 0-gv0-client-2:
bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0xd91f sent = 2015-09-29
12:20:36.092734. timeout = 1800 for 160.10.31.227:24007
[2015-09-29 12:50:41.923550] W [MSGID: 114032]
[client-handshake.c:1623:client_dump_version_cbk] 0-gv0-client-2: received
RPC status error [Transport endpoint is not connected]
[2015-09-30 23:54:36.547979] W [socket.c:588:__socket_rwv] 0-glusterfs:
readv on 127.0.0.1:24007 failed (No data available)
[2015-09-30 23:54:46.812870] E [socket.c:2278:socket_connect_finish]
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-10-01 00:14:20.997081] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-10-01 00:15:36.770579] W [socket.c:588:__socket_rwv] 0-gv0-client-2:
readv on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-10-01 00:15:37.906708] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection
refused)
[2015-10-01 00:15:53.008130] W [glusterfsd.c:1219:cleanup_and_exit]
(-->/lib64/libpthread.so.0() [0x3b91807a51]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received
signum (15), shutting down
[2015-10-01 00:15:53.008697] I [timer.c:48:gf_timer_call_after]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x3e2) [0x3b9480f992]
-->/usr/lib64/libgfrpc.so.0(__save_frame+0x76) [0x3b9480f046]
-->/usr/lib64/libglusterfs.so.0(gf_timer_call_after+0x1b1) [0x3b93447881] )
0-timer: ctx cleanup started
[2015-10-01 00:15:53.994698] I [MSGID: 100030] [glusterfsd.c:2301:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4
(args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/lib/glusterd/glustershd/run/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option
*replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-10-01 00:15:54.020401] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2015-10-01 00:15:54.086777] I [graph.c:269:gf_add_cmdline_options]
0-gv0-replicate-0: adding option 'node-uuid' for volume
'gv0-replicate-0'
with value '416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-10-01 00:15:54.093004] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 2
[2015-10-01 00:15:54.098144] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-0: parent translators are ready, attempting connect on
transport
[2015-10-01 00:15:54.107432] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-1: parent translators are ready, attempting connect on
transport
[2015-10-01 00:15:54.115962] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-2: parent translators are ready, attempting connect on
transport
[2015-10-01 00:15:54.120474] E [socket.c:2278:socket_connect_finish]
0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-10-01 00:15:54.120639] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-0: changing port to 49152 (from 0)
Final graph:
+------------------------------------------------------------------------------+
1: volume gv0-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host eapps-gluster01.uwg.westga.edu
5: option remote-subvolume /export/sdb1/gv0
6: option transport-type socket
7: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
8: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
9: end-volume
10:
11: volume gv0-client-1
12: type protocol/client
13: option ping-timeout 42
14: option remote-host eapps-gluster02.uwg.westga.edu
15: option remote-subvolume /export/sdb1/gv0
16: option transport-type socket
17: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
18: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
19: end-volume
20:
21: volume gv0-client-2
22: type protocol/client
23: option ping-timeout 42
24: option remote-host eapps-gluster03.uwg.westga.edu
25: option remote-subvolume /export/sdb1/gv0
26: option transport-type socket
27: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
28: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
29: end-volume
30:
31: volume gv0-replicate-0
32: type cluster/replicate
33: option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
40: subvolumes gv0-client-0 gv0-client-1 gv0-client-2
41: end-volume
42:
43: volume glustershd
44: type debug/io-stats
45: subvolumes gv0-replicate-0
46: end-volume
47:
+------------------------------------------------------------------------------+
[2015-10-01 00:15:54.135650] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0:
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 00:15:54.136223] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to
gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 00:15:54.136262] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and
Client lk-version numbers are not same, reopening the fds
[2015-10-01 00:15:54.136410] I [MSGID: 108005]
[afr-common.c:3998:afr_notify] 0-gv0-replicate-0: Subvolume
'gv0-client-0'
came back up; going online.
[2015-10-01 00:15:54.136500] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server
lk version = 1
[2015-10-01 00:15:54.401702] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed
to get the port number for remote subvolume. Please run 'gluster volume
status' on server to see if brick process is running.
[2015-10-01 00:15:54.401834] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-gv0-client-2: disconnected from
gv0-client-2. Client process will keep trying to connect to glusterd until
brick's port is available
[2015-10-01 00:15:54.401878] W [MSGID: 108001]
[afr-common.c:4081:afr_notify] 0-gv0-replicate-0: Client-quorum is not met
[2015-10-01 03:57:52.755426] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection
refused)
[2015-10-01 13:50:49.000708] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection timed
out)
[2015-10-01 14:36:40.481673] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-1: failed
to get the port number for remote subvolume. Please run 'gluster volume
status' on server to see if brick process is running.
[2015-10-01 14:36:40.481833] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-gv0-client-1: disconnected from
gv0-client-1. Client process will keep trying to connect to glusterd until
brick's port is available
[2015-10-01 14:36:41.982037] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-1: changing port to 49152 (from 0)
[2015-10-01 14:36:41.993478] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-1:
Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 14:36:41.994568] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-1: Connected to
gv0-client-1, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 14:36:41.994647] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-1: Server and
Client lk-version numbers are not same, reopening the fds
[2015-10-01 14:36:41.994899] I [MSGID: 108002]
[afr-common.c:4077:afr_notify] 0-gv0-replicate-0: Client-quorum is met
[2015-10-01 14:36:42.002275] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-1: Server
lk version = 1
Thanks,
*Gene Liverman*
Systems Integration Architect
Information Technology Services
University of West Georgia
gliverma at westga.edu
ITS: Making Technology Work for You!
On Wed, Sep 30, 2015 at 10:54 PM, Gaurav Garg <ggarg at redhat.com> wrote:
> Hi Gene,
>
> Could you paste or attach core file/glusterd log file/cmd history to find
> out actual RCA of the crash. What steps you performed for this crash.
>
> >> How can I troubleshoot this?
>
> If you want to troubleshoot this then you can look into the glusterd log
> file, core file.
>
> Thank you..
>
> Regards,
> Gaurav
>
> ----- Original Message -----
> From: "Gene Liverman" <gliverma at westga.edu>
> To: gluster-users at gluster.org
> Sent: Thursday, October 1, 2015 7:59:47 AM
> Subject: [Gluster-users] glusterd crashing
>
> In the last few days I've started having issues with my glusterd
service
> crashing. When it goes down it seems to do so on all nodes in my replicated
> volume. How can I troubleshoot this? I'm on a mix of CentOS 6 and RHEL
6.
> Thanks!
>
>
>
> Gene Liverman
> Systems Integration Architect
> Information Technology Services
> University of West Georgia
> gliverma at westga.edu
>
>
> Sent from Outlook on my iPhone
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151001/138e2b91/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cmd_history.log
Type: application/octet-stream
Size: 147929 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151001/138e2b91/attachment.obj>