Hi Gene,
you have paste glustershd log. we asked you to paste glusterd log. glusterd and
glustershd both are different process. with this information we can't find
out why your glusterd crashed. could you paste *glusterd* logs
(/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log*) in pastebin (not
in this mail thread) and give the link of pastebin in this mail thread. Can you
also attach core file or you can paste backtrace of that core dump file.
It will be great if you give us sos report of the node where the crash happen.
Thanx,
~Gaurav
----- Original Message -----
From: "Gene Liverman" <gliverma at westga.edu>
To: "gluster-users" <gluster-users at gluster.org>
Sent: Friday, October 2, 2015 4:47:00 AM
Subject: Re: [Gluster-users] glusterd crashing
Sorry for the delay. Here is what's installed:
# rpm -qa | grep gluster
glusterfs-geo-replication-3.7.4-2.el6.x86_64
glusterfs-client-xlators-3.7.4-2.el6.x86_64
glusterfs-3.7.4-2.el6.x86_64
glusterfs-libs-3.7.4-2.el6.x86_64
glusterfs-api-3.7.4-2.el6.x86_64
glusterfs-fuse-3.7.4-2.el6.x86_64
glusterfs-server-3.7.4-2.el6.x86_64
glusterfs-cli-3.7.4-2.el6.x86_64
The cmd_history.log file is attached.
In gluster.log I have filtered out a bunch of lines like the one below due to
make them more readable. I had a node down for multiple days due to maintenance
and another one went down due to a hardware failure during that time too.
[2015-10-01 00:16:09.643631] W [MSGID: 114031]
[client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-gv0-client-0: remote operation
failed. Path: <gfid:31f17f8c-6c96-4440-88c0-f813b3c8d364>
(31f17f8c-6c96-4440-88c0-f813b3c8d364) [No such file or directory]
I also filtered out a boat load of self heal lines like these two:
[2015-10-01 15:14:14.851015] I [MSGID: 108026]
[afr-self-heal-metadata.c:56:__afr_selfheal_metadata_do] 0-gv0-replicate-0:
performing metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3
[2015-10-01 15:14:14.856392] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-gv0-replicate-0: Completed
metadata selfheal on f78a47db-a359-430d-a655-1d217eb848c3. source=0 sinks=1
[root at eapps-gluster01 glusterfs]# cat glustershd.log |grep -v 'remote
operation failed' |grep -v 'self-heal'
[2015-09-27 08:46:56.893125] E [rpc-clnt.c:201:call_bail] 0-glusterfs: bailing
out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x6 sent = 2015-09-27
08:16:51.742731. timeout = 1800 for 127.0.0.1:24007
[2015-09-28 12:54:17.524924] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on
127.0.0.1:24007 failed (Connection reset by peer)
[2015-09-28 12:54:27.844374] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-28 12:57:03.485027] W [socket.c:588:__socket_rwv] 0-gv0-client-2: readv
on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-09-28 12:57:05.872973] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-09-28 12:57:38.490578] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on
127.0.0.1:24007 failed (No data available)
[2015-09-28 12:57:49.054475] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-28 13:01:12.062960] W [glusterfsd.c:1219:cleanup_and_exit]
(-->/lib64/libpthread.so.0() [0x3c65e07a51]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received
signum (15), shutting down
[2015-09-28 13:01:12.981945] I [MSGID: 100030] [glusterfsd.c:2301:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4 (args:
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/lib/glusterd/glustershd/run/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option
*replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-09-28 13:01:13.009171] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2015-09-28 13:01:13.092483] I [graph.c:269:gf_add_cmdline_options]
0-gv0-replicate-0: adding option 'node-uuid' for volume
'gv0-replicate-0' with value
'416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-09-28 13:01:13.100856] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 2
[2015-09-28 13:01:13.103995] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-0: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.114745] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-1: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.115725] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-0: changing port to 49152 (from 0)
[2015-09-28 13:01:13.125619] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-2: parent translators are ready, attempting connect on transport
[2015-09-28 13:01:13.132316] E [socket.c:2278:socket_connect_finish]
0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-09-28 13:01:13.132650] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0: Using
Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-28 13:01:13.133322] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to
gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-09-28 13:01:13.133365] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and Client
lk-version numbers are not same, reopening the fds
[2015-09-28 13:01:13.133782] I [MSGID: 108005] [afr-common.c:3998:afr_notify]
0-gv0-replicate-0: Subvolume 'gv0-client-0' came back up; going online.
[2015-09-28 13:01:13.133863] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server lk
version = 1
Final graph:
+------------------------------------------------------------------------------+
1: volume gv0-client-0
2: type protocol/client
3: option clnt-lk-version 1
4: option volfile-checksum 0
5: option volfile-key gluster/glustershd
6: option client-version 3.7.4
7: option process-uuid
eapps-gluster01-65147-2015/09/28-13:01:12:970131-gv0-client-0-0-0
8: option fops-version 1298437
9: option ping-timeout 42
10: option remote-host eapps-gluster01.uwg.westga.edu
11: option remote-subvolume /export/sdb1/gv0
12: option transport-type socket
13: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
14: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
15: end-volume
16:
17: volume gv0-client-1
18: type protocol/client
19: option ping-timeout 42
20: option remote-host eapps-gluster02.uwg.westga.edu
21: option remote-subvolume /export/sdb1/gv0
22: option transport-type socket
23: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
24: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
25: end-volume
26:
27: volume gv0-client-2
28: type protocol/client
29: option ping-timeout 42
30: option remote-host eapps-gluster03.uwg.westga.edu
31: option remote-subvolume /export/sdb1/gv0
32: option transport-type socket
33: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
34: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
35: end-volume
36:
37: volume gv0-replicate-0
38: type cluster/replicate
39: option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
46: subvolumes gv0-client-0 gv0-client-1 gv0-client-2
47: end-volume
48:
49: volume glustershd
50: type debug/io-stats
51: subvolumes gv0-replicate-0
52: end-volume
53:
+------------------------------------------------------------------------------+
[2015-09-28 13:01:13.154898] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed to get
the port number for remote subvolume. Please run 'gluster volume status'
on server to see if brick process is running.
[2015-09-28 13:01:13.155031] I [MSGID: 114018] [client.c:2042:client_rpc_notify]
0-gv0-client-2: disconnected from gv0-client-2. Client process will keep trying
to connect to glusterd until brick's port is available
[2015-09-28 13:01:13.155080] W [MSGID: 108001] [afr-common.c:4081:afr_notify]
0-gv0-replicate-0: Client-quorum is not met
[2015-09-29 08:11:24.728797] I [MSGID: 100011] [glusterfsd.c:1291:reincarnate]
0-glusterfsd: Fetching the volume file from server...
[2015-09-29 08:11:24.763338] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-09-29 12:50:41.915254] E [rpc-clnt.c:201:call_bail] 0-gv0-client-2:
bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0xd91f sent = 2015-09-29
12:20:36.092734. timeout = 1800 for 160.10.31.227:24007
[2015-09-29 12:50:41.923550] W [MSGID: 114032]
[client-handshake.c:1623:client_dump_version_cbk] 0-gv0-client-2: received RPC
status error [Transport endpoint is not connected]
[2015-09-30 23:54:36.547979] W [socket.c:588:__socket_rwv] 0-glusterfs: readv on
127.0.0.1:24007 failed (No data available)
[2015-09-30 23:54:46.812870] E [socket.c:2278:socket_connect_finish]
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-10-01 00:14:20.997081] I [glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2015-10-01 00:15:36.770579] W [socket.c:588:__socket_rwv] 0-gv0-client-2: readv
on 160.10.31.227:24007 failed (Connection reset by peer)
[2015-10-01 00:15:37.906708] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-10-01 00:15:53.008130] W [glusterfsd.c:1219:cleanup_and_exit]
(-->/lib64/libpthread.so.0() [0x3b91807a51]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x65) [0x4059b5] ) 0-: received
signum (15), shutting down
[2015-10-01 00:15:53.008697] I [timer.c:48:gf_timer_call_after]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x3e2) [0x3b9480f992]
-->/usr/lib64/libgfrpc.so.0(__save_frame+0x76) [0x3b9480f046]
-->/usr/lib64/libglusterfs.so.0(gf_timer_call_after+0x1b1) [0x3b93447881] )
0-timer: ctx cleanup started
[2015-10-01 00:15:53.994698] I [MSGID: 100030] [glusterfsd.c:2301:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.4 (args:
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/lib/glusterd/glustershd/run/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/9a9819e90404187e84e67b01614bbe10.socket --xlator-option
*replicate*.node-uuid=416d712a-06fc-4b3c-a92f-8c82145626ff)
[2015-10-01 00:15:54.020401] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2015-10-01 00:15:54.086777] I [graph.c:269:gf_add_cmdline_options]
0-gv0-replicate-0: adding option 'node-uuid' for volume
'gv0-replicate-0' with value
'416d712a-06fc-4b3c-a92f-8c82145626ff'
[2015-10-01 00:15:54.093004] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 2
[2015-10-01 00:15:54.098144] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-0: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.107432] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-1: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.115962] I [MSGID: 114020] [client.c:2118:notify]
0-gv0-client-2: parent translators are ready, attempting connect on transport
[2015-10-01 00:15:54.120474] E [socket.c:2278:socket_connect_finish]
0-gv0-client-1: connection to 160.10.31.64:24007 failed (Connection refused)
[2015-10-01 00:15:54.120639] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-0: changing port to 49152 (from 0)
Final graph:
+------------------------------------------------------------------------------+
1: volume gv0-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host eapps-gluster01.uwg.westga.edu
5: option remote-subvolume /export/sdb1/gv0
6: option transport-type socket
7: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
8: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
9: end-volume
10:
11: volume gv0-client-1
12: type protocol/client
13: option ping-timeout 42
14: option remote-host eapps-gluster02.uwg.westga.edu
15: option remote-subvolume /export/sdb1/gv0
16: option transport-type socket
17: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
18: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
19: end-volume
20:
21: volume gv0-client-2
22: type protocol/client
23: option ping-timeout 42
24: option remote-host eapps-gluster03.uwg.westga.edu
25: option remote-subvolume /export/sdb1/gv0
26: option transport-type socket
27: option username 0005f8fa-107a-4cc8-ac38-bb821c014c14
28: option password 379bae9a-6529-4564-a6f5-f5a9f7424d01
29: end-volume
30:
31: volume gv0-replicate-0
32: type cluster/replicate
33: option node-uuid 416d712a-06fc-4b3c-a92f-8c82145626ff
40: subvolumes gv0-client-0 gv0-client-1 gv0-client-2
41: end-volume
42:
43: volume glustershd
44: type debug/io-stats
45: subvolumes gv0-replicate-0
46: end-volume
47:
+------------------------------------------------------------------------------+
[2015-10-01 00:15:54.135650] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-0: Using
Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 00:15:54.136223] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-0: Connected to
gv0-client-0, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 00:15:54.136262] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-0: Server and Client
lk-version numbers are not same, reopening the fds
[2015-10-01 00:15:54.136410] I [MSGID: 108005] [afr-common.c:3998:afr_notify]
0-gv0-replicate-0: Subvolume 'gv0-client-0' came back up; going online.
[2015-10-01 00:15:54.136500] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-0: Server lk
version = 1
[2015-10-01 00:15:54.401702] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-2: failed to get
the port number for remote subvolume. Please run 'gluster volume status'
on server to see if brick process is running.
[2015-10-01 00:15:54.401834] I [MSGID: 114018] [client.c:2042:client_rpc_notify]
0-gv0-client-2: disconnected from gv0-client-2. Client process will keep trying
to connect to glusterd until brick's port is available
[2015-10-01 00:15:54.401878] W [MSGID: 108001] [afr-common.c:4081:afr_notify]
0-gv0-replicate-0: Client-quorum is not met
[2015-10-01 03:57:52.755426] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection refused)
[2015-10-01 13:50:49.000708] E [socket.c:2278:socket_connect_finish]
0-gv0-client-2: connection to 160.10.31.227:24007 failed (Connection timed out)
[2015-10-01 14:36:40.481673] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk] 0-gv0-client-1: failed to get
the port number for remote subvolume. Please run 'gluster volume status'
on server to see if brick process is running.
[2015-10-01 14:36:40.481833] I [MSGID: 114018] [client.c:2042:client_rpc_notify]
0-gv0-client-1: disconnected from gv0-client-1. Client process will keep trying
to connect to glusterd until brick's port is available
[2015-10-01 14:36:41.982037] I [rpc-clnt.c:1851:rpc_clnt_reconfig]
0-gv0-client-1: changing port to 49152 (from 0)
[2015-10-01 14:36:41.993478] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs] 0-gv0-client-1: Using
Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-10-01 14:36:41.994568] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-gv0-client-1: Connected to
gv0-client-1, attached to remote volume '/export/sdb1/gv0'.
[2015-10-01 14:36:41.994647] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-gv0-client-1: Server and Client
lk-version numbers are not same, reopening the fds
[2015-10-01 14:36:41.994899] I [MSGID: 108002] [afr-common.c:4077:afr_notify]
0-gv0-replicate-0: Client-quorum is met
[2015-10-01 14:36:42.002275] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-gv0-client-1: Server lk
version = 1
Thanks,
Gene Liverman
Systems Integration Architect
Information Technology Services
University of West Georgia
gliverma at westga.edu
ITS: Making Technology Work for You!
On Wed, Sep 30, 2015 at 10:54 PM, Gaurav Garg < ggarg at redhat.com >
wrote:
Hi Gene,
Could you paste or attach core file/glusterd log file/cmd history to find out
actual RCA of the crash. What steps you performed for this crash.
>> How can I troubleshoot this?
If you want to troubleshoot this then you can look into the glusterd log file,
core file.
Thank you..
Regards,
Gaurav
----- Original Message -----
From: "Gene Liverman" < gliverma at westga.edu >
To: gluster-users at gluster.org
Sent: Thursday, October 1, 2015 7:59:47 AM
Subject: [Gluster-users] glusterd crashing
In the last few days I've started having issues with my glusterd service
crashing. When it goes down it seems to do so on all nodes in my replicated
volume. How can I troubleshoot this? I'm on a mix of CentOS 6 and RHEL 6.
Thanks!
Gene Liverman
Systems Integration Architect
Information Technology Services
University of West Georgia
gliverma at westga.edu
Sent from Outlook on my iPhone
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cmd_history.log
Type: text/x-log
Size: 147929 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151002/ca2443eb/attachment-0001.bin>