thr3ads.net - Gluster users - [Gluster-users] Production Volume will not start [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Matt Waymack

2017-Dec-15 19:15 UTC

[Gluster-users] Production Volume will not start

Hi all,

I have an issue where our volume will not start from any node.  When attempting
to start the volume it will eventually return:

Error: Request timed out

For some time after that, the volume is locked and we either have to wait or
restart Gluster services.  In the gluserd.log, it shows the following:

[2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b1/gv0
[2017-12-15 18:03:12.673885] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
gd_mgmt_v3_unlock_timer_cbk
[2017-12-15 18:06:34.304868] I [MSGID: 106499]
[glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: Received
status volume req for volume gv0
[2017-12-15 18:06:34.306603] E [MSGID: 106301]
[glusterd-syncop.c:1353:gd_stage_op_phase] 0-management: Staging of operation
'Volume Status' failed on localhost : Volume gv0 is not started
[2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b2/gv0
[2017-12-15 18:11:42.405966] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b2/gv0 on
port 49153
[2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b3/gv0
[2017-12-15 18:14:39.737192] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
gd_mgmt_v3_unlock_timer_cbk
[2017-12-15 18:35:20.856849] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b1/gv0 on
port 49152
[2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b4/gv0
[2017-12-15 18:46:07.953995] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b3/gv0 on
port 49154
[2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-snapd: setting frame-timeout to 600
[2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-nfs:
setting frame-timeout to 600
[2017-12-15 18:46:07.989543] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2017-12-15 18:46:07.989562] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped
[2017-12-15 18:46:07.989575] I [MSGID: 106600]
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed
[2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-glustershd: setting frame-timeout to 600
[2017-12-15 18:46:08.003011] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already
stopped
[2017-12-15 18:46:08.003039] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is
stopped
[2017-12-15 18:46:08.003079] I [MSGID: 106567]
[glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting glustershd
service
[2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-quotad: setting frame-timeout to 600
[2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-bitd: setting frame-timeout to 600
[2017-12-15 18:46:09.005673] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2017-12-15 18:46:09.005689] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is
stopped
[2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-scrub: setting frame-timeout to 600
[2017-12-15 18:46:09.005892] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2017-12-15 18:46:09.005912] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is
stopped
[2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)
[2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2017-12-15 18:46:09.026582] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management: bailing
out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent = 2017-12-15
18:46:09.005976. timeout = 600 for 10.17.100.208:24007
[2017-12-15 18:56:17.962324] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Commit failed on
nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
[2017-12-15 18:56:17.962408] E [MSGID: 106123]
[glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] 0-management: Commit failed on
peers
[2017-12-15 18:56:17.962656] E [MSGID: 106123]
[glusterd-mgmt.c:2209:glusterd_mgmt_v3_initiate_all_phases] 0-management: Commit
Op Failed
[2017-12-15 18:56:17.964004] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed
on nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
[2017-12-15 18:56:17.965184] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed
on tpc-arbiter1-100617. Please check log file for details.
[2017-12-15 18:56:17.965277] E [MSGID: 106118]
[glusterd-mgmt.c:2087:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock
failed on peers
[2017-12-15 18:56:17.965372] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631)
[0x7f48e44a1631]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e)
[0x7f48e44a143e]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625)
[0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held
[2017-12-15 18:56:17.965394] E [MSGID: 106118]
[glusterd-locks.c:356:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to
release lock for vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc.
[2017-12-15 18:56:17.965409] E [MSGID: 106147]
[glusterd-locks.c:483:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to
unlock all vol
[2017-12-15 18:56:17.965424] E [MSGID: 106118]
[glusterd-mgmt.c:2240:glusterd_mgmt_v3_initiate_all_phases] 0-management: Failed
to release mgmt_v3 locks on localhost
[2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)
[2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
[2017-12-15 18:56:17.965486] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed

This issue started after a gluster volume stop followed by a reboot of all
nodes.  We also updated to the latest available in the CentOS repo and are at
version 3.12.3.  I'm not sure where to look as the log doesn't seem to
show me anything other than it just not working.

gluster peer status shows all peers connected across all nodes, firewall has all
ports opened and was disabled for troubleshooting.  The volume is a
distributed-replicated with arbiter for a total of 3 nodes.

The volume is a production volume with over 120TB of data so I'd really like
to not have to start over with the volume.  Anyone have any suggestions on where
else to look?

Thank you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171215/202eb8f6/attachment.html>

Ben Turner

2017-Dec-17 23:05 UTC

head link

[Gluster-users] Production Volume will not start

----- Original Message -----> From: "Matt Waymack" <mwaymack at nsgdv.com>
> To: "gluster-users" <Gluster-users at gluster.org>
> Sent: Friday, December 15, 2017 2:15:48 PM
> Subject: [Gluster-users] Production Volume will not start
> 
> 
> 
> Hi all,
> 
> 
> 
> I have an issue where our volume will not start from any node. When
> attempting to start the volume it will eventually return:
> 
> Error: Request timed out
> 
> 
> 
> For some time after that, the volume is locked and we either have to wait
or
> restart Gluster services. In the gluserd.log, it shows the following:
> 
> 
> 
> [2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b1/gv0
> 
> [2017-12-15 18:03:12.673885] I
> [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
> gd_mgmt_v3_unlock_timer_cbk
> 
> [2017-12-15 18:06:34.304868] I [MSGID: 106499]
> [glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume gv0
> 
> [2017-12-15 18:06:34.306603] E [MSGID: 106301]
> [glusterd-syncop.c:1353:gd_stage_op_phase] 0-management: Staging of
> operation 'Volume Status' failed on localhost : Volume gv0 is not
started
> 
> [2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b2/gv0
> 
> [2017-12-15 18:11:42.405966] I [MSGID: 106143]
> [glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b2/gv0
on
> port 49153
> 
> [2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
> 
> [2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b3/gv0
> 
> [2017-12-15 18:14:39.737192] I
> [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
> gd_mgmt_v3_unlock_timer_cbk
> 
> [2017-12-15 18:35:20.856849] I [MSGID: 106143]
> [glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b1/gv0
on
> port 49152
> 
> [2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
> 
> [2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b4/gv0
> 
> [2017-12-15 18:46:07.953995] I [MSGID: 106143]
> [glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b3/gv0
on
> port 49154
> 
> [2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
> 
> [2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-snapd: setting frame-timeout to 600
> 
> [2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-nfs: setting frame-timeout to 600
> 
> [2017-12-15 18:46:07.989543] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already
> stopped
> 
> [2017-12-15 18:46:07.989562] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is
> stopped
> 
> [2017-12-15 18:46:07.989575] I [MSGID: 106600]
> [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so
> xlator is not installed
> 
> [2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
> 
> [2017-12-15 18:46:08.003011] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd
> already stopped
> 
> [2017-12-15 18:46:08.003039] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd
service
> is stopped
> 
> [2017-12-15 18:46:08.003079] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting
> glustershd service
> 
> [2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-quotad: setting frame-timeout to 600
> 
> [2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-bitd: setting frame-timeout to 600
> 
> [2017-12-15 18:46:09.005673] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already
> stopped
> 
> [2017-12-15 18:46:09.005689] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is
> stopped
> 
> [2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-scrub: setting frame-timeout to 600
> 
> [2017-12-15 18:46:09.005892] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already
> stopped
> 
> [2017-12-15 18:46:09.005912] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is
> stopped
> 
> [2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
> 
> [2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
> 
> [2017-12-15 18:46:09.026582] E [MSGID: 106430]
> [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission
> failed
> 
> [2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management:
> bailing out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent >
2017-12-15 18:46:09.005976. timeout = 600 for 10.17.100.208:24007
> 
> [2017-12-15 18:56:17.962324] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Commit failed
> on nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
> 
> [2017-12-15 18:56:17.962408] E [MSGID: 106123]
> [glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] 0-management: Commit failed
> on peers
> 
> [2017-12-15 18:56:17.962656] E [MSGID: 106123]
> [glusterd-mgmt.c:2209:glusterd_mgmt_v3_initiate_all_phases] 0-management:
> Commit Op Failed
> 
> [2017-12-15 18:56:17.964004] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
> 
> [2017-12-15 18:56:17.965184] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on tpc-arbiter1-100617. Please check log file for details.
> 
> [2017-12-15 18:56:17.965277] E [MSGID: 106118]
> [glusterd-mgmt.c:2087:glusterd_mgmt_v3_release_peer_locks] 0-management:
> Unlock failed on peers
> 
> [2017-12-15 18:56:17.965372] W
[glusterd-locks.c:843:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631)
> [0x7f48e44a1631]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e)
> [0x7f48e44a143e]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625)
> [0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held
> 
> [2017-12-15 18:56:17.965394] E [MSGID: 106118]
> [glusterd-locks.c:356:glusterd_mgmt_v3_unlock_entity] 0-management: Failed
> to release lock for vol gv0 on behalf of
> 711ffb0c-57b7-46ec-ba8d-185de969e6cc.
> 
> [2017-12-15 18:56:17.965409] E [MSGID: 106147]
> [glusterd-locks.c:483:glusterd_multiple_mgmt_v3_unlock] 0-management:
Unable
> to unlock all vol
> 
> [2017-12-15 18:56:17.965424] E [MSGID: 106118]
> [glusterd-mgmt.c:2240:glusterd_mgmt_v3_initiate_all_phases] 0-management:
> Failed to release mgmt_v3 locks on localhost
> 
> [2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
> 
> [2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
> 
> [2017-12-15 18:56:17.965486] E [MSGID: 106430]
> [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission
> failed
> 
> 
> 
> This issue started after a gluster volume stop followed by a reboot of all
> nodes. We also updated to the latest available in the CentOS repo and are
at
> version 3.12.3. I?m not sure where to look as the log doesn?t seem to show
> me anything other than it just not working.
Can you paste your glusterFS package versions and check your rpm DC for
problems:

# rpm -qa | grep gluster
# yum check
> 
> 
> 
> gluster peer status shows all peers connected across all nodes, firewall
has
> all ports opened and was disabled for troubleshooting. The volume is a
> distributed-replicated with arbiter for a total of 3 nodes.
Check for:

-Selinux errors - is selinux enabled?  If so are you seeing any errors / avc
denials?  Maybe try disabling it if its enabled as a troubleshooting step.
-Brick mounts - are your bricks mounted?  Can you cd to the mount and see your
data?  Is the .glusterfs directory on your bricks?  Check your fstab and make
sure everything is mounting properly.
-NW errors - are your hostnames in /etc/hosts?  If not add them.  Make sure
there aren't any DNS errors.  After the update did you have a NIC name
changes?  Check for any NW differences.

If all these check out try:

-On all nodes run:

# service glusterd stop
# killall glusterfs
# killall glusterfsd

Once all nodes have gluster stopped, the bricks are mounted, and the NW is good
the restart glusterd on all nodes:

# service glusterd start
# service glusterd status

Make sure that glusterd is OK on all your nodes.

Then:

# gluster v info
# gluster v status

Are you using any sort of quorum?  You may want to try

# gluster v start <your vol> force

If it still won't work can you provide the brick logs from a failed volume
start?
> 
> 
> 
> The volume is a production volume with over 120TB of data so I?d really
like
> to not have to start over with the volume. Anyone have any suggestions on
> where else to look?
Your data is still there on your bricks if you need to access it.  Don't
write to your bricks unless its through a gluster mount, but in a pinch you can
read / copy the back end data.  Don't panic though, I'm sure we can get
you volume back up.  Some other things to check:

-Look in /var/lib/glusterd/vols, make sure your vol files are there.
-Check you messages files for any clues.
-Look for crashes in abrt or any corefiles in /tmp or where ever you have app
core dumped to

Let me know what you find after running through this, its gotta be something
with the update.

-b
> 
> 
> 
> Thank you!
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

Atin Mukherjee

2017-Dec-18 07:26 UTC

head link

[Gluster-users] Production Volume will not start

On Sat, Dec 16, 2017 at 12:45 AM, Matt Waymack <mwaymack at nsgdv.com>
wrote:
> Hi all,
>
>
>
> I have an issue where our volume will not start from any node.  When
> attempting to start the volume it will eventually return:
>
> Error: Request timed out
>
>
>
> For some time after that, the volume is locked and we either have to wait
> or restart Gluster services.  In the gluserd.log, it shows the following:
>
>
>
> [2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b1/gv0
>
> [2017-12-15 18:03:12.673885] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk]
> 0-management: In gd_mgmt_v3_unlock_timer_cbk
>
> [2017-12-15 18:06:34.304868] I [MSGID: 106499]
[glusterd-handler.c:4303:__glusterd_handle_status_volume]
> 0-management: Received status volume req for volume gv0
>
> [2017-12-15 18:06:34.306603] E [MSGID: 106301]
[glusterd-syncop.c:1353:gd_stage_op_phase]
> 0-management: Staging of operation 'Volume Status' failed on
localhost :
> Volume gv0 is not started
>
> [2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b2/gv0
>
> [2017-12-15 18:11:42.405966] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b2/gv0 on port 49153
>
> [2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b3/gv0
>
> [2017-12-15 18:14:39.737192] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk]
> 0-management: In gd_mgmt_v3_unlock_timer_cbk
>
> [2017-12-15 18:35:20.856849] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b1/gv0 on port 49152
>
> [2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b4/gv0
>
> [2017-12-15 18:46:07.953995] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b3/gv0 on port 49154
>
> [2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-snapd: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-nfs: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.989543] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: nfs already stopped
>
> [2017-12-15 18:46:07.989562] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: nfs service is stopped
>
> [2017-12-15 18:46:07.989575] I [MSGID: 106600]
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager]
> 0-management: nfs/server.so xlator is not installed
>
> [2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
>
> [2017-12-15 18:46:08.003011] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: glustershd already stopped
>
> [2017-12-15 18:46:08.003039] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: glustershd service is stopped
>
> [2017-12-15 18:46:08.003079] I [MSGID: 106567]
[glusterd-svc-mgmt.c:197:glusterd_svc_start]
> 0-management: Starting glustershd service
>
> [2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-quotad: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-bitd: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005673] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: bitd already stopped
>
> [2017-12-15 18:46:09.005689] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: bitd service is stopped
>
> [2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-scrub: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005892] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: scrub already stopped
>
> [2017-12-15 18:46:09.005912] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: scrub service is stopped
>
> [2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
>
> [2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
>
> [2017-12-15 18:46:09.026582] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply]
> 0-glusterd: Reply submission failed
>
> [2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management:
> bailing out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent >
2017-12-15 18:46:09.005976. timeout = 600 for 10.17.100.208:24007
>
There's a call bail here which means glusterd was never able to get a cbk
response back from nsgtpcfs02.corp.nsgdv.com .

I am guessing you have ended up with a duplicate peerinfo entry of
nsgtpcfs02.corp.nsgdv.com in /var/lib/glusterd/peers folder on the node
where the CLI failed. Can you please share the output of gluster peer
status along with the content of "cat /var/lib/glusterd/peers/* " from
all
the nodes?

[2017-12-15 18:56:17.962324] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors]> 0-management: Commit failed on nsgtpcfs02.corp.nsgdv.com. Please check
> log file for details.
>
> [2017-12-15 18:56:17.962408] E [MSGID: 106123]
[glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit]
> 0-management: Commit failed on peers
>
> [2017-12-15 18:56:17.962656] E [MSGID: 106123] [glusterd-mgmt.c:2209:
> glusterd_mgmt_v3_initiate_all_phases] 0-management: Commit Op Failed
>
> [2017-12-15 18:56:17.964004] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
>
> [2017-12-15 18:56:17.965184] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on tpc-arbiter1-100617. Please check log file for details.
>
> [2017-12-15 18:56:17.965277] E [MSGID: 106118] [glusterd-mgmt.c:2087:
> glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers
>
> [2017-12-15 18:56:17.965372] W
[glusterd-locks.c:843:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631)
> [0x7f48e44a1631]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e)
> [0x7f48e44a143e]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625)
> [0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held
>
> [2017-12-15 18:56:17.965394] E [MSGID: 106118] [glusterd-locks.c:356:
> glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for
> vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc.
>
> [2017-12-15 18:56:17.965409] E [MSGID: 106147] [glusterd-locks.c:483:
> glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol
>
> [2017-12-15 18:56:17.965424] E [MSGID: 106118] [glusterd-mgmt.c:2240:
> glusterd_mgmt_v3_initiate_all_phases] 0-management: Failed to release
> mgmt_v3 locks on localhost
>
> [2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
>
> [2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
>
> [2017-12-15 18:56:17.965486] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply]
> 0-glusterd: Reply submission failed
>
>
>
> This issue started after a gluster volume stop followed by a reboot of all
> nodes.  We also updated to the latest available in the CentOS repo and are
> at version 3.12.3.  I?m not sure where to look as the log doesn?t seem to
> show me anything other than it just not working.
>
>
>
> gluster peer status shows all peers connected across all nodes, firewall
> has all ports opened and was disabled for troubleshooting.  The volume is a
> distributed-replicated with arbiter for a total of 3 nodes.
>
>
>
> The volume is a production volume with over 120TB of data so I?d really
> like to not have to start over with the volume.  Anyone have any
> suggestions on where else to look?
>
>
>
> Thank you!
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171218/09799ebb/attachment.html>

Matt Waymack

2017-Dec-18 14:43 UTC

head link

[Gluster-users] Production Volume will not start

Hi thank you for the reply.  Ultimately the volume did eventually start after
about 1.5 hours from the volume start command.  Could it have something to do
with the amount of files on the volume?

From: Atin Mukherjee [mailto:amukherj at redhat.com]
Sent: Monday, December 18, 2017 1:26 AM
To: Matt Waymack <mwaymack at nsgdv.com>
Cc: gluster-users <Gluster-users at gluster.org>
Subject: Re: [Gluster-users] Production Volume will not start



On Sat, Dec 16, 2017 at 12:45 AM, Matt Waymack <mwaymack at
nsgdv.com<mailto:mwaymack at nsgdv.com>> wrote:

Hi all,



I have an issue where our volume will not start from any node.  When attempting
to start the volume it will eventually return:

Error: Request timed out



For some time after that, the volume is locked and we either have to wait or
restart Gluster services.  In the gluserd.log, it shows the following:



[2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b1/gv0

[2017-12-15 18:03:12.673885] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
gd_mgmt_v3_unlock_timer_cbk

[2017-12-15 18:06:34.304868] I [MSGID: 106499]
[glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: Received
status volume req for volume gv0

[2017-12-15 18:06:34.306603] E [MSGID: 106301]
[glusterd-syncop.c:1353:gd_stage_op_phase] 0-management: Staging of operation
'Volume Status' failed on localhost : Volume gv0 is not started

[2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b2/gv0

[2017-12-15 18:11:42.405966] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b2/gv0 on
port 49153

[2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600

[2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b3/gv0

[2017-12-15 18:14:39.737192] I
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In
gd_mgmt_v3_unlock_timer_cbk

[2017-12-15 18:35:20.856849] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b1/gv0 on
port 49152

[2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600

[2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start]
0-management: starting a fresh brick process for brick /exp/b4/gv0

[2017-12-15 18:46:07.953995] I [MSGID: 106143]
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b3/gv0 on
port 49154

[2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600

[2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-snapd: setting frame-timeout to 600

[2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 0-nfs:
setting frame-timeout to 600

[2017-12-15 18:46:07.989543] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped

[2017-12-15 18:46:07.989562] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped

[2017-12-15 18:46:07.989575] I [MSGID: 106600]
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed

[2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-glustershd: setting frame-timeout to 600

[2017-12-15 18:46:08.003011] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already
stopped

[2017-12-15 18:46:08.003039] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is
stopped

[2017-12-15 18:46:08.003079] I [MSGID: 106567]
[glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting glustershd
service

[2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-quotad: setting frame-timeout to 600

[2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-bitd: setting frame-timeout to 600

[2017-12-15 18:46:09.005673] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped

[2017-12-15 18:46:09.005689] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is
stopped

[2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
0-scrub: setting frame-timeout to 600

[2017-12-15 18:46:09.005892] I [MSGID: 106132]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped

[2017-12-15 18:46:09.005912] I [MSGID: 106568]
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is
stopped

[2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)

[2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 27) to rpc-transport (socket.management)

[2017-12-15 18:46:09.026582] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed

[2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management: bailing
out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent = 2017-12-15
18:46:09.005976. timeout = 600 for
10.17.100.208:24007<http://10.17.100.208:24007>

There's a call bail here which means glusterd was never able to get a cbk
response back from
nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com> .

I am guessing you have ended up with a duplicate peerinfo entry of
nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com> in
/var/lib/glusterd/peers folder on the node where the CLI failed. Can you please
share the output of gluster peer status along with the content of "cat
/var/lib/glusterd/peers/* " from all the nodes?


[2017-12-15 18:56:17.962324] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Commit failed on
nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com>. Please check
log file for details.

[2017-12-15 18:56:17.962408] E [MSGID: 106123]
[glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] 0-management: Commit failed on
peers

[2017-12-15 18:56:17.962656] E [MSGID: 106123]
[glusterd-mgmt.c:2209:glusterd_mgmt_v3_initiate_all_phases] 0-management: Commit
Op Failed

[2017-12-15 18:56:17.964004] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed
on nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com>. Please
check log file for details.

[2017-12-15 18:56:17.965184] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed
on tpc-arbiter1-100617. Please check log file for details.

[2017-12-15 18:56:17.965277] E [MSGID: 106118]
[glusterd-mgmt.c:2087:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock
failed on peers

[2017-12-15 18:56:17.965372] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631)
[0x7f48e44a1631]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e)
[0x7f48e44a143e]
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625)
[0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held

[2017-12-15 18:56:17.965394] E [MSGID: 106118]
[glusterd-locks.c:356:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to
release lock for vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc.

[2017-12-15 18:56:17.965409] E [MSGID: 106147]
[glusterd-locks.c:483:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to
unlock all vol

[2017-12-15 18:56:17.965424] E [MSGID: 106118]
[glusterd-mgmt.c:2240:glusterd_mgmt_v3_initiate_all_phases] 0-management: Failed
to release mgmt_v3 locks on localhost

[2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)

[2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli,
ProgVers: 2, Proc: 8) to rpc-transport (socket.management)

[2017-12-15 18:56:17.965486] E [MSGID: 106430]
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed



This issue started after a gluster volume stop followed by a reboot of all
nodes.  We also updated to the latest available in the CentOS repo and are at
version 3.12.3.  I?m not sure where to look as the log doesn?t seem to show me
anything other than it just not working.



gluster peer status shows all peers connected across all nodes, firewall has all
ports opened and was disabled for troubleshooting.  The volume is a
distributed-replicated with arbiter for a total of 3 nodes.



The volume is a production volume with over 120TB of data so I?d really like to
not have to start over with the volume.  Anyone have any suggestions on where
else to look?



Thank you!

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171218/341e9023/attachment.html>

Maybe Matching Threads

Search for more apparently analagous threads

Gluster users - Dec 2017 - Production Volume will not start

[Gluster-users] Production Volume will not start

[Gluster-users] Production Volume will not start

[Gluster-users] Production Volume will not start

[Gluster-users] Production Volume will not start

Maybe Matching Threads