Avra Sengupta
2017-Feb-21 06:22 UTC
[Gluster-users] Failed snapshot clone leaving undeletable orphaned volume on a single peer
Hi D, We tried reproducing the issue with a similar setup but were unable to do so. We are still investigating it. I have another follow-up question. You said that the repo exists only in s0? If that was the case, then bringing glusterd down on s0 only, deleteing the repo and starting glusterd once again would have removed it. The fact that the repo is restored as soon as glusterd restarts on s0, means that some other node(s) in the cluster also has that repo and is passing that information to the glusterd in s0 during handshake. Could you please confirm if any other node apart from s0 has the particular repo(/var/lib/glusterd/vols/data-teste) or not. Thanks. Regards, Avra On 02/20/2017 06:51 PM, Gambit15 wrote:> Hi Avra, > > On 20 February 2017 at 02:51, Avra Sengupta <asengupt at redhat.com > <mailto:asengupt at redhat.com>> wrote: > > Hi D, > > It seems you tried to take a clone of a snapshot, when that > snapshot was not activated. > > > Correct. As per my commands, I then noticed the issue, checked the > snapshot's status & activated it. I included this in my command > history just to clear up any doubts from the logs. > > However in this scenario, the cloned volume should not be in an > inconsistent state. I will try to reproduce this and see if it's a > bug. Meanwhile could you please answer the following queries: > 1. How many nodes were in the cluster. > > > There are 4 nodes in a (2+1)x2 setup. > s0 replicates to s1, with an arbiter on s2, and s2 replicates to s3, > with an arbiter on s0. > > 2. How many bricks does the snapshot > data-bck_GMT-2017.02.09-14.15.43 have? > > 6 bricks, including the 2 arbiters. > > 3. Was the snapshot clone command issued from a node which did not > have any bricks for the snapshot data-bck_GMT-2017.02.09-14.15.43 > > > All commands were issued from s0. All volumes have bricks on every > node in the cluster. > > 4. I see you tried to delete the new cloned volume. Did the new > cloned volume land in this state after failure to create the clone > or failure to delete the clone > > > I noticed there was something wrong as soon as I created the clone. > The clone command completed, however I was then unable to do anything > with it because the clone didn't exist on s1-s3. > > > If you want to remove the half baked volume from the cluster > please proceed with the following steps. > 1. bring down glusterd on all nodes by running the following > command on all nodes > $ systemctl stop glusterd. > Verify that the glusterd is down on all nodes by running the > following command on all nodes > $ systemctl status glusterd. > 2. delete the following repo from all the nodes (whichever nodes > it exists) > /var/lib/glusterd/vols/data-teste > > > The repo only exists on s0, but stoppping glusterd on only s0 & > deleting the directory didn't work, the directory was restored as soon > as glusterd was restarted. I haven't yet tried stopping glusterd on > *all* nodes before doing this, although I'll need to plan for that, as > it'll take the entire cluster off the air. > > Thanks for the reply, > Doug > > > Regards, > Avra > > > On 02/16/2017 08:01 PM, Gambit15 wrote: >> Hey guys, >> I tried to create a new volume from a cloned snapshot yesterday, >> however something went wrong during the process & I'm now stuck >> with the new volume being created on the server I ran the >> commands on (s0), but not on the rest of the peers. I'm unable to >> delete this new volume from the server, as it doesn't exist on >> the peers. >> >> What do I do? >> Any insights into what may have gone wrong? >> >> CentOS 7.3.1611 >> Gluster 3.8.8 >> >> The command history & extract from etc-glusterfs-glusterd.vol.log >> are included below. >> >> gluster volume list >> gluster snapshot list >> gluster snapshot clone data-teste data-bck_GMT-2017.02.09-14.15.43 >> gluster volume status data-teste >> gluster volume delete data-teste >> gluster snapshot create teste data >> gluster snapshot clone data-teste teste_GMT-2017.02.15-12.44.04 >> gluster snapshot status >> gluster snapshot activate teste_GMT-2017.02.15-12.44.04 >> gluster snapshot clone data-teste teste_GMT-2017.02.15-12.44.04 >> >> >> [2017-02-15 12:43:21.667403] I [MSGID: 106499] >> [glusterd-handler.c:4349:__glusterd_handle_status_volume] >> 0-management: Received status volume req for volume data-teste >> [2017-02-15 12:43:21.682530] E [MSGID: 106301] >> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging >> of operation 'Volume Status' failed on localhost : Volume >> data-teste is not started >> [2017-02-15 12:43:43.633031] I [MSGID: 106495] >> [glusterd-handler.c:3128:__glusterd_handle_getwd] 0-glusterd: >> Received getwd req >> [2017-02-15 12:43:43.640597] I [run.c:191:runner_log] >> (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xcc4b2) >> [0x7ffb396a14b2] >> -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xcbf65) >> [0x7ffb396a0f65] -->/lib64/libglusterfs.so.0(runner_log+0x115) >> [0x7ffb44ec31c5] ) 0-management: Ran script: >> /var/lib/glusterd/hooks/1/delete/post/S57glusterfind-delete-post >> --volname=data-teste >> [2017-02-15 13:05:20.103423] E [MSGID: 106122] >> [glusterd-snapshot.c:2397:glusterd_snapshot_clone_prevalidate] >> 0-management: Failed to pre validate >> [2017-02-15 13:05:20.103464] E [MSGID: 106443] >> [glusterd-snapshot.c:2413:glusterd_snapshot_clone_prevalidate] >> 0-management: One or more bricks are not running. Please run >> snapshot status command to see brick status. >> Please start the stopped brick and then issue snapshot clone command >> [2017-02-15 13:05:20.103481] W [MSGID: 106443] >> [glusterd-snapshot.c:8563:glusterd_snapshot_prevalidate] >> 0-management: Snapshot clone pre-validation failed >> [2017-02-15 13:05:20.103492] W [MSGID: 106122] >> [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: >> Snapshot Prevalidate Failed >> [2017-02-15 13:05:20.103503] E [MSGID: 106122] >> [glusterd-mgmt.c:884:glusterd_mgmt_v3_pre_validate] 0-management: >> Pre Validation failed for operation Snapshot on local node >> [2017-02-15 13:05:20.103514] E [MSGID: 106122] >> [glusterd-mgmt.c:2243:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Pre Validation Failed >> [2017-02-15 13:05:20.103531] E [MSGID: 106027] >> [glusterd-snapshot.c:8118:glusterd_snapshot_clone_postvalidate] >> 0-management: unable to find clone data-teste volinfo >> [2017-02-15 13:05:20.103542] W [MSGID: 106444] >> [glusterd-snapshot.c:9063:glusterd_snapshot_postvalidate] >> 0-management: Snapshot create post-validation failed >> [2017-02-15 13:05:20.103561] W [MSGID: 106121] >> [glusterd-mgmt.c:351:gd_mgmt_v3_post_validate_fn] 0-management: >> postvalidate operation failed >> [2017-02-15 13:05:20.103572] E [MSGID: 106121] >> [glusterd-mgmt.c:1660:glusterd_mgmt_v3_post_validate] >> 0-management: Post Validation failed for operation Snapshot on >> local node >> [2017-02-15 13:05:20.103582] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> [2017-02-15 13:11:15.862858] W [MSGID: 106057] >> [glusterd-snapshot-utils.c:410:glusterd_snap_volinfo_find] >> 0-management: Snap volume >> c3ceae3889484e96ab8bed69593cf6d3.s0.run-gluster-snaps-c3ceae3889484e96ab8bed69593cf6d3-brick1-data-brick >> not found [Argumento inv?lido] >> [2017-02-15 13:11:16.314759] I [MSGID: 106143] >> [glusterd-pmap.c:250:pmap_registry_bind] 0-pmap: adding brick >> /run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick1/data/brick >> on port 49452 >> [2017-02-15 13:11:16.316090] I >> [rpc-clnt.c:1046:rpc_clnt_connection_init] 0-management: setting >> frame-timeout to 600 >> [2017-02-15 13:11:16.348867] W [MSGID: 106057] >> [glusterd-snapshot-utils.c:410:glusterd_snap_volinfo_find] >> 0-management: Snap volume >> c3ceae3889484e96ab8bed69593cf6d3.s0.run-gluster-snaps-c3ceae3889484e96ab8bed69593cf6d3-brick6-data-arbiter >> not found [Argumento inv?lido] >> [2017-02-15 13:11:16.558878] I [MSGID: 106143] >> [glusterd-pmap.c:250:pmap_registry_bind] 0-pmap: adding brick >> /run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick6/data/arbiter >> on port 49453 >> [2017-02-15 13:11:16.559883] I >> [rpc-clnt.c:1046:rpc_clnt_connection_init] 0-management: setting >> frame-timeout to 600 >> [2017-02-15 13:11:23.279721] E [MSGID: 106030] >> [glusterd-snapshot.c:4736:glusterd_take_lvm_snapshot] >> 0-management: taking snapshot of the brick >> (/run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick1/data/brick) >> of device >> /dev/mapper/v0.dc0.cte--g0-c3ceae3889484e96ab8bed69593cf6d3_0 failed >> [2017-02-15 13:11:23.279790] E [MSGID: 106030] >> [glusterd-snapshot.c:5135:glusterd_take_brick_snapshot] >> 0-management: Failed to take snapshot of brick >> s0:/run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick1/data/brick >> [2017-02-15 13:11:23.279806] E [MSGID: 106030] >> [glusterd-snapshot.c:6484:glusterd_take_brick_snapshot_task] >> 0-management: Failed to take backend snapshot for brick >> s0:/run/gluster/snaps/data-teste/brick1/data/brick volume(data-teste) >> [2017-02-15 13:11:23.286678] E [MSGID: 106030] >> [glusterd-snapshot.c:4736:glusterd_take_lvm_snapshot] >> 0-management: taking snapshot of the brick >> (/run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick6/data/arbiter) >> of device >> /dev/mapper/v0.dc0.cte--g0-c3ceae3889484e96ab8bed69593cf6d3_1 failed >> [2017-02-15 13:11:23.286735] E [MSGID: 106030] >> [glusterd-snapshot.c:5135:glusterd_take_brick_snapshot] >> 0-management: Failed to take snapshot of brick >> s0:/run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick6/data/arbiter >> [2017-02-15 13:11:23.286749] E [MSGID: 106030] >> [glusterd-snapshot.c:6484:glusterd_take_brick_snapshot_task] >> 0-management: Failed to take backend snapshot for brick >> s0:/run/gluster/snaps/data-teste/brick6/data/arbiter >> volume(data-teste) >> [2017-02-15 13:11:23.286793] E [MSGID: 106030] >> [glusterd-snapshot.c:6626:glusterd_schedule_brick_snapshot] >> 0-management: Failed to create snapshot >> [2017-02-15 13:11:23.286813] E [MSGID: 106441] >> [glusterd-snapshot.c:6796:glusterd_snapshot_clone_commit] >> 0-management: Failed to take backend snapshot data-teste >> [2017-02-15 13:11:25.530666] E [MSGID: 106442] >> [glusterd-snapshot.c:8308:glusterd_snapshot] 0-management: Failed >> to clone snapshot >> [2017-02-15 13:11:25.530721] W [MSGID: 106123] >> [glusterd-mgmt.c:272:gd_mgmt_v3_commit_fn] 0-management: Snapshot >> Commit Failed >> [2017-02-15 13:11:25.530735] E [MSGID: 106123] >> [glusterd-mgmt.c:1427:glusterd_mgmt_v3_commit] 0-management: >> Commit failed for operation Snapshot on local node >> [2017-02-15 13:11:25.530749] E [MSGID: 106123] >> [glusterd-mgmt.c:2304:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Commit Op Failed >> [2017-02-15 13:11:25.532312] E [MSGID: 106027] >> [glusterd-snapshot.c:8118:glusterd_snapshot_clone_postvalidate] >> 0-management: unable to find clone data-teste volinfo >> [2017-02-15 13:11:25.532339] W [MSGID: 106444] >> [glusterd-snapshot.c:9063:glusterd_snapshot_postvalidate] >> 0-management: Snapshot create post-validation failed >> [2017-02-15 13:11:25.532353] W [MSGID: 106121] >> [glusterd-mgmt.c:351:gd_mgmt_v3_post_validate_fn] 0-management: >> postvalidate operation failed >> [2017-02-15 13:11:25.532367] E [MSGID: 106121] >> [glusterd-mgmt.c:1660:glusterd_mgmt_v3_post_validate] >> 0-management: Post Validation failed for operation Snapshot on >> local node >> [2017-02-15 13:11:25.532381] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> [2017-02-15 13:29:53.779020] E [MSGID: 106062] >> [glusterd-snapshot-utils.c:2391:glusterd_snap_create_use_rsp_dict] 0-management: >> failed to get snap UUID >> [2017-02-15 13:29:53.779073] E [MSGID: 106099] >> [glusterd-snapshot-utils.c:2507:glusterd_snap_use_rsp_dict] >> 0-glusterd: Unable to use rsp dict >> [2017-02-15 13:29:53.779096] E [MSGID: 106108] >> [glusterd-mgmt.c:1305:gd_mgmt_v3_commit_cbk_fn] 0-management: >> Failed to aggregate response from node/brick >> [2017-02-15 13:29:53.779136] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Commit failed on s3. Please check log file for details. >> [2017-02-15 13:29:54.136196] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Commit failed on s1. Please check log file for details. >> The message "E [MSGID: 106108] >> [glusterd-mgmt.c:1305:gd_mgmt_v3_commit_cbk_fn] 0-management: >> Failed to aggregate response from node/brick" repeated 2 times >> between [2017-02-15 13:29:53.779096] and [2017-02-15 13:29:54.535080] >> [2017-02-15 13:29:54.535098] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Commit failed on s2. Please check log file for details. >> [2017-02-15 13:29:54.535320] E [MSGID: 106123] >> [glusterd-mgmt.c:1490:glusterd_mgmt_v3_commit] 0-management: >> Commit failed on peers >> [2017-02-15 13:29:54.535370] E [MSGID: 106123] >> [glusterd-mgmt.c:2304:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Commit Op Failed >> [2017-02-15 13:29:54.539708] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Post Validation failed on s1. Please check log file for details. >> [2017-02-15 13:29:54.539797] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Post Validation failed on s3. Please check log file for details. >> [2017-02-15 13:29:54.539856] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: >> Post Validation failed on s2. Please check log file for details. >> [2017-02-15 13:29:54.540224] E [MSGID: 106121] >> [glusterd-mgmt.c:1713:glusterd_mgmt_v3_post_validate] >> 0-management: Post Validation failed on peers >> [2017-02-15 13:29:54.540256] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> The message "E [MSGID: 106062] >> [glusterd-snapshot-utils.c:2391:glusterd_snap_create_use_rsp_dict] 0-management: >> failed to get snap UUID" repeated 2 times between [2017-02-15 >> 13:29:53.779020] and [2017-02-15 13:29:54.535075] >> The message "E [MSGID: 106099] >> [glusterd-snapshot-utils.c:2507:glusterd_snap_use_rsp_dict] >> 0-glusterd: Unable to use rsp dict" repeated 2 times between >> [2017-02-15 13:29:53.779073] and [2017-02-15 13:29:54.535078] >> [2017-02-15 13:31:14.285666] I [MSGID: 106488] >> [glusterd-handler.c:1537:__glusterd_handle_cli_get_volume] >> 0-management: Received get vol req >> [2017-02-15 13:32:17.827422] E [MSGID: 106027] >> [glusterd-handler.c:4670:glusterd_get_volume_opts] 0-management: >> Volume cluster.locking-scheme does not exist >> [2017-02-15 13:34:02.635762] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s1. Volume data-teste does not exist >> [2017-02-15 13:34:02.635838] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s2. Volume data-teste does not exist >> [2017-02-15 13:34:02.635889] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s3. Volume data-teste does not exist >> [2017-02-15 13:34:02.636092] E [MSGID: 106122] >> [glusterd-mgmt.c:947:glusterd_mgmt_v3_pre_validate] 0-management: >> Pre Validation failed on peers >> [2017-02-15 13:34:02.636132] E [MSGID: 106122] >> [glusterd-mgmt.c:2009:glusterd_mgmt_v3_initiate_all_phases] >> 0-management: Pre Validation Failed >> [2017-02-15 13:34:20.313228] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s2. Error: Volume data-teste does not exist >> [2017-02-15 13:34:20.313320] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s1. Error: Volume data-teste does not exist >> [2017-02-15 13:34:20.313377] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s3. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796455] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s1. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796830] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s3. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796896] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging >> failed on s2. Error: Volume data-teste does not exist >> >> Many thanks! >> D >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> http://lists.gluster.org/mailman/listinfo/gluster-users >> <http://lists.gluster.org/mailman/listinfo/gluster-users> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170221/263e9780/attachment.html>
Gambit15
2017-Feb-21 15:32 UTC
[Gluster-users] Failed snapshot clone leaving undeletable orphaned volume on a single peer
Hi Avra, On 21 February 2017 at 03:22, Avra Sengupta <asengupt at redhat.com> wrote:> Hi D, > > We tried reproducing the issue with a similar setup but were unable to do > so. We are still investigating it. > > I have another follow-up question. You said that the repo exists only in > s0? If that was the case, then bringing glusterd down on s0 only, deleteing > the repo and starting glusterd once again would have removed it. The fact > that the repo is restored as soon as glusterd restarts on s0, means that > some other node(s) in the cluster also has that repo and is passing that > information to the glusterd in s0 during handshake. Could you please > confirm if any other node apart from s0 has the particular > repo(/var/lib/glusterd/vols/data-teste) or not. Thanks. >I'll point out that this isn't a recurring issue. It's the first time this has happened, and it's not happened since. If it wasn't for the orphaned volume, I wouldn't even have requested support. Huh, so, I've just rescanned all of the nodes, and the volume is now appearing on all. That's very odd, as the volume was "created" on Weds 15th & until the end of the 17th it was still only appearing on s0 (both in the volume list & in the vols directory). Grepping the etc-glusterfs-glusterd.vol logs, the first mention of the volume after the failures I posted previously is the following... [2017-02-17 15:46:17.199193] W [rpcsvc.c:265:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.123.123.102:49008 [2017-02-17 15:46:17.199216] E [rpcsvc.c:560:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-02-17 22:20:58.525036] I [MSGID: 106004] [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer <s3> (<978c228a-86f8-48dc-89c1-c63914eaa9a4>), in state <Peer in Cluster>, has disconnected from glusterd. [2017-02-17 22:20:58.525128] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x1deac) [0x7f2a85517eac] -->/usr/lib64/glusterfs/3.8.8/xlator/ mgmt/glusterd.so(+0x27a58) [0x7f2a85521a58] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xd09da) [0x7f2a855ca9da] ) 0-management: Lock for vol data not held [2017-02-17 22:20:58.525144] W [MSGID: 106118] [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock not released for data [2017-02-17 22:20:58.525171] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x1deac) [0x7f2a85517eac] -->/usr/lib64/glusterfs/3.8.8/xlator/ mgmt/glusterd.so(+0x27a58) [0x7f2a85521a58] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xd09da) [0x7f2a855ca9da] ) 0-management: Lock for vol data-novo not held [2017-02-17 22:20:58.525182] W [MSGID: 106118] [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock not released for data-novo [2017-02-17 22:20:58.525205] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x1deac) [0x7f2a85517eac] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x27a58) [0x7f2a85521a58] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xd09da) [0x7f2a855ca9da] ) 0-management: Lock for vol data-teste not held [2017-02-17 22:20:58.525235] W [MSGID: 106118] [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock not released for data-teste [2017-02-17 22:20:58.525261] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x1deac) [0x7f2a85517eac] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0x27a58) [0x7f2a85521a58] -->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xd09da) [0x7f2a855ca9da] ) 0-management: Lock for vol data-teste2 not held [2017-02-17 22:20:58.525272] W [MSGID: 106118] [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock not released for data-teste2 That's 58 hours between the volume's failed creation & its first sign of life...?? At the time when it was only appearing on s0, I tried stopping glusterd on multiple occasions & deleting the volume's directory within vols, but it always returned as soon as I restarted glusterd. I did this with the help of Joe on IRC at the time, and he was also stumped (he suggested that the data was possibly still being held in memory somewhere), so I'm quite sure this wasn't simply an oversight on my part. Anyway, many thanks for the help, and I'd be happy to provide any logs if desired, however whilst knowing what happened & why might be useful, all now seems to have resolved itself. Cheers, Doug> > Regards, > Avra > > > On 02/20/2017 06:51 PM, Gambit15 wrote: > > Hi Avra, > > On 20 February 2017 at 02:51, Avra Sengupta <asengupt at redhat.com> wrote: > >> Hi D, >> >> It seems you tried to take a clone of a snapshot, when that snapshot was >> not activated. >> > > Correct. As per my commands, I then noticed the issue, checked the > snapshot's status & activated it. I included this in my command history > just to clear up any doubts from the logs. > > However in this scenario, the cloned volume should not be in an >> inconsistent state. I will try to reproduce this and see if it's a bug. >> Meanwhile could you please answer the following queries: >> 1. How many nodes were in the cluster. >> > > There are 4 nodes in a (2+1)x2 setup. > s0 replicates to s1, with an arbiter on s2, and s2 replicates to s3, with > an arbiter on s0. > > 2. How many bricks does the snapshot data-bck_GMT-2017.02.09-14.15.43 >> have? >> > > 6 bricks, including the 2 arbiters. > > >> 3. Was the snapshot clone command issued from a node which did not have >> any bricks for the snapshot data-bck_GMT-2017.02.09-14.15.43 >> > > All commands were issued from s0. All volumes have bricks on every node in > the cluster. > > >> 4. I see you tried to delete the new cloned volume. Did the new cloned >> volume land in this state after failure to create the clone or failure to >> delete the clone >> > > I noticed there was something wrong as soon as I created the clone. The > clone command completed, however I was then unable to do anything with it > because the clone didn't exist on s1-s3. > > >> >> If you want to remove the half baked volume from the cluster please >> proceed with the following steps. >> 1. bring down glusterd on all nodes by running the following command on >> all nodes >> $ systemctl stop glusterd. >> Verify that the glusterd is down on all nodes by running the following >> command on all nodes >> $ systemctl status glusterd. >> 2. delete the following repo from all the nodes (whichever nodes it >> exists) >> /var/lib/glusterd/vols/data-teste >> > > The repo only exists on s0, but stoppping glusterd on only s0 & deleting > the directory didn't work, the directory was restored as soon as glusterd > was restarted. I haven't yet tried stopping glusterd on *all* nodes before > doing this, although I'll need to plan for that, as it'll take the entire > cluster off the air. > > Thanks for the reply, > Doug > > >> Regards, >> Avra >> >> >> On 02/16/2017 08:01 PM, Gambit15 wrote: >> >> Hey guys, >> I tried to create a new volume from a cloned snapshot yesterday, however >> something went wrong during the process & I'm now stuck with the new volume >> being created on the server I ran the commands on (s0), but not on the rest >> of the peers. I'm unable to delete this new volume from the server, as it >> doesn't exist on the peers. >> >> What do I do? >> Any insights into what may have gone wrong? >> >> CentOS 7.3.1611 >> Gluster 3.8.8 >> >> The command history & extract from etc-glusterfs-glusterd.vol.log are >> included below. >> >> gluster volume list >> gluster snapshot list >> gluster snapshot clone data-teste data-bck_GMT-2017.02.09-14.15.43 >> gluster volume status data-teste >> gluster volume delete data-teste >> gluster snapshot create teste data >> gluster snapshot clone data-teste teste_GMT-2017.02.15-12.44.04 >> gluster snapshot status >> gluster snapshot activate teste_GMT-2017.02.15-12.44.04 >> gluster snapshot clone data-teste teste_GMT-2017.02.15-12.44.04 >> >> >> [2017-02-15 12:43:21.667403] I [MSGID: 106499] >> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management: >> Received status volume req for volume data-teste >> [2017-02-15 12:43:21.682530] E [MSGID: 106301] >> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of >> operation 'Volume Status' failed on localhost : Volume data-teste is not >> started >> [2017-02-15 12:43:43.633031] I [MSGID: 106495] >> [glusterd-handler.c:3128:__glusterd_handle_getwd] 0-glusterd: Received >> getwd req >> [2017-02-15 12:43:43.640597] I [run.c:191:runner_log] >> (-->/usr/lib64/glusterfs/3.8.8/xlator/mgmt/glusterd.so(+0xcc4b2) >> [0x7ffb396a14b2] -->/usr/lib64/glusterfs/3.8.8/ >> xlator/mgmt/glusterd.so(+0xcbf65) [0x7ffb396a0f65] >> -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7ffb44ec31c5] ) >> 0-management: Ran script: /var/lib/glusterd/hooks/1/dele >> te/post/S57glusterfind-delete-post --volname=data-teste >> [2017-02-15 13:05:20.103423] E [MSGID: 106122] >> [glusterd-snapshot.c:2397:glusterd_snapshot_clone_prevalidate] >> 0-management: Failed to pre validate >> [2017-02-15 13:05:20.103464] E [MSGID: 106443] >> [glusterd-snapshot.c:2413:glusterd_snapshot_clone_prevalidate] >> 0-management: One or more bricks are not running. Please run snapshot >> status command to see brick status. >> Please start the stopped brick and then issue snapshot clone command >> [2017-02-15 13:05:20.103481] W [MSGID: 106443] >> [glusterd-snapshot.c:8563:glusterd_snapshot_prevalidate] 0-management: >> Snapshot clone pre-validation failed >> [2017-02-15 13:05:20.103492] W [MSGID: 106122] >> [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot >> Prevalidate Failed >> [2017-02-15 13:05:20.103503] E [MSGID: 106122] >> [glusterd-mgmt.c:884:glusterd_mgmt_v3_pre_validate] 0-management: Pre >> Validation failed for operation Snapshot on local node >> [2017-02-15 13:05:20.103514] E [MSGID: 106122] >> [glusterd-mgmt.c:2243:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Pre Validation Failed >> [2017-02-15 13:05:20.103531] E [MSGID: 106027] >> [glusterd-snapshot.c:8118:glusterd_snapshot_clone_postvalidate] >> 0-management: unable to find clone data-teste volinfo >> [2017-02-15 13:05:20.103542] W [MSGID: 106444] >> [glusterd-snapshot.c:9063:glusterd_snapshot_postvalidate] 0-management: >> Snapshot create post-validation failed >> [2017-02-15 13:05:20.103561] W [MSGID: 106121] >> [glusterd-mgmt.c:351:gd_mgmt_v3_post_validate_fn] 0-management: >> postvalidate operation failed >> [2017-02-15 13:05:20.103572] E [MSGID: 106121] >> [glusterd-mgmt.c:1660:glusterd_mgmt_v3_post_validate] 0-management: Post >> Validation failed for operation Snapshot on local node >> [2017-02-15 13:05:20.103582] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> [2017-02-15 13:11:15.862858] W [MSGID: 106057] >> [glusterd-snapshot-utils.c:410:glusterd_snap_volinfo_find] 0-management: >> Snap volume c3ceae3889484e96ab8bed69593cf6d3.s0.run-gluster-snaps-c3ceae >> 3889484e96ab8bed69593cf6d3-brick1-data-brick not found [Argumento >> inv?lido] >> [2017-02-15 13:11:16.314759] I [MSGID: 106143] >> [glusterd-pmap.c:250:pmap_registry_bind] 0-pmap: adding brick >> /run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick1/data/brick on >> port 49452 >> [2017-02-15 13:11:16.316090] I [rpc-clnt.c:1046:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> [2017-02-15 13:11:16.348867] W [MSGID: 106057] >> [glusterd-snapshot-utils.c:410:glusterd_snap_volinfo_find] 0-management: >> Snap volume c3ceae3889484e96ab8bed69593cf6d3.s0.run-gluster-snaps-c3ceae >> 3889484e96ab8bed69593cf6d3-brick6-data-arbiter not found [Argumento >> inv?lido] >> [2017-02-15 13:11:16.558878] I [MSGID: 106143] >> [glusterd-pmap.c:250:pmap_registry_bind] 0-pmap: adding brick >> /run/gluster/snaps/c3ceae3889484e96ab8bed69593cf6d3/brick6/data/arbiter >> on port 49453 >> [2017-02-15 13:11:16.559883] I [rpc-clnt.c:1046:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> [2017-02-15 13:11:23.279721] E [MSGID: 106030] >> [glusterd-snapshot.c:4736:glusterd_take_lvm_snapshot] 0-management: >> taking snapshot of the brick (/run/gluster/snaps/c3ceae3889 >> 484e96ab8bed69593cf6d3/brick1/data/brick) of device >> /dev/mapper/v0.dc0.cte--g0-c3ceae3889484e96ab8bed69593cf6d3_0 failed >> [2017-02-15 13:11:23.279790] E [MSGID: 106030] >> [glusterd-snapshot.c:5135:glusterd_take_brick_snapshot] 0-management: >> Failed to take snapshot of brick s0:/run/gluster/snaps/c3ceae38 >> 89484e96ab8bed69593cf6d3/brick1/data/brick >> [2017-02-15 13:11:23.279806] E [MSGID: 106030] >> [glusterd-snapshot.c:6484:glusterd_take_brick_snapshot_task] >> 0-management: Failed to take backend snapshot for brick >> s0:/run/gluster/snaps/data-teste/brick1/data/brick volume(data-teste) >> [2017-02-15 13:11:23.286678] E [MSGID: 106030] >> [glusterd-snapshot.c:4736:glusterd_take_lvm_snapshot] 0-management: >> taking snapshot of the brick (/run/gluster/snaps/c3ceae3889 >> 484e96ab8bed69593cf6d3/brick6/data/arbiter) of device >> /dev/mapper/v0.dc0.cte--g0-c3ceae3889484e96ab8bed69593cf6d3_1 failed >> [2017-02-15 13:11:23.286735] E [MSGID: 106030] >> [glusterd-snapshot.c:5135:glusterd_take_brick_snapshot] 0-management: >> Failed to take snapshot of brick s0:/run/gluster/snaps/c3ceae38 >> 89484e96ab8bed69593cf6d3/brick6/data/arbiter >> [2017-02-15 13:11:23.286749] E [MSGID: 106030] >> [glusterd-snapshot.c:6484:glusterd_take_brick_snapshot_task] >> 0-management: Failed to take backend snapshot for brick >> s0:/run/gluster/snaps/data-teste/brick6/data/arbiter volume(data-teste) >> [2017-02-15 13:11:23.286793] E [MSGID: 106030] >> [glusterd-snapshot.c:6626:glusterd_schedule_brick_snapshot] >> 0-management: Failed to create snapshot >> [2017-02-15 13:11:23.286813] E [MSGID: 106441] >> [glusterd-snapshot.c:6796:glusterd_snapshot_clone_commit] 0-management: >> Failed to take backend snapshot data-teste >> [2017-02-15 13:11:25.530666] E [MSGID: 106442] >> [glusterd-snapshot.c:8308:glusterd_snapshot] 0-management: Failed to >> clone snapshot >> [2017-02-15 13:11:25.530721] W [MSGID: 106123] >> [glusterd-mgmt.c:272:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit >> Failed >> [2017-02-15 13:11:25.530735] E [MSGID: 106123] >> [glusterd-mgmt.c:1427:glusterd_mgmt_v3_commit] 0-management: Commit >> failed for operation Snapshot on local node >> [2017-02-15 13:11:25.530749] E [MSGID: 106123] >> [glusterd-mgmt.c:2304:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Commit Op Failed >> [2017-02-15 13:11:25.532312] E [MSGID: 106027] >> [glusterd-snapshot.c:8118:glusterd_snapshot_clone_postvalidate] >> 0-management: unable to find clone data-teste volinfo >> [2017-02-15 13:11:25.532339] W [MSGID: 106444] >> [glusterd-snapshot.c:9063:glusterd_snapshot_postvalidate] 0-management: >> Snapshot create post-validation failed >> [2017-02-15 13:11:25.532353] W [MSGID: 106121] >> [glusterd-mgmt.c:351:gd_mgmt_v3_post_validate_fn] 0-management: >> postvalidate operation failed >> [2017-02-15 13:11:25.532367] E [MSGID: 106121] >> [glusterd-mgmt.c:1660:glusterd_mgmt_v3_post_validate] 0-management: Post >> Validation failed for operation Snapshot on local node >> [2017-02-15 13:11:25.532381] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> [2017-02-15 13:29:53.779020] E [MSGID: 106062] >> [glusterd-snapshot-utils.c:2391:glusterd_snap_create_use_rsp_dict] >> 0-management: failed to get snap UUID >> [2017-02-15 13:29:53.779073] E [MSGID: 106099] >> [glusterd-snapshot-utils.c:2507:glusterd_snap_use_rsp_dict] 0-glusterd: >> Unable to use rsp dict >> [2017-02-15 13:29:53.779096] E [MSGID: 106108] >> [glusterd-mgmt.c:1305:gd_mgmt_v3_commit_cbk_fn] 0-management: Failed to >> aggregate response from node/brick >> [2017-02-15 13:29:53.779136] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Commit >> failed on s3. Please check log file for details. >> [2017-02-15 13:29:54.136196] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Commit >> failed on s1. Please check log file for details. >> The message "E [MSGID: 106108] [glusterd-mgmt.c:1305:gd_mgmt_v3_commit_cbk_fn] >> 0-management: Failed to aggregate response from node/brick" repeated 2 >> times between [2017-02-15 13:29:53.779096] and [2017-02-15 13:29:54.535080] >> [2017-02-15 13:29:54.535098] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Commit >> failed on s2. Please check log file for details. >> [2017-02-15 13:29:54.535320] E [MSGID: 106123] >> [glusterd-mgmt.c:1490:glusterd_mgmt_v3_commit] 0-management: Commit >> failed on peers >> [2017-02-15 13:29:54.535370] E [MSGID: 106123] >> [glusterd-mgmt.c:2304:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Commit Op Failed >> [2017-02-15 13:29:54.539708] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Post >> Validation failed on s1. Please check log file for details. >> [2017-02-15 13:29:54.539797] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Post >> Validation failed on s3. Please check log file for details. >> [2017-02-15 13:29:54.539856] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Post >> Validation failed on s2. Please check log file for details. >> [2017-02-15 13:29:54.540224] E [MSGID: 106121] >> [glusterd-mgmt.c:1713:glusterd_mgmt_v3_post_validate] 0-management: Post >> Validation failed on peers >> [2017-02-15 13:29:54.540256] E [MSGID: 106122] >> [glusterd-mgmt.c:2363:glusterd_mgmt_v3_initiate_snap_phases] >> 0-management: Post Validation Failed >> The message "E [MSGID: 106062] [glusterd-snapshot-utils.c:239 >> 1:glusterd_snap_create_use_rsp_dict] 0-management: failed to get snap >> UUID" repeated 2 times between [2017-02-15 13:29:53.779020] and [2017-02-15 >> 13:29:54.535075] >> The message "E [MSGID: 106099] [glusterd-snapshot-utils.c:2507:glusterd_snap_use_rsp_dict] >> 0-glusterd: Unable to use rsp dict" repeated 2 times between [2017-02-15 >> 13:29:53.779073] and [2017-02-15 13:29:54.535078] >> [2017-02-15 13:31:14.285666] I [MSGID: 106488] >> [glusterd-handler.c:1537:__glusterd_handle_cli_get_volume] 0-management: >> Received get vol req >> [2017-02-15 13:32:17.827422] E [MSGID: 106027] >> [glusterd-handler.c:4670:glusterd_get_volume_opts] 0-management: Volume >> cluster.locking-scheme does not exist >> [2017-02-15 13:34:02.635762] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s1. Volume data-teste does not exist >> [2017-02-15 13:34:02.635838] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s2. Volume data-teste does not exist >> [2017-02-15 13:34:02.635889] E [MSGID: 106116] >> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre >> Validation failed on s3. Volume data-teste does not exist >> [2017-02-15 13:34:02.636092] E [MSGID: 106122] >> [glusterd-mgmt.c:947:glusterd_mgmt_v3_pre_validate] 0-management: Pre >> Validation failed on peers >> [2017-02-15 13:34:02.636132] E [MSGID: 106122] >> [glusterd-mgmt.c:2009:glusterd_mgmt_v3_initiate_all_phases] >> 0-management: Pre Validation Failed >> [2017-02-15 13:34:20.313228] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s2. Error: Volume data-teste does not exist >> [2017-02-15 13:34:20.313320] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s1. Error: Volume data-teste does not exist >> [2017-02-15 13:34:20.313377] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s3. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796455] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s1. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796830] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s3. Error: Volume data-teste does not exist >> [2017-02-15 13:34:36.796896] E [MSGID: 106153] >> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Staging failed on >> s2. Error: Volume data-teste does not exist >> >> Many thanks! >> D >> >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.orghttp://lists.gluster.org/mailman/listinfo/gluster-users >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170221/228f6bb3/attachment.html>