Atin Mukherjee
2019-Jan-25 08:26 UTC
[Gluster-users] glusterfs 4.1.6 error in starting glusterd service
Amudhan, So here's the issue: In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's details and that's why glusterd wasn't able to resolve the brick(s) hosted on node2. Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from /var/lib/glusterd/peers/ from node 4 and place it in the same location in node 3 and then restart glusterd service on node 3? On Thu, Jan 24, 2019 at 11:57 AM Amudhan P <amudhan83 at gmail.com> wrote:> Atin, > > Sorry, i missed to send entire `glusterd` folder. Now attached zip > contains `glusterd` folder from all nodes. > > the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside node3 > folder. > > regards > Amudhan > > On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee <amukherj at redhat.com> > wrote: > >> Amudhan, >> >> I see that you have provided the content of the configuration of the >> volume gfs-tst where the request was to share the dump of >> /var/lib/glusterd/* . I can not debug this further until you share the >> correct dump. >> >> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee <amukherj at redhat.com> >> wrote: >> >>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>> Instead of doing too many back and forth I suggest you to share the content >>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>> node the glusterd service is unable to come up. >>> >>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P <amudhan83 at gmail.com> wrote: >>> >>>> I have created the folder in the path as said but still, service failed >>>> to start below is the error msg in glusterd.log >>>> >>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] [glusterfsd.c:2741:main] >>>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd >>>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) >>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>> 0-management: Using /var/lib/glusterd as working directory >>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>> 0-management: Using /var/run/gluster as pid file working directory >>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>> channel creation failed [No such device] >>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>> 0-rdma.management: Failed to initialize IB Device >>>> [2019-01-16 14:50:14.563882] W [rpc-transport.c:351:rpc_transport_load] >>>> 0-rpc-transport: 'rdma' initialization failed >>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>> 0-rpc-service: cannot create listener, initing the transport failed >>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>> transport >>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>> op-version: 40100 >>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>> connect returned 0 >>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>> Failed to get tcp-user-timeout >>>> [2019-01-16 14:50:15.675451] I >>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>> frame-timeout to 600 >>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>> brick failed in restore* >>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>> 'management' failed, review your volfile again* >>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>> failed >>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>> received signum (-1), shutting down >>>> >>>> >>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee <amukherj at redhat.com> >>>> wrote: >>>> >>>>> If gluster volume info/status shows the brick to be >>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>> brick. >>>>> >>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P <amudhan83 at gmail.com> wrote: >>>>> >>>>>> Yes, I did mount bricks but the folder 'brick4' was still not created >>>>>> inside the brick. >>>>>> Do I need to create this folder because when I run replace-brick it >>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>> running replace-brick or heal begins. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee <amukherj at redhat.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P <amudhan83 at gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Atin, >>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>> >>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>> /var/run/glusterd.pid) >>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>> set to 65536 >>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>> directory >>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>> working directory >>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>> channel creation failed [No such device] >>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>> initialization failed >>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>> listener, initing the transport failed >>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>> continuing with succeeded transport >>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>> op-version: 40100 >>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>> directory] >>>>>>>> >>>>>>> >>>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>>> not mounted it yet? >>>>>>> >>>>>>> >>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>> connect returned 0 >>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>> Failed to get tcp-user-timeout >>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>> frame-timeout to 600 >>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>> brick failed in restore >>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>> 'management' failed, review your volfile again >>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>> failed >>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>> [2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>> received signum (-1), shutting down >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee <amukherj at redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P <amudhan83 at gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>>> what needs to be done? >>>>>>>>>> >>>>>>>>>> error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In long, I am trying to simulate a situation. where volume stoped >>>>>>>>>> abnormally and >>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>> >>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, I >>>>>>>>>> have setup a volume with disperse 4+2. >>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>> system >>>>>>>>>> >>>>>>>>>> below are the steps done. >>>>>>>>>> >>>>>>>>>> 1. umount from client machine >>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>> without stopping volume and stop service) >>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>> 4. powered ON all system >>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>> file for details. >>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED >>>>>>>>>> : Volume gfs-tst already started >>>>>>>>>> >>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>> Y 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>> `reset-brick` command >>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>> >>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>> >>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>> volume and start with force command >>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>> details >>>>>>>>>> >>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>> >>>>>>>>>> in node-3 receiving following message. >>>>>>>>>> >>>>>>>>>> sudo service glusterd start >>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>> >>>>>>>>>> [fail] >>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>>> >>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>> out of space >>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>> left on device] >>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>> >>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>> >>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>> set to 65536 >>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>> directory >>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>> working directory >>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>> channel creation failed [No such device] >>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>> listener, initing the transport failed >>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>> continuing with succeeded transport >>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>> op-version: 40100 >>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>> file or directory] >>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>> failed >>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>> received signum (-1), shutting down >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>> node3 is live >>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>> >>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>> Online Pid >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1517 >>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>> 1668 >>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1522 >>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>> 1678 >>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1527 >>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>> 1677 >>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1541 >>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>> 1683 >>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>> 2662 >>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>> 2786 >>>>>>>>>> >>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> There are no active volume tasks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>> UUID Hostname State >>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>>> >>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>> Number of Peers: 2 >>>>>>>>>> >>>>>>>>>> Hostname: IP.3 >>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>> >>>>>>>>>> Hostname: IP.4 >>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards >>>>>>>>>> Amudhan >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190125/b2684fff/attachment.html>
Amudhan P
2019-Jan-30 13:55 UTC
[Gluster-users] glusterfs 4.1.6 error in starting glusterd service
Hi Atin, yes, it worked out thank you. what would be the cause of this issue? On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee <amukherj at redhat.com> wrote:> Amudhan, > > So here's the issue: > > In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's details > and that's why glusterd wasn't able to resolve the brick(s) hosted on node2. > > Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from > /var/lib/glusterd/peers/ from node 4 and place it in the same location in > node 3 and then restart glusterd service on node 3? > > > On Thu, Jan 24, 2019 at 11:57 AM Amudhan P <amudhan83 at gmail.com> wrote: > >> Atin, >> >> Sorry, i missed to send entire `glusterd` folder. Now attached zip >> contains `glusterd` folder from all nodes. >> >> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside >> node3 folder. >> >> regards >> Amudhan >> >> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee <amukherj at redhat.com> >> wrote: >> >>> Amudhan, >>> >>> I see that you have provided the content of the configuration of the >>> volume gfs-tst where the request was to share the dump of >>> /var/lib/glusterd/* . I can not debug this further until you share the >>> correct dump. >>> >>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee <amukherj at redhat.com> >>> wrote: >>> >>>> Can you please run 'glusterd -LDEBUG' and share back the glusterd.log? >>>> Instead of doing too many back and forth I suggest you to share the content >>>> of /var/lib/glusterd from all the nodes. Also do mention which particular >>>> node the glusterd service is unable to come up. >>>> >>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P <amudhan83 at gmail.com> wrote: >>>> >>>>> I have created the folder in the path as said but still, service >>>>> failed to start below is the error msg in glusterd.log >>>>> >>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] >>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>> /var/run/glusterd.pid) >>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] [glusterd.c:1423:init] >>>>> 0-management: Maximum allowed open file descriptors set to 65536 >>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] [glusterd.c:1481:init] >>>>> 0-management: Using /var/lib/glusterd as working directory >>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] [glusterd.c:1486:init] >>>>> 0-management: Using /var/run/gluster as pid file working directory >>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>> channel creation failed [No such device] >>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>>> 0-rdma.management: Failed to initialize IB Device >>>>> [2019-01-16 14:50:14.563882] W >>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>> initialization failed >>>>> [2019-01-16 14:50:14.563957] W [rpcsvc.c:1781:rpcsvc_create_listener] >>>>> 0-rpc-service: cannot create listener, initing the transport failed >>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] [glusterd.c:1764:init] >>>>> 0-management: creation of 1 listeners failed, continuing with succeeded >>>>> transport >>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>> op-version: 40100 >>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>> connect returned 0 >>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>> Failed to get tcp-user-timeout >>>>> [2019-01-16 14:50:15.675451] I >>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>> frame-timeout to 600 >>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>> brick failed in restore* >>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>> 'management' failed, review your volfile again* >>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>> failed >>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>> received signum (-1), shutting down >>>>> >>>>> >>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee <amukherj at redhat.com> >>>>> wrote: >>>>> >>>>>> If gluster volume info/status shows the brick to be >>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd >>>>>> need to create the brick4 directory explicitly. I fail to understand the >>>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>>> brick. >>>>>> >>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P <amudhan83 at gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not >>>>>>> created inside the brick. >>>>>>> Do I need to create this folder because when I run replace-brick it >>>>>>> will create folder inside the brick. I have seen this behavior before when >>>>>>> running replace-brick or heal begins. >>>>>>> >>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee <amukherj at redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P <amudhan83 at gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Atin, >>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in another >>>>>>>>> node. when starting service again fails with error msg in glusterd.log file. >>>>>>>>> >>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>> /var/run/glusterd.pid) >>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>> set to 65536 >>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>> directory >>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>> working directory >>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>> channel creation failed [No such device] >>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>> initialization failed >>>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>> listener, initing the transport failed >>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>> continuing with succeeded transport >>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>> op-version: 40100 >>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed >>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>>> directory] >>>>>>>>> >>>>>>>> >>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't exist. >>>>>>>> You already mentioned that you had replaced the faulty disk, but have you >>>>>>>> not mounted it yet? >>>>>>>> >>>>>>>> >>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: >>>>>>>>> connect returned 0 >>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: >>>>>>>>> Failed to get tcp-user-timeout >>>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>>> frame-timeout to 600 >>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>>>> brick failed in restore >>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>> 'management' failed, review your volfile again >>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>> failed >>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>> [2019-01-15 20:17:00.693004] W >>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>> received signum (-1), shutting down >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee < >>>>>>>>> amukherj at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> This is a case of partial write of a transaction and as the host >>>>>>>>>> ran out of space for the root partition where all the glusterd related >>>>>>>>>> configurations are persisted, the transaction couldn't be written and hence >>>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted >>>>>>>>>> storage pool to the node where glusterd service fails to come up and post >>>>>>>>>> that restarting the glusterd service should be able to make peer status >>>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P <amudhan83 at gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> In short, when I started glusterd service I am getting following >>>>>>>>>>> error msg in the glusterd.log file in one server. >>>>>>>>>>> what needs to be done? >>>>>>>>>>> >>>>>>>>>>> error logged in glusterd.log >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>> file or directory] >>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In long, I am trying to simulate a situation. where volume >>>>>>>>>>> stoped abnormally and >>>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>>> >>>>>>>>>>> My test cluster is set up with 3 nodes and each has four disks, >>>>>>>>>>> I have setup a volume with disperse 4+2. >>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>>> system >>>>>>>>>>> >>>>>>>>>>> below are the steps done. >>>>>>>>>>> >>>>>>>>>>> 1. umount from client machine >>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>>> without stopping volume and stop service) >>>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>>> 4. powered ON all system >>>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log >>>>>>>>>>> file for details. >>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : >>>>>>>>>>> FAILED : Volume gfs-tst already started >>>>>>>>>>> >>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>>> Online Pid >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1517 >>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1668 >>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1522 >>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1678 >>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1527 >>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1677 >>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1541 >>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1683 >>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>> Y 2662 >>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>> 2786 >>>>>>>>>>> >>>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>>> `reset-brick` command >>>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>>> >>>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : >>>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>>> >>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>>> volume and start with force command >>>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : >>>>>>>>>>> FAILED : Pre-validation failed on localhost. Please check log file for >>>>>>>>>>> details >>>>>>>>>>> >>>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>>> except node-3 other nodes service started successfully without any issues. >>>>>>>>>>> >>>>>>>>>>> in node-3 receiving following message. >>>>>>>>>>> >>>>>>>>>>> sudo service glusterd start >>>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>>> >>>>>>>>>>> [fail] >>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more information. >>>>>>>>>>> >>>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>>> out of space >>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space >>>>>>>>>>> left on device] >>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: >>>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>>> >>>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such >>>>>>>>>>> file or directory] >>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: >>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>>> node3 is live >>>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>>> >>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>> Gluster process TCP Port RDMA Port >>>>>>>>>>> Online Pid >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1517 >>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 Y >>>>>>>>>>> 1668 >>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1522 >>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 Y >>>>>>>>>>> 1678 >>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1527 >>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 Y >>>>>>>>>>> 1677 >>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1541 >>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 Y >>>>>>>>>>> 1683 >>>>>>>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>>>>>>> 2662 >>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A Y >>>>>>>>>>> 2786 >>>>>>>>>>> >>>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> There are no active volume tasks >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list >>>>>>>>>>> UUID Hostname State >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected >>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected >>>>>>>>>>> >>>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status >>>>>>>>>>> Number of Peers: 2 >>>>>>>>>>> >>>>>>>>>>> Hostname: IP.3 >>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>>> >>>>>>>>>>> Hostname: IP.4 >>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> regards >>>>>>>>>>> Amudhan >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/49a4bcce/attachment.html>