1) one peer, out of four, got separated from the network, from the rest of the cluster. 2) that unavailable(while it was unavailable) peer got detached with "gluster peer detach" command which succeeded, so now cluster comprise of three peers 3) Self-heal daemon (for some reason) does not start(with an attempt to restart glusted) on the peer which probed that fourth peer. 4) fourth unavailable peer is still up & running but is inaccessible to other peers for network is disconnected, segmented. That peer's gluster status show peer is still in the cluster. 5) So, fourth peer's gluster(nor other processes) stack did not fail nor crushed, just network got, is disconnected. 6) peer status show ok & connected for current three peers. This is third time when it happens to me, very same way: each time net-disjointed peer was brought back online then statistics & details worked again. can you not reproduce it? $ gluster vol info QEMU-VMs Volume Name: QEMU-VMs Type: Replicate Volume ID: 8709782a-daa5-4434-a816-c4e0aef8fef2 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.5.6.32:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs Brick2: 10.5.6.49:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs Brick3: 10.5.6.100:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs Options Reconfigured: transport.address-family: inet nfs.disable: on storage.owner-gid: 107 storage.owner-uid: 107 performance.readdir-ahead: on geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on $ gluster vol status QEMU-VMs Status of volume: QEMU-VMs Gluster process???????????????????????????? TCP Port? RDMA Port Online? Pid ------------------------------------------------------------------------------ Brick 10.5.6.32:/__.aLocalStorages/0/0-GLUS TERs/0GLUSTER-QEMU-VMs????????????????????? 49156???? 0 Y?????? 9302 Brick 10.5.6.49:/__.aLocalStorages/0/0-GLUS TERs/0GLUSTER-QEMU-VMs????????????????????? 49156???? 0 Y?????? 7610 Brick 10.5.6.100:/__.aLocalStorages/0/0-GLU STERs/0GLUSTER-QEMU-VMs???????????????????? 49156???? 0 Y?????? 11013 Self-heal Daemon on localhost?????????????? N/A?????? N/A Y?????? 3069276 Self-heal Daemon on 10.5.6.32?????????????? N/A?????? N/A Y?????? 3315870 Self-heal Daemon on 10.5.6.49?????????????? N/A?????? N/A N?????? N/A? <--- HERE Self-heal Daemon on 10.5.6.17?????????????? N/A?????? N/A Y?????? 5163 Task Status of Volume QEMU-VMs ------------------------------------------------------------------------------ There are no active volume tasks $ gluster vol heal QEMU-VMs statistics heal-count Gathering count of entries to be healed on volume QEMU-VMs has been unsuccessful on bricks that are down. Please check if all brick processes are running. On 04/09/17 11:47, Atin Mukherjee wrote:> Please provide the output of gluster volume info, gluster > volume status and gluster peer status. > > On Mon, Sep 4, 2017 at 4:07 PM, lejeczek > <peljasz at yahoo.co.uk <mailto:peljasz at yahoo.co.uk>> wrote: > > hi all > > this: > $ vol heal $_vol info > outputs ok and exit code is 0 > But if I want to see statistics: > $ gluster vol heal $_vol statistics > Gathering crawl statistics on volume GROUP-WORK has > been unsuccessful on bricks that are down. Please > check if all brick processes are running. > > I suspect - gluster inability to cope with a situation > where one peer(which is not even a brick for a single > vol on the cluster!) is inaccessible to the rest of > cluster. > I have not played with any other variations of this > case, eg. more than one peer goes down, etc. > But I hope someone could try to replicate this simple > test case. > > Cluster and vols, when something like this happens, > seem accessible and as such "all" works, except when > you want more details. > This also fails: > $ gluster vol status $_vol detail > Error : Request timed out > > My gluster(3.10.5-1.el7.x86_64) exhibits these > symptoms every time one(at least) peers goes out of > the rest reach. > > maybe @devel can comment? > > many thanks, L. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > <mailto:Gluster-users at gluster.org> > http://lists.gluster.org/mailman/listinfo/gluster-users > <http://lists.gluster.org/mailman/listinfo/gluster-users> > >
Atin Mukherjee
2017-Sep-04 14:05 UTC
[Gluster-users] heal info OK but statistics not working
Ravi/Karthick, If one of the self heal process is down, will the statstics heal-count command work? On Mon, Sep 4, 2017 at 7:24 PM, lejeczek <peljasz at yahoo.co.uk> wrote:> 1) one peer, out of four, got separated from the network, from the rest of > the cluster. > 2) that unavailable(while it was unavailable) peer got detached with > "gluster peer detach" command which succeeded, so now cluster comprise of > three peers > 3) Self-heal daemon (for some reason) does not start(with an attempt to > restart glusted) on the peer which probed that fourth peer. > 4) fourth unavailable peer is still up & running but is inaccessible to > other peers for network is disconnected, segmented. That peer's gluster > status show peer is still in the cluster. > 5) So, fourth peer's gluster(nor other processes) stack did not fail nor > crushed, just network got, is disconnected. > 6) peer status show ok & connected for current three peers. > > This is third time when it happens to me, very same way: each time > net-disjointed peer was brought back online then statistics & details > worked again. > > can you not reproduce it? > > $ gluster vol info QEMU-VMs > > Volume Name: QEMU-VMs > Type: Replicate > Volume ID: 8709782a-daa5-4434-a816-c4e0aef8fef2 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 10.5.6.32:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Brick2: 10.5.6.49:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Brick3: 10.5.6.100:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Options Reconfigured: > transport.address-family: inet > nfs.disable: on > storage.owner-gid: 107 > storage.owner-uid: 107 > performance.readdir-ahead: on > geo-replication.indexing: on > geo-replication.ignore-pid-check: on > changelog.changelog: on > > $ gluster vol status QEMU-VMs > Status of volume: QEMU-VMs > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------ > ------------------ > Brick 10.5.6.32:/__.aLocalStorages/0/0-GLUS > TERs/0GLUSTER-QEMU-VMs 49156 0 Y 9302 > Brick 10.5.6.49:/__.aLocalStorages/0/0-GLUS > TERs/0GLUSTER-QEMU-VMs 49156 0 Y 7610 > Brick 10.5.6.100:/__.aLocalStorages/0/0-GLU > STERs/0GLUSTER-QEMU-VMs 49156 0 Y 11013 > Self-heal Daemon on localhost N/A N/A Y 3069276 > Self-heal Daemon on 10.5.6.32 N/A N/A Y 3315870 > Self-heal Daemon on 10.5.6.49 N/A N/A N N/A > <--- HERE > Self-heal Daemon on 10.5.6.17 N/A N/A Y 5163 > > Task Status of Volume QEMU-VMs > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > $ gluster vol heal QEMU-VMs statistics heal-count > Gathering count of entries to be healed on volume QEMU-VMs has been > unsuccessful on bricks that are down. Please check if all brick processes > are running. > > > > On 04/09/17 11:47, Atin Mukherjee wrote: > >> Please provide the output of gluster volume info, gluster volume status >> and gluster peer status. >> >> On Mon, Sep 4, 2017 at 4:07 PM, lejeczek <peljasz at yahoo.co.uk <mailto: >> peljasz at yahoo.co.uk>> wrote: >> >> hi all >> >> this: >> $ vol heal $_vol info >> outputs ok and exit code is 0 >> But if I want to see statistics: >> $ gluster vol heal $_vol statistics >> Gathering crawl statistics on volume GROUP-WORK has >> been unsuccessful on bricks that are down. Please >> check if all brick processes are running. >> >> I suspect - gluster inability to cope with a situation >> where one peer(which is not even a brick for a single >> vol on the cluster!) is inaccessible to the rest of >> cluster. >> I have not played with any other variations of this >> case, eg. more than one peer goes down, etc. >> But I hope someone could try to replicate this simple >> test case. >> >> Cluster and vols, when something like this happens, >> seem accessible and as such "all" works, except when >> you want more details. >> This also fails: >> $ gluster vol status $_vol detail >> Error : Request timed out >> >> My gluster(3.10.5-1.el7.x86_64) exhibits these >> symptoms every time one(at least) peers goes out of >> the rest reach. >> >> maybe @devel can comment? >> >> many thanks, L. >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> <mailto:Gluster-users at gluster.org> >> http://lists.gluster.org/mailman/listinfo/gluster-users >> <http://lists.gluster.org/mailman/listinfo/gluster-users> >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170904/e7d7bfed/attachment.html>
Ravishankar N
2017-Sep-05 07:40 UTC
[Gluster-users] heal info OK but statistics not working
On 09/04/2017 07:35 PM, Atin Mukherjee wrote:> Ravi/Karthick, > > If one of the self heal process is down, will the statstics heal-count > command work?No it doesn't seem to: glusterd stage-op phase fails because shd was down on that node and we error out. FWIW, the error message "Gathering crawl statistics on volume GROUP-WORK has been unsuccessful on bricks that are down. Please check if all brick processes are running." is incorrect and once https://review.gluster.org/#/c/15724/ gets merged, you will get the correct error message like so: / /root at vm2 glusterfs]# gluster v heal testvol statistics Gathering crawl statistics on volume testvol has been unsuccessful: ?Staging failed on vm1. Error: Self-heal daemon is not running. Check self-heal daemon log file./ / -Ravi> > On Mon, Sep 4, 2017 at 7:24 PM, lejeczek <peljasz at yahoo.co.uk > <mailto:peljasz at yahoo.co.uk>> wrote: > > 1) one peer, out of four, got separated from the network, from the > rest of the cluster. > 2) that unavailable(while it was unavailable) peer got detached > with "gluster peer detach" command which succeeded, so now cluster > comprise of three peers > 3) Self-heal daemon (for some reason) does not start(with an > attempt to restart glusted) on the peer which probed that fourth peer. > 4) fourth unavailable peer is still up & running but is > inaccessible to other peers for network is disconnected, > segmented. That peer's gluster status show peer is still in the > cluster. > 5) So, fourth peer's gluster(nor other processes) stack did not > fail nor crushed, just network got, is disconnected. > 6) peer status show ok & connected for current three peers. > > This is third time when it happens to me, very same way: each time > net-disjointed peer was brought back online then statistics & > details worked again. > > can you not reproduce it? > > $ gluster vol info QEMU-VMs > > Volume Name: QEMU-VMs > Type: Replicate > Volume ID: 8709782a-daa5-4434-a816-c4e0aef8fef2 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 10.5.6.32:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Brick2: 10.5.6.49:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Brick3: 10.5.6.100:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER-QEMU-VMs > Options Reconfigured: > transport.address-family: inet > nfs.disable: on > storage.owner-gid: 107 > storage.owner-uid: 107 > performance.readdir-ahead: on > geo-replication.indexing: on > geo-replication.ignore-pid-check: on > changelog.changelog: on > > $ gluster vol status QEMU-VMs > Status of volume: QEMU-VMs > Gluster process ????? TCP Port RDMA Port Online? Pid > ------------------------------------------------------------------------------ > Brick 10.5.6.32:/__.aLocalStorages/0/0-GLUS > TERs/0GLUSTER-QEMU-VMs ????????????? 49156???? 0 Y?????? 9302 > Brick 10.5.6.49:/__.aLocalStorages/0/0-GLUS > TERs/0GLUSTER-QEMU-VMs ????????????? 49156???? 0 Y?????? 7610 > Brick 10.5.6.100:/__.aLocalStorages/0/0-GLU > STERs/0GLUSTER-QEMU-VMs ????????????? 49156???? 0 Y?????? 11013 > Self-heal Daemon on localhost?????????????? N/A?????? N/A Y?????? > 3069276 > Self-heal Daemon on 10.5.6.32?????????????? N/A?????? N/A Y?????? > 3315870 > Self-heal Daemon on 10.5.6.49?????????????? N/A?????? N/A N?????? > N/A? <--- HERE > Self-heal Daemon on 10.5.6.17?????????????? N/A?????? N/A Y?????? 5163 > > Task Status of Volume QEMU-VMs > ------------------------------------------------------------------------------ > There are no active volume tasks > > $ gluster vol heal QEMU-VMs statistics heal-count > Gathering count of entries to be healed on volume QEMU-VMs has > been unsuccessful on bricks that are down. Please check if all > brick processes are running. > > > > On 04/09/17 11:47, Atin Mukherjee wrote: > > Please provide the output of gluster volume info, gluster > volume status and gluster peer status. > > On Mon, Sep 4, 2017 at 4:07 PM, lejeczek <peljasz at yahoo.co.uk > <mailto:peljasz at yahoo.co.uk> <mailto:peljasz at yahoo.co.uk > <mailto:peljasz at yahoo.co.uk>>> wrote: > > ? ? hi all > > ? ? this: > ? ? $ vol heal $_vol info > ? ? outputs ok and exit code is 0 > ? ? But if I want to see statistics: > ? ? $ gluster vol heal $_vol statistics > ? ? Gathering crawl statistics on volume GROUP-WORK has > ? ? been unsuccessful on bricks that are down. Please > ? ? check if all brick processes are running. > > ? ? I suspect - gluster inability to cope with a situation > ? ? where one peer(which is not even a brick for a single > ? ? vol on the cluster!) is inaccessible to the rest of > ? ? cluster. > ? ? I have not played with any other variations of this > ? ? case, eg. more than one peer goes down, etc. > ? ? But I hope someone could try to replicate this simple > ? ? test case. > > ? ? Cluster and vols, when something like this happens, > ? ? seem accessible and as such "all" works, except when > ? ? you want more details. > ? ? This also fails: > ? ? $ gluster vol status $_vol detail > ? ? Error : Request timed out > > ? ? My gluster(3.10.5-1.el7.x86_64) exhibits these > ? ? symptoms every time one(at least) peers goes out of > ? ? the rest reach. > > ? ? maybe @devel can comment? > > ? ? many thanks, L. > ? ? _______________________________________________ > ? ? Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > ? ? <mailto:Gluster-users at gluster.org > <mailto:Gluster-users at gluster.org>> > http://lists.gluster.org/mailman/listinfo/gluster-users > <http://lists.gluster.org/mailman/listinfo/gluster-users> > ? ? <http://lists.gluster.org/mailman/listinfo/gluster-users > <http://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170905/47676f4a/attachment.html>
Seemingly Similar Threads
- heal info OK but statistics not working
- connection to 10.5.6.32:49155 failed (Connection refused); disconnecting socket
- connection to 10.5.6.32:49155 failed (Connection refused); disconnecting socket
- one brick one volume process dies?
- one brick one volume process dies?