>> I?m still trying to figure out why the self-heal-daemon doesn?t seem to be >> working, and what ?unable to get index-dir? means. Any advice on what to >> look at would be appreciated. Thanks!>At any point did you have one node with 3.7.6 and another in 3.7.8 version?Yes. I upgraded each server in turn by stopping all gluster server and client processes on a machine, installing the update, and restarting everything on that machine. Then I waited for "heal info" on all volumes to say it had nothing to work on before proceeding to the next server.>I couldn't find information about the setup (number of nodes, vol info etc).I wasn't sure how much to spam the list :-) There are 3 servers in the gluster pool. amillar at chunxy:~$ sudo gluster pool list UUID Hostname State c1531143-229e-44a8-9fbb-089769ec999d pve4 Connected e82b0dc0-a490-47e4-bd11-6dcfcd36fd62 pve3 Connected 0422e779-7219-4392-ad8d-6263af4372fa localhost Connected Example volume: amillar at chunxy:~$ sudo gluster vol info public Volume Name: public Type: Replicate Volume ID: f75bbc7b-5db7-4497-b14c-c0433a84bcd9 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: chunxy:/data/glfs23/public Brick2: pve4:/data/glfs240/public Brick3: pve3:/data/glfs830/public Options Reconfigured: cluster.metadata-self-heal: on cluster.data-self-heal: on cluster.entry-self-heal: on cluster.self-heal-daemon: enable performance.readdir-ahead: on amillar at chunxy:~$ sudo gluster vol status public Status of volume: public Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick chunxy:/data/glfs23/public 49184 0 Y 13119 Brick pve4:/data/glfs240/public 49168 0 Y 837 Brick pve3:/data/glfs830/public 49167 0 Y 21074 NFS Server on localhost 2049 0 Y 10237 Self-heal Daemon on localhost N/A N/A Y 10243 NFS Server on pve3 2049 0 Y 30177 Self-heal Daemon on pve3 N/A N/A Y 30183 NFS Server on pve4 2049 0 Y 13070 Self-heal Daemon on pve4 N/A N/A Y 13069 Task Status of Volume public ------------------------------------------------------------------------------ There are no active volume tasks amillar at chunxy:~$ sudo gluster vol heal public info Brick chunxy:/data/glfs23/public Number of entries: 0 Brick pve4:/data/glfs240/public Number of entries: 0 Brick pve3:/data/glfs830/public Number of entries: 0 amillar at chunxy:~$ sudo grep public /var/log/glusterfs/glustershd.log |tail -10 [2016-03-01 18:38:41.006543] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 18:48:42.001580] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 18:58:43.001631] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:08:43.002149] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:18:44.001444] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:28:44.001444] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:38:44.001445] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:48:44.001555] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 19:58:44.001405] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 [2016-03-01 20:08:44.001444] W [MSGID: 108034] [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to get index-dir on public-client-1 in glustershd.log: 39: volume public-client-1 40: type protocol/client 41: option clnt-lk-version 1 42: option volfile-checksum 0 43: option volfile-key gluster/glustershd 44: option client-version 3.7.8 45: option process-uuid chunxy-10236-2016/03/01-15:38:17:614674-public-client-1-0-0 46: option fops-version 1298437 47: option ping-timeout 42 48: option remote-host chunxy 49: option remote-subvolume /data/glfs23/public 50: option transport-type socket 51: option username 1aafe2d7-79ea-46b8-9d3e-XXXXXXXXXXXX 52: option password 7aaa0ce0-2a56-45e1-bc0e-YYYYYYYYYYYY 53: end-volume What other information would help? Thanks!
Anuradha Talur
2016-Mar-02 09:17 UTC
[Gluster-users] Broken after 3.7.8 upgrade from 3.7.6
----- Original Message -----> From: "Alan Millar" <grunthos503 at yahoo.com> > To: "Anuradha Talur" <atalur at redhat.com> > Cc: gluster-users at gluster.org > Sent: Wednesday, March 2, 2016 2:00:49 AM > Subject: Re: [Gluster-users] Broken after 3.7.8 upgrade from 3.7.6 > > >> I?m still trying to figure out why the self-heal-daemon doesn?t seem to be > >> working, and what ?unable to get index-dir? means. Any advice on what to > >> look at would be appreciated. Thanks! > > > >At any point did you have one node with 3.7.6 and another in 3.7.8 version? > > Yes. I upgraded each server in turn by stopping all gluster server and > client processes on a machine, installing the update, and restarting > everything on that machine. Then I waited for "heal info" on all volumes to > say it had nothing to work on before proceeding to the next server. >Thanks for the information. The "unable to get index-dir on .." messages you saw in log are not harmful in this scenario. A simple explanation : when you have 1 new node and 2 old nodes, the self-heal-deamon and heal commands run on the new node are expecting that the index-dir "<brickpath>/.glusterfs/xattrop/dirty" exists, to process entries from it. But the mentioned directory doesn't exist on the old bricks. (it was introduced as part of 3.7.7). As a result you see these logs. But this doesn't explain why heal didn't happen on replacing the brick. :-/ After replacing, did you check volume heal info to know the status of heal?> > >I couldn't find information about the setup (number of nodes, vol info etc). > > > I wasn't sure how much to spam the list :-) > > There are 3 servers in the gluster pool. > > > amillar at chunxy:~$ sudo gluster pool list > UUID Hostname State > c1531143-229e-44a8-9fbb-089769ec999d pve4 Connected > e82b0dc0-a490-47e4-bd11-6dcfcd36fd62 pve3 Connected > 0422e779-7219-4392-ad8d-6263af4372fa localhost Connected > > > > Example volume: > > > > amillar at chunxy:~$ sudo gluster vol info public > > Volume Name: public > Type: Replicate > Volume ID: f75bbc7b-5db7-4497-b14c-c0433a84bcd9 > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: chunxy:/data/glfs23/public > Brick2: pve4:/data/glfs240/public > Brick3: pve3:/data/glfs830/public > Options Reconfigured: > cluster.metadata-self-heal: on > cluster.data-self-heal: on > cluster.entry-self-heal: on > cluster.self-heal-daemon: enable > performance.readdir-ahead: on > > > > amillar at chunxy:~$ sudo gluster vol status public > Status of volume: public > Gluster process TCP Port RDMA Port Online Pid > ------------------------------------------------------------------------------ > Brick chunxy:/data/glfs23/public 49184 0 Y > 13119 > Brick pve4:/data/glfs240/public 49168 0 Y 837 > Brick pve3:/data/glfs830/public 49167 0 Y > 21074 > NFS Server on localhost 2049 0 Y > 10237 > Self-heal Daemon on localhost N/A N/A Y > 10243 > NFS Server on pve3 2049 0 Y > 30177 > Self-heal Daemon on pve3 N/A N/A Y > 30183 > NFS Server on pve4 2049 0 Y > 13070 > Self-heal Daemon on pve4 N/A N/A Y > 13069 > > Task Status of Volume public > ------------------------------------------------------------------------------ > There are no active volume tasks > > > > amillar at chunxy:~$ sudo gluster vol heal public info > Brick chunxy:/data/glfs23/public > Number of entries: 0 > > Brick pve4:/data/glfs240/public > Number of entries: 0 > > Brick pve3:/data/glfs830/public > Number of entries: 0 > > > > amillar at chunxy:~$ sudo grep public /var/log/glusterfs/glustershd.log |tail > -10 > [2016-03-01 18:38:41.006543] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 18:48:42.001580] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 18:58:43.001631] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:08:43.002149] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:18:44.001444] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:28:44.001444] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:38:44.001445] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:48:44.001555] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 19:58:44.001405] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > [2016-03-01 20:08:44.001444] W [MSGID: 108034] > [afr-self-heald.c:445:afr_shd_index_sweep] 0-public-replicate-0: unable to > get index-dir on public-client-1 > > > > in glustershd.log: > 39: volume public-client-1 > 40: type protocol/client > 41: option clnt-lk-version 1 > 42: option volfile-checksum 0 > 43: option volfile-key gluster/glustershd > 44: option client-version 3.7.8 > 45: option process-uuid > chunxy-10236-2016/03/01-15:38:17:614674-public-client-1-0-0 > 46: option fops-version 1298437 > 47: option ping-timeout 42 > 48: option remote-host chunxy > 49: option remote-subvolume /data/glfs23/public > 50: option transport-type socket > 51: option username 1aafe2d7-79ea-46b8-9d3e-XXXXXXXXXXXX > 52: option password 7aaa0ce0-2a56-45e1-bc0e-YYYYYYYYYYYY > 53: end-volume > > > > What other information would help? Thanks! >-- Thanks, Anuradha.