Tried this.
With me, only 'fake2' gets healed after i bring the 'empty'
brick back up
and it stops there unless I do a 'heal-full'.
Is that what you're seeing as well?
-Krutika
On Wed, Aug 31, 2016 at 4:43 AM, David Gossage <dgossage at
carouselchecks.com>
wrote:
> Same issue brought up glusterd on problem node heal count still stuck at
> 6330.
>
> Ran gluster v heal GUSTER1 full
>
> glustershd on problem node shows a sweep starting and finishing in
> seconds. Other 2 nodes show no activity in log. They should start a sweep
> too shouldn't they?
>
> Tried starting from scratch
>
> kill -15 brickpid
> rm -Rf /brick
> mkdir -p /brick
> mkdir mkdir /gsmount/fake2
> setfattr -n "user.some-name" -v "some-value"
/gsmount/fake2
>
> Heals visible dirs instantly then stops.
>
> gluster v heal GLUSTER1 full
>
> see sweep star on problem node and end almost instantly. no files added t
> heal list no files healed no more logging
>
> [2016-08-30 23:11:31.544331] I [MSGID: 108026]
> [afr-self-heald.c:646:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
> starting full sweep on subvol GLUSTER1-client-1
> [2016-08-30 23:11:33.776235] I [MSGID: 108026]
> [afr-self-heald.c:656:afr_shd_full_healer] 0-GLUSTER1-replicate-0:
> finished full sweep on subvol GLUSTER1-client-1
>
> same results no matter which node you run command on. Still stuck with
> 6330 files showing needing healed out of 19k. still showing in logs no
> heals are occuring.
>
> Is their a way to forcibly reset any prior heal data? Could it be stuck
> on some past failed heal start?
>
>
>
>
> *David Gossage*
> *Carousel Checks Inc. | System Administrator*
> *Office* 708.613.2284
>
> On Tue, Aug 30, 2016 at 10:03 AM, David Gossage <
> dgossage at carouselchecks.com> wrote:
>
>> On Tue, Aug 30, 2016 at 10:02 AM, David Gossage <
>> dgossage at carouselchecks.com> wrote:
>>
>>> updated test server to 3.8.3
>>>
>>> Brick1: 192.168.71.10:/gluster2/brick1/1
>>> Brick2: 192.168.71.11:/gluster2/brick2/1
>>> Brick3: 192.168.71.12:/gluster2/brick3/1
>>> Options Reconfigured:
>>> cluster.granular-entry-heal: on
>>> performance.readdir-ahead: on
>>> performance.read-ahead: off
>>> nfs.disable: on
>>> nfs.addr-namelookup: off
>>> nfs.enable-ino32: off
>>> cluster.background-self-heal-count: 16
>>> cluster.self-heal-window-size: 1024
>>> performance.quick-read: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: on
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> server.allow-insecure: on
>>> features.shard: on
>>> features.shard-block-size: 64MB
>>> performance.strict-o-direct: off
>>> cluster.locking-scheme: granular
>>>
>>> kill -15 brickpid
>>> rm -Rf /gluster2/brick3
>>> mkdir -p /gluster2/brick3/1
>>> mkdir mkdir
/rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard
>>> /fake2
>>> setfattr -n "user.some-name" -v "some-value"
>>> /rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard/fake2
>>> gluster v start glustershard force
>>>
>>> at this point brick process starts and all visible files including
new
>>> dir are made on brick
>>> handful of shards are in heal statistics still but no .shard
directory
>>> created and no increase in shard count
>>>
>>> gluster v heal glustershard
>>>
>>> At this point still no increase in count or dir made no additional
>>> activity in logs for healing generated. waited few minutes tailing
logs to
>>> check if anything kicked in.
>>>
>>> gluster v heal glustershard full
>>>
>>> gluster shards added to list and heal commences. logs show full
sweep
>>> starting on all 3 nodes. though this time it only shows as
finishing on
>>> one which looks to be the one that had brick deleted.
>>>
>>> [2016-08-30 14:45:33.098589] I [MSGID: 108026]
>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>> starting full sweep on subvol glustershard-client-0
>>> [2016-08-30 14:45:33.099492] I [MSGID: 108026]
>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>> starting full sweep on subvol glustershard-client-1
>>> [2016-08-30 14:45:33.100093] I [MSGID: 108026]
>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>> starting full sweep on subvol glustershard-client-2
>>> [2016-08-30 14:52:29.760213] I [MSGID: 108026]
>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-glustershard-replicate-0:
>>> finished full sweep on subvol glustershard-client-2
>>>
>>
>> Just realized its still healing so that may be why sweep on 2 other
>> bricks haven't replied as finished.
>>
>>>
>>>
>>> my hope is that later tonight a full heal will work on production.
Is
>>> it possible self-heal daemon can get stale or stop listening but
still show
>>> as active? Would stopping and starting self-heal daemon from
gluster cli
>>> before doing these heals be helpful?
>>>
>>>
>>> On Tue, Aug 30, 2016 at 9:29 AM, David Gossage <
>>> dgossage at carouselchecks.com> wrote:
>>>
>>>> On Tue, Aug 30, 2016 at 8:52 AM, David Gossage <
>>>> dgossage at carouselchecks.com> wrote:
>>>>
>>>>> On Tue, Aug 30, 2016 at 8:01 AM, Krutika Dhananjay <
>>>>> kdhananj at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 30, 2016 at 6:20 PM, Krutika Dhananjay <
>>>>>> kdhananj at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 30, 2016 at 6:07 PM, David Gossage <
>>>>>>> dgossage at carouselchecks.com> wrote:
>>>>>>>
>>>>>>>> On Tue, Aug 30, 2016 at 7:18 AM, Krutika
Dhananjay <
>>>>>>>> kdhananj at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Could you also share the glustershd logs?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'll get them when I get to work sure
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I tried the same steps that you mentioned
multiple times, but heal
>>>>>>>>> is running to completion without any
issues.
>>>>>>>>>
>>>>>>>>> It must be said that 'heal full'
traverses the files and
>>>>>>>>> directories in a depth-first order and does
heals also in the same order.
>>>>>>>>> But if it gets interrupted in the middle
(say because self-heal-daemon was
>>>>>>>>> either intentionally or unintentionally
brought offline and then brought
>>>>>>>>> back up), self-heal will only pick up the
entries that are so far marked as
>>>>>>>>> new-entries that need heal which it will
find in indices/xattrop directory.
>>>>>>>>> What this means is that those files and
directories that were not visited
>>>>>>>>> during the crawl, will remain untouched and
unhealed in this second
>>>>>>>>> iteration of heal, unless you execute a
'heal-full' again.
>>>>>>>>>
>>>>>>>>
>>>>>>>> So should it start healing shards as it crawls
or not until after
>>>>>>>> it crawls the entire .shard directory? At the
pace it was going that could
>>>>>>>> be a week with one node appearing in the
cluster but with no shard files if
>>>>>>>> anything tries to access a file on that node.
From my experience other day
>>>>>>>> telling it to heal full again did nothing
regardless of node used.
>>>>>>>>
>>>>>>>
>>>>>> Crawl is started from '/' of the volume.
Whenever self-heal detects
>>>>>> during the crawl that a file or directory is present in
some brick(s) and
>>>>>> absent in others, it creates the file on the bricks
where it is absent and
>>>>>> marks the fact that the file or directory might need
data/entry and
>>>>>> metadata heal too (this also means that an index is
created under
>>>>>> .glusterfs/indices/xattrop of the src bricks). And the
data/entry and
>>>>>> metadata heal are picked up and done in
>>>>>>
>>>>> the background with the help of these indices.
>>>>>>
>>>>>
>>>>> Looking at my 3rd node as example i find nearly an exact
same number
>>>>> of files in xattrop dir as reported by heal count at time I
brought down
>>>>> node2 to try and alleviate read io errors that seemed to
occur from what I
>>>>> was guessing as attempts to use the node with no shards for
reads.
>>>>>
>>>>> Also attached are the glustershd logs from the 3 nodes,
along with the
>>>>> test node i tried yesterday with same results.
>>>>>
>>>>
>>>> Looking at my own logs I notice that a full sweep was only ever
>>>> recorded in glustershd.log on 2nd node with missing directory.
I believe I
>>>> should have found a sweep begun on every node correct?
>>>>
>>>> On my test dev when it did work I do see that
>>>>
>>>> [2016-08-30 13:56:25.223333] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> starting full sweep on subvol glustershard-client-0
>>>> [2016-08-30 13:56:25.223522] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> starting full sweep on subvol glustershard-client-1
>>>> [2016-08-30 13:56:25.224616] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> starting full sweep on subvol glustershard-client-2
>>>> [2016-08-30 14:18:48.333740] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> finished full sweep on subvol glustershard-client-2
>>>> [2016-08-30 14:18:48.356008] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> finished full sweep on subvol glustershard-client-1
>>>> [2016-08-30 14:18:49.637811] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-glustershard-replicate-0:
>>>> finished full sweep on subvol glustershard-client-0
>>>>
>>>> While when looking at past few days of the 3 prod nodes i only
found
>>>> that on my 2nd node
>>>> [2016-08-27 01:26:42.638772] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> starting full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 11:37:01.732366] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> finished full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 12:58:34.597228] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> starting full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 12:59:28.041173] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> finished full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 20:03:42.560188] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> starting full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 20:03:44.278274] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> finished full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 21:00:42.603315] I [MSGID: 108026]
>>>> [afr-self-heald.c:646:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> starting full sweep on subvol GLUSTER1-client-1
>>>> [2016-08-27 21:00:46.148674] I [MSGID: 108026]
>>>> [afr-self-heald.c:656:afr_shd_full_healer]
0-GLUSTER1-replicate-0:
>>>> finished full sweep on subvol GLUSTER1-client-1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> My suspicion is that this is what happened
on your setup. Could
>>>>>>>>> you confirm if that was the case?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Brick was brought online with force start then
a full heal
>>>>>>>> launched. Hours later after it became evident
that it was not adding new
>>>>>>>> files to heal I did try restarting self-heal
daemon and relaunching full
>>>>>>>> heal again. But this was after the heal had
basically already failed to
>>>>>>>> work as intended.
>>>>>>>>
>>>>>>>
>>>>>>> OK. How did you figure it was not adding any new
files? I need to
>>>>>>> know what places you were monitoring to come to
this conclusion.
>>>>>>>
>>>>>>> -Krutika
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> As for those logs, I did manager to do
something that caused these
>>>>>>>>> warning messages you shared earlier to
appear in my client and server logs.
>>>>>>>>> Although these logs are annoying and a bit
scary too, they didn't
>>>>>>>>> do any harm to the data in my volume. Why
they appear just after a brick is
>>>>>>>>> replaced and under no other circumstances
is something I'm still
>>>>>>>>> investigating.
>>>>>>>>>
>>>>>>>>> But for future, it would be good to follow
the steps Anuradha gave
>>>>>>>>> as that would allow self-heal to at least
detect that it has some repairing
>>>>>>>>> to do whenever it is restarted whether
intentionally or otherwise.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I followed those steps as described on my test
box and ended up
>>>>>>>> with exact same outcome of adding shards at an
agonizing slow pace and no
>>>>>>>> creation of .shard directory or heals on shard
directory. Directories
>>>>>>>> visible from mount healed quickly. This was
with one VM so it has only 800
>>>>>>>> shards as well. After hours at work it had
added a total of 33 shards to
>>>>>>>> be healed. I sent those logs yesterday as well
though not the glustershd.
>>>>>>>>
>>>>>>>> Does replace-brick command copy files in same
manner? For these
>>>>>>>> purposes I am contemplating just skipping the
heal route.
>>>>>>>>
>>>>>>>>
>>>>>>>>> -Krutika
>>>>>>>>>
>>>>>>>>> On Tue, Aug 30, 2016 at 2:22 AM, David
Gossage <
>>>>>>>>> dgossage at carouselchecks.com> wrote:
>>>>>>>>>
>>>>>>>>>> attached brick and client logs from
test machine where same
>>>>>>>>>> behavior occurred not sure if anything
new is there. its still on 3.8.2
>>>>>>>>>>
>>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>>> Transport-type: tcp
>>>>>>>>>> Bricks:
>>>>>>>>>> Brick1:
192.168.71.10:/gluster2/brick1/1
>>>>>>>>>> Brick2:
192.168.71.11:/gluster2/brick2/1
>>>>>>>>>> Brick3:
192.168.71.12:/gluster2/brick3/1
>>>>>>>>>> Options Reconfigured:
>>>>>>>>>> cluster.locking-scheme: granular
>>>>>>>>>> performance.strict-o-direct: off
>>>>>>>>>> features.shard-block-size: 64MB
>>>>>>>>>> features.shard: on
>>>>>>>>>> server.allow-insecure: on
>>>>>>>>>> storage.owner-uid: 36
>>>>>>>>>> storage.owner-gid: 36
>>>>>>>>>> cluster.server-quorum-type: server
>>>>>>>>>> cluster.quorum-type: auto
>>>>>>>>>> network.remote-dio: on
>>>>>>>>>> cluster.eager-lock: enable
>>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>>> performance.io-cache: off
>>>>>>>>>> performance.quick-read: off
>>>>>>>>>> cluster.self-heal-window-size: 1024
>>>>>>>>>> cluster.background-self-heal-count: 16
>>>>>>>>>> nfs.enable-ino32: off
>>>>>>>>>> nfs.addr-namelookup: off
>>>>>>>>>> nfs.disable: on
>>>>>>>>>> performance.read-ahead: off
>>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>>> cluster.granular-entry-heal: on
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 29, 2016 at 2:20 PM, David
Gossage <
>>>>>>>>>> dgossage at carouselchecks.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 29, 2016 at 7:01 AM,
Anuradha Talur <
>>>>>>>>>>> atalur at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> > From: "David
Gossage" <dgossage at carouselchecks.com>
>>>>>>>>>>>> > To: "Anuradha
Talur" <atalur at redhat.com>
>>>>>>>>>>>> > Cc: "gluster-users at
gluster.org List" <
>>>>>>>>>>>> Gluster-users at
gluster.org>, "Krutika Dhananjay" <
>>>>>>>>>>>> kdhananj at redhat.com>
>>>>>>>>>>>> > Sent: Monday, August 29,
2016 5:12:42 PM
>>>>>>>>>>>> > Subject: Re:
[Gluster-users] 3.8.3 Shards Healing Glacier Slow
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Mon, Aug 29, 2016 at
5:39 AM, Anuradha Talur <
>>>>>>>>>>>> atalur at redhat.com> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > > Response inline.
>>>>>>>>>>>> > >
>>>>>>>>>>>> > > ----- Original
Message -----
>>>>>>>>>>>> > > > From:
"Krutika Dhananjay" <kdhananj at redhat.com>
>>>>>>>>>>>> > > > To: "David
Gossage" <dgossage at carouselchecks.com>
>>>>>>>>>>>> > > > Cc:
"gluster-users at gluster.org List" <
>>>>>>>>>>>> Gluster-users at
gluster.org>
>>>>>>>>>>>> > > > Sent: Monday,
August 29, 2016 3:55:04 PM
>>>>>>>>>>>> > > > Subject: Re:
[Gluster-users] 3.8.3 Shards Healing Glacier
>>>>>>>>>>>> Slow
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > Could you attach
both client and brick logs? Meanwhile I
>>>>>>>>>>>> will try these
>>>>>>>>>>>> > > steps
>>>>>>>>>>>> > > > out on my
machines and see if it is easily recreatable.
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > -Krutika
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > On Mon, Aug 29,
2016 at 2:31 PM, David Gossage <
>>>>>>>>>>>> > > dgossage at
carouselchecks.com
>>>>>>>>>>>> > > > > wrote:
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > Centos 7 Gluster
3.8.3
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > Brick1:
ccgl1.gl.local:/gluster1/BRICK1/1
>>>>>>>>>>>> > > > Brick2:
ccgl2.gl.local:/gluster1/BRICK1/1
>>>>>>>>>>>> > > > Brick3:
ccgl4.gl.local:/gluster1/BRICK1/1
>>>>>>>>>>>> > > > Options
Reconfigured:
>>>>>>>>>>>> > > >
cluster.data-self-heal-algorithm: full
>>>>>>>>>>>> > > >
cluster.self-heal-daemon: on
>>>>>>>>>>>> > > >
cluster.locking-scheme: granular
>>>>>>>>>>>> > > >
features.shard-block-size: 64MB
>>>>>>>>>>>> > > > features.shard:
on
>>>>>>>>>>>> > > >
performance.readdir-ahead: on
>>>>>>>>>>>> > > >
storage.owner-uid: 36
>>>>>>>>>>>> > > >
storage.owner-gid: 36
>>>>>>>>>>>> > > >
performance.quick-read: off
>>>>>>>>>>>> > > >
performance.read-ahead: off
>>>>>>>>>>>> > > >
performance.io-cache: off
>>>>>>>>>>>> > > >
performance.stat-prefetch: on
>>>>>>>>>>>> > > >
cluster.eager-lock: enable
>>>>>>>>>>>> > > >
network.remote-dio: enable
>>>>>>>>>>>> > > >
cluster.quorum-type: auto
>>>>>>>>>>>> > > >
cluster.server-quorum-type: server
>>>>>>>>>>>> > > >
server.allow-insecure: on
>>>>>>>>>>>> > > >
cluster.self-heal-window-size: 1024
>>>>>>>>>>>> > > >
cluster.background-self-heal-count: 16
>>>>>>>>>>>> > > >
performance.strict-write-ordering: off
>>>>>>>>>>>> > > > nfs.disable: on
>>>>>>>>>>>> > > >
nfs.addr-namelookup: off
>>>>>>>>>>>> > > >
nfs.enable-ino32: off
>>>>>>>>>>>> > > >
cluster.granular-entry-heal: on
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > Friday did
rolling upgrade from 3.8.3->3.8.3 no issues.
>>>>>>>>>>>> > > > Following steps
detailed in previous recommendations
>>>>>>>>>>>> began proces of
>>>>>>>>>>>> > > > replacing and
healngbricks one node at a time.
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > 1) kill pid of
brick
>>>>>>>>>>>> > > > 2) reconfigure
brick from raid6 to raid10
>>>>>>>>>>>> > > > 3) recreate
directory of brick
>>>>>>>>>>>> > > > 4) gluster
volume start <> force
>>>>>>>>>>>> > > > 5) gluster
volume heal <> full
>>>>>>>>>>>> > > Hi,
>>>>>>>>>>>> > >
>>>>>>>>>>>> > > I'd suggest that
full heal is not used. There are a few
>>>>>>>>>>>> bugs in full heal.
>>>>>>>>>>>> > > Better safe than
sorry ;)
>>>>>>>>>>>> > > Instead I'd
suggest the following steps:
>>>>>>>>>>>> > >
>>>>>>>>>>>> > > Currently I brought
the node down by systemctl stop
>>>>>>>>>>>> glusterd as I was
>>>>>>>>>>>> > getting sporadic io issues
and a few VM's paused so hoping
>>>>>>>>>>>> that will help.
>>>>>>>>>>>> > I may wait to do this till
around 4PM when most work is done
>>>>>>>>>>>> in case it
>>>>>>>>>>>> > shoots load up.
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > > 1) kill pid of brick
>>>>>>>>>>>> > > 2) to configuring of
brick that you need
>>>>>>>>>>>> > > 3) recreate brick dir
>>>>>>>>>>>> > > 4) while the brick is
still down, from the mount point:
>>>>>>>>>>>> > > a) create a dummy
non existent dir under / of mount.
>>>>>>>>>>>> > >
>>>>>>>>>>>> >
>>>>>>>>>>>> > so if noee 2 is down
brick, pick node for example 3 and make
>>>>>>>>>>>> a test dir
>>>>>>>>>>>> > under its brick directory
that doesnt exist on 2 or should I
>>>>>>>>>>>> be dong this
>>>>>>>>>>>> > over a gluster mount?
>>>>>>>>>>>> You should be doing this over
gluster mount.
>>>>>>>>>>>> >
>>>>>>>>>>>> > > b) set a non
existent extended attribute on / of mount.
>>>>>>>>>>>> > >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Could you give me an
example of an attribute to set? I've
>>>>>>>>>>>> read a tad on
>>>>>>>>>>>> > this, and looked up
attributes but haven't set any yet myself.
>>>>>>>>>>>> >
>>>>>>>>>>>> Sure. setfattr -n
"user.some-name" -v "some-value"
>>>>>>>>>>>> <path-to-mount>
>>>>>>>>>>>> > Doing these steps will
ensure that heal happens only from
>>>>>>>>>>>> updated brick to
>>>>>>>>>>>> > > down brick.
>>>>>>>>>>>> > > 5) gluster v start
<> force
>>>>>>>>>>>> > > 6) gluster v heal
<>
>>>>>>>>>>>> > >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Will it matter if
somewhere in gluster the full heal command
>>>>>>>>>>>> was run other
>>>>>>>>>>>> > day? Not sure if it
eventually stops or times out.
>>>>>>>>>>>> >
>>>>>>>>>>>> full heal will stop once the
crawl is done. So if you want to
>>>>>>>>>>>> trigger heal again,
>>>>>>>>>>>> run gluster v heal <>.
Actually even brick up or volume start
>>>>>>>>>>>> force should
>>>>>>>>>>>> trigger the heal.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Did this on test bed today. its
one server with 3 bricks on
>>>>>>>>>>> same machine so take that for what
its worth. also it still runs 3.8.2.
>>>>>>>>>>> Maybe ill update and re-run test.
>>>>>>>>>>>
>>>>>>>>>>> killed brick
>>>>>>>>>>> deleted brick dir
>>>>>>>>>>> recreated brick dir
>>>>>>>>>>> created fake dir on gluster mount
>>>>>>>>>>> set suggested fake attribute on it
>>>>>>>>>>> ran volume start <> force
>>>>>>>>>>>
>>>>>>>>>>> looked at files it said needed
healing and it was just 8 shards
>>>>>>>>>>> that were modified for few minutes
I ran through steps
>>>>>>>>>>>
>>>>>>>>>>> gave it few minutes and it stayed
same
>>>>>>>>>>> ran gluster volume <> heal
>>>>>>>>>>>
>>>>>>>>>>> it healed all the directories and
files you can see over mount
>>>>>>>>>>> including fakedir.
>>>>>>>>>>>
>>>>>>>>>>> same issue for shards though. it
adds more shards to heal at
>>>>>>>>>>> glacier pace. slight jump in speed
if I stat every file and dir in VM
>>>>>>>>>>> running but not all shards.
>>>>>>>>>>>
>>>>>>>>>>> It started with 8 shards to heal
and is now only at 33 out of
>>>>>>>>>>> 800 and probably wont finish adding
for few days at rate it goes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> > >
>>>>>>>>>>>> > > > 1st node worked
as expected took 12 hours to heal 1TB
>>>>>>>>>>>> data. Load was
>>>>>>>>>>>> > > little
>>>>>>>>>>>> > > > heavy but
nothing shocking.
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > About an hour
after node 1 finished I began same process
>>>>>>>>>>>> on node2. Heal
>>>>>>>>>>>> > > > proces kicked in
as before and the files in directories
>>>>>>>>>>>> visible from
>>>>>>>>>>>> > > mount
>>>>>>>>>>>> > > > and .glusterfs
healed in short time. Then it began crawl
>>>>>>>>>>>> of .shard adding
>>>>>>>>>>>> > > > those files to
heal count at which point the entire
>>>>>>>>>>>> proces ground to a
>>>>>>>>>>>> > > halt
>>>>>>>>>>>> > > > basically. After
48 hours out of 19k shards it has added
>>>>>>>>>>>> 5900 to heal
>>>>>>>>>>>> > > list.
>>>>>>>>>>>> > > > Load on all 3
machnes is negligible. It was suggested to
>>>>>>>>>>>> change this
>>>>>>>>>>>> > > value
>>>>>>>>>>>> > > > to full
cluster.data-self-heal-algorithm and restart
>>>>>>>>>>>> volume which I
>>>>>>>>>>>> > > did. No
>>>>>>>>>>>> > > > efffect. Tried
relaunching heal no effect, despite any
>>>>>>>>>>>> node picked. I
>>>>>>>>>>>> > > > started each VM
and performed a stat of all files from
>>>>>>>>>>>> within it, or a
>>>>>>>>>>>> > > full
>>>>>>>>>>>> > > > virus scan and
that seemed to cause short small spikes in
>>>>>>>>>>>> shards added,
>>>>>>>>>>>> > > but
>>>>>>>>>>>> > > > not by much.
Logs are showing no real messages indicating
>>>>>>>>>>>> anything is
>>>>>>>>>>>> > > going
>>>>>>>>>>>> > > > on. I get hits
to brick log on occasion of null lookups
>>>>>>>>>>>> making me think
>>>>>>>>>>>> > > its
>>>>>>>>>>>> > > > not really
crawling shards directory but waiting for a
>>>>>>>>>>>> shard lookup to
>>>>>>>>>>>> > > add
>>>>>>>>>>>> > > > it. I'll get
following in brick log but not constant and
>>>>>>>>>>>> sometime
>>>>>>>>>>>> > > multiple
>>>>>>>>>>>> > > > for same shard.
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > [2016-08-29
08:31:57.478125] W [MSGID: 115009]
>>>>>>>>>>>> > > >
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server:
>>>>>>>>>>>> no resolution
>>>>>>>>>>>> > > type
>>>>>>>>>>>> > > > for (null)
(LOOKUP)
>>>>>>>>>>>> > > > [2016-08-29
08:31:57.478170] E [MSGID: 115050]
>>>>>>>>>>>> > > >
[server-rpc-fops.c:156:server_lookup_cbk]
>>>>>>>>>>>> 0-GLUSTER1-server: 12591783:
>>>>>>>>>>>> > > > LOOKUP (null)
(00000000-0000-0000-00
>>>>>>>>>>>> > > >
00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221)
>>>>>>>>>>>> ==> (Invalid
>>>>>>>>>>>> > > > argument)
[Invalid argument]
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > This one
repeated about 30 times in row then nothing for
>>>>>>>>>>>> 10 minutes then
>>>>>>>>>>>> > > one
>>>>>>>>>>>> > > > hit for one
different shard by itself.
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > How can I
determine if Heal is actually running? How can
>>>>>>>>>>>> I kill it or
>>>>>>>>>>>> > > force
>>>>>>>>>>>> > > > restart? Does
node I start it from determine which
>>>>>>>>>>>> directory gets
>>>>>>>>>>>> > > crawled to
>>>>>>>>>>>> > > > determine heals?
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > > David Gossage
>>>>>>>>>>>> > > > Carousel Checks
Inc. | System Administrator
>>>>>>>>>>>> > > > Office
708.613.2284
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > >
_______________________________________________
>>>>>>>>>>>> > > > Gluster-users
mailing list
>>>>>>>>>>>> > > > Gluster-users at
gluster.org
>>>>>>>>>>>> > > >
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > >
>>>>>>>>>>>> > > >
_______________________________________________
>>>>>>>>>>>> > > > Gluster-users
mailing list
>>>>>>>>>>>> > > > Gluster-users at
gluster.org
>>>>>>>>>>>> > > >
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>> > >
>>>>>>>>>>>> > > --
>>>>>>>>>>>> > > Thanks,
>>>>>>>>>>>> > > Anuradha.
>>>>>>>>>>>> > >
>>>>>>>>>>>> >
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anuradha.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160831/71f2a639/attachment.html>