thr3ads.net - Gluster users - [Gluster-users] 3.8.3 Shards Healing Glacier Slow [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Darrell Budic

2016-Aug-29 14:49 UTC

[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Just to let you know I?m seeing the same issue under 3.7.14 on CentOS 7. Some
content was healed correctly, now all the shards are queued up in a heal list,
but nothing is healing. Got similar brick errors logged to the ones David was
getting on the brick that isn?t healing:

[2016-08-29 03:31:40.436110] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP
(null)
(00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
==> (Invalid argument) [Invalid argument]
[2016-08-29 03:31:43.005013] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP
(null)
(00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
==> (Invalid argument) [Invalid argument]

This was after replacing the drive the brick was on and trying to get it back
into the system by setting the volume's fattr on the brick dir. I?ll try the
suggested method here on it it shortly.

  -Darrell

> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhananj at
redhat.com> wrote:
> 
> Got it. Thanks.
> 
> I tried the same test and shd crashed with SIGABRT (well, that's
because I compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
> 
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhananj at
redhat.com <mailto:kdhananj at redhat.com>> wrote:
> Could you attach both client and brick logs? Meanwhile I will try these
steps out on my machines and see if it is easily recreatable.
> 
> 
> Hoping 7z files are accepted by mail server.
> 
> looks like zip file awaiting approval due to size 
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
> 
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
little heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2.  Heal
proces kicked in as before and the files in directories visible from mount and
.glusterfs healed in short time.  Then it began crawl of .shard adding those
files to heal count at which point the entire proces ground to a halt basically.
After 48 hours out of 19k shards it has added 5900 to heal list.  Load on all 3
machnes is negligible.   It was suggested to change this value to full
cluster.data-self-heal-algorithm and restart volume which I did.  No efffect. 
Tried relaunching heal no effect, despite any node picked.  I started each VM
and performed a stat of all files from within it, or a full virus scan  and that
seemed to cause short small spikes in shards added, but not by much.  Logs are
showing no real messages indicating anything is going on.  I get hits to brick
log on occasion of null lookups making me think its not really crawling shards
directory but waiting for a shard lookup to add it.  I'll get following in
brick log but not constant and sometime multiple for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type for
(null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: LOOKUP
(null) (00000000-0000-0000-00
> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
argument) [Invalid argument]
> 
> This one repeated about 30 times in row then nothing for 10 minutes then
one hit for one different shard by itself.
> 
> How can I determine if Heal is actually running?  How can I kill it or
force restart?  Does node I start it from determine which directory gets crawled
to determine heals?
> 
> David Gossage
> Carousel Checks Inc. | System Administrator
> Office 708.613.2284
>  <>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> http://www.gluster.org/mailman/listinfo/gluster-users
<http://www.gluster.org/mailman/listinfo/gluster-users>
> 
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160829/222845c8/attachment.html>

Darrell Budic

2016-Aug-30 04:25 UTC

head link

[Gluster-users] 3.8.3 Shards Healing Glacier Slow

I noticed that my new brick (replacement disk) did not have a .shard directory
created on the brick, if that helps.

I removed the affected brick from the volume and then wiped the disk, did an
add-brick, and everything healed right up. I didn?t try and set any attrs or
anything else, just removed and added the brick as new.
> On Aug 29, 2016, at 9:49 AM, Darrell Budic <budic at
onholyground.com> wrote:
> 
> Just to let you know I?m seeing the same issue under 3.7.14 on CentOS 7.
Some content was healed correctly, now all the shards are queued up in a heal
list, but nothing is healing. Got similar brick errors logged to the ones David
was getting on the brick that isn?t healing:
> 
> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP
(null)
(00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP
(null)
(00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
==> (Invalid argument) [Invalid argument]
> 
> This was after replacing the drive the brick was on and trying to get it
back into the system by setting the volume's fattr on the brick dir. I?ll
try the suggested method here on it it shortly.
> 
>   -Darrell
> 
> 
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhananj at
redhat.com <mailto:kdhananj at redhat.com>> wrote:
>> 
>> Got it. Thanks.
>> 
>> I tried the same test and shd crashed with SIGABRT (well, that's
because I compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding
further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
>> 
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhananj at
redhat.com <mailto:kdhananj at redhat.com>> wrote:
>> Could you attach both client and brick logs? Meanwhile I will try these
steps out on my machines and see if it is easily recreatable.
>> 
>> 
>> Hoping 7z files are accepted by mail server.
>> 
>> looks like zip file awaiting approval due to size 
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <dgossage at
carouselchecks.com <mailto:dgossage at carouselchecks.com>> wrote:
>> Centos 7 Gluster 3.8.3
>> 
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io <http://performance.io/>-cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>> 
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of
replacing and healngbricks one node at a time.
>> 
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>> 
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
little heavy but nothing shocking.
>> 
>> About an hour after node 1 finished I began same process on node2. 
Heal proces kicked in as before and the files in directories visible from mount
and .glusterfs healed in short time.  Then it began crawl of .shard adding those
files to heal count at which point the entire proces ground to a halt basically.
After 48 hours out of 19k shards it has added 5900 to heal list.  Load on all 3
machnes is negligible.   It was suggested to change this value to full
cluster.data-self-heal-algorithm and restart volume which I did.  No efffect. 
Tried relaunching heal no effect, despite any node picked.  I started each VM
and performed a stat of all files from within it, or a full virus scan  and that
seemed to cause short small spikes in shards added, but not by much.  Logs are
showing no real messages indicating anything is going on.  I get hits to brick
log on occasion of null lookups making me think its not really crawling shards
directory but waiting for a shard lookup to add it.  I'll get following in
brick log but not constant and sometime multiple for same shard.
>> 
>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type for
(null) (LOOKUP)
>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: LOOKUP
(null) (00000000-0000-0000-00
>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==>
(Invalid argument) [Invalid argument]
>> 
>> This one repeated about 30 times in row then nothing for 10 minutes
then one hit for one different shard by itself.
>> 
>> How can I determine if Heal is actually running?  How can I kill it or
force restart?  Does node I start it from determine which directory gets crawled
to determine heals?
>> 
>> David Gossage
>> Carousel Checks Inc. | System Administrator
>> Office 708.613.2284
>>  <>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
<http://www.gluster.org/mailman/listinfo/gluster-users>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160829/e59467f5/attachment.html>

Gluster users - Aug 2016 - 3.8.3 Shards Healing Glacier Slow

[Gluster-users] 3.8.3 Shards Healing Glacier Slow

[Gluster-users] 3.8.3 Shards Healing Glacier Slow