thr3ads.net - Gluster users - [Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load [May 2020]

If this information is useful, please help other people find it:
Share via:

Artem Russakovskii

2020-May-15 22:14 UTC

[Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load

Hi Hari,

Thanks for the analysis.

   1. I understand why the rebooted node would have 0 heal state files
   whereas the other nodes would be going up. The problem with 5.11 was that
   there was a bug that prevented the heal from completing, which as I
   mentioned was fixed in 5.13.

   2. If the number of files to heal is known, why are operations like
   md5sum needed to perform the heal at all? Why doesn't the auto heal agent
   just go through the list one file at a time and perform the heal, then mark
   the files as healed? From what I've seen, even requesting a manual heal
   didn't do anything until those files were touched in some way (like
md5sum).

   3. cscope? I am unable to find what you're talking about -
   http://cscope.sourceforge.net/ seems to be a code search tool?

   4. What's the best way to analyze and present the data about what's
   raising the load to 100+ on all nodes after reboots? If there's some
   monitoring tool I could run, reproduce the issue, then save the output and
   send it here, that'd be great.

   5. Based on what I've seen when I straced apache processes, they would
   all hang for a long time when they ran across some of the gluster data.
   Specifically, one of the directories (with only several files, nothing
   special) which may be queried in some way by a lot of page loads (for
   context, we have 2 busy WordPress sites), would come up a lot in straces
   and hang. I even tried moving this directory out of gluster and adding a
   symlink, but I'm not sure that helped. I wonder if multiple conditions
   cause frequent read access to a certain resource in gluster to spike out of
   control and go haywire.
   Here, the gluster root directory is SITE/public/wp-content/uploads, and
   the dirs I saw the most are symlinked as follows:
   lrwxrwxrwx   1 archon810 users       73 Apr 26 15:47
   wp-security-audit-log -> SITE/public/wp-content/wp-security-audit-log/
   (belongs to https://wordpress.org/plugins/wp-security-audit-log/ but the
   Pro version)

   A sample strace log of requests to this dir:
   [2020-05-15 15:10:15 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
   [2020-05-15 15:10:15 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:15 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:15 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:18 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
   [2020-05-15 15:10:18 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:18 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:18 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:19 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
   [2020-05-15 15:10:19 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:19 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:19 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:21 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
   [2020-05-15 15:10:21 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:21 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:21 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:23 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
   [2020-05-15 15:10:23 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:23 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:23 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:25 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
   [2020-05-15 15:10:25 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:25 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:25 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:27 PDT]
   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
   [2020-05-15 15:10:27 PDT]
   access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) = 0
   [2020-05-15 15:10:27 PDT]
  
access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
   F_OK) = -1 ENOENT (No such file or directory)
   [2020-05-15 15:10:27 PDT]
  
stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)

The load spikes and everything hanging is seriously stressing me out
because at any point it can cause an outage for us.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>


On Thu, May 7, 2020 at 11:32 PM Hari Gowtham <hgowtham at redhat.com>
wrote:
> The heal info is working fine. The explanation to what's happening:
> When a node goes down, the changes to this node can't be done. So on
the
> other nodes which were up, get the changes and keeps track saying
> these files were changed (note: this change hasn't been reflected to
the
> node which was down). Once the node down comes back up,
> it doesn't know what happened when it was down. But the other nodes
know
> that there are a few changes which didn't make it to the rebooted node.
> So the node down is blamed by the other nodes .This is what is shown in
> the heal info. As the node which was up doesn't have any change that
went
> into that node alone.
> It says 0 files to be healed and the other nodes as it has the data say
> which are the files that need to heal.
> This is the expected working.
> So as per the rebooted node, heal info is working fine.
>
> About healing the file itself:
> Doing an operation on a file, triggers client side heal as per the design,
> that's the reason these files are getting corrected after the md5sum (I
> hope this is done from the client side not the backend itself).
> So this is expected.
> About the heals not happening for a long time, there can be some issue
> there.
> @Karthik Subrahmanya <ksubrahm at redhat.com> is the better person to
help
> you with this.
>
> About the CPU usage going higher:
> We need info about what is consuming more CPU.
> Glusterd needs to do a bit of handshake and connect after reboot. During
> this a little bit of data is transferred as well.
> If the number of nodes goes higher it can contribute to hike.
> Similarly, if the heal is happening, then it can increase the usage.
> So we need info about what is consuming the cpu to know if it's
expected
> or not.
> If this hike is expected, you can try using cscope to restrict the cpu
> usage by that particular process.
>
> On Tue, Apr 28, 2020 at 3:02 AM Artem Russakovskii <archon810 at
gmail.com>
> wrote:
>
>> Good news, after upgrading to 5.13 and running this scenario again, the
>> self heal actually succeeded without my intervention following a server
>> reboot.
>>
>> The load was still high during this process, but at least the endless
>> heal issue is resolved.
>>
>> I'd still love to hear from the team on managing heal load spikes.
>>
>> Sincerely,
>> Artem
>>
>> --
>> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>> <http://www.apkmirror.com/>, Illogical Robot LLC
>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>
>>
>> On Sun, Apr 26, 2020 at 3:13 PM Artem Russakovskii <archon810 at
gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've been observing this problem for a long time now and
it's time to
>>> finally figure out what's going on.
>>>
>>> We're running gluster 5.11 and have a 10TB 1 x 4 = 4 replicate
volume.
>>> I'll include its slightly redacted config below.
>>>
>>> When I reboot one of the servers and it goes offline for a bit,
when it
>>> comes back, heal info tells me there are some files and dirs that
are "heal
>>> pending". 0 "split-brain" and "possibly
healing" - only "heal pending"
>>> are >0.
>>>
>>>    1. For some reason, the server that was rebooted shows
"heal
>>>    pending" 0. All other servers show "heal pending"
with some number, say 65.
>>>    2. We have cluster.self-heal-daemon enabled.
>>>    3. The logs are full of "performing entry selfheal"
and "completed
>>>    entry selfheal" messages that continue to print endlessly.
>>>    4. This "heal pending" number never goes down by
itself, but it does
>>>    if I run some operation on it, like md5sum.
>>>    5. When the server goes down for reboot and especially when it
comes
>>>    back, the load on ALL servers shoots up through the roof (load
of 100+) and
>>>    ends up bringing everything down, including apache and nginx. My
theory is
>>>    that self-heal kicks in so hard that it kills IO on these
attached Linode
>>>    block devices. However, after some time - say 10 minutes - the
load
>>>    subsides, but the "heal pending" remains and the
gluster logs continue to
>>>    output "performing entry selfheal" and "completed
entry selfheal" messages.
>>>    This load spike has become a huge issue for us because it brings
down the
>>>    whole site for entire minutes.
>>>    6. At this point in my investigation, I noticed that the
>>>    selfheal messages actually repeat for the same gfids over and
over.
>>>    [2020-04-26 21:32:29.877987] I [MSGID: 108026]
>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
0-SNIP_data1-replicate-0:
>>>    performing entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc
>>>    [2020-04-26 21:32:29.901246] I [MSGID: 108026]
>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
0-SNIP_data1-replicate-0:
>>>    Completed entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>    sinks=0 1 2
>>>    [2020-04-26 21:32:32.171959] I [MSGID: 108026]
>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
0-SNIP_data1-replicate-0:
>>>    performing entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc
>>>    [2020-04-26 21:32:32.225828] I [MSGID: 108026]
>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
0-SNIP_data1-replicate-0:
>>>    Completed entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>    sinks=0 1 2
>>>    [2020-04-26 21:32:33.346990] I [MSGID: 108026]
>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
0-SNIP_data1-replicate-0:
>>>    performing entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc
>>>    [2020-04-26 21:32:33.374413] I [MSGID: 108026]
>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
0-SNIP_data1-replicate-0:
>>>    Completed entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>    sinks=0 1 2
>>>    7. I used gfid-resolver.sh from
https://gist.github.com/4392640.git
>>>    to resolve this gfid to the real location and yup - it was one
of the files
>>>    (a dir actually) listed as "heal pending" in heal
info. As soon as I ran
>>>    md5sum on the file inside (which was also listed in "heal
pending"), the
>>>    log messages stopped repeating for this entry and it disappeared
from "heal
>>>    pending" heal info. These were the final log lines:
>>>    [2020-04-26 21:32:35.642662] I [MSGID: 108026]
>>>    [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>>    0-SNIP_data1-replicate-0: performing metadata selfheal on
>>>    96d282cf-402f-455c-9add-5f03c088a1bc
>>>    [2020-04-26 21:32:35.658714] I [MSGID: 108026]
>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
0-SNIP_data1-replicate-0:
>>>    Completed metadata selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc.
>>>    sources=0 [1] 2  sinks=3
>>>    [2020-04-26 21:32:35.686509] I [MSGID: 108026]
>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
0-SNIP_data1-replicate-0:
>>>    performing entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc
>>>    [2020-04-26 21:32:35.720387] I [MSGID: 108026]
>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
0-SNIP_data1-replicate-0:
>>>    Completed entry selfheal on
96d282cf-402f-455c-9add-5f03c088a1bc. sources=0
>>>    [1] 2  sinks=3
>>>
>>> I have to repeat this song and dance every time I reboot servers
and run
>>> md5sum on each "heal pending" file or else the messages
will continue
>>> presumably indefinitely. In the meantime, the files seem to be fine
when
>>> accessed.
>>>
>>> What I don't understand is:
>>>
>>>    1. Why doesn't gluster just heal them properly instead of
getting
>>>    stuck? Or maybe this was fixed in v6 or v7, which I haven't
upgraded to due
>>>    to waiting for another unrelated issue to be fixed?
>>>    2. Why does heal info show 0 "heal pending" files on
the server that
>>>    was rebooted, but all other servers show the same number of
"heal pending"
>>>    entries >0?
>>>    3. Why are there these insane load spikes upon going down and
>>>    especially coming back online? Is it related to the issue here?
I'm pretty
>>>    sure that it didn't happen in previous versions of gluster,
when this issue
>>>    didn't manifest - I could easily bring down one of the
servers without it
>>>    creating havoc when it comes back online.
>>>
>>> Here's the volume info:
>>>
>>> Volume Name: SNIP_data1
>>>
>>> Type: Replicate
>>>
>>> Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e
>>>
>>> Status: Started
>>>
>>> Snapshot Count: 0
>>>
>>> Number of Bricks: 1 x 4 = 4
>>>
>>> Transport-type: tcp
>>>
>>> Bricks:
>>>
>>> Brick1: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>
>>> Brick2: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>
>>> Brick3: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>
>>> Brick4: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>
>>> Options Reconfigured:
>>>
>>> performance.client-io-threads: on
>>>
>>> nfs.disable: on
>>>
>>> transport.address-family: inet
>>>
>>> cluster.self-heal-daemon: enable
>>>
>>> performance.cache-size: 1GB
>>>
>>> cluster.lookup-optimize: on
>>>
>>> performance.read-ahead: off
>>>
>>> client.event-threads: 4
>>>
>>> server.event-threads: 4
>>>
>>> performance.io-thread-count: 32
>>>
>>> cluster.readdir-optimize: on
>>>
>>> features.cache-invalidation: on
>>>
>>> features.cache-invalidation-timeout: 600
>>>
>>> performance.stat-prefetch: on
>>>
>>> performance.cache-invalidation: on
>>>
>>> performance.md-cache-timeout: 600
>>>
>>> network.inode-lru-limit: 500000
>>>
>>> performance.parallel-readdir: on
>>>
>>> performance.readdir-ahead: on
>>>
>>> performance.rda-cache-limit: 256MB
>>>
>>> network.remote-dio: enable
>>>
>>> network.ping-timeout: 5
>>>
>>> cluster.quorum-type: fixed
>>>
>>> cluster.quorum-count: 1
>>>
>>> cluster.granular-entry-heal: enable
>>>
>>> cluster.data-self-heal-algorithm: full
>>>
>>>
>>> Appreciate any insight. Thank you.
>>>
>>> Sincerely,
>>> Artem
>>>
>>> --
>>> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>>
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
> --
> Regards,
> Hari Gowtham.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200515/dc50b273/attachment.html>

Strahil Nikolov

2020-May-16 06:20 UTC

head link

[Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load

On May 16, 2020 1:14:17 AM GMT+03:00, Artem Russakovskii <archon810 at
gmail.com> wrote:>Hi Hari,
>
>Thanks for the analysis.
>
>   1. I understand why the rebooted node would have 0 heal state files
>whereas the other nodes would be going up. The problem with 5.11 was
>that
>   there was a bug that prevented the heal from completing, which as I
>   mentioned was fixed in 5.13.
>
>   2. If the number of files to heal is known, why are operations like
>md5sum needed to perform the heal at all? Why doesn't the auto heal
>agent
>just go through the list one file at a time and perform the heal, then
>mark
>the files as healed? From what I've seen, even requesting a manual heal
>didn't do anything until those files were touched in some way (like
>md5sum).
>
>   3. cscope? I am unable to find what you're talking about -
>   http://cscope.sourceforge.net/ seems to be a code search tool?
>
>   4. What's the best way to analyze and present the data about
what's
>   raising the load to 100+ on all nodes after reboots? If there's some
>monitoring tool I could run, reproduce the issue, then save the output
>and
>   send it here, that'd be great.
>
> 5. Based on what I've seen when I straced apache processes, they would
>all hang for a long time when they ran across some of the gluster data.
> Specifically, one of the directories (with only several files, nothing
>  special) which may be queried in some way by a lot of page loads (for
>context, we have 2 busy WordPress sites), would come up a lot in
>straces
>and hang. I even tried moving this directory out of gluster and adding
>a
> symlink, but I'm not sure that helped. I wonder if multiple conditions
>cause frequent read access to a certain resource in gluster to spike
>out of
>   control and go haywire.
>Here, the gluster root directory is SITE/public/wp-content/uploads, and
>   the dirs I saw the most are symlinked as follows:
>   lrwxrwxrwx   1 archon810 users       73 Apr 26 15:47
> wp-security-audit-log -> SITE/public/wp-content/wp-security-audit-log/
>(belongs to https://wordpress.org/plugins/wp-security-audit-log/ but
>the
>   Pro version)
>
>   A sample strace log of requests to this dir:
>   [2020-05-15 15:10:15 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:15 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:15 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:15 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:18 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:18 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:18 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:18 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:19 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:19 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:19 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:19 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:21 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:21 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:21 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:21 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:23 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:23 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:23 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:23 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:25 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:25 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:25 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:25 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:27 PDT]
>   stat("SITE/public/wp-content/uploads/wp-security-audit-log",
>   {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
>   [2020-05-15 15:10:27 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log",
R_OK) >0
>   [2020-05-15 15:10:27 PDT]
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",
>   F_OK) = -1 ENOENT (No such file or directory)
>   [2020-05-15 15:10:27 PDT]
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",
>   0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)
>
>The load spikes and everything hanging is seriously stressing me out
>because at any point it can cause an outage for us.
>
>Sincerely,
>Artem
>
>--
>Founder, Android Police <http://www.androidpolice.com>, APK Mirror
><http://www.apkmirror.com/>, Illogical Robot LLC
>beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
>On Thu, May 7, 2020 at 11:32 PM Hari Gowtham <hgowtham at redhat.com>
>wrote:
>
>> The heal info is working fine. The explanation to what's happening:
>> When a node goes down, the changes to this node can't be done. So
on
>the
>> other nodes which were up, get the changes and keeps track saying
>> these files were changed (note: this change hasn't been reflected
to
>the
>> node which was down). Once the node down comes back up,
>> it doesn't know what happened when it was down. But the other nodes
>know
>> that there are a few changes which didn't make it to the rebooted
>node.
>> So the node down is blamed by the other nodes .This is what is shown
>in
>> the heal info. As the node which was up doesn't have any change
that
>went
>> into that node alone.
>> It says 0 files to be healed and the other nodes as it has the data
>say
>> which are the files that need to heal.
>> This is the expected working.
>> So as per the rebooted node, heal info is working fine.
>>
>> About healing the file itself:
>> Doing an operation on a file, triggers client side heal as per the
>design,
>> that's the reason these files are getting corrected after the
md5sum
>(I
>> hope this is done from the client side not the backend itself).
>> So this is expected.
>> About the heals not happening for a long time, there can be some
>issue
>> there.
>> @Karthik Subrahmanya <ksubrahm at redhat.com> is the better
person to
>help
>> you with this.
>>
>> About the CPU usage going higher:
>> We need info about what is consuming more CPU.
>> Glusterd needs to do a bit of handshake and connect after reboot.
>During
>> this a little bit of data is transferred as well.
>> If the number of nodes goes higher it can contribute to hike.
>> Similarly, if the heal is happening, then it can increase the usage.
>> So we need info about what is consuming the cpu to know if it's
>expected
>> or not.
>> If this hike is expected, you can try using cscope to restrict the
>cpu
>> usage by that particular process.
>>
>> On Tue, Apr 28, 2020 at 3:02 AM Artem Russakovskii
><archon810 at gmail.com>
>> wrote:
>>
>>> Good news, after upgrading to 5.13 and running this scenario again,
>the
>>> self heal actually succeeded without my intervention following a
>server
>>> reboot.
>>>
>>> The load was still high during this process, but at least the
>endless
>>> heal issue is resolved.
>>>
>>> I'd still love to hear from the team on managing heal load
spikes.
>>>
>>> Sincerely,
>>> Artem
>>>
>>> --
>>> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>>
>>>
>>> On Sun, Apr 26, 2020 at 3:13 PM Artem Russakovskii
><archon810 at gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've been observing this problem for a long time now and
it's time
>to
>>>> finally figure out what's going on.
>>>>
>>>> We're running gluster 5.11 and have a 10TB 1 x 4 = 4
replicate
>volume.
>>>> I'll include its slightly redacted config below.
>>>>
>>>> When I reboot one of the servers and it goes offline for a bit,
>when it
>>>> comes back, heal info tells me there are some files and dirs
that
>are "heal
>>>> pending". 0 "split-brain" and "possibly
healing" - only "heal
>pending"
>>>> are >0.
>>>>
>>>>    1. For some reason, the server that was rebooted shows
"heal
>>>>    pending" 0. All other servers show "heal
pending" with some
>number, say 65.
>>>>    2. We have cluster.self-heal-daemon enabled.
>>>>    3. The logs are full of "performing entry
selfheal" and
>"completed
>>>>    entry selfheal" messages that continue to print
endlessly.
>>>>    4. This "heal pending" number never goes down by
itself, but it
>does
>>>>    if I run some operation on it, like md5sum.
>>>>    5. When the server goes down for reboot and especially when
it
>comes
>>>>    back, the load on ALL servers shoots up through the roof
(load
>of 100+) and
>>>>    ends up bringing everything down, including apache and
nginx. My
>theory is
>>>>    that self-heal kicks in so hard that it kills IO on these
>attached Linode
>>>>    block devices. However, after some time - say 10 minutes -
the
>load
>>>>    subsides, but the "heal pending" remains and the
gluster logs
>continue to
>>>>    output "performing entry selfheal" and
"completed entry
>selfheal" messages.
>>>>    This load spike has become a huge issue for us because it
brings
>down the
>>>>    whole site for entire minutes.
>>>>    6. At this point in my investigation, I noticed that the
>>>>    selfheal messages actually repeat for the same gfids over
and
>over.
>>>>    [2020-04-26 21:32:29.877987] I [MSGID: 108026]
>>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
>0-SNIP_data1-replicate-0:
>>>>    performing entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc
>>>>    [2020-04-26 21:32:29.901246] I [MSGID: 108026]
>>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
>0-SNIP_data1-replicate-0:
>>>>    Completed entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>>    sinks=0 1 2
>>>>    [2020-04-26 21:32:32.171959] I [MSGID: 108026]
>>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
>0-SNIP_data1-replicate-0:
>>>>    performing entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc
>>>>    [2020-04-26 21:32:32.225828] I [MSGID: 108026]
>>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
>0-SNIP_data1-replicate-0:
>>>>    Completed entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>>    sinks=0 1 2
>>>>    [2020-04-26 21:32:33.346990] I [MSGID: 108026]
>>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
>0-SNIP_data1-replicate-0:
>>>>    performing entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc
>>>>    [2020-04-26 21:32:33.374413] I [MSGID: 108026]
>>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
>0-SNIP_data1-replicate-0:
>>>>    Completed entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc. sources>>>>    sinks=0 1 2
>>>>    7. I used gfid-resolver.sh from
>https://gist.github.com/4392640.git
>>>>    to resolve this gfid to the real location and yup - it was
one
>of the files
>>>>    (a dir actually) listed as "heal pending" in heal
info. As soon
>as I ran
>>>>    md5sum on the file inside (which was also listed in
"heal
>pending"), the
>>>>    log messages stopped repeating for this entry and it
disappeared
>from "heal
>>>>    pending" heal info. These were the final log lines:
>>>>    [2020-04-26 21:32:35.642662] I [MSGID: 108026]
>>>>    [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>>>    0-SNIP_data1-replicate-0: performing metadata selfheal on
>>>>    96d282cf-402f-455c-9add-5f03c088a1bc
>>>>    [2020-04-26 21:32:35.658714] I [MSGID: 108026]
>>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
>0-SNIP_data1-replicate-0:
>>>>    Completed metadata selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc.
>>>>    sources=0 [1] 2  sinks=3
>>>>    [2020-04-26 21:32:35.686509] I [MSGID: 108026]
>>>>    [afr-self-heal-entry.c:897:afr_selfheal_entry_do]
>0-SNIP_data1-replicate-0:
>>>>    performing entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc
>>>>    [2020-04-26 21:32:35.720387] I [MSGID: 108026]
>>>>    [afr-self-heal-common.c:1729:afr_log_selfheal]
>0-SNIP_data1-replicate-0:
>>>>    Completed entry selfheal on
>96d282cf-402f-455c-9add-5f03c088a1bc. sources=0
>>>>    [1] 2  sinks=3
>>>>
>>>> I have to repeat this song and dance every time I reboot
servers
>and run
>>>> md5sum on each "heal pending" file or else the
messages will
>continue
>>>> presumably indefinitely. In the meantime, the files seem to be
fine
>when
>>>> accessed.
>>>>
>>>> What I don't understand is:
>>>>
>>>>    1. Why doesn't gluster just heal them properly instead
of
>getting
>>>>    stuck? Or maybe this was fixed in v6 or v7, which I
haven't
>upgraded to due
>>>>    to waiting for another unrelated issue to be fixed?
>>>>    2. Why does heal info show 0 "heal pending" files
on the server
>that
>>>>    was rebooted, but all other servers show the same number of
>"heal pending"
>>>>    entries >0?
>>>>    3. Why are there these insane load spikes upon going down
and
>>>>    especially coming back online? Is it related to the issue
here?
>I'm pretty
>>>>    sure that it didn't happen in previous versions of
gluster, when
>this issue
>>>>    didn't manifest - I could easily bring down one of the
servers
>without it
>>>>    creating havoc when it comes back online.
>>>>
>>>> Here's the volume info:
>>>>
>>>> Volume Name: SNIP_data1
>>>>
>>>> Type: Replicate
>>>>
>>>> Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e
>>>>
>>>> Status: Started
>>>>
>>>> Snapshot Count: 0
>>>>
>>>> Number of Bricks: 1 x 4 = 4
>>>>
>>>> Transport-type: tcp
>>>>
>>>> Bricks:
>>>>
>>>> Brick1: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>>
>>>> Brick2: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>>
>>>> Brick3: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>>
>>>> Brick4: SNIP:/mnt/SNIP_block1/SNIP_data1
>>>>
>>>> Options Reconfigured:
>>>>
>>>> performance.client-io-threads: on
>>>>
>>>> nfs.disable: on
>>>>
>>>> transport.address-family: inet
>>>>
>>>> cluster.self-heal-daemon: enable
>>>>
>>>> performance.cache-size: 1GB
>>>>
>>>> cluster.lookup-optimize: on
>>>>
>>>> performance.read-ahead: off
>>>>
>>>> client.event-threads: 4
>>>>
>>>> server.event-threads: 4
>>>>
>>>> performance.io-thread-count: 32
>>>>
>>>> cluster.readdir-optimize: on
>>>>
>>>> features.cache-invalidation: on
>>>>
>>>> features.cache-invalidation-timeout: 600
>>>>
>>>> performance.stat-prefetch: on
>>>>
>>>> performance.cache-invalidation: on
>>>>
>>>> performance.md-cache-timeout: 600
>>>>
>>>> network.inode-lru-limit: 500000
>>>>
>>>> performance.parallel-readdir: on
>>>>
>>>> performance.readdir-ahead: on
>>>>
>>>> performance.rda-cache-limit: 256MB
>>>>
>>>> network.remote-dio: enable
>>>>
>>>> network.ping-timeout: 5
>>>>
>>>> cluster.quorum-type: fixed
>>>>
>>>> cluster.quorum-count: 1
>>>>
>>>> cluster.granular-entry-heal: enable
>>>>
>>>> cluster.data-self-heal-algorithm: full
>>>>
>>>>
>>>> Appreciate any insight. Thank you.
>>>>
>>>> Sincerely,
>>>> Artem
>>>>
>>>> --
>>>> Founder, Android Police <http://www.androidpolice.com>,
APK Mirror
>>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>>>
>>> ________
>>>
>>>
>>>
>>> Community Meeting Calendar:
>>>
>>> Schedule -
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>> --
>> Regards,
>> Hari Gowtham.
>>
Hi Hari,

I can confirm that I have  observed what Artem has described on v7.X .
When one of my Gluster  nodes  was down for some time (more  data for heal) and
powered up - the node  is barely responsive over ssh (separate network than the
gluster one) and the system is seriously  loaded untill the heal is over.
I'm using the defaults for healing options.
Does  the healing process require  some checksumming  on blamed entries ?


@Artem, Gluster has 2 kinds  of healing.
A) FUSE clients can cause a healing of a file which is not the same on all
bricks
This is why md5sum causes a heal.
Usually even a simple 'stat' will trigger that, but I have noticed that
Gluster with sharding might require reading the file with an offset that matches
the shard in order this type of heal to work.

B) There is a heal daemon that runs every 15min (or somewhere there)  which
crawls over the entries blamed and  triggers healing .

Also, as far as I know, each file that is being healed  is  also locked  for the
duration of the heal. That was the reason why oVirt uses  sharding , so instead
of the whole  disk being locked - only a small piece (shard ) is locked  untill
healed.

@Artem what is the average size  of the files for your apaches ?

Best Regards,
Strahil Nikolov

Gluster users - May 2020 - gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load

[Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load

[Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load