thr3ads.net - Gluster users - [Gluster-users] Gluter 3.12.12: performance during heal and in general [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Pranith Kumar Karampuri

2018-Aug-21 04:43 UTC

[Gluster-users] Gluter 3.12.12: performance during heal and in general

On Mon, Aug 20, 2018 at 3:20 PM Hu Bert <revirii at googlemail.com> wrote:
> Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
> Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
> RAID Controller; operating system running on a raid1, then 4 disks
> (JBOD) as bricks.
>
> Ok, i ran perf for a few seconds.
>
> ------------------------
> perf record --call-graph=dwarf -p 7897 -o
> /tmp/perf.gluster11.bricksdd1.out
> ^C[ perf record: Woken up 378 times to write data ]
> Warning:
> Processed 83690 events and lost 96 chunks!
>
> Check IO/CPU overload!
>
> [ perf record: Captured and wrote 423.087 MB
> /tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
> ------------------------
>
> I copied a couple of lines:
>
> +    8.10%     0.00%  glusteriotwr22   [unknown]              [k]
> 0xffffffffffffffff
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
> iterate_dir
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
> sys_getdents
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k] filldir
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
> do_syscall_64
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
> entry_SYSCALL_64_after_swapgs
> +    8.10%     0.00%  glusteriotwr22   libc-2.24.so           [.]
> 0xffff80c60db8ef2b
> +    8.10%     0.00%  glusteriotwr22   libc-2.24.so           [.]
> readdir64
> +    8.10%     0.00%  glusteriotwr22   index.so               [.]
> 0xffff80c6192a1888
> +    8.10%     0.04%  glusteriotwr22   [kernel.kallsyms]      [k]
> ext4_htree_fill_tree
> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
> ext4_readdir
> +    7.95%     0.12%  glusteriotwr22   [kernel.kallsyms]      [k]
> htree_dirblock_to_tree
> +    5.78%     0.96%  glusteriotwr22   [kernel.kallsyms]      [k]
> __ext4_read_dirblock
> +    4.80%     0.02%  glusteriotwr22   [kernel.kallsyms]      [k]
> ext4_bread
> +    4.78%     0.04%  glusteriotwr22   [kernel.kallsyms]      [k]
> ext4_getblk
> +    4.72%     0.02%  glusteriotwr22   [kernel.kallsyms]      [k]
> __getblk_gfp
> +    4.57%     0.00%  glusteriotwr3    [unknown]              [k]
> 0xffffffffffffffff
> +    4.55%     0.00%  glusteriotwr3    [kernel.kallsyms]      [k]
> do_syscall_64
>
> Do you need different or additional information?
>
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record for? Will
it be possible for you to do this for some more time? may be a minute? Just
want to be sure that the data actually represents what we are observing.

>
>
> 2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com>:
> > Even the brick which doesn't have high CPU seems to have same
number of
> > lookups, so that's not it.
> > Is there any difference at all between the machines which have high
CPU
> vs
> > low CPU?
> > I think the only other thing I would do is to install perf tools and
try
> to
> > figure out the call-graph which is leading to so much CPU
> >
> > This affects performance of the brick I think, so you may have to do
it
> > quickly and for less time.
> >
> > perf record --call-graph=dwarf -p   <brick-pid> -o
</path/to/output>
> > then
> > perf report -i </path/to/output/given/in/the/previous/command>
> >
> >
> > On Mon, Aug 20, 2018 at 2:40 PM Hu Bert <revirii at
googlemail.com> wrote:
> >>
> >> gluster volume heal shared info | grep -i number
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >> Number of entries: 0
> >>
> >> Looks good to me.
> >>
> >>
> >> 2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com
> >:
> >> > There are a lot of Lookup operations in the system. But I am
not able
> to
> >> > find why. Could you check the output of
> >> >
> >> > # gluster volume heal <volname> info | grep -i number
> >> >
> >> > it should print all zeros.
> >> >
> >> > On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <revirii at
googlemail.com>
> wrote:
> >> >>
> >> >> I don't know what you exactly mean with workload, but
the main
> >> >> function of the volume is storing (incl. writing,
reading) images
> >> >> (from hundreds of bytes up to 30 MBs, overall ~7TB). The
work is done
> >> >> by apache tomcat servers writing to / reading from the
volume.
> Besides
> >> >> images there are some text files and binaries that are
stored on the
> >> >> volume and get updated regularly (every x hours);
we'll try to
> migrate
> >> >> the latter ones to local storage asap.
> >> >>
> >> >> Interestingly it's only one process (and its threads)
of the same
> >> >> brick on 2 of the gluster servers that consumes the CPU.
> >> >>
> >> >> gluster11: bricksdd1; not healed; full CPU
> >> >> gluster12: bricksdd1; got healed; normal CPU
> >> >> gluster13: bricksdd1; got healed; full CPU
> >> >>
> >> >> Besides: performance during heal (e.g. gluster12,
bricksdd1) was way
> >> >> better than it is now. I've attached 2 pngs showing
the differing cpu
> >> >> usage of last week before/after heal.
> >> >>
> >> >> 2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
> >> >> <pkarampu at redhat.com>:
> >> >> > There seems to be too many lookup operations
compared to any other
> >> >> > operations. What is the workload on the volume?
> >> >> >
> >> >> > On Fri, Aug 17, 2018 at 12:47 PM Hu Bert <revirii
at googlemail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> i hope i did get it right.
> >> >> >>
> >> >> >> gluster volume profile shared start
> >> >> >> wait 10 minutes
> >> >> >> gluster volume profile shared info
> >> >> >> gluster volume profile shared stop
> >> >> >>
> >> >> >> If that's ok, i've attached the output
of the info command.
> >> >> >>
> >> >> >>
> >> >> >> 2018-08-17 8:31 GMT+02:00 Pranith Kumar
Karampuri
> >> >> >> <pkarampu at redhat.com>:
> >> >> >> > Please do volume profile also for around 10
minutes when CPU% is
> >> >> >> > high.
> >> >> >> >
> >> >> >> > On Fri, Aug 17, 2018 at 11:56 AM Pranith
Kumar Karampuri
> >> >> >> > <pkarampu at redhat.com> wrote:
> >> >> >> >>
> >> >> >> >> As per the output, all io-threads are
using a lot of CPU. It is
> >> >> >> >> better
> >> >> >> >> to
> >> >> >> >> check what the volume profile is to see
what is leading to so
> >> >> >> >> much
> >> >> >> >> work
> >> >> >> >> for
> >> >> >> >> io-threads. Please follow the
documentation at
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
>
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
> >> >> >> >> section: "
> >> >> >> >>
> >> >> >> >> Running GlusterFS Volume Profile
Command"
> >> >> >> >>
> >> >> >> >> and attach output of  "gluster
volume profile info",
> >> >> >> >>
> >> >> >> >> On Fri, Aug 17, 2018 at 11:24 AM Hu
Bert <
> revirii at googlemail.com>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> Good morning,
> >> >> >> >>>
> >> >> >> >>> i ran the command during 100% CPU
usage and attached the file.
> >> >> >> >>> Hopefully it helps.
> >> >> >> >>>
> >> >> >> >>> 2018-08-17 7:33 GMT+02:00 Pranith
Kumar Karampuri
> >> >> >> >>> <pkarampu at redhat.com>:
> >> >> >> >>> > Could you do the following on
one of the nodes where you are
> >> >> >> >>> > observing
> >> >> >> >>> > high
> >> >> >> >>> > CPU usage and attach that file
to this thread? We can find
> >> >> >> >>> > what
> >> >> >> >>> > threads/processes are leading
to high usage. Do this for say
> >> >> >> >>> > 10
> >> >> >> >>> > minutes
> >> >> >> >>> > when
> >> >> >> >>> > you see the ~100% CPU.
> >> >> >> >>> >
> >> >> >> >>> > top -bHd 5 >
/tmp/top.${HOSTNAME}.txt
> >> >> >> >>> >
> >> >> >> >>> > On Wed, Aug 15, 2018 at 2:37
PM Hu Bert
> >> >> >> >>> > <revirii at
googlemail.com>
> >> >> >> >>> > wrote:
> >> >> >> >>> >>
> >> >> >> >>> >> Hello again :-)
> >> >> >> >>> >>
> >> >> >> >>> >> The self heal must have
finished as there are no log
> entries
> >> >> >> >>> >> in
> >> >> >> >>> >> glustershd.log files
anymore. According to munin disk
> latency
> >> >> >> >>> >> (average
> >> >> >> >>> >> io wait) has gone down to
100 ms, and disk utilization has
> >> >> >> >>> >> gone
> >> >> >> >>> >> down
> >> >> >> >>> >> to ~60% - both on all
servers and hard disks.
> >> >> >> >>> >>
> >> >> >> >>> >> But now system load on 2
servers (which were in the good
> >> >> >> >>> >> state)
> >> >> >> >>> >> fluctuates between 60 and
100; the server with the formerly
> >> >> >> >>> >> failed
> >> >> >> >>> >> disk has a load of
20-30.I've uploaded some munin graphics
> of
> >> >> >> >>> >> the
> >> >> >> >>> >> cpu
> >> >> >> >>> >> usage:
> >> >> >> >>> >>
> >> >> >> >>> >>
https://abload.de/img/gluster11_cpu31d3a.png
> >> >> >> >>> >>
https://abload.de/img/gluster12_cpu8sem7.png
> >> >> >> >>> >>
https://abload.de/img/gluster13_cpud7eni.png
> >> >> >> >>> >>
> >> >> >> >>> >> This can't be normal.
2 of the servers under heavy load and
> >> >> >> >>> >> one
> >> >> >> >>> >> not
> >> >> >> >>> >> that much. Does anyone
have an explanation of this strange
> >> >> >> >>> >> behaviour?
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> Thx :-)
> >> >> >> >>> >>
> >> >> >> >>> >> 2018-08-14 9:37 GMT+02:00
Hu Bert <revirii at googlemail.com
> >:
> >> >> >> >>> >> > Hi there,
> >> >> >> >>> >> >
> >> >> >> >>> >> > well, it seems the
heal has finally finished. Couldn't
> >> >> >> >>> >> > see/find
> >> >> >> >>> >> > any
> >> >> >> >>> >> > related log message;
is there such a message in a
> specific
> >> >> >> >>> >> > log
> >> >> >> >>> >> > file?
> >> >> >> >>> >> >
> >> >> >> >>> >> > But i see the same
behaviour when the last heal finished:
> >> >> >> >>> >> > all
> >> >> >> >>> >> > CPU
> >> >> >> >>> >> > cores are consumed by
brick processes; not only by the
> >> >> >> >>> >> > formerly
> >> >> >> >>> >> > failed
> >> >> >> >>> >> > bricksdd1, but by all
4 brick processes (and their
> >> >> >> >>> >> > threads).
> >> >> >> >>> >> > Load
> >> >> >> >>> >> > goes
> >> >> >> >>> >> > up to > 100 on the
2 servers with the not-failed brick,
> and
> >> >> >> >>> >> > glustershd.log gets
filled with a lot of entries. Load on
> >> >> >> >>> >> > the
> >> >> >> >>> >> > server
> >> >> >> >>> >> > with the then failed
brick not that high, but still ~60.
> >> >> >> >>> >> >
> >> >> >> >>> >> > Is this behaviour
normal? Is there some post-heal after a
> >> >> >> >>> >> > heal
> >> >> >> >>> >> > has
> >> >> >> >>> >> > finished?
> >> >> >> >>> >> >
> >> >> >> >>> >> > thx in advance :-)
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> > --
> >> >> >> >>> > Pranith
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Pranith
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Pranith
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Pranith
> >> >
> >> >
> >> >
> >> > --
> >> > Pranith
> >
> >
> >
> > --
> > Pranith
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180821/22ca20b9/attachment.html>

Pranith Kumar Karampuri

2018-Aug-21 05:13 UTC

head link

[Gluster-users] Gluter 3.12.12: performance during heal and in general

On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:
>
>
> On Mon, Aug 20, 2018 at 3:20 PM Hu Bert <revirii at googlemail.com>
wrote:
>
>> Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
>> Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
>> RAID Controller; operating system running on a raid1, then 4 disks
>> (JBOD) as bricks.
>>
>> Ok, i ran perf for a few seconds.
>>
>> ------------------------
>> perf record --call-graph=dwarf -p 7897 -o
>> /tmp/perf.gluster11.bricksdd1.out
>> ^C[ perf record: Woken up 378 times to write data ]
>> Warning:
>> Processed 83690 events and lost 96 chunks!
>>
>> Check IO/CPU overload!
>>
>> [ perf record: Captured and wrote 423.087 MB
>> /tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
>> ------------------------
>>
>> I copied a couple of lines:
>>
>> +    8.10%     0.00%  glusteriotwr22   [unknown]              [k]
>> 0xffffffffffffffff
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
>> iterate_dir
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
>> sys_getdents
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
filldir
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
>> do_syscall_64
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
>> entry_SYSCALL_64_after_swapgs
>> +    8.10%     0.00%  glusteriotwr22   libc-2.24.so           [.]
>> 0xffff80c60db8ef2b
>> +    8.10%     0.00%  glusteriotwr22   libc-2.24.so           [.]
>> readdir64
>> +    8.10%     0.00%  glusteriotwr22   index.so               [.]
>> 0xffff80c6192a1888
>> +    8.10%     0.04%  glusteriotwr22   [kernel.kallsyms]      [k]
>> ext4_htree_fill_tree
>> +    8.10%     0.00%  glusteriotwr22   [kernel.kallsyms]      [k]
>> ext4_readdir
>> +    7.95%     0.12%  glusteriotwr22   [kernel.kallsyms]      [k]
>> htree_dirblock_to_tree
>> +    5.78%     0.96%  glusteriotwr22   [kernel.kallsyms]      [k]
>> __ext4_read_dirblock
>> +    4.80%     0.02%  glusteriotwr22   [kernel.kallsyms]      [k]
>> ext4_bread
>> +    4.78%     0.04%  glusteriotwr22   [kernel.kallsyms]      [k]
>> ext4_getblk
>> +    4.72%     0.02%  glusteriotwr22   [kernel.kallsyms]      [k]
>> __getblk_gfp
>> +    4.57%     0.00%  glusteriotwr3    [unknown]              [k]
>> 0xffffffffffffffff
>> +    4.55%     0.00%  glusteriotwr3    [kernel.kallsyms]      [k]
>> do_syscall_64
>>
>> Do you need different or additional information?
>>
>
> This looks like there are lot of readdirs going on which is different from
> what we observed earlier, how many seconds did you do perf record for? Will
> it be possible for you to do this for some more time? may be a minute? Just
> want to be sure that the data actually represents what we are observing.
>
I found one code path which on lookup does readdirs. Could you give me the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.

>
>
>>
>>
>> 2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com>:
>> > Even the brick which doesn't have high CPU seems to have same
number of
>> > lookups, so that's not it.
>> > Is there any difference at all between the machines which have
high CPU
>> vs
>> > low CPU?
>> > I think the only other thing I would do is to install perf tools
and
>> try to
>> > figure out the call-graph which is leading to so much CPU
>> >
>> > This affects performance of the brick I think, so you may have to
do it
>> > quickly and for less time.
>> >
>> > perf record --call-graph=dwarf -p   <brick-pid> -o
</path/to/output>
>> > then
>> > perf report -i
</path/to/output/given/in/the/previous/command>
>> >
>> >
>> > On Mon, Aug 20, 2018 at 2:40 PM Hu Bert <revirii at
googlemail.com> wrote:
>> >>
>> >> gluster volume heal shared info | grep -i number
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >> Number of entries: 0
>> >>
>> >> Looks good to me.
>> >>
>> >>
>> >> 2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri <
>> pkarampu at redhat.com>:
>> >> > There are a lot of Lookup operations in the system. But I
am not
>> able to
>> >> > find why. Could you check the output of
>> >> >
>> >> > # gluster volume heal <volname> info | grep -i
number
>> >> >
>> >> > it should print all zeros.
>> >> >
>> >> > On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <revirii at
googlemail.com>
>> wrote:
>> >> >>
>> >> >> I don't know what you exactly mean with workload,
but the main
>> >> >> function of the volume is storing (incl. writing,
reading) images
>> >> >> (from hundreds of bytes up to 30 MBs, overall ~7TB).
The work is
>> done
>> >> >> by apache tomcat servers writing to / reading from
the volume.
>> Besides
>> >> >> images there are some text files and binaries that
are stored on the
>> >> >> volume and get updated regularly (every x hours);
we'll try to
>> migrate
>> >> >> the latter ones to local storage asap.
>> >> >>
>> >> >> Interestingly it's only one process (and its
threads) of the same
>> >> >> brick on 2 of the gluster servers that consumes the
CPU.
>> >> >>
>> >> >> gluster11: bricksdd1; not healed; full CPU
>> >> >> gluster12: bricksdd1; got healed; normal CPU
>> >> >> gluster13: bricksdd1; got healed; full CPU
>> >> >>
>> >> >> Besides: performance during heal (e.g. gluster12,
bricksdd1) was way
>> >> >> better than it is now. I've attached 2 pngs
showing the differing
>> cpu
>> >> >> usage of last week before/after heal.
>> >> >>
>> >> >> 2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
>> >> >> <pkarampu at redhat.com>:
>> >> >> > There seems to be too many lookup operations
compared to any other
>> >> >> > operations. What is the workload on the volume?
>> >> >> >
>> >> >> > On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
<revirii at googlemail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> i hope i did get it right.
>> >> >> >>
>> >> >> >> gluster volume profile shared start
>> >> >> >> wait 10 minutes
>> >> >> >> gluster volume profile shared info
>> >> >> >> gluster volume profile shared stop
>> >> >> >>
>> >> >> >> If that's ok, i've attached the
output of the info command.
>> >> >> >>
>> >> >> >>
>> >> >> >> 2018-08-17 8:31 GMT+02:00 Pranith Kumar
Karampuri
>> >> >> >> <pkarampu at redhat.com>:
>> >> >> >> > Please do volume profile also for
around 10 minutes when CPU%
>> is
>> >> >> >> > high.
>> >> >> >> >
>> >> >> >> > On Fri, Aug 17, 2018 at 11:56 AM
Pranith Kumar Karampuri
>> >> >> >> > <pkarampu at redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> As per the output, all io-threads
are using a lot of CPU. It
>> is
>> >> >> >> >> better
>> >> >> >> >> to
>> >> >> >> >> check what the volume profile is to
see what is leading to so
>> >> >> >> >> much
>> >> >> >> >> work
>> >> >> >> >> for
>> >> >> >> >> io-threads. Please follow the
documentation at
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>>
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
>> >> >> >> >> section: "
>> >> >> >> >>
>> >> >> >> >> Running GlusterFS Volume Profile
Command"
>> >> >> >> >>
>> >> >> >> >> and attach output of  "gluster
volume profile info",
>> >> >> >> >>
>> >> >> >> >> On Fri, Aug 17, 2018 at 11:24 AM Hu
Bert <
>> revirii at googlemail.com>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Good morning,
>> >> >> >> >>>
>> >> >> >> >>> i ran the command during 100%
CPU usage and attached the
>> file.
>> >> >> >> >>> Hopefully it helps.
>> >> >> >> >>>
>> >> >> >> >>> 2018-08-17 7:33 GMT+02:00
Pranith Kumar Karampuri
>> >> >> >> >>> <pkarampu at redhat.com>:
>> >> >> >> >>> > Could you do the following
on one of the nodes where you
>> are
>> >> >> >> >>> > observing
>> >> >> >> >>> > high
>> >> >> >> >>> > CPU usage and attach that
file to this thread? We can find
>> >> >> >> >>> > what
>> >> >> >> >>> > threads/processes are
leading to high usage. Do this for
>> say
>> >> >> >> >>> > 10
>> >> >> >> >>> > minutes
>> >> >> >> >>> > when
>> >> >> >> >>> > you see the ~100% CPU.
>> >> >> >> >>> >
>> >> >> >> >>> > top -bHd 5 >
/tmp/top.${HOSTNAME}.txt
>> >> >> >> >>> >
>> >> >> >> >>> > On Wed, Aug 15, 2018 at
2:37 PM Hu Bert
>> >> >> >> >>> > <revirii at
googlemail.com>
>> >> >> >> >>> > wrote:
>> >> >> >> >>> >>
>> >> >> >> >>> >> Hello again :-)
>> >> >> >> >>> >>
>> >> >> >> >>> >> The self heal must
have finished as there are no log
>> entries
>> >> >> >> >>> >> in
>> >> >> >> >>> >> glustershd.log files
anymore. According to munin disk
>> latency
>> >> >> >> >>> >> (average
>> >> >> >> >>> >> io wait) has gone down
to 100 ms, and disk utilization has
>> >> >> >> >>> >> gone
>> >> >> >> >>> >> down
>> >> >> >> >>> >> to ~60% - both on all
servers and hard disks.
>> >> >> >> >>> >>
>> >> >> >> >>> >> But now system load on
2 servers (which were in the good
>> >> >> >> >>> >> state)
>> >> >> >> >>> >> fluctuates between 60
and 100; the server with the
>> formerly
>> >> >> >> >>> >> failed
>> >> >> >> >>> >> disk has a load of
20-30.I've uploaded some munin
>> graphics of
>> >> >> >> >>> >> the
>> >> >> >> >>> >> cpu
>> >> >> >> >>> >> usage:
>> >> >> >> >>> >>
>> >> >> >> >>> >>
https://abload.de/img/gluster11_cpu31d3a.png
>> >> >> >> >>> >>
https://abload.de/img/gluster12_cpu8sem7.png
>> >> >> >> >>> >>
https://abload.de/img/gluster13_cpud7eni.png
>> >> >> >> >>> >>
>> >> >> >> >>> >> This can't be
normal. 2 of the servers under heavy load
>> and
>> >> >> >> >>> >> one
>> >> >> >> >>> >> not
>> >> >> >> >>> >> that much. Does anyone
have an explanation of this strange
>> >> >> >> >>> >> behaviour?
>> >> >> >> >>> >>
>> >> >> >> >>> >>
>> >> >> >> >>> >> Thx :-)
>> >> >> >> >>> >>
>> >> >> >> >>> >> 2018-08-14 9:37
GMT+02:00 Hu Bert <revirii at googlemail.com
>> >:
>> >> >> >> >>> >> > Hi there,
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > well, it seems
the heal has finally finished. Couldn't
>> >> >> >> >>> >> > see/find
>> >> >> >> >>> >> > any
>> >> >> >> >>> >> > related log
message; is there such a message in a
>> specific
>> >> >> >> >>> >> > log
>> >> >> >> >>> >> > file?
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > But i see the
same behaviour when the last heal
>> finished:
>> >> >> >> >>> >> > all
>> >> >> >> >>> >> > CPU
>> >> >> >> >>> >> > cores are
consumed by brick processes; not only by the
>> >> >> >> >>> >> > formerly
>> >> >> >> >>> >> > failed
>> >> >> >> >>> >> > bricksdd1, but by
all 4 brick processes (and their
>> >> >> >> >>> >> > threads).
>> >> >> >> >>> >> > Load
>> >> >> >> >>> >> > goes
>> >> >> >> >>> >> > up to > 100 on
the 2 servers with the not-failed brick,
>> and
>> >> >> >> >>> >> > glustershd.log
gets filled with a lot of entries. Load
>> on
>> >> >> >> >>> >> > the
>> >> >> >> >>> >> > server
>> >> >> >> >>> >> > with the then
failed brick not that high, but still ~60.
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > Is this behaviour
normal? Is there some post-heal after
>> a
>> >> >> >> >>> >> > heal
>> >> >> >> >>> >> > has
>> >> >> >> >>> >> > finished?
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > thx in advance
:-)
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > --
>> >> >> >> >>> > Pranith
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Pranith
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Pranith
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Pranith
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Pranith
>> >
>> >
>> >
>> > --
>> > Pranith
>>
>
>
> --
> Pranith
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180821/ae6a65ee/attachment.html>

Gluster users - Aug 2018 - Gluter 3.12.12: performance during heal and in general

[Gluster-users] Gluter 3.12.12: performance during heal and in general

[Gluster-users] Gluter 3.12.12: performance during heal and in general