thr3ads.net - Gluster users - [Gluster-users] Gluter 3.12.12: performance during heal and in general [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Hu Bert

2018-Jul-27 05:41 UTC

[Gluster-users] Gluter 3.12.12: performance during heal and in general

Good Morning :-)

on server gluster11 about 1.25 million and on gluster13 about 1.35
million log entries in glustershd.log file. About 70 GB got healed,
overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
'find...' whenever i notice that it has finished. Hmm... is it
possible and reasonable to run 2 finds in parallel, maybe on different
subdirectories? E.g. running one one $volume/public/ and on one
$volume/private/ ?

2018-07-26 11:29 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com>:>
>
> On Thu, Jul 26, 2018 at 2:41 PM, Hu Bert <revirii at googlemail.com>
wrote:
>>
>> > Sorry, bad copy/paste :-(.
>>
>> np :-)
>>
>> The question regarding version 4.1 was meant more generally: does
>> gluster v4.0 etc. have a better performance than version 3.12 etc.?
>> Just curious :-) Sooner or later we have to upgrade anyway.
>
>
> You can check what changed @
>
https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md#performance
>
https://github.com/gluster/glusterfs/blob/release-4.1/doc/release-notes/4.1.0.md#performance
>
>>
>>
>> btw.: gluster12 was the node with the failed brick, and i started the
>> full heal on this node (has the biggest uuid as well). Is it normal
>> that the glustershd.log on this node is rather empty (some hundred
>> entries), but the glustershd.log files on the 2 other nodes have
>> hundreds of thousands of entries?
>
>
> heals happen on the good bricks, so this is expected.
>
>>
>> (sry, mail twice, didn't go to the list, but maybe others are
>> interested... :-) )
>>
>> 2018-07-26 10:17 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com>:
>> >
>> >
>> > On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert <revirii at
googlemail.com>
>> > wrote:
>> >>
>> >> Hi Pranith,
>> >>
>> >> thanks a lot for your efforts and for tracking "my"
problem with an
>> >> issue.
>> >> :-)
>> >>
>> >>
>> >> I've set this params on the gluster volume and will start
the
>> >> 'find...' command within a short time. I'll
probably add another
>> >> answer to the list to document the progress.
>> >>
>> >> btw. - you had some typos:
>> >> gluster volume set <volname>
cluster.cluster.heal-wait-queue-length
>> >> 10000 => cluster is doubled
>> >> gluster volume set <volname>
cluster.data-self-heal-window-size 16 =>
>> >> it's actually cluster.self-heal-window-size
>> >>
>> >> but actually no problem :-)
>> >
>> >
>> > Sorry, bad copy/paste :-(.
>> >
>> >>
>> >>
>> >> Just curious: would gluster 4.1 improve the performance for
healing
>> >> and in general for "my" scenario?
>> >
>> >
>> > No, this issue is present in all the existing releases. But it is
>> > solvable.
>> > You can follow that issue to see progress and when it is fixed
etc.
>> >
>> >>
>> >>
>> >> 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri
>> >> <pkarampu at redhat.com>:
>> >> > Thanks a lot for detailed write-up, this helps find the
bottlenecks
>> >> > easily.
>> >> > On a high level, to handle this directory hierarchy i.e.
lots of
>> >> > directories
>> >> > with files, we need to improve healing
>> >> > algorithms. Based on the data you provided, we need to
make the
>> >> > following
>> >> > enhancements:
>> >> >
>> >> > 1) At the moment directories are healed one at a time,
but files can
>> >> > be
>> >> > healed upto 64 in parallel per replica subvolume.
>> >> > So if you have nX2 or nX3 distributed subvolumes, it can
heal 64n
>> >> > number
>> >> > of
>> >> > files in parallel.
>> >> >
>> >> > I raised https://github.com/gluster/glusterfs/issues/477
to track
>> >> > this.
>> >> > In
>> >> > the mean-while you can use the following work-around:
>> >> > a) Increase background heals on the mount:
>> >> > gluster volume set <volname>
cluster.background-self-heal-count 256
>> >> > gluster volume set <volname>
cluster.cluster.heal-wait-queue-length
>> >> > 10000
>> >> > find <mnt> -type d | xargs stat
>> >> >
>> >> > one 'find' will trigger 10256 directories. So you
may have to do this
>> >> > periodically until all directories are healed.
>> >> >
>> >> > 2) Self-heal heals a file 128KB at a
>> >> > time(data-self-heal-window-size). I
>> >> > think for your environment bumping upto MBs is better.
Say 2MB i.e.
>> >> > 16*128KB?
>> >> >
>> >> > Command to do that is:
>> >> > gluster volume set <volname>
cluster.data-self-heal-window-size 16
>> >> >
>> >> >
>> >> > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <revirii at
googlemail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Pranith,
>> >> >>
>> >> >> Sry, it took a while to count the directories.
I'll try to answer
>> >> >> your
>> >> >> questions as good as possible.
>> >> >>
>> >> >> > What kind of data do you have?
>> >> >> > How many directories in the filesystem?
>> >> >> > On average how many files per directory?
>> >> >> > What is the depth of your directory hierarchy on
average?
>> >> >> > What is average filesize?
>> >> >>
>> >> >> We have mostly images (more than 95% of disk usage,
90% of file
>> >> >> count), some text files (like css, jsp, gpx etc.) and
some binaries.
>> >> >>
>> >> >> There are about 190.000 directories in the file
system; maybe there
>> >> >> are some more because we're hit by bug 1512371
(parallel-readdir >> >> >> TRUE prevents directories listing).
But the number of directories
>> >> >> could/will rise in the future (maybe millions).
>> >> >>
>> >> >> files per directory: ranges from 0 to 100, on average
it should be
>> >> >> 20
>> >> >> files per directory (well, at least in the deepest
dirs, see
>> >> >> explanation below).
>> >> >>
>> >> >> Average filesize: ranges from a few hundred bytes up
to 30 MB, on
>> >> >> average it should be 2-3 MB.
>> >> >>
>> >> >> Directory hierarchy: maximum depth as seen from
within the volume is
>> >> >> 6, the average should be 3.
>> >> >>
>> >> >> volume name: shared
>> >> >> mount point on clients: /data/repository/shared/
>> >> >> below /shared/ there are 2 directories:
>> >> >> - public/: mainly calculated images (file sizes from
a few KB up to
>> >> >> max 1 MB) and some resouces (small PNGs with a size
of a few hundred
>> >> >> bytes).
>> >> >> - private/: mainly source images; file sizes from 50
KB up to 30MB
>> >> >>
>> >> >> We migrated from a NFS server (SPOF) to glusterfs and
simply copied
>> >> >> our files. The images (which have an ID) are stored
in the deepest
>> >> >> directories of the dir tree. I'll better explain
it :-)
>> >> >>
>> >> >> directory structure for the images (i'll omit
some other
>> >> >> miscellaneous
>> >> >> stuff, but it looks quite similar):
>> >> >> - ID of an image has 7 or 8 digits
>> >> >> - /shared/private/: /(first 3 digits of ID)/(next 3
digits of
>> >> >> ID)/$ID.jpg
>> >> >> - /shared/public/: /(first 3 digits of ID)/(next 3
digits of
>> >> >> ID)/$ID/$misc_formats.jpg
>> >> >>
>> >> >> That's why we have that many (sub-)directories.
Files are only
>> >> >> stored
>> >> >> in the lowest directory hierarchy. I hope i could
make our structure
>> >> >> at least a bit more transparent.
>> >> >>
>> >> >> i hope there's something we can do to raise
performance a bit. thx
>> >> >> in
>> >> >> advance :-)
>> >> >>
>> >> >>
>> >> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri
>> >> >> <pkarampu at redhat.com>:
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert
<revirii at googlemail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Well, over the weekend about 200GB were
copied, so now there are
>> >> >> >> ~400GB copied to the brick. That's far
beyond a speed of 10GB per
>> >> >> >> hour. If I copied the 1.6 TB directly, that
would be done within
>> >> >> >> max
>> >> >> >> 2
>> >> >> >> days. But with the self heal this will take
at least 20 days
>> >> >> >> minimum.
>> >> >> >>
>> >> >> >> Why is the performance that bad? No chance
of speeding this up?
>> >> >> >
>> >> >> >
>> >> >> > What kind of data do you have?
>> >> >> > How many directories in the filesystem?
>> >> >> > On average how many files per directory?
>> >> >> > What is the depth of your directory hierarchy on
average?
>> >> >> > What is average filesize?
>> >> >> >
>> >> >> > Based on this data we can see if anything can be
improved. Or if
>> >> >> > there
>> >> >> > are
>> >> >> > some
>> >> >> > enhancements that need to be implemented in
gluster to address
>> >> >> > this
>> >> >> > kind
>> >> >> > of
>> >> >> > data layout
>> >> >> >>
>> >> >> >>
>> >> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert
<revirii at googlemail.com>:
>> >> >> >> > hmm... no one any idea?
>> >> >> >> >
>> >> >> >> > Additional question: the hdd on server
gluster12 was changed,
>> >> >> >> > so
>> >> >> >> > far
>> >> >> >> > ~220 GB were copied. On the other 2
servers i see a lot of
>> >> >> >> > entries
>> >> >> >> > in
>> >> >> >> > glustershd.log, about 312.000
respectively 336.000 entries
>> >> >> >> > there
>> >> >> >> > yesterday, most of them (current log
output) looking like this:
>> >> >> >> >
>> >> >> >> > [2018-07-20 07:30:49.757595] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
>> >> >> >> > 0-shared-replicate-3:
>> >> >> >> > Completed data selfheal on
>> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
>> >> >> >> > sources=0 [2]  sinks=1
>> >> >> >> > [2018-07-20 07:30:49.992398] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> >> >> >> > 0-shared-replicate-3: performing
metadata selfheal on
>> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6
>> >> >> >> > [2018-07-20 07:30:50.243551] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
>> >> >> >> > 0-shared-replicate-3:
>> >> >> >> > Completed metadata selfheal on
>> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
>> >> >> >> > sources=0 [2]  sinks=1
>> >> >> >> >
>> >> >> >> > or like this:
>> >> >> >> >
>> >> >> >> > [2018-07-20 07:38:41.726943] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> >> >> >> > 0-shared-replicate-3: performing
metadata selfheal on
>> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
>> >> >> >> > [2018-07-20 07:38:41.855737] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
>> >> >> >> > 0-shared-replicate-3:
>> >> >> >> > Completed metadata selfheal on
>> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
>> >> >> >> > sources=[0] 2  sinks=1
>> >> >> >> > [2018-07-20 07:38:44.755800] I [MSGID:
108026]
>> >> >> >> >
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
>> >> >> >> > 0-shared-replicate-3: performing entry
selfheal on
>> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
>> >> >> >> >
>> >> >> >> > is this behaviour normal? I'd
expect these messages on the
>> >> >> >> > server
>> >> >> >> > with
>> >> >> >> > the failed brick, not on the other
ones.
>> >> >> >> >
>> >> >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert
<revirii at googlemail.com>:
>> >> >> >> >> Hi there,
>> >> >> >> >>
>> >> >> >> >> sent this mail yesterday, but
somehow it didn't work? Wasn't
>> >> >> >> >> archived,
>> >> >> >> >> so please be indulgent it you
receive this mail again :-)
>> >> >> >> >>
>> >> >> >> >> We are currently running a
replicate setup and are
>> >> >> >> >> experiencing a
>> >> >> >> >> quite poor performance. It got even
worse when within a couple
>> >> >> >> >> of
>> >> >> >> >> weeks 2 bricks (disks) crashed.
Maybe some general information
>> >> >> >> >> of
>> >> >> >> >> our
>> >> >> >> >> setup:
>> >> >> >> >>
>> >> >> >> >> 3 Dell PowerEdge R530 (Xeon E5-1650
v3 Hexa-Core, 64 GB DDR4,
>> >> >> >> >> OS
>> >> >> >> >> on
>> >> >> >> >> separate disks); each server has 4
10TB disks -> each is a
>> >> >> >> >> brick;
>> >> >> >> >> replica 3 setup (see gluster volume
status below). Debian
>> >> >> >> >> stretch,
>> >> >> >> >> kernel 4.9.0, gluster version
3.12.12. Servers and clients are
>> >> >> >> >> connected via 10 GBit ethernet.
>> >> >> >> >>
>> >> >> >> >> About a month ago and 2 days ago a
disk died (on different
>> >> >> >> >> servers);
>> >> >> >> >> disk were replaced, were brought
back into the volume and full
>> >> >> >> >> self
>> >> >> >> >> heal started. But the speed for
this is quite...
>> >> >> >> >> disappointing.
>> >> >> >> >> Each
>> >> >> >> >> brick has ~1.6TB of data on it
(mostly the infamous small
>> >> >> >> >> files).
>> >> >> >> >> The
>> >> >> >> >> full heal i started yesterday
copied only ~50GB within 24
>> >> >> >> >> hours
>> >> >> >> >> (48
>> >> >> >> >> hours: about 100GB) - with
>> >> >> >> >> this rate it would take weeks until
the self heal finishes.
>> >> >> >> >>
>> >> >> >> >> After the first heal (started on
gluster13 about a month ago,
>> >> >> >> >> took
>> >> >> >> >> about 3 weeks) finished we had a
terrible performance; CPU on
>> >> >> >> >> one
>> >> >> >> >> or
>> >> >> >> >> two of the nodes (gluster11,
gluster12) was up to 1200%,
>> >> >> >> >> consumed
>> >> >> >> >> by
>> >> >> >> >> the brick process of the former
crashed brick (bricksdd1),
>> >> >> >> >> interestingly not on the server
with the failed this, but on
>> >> >> >> >> the
>> >> >> >> >> other
>> >> >> >> >> 2 ones...
>> >> >> >> >>
>> >> >> >> >> Well... am i doing something wrong?
Some options wrongly
>> >> >> >> >> configured?
>> >> >> >> >> Terrible setup? Anyone got an idea?
Any additional information
>> >> >> >> >> needed?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Thx in advance :-)
>> >> >> >> >>
>> >> >> >> >> gluster volume status
>> >> >> >> >>
>> >> >> >> >> Volume Name: shared
>> >> >> >> >> Type: Distributed-Replicate
>> >> >> >> >> Volume ID:
e879d208-1d8c-4089-85f3-ef1b3aa45d36
>> >> >> >> >> Status: Started
>> >> >> >> >> Snapshot Count: 0
>> >> >> >> >> Number of Bricks: 4 x 3 = 12
>> >> >> >> >> Transport-type: tcp
>> >> >> >> >> Bricks:
>> >> >> >> >> Brick1:
gluster11:/gluster/bricksda1/shared
>> >> >> >> >> Brick2:
gluster12:/gluster/bricksda1/shared
>> >> >> >> >> Brick3:
gluster13:/gluster/bricksda1/shared
>> >> >> >> >> Brick4:
gluster11:/gluster/bricksdb1/shared
>> >> >> >> >> Brick5:
gluster12:/gluster/bricksdb1/shared
>> >> >> >> >> Brick6:
gluster13:/gluster/bricksdb1/shared
>> >> >> >> >> Brick7:
gluster11:/gluster/bricksdc1/shared
>> >> >> >> >> Brick8:
gluster12:/gluster/bricksdc1/shared
>> >> >> >> >> Brick9:
gluster13:/gluster/bricksdc1/shared
>> >> >> >> >> Brick10:
gluster11:/gluster/bricksdd1/shared
>> >> >> >> >> Brick11:
gluster12:/gluster/bricksdd1_new/shared
>> >> >> >> >> Brick12:
gluster13:/gluster/bricksdd1_new/shared
>> >> >> >> >> Options Reconfigured:
>> >> >> >> >> cluster.shd-max-threads: 4
>> >> >> >> >> performance.md-cache-timeout: 60
>> >> >> >> >> cluster.lookup-optimize: on
>> >> >> >> >> cluster.readdir-optimize: on
>> >> >> >> >> performance.cache-refresh-timeout:
4
>> >> >> >> >> performance.parallel-readdir: on
>> >> >> >> >> server.event-threads: 8
>> >> >> >> >> client.event-threads: 8
>> >> >> >> >> performance.cache-max-file-size:
128MB
>> >> >> >> >>
performance.write-behind-window-size: 16MB
>> >> >> >> >> performance.io-thread-count: 64
>> >> >> >> >> cluster.min-free-disk: 1%
>> >> >> >> >> performance.cache-size: 24GB
>> >> >> >> >> nfs.disable: on
>> >> >> >> >> transport.address-family: inet
>> >> >> >> >> performance.high-prio-threads: 32
>> >> >> >> >> performance.normal-prio-threads: 32
>> >> >> >> >> performance.low-prio-threads: 32
>> >> >> >> >> performance.least-prio-threads: 8
>> >> >> >> >> performance.io-cache: on
>> >> >> >> >> server.allow-insecure: on
>> >> >> >> >> performance.strict-o-direct: off
>> >> >> >> >> transport.listen-backlog: 100
>> >> >> >> >> server.outstanding-rpc-limit: 128
>> >> >> >>
_______________________________________________
>> >> >> >> Gluster-users mailing list
>> >> >> >> Gluster-users at gluster.org
>> >> >> >>
https://lists.gluster.org/mailman/listinfo/gluster-users
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Pranith
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Pranith
>> >
>> >
>> >
>> >
>> > --
>> > Pranith
>
>
>
>
> --
> Pranith

Pranith Kumar Karampuri

2018-Jul-27 05:55 UTC

head link

[Gluster-users] Gluter 3.12.12: performance during heal and in general

On Fri, Jul 27, 2018 at 11:11 AM, Hu Bert <revirii at googlemail.com>
wrote:
> Good Morning :-)
>
> on server gluster11 about 1.25 million and on gluster13 about 1.35
> million log entries in glustershd.log file. About 70 GB got healed,
> overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
> 'find...' whenever i notice that it has finished. Hmm... is it
> possible and reasonable to run 2 finds in parallel, maybe on different
> subdirectories? E.g. running one one $volume/public/ and on one
> $volume/private/ ?
>
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly
instead of find?

>
> 2018-07-26 11:29 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com>:
> >
> >
> > On Thu, Jul 26, 2018 at 2:41 PM, Hu Bert <revirii at
googlemail.com> wrote:
> >>
> >> > Sorry, bad copy/paste :-(.
> >>
> >> np :-)
> >>
> >> The question regarding version 4.1 was meant more generally: does
> >> gluster v4.0 etc. have a better performance than version 3.12
etc.?
> >> Just curious :-) Sooner or later we have to upgrade anyway.
> >
> >
> > You can check what changed @
> > https://github.com/gluster/glusterfs/blob/release-4.0/
> doc/release-notes/4.0.0.md#performance
> > https://github.com/gluster/glusterfs/blob/release-4.1/
> doc/release-notes/4.1.0.md#performance
> >
> >>
> >>
> >> btw.: gluster12 was the node with the failed brick, and i started
the
> >> full heal on this node (has the biggest uuid as well). Is it
normal
> >> that the glustershd.log on this node is rather empty (some hundred
> >> entries), but the glustershd.log files on the 2 other nodes have
> >> hundreds of thousands of entries?
> >
> >
> > heals happen on the good bricks, so this is expected.
> >
> >>
> >> (sry, mail twice, didn't go to the list, but maybe others are
> >> interested... :-) )
> >>
> >> 2018-07-26 10:17 GMT+02:00 Pranith Kumar Karampuri <pkarampu at
redhat.com
> >:
> >> >
> >> >
> >> > On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert <revirii at
googlemail.com>
> >> > wrote:
> >> >>
> >> >> Hi Pranith,
> >> >>
> >> >> thanks a lot for your efforts and for tracking
"my" problem with an
> >> >> issue.
> >> >> :-)
> >> >>
> >> >>
> >> >> I've set this params on the gluster volume and will
start the
> >> >> 'find...' command within a short time. I'll
probably add another
> >> >> answer to the list to document the progress.
> >> >>
> >> >> btw. - you had some typos:
> >> >> gluster volume set <volname>
cluster.cluster.heal-wait-queue-length
> >> >> 10000 => cluster is doubled
> >> >> gluster volume set <volname>
cluster.data-self-heal-window-size 16
> =>
> >> >> it's actually cluster.self-heal-window-size
> >> >>
> >> >> but actually no problem :-)
> >> >
> >> >
> >> > Sorry, bad copy/paste :-(.
> >> >
> >> >>
> >> >>
> >> >> Just curious: would gluster 4.1 improve the performance
for healing
> >> >> and in general for "my" scenario?
> >> >
> >> >
> >> > No, this issue is present in all the existing releases. But
it is
> >> > solvable.
> >> > You can follow that issue to see progress and when it is
fixed etc.
> >> >
> >> >>
> >> >>
> >> >> 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri
> >> >> <pkarampu at redhat.com>:
> >> >> > Thanks a lot for detailed write-up, this helps find
the bottlenecks
> >> >> > easily.
> >> >> > On a high level, to handle this directory hierarchy
i.e. lots of
> >> >> > directories
> >> >> > with files, we need to improve healing
> >> >> > algorithms. Based on the data you provided, we need
to make the
> >> >> > following
> >> >> > enhancements:
> >> >> >
> >> >> > 1) At the moment directories are healed one at a
time, but files
> can
> >> >> > be
> >> >> > healed upto 64 in parallel per replica subvolume.
> >> >> > So if you have nX2 or nX3 distributed subvolumes, it
can heal 64n
> >> >> > number
> >> >> > of
> >> >> > files in parallel.
> >> >> >
> >> >> > I raised
https://github.com/gluster/glusterfs/issues/477 to track
> >> >> > this.
> >> >> > In
> >> >> > the mean-while you can use the following
work-around:
> >> >> > a) Increase background heals on the mount:
> >> >> > gluster volume set <volname>
cluster.background-self-heal-count
> 256
> >> >> > gluster volume set <volname>
cluster.cluster.heal-wait-
> queue-length
> >> >> > 10000
> >> >> > find <mnt> -type d | xargs stat
> >> >> >
> >> >> > one 'find' will trigger 10256 directories.
So you may have to do
> this
> >> >> > periodically until all directories are healed.
> >> >> >
> >> >> > 2) Self-heal heals a file 128KB at a
> >> >> > time(data-self-heal-window-size). I
> >> >> > think for your environment bumping upto MBs is
better. Say 2MB i.e.
> >> >> > 16*128KB?
> >> >> >
> >> >> > Command to do that is:
> >> >> > gluster volume set <volname>
cluster.data-self-heal-window-size 16
> >> >> >
> >> >> >
> >> >> > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert
<revirii at googlemail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Pranith,
> >> >> >>
> >> >> >> Sry, it took a while to count the directories.
I'll try to answer
> >> >> >> your
> >> >> >> questions as good as possible.
> >> >> >>
> >> >> >> > What kind of data do you have?
> >> >> >> > How many directories in the filesystem?
> >> >> >> > On average how many files per directory?
> >> >> >> > What is the depth of your directory
hierarchy on average?
> >> >> >> > What is average filesize?
> >> >> >>
> >> >> >> We have mostly images (more than 95% of disk
usage, 90% of file
> >> >> >> count), some text files (like css, jsp, gpx
etc.) and some
> binaries.
> >> >> >>
> >> >> >> There are about 190.000 directories in the file
system; maybe
> there
> >> >> >> are some more because we're hit by bug
1512371 (parallel-readdir > >> >> >> TRUE prevents
directories listing). But the number of directories
> >> >> >> could/will rise in the future (maybe millions).
> >> >> >>
> >> >> >> files per directory: ranges from 0 to 100, on
average it should be
> >> >> >> 20
> >> >> >> files per directory (well, at least in the
deepest dirs, see
> >> >> >> explanation below).
> >> >> >>
> >> >> >> Average filesize: ranges from a few hundred
bytes up to 30 MB, on
> >> >> >> average it should be 2-3 MB.
> >> >> >>
> >> >> >> Directory hierarchy: maximum depth as seen from
within the volume
> is
> >> >> >> 6, the average should be 3.
> >> >> >>
> >> >> >> volume name: shared
> >> >> >> mount point on clients: /data/repository/shared/
> >> >> >> below /shared/ there are 2 directories:
> >> >> >> - public/: mainly calculated images (file sizes
from a few KB up
> to
> >> >> >> max 1 MB) and some resouces (small PNGs with a
size of a few
> hundred
> >> >> >> bytes).
> >> >> >> - private/: mainly source images; file sizes
from 50 KB up to 30MB
> >> >> >>
> >> >> >> We migrated from a NFS server (SPOF) to
glusterfs and simply
> copied
> >> >> >> our files. The images (which have an ID) are
stored in the deepest
> >> >> >> directories of the dir tree. I'll better
explain it :-)
> >> >> >>
> >> >> >> directory structure for the images (i'll
omit some other
> >> >> >> miscellaneous
> >> >> >> stuff, but it looks quite similar):
> >> >> >> - ID of an image has 7 or 8 digits
> >> >> >> - /shared/private/: /(first 3 digits of
ID)/(next 3 digits of
> >> >> >> ID)/$ID.jpg
> >> >> >> - /shared/public/: /(first 3 digits of ID)/(next
3 digits of
> >> >> >> ID)/$ID/$misc_formats.jpg
> >> >> >>
> >> >> >> That's why we have that many
(sub-)directories. Files are only
> >> >> >> stored
> >> >> >> in the lowest directory hierarchy. I hope i
could make our
> structure
> >> >> >> at least a bit more transparent.
> >> >> >>
> >> >> >> i hope there's something we can do to raise
performance a bit. thx
> >> >> >> in
> >> >> >> advance :-)
> >> >> >>
> >> >> >>
> >> >> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar
Karampuri
> >> >> >> <pkarampu at redhat.com>:
> >> >> >> >
> >> >> >> >
> >> >> >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert
<
> revirii at googlemail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Well, over the weekend about 200GB were
copied, so now there
> are
> >> >> >> >> ~400GB copied to the brick. That's
far beyond a speed of 10GB
> per
> >> >> >> >> hour. If I copied the 1.6 TB directly,
that would be done
> within
> >> >> >> >> max
> >> >> >> >> 2
> >> >> >> >> days. But with the self heal this will
take at least 20 days
> >> >> >> >> minimum.
> >> >> >> >>
> >> >> >> >> Why is the performance that bad? No
chance of speeding this up?
> >> >> >> >
> >> >> >> >
> >> >> >> > What kind of data do you have?
> >> >> >> > How many directories in the filesystem?
> >> >> >> > On average how many files per directory?
> >> >> >> > What is the depth of your directory
hierarchy on average?
> >> >> >> > What is average filesize?
> >> >> >> >
> >> >> >> > Based on this data we can see if anything
can be improved. Or if
> >> >> >> > there
> >> >> >> > are
> >> >> >> > some
> >> >> >> > enhancements that need to be implemented in
gluster to address
> >> >> >> > this
> >> >> >> > kind
> >> >> >> > of
> >> >> >> > data layout
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert
<revirii at googlemail.com>:
> >> >> >> >> > hmm... no one any idea?
> >> >> >> >> >
> >> >> >> >> > Additional question: the hdd on
server gluster12 was changed,
> >> >> >> >> > so
> >> >> >> >> > far
> >> >> >> >> > ~220 GB were copied. On the other
2 servers i see a lot of
> >> >> >> >> > entries
> >> >> >> >> > in
> >> >> >> >> > glustershd.log, about 312.000
respectively 336.000 entries
> >> >> >> >> > there
> >> >> >> >> > yesterday, most of them (current
log output) looking like
> this:
> >> >> >> >> >
> >> >> >> >> > [2018-07-20 07:30:49.757595] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
> >> >> >> >> > 0-shared-replicate-3:
> >> >> >> >> > Completed data selfheal on
> >> >> >> >> >
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
> >> >> >> >> > sources=0 [2]  sinks=1
> >> >> >> >> > [2018-07-20 07:30:49.992398] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >> >> >> >> > 0-shared-replicate-3: performing
metadata selfheal on
> >> >> >> >> >
0d863a62-0dd8-401c-b699-2b642d9fd2b6
> >> >> >> >> > [2018-07-20 07:30:50.243551] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
> >> >> >> >> > 0-shared-replicate-3:
> >> >> >> >> > Completed metadata selfheal on
> >> >> >> >> >
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
> >> >> >> >> > sources=0 [2]  sinks=1
> >> >> >> >> >
> >> >> >> >> > or like this:
> >> >> >> >> >
> >> >> >> >> > [2018-07-20 07:38:41.726943] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >> >> >> >> > 0-shared-replicate-3: performing
metadata selfheal on
> >> >> >> >> >
9276097a-cdac-4d12-9dc6-04b1ea4458ba
> >> >> >> >> > [2018-07-20 07:38:41.855737] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-common.c:1724:afr_log_selfheal]
> >> >> >> >> > 0-shared-replicate-3:
> >> >> >> >> > Completed metadata selfheal on
> >> >> >> >> >
9276097a-cdac-4d12-9dc6-04b1ea4458ba.
> >> >> >> >> > sources=[0] 2  sinks=1
> >> >> >> >> > [2018-07-20 07:38:44.755800] I
[MSGID: 108026]
> >> >> >> >> >
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
> >> >> >> >> > 0-shared-replicate-3: performing
entry selfheal on
> >> >> >> >> >
9276097a-cdac-4d12-9dc6-04b1ea4458ba
> >> >> >> >> >
> >> >> >> >> > is this behaviour normal? I'd
expect these messages on the
> >> >> >> >> > server
> >> >> >> >> > with
> >> >> >> >> > the failed brick, not on the other
ones.
> >> >> >> >> >
> >> >> >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert
<revirii at googlemail.com>:
> >> >> >> >> >> Hi there,
> >> >> >> >> >>
> >> >> >> >> >> sent this mail yesterday, but
somehow it didn't work? Wasn't
> >> >> >> >> >> archived,
> >> >> >> >> >> so please be indulgent it you
receive this mail again :-)
> >> >> >> >> >>
> >> >> >> >> >> We are currently running a
replicate setup and are
> >> >> >> >> >> experiencing a
> >> >> >> >> >> quite poor performance. It got
even worse when within a
> couple
> >> >> >> >> >> of
> >> >> >> >> >> weeks 2 bricks (disks)
crashed. Maybe some general
> information
> >> >> >> >> >> of
> >> >> >> >> >> our
> >> >> >> >> >> setup:
> >> >> >> >> >>
> >> >> >> >> >> 3 Dell PowerEdge R530 (Xeon
E5-1650 v3 Hexa-Core, 64 GB
> DDR4,
> >> >> >> >> >> OS
> >> >> >> >> >> on
> >> >> >> >> >> separate disks); each server
has 4 10TB disks -> each is a
> >> >> >> >> >> brick;
> >> >> >> >> >> replica 3 setup (see gluster
volume status below). Debian
> >> >> >> >> >> stretch,
> >> >> >> >> >> kernel 4.9.0, gluster version
3.12.12. Servers and clients
> are
> >> >> >> >> >> connected via 10 GBit
ethernet.
> >> >> >> >> >>
> >> >> >> >> >> About a month ago and 2 days
ago a disk died (on different
> >> >> >> >> >> servers);
> >> >> >> >> >> disk were replaced, were
brought back into the volume and
> full
> >> >> >> >> >> self
> >> >> >> >> >> heal started. But the speed
for this is quite...
> >> >> >> >> >> disappointing.
> >> >> >> >> >> Each
> >> >> >> >> >> brick has ~1.6TB of data on it
(mostly the infamous small
> >> >> >> >> >> files).
> >> >> >> >> >> The
> >> >> >> >> >> full heal i started yesterday
copied only ~50GB within 24
> >> >> >> >> >> hours
> >> >> >> >> >> (48
> >> >> >> >> >> hours: about 100GB) - with
> >> >> >> >> >> this rate it would take weeks
until the self heal finishes.
> >> >> >> >> >>
> >> >> >> >> >> After the first heal (started
on gluster13 about a month
> ago,
> >> >> >> >> >> took
> >> >> >> >> >> about 3 weeks) finished we had
a terrible performance; CPU
> on
> >> >> >> >> >> one
> >> >> >> >> >> or
> >> >> >> >> >> two of the nodes (gluster11,
gluster12) was up to 1200%,
> >> >> >> >> >> consumed
> >> >> >> >> >> by
> >> >> >> >> >> the brick process of the
former crashed brick (bricksdd1),
> >> >> >> >> >> interestingly not on the
server with the failed this, but on
> >> >> >> >> >> the
> >> >> >> >> >> other
> >> >> >> >> >> 2 ones...
> >> >> >> >> >>
> >> >> >> >> >> Well... am i doing something
wrong? Some options wrongly
> >> >> >> >> >> configured?
> >> >> >> >> >> Terrible setup? Anyone got an
idea? Any additional
> information
> >> >> >> >> >> needed?
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Thx in advance :-)
> >> >> >> >> >>
> >> >> >> >> >> gluster volume status
> >> >> >> >> >>
> >> >> >> >> >> Volume Name: shared
> >> >> >> >> >> Type: Distributed-Replicate
> >> >> >> >> >> Volume ID:
e879d208-1d8c-4089-85f3-ef1b3aa45d36
> >> >> >> >> >> Status: Started
> >> >> >> >> >> Snapshot Count: 0
> >> >> >> >> >> Number of Bricks: 4 x 3 = 12
> >> >> >> >> >> Transport-type: tcp
> >> >> >> >> >> Bricks:
> >> >> >> >> >> Brick1:
gluster11:/gluster/bricksda1/shared
> >> >> >> >> >> Brick2:
gluster12:/gluster/bricksda1/shared
> >> >> >> >> >> Brick3:
gluster13:/gluster/bricksda1/shared
> >> >> >> >> >> Brick4:
gluster11:/gluster/bricksdb1/shared
> >> >> >> >> >> Brick5:
gluster12:/gluster/bricksdb1/shared
> >> >> >> >> >> Brick6:
gluster13:/gluster/bricksdb1/shared
> >> >> >> >> >> Brick7:
gluster11:/gluster/bricksdc1/shared
> >> >> >> >> >> Brick8:
gluster12:/gluster/bricksdc1/shared
> >> >> >> >> >> Brick9:
gluster13:/gluster/bricksdc1/shared
> >> >> >> >> >> Brick10:
gluster11:/gluster/bricksdd1/shared
> >> >> >> >> >> Brick11:
gluster12:/gluster/bricksdd1_new/shared
> >> >> >> >> >> Brick12:
gluster13:/gluster/bricksdd1_new/shared
> >> >> >> >> >> Options Reconfigured:
> >> >> >> >> >> cluster.shd-max-threads: 4
> >> >> >> >> >> performance.md-cache-timeout:
60
> >> >> >> >> >> cluster.lookup-optimize: on
> >> >> >> >> >> cluster.readdir-optimize: on
> >> >> >> >> >>
performance.cache-refresh-timeout: 4
> >> >> >> >> >> performance.parallel-readdir:
on
> >> >> >> >> >> server.event-threads: 8
> >> >> >> >> >> client.event-threads: 8
> >> >> >> >> >>
performance.cache-max-file-size: 128MB
> >> >> >> >> >>
performance.write-behind-window-size: 16MB
> >> >> >> >> >> performance.io-thread-count:
64
> >> >> >> >> >> cluster.min-free-disk: 1%
> >> >> >> >> >> performance.cache-size: 24GB
> >> >> >> >> >> nfs.disable: on
> >> >> >> >> >> transport.address-family: inet
> >> >> >> >> >> performance.high-prio-threads:
32
> >> >> >> >> >>
performance.normal-prio-threads: 32
> >> >> >> >> >> performance.low-prio-threads:
32
> >> >> >> >> >>
performance.least-prio-threads: 8
> >> >> >> >> >> performance.io-cache: on
> >> >> >> >> >> server.allow-insecure: on
> >> >> >> >> >> performance.strict-o-direct:
off
> >> >> >> >> >> transport.listen-backlog: 100
> >> >> >> >> >> server.outstanding-rpc-limit:
128
> >> >> >> >>
_______________________________________________
> >> >> >> >> Gluster-users mailing list
> >> >> >> >> Gluster-users at gluster.org
> >> >> >> >>
https://lists.gluster.org/mailman/listinfo/gluster-users
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Pranith
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Pranith
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Pranith
> >
> >
> >
> >
> > --
> > Pranith
>


-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180727/d2cc2e9b/attachment.html>

Gluster users - Jul 2018 - Gluter 3.12.12: performance during heal and in general

[Gluster-users] Gluter 3.12.12: performance during heal and in general

[Gluster-users] Gluter 3.12.12: performance during heal and in general