Pranith Kumar Karampuri
2018-Jul-27 05:55 UTC
[Gluster-users] Gluter 3.12.12: performance during heal and in general
On Fri, Jul 27, 2018 at 11:11 AM, Hu Bert <revirii at googlemail.com> wrote:> Good Morning :-) > > on server gluster11 about 1.25 million and on gluster13 about 1.35 > million log entries in glustershd.log file. About 70 GB got healed, > overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling > 'find...' whenever i notice that it has finished. Hmm... is it > possible and reasonable to run 2 finds in parallel, maybe on different > subdirectories? E.g. running one one $volume/public/ and on one > $volume/private/ ? >Do you already have all the 190000 directories already created? If not could you find out which of the paths need it and do a stat directly instead of find?> > 2018-07-26 11:29 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com>: > > > > > > On Thu, Jul 26, 2018 at 2:41 PM, Hu Bert <revirii at googlemail.com> wrote: > >> > >> > Sorry, bad copy/paste :-(. > >> > >> np :-) > >> > >> The question regarding version 4.1 was meant more generally: does > >> gluster v4.0 etc. have a better performance than version 3.12 etc.? > >> Just curious :-) Sooner or later we have to upgrade anyway. > > > > > > You can check what changed @ > > https://github.com/gluster/glusterfs/blob/release-4.0/ > doc/release-notes/4.0.0.md#performance > > https://github.com/gluster/glusterfs/blob/release-4.1/ > doc/release-notes/4.1.0.md#performance > > > >> > >> > >> btw.: gluster12 was the node with the failed brick, and i started the > >> full heal on this node (has the biggest uuid as well). Is it normal > >> that the glustershd.log on this node is rather empty (some hundred > >> entries), but the glustershd.log files on the 2 other nodes have > >> hundreds of thousands of entries? > > > > > > heals happen on the good bricks, so this is expected. > > > >> > >> (sry, mail twice, didn't go to the list, but maybe others are > >> interested... :-) ) > >> > >> 2018-07-26 10:17 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com > >: > >> > > >> > > >> > On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert <revirii at googlemail.com> > >> > wrote: > >> >> > >> >> Hi Pranith, > >> >> > >> >> thanks a lot for your efforts and for tracking "my" problem with an > >> >> issue. > >> >> :-) > >> >> > >> >> > >> >> I've set this params on the gluster volume and will start the > >> >> 'find...' command within a short time. I'll probably add another > >> >> answer to the list to document the progress. > >> >> > >> >> btw. - you had some typos: > >> >> gluster volume set <volname> cluster.cluster.heal-wait-queue-length > >> >> 10000 => cluster is doubled > >> >> gluster volume set <volname> cluster.data-self-heal-window-size 16 > => > >> >> it's actually cluster.self-heal-window-size > >> >> > >> >> but actually no problem :-) > >> > > >> > > >> > Sorry, bad copy/paste :-(. > >> > > >> >> > >> >> > >> >> Just curious: would gluster 4.1 improve the performance for healing > >> >> and in general for "my" scenario? > >> > > >> > > >> > No, this issue is present in all the existing releases. But it is > >> > solvable. > >> > You can follow that issue to see progress and when it is fixed etc. > >> > > >> >> > >> >> > >> >> 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri > >> >> <pkarampu at redhat.com>: > >> >> > Thanks a lot for detailed write-up, this helps find the bottlenecks > >> >> > easily. > >> >> > On a high level, to handle this directory hierarchy i.e. lots of > >> >> > directories > >> >> > with files, we need to improve healing > >> >> > algorithms. Based on the data you provided, we need to make the > >> >> > following > >> >> > enhancements: > >> >> > > >> >> > 1) At the moment directories are healed one at a time, but files > can > >> >> > be > >> >> > healed upto 64 in parallel per replica subvolume. > >> >> > So if you have nX2 or nX3 distributed subvolumes, it can heal 64n > >> >> > number > >> >> > of > >> >> > files in parallel. > >> >> > > >> >> > I raised https://github.com/gluster/glusterfs/issues/477 to track > >> >> > this. > >> >> > In > >> >> > the mean-while you can use the following work-around: > >> >> > a) Increase background heals on the mount: > >> >> > gluster volume set <volname> cluster.background-self-heal-count > 256 > >> >> > gluster volume set <volname> cluster.cluster.heal-wait- > queue-length > >> >> > 10000 > >> >> > find <mnt> -type d | xargs stat > >> >> > > >> >> > one 'find' will trigger 10256 directories. So you may have to do > this > >> >> > periodically until all directories are healed. > >> >> > > >> >> > 2) Self-heal heals a file 128KB at a > >> >> > time(data-self-heal-window-size). I > >> >> > think for your environment bumping upto MBs is better. Say 2MB i.e. > >> >> > 16*128KB? > >> >> > > >> >> > Command to do that is: > >> >> > gluster volume set <volname> cluster.data-self-heal-window-size 16 > >> >> > > >> >> > > >> >> > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <revirii at googlemail.com> > >> >> > wrote: > >> >> >> > >> >> >> Hi Pranith, > >> >> >> > >> >> >> Sry, it took a while to count the directories. I'll try to answer > >> >> >> your > >> >> >> questions as good as possible. > >> >> >> > >> >> >> > What kind of data do you have? > >> >> >> > How many directories in the filesystem? > >> >> >> > On average how many files per directory? > >> >> >> > What is the depth of your directory hierarchy on average? > >> >> >> > What is average filesize? > >> >> >> > >> >> >> We have mostly images (more than 95% of disk usage, 90% of file > >> >> >> count), some text files (like css, jsp, gpx etc.) and some > binaries. > >> >> >> > >> >> >> There are about 190.000 directories in the file system; maybe > there > >> >> >> are some more because we're hit by bug 1512371 (parallel-readdir > >> >> >> TRUE prevents directories listing). But the number of directories > >> >> >> could/will rise in the future (maybe millions). > >> >> >> > >> >> >> files per directory: ranges from 0 to 100, on average it should be > >> >> >> 20 > >> >> >> files per directory (well, at least in the deepest dirs, see > >> >> >> explanation below). > >> >> >> > >> >> >> Average filesize: ranges from a few hundred bytes up to 30 MB, on > >> >> >> average it should be 2-3 MB. > >> >> >> > >> >> >> Directory hierarchy: maximum depth as seen from within the volume > is > >> >> >> 6, the average should be 3. > >> >> >> > >> >> >> volume name: shared > >> >> >> mount point on clients: /data/repository/shared/ > >> >> >> below /shared/ there are 2 directories: > >> >> >> - public/: mainly calculated images (file sizes from a few KB up > to > >> >> >> max 1 MB) and some resouces (small PNGs with a size of a few > hundred > >> >> >> bytes). > >> >> >> - private/: mainly source images; file sizes from 50 KB up to 30MB > >> >> >> > >> >> >> We migrated from a NFS server (SPOF) to glusterfs and simply > copied > >> >> >> our files. The images (which have an ID) are stored in the deepest > >> >> >> directories of the dir tree. I'll better explain it :-) > >> >> >> > >> >> >> directory structure for the images (i'll omit some other > >> >> >> miscellaneous > >> >> >> stuff, but it looks quite similar): > >> >> >> - ID of an image has 7 or 8 digits > >> >> >> - /shared/private/: /(first 3 digits of ID)/(next 3 digits of > >> >> >> ID)/$ID.jpg > >> >> >> - /shared/public/: /(first 3 digits of ID)/(next 3 digits of > >> >> >> ID)/$ID/$misc_formats.jpg > >> >> >> > >> >> >> That's why we have that many (sub-)directories. Files are only > >> >> >> stored > >> >> >> in the lowest directory hierarchy. I hope i could make our > structure > >> >> >> at least a bit more transparent. > >> >> >> > >> >> >> i hope there's something we can do to raise performance a bit. thx > >> >> >> in > >> >> >> advance :-) > >> >> >> > >> >> >> > >> >> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri > >> >> >> <pkarampu at redhat.com>: > >> >> >> > > >> >> >> > > >> >> >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert < > revirii at googlemail.com> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> Well, over the weekend about 200GB were copied, so now there > are > >> >> >> >> ~400GB copied to the brick. That's far beyond a speed of 10GB > per > >> >> >> >> hour. If I copied the 1.6 TB directly, that would be done > within > >> >> >> >> max > >> >> >> >> 2 > >> >> >> >> days. But with the self heal this will take at least 20 days > >> >> >> >> minimum. > >> >> >> >> > >> >> >> >> Why is the performance that bad? No chance of speeding this up? > >> >> >> > > >> >> >> > > >> >> >> > What kind of data do you have? > >> >> >> > How many directories in the filesystem? > >> >> >> > On average how many files per directory? > >> >> >> > What is the depth of your directory hierarchy on average? > >> >> >> > What is average filesize? > >> >> >> > > >> >> >> > Based on this data we can see if anything can be improved. Or if > >> >> >> > there > >> >> >> > are > >> >> >> > some > >> >> >> > enhancements that need to be implemented in gluster to address > >> >> >> > this > >> >> >> > kind > >> >> >> > of > >> >> >> > data layout > >> >> >> >> > >> >> >> >> > >> >> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert <revirii at googlemail.com>: > >> >> >> >> > hmm... no one any idea? > >> >> >> >> > > >> >> >> >> > Additional question: the hdd on server gluster12 was changed, > >> >> >> >> > so > >> >> >> >> > far > >> >> >> >> > ~220 GB were copied. On the other 2 servers i see a lot of > >> >> >> >> > entries > >> >> >> >> > in > >> >> >> >> > glustershd.log, about 312.000 respectively 336.000 entries > >> >> >> >> > there > >> >> >> >> > yesterday, most of them (current log output) looking like > this: > >> >> >> >> > > >> >> >> >> > [2018-07-20 07:30:49.757595] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] > >> >> >> >> > 0-shared-replicate-3: > >> >> >> >> > Completed data selfheal on > >> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6. > >> >> >> >> > sources=0 [2] sinks=1 > >> >> >> >> > [2018-07-20 07:30:49.992398] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >> >> >> >> > 0-shared-replicate-3: performing metadata selfheal on > >> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6 > >> >> >> >> > [2018-07-20 07:30:50.243551] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] > >> >> >> >> > 0-shared-replicate-3: > >> >> >> >> > Completed metadata selfheal on > >> >> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6. > >> >> >> >> > sources=0 [2] sinks=1 > >> >> >> >> > > >> >> >> >> > or like this: > >> >> >> >> > > >> >> >> >> > [2018-07-20 07:38:41.726943] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >> >> >> >> > 0-shared-replicate-3: performing metadata selfheal on > >> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba > >> >> >> >> > [2018-07-20 07:38:41.855737] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] > >> >> >> >> > 0-shared-replicate-3: > >> >> >> >> > Completed metadata selfheal on > >> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba. > >> >> >> >> > sources=[0] 2 sinks=1 > >> >> >> >> > [2018-07-20 07:38:44.755800] I [MSGID: 108026] > >> >> >> >> > [afr-self-heal-entry.c:887:afr_selfheal_entry_do] > >> >> >> >> > 0-shared-replicate-3: performing entry selfheal on > >> >> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba > >> >> >> >> > > >> >> >> >> > is this behaviour normal? I'd expect these messages on the > >> >> >> >> > server > >> >> >> >> > with > >> >> >> >> > the failed brick, not on the other ones. > >> >> >> >> > > >> >> >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert <revirii at googlemail.com>: > >> >> >> >> >> Hi there, > >> >> >> >> >> > >> >> >> >> >> sent this mail yesterday, but somehow it didn't work? Wasn't > >> >> >> >> >> archived, > >> >> >> >> >> so please be indulgent it you receive this mail again :-) > >> >> >> >> >> > >> >> >> >> >> We are currently running a replicate setup and are > >> >> >> >> >> experiencing a > >> >> >> >> >> quite poor performance. It got even worse when within a > couple > >> >> >> >> >> of > >> >> >> >> >> weeks 2 bricks (disks) crashed. Maybe some general > information > >> >> >> >> >> of > >> >> >> >> >> our > >> >> >> >> >> setup: > >> >> >> >> >> > >> >> >> >> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB > DDR4, > >> >> >> >> >> OS > >> >> >> >> >> on > >> >> >> >> >> separate disks); each server has 4 10TB disks -> each is a > >> >> >> >> >> brick; > >> >> >> >> >> replica 3 setup (see gluster volume status below). Debian > >> >> >> >> >> stretch, > >> >> >> >> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients > are > >> >> >> >> >> connected via 10 GBit ethernet. > >> >> >> >> >> > >> >> >> >> >> About a month ago and 2 days ago a disk died (on different > >> >> >> >> >> servers); > >> >> >> >> >> disk were replaced, were brought back into the volume and > full > >> >> >> >> >> self > >> >> >> >> >> heal started. But the speed for this is quite... > >> >> >> >> >> disappointing. > >> >> >> >> >> Each > >> >> >> >> >> brick has ~1.6TB of data on it (mostly the infamous small > >> >> >> >> >> files). > >> >> >> >> >> The > >> >> >> >> >> full heal i started yesterday copied only ~50GB within 24 > >> >> >> >> >> hours > >> >> >> >> >> (48 > >> >> >> >> >> hours: about 100GB) - with > >> >> >> >> >> this rate it would take weeks until the self heal finishes. > >> >> >> >> >> > >> >> >> >> >> After the first heal (started on gluster13 about a month > ago, > >> >> >> >> >> took > >> >> >> >> >> about 3 weeks) finished we had a terrible performance; CPU > on > >> >> >> >> >> one > >> >> >> >> >> or > >> >> >> >> >> two of the nodes (gluster11, gluster12) was up to 1200%, > >> >> >> >> >> consumed > >> >> >> >> >> by > >> >> >> >> >> the brick process of the former crashed brick (bricksdd1), > >> >> >> >> >> interestingly not on the server with the failed this, but on > >> >> >> >> >> the > >> >> >> >> >> other > >> >> >> >> >> 2 ones... > >> >> >> >> >> > >> >> >> >> >> Well... am i doing something wrong? Some options wrongly > >> >> >> >> >> configured? > >> >> >> >> >> Terrible setup? Anyone got an idea? Any additional > information > >> >> >> >> >> needed? > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> Thx in advance :-) > >> >> >> >> >> > >> >> >> >> >> gluster volume status > >> >> >> >> >> > >> >> >> >> >> Volume Name: shared > >> >> >> >> >> Type: Distributed-Replicate > >> >> >> >> >> Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36 > >> >> >> >> >> Status: Started > >> >> >> >> >> Snapshot Count: 0 > >> >> >> >> >> Number of Bricks: 4 x 3 = 12 > >> >> >> >> >> Transport-type: tcp > >> >> >> >> >> Bricks: > >> >> >> >> >> Brick1: gluster11:/gluster/bricksda1/shared > >> >> >> >> >> Brick2: gluster12:/gluster/bricksda1/shared > >> >> >> >> >> Brick3: gluster13:/gluster/bricksda1/shared > >> >> >> >> >> Brick4: gluster11:/gluster/bricksdb1/shared > >> >> >> >> >> Brick5: gluster12:/gluster/bricksdb1/shared > >> >> >> >> >> Brick6: gluster13:/gluster/bricksdb1/shared > >> >> >> >> >> Brick7: gluster11:/gluster/bricksdc1/shared > >> >> >> >> >> Brick8: gluster12:/gluster/bricksdc1/shared > >> >> >> >> >> Brick9: gluster13:/gluster/bricksdc1/shared > >> >> >> >> >> Brick10: gluster11:/gluster/bricksdd1/shared > >> >> >> >> >> Brick11: gluster12:/gluster/bricksdd1_new/shared > >> >> >> >> >> Brick12: gluster13:/gluster/bricksdd1_new/shared > >> >> >> >> >> Options Reconfigured: > >> >> >> >> >> cluster.shd-max-threads: 4 > >> >> >> >> >> performance.md-cache-timeout: 60 > >> >> >> >> >> cluster.lookup-optimize: on > >> >> >> >> >> cluster.readdir-optimize: on > >> >> >> >> >> performance.cache-refresh-timeout: 4 > >> >> >> >> >> performance.parallel-readdir: on > >> >> >> >> >> server.event-threads: 8 > >> >> >> >> >> client.event-threads: 8 > >> >> >> >> >> performance.cache-max-file-size: 128MB > >> >> >> >> >> performance.write-behind-window-size: 16MB > >> >> >> >> >> performance.io-thread-count: 64 > >> >> >> >> >> cluster.min-free-disk: 1% > >> >> >> >> >> performance.cache-size: 24GB > >> >> >> >> >> nfs.disable: on > >> >> >> >> >> transport.address-family: inet > >> >> >> >> >> performance.high-prio-threads: 32 > >> >> >> >> >> performance.normal-prio-threads: 32 > >> >> >> >> >> performance.low-prio-threads: 32 > >> >> >> >> >> performance.least-prio-threads: 8 > >> >> >> >> >> performance.io-cache: on > >> >> >> >> >> server.allow-insecure: on > >> >> >> >> >> performance.strict-o-direct: off > >> >> >> >> >> transport.listen-backlog: 100 > >> >> >> >> >> server.outstanding-rpc-limit: 128 > >> >> >> >> _______________________________________________ > >> >> >> >> Gluster-users mailing list > >> >> >> >> Gluster-users at gluster.org > >> >> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > Pranith > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Pranith > >> > > >> > > >> > > >> > > >> > -- > >> > Pranith > > > > > > > > > > -- > > Pranith >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180727/d2cc2e9b/attachment.html>
Hu Bert
2018-Jul-27 06:23 UTC
[Gluster-users] Gluter 3.12.12: performance during heal and in general
> Do you already have all the 190000 directories already created? If not could you find out which of the paths need it and do a stat directly instead of find?Quite probable not all of them have been created (but counting how much would take very long...). Hm, maybe running stat in a double loop (thx to our directory structure) would help. Something like this (may be not 100% correct): for a in ${100..999}; do for b in ${100..999}; do stat /$a/$b/ done done Should run stat on all directories. I think i'll give this a try.