Pranith Kumar Karampuri
2018-Jul-26 06:56 UTC
[Gluster-users] Gluter 3.12.12: performance during heal and in general
Thanks a lot for detailed write-up, this helps find the bottlenecks easily. On a high level, to handle this directory hierarchy i.e. lots of directories with files, we need to improve healing algorithms. Based on the data you provided, we need to make the following enhancements: 1) At the moment directories are healed one at a time, but files can be healed upto 64 in parallel per replica subvolume. So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number of files in parallel. I raised https://github.com/gluster/glusterfs/issues/477 to track this. In the mean-while you can use the following work-around: a) Increase background heals on the mount: gluster volume set <volname> cluster.background-self-heal-count 256 gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000 find <mnt> -type d | xargs stat one 'find' will trigger 10256 directories. So you may have to do this periodically until all directories are healed. 2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I think for your environment bumping upto MBs is better. Say 2MB i.e. 16*128KB? Command to do that is: gluster volume set <volname> cluster.data-self-heal-window-size 16 On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <revirii at googlemail.com> wrote:> Hi Pranith, > > Sry, it took a while to count the directories. I'll try to answer your > questions as good as possible. > > > What kind of data do you have? > > How many directories in the filesystem? > > On average how many files per directory? > > What is the depth of your directory hierarchy on average? > > What is average filesize? > > We have mostly images (more than 95% of disk usage, 90% of file > count), some text files (like css, jsp, gpx etc.) and some binaries. > > There are about 190.000 directories in the file system; maybe there > are some more because we're hit by bug 1512371 (parallel-readdir > TRUE prevents directories listing). But the number of directories > could/will rise in the future (maybe millions). > > files per directory: ranges from 0 to 100, on average it should be 20 > files per directory (well, at least in the deepest dirs, see > explanation below). > > Average filesize: ranges from a few hundred bytes up to 30 MB, on > average it should be 2-3 MB. > > Directory hierarchy: maximum depth as seen from within the volume is > 6, the average should be 3. > > volume name: shared > mount point on clients: /data/repository/shared/ > below /shared/ there are 2 directories: > - public/: mainly calculated images (file sizes from a few KB up to > max 1 MB) and some resouces (small PNGs with a size of a few hundred > bytes). > - private/: mainly source images; file sizes from 50 KB up to 30MB > > We migrated from a NFS server (SPOF) to glusterfs and simply copied > our files. The images (which have an ID) are stored in the deepest > directories of the dir tree. I'll better explain it :-) > > directory structure for the images (i'll omit some other miscellaneous > stuff, but it looks quite similar): > - ID of an image has 7 or 8 digits > - /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg > - /shared/public/: /(first 3 digits of ID)/(next 3 digits of > ID)/$ID/$misc_formats.jpg > > That's why we have that many (sub-)directories. Files are only stored > in the lowest directory hierarchy. I hope i could make our structure > at least a bit more transparent. > > i hope there's something we can do to raise performance a bit. thx in > advance :-) > > > 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com>: > > > > > > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <revirii at googlemail.com> wrote: > >> > >> Well, over the weekend about 200GB were copied, so now there are > >> ~400GB copied to the brick. That's far beyond a speed of 10GB per > >> hour. If I copied the 1.6 TB directly, that would be done within max 2 > >> days. But with the self heal this will take at least 20 days minimum. > >> > >> Why is the performance that bad? No chance of speeding this up? > > > > > > What kind of data do you have? > > How many directories in the filesystem? > > On average how many files per directory? > > What is the depth of your directory hierarchy on average? > > What is average filesize? > > > > Based on this data we can see if anything can be improved. Or if there > are > > some > > enhancements that need to be implemented in gluster to address this kind > of > > data layout > >> > >> > >> 2018-07-20 9:41 GMT+02:00 Hu Bert <revirii at googlemail.com>: > >> > hmm... no one any idea? > >> > > >> > Additional question: the hdd on server gluster12 was changed, so far > >> > ~220 GB were copied. On the other 2 servers i see a lot of entries in > >> > glustershd.log, about 312.000 respectively 336.000 entries there > >> > yesterday, most of them (current log output) looking like this: > >> > > >> > [2018-07-20 07:30:49.757595] I [MSGID: 108026] > >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: > >> > Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6. > >> > sources=0 [2] sinks=1 > >> > [2018-07-20 07:30:49.992398] I [MSGID: 108026] > >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >> > 0-shared-replicate-3: performing metadata selfheal on > >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6 > >> > [2018-07-20 07:30:50.243551] I [MSGID: 108026] > >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: > >> > Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6. > >> > sources=0 [2] sinks=1 > >> > > >> > or like this: > >> > > >> > [2018-07-20 07:38:41.726943] I [MSGID: 108026] > >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >> > 0-shared-replicate-3: performing metadata selfheal on > >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba > >> > [2018-07-20 07:38:41.855737] I [MSGID: 108026] > >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: > >> > Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba. > >> > sources=[0] 2 sinks=1 > >> > [2018-07-20 07:38:44.755800] I [MSGID: 108026] > >> > [afr-self-heal-entry.c:887:afr_selfheal_entry_do] > >> > 0-shared-replicate-3: performing entry selfheal on > >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba > >> > > >> > is this behaviour normal? I'd expect these messages on the server with > >> > the failed brick, not on the other ones. > >> > > >> > 2018-07-19 8:31 GMT+02:00 Hu Bert <revirii at googlemail.com>: > >> >> Hi there, > >> >> > >> >> sent this mail yesterday, but somehow it didn't work? Wasn't > archived, > >> >> so please be indulgent it you receive this mail again :-) > >> >> > >> >> We are currently running a replicate setup and are experiencing a > >> >> quite poor performance. It got even worse when within a couple of > >> >> weeks 2 bricks (disks) crashed. Maybe some general information of our > >> >> setup: > >> >> > >> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on > >> >> separate disks); each server has 4 10TB disks -> each is a brick; > >> >> replica 3 setup (see gluster volume status below). Debian stretch, > >> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients are > >> >> connected via 10 GBit ethernet. > >> >> > >> >> About a month ago and 2 days ago a disk died (on different servers); > >> >> disk were replaced, were brought back into the volume and full self > >> >> heal started. But the speed for this is quite... disappointing. Each > >> >> brick has ~1.6TB of data on it (mostly the infamous small files). The > >> >> full heal i started yesterday copied only ~50GB within 24 hours (48 > >> >> hours: about 100GB) - with > >> >> this rate it would take weeks until the self heal finishes. > >> >> > >> >> After the first heal (started on gluster13 about a month ago, took > >> >> about 3 weeks) finished we had a terrible performance; CPU on one or > >> >> two of the nodes (gluster11, gluster12) was up to 1200%, consumed by > >> >> the brick process of the former crashed brick (bricksdd1), > >> >> interestingly not on the server with the failed this, but on the > other > >> >> 2 ones... > >> >> > >> >> Well... am i doing something wrong? Some options wrongly configured? > >> >> Terrible setup? Anyone got an idea? Any additional information > needed? > >> >> > >> >> > >> >> Thx in advance :-) > >> >> > >> >> gluster volume status > >> >> > >> >> Volume Name: shared > >> >> Type: Distributed-Replicate > >> >> Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36 > >> >> Status: Started > >> >> Snapshot Count: 0 > >> >> Number of Bricks: 4 x 3 = 12 > >> >> Transport-type: tcp > >> >> Bricks: > >> >> Brick1: gluster11:/gluster/bricksda1/shared > >> >> Brick2: gluster12:/gluster/bricksda1/shared > >> >> Brick3: gluster13:/gluster/bricksda1/shared > >> >> Brick4: gluster11:/gluster/bricksdb1/shared > >> >> Brick5: gluster12:/gluster/bricksdb1/shared > >> >> Brick6: gluster13:/gluster/bricksdb1/shared > >> >> Brick7: gluster11:/gluster/bricksdc1/shared > >> >> Brick8: gluster12:/gluster/bricksdc1/shared > >> >> Brick9: gluster13:/gluster/bricksdc1/shared > >> >> Brick10: gluster11:/gluster/bricksdd1/shared > >> >> Brick11: gluster12:/gluster/bricksdd1_new/shared > >> >> Brick12: gluster13:/gluster/bricksdd1_new/shared > >> >> Options Reconfigured: > >> >> cluster.shd-max-threads: 4 > >> >> performance.md-cache-timeout: 60 > >> >> cluster.lookup-optimize: on > >> >> cluster.readdir-optimize: on > >> >> performance.cache-refresh-timeout: 4 > >> >> performance.parallel-readdir: on > >> >> server.event-threads: 8 > >> >> client.event-threads: 8 > >> >> performance.cache-max-file-size: 128MB > >> >> performance.write-behind-window-size: 16MB > >> >> performance.io-thread-count: 64 > >> >> cluster.min-free-disk: 1% > >> >> performance.cache-size: 24GB > >> >> nfs.disable: on > >> >> transport.address-family: inet > >> >> performance.high-prio-threads: 32 > >> >> performance.normal-prio-threads: 32 > >> >> performance.low-prio-threads: 32 > >> >> performance.least-prio-threads: 8 > >> >> performance.io-cache: on > >> >> server.allow-insecure: on > >> >> performance.strict-o-direct: off > >> >> transport.listen-backlog: 100 > >> >> server.outstanding-rpc-limit: 128 > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > > -- > > Pranith >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180726/812070f7/attachment.html>
Hu Bert
2018-Jul-26 07:29 UTC
[Gluster-users] Gluter 3.12.12: performance during heal and in general
Hi Pranith, thanks a lot for your efforts and for tracking "my" problem with an issue. :-) I've set this params on the gluster volume and will start the 'find...' command within a short time. I'll probably add another answer to the list to document the progress. btw. - you had some typos: gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000 => cluster is doubled gluster volume set <volname> cluster.data-self-heal-window-size 16 => it's actually cluster.self-heal-window-size but actually no problem :-) Just curious: would gluster 4.1 improve the performance for healing and in general for "my" scenario? 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com>:> Thanks a lot for detailed write-up, this helps find the bottlenecks easily. > On a high level, to handle this directory hierarchy i.e. lots of directories > with files, we need to improve healing > algorithms. Based on the data you provided, we need to make the following > enhancements: > > 1) At the moment directories are healed one at a time, but files can be > healed upto 64 in parallel per replica subvolume. > So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number of > files in parallel. > > I raised https://github.com/gluster/glusterfs/issues/477 to track this. In > the mean-while you can use the following work-around: > a) Increase background heals on the mount: > gluster volume set <volname> cluster.background-self-heal-count 256 > gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000 > find <mnt> -type d | xargs stat > > one 'find' will trigger 10256 directories. So you may have to do this > periodically until all directories are healed. > > 2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I > think for your environment bumping upto MBs is better. Say 2MB i.e. > 16*128KB? > > Command to do that is: > gluster volume set <volname> cluster.data-self-heal-window-size 16 > > > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <revirii at googlemail.com> wrote: >> >> Hi Pranith, >> >> Sry, it took a while to count the directories. I'll try to answer your >> questions as good as possible. >> >> > What kind of data do you have? >> > How many directories in the filesystem? >> > On average how many files per directory? >> > What is the depth of your directory hierarchy on average? >> > What is average filesize? >> >> We have mostly images (more than 95% of disk usage, 90% of file >> count), some text files (like css, jsp, gpx etc.) and some binaries. >> >> There are about 190.000 directories in the file system; maybe there >> are some more because we're hit by bug 1512371 (parallel-readdir >> TRUE prevents directories listing). But the number of directories >> could/will rise in the future (maybe millions). >> >> files per directory: ranges from 0 to 100, on average it should be 20 >> files per directory (well, at least in the deepest dirs, see >> explanation below). >> >> Average filesize: ranges from a few hundred bytes up to 30 MB, on >> average it should be 2-3 MB. >> >> Directory hierarchy: maximum depth as seen from within the volume is >> 6, the average should be 3. >> >> volume name: shared >> mount point on clients: /data/repository/shared/ >> below /shared/ there are 2 directories: >> - public/: mainly calculated images (file sizes from a few KB up to >> max 1 MB) and some resouces (small PNGs with a size of a few hundred >> bytes). >> - private/: mainly source images; file sizes from 50 KB up to 30MB >> >> We migrated from a NFS server (SPOF) to glusterfs and simply copied >> our files. The images (which have an ID) are stored in the deepest >> directories of the dir tree. I'll better explain it :-) >> >> directory structure for the images (i'll omit some other miscellaneous >> stuff, but it looks quite similar): >> - ID of an image has 7 or 8 digits >> - /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg >> - /shared/public/: /(first 3 digits of ID)/(next 3 digits of >> ID)/$ID/$misc_formats.jpg >> >> That's why we have that many (sub-)directories. Files are only stored >> in the lowest directory hierarchy. I hope i could make our structure >> at least a bit more transparent. >> >> i hope there's something we can do to raise performance a bit. thx in >> advance :-) >> >> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com>: >> > >> > >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <revirii at googlemail.com> wrote: >> >> >> >> Well, over the weekend about 200GB were copied, so now there are >> >> ~400GB copied to the brick. That's far beyond a speed of 10GB per >> >> hour. If I copied the 1.6 TB directly, that would be done within max 2 >> >> days. But with the self heal this will take at least 20 days minimum. >> >> >> >> Why is the performance that bad? No chance of speeding this up? >> > >> > >> > What kind of data do you have? >> > How many directories in the filesystem? >> > On average how many files per directory? >> > What is the depth of your directory hierarchy on average? >> > What is average filesize? >> > >> > Based on this data we can see if anything can be improved. Or if there >> > are >> > some >> > enhancements that need to be implemented in gluster to address this kind >> > of >> > data layout >> >> >> >> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert <revirii at googlemail.com>: >> >> > hmm... no one any idea? >> >> > >> >> > Additional question: the hdd on server gluster12 was changed, so far >> >> > ~220 GB were copied. On the other 2 servers i see a lot of entries in >> >> > glustershd.log, about 312.000 respectively 336.000 entries there >> >> > yesterday, most of them (current log output) looking like this: >> >> > >> >> > [2018-07-20 07:30:49.757595] I [MSGID: 108026] >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: >> >> > Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6. >> >> > sources=0 [2] sinks=1 >> >> > [2018-07-20 07:30:49.992398] I [MSGID: 108026] >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >> > 0-shared-replicate-3: performing metadata selfheal on >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6 >> >> > [2018-07-20 07:30:50.243551] I [MSGID: 108026] >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: >> >> > Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6. >> >> > sources=0 [2] sinks=1 >> >> > >> >> > or like this: >> >> > >> >> > [2018-07-20 07:38:41.726943] I [MSGID: 108026] >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >> > 0-shared-replicate-3: performing metadata selfheal on >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba >> >> > [2018-07-20 07:38:41.855737] I [MSGID: 108026] >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3: >> >> > Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba. >> >> > sources=[0] 2 sinks=1 >> >> > [2018-07-20 07:38:44.755800] I [MSGID: 108026] >> >> > [afr-self-heal-entry.c:887:afr_selfheal_entry_do] >> >> > 0-shared-replicate-3: performing entry selfheal on >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba >> >> > >> >> > is this behaviour normal? I'd expect these messages on the server >> >> > with >> >> > the failed brick, not on the other ones. >> >> > >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert <revirii at googlemail.com>: >> >> >> Hi there, >> >> >> >> >> >> sent this mail yesterday, but somehow it didn't work? Wasn't >> >> >> archived, >> >> >> so please be indulgent it you receive this mail again :-) >> >> >> >> >> >> We are currently running a replicate setup and are experiencing a >> >> >> quite poor performance. It got even worse when within a couple of >> >> >> weeks 2 bricks (disks) crashed. Maybe some general information of >> >> >> our >> >> >> setup: >> >> >> >> >> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on >> >> >> separate disks); each server has 4 10TB disks -> each is a brick; >> >> >> replica 3 setup (see gluster volume status below). Debian stretch, >> >> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients are >> >> >> connected via 10 GBit ethernet. >> >> >> >> >> >> About a month ago and 2 days ago a disk died (on different servers); >> >> >> disk were replaced, were brought back into the volume and full self >> >> >> heal started. But the speed for this is quite... disappointing. Each >> >> >> brick has ~1.6TB of data on it (mostly the infamous small files). >> >> >> The >> >> >> full heal i started yesterday copied only ~50GB within 24 hours (48 >> >> >> hours: about 100GB) - with >> >> >> this rate it would take weeks until the self heal finishes. >> >> >> >> >> >> After the first heal (started on gluster13 about a month ago, took >> >> >> about 3 weeks) finished we had a terrible performance; CPU on one or >> >> >> two of the nodes (gluster11, gluster12) was up to 1200%, consumed by >> >> >> the brick process of the former crashed brick (bricksdd1), >> >> >> interestingly not on the server with the failed this, but on the >> >> >> other >> >> >> 2 ones... >> >> >> >> >> >> Well... am i doing something wrong? Some options wrongly configured? >> >> >> Terrible setup? Anyone got an idea? Any additional information >> >> >> needed? >> >> >> >> >> >> >> >> >> Thx in advance :-) >> >> >> >> >> >> gluster volume status >> >> >> >> >> >> Volume Name: shared >> >> >> Type: Distributed-Replicate >> >> >> Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36 >> >> >> Status: Started >> >> >> Snapshot Count: 0 >> >> >> Number of Bricks: 4 x 3 = 12 >> >> >> Transport-type: tcp >> >> >> Bricks: >> >> >> Brick1: gluster11:/gluster/bricksda1/shared >> >> >> Brick2: gluster12:/gluster/bricksda1/shared >> >> >> Brick3: gluster13:/gluster/bricksda1/shared >> >> >> Brick4: gluster11:/gluster/bricksdb1/shared >> >> >> Brick5: gluster12:/gluster/bricksdb1/shared >> >> >> Brick6: gluster13:/gluster/bricksdb1/shared >> >> >> Brick7: gluster11:/gluster/bricksdc1/shared >> >> >> Brick8: gluster12:/gluster/bricksdc1/shared >> >> >> Brick9: gluster13:/gluster/bricksdc1/shared >> >> >> Brick10: gluster11:/gluster/bricksdd1/shared >> >> >> Brick11: gluster12:/gluster/bricksdd1_new/shared >> >> >> Brick12: gluster13:/gluster/bricksdd1_new/shared >> >> >> Options Reconfigured: >> >> >> cluster.shd-max-threads: 4 >> >> >> performance.md-cache-timeout: 60 >> >> >> cluster.lookup-optimize: on >> >> >> cluster.readdir-optimize: on >> >> >> performance.cache-refresh-timeout: 4 >> >> >> performance.parallel-readdir: on >> >> >> server.event-threads: 8 >> >> >> client.event-threads: 8 >> >> >> performance.cache-max-file-size: 128MB >> >> >> performance.write-behind-window-size: 16MB >> >> >> performance.io-thread-count: 64 >> >> >> cluster.min-free-disk: 1% >> >> >> performance.cache-size: 24GB >> >> >> nfs.disable: on >> >> >> transport.address-family: inet >> >> >> performance.high-prio-threads: 32 >> >> >> performance.normal-prio-threads: 32 >> >> >> performance.low-prio-threads: 32 >> >> >> performance.least-prio-threads: 8 >> >> >> performance.io-cache: on >> >> >> server.allow-insecure: on >> >> >> performance.strict-o-direct: off >> >> >> transport.listen-backlog: 100 >> >> >> server.outstanding-rpc-limit: 128 >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > >> > >> > >> > >> > -- >> > Pranith > > > > > -- > Pranith