Martin Bähr
2020-Sep-11 03:27 UTC
[Gluster-users] gluster heal performance (was: Fwd: New GlusterFS deployment, doubts on 1 brick per host vs 1 brick per drive.)
Excerpts from Gionatan Danti's message of 2020-09-11 00:35:52 +0200:> The main point was the potentially long heal timecould you (or anyone else) please elaborate on what long heal times are to be expected? we have a 3-node replica cluster running version 3.12.9 (we are building a new cluster now) with 32TiB of space. each node has a single brick on top of a 7-disk raid5 (linux softraid) at one point we had one node unavailable for one month (gluster failed to start up properly on that node and we didn't have monitoring in place to notice) and the accumulated changes of one month of operation took 4 months to heal. i would have expected this ideally to take 2 weeks or less, one month at the worst (ie faster than or at least as fast as it took to create the data but not slower, and especially not 4 times slower) the initial heal count was about 6million files for one node and 5.4million for the other. the healing speed was not constant. at first the heal count increased, that is, healing was seemingly slower than the amount of new files added. then it started to speed up and the first million of each node took about 46 days to heal, while the last million took 4 days. i logged the output of "gluster volume heal gluster-volume statistics heal-count" every hour to monitor the healing process. what makes healing so slow? almost all files are newly added and not changed, so they were missing on the node that was offline. the files are backup for user devices, so almost all files are written once and rarely, if ever, read. we do have a few huge directories with 250000, 88000, 60000 and 29000 subdirectories each. in total 26TiB of small files, but no more than a few 1000 per directory. (it's user data, some have more, some have less) could those huge directories be responsible for the slow healing? the filesystem is ext4 on top of a 7 disk raid5. after this ordeal was over we discovered the readdir-ahead setting which was on. we turned that off based on other discussions on performance that suggested an improvement from this change, but we haven't had the opportunity to do a large healing test since, so we can't tell if it makes a difference for us. any insights would be appreciated. greetings, martin. -- general manager realss.com student mentor fossasia.org community mentor blug.sh beijinglug.club pike programmer pike.lysator.liu.se caudium.net societyserver.org Martin B?hr working in china http://societyserver.org/mbaehr/
Il 2020-09-11 05:27 Martin B?hr ha scritto:> Excerpts from Gionatan Danti's message of 2020-09-11 00:35:52 +0200: >> The main point was the potentially long heal time > > could you (or anyone else) please elaborate on what long heal times are > to be expected?Hi, there are multiple factor at works here: - healing via network (gluster) vs internal bus data transfer (RAID rebuild); - gluster being a user-space application which commands a significant CPU load; - healing proceeding per-file and not in LBA order (ie: it has to traverse all the affected files/dirs, which means scattered random IO for the most part); - other things which I am surely missing.> we have a 3-node replica cluster running version 3.12.9 (we are > building > a new cluster now) with 32TiB of space. each node has a single brick on > top of a 7-disk raid5 (linux softraid)3.12.9, while being the official RHEL 7 release, is very old now.> at one point we had one node unavailable for one month (gluster failed > to start up properly on that node and we didn't have monitoring in > place > to notice) and the accumulated changes of one month of operation took 4 > months to heal. i would have expected this ideally to take 2 weeks or > less, one month at the worst (ie faster than or at least as fast as it > took to create the data but not slower, and especially not 4 times > slower)Wow, 4 months is a lot... but you had at least internal redundancy (RAID5 bricks). The OP was asking about running with *no* internal redundancy and this is the reason I suggest against it: losing a disk while needing weeks to heal is not good.> the initial heal count was about 6million files for one node and > 5.4million for the other. > ... > we do have a few huge directories with 250000, 88000, 60000 and 29000 > subdirectories each. in total 26TiB of small files, but no more than > a few 1000 per directory. (it's user data, some have more, some have > less) > > could those huge directories be responsible for the slow healing?The very high number of to-be-healed files surely has a negative impact on your heal speed. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8