Strahil Nikolov
2023-Mar-24 21:11 UTC
[Gluster-users] hardware issues and new server advice
Actually, pure NVME-based volume will be waste of money. Gluster excells when you have more servers and clients to consume that data. I would choose? LVM cache (NVMEs) + HW RAID10 of SAS 15K disks to cope with the load. At least if you decide to go with more disks for the raids, use several? (not the built-in ones) controllers. @Martin, in order to get a more reliable setup, you will have to either get more servers and switch to distributed-replicated volume(s) or consider getting server hardware.Dispersed volumes require a lot of CPU computations and the Ryzens won't cope with the load. Best Regards,Strahil Nikolov? On Thu, Mar 23, 2023 at 12:16, Hu Bert<revirii at googlemail.com> wrote: Hi, Am Di., 21. M?rz 2023 um 23:36 Uhr schrieb Martin B?hr <mbaehr+gluster at realss.com>:> the primary data is photos. we get an average of 50000 new files per > day, with a peak if 7 to 8 times as much during christmas. > > gluster has always been able to keep up with that, only when raid resync > or checks happen the server load sometimes increases to cause issues.Interesting, we have a similar workload: hundreds of millions of images, small files, and especially on weekends with high traffic the load+iowait is really heavy. Or if a hdd fails, or during a raid check. our hardware: 10x 10TB hdds -> 5x raid1, each raid1 is a brick, replicate 3 setup. About 40TB of data. Well, the bricks are bigger than recommended... Sooner or later we will have to migrate that stuff, and use nvme for that, either 3.5TB or bigger ones. Those should be faster... *fingerscrossed* regards, Hubert ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20230324/eec6c87e/attachment.html>
Excerpts from Strahil Nikolov's message of 2023-03-24 21:11:28 +0000:> Gluster excells when you have more servers and clients to consume that data.you mean multiple smaller servers are better than one large server?> LVM cache (NVMEs)we only have a few clients. gluster is for us effectively only a scalable large file storage for one application. new files are written once and then access to files is rather random (users accessing their albums) so that i don't see a benefit in using a cache. (and we also have a webcache which covers most of the repeated access from clients)> @Martin, > in order to get a more reliable setup, you will have to either get > more servers and switch to distributed-replicated volume(s) orthat is the plan. we are not considering dispersed volumes. with the small file sized that doesn't seem worth it. besides, with regular volumes the files remain accessible even if gluster itself fails (which is the case now, as healing causes our raid to fail, we decided to turn off gluster on the old servers, and simply copy the raw files from the gluster storage to the new gluster once that is set up). greetings, martin.
Hi, sry if i hijack this, but maybe it's helpful for other gluster users...> pure NVME-based volume will be waste of money. Gluster excells when you have more servers and clients to consume that data. > I would choose LVM cache (NVMEs) + HW RAID10 of SAS 15K disks to cope with the load. At least if you decide to go with more disks for the raids, use several (not the built-in ones) controllers.Well, we have to take what our provider (hetzner) offers - SATA hdds or sata|nvme ssds. Volume Name: workdata Type: Distributed-Replicate Number of Bricks: 5 x 3 = 15 Bricks: Brick1: gls1:/gluster/md3/workdata Brick2: gls2:/gluster/md3/workdata Brick3: gls3:/gluster/md3/workdata Brick4: gls1:/gluster/md4/workdata Brick5: gls2:/gluster/md4/workdata Brick6: gls3:/gluster/md4/workdata etc. Below are the volume settings. Each brick is a sw raid1 (made out of 10TB hdds). file access to the backends is pretty slow, even with low system load (which reaches >100 on the servers on high traffic days); even a simple 'ls' on a directory with ~1000 sub-directories will take a couple of seconds. Some images: https://abload.de/img/gls-diskutilfti5d.png https://abload.de/img/gls-io6cfgp.png https://abload.de/img/gls-throughput3oicf.png As you mentioned it: is a raid10 better than x*raid1? Anything misconfigured? Thx a lot & best regards, Hubert Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on performance.read-ahead: off performance.io-cache: off performance.quick-read: on cluster.self-heal-window-size: 16 cluster.heal-wait-queue-length: 10000 cluster.data-self-heal-algorithm: full cluster.background-self-heal-count: 256 network.inode-lru-limit: 200000 cluster.shd-max-threads: 8 server.outstanding-rpc-limit: 128 transport.listen-backlog: 100 performance.least-prio-threads: 8 performance.cache-size: 6GB cluster.min-free-disk: 1% performance.io-thread-count: 32 performance.write-behind-window-size: 16MB performance.cache-max-file-size: 128MB client.event-threads: 8 server.event-threads: 8 performance.parallel-readdir: on performance.cache-refresh-timeout: 4 cluster.readdir-optimize: off performance.md-cache-timeout: 600 performance.nl-cache: off cluster.lookup-unhashed: on cluster.shd-wait-qlength: 10000 performance.readdir-ahead: on storage.build-pgfid: off