Hu Bert
2023-Mar-30 09:26 UTC
[Gluster-users] Performance: lots of small files, hdd, nvme etc.
Hello there, as Strahil suggested a separate thread might be better. current state: - servers with 10TB hdds - 2 hdds build up a sw raid1 - each raid1 is a brick - so 5 bricks per server - Volume info (complete below): Volume Name: workdata Type: Distributed-Replicate Number of Bricks: 5 x 3 = 15 Bricks: Brick1: gls1:/gluster/md3/workdata Brick2: gls2:/gluster/md3/workdata Brick3: gls3:/gluster/md3/workdata Brick4: gls1:/gluster/md4/workdata Brick5: gls2:/gluster/md4/workdata Brick6: gls3:/gluster/md4/workdata etc. - workload: the (un)famous "lots of small files" setting - currently 70% of the of the volume is used: ~32TB - file size: few KB up to 1MB - so there are hundreds of millions of files (and millions of directories) - each image has an ID - under the base dir the IDs are split into 3 digits - dir structure: /basedir/(000-999)/(000-999)/ID/[lotsoffileshere] - example for ID 123456789: /basedir/123/456/123456789/default.jpg - maybe this structure isn't good and e.g. this would be better: /basedir/IDs/[here the files] - so millions of ID-dirs directly under /basedir/ - frequent access to the files by webservers (nginx, tomcat): lookup if file exists, read/write images etc. - Strahil mentioned: "Keep in mind that negative searches (searches of non-existing/deleted objects) has highest penalty." <--- that happens very often... - server load on high traffic days: > 100 (mostly iowait) - bad are server reboots (read filesystem info etc.) - really bad is a sw raid rebuild/resync Some images: https://abload.de/img/gls-diskutilfti5d.png https://abload.de/img/gls-io6cfgp.png https://abload.de/img/gls-throughput3oicf.png Our conclusion: the hardware is too slow, the disks are too big. For a future setup we need to improve the performance (or switch to a different solution). HW-Raid-controller might be an option, but SAS disks are not available. Options: - scale broader: more servers with smaller disks - faster disks: nvme Both are costly. Any suggestions, recommendations, ideas? Just an observation: is there a performance difference between a sw raid10 (10 disks -> one brick) or 5x raid1 (each raid1 a brick) with the same disks (10TB hdd)? The heal processes on the 5xraid1-scenario seems faster. Just out of curiosity... whoa, lofs of text - thx for reading if you reached ths point :-) Best regards Hubert Volume Name: workdata Type: Distributed-Replicate Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 Status: Started Snapshot Count: 0 Number of Bricks: 5 x 3 = 15 Transport-type: tcp Bricks: Brick1: glusterpub1:/gluster/md3/workdata Brick2: glusterpub2:/gluster/md3/workdata Brick3: glusterpub3:/gluster/md3/workdata Brick4: glusterpub1:/gluster/md4/workdata Brick5: glusterpub2:/gluster/md4/workdata Brick6: glusterpub3:/gluster/md4/workdata Brick7: glusterpub1:/gluster/md5/workdata Brick8: glusterpub2:/gluster/md5/workdata Brick9: glusterpub3:/gluster/md5/workdata Brick10: glusterpub1:/gluster/md6/workdata Brick11: glusterpub2:/gluster/md6/workdata Brick12: glusterpub3:/gluster/md6/workdata Brick13: glusterpub1:/gluster/md7/workdata Brick14: glusterpub2:/gluster/md7/workdata Brick15: glusterpub3:/gluster/md7/workdata Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on performance.read-ahead: off performance.io-cache: off performance.quick-read: on cluster.self-heal-window-size: 16 cluster.heal-wait-queue-length: 10000 cluster.data-self-heal-algorithm: full cluster.background-self-heal-count: 256 network.inode-lru-limit: 200000 cluster.shd-max-threads: 8 server.outstanding-rpc-limit: 128 transport.listen-backlog: 100 performance.least-prio-threads: 8 performance.cache-size: 6GB cluster.min-free-disk: 1% performance.io-thread-count: 32 performance.write-behind-window-size: 16MB performance.cache-max-file-size: 128MB client.event-threads: 8 server.event-threads: 8 performance.parallel-readdir: on performance.cache-refresh-timeout: 4 cluster.readdir-optimize: off performance.md-cache-timeout: 600 performance.nl-cache: off cluster.lookup-unhashed: on cluster.shd-wait-qlength: 10000 performance.readdir-ahead: on storage.build-pgfid: off
Diego Zuccato
2023-Mar-30 10:00 UTC
[Gluster-users] Performance: lots of small files, hdd, nvme etc.
Well, you have *way* more files than we do... :) Il 30/03/2023 11:26, Hu Bert ha scritto:> Just an observation: is there a performance difference between a sw > raid10 (10 disks -> one brick) or 5x raid1 (each raid1 a brick)Err... RAID10 is not 10 disks unless you stripe 5 mirrors of 2 disks.> with > the same disks (10TB hdd)? The heal processes on the 5xraid1-scenario > seems faster. Just out of curiosity...It should be, since the bricks are smaller. But given you're using a replica 3 I don't understand why you're also using RAID1: for each 10T of user-facing capacity you're keeping 60TB of data on disks. I'd ditch local RAIDs to double the space available. Unless you desperately need the extra read performance.> Options Reconfigured:I'll have a look at the options you use. Maybe something can be usefulin our case. Tks :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Universit? di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
gluster-users at jahu.sk
2023-Apr-03 17:00 UTC
[Gluster-users] Performance: lots of small files, hdd, nvme etc.
hello you can read files from underlying filesystem first (ext4,xfs...), for ex: /srv/glusterfs/wwww/brick. as fall back you can check mounted glusterfs path, to heal missing local node entries. ex: /mnt/shared/www/... you need only to write to mount.glusterfs mount point. On 3/30/2023 11:26 AM, Hu Bert wrote:> - workload: the (un)famous "lots of small files" setting > - currently 70% of the of the volume is used: ~32TB > - file size: few KB up to 1MB > - so there are hundreds of millions of files (and millions of directories) > - each image has an ID > - under the base dir the IDs are split into 3 digits > - dir structure: /basedir/(000-999)/(000-999)/ID/[lotsoffileshere] > - example for ID 123456789: /basedir/123/456/123456789/default.jpg > - maybe this structure isn't good and e.g. this would be better: > /basedir/IDs/[here the files] - so millions of ID-dirs directly under > /basedir/ > - frequent access to the files by webservers (nginx, tomcat): lookup > if file exists, read/write images etc. > - Strahil mentioned: "Keep in mind that negative searches (searches of > non-existing/deleted objects) has highest penalty." <--- that happens > very often... > - server load on high traffic days: > 100 (mostly iowait) > - bad are server reboots (read filesystem info etc.) > - really bad is a sw raid rebuild/resync-- S pozdravom / Yours sincerely Ing. Jan Hudoba http://www.jahu.sk