thr3ads.net - Gluster users - [Gluster-users] Performance: lots of small files, hdd, nvme etc. [Mar 2023]

If this information is useful, please help other people find it:
Share via:

Hu Bert

2023-Mar-30 09:26 UTC

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

Hello there,

as Strahil suggested a separate thread might be better.

current state:
- servers with 10TB hdds
- 2 hdds build up a sw raid1
- each raid1 is a brick
- so 5 bricks per server
- Volume info (complete below):
Volume Name: workdata
Type: Distributed-Replicate
Number of Bricks: 5 x 3 = 15
Bricks:
Brick1: gls1:/gluster/md3/workdata
Brick2: gls2:/gluster/md3/workdata
Brick3: gls3:/gluster/md3/workdata
Brick4: gls1:/gluster/md4/workdata
Brick5: gls2:/gluster/md4/workdata
Brick6: gls3:/gluster/md4/workdata
etc.

- workload: the (un)famous "lots of small files" setting
- currently 70% of the of the volume is used: ~32TB
- file size: few KB up to 1MB
- so there are hundreds of millions of files (and millions of directories)
- each image has an ID
- under the base dir the IDs are split into 3 digits
- dir structure: /basedir/(000-999)/(000-999)/ID/[lotsoffileshere]
- example for ID 123456789: /basedir/123/456/123456789/default.jpg
- maybe this structure isn't good and e.g. this would be better:
/basedir/IDs/[here the files] - so millions of ID-dirs directly under
/basedir/
- frequent access to the files by webservers (nginx, tomcat): lookup
if file exists, read/write images etc.
- Strahil mentioned: "Keep in mind that negative searches (searches of
non-existing/deleted objects) has highest penalty." <--- that happens
very often...
- server load on high traffic days: > 100 (mostly iowait)
- bad are server reboots (read filesystem info etc.)
- really bad is a sw raid rebuild/resync

Some images:
https://abload.de/img/gls-diskutilfti5d.png
https://abload.de/img/gls-io6cfgp.png
https://abload.de/img/gls-throughput3oicf.png

Our conclusion: the hardware is too slow, the disks are too big. For a
future setup we need to improve the performance (or switch to a
different solution). HW-Raid-controller might be an option, but SAS
disks are not available.

Options:
- scale broader: more servers with smaller disks
- faster disks: nvme

Both are costly. Any suggestions, recommendations, ideas?

Just an observation: is there a performance difference between a sw
raid10 (10 disks -> one brick) or 5x raid1 (each raid1 a brick) with
the same disks (10TB hdd)? The heal processes on the 5xraid1-scenario
seems faster. Just out of curiosity...

whoa, lofs of text - thx for reading if you reached ths point :-)


Best regards

Hubert

Volume Name: workdata
Type: Distributed-Replicate
Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: glusterpub1:/gluster/md3/workdata
Brick2: glusterpub2:/gluster/md3/workdata
Brick3: glusterpub3:/gluster/md3/workdata
Brick4: glusterpub1:/gluster/md4/workdata
Brick5: glusterpub2:/gluster/md4/workdata
Brick6: glusterpub3:/gluster/md4/workdata
Brick7: glusterpub1:/gluster/md5/workdata
Brick8: glusterpub2:/gluster/md5/workdata
Brick9: glusterpub3:/gluster/md5/workdata
Brick10: glusterpub1:/gluster/md6/workdata
Brick11: glusterpub2:/gluster/md6/workdata
Brick12: glusterpub3:/gluster/md6/workdata
Brick13: glusterpub1:/gluster/md7/workdata
Brick14: glusterpub2:/gluster/md7/workdata
Brick15: glusterpub3:/gluster/md7/workdata
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.read-ahead: off
performance.io-cache: off
performance.quick-read: on
cluster.self-heal-window-size: 16
cluster.heal-wait-queue-length: 10000
cluster.data-self-heal-algorithm: full
cluster.background-self-heal-count: 256
network.inode-lru-limit: 200000
cluster.shd-max-threads: 8
server.outstanding-rpc-limit: 128
transport.listen-backlog: 100
performance.least-prio-threads: 8
performance.cache-size: 6GB
cluster.min-free-disk: 1%
performance.io-thread-count: 32
performance.write-behind-window-size: 16MB
performance.cache-max-file-size: 128MB
client.event-threads: 8
server.event-threads: 8
performance.parallel-readdir: on
performance.cache-refresh-timeout: 4
cluster.readdir-optimize: off
performance.md-cache-timeout: 600
performance.nl-cache: off
cluster.lookup-unhashed: on
cluster.shd-wait-qlength: 10000
performance.readdir-ahead: on
storage.build-pgfid: off

Diego Zuccato

2023-Mar-30 10:00 UTC

head link

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

Well, you have *way* more files than we do... :)

Il 30/03/2023 11:26, Hu Bert ha scritto:
> Just an observation: is there a performance difference between a sw
> raid10 (10 disks -> one brick) or 5x raid1 (each raid1 a brick)Err... RAID10 is not 10 disks unless you stripe 5 mirrors of 2 disks.
> with
> the same disks (10TB hdd)? The heal processes on the 5xraid1-scenario
> seems faster. Just out of curiosity...It should be, since the bricks are smaller. But given you're using a 
replica 3 I don't understand why you're also using RAID1: for each 10T 
of user-facing capacity you're keeping 60TB of data on disks.
I'd ditch local RAIDs to double the space available. Unless you 
desperately need the extra read performance.
> Options Reconfigured:I'll have a look at the options you use. Maybe
something can be usefulin our case. Tks :)

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

gluster-users at jahu.sk

2023-Apr-03 17:00 UTC

head link

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

hello
you can read files from underlying filesystem first (ext4,xfs...), for 
ex: /srv/glusterfs/wwww/brick.

as fall back you can check mounted glusterfs path, to heal missing local 
node entries. ex: /mnt/shared/www/...

you need only to write to mount.glusterfs mount point.





On 3/30/2023 11:26 AM, Hu Bert wrote:> - workload: the (un)famous "lots of small files" setting
> - currently 70% of the of the volume is used: ~32TB
> - file size: few KB up to 1MB
> - so there are hundreds of millions of files (and millions of directories)
> - each image has an ID
> - under the base dir the IDs are split into 3 digits
> - dir structure: /basedir/(000-999)/(000-999)/ID/[lotsoffileshere]
> - example for ID 123456789: /basedir/123/456/123456789/default.jpg
> - maybe this structure isn't good and e.g. this would be better:
> /basedir/IDs/[here the files] - so millions of ID-dirs directly under
> /basedir/
> - frequent access to the files by webservers (nginx, tomcat): lookup
> if file exists, read/write images etc.
> - Strahil mentioned: "Keep in mind that negative searches (searches of
> non-existing/deleted objects) has highest penalty." <--- that
happens
> very often...
> - server load on high traffic days: > 100 (mostly iowait)
> - bad are server reboots (read filesystem info etc.)
> - really bad is a sw raid rebuild/resync

-- 
S pozdravom / Yours sincerely
Ing. Jan Hudoba

http://www.jahu.sk

Reasonably Related Threads

Search for more seemingly similar threads

Gluster users - Mar 2023 - Performance: lots of small files, hdd, nvme etc.

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

[Gluster-users] Performance: lots of small files, hdd, nvme etc.

Reasonably Related Threads