thr3ads.net - Gluster users - [Gluster-users] Gluter 3.12.12: performance during heal and in general [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Hu Bert

2018-Jul-19 06:24 UTC

[Gluster-users] Gluter 3.12.12: performance during heal and in general

Hi there,

sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)

We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
setup:

3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.

About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.

After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...

Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?


Thx in advance :-)

gluster volume status

Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
Options Reconfigured:
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128

Gluster users - Jul 2018 - Gluter 3.12.12: performance during heal and in general

[Gluster-users] Gluter 3.12.12: performance during heal and in general