thr3ads.net - Gluster users - [Gluster-users] Gluster Performance Issues [Feb 2020]

If this information is useful, please help other people find it:
Share via:

Felix Kölzow

2020-Feb-20 20:39 UTC

[Gluster-users] Gluster Performance Issues

Dear Gluster-Experts,


we created a three-node setup with two bricks each and a dispersed
volume that can be

accessed via the native client (glusterfs --version = 6.0).


The nodes are connected via 10Gbps (cat6,bonding mode 6).


If we running a performance test using the smallfile benchmark tool,

./smallfile_cli.py --top ~/test6/ --file-size 16384 --threads 8 --files
1000 --response-times y --operation create

we observing the

behavior as? shown in the attached figure. For a short time, we can
almost saturated the 10Gbps but

then immediately network traffic is dropping almost to zero for about 5
to 25 seconds.

Actually, we don't have any glue what does causes this strange behavior.
We had a lot of

reconfigured volume options and now we are starting from scratch, i.e.


gluster volume reset dispersed_fuse all


and the current options looks like:

Volume Name: dispersed_fuse
Type: Disperse
Volume ID: 45d3c7c9-526f-45ea-9930-ee7a2274a220
Status: Started
Snapshot Count: 1
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: mimas:/gluster/vg00/dispersed_fuse/brick
Brick2: enceladus:/gluster/vg00/dispersed_fuse/brick
Brick3: tethys:/gluster/vg00/dispersed_fuse/brick
Brick4: mimas:/gluster/vg01/dispersed_fuse/brick
Brick5: enceladus:/gluster/vg01/dispersed_fuse/brick
Brick6: tethys:/gluster/vg01/dispersed_fuse/brick
Options Reconfigured:
client.event-threads: 16
server.event-threads: 16
performance.io-thread-count: 16
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.quota: off
features.inode-quota: off
nfs.disable: on
storage.fips-mode-rchecksum: on
server.outstanding-rpc-limit: 512
server.root-squash: off
cluster.server-quorum-ratio: 51%
cluster.enable-shared-storage: enable


Any hints for root cause analysis are appreciated.

Please let me know if you need more information.


Kind Regards,

Felix

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gluster_smallfile_Performance_test_nativeClient_dispersed_fuse.png
Type: image/png
Size: 74264 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200220/c17c4fb1/attachment.png>

Strahil Nikolov

2020-Feb-20 21:26 UTC

head link

[Gluster-users] Gluster Performance Issues

On February 20, 2020 10:39:47 PM GMT+02:00, "Felix K?lzow"
<felix.koelzow at gmx.de> wrote:>
>Dear Gluster-Experts,
>
>
>we created a three-node setup with two bricks each and a dispersed
>volume that can be
>
>accessed via the native client (glusterfs --version = 6.0).
>
>
>The nodes are connected via 10Gbps (cat6,bonding mode 6).
>
>
>If we running a performance test using the smallfile benchmark tool,
>
>./smallfile_cli.py --top ~/test6/ --file-size 16384 --threads 8 --files
>1000 --response-times y --operation create
>
>we observing the
>
>behavior as? shown in the attached figure. For a short time, we can
>almost saturated the 10Gbps but
>
>then immediately network traffic is dropping almost to zero for about 5
>to 25 seconds.
>
>Actually, we don't have any glue what does causes this strange
>behavior.
>We had a lot of
>
>reconfigured volume options and now we are starting from scratch, i.e.
>
>
>gluster volume reset dispersed_fuse all
>
>
>and the current options looks like:
>
>Volume Name: dispersed_fuse
>Type: Disperse
>Volume ID: 45d3c7c9-526f-45ea-9930-ee7a2274a220
>Status: Started
>Snapshot Count: 1
>Number of Bricks: 1 x (4 + 2) = 6
>Transport-type: tcp
>Bricks:
>Brick1: mimas:/gluster/vg00/dispersed_fuse/brick
>Brick2: enceladus:/gluster/vg00/dispersed_fuse/brick
>Brick3: tethys:/gluster/vg00/dispersed_fuse/brick
>Brick4: mimas:/gluster/vg01/dispersed_fuse/brick
>Brick5: enceladus:/gluster/vg01/dispersed_fuse/brick
>Brick6: tethys:/gluster/vg01/dispersed_fuse/brick
>Options Reconfigured:
>client.event-threads: 16
>server.event-threads: 16
>performance.io-thread-count: 16
>geo-replication.ignore-pid-check: on
>geo-replication.indexing: on
>features.quota: off
>features.inode-quota: off
>nfs.disable: on
>storage.fips-mode-rchecksum: on
>server.outstanding-rpc-limit: 512
>server.root-squash: off
>cluster.server-quorum-ratio: 51%
>cluster.enable-shared-storage: enable
>
>
>Any hints for root cause analysis are appreciated.
>
>Please let me know if you need more information.
>
>
>Kind Regards,
>
>Felix
Hi Felix,

Usually for performance  issues  the devs need a  little bit more details.

In your case you can :
1.  Test  with real workload  - as this  is what you are going to do with the
Gluster. If that is not possible, you can run the synthetic benchmark focusing
on the workload that is closest to the real one.

2. Check the groups of options in /var/lib/gluster/groups . There is a 
'profile' group for DB,  Virtualization, small file workloads, etc .
Play with those settings and adjust as necessary.

3. Obtain a gluster profile and use the 'top' command to gain more
information about the pool's status.
For details,  check
https://docs.gluster.org/en/latest/Administrator%20Guide/Monitoring%20Workload/

This information will allow more experienced adminiatrators and the developers
to identify any pattern that could cause the symptoms.

Tuning Gluster is one of the hardest topics,  so you should prepare yourself for
a lot of test untill you reach the optimal settings for your volumes.

Best Regards,
Strahil Nikolov

Felix Kölzow

2020-Mar-10 20:31 UTC

head link

[Gluster-users] Gluster Performance Issues

Dear Strahil,


thank you very much for your response. After some testing,

the brick performance was much lower than expected. A previous
benchmarking procedure

at the beginning does not point that out. In our case with real life
data, even with a minor

number of directories and files, a simple ls -l command on the **brick**
itself

take up to 16 seconds (4 seconds in general, but 16s was the worst case
we observed).

We decided to reconfigure the bricks from raid6 to raid10

using the reset-brick command, see section 11.9.5 as reference:

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-migrating_volumes#sect-Migrating_Volumes-Reconfigure_Brick


This re-configuring worked perfect and self-healing is almost done now
(b.t.w. 2TB/day self-healing performance is normal?).

Since the bricks shows acceptable raid10 performance (at least for our
current workload), the performance on the fuse-mountpoint

on the dispersed volume (4+2) is lower than the brick performance. Maybe
I am wrong, I was going to expect at least

a slightly higher performance than the brick-performance.


Other example:

Just for testing purposes, I created a distributed volume (sharding on)
on the some storage nodes that consists of 2 NVME's / per node (i.e.? 6
in total).

A simple rsync command of a single large file provides? 222.25M
bytes/sec on a fuse-mount-point of this distributed volume.

The same command, executed on the brick itself provides 265.19M
bytes/sec. As before,

the performance is lower than the individual brick performance. Is this
a normal behavior or

or what can be done to improve the single client performance as pointed
out in this case?


Regards,

Felix




On 20/02/2020 22:26, Strahil Nikolov wrote:> On February 20, 2020 10:39:47 PM GMT+02:00, "Felix K?lzow"
<felix.koelzow at gmx.de> wrote:
>> Dear Gluster-Experts,
>>
>>
>> we created a three-node setup with two bricks each and a dispersed
>> volume that can be
>>
>> accessed via the native client (glusterfs --version = 6.0).
>>
>>
>> The nodes are connected via 10Gbps (cat6,bonding mode 6).
>>
>>
>> If we running a performance test using the smallfile benchmark tool,
>>
>> ./smallfile_cli.py --top ~/test6/ --file-size 16384 --threads 8 --files
>> 1000 --response-times y --operation create
>>
>> we observing the
>>
>> behavior as? shown in the attached figure. For a short time, we can
>> almost saturated the 10Gbps but
>>
>> then immediately network traffic is dropping almost to zero for about 5
>> to 25 seconds.
>>
>> Actually, we don't have any glue what does causes this strange
>> behavior.
>> We had a lot of
>>
>> reconfigured volume options and now we are starting from scratch, i.e.
>>
>>
>> gluster volume reset dispersed_fuse all
>>
>>
>> and the current options looks like:
>>
>> Volume Name: dispersed_fuse
>> Type: Disperse
>> Volume ID: 45d3c7c9-526f-45ea-9930-ee7a2274a220
>> Status: Started
>> Snapshot Count: 1
>> Number of Bricks: 1 x (4 + 2) = 6
>> Transport-type: tcp
>> Bricks:
>> Brick1: mimas:/gluster/vg00/dispersed_fuse/brick
>> Brick2: enceladus:/gluster/vg00/dispersed_fuse/brick
>> Brick3: tethys:/gluster/vg00/dispersed_fuse/brick
>> Brick4: mimas:/gluster/vg01/dispersed_fuse/brick
>> Brick5: enceladus:/gluster/vg01/dispersed_fuse/brick
>> Brick6: tethys:/gluster/vg01/dispersed_fuse/brick
>> Options Reconfigured:
>> client.event-threads: 16
>> server.event-threads: 16
>> performance.io-thread-count: 16
>> geo-replication.ignore-pid-check: on
>> geo-replication.indexing: on
>> features.quota: off
>> features.inode-quota: off
>> nfs.disable: on
>> storage.fips-mode-rchecksum: on
>> server.outstanding-rpc-limit: 512
>> server.root-squash: off
>> cluster.server-quorum-ratio: 51%
>> cluster.enable-shared-storage: enable
>>
>>
>> Any hints for root cause analysis are appreciated.
>>
>> Please let me know if you need more information.
>>
>>
>> Kind Regards,
>>
>> Felix
> Hi Felix,
>
> Usually for performance  issues  the devs need a  little bit more details.
>
> In your case you can :
> 1.  Test  with real workload  - as this  is what you are going to do with
the Gluster. If that is not possible, you can run the synthetic benchmark
focusing on the workload that is closest to the real one.
>
> 2. Check the groups of options in /var/lib/gluster/groups . There is a 
'profile' group for DB,  Virtualization, small file workloads, etc .
Play with those settings and adjust as necessary.
>
> 3. Obtain a gluster profile and use the 'top' command to gain more
information about the pool's status.
> For details,  check
https://docs.gluster.org/en/latest/Administrator%20Guide/Monitoring%20Workload/
>
> This information will allow more experienced adminiatrators and the
developers to identify any pattern that could cause the symptoms.
>
> Tuning Gluster is one of the hardest topics,  so you should prepare
yourself for a lot of test untill you reach the optimal settings for your
volumes.
>
> Best Regards,
> Strahil Nikolov
> ________
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/441850968
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

Gluster users - Feb 2020 - Gluster Performance Issues

[Gluster-users] Gluster Performance Issues

[Gluster-users] Gluster Performance Issues

[Gluster-users] Gluster Performance Issues