thr3ads.net - Gluster users - [Gluster-users] Setup recommendations [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Nico van Royen

2020-Oct-16 07:05 UTC

[Gluster-users] Setup recommendations

Gluster community: 

Due to some recent issues with performance from one of our (internal) clients
that uses several GlusterFS setups, this is mainly an open question for generic
possible improvements.
(apart from the fact the users themselves do some 'less smart' things on
it....)

The setup used: a 3 node (replica=3) RHEL7 gluster (with RHGS, so currently that
is GlusterFS 6.0-37.1.el7rhgs). Each Gluster has 1 volume, exported through
NFS-Ganesha.
Each node also has a virtual IP, managed by pacemaker/corosync following
standard RedHat setup documentation
Size is not that big, 600GB space with around half of that actually used.
GlusterFS servers themselves each have 4 cores and 12GB memory. It might also be
important to note that these are VMware hosted nodes that make use of SAN
storage for the datastores.

Connected to that NFS (ganesha) exported share are just over 100 clients, all
RHEL6 and RHEL7, some spanning 10 network hops away. All of those clients are
(currently) using the same virtual-IP, so all end up on the same server.
(we did already advise them to spread that across the three servers). 
Certain subfolders of the share hold (at times) large numbers of (small) files
that *should* peak at around 50.000 files, into a single, unhashed, directory
(this will of course make simple ls and find commands via NFS quite slow).
Note that I mentioned 'should', since at times it had anywhere between
250.000 and 1 million files in it (which of course is not advised). Using some
kind of hashing (subfolders spread per day/hour etc) was also already advised.

Problems that are often seen: 
- Any kind of operation on VMware such as a vMotion, creating a VM snapshot etc.
on the node that has these 100+ clients connected causes such a temporary pause
that pacemaker decides to switch the resources (causing a failover of the
virtual IP address, thus clients connected suffer delay). One would expect this
to last just shy under a minute, then clients would happily continue. However
connected clients are stuck with a non-working mountpoint (commands as df, ls,
find etc simply hang.. they go into an uninterruptible sleep).
Mount are 'hard' mounts to insure guaranteed writes. 
- Once the number of files are over the 100.000 mark (again into a single,
unhashed, folder) any operation on that share becomes very sluggish (even a df,
on a client, would take 20/30 seconds, a find command would take minutes to
complete).

If anyone can spot any ideas for improvement ? 

Some config info (below is from a sandbox setup using the same values as the
affected gluster):
For Ganesha: 
/etc/ganesha/ganesha.conf: 
# BEGIN ANSIBLE MANAGED BLOCK 
NFSv4 { 
minor_versions = 0; 
} 
# END ANSIBLE MANAGED BLOCK 
%include
"/var/run/gluster/shared_storage/nfs-ganesha/exports/export.BLAH.conf"

/var/run/gluster/shared_storage/nfs-ganesha/exports/export.BLAH.conf: 
EXPORT{ 
Export_Id = 2; 
Path = "/BLAH"; 
FSAL { 
name = GLUSTER; 
hostname="localhost"; 
volume="BLAH"; 
} 
Access_type = RW; 
Disable_ACL = true; 
Squash="No_root_squash"; 
Pseudo="/BLAH"; 
Protocols = "3", "4" ; 
Transports = "UDP","TCP"; 
SecType = "sys"; 
} 

# gluster v info BLAH 
Volume Name: BLAH 
Type: Replicate 
Volume ID: 6fee713c-4258-44d8-a849-f8d6b2991631 
Status: Started 
Snapshot Count: 0 
Number of Bricks: 1 x 3 = 3 
Transport-type: tcp 
Bricks: 
Brick1: tlrvrhgluster03:/gluster/BLAH/export 
Brick2: tlrvrhgluster02:/gluster/BLAH/export 
Brick3: tlrvrhgluster01:/gluster/BLAH/export 
Options Reconfigured: 
diagnostics.count-fop-hits: on 
diagnostics.latency-measurement: on 
ganesha.enable: on 
features.cache-invalidation: on 
performance.client-io-threads: off 
nfs.disable: on 
storage.fips-mode-rchecksum: on 
transport.address-family: inet 
cluster.server-quorum-type: server 
cluster.quorum-count: 2 
performance.cache-refresh-timeout: 10 
cluster.quorum-type: fixed 
cluster.enable-shared-storage: enable 
nfs-ganesha: enable 

# lvs -a 
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 
BLAH gfsvg Vwi-aotz-- 10.00g gfspool 36.45 
gfspool gfsvg twi-aotz-- 48.00g 7.59 1.76 
[gfspool_tdata] gfsvg Twi-ao---- 48.00g 
[gfspool_tmeta] gfsvg ewi-ao---- 1.00g 
[lvol0_pmspare] gfsvg ewi------- 48.00m 
auditlv systemvg -wi-ao---- 252.00m 
homelv systemvg -wi-ao---- 1.00g 
rootlv systemvg -wi-ao---- 16.00g 
swaplv systemvg -wi-ao---- 2.00g 
tmplv systemvg -wi-ao---- 2.00g 
varcorelv systemvg -wi-ao---- 1.00g 
varloglv systemvg -wi-ao---- 6.00g 
varlv systemvg -wi-ao---- 6.00g 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20201016/d128144e/attachment.html>

Strahil Nikolov

2020-Oct-19 03:56 UTC

head link

[Gluster-users] Setup recommendations

>Size is not that big, 600GB space with around half of that actually used.?
GlusterFS servers themselves each have 4 cores and 12GB memory.? It might also
be important to note that these are VMware hosted nodes that make use of? SAN
storage for the datastores.
4 cores is quite low, especially when healing.
>Connected to that NFS (ganesha) exported share are just over 100 clients,
all RHEL6 and RHEL7, some spanning 10 network hops away.? All of those clients
are (currently) using the same virtual-IP, so all end up on the same server.
Why not FUSE ? Ganesha is suitable for UNIX and BSD systems that do not support
FUSE.
>Note that I mentioned 'should', since at times it had anywhere
between 250.000 and 1 million files in it (which of course is not advised).?
Using some kind of hashing (subfolders spread per day/hour etc) was also already
advised.If you have multiple subdomains (from replicate -> to distributed-replicated)
, you can also spread the load - yet 'find' won't be faster :)


Problems that are often seen:>- Any kind of operation on VMware such as a vMotion, creating a VM snapshot
etc. on the node that has these 100+ clients connected causes such a temporary
pause that pacemaker decides to switch the resources (causing a failover of the
virtual IP address, thus clients connected suffer delay).??RH corosync defaults are not suitable for VMs. I prefer SUSE's defaults.
Consider increasing the 'token' and 'consensus' to a more
meaningful values -> start with 10s token for example.
>One would expect this to last just shy under a minute, then clients would
happily continue.? However connected clients are stuck with a non-working
mountpoint (commands as df, ls, find etc simply hang.. they go into an
uninterruptible sleep).In regular HA NFS, there is a "notify" resource that notifies the
clients about the failover. The stale happens because your IP is brought before
the NFS export is ready. As you haven't provided HA details, I can't
help much there.
>Mount are 'hard' mounts to insure guaranteed writes.That's good. Also is needed for the HA to properly work.
>- Once the number of files are over the 100.000 mark (again into a single,
unhashed, folder) any operation on that share becomes very sluggish (even a df,
on a client, would take 20/30 seconds,? a find command would take minutes to
complete).I think it's expected...
>If anyone can spot any ideas for improvement ?I would try to first switch to 'replica 3 arbiter 1' as current setup is
wasting storage, next switch the clients to FUSE.
For performance improvements , I would add some SSDs in the game (tier 1+
storage) and use the SSD-based LUNs as lvm caching.

Best Regards,
Strahil Nikolov

Gluster users - Oct 2020 - Setup recommendations

[Gluster-users] Setup recommendations

[Gluster-users] Setup recommendations