thr3ads.net - Gluster users - [Gluster-users] How to configure? [Mar 2023]

If this information is useful, please help other people find it:
Share via:

Diego Zuccato

2023-Mar-16 13:28 UTC

[Gluster-users] How to configure?

In Debian stopping glusterd does not stop brick processes: to stop 
everything (and free the memory) I have to
systemctl stop glusterd
   killall glusterfs{,d}
   killall glfsheal
   systemctl start glusterd
[this behaviour hangs a simple reboot of a machine running glusterd... 
not nice]

For now I just restarted glusterd w/o killing the bricks:

root at str957-clustor00:~# ps aux|grep glfsheal|wc -l ; systemctl restart 
glusterd ; ps aux|grep glfsheal|wc -l
618
618

No change neither in glfsheal processes nor in free memory :(
Should I "killall glfsheal" before OOK kicks in?

Diego

Il 16/03/2023 12:37, Strahil Nikolov ha scritto:> Can you restart glusterd service (first check that it was not modified 
> to kill the bricks)?
> 
> Best Regards,
> Strahil Nikolov
> 
>     On Thu, Mar 16, 2023 at 8:26, Diego Zuccato
>     <diego.zuccato at unibo.it> wrote:
>     OOM is just just a matter of time.
> 
>     Today mem use is up to 177G/187 and:
>     # ps aux|grep glfsheal|wc -l
>     551
> 
>     (well, one is actually the grep process, so "only" 550
glfsheal
>     processes.
> 
>     I'll take the last 5:
>     root? ? 3266352? 0.5? 0.0 600292 93044 ?? ? ? ? Sl? 06:55? 0:07
>     /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>     root? ? 3267220? 0.7? 0.0 600292 91964 ?? ? ? ? Sl? 07:00? 0:07
>     /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>     root? ? 3268076? 1.0? 0.0 600160 88216 ?? ? ? ? Sl? 07:05? 0:08
>     /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>     root? ? 3269492? 1.6? 0.0 600292 91248 ?? ? ? ? Sl? 07:10? 0:07
>     /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>     root? ? 3270354? 4.4? 0.0 600292 93260 ?? ? ? ? Sl? 07:15? 0:07
>     /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
> 
>     -8<--
>     root at str957-clustor00:~# ps -o ppid= 3266352
>     3266345
>     root at str957-clustor00:~# ps -o ppid= 3267220
>     3267213
>     root at str957-clustor00:~# ps -o ppid= 3268076
>     3268069
>     root at str957-clustor00:~# ps -o ppid= 3269492
>     3269485
>     root at str957-clustor00:~# ps -o ppid= 3270354
>     3270347
>     root at str957-clustor00:~# ps aux|grep 3266345
>     root? ? 3266345? 0.0? 0.0 430536 10764 ?? ? ? ? Sl? 06:55? 0:00
>     gluster volume heal cluster_data info summary --xml
>     root? ? 3271532? 0.0? 0.0? 6260? 2500 pts/1? ? S+? 07:21? 0:00 grep
>     3266345
>     root at str957-clustor00:~# ps aux|grep 3267213
>     root? ? 3267213? 0.0? 0.0 430536 10644 ?? ? ? ? Sl? 07:00? 0:00
>     gluster volume heal cluster_data info summary --xml
>     root? ? 3271599? 0.0? 0.0? 6260? 2480 pts/1? ? S+? 07:22? 0:00 grep
>     3267213
>     root at str957-clustor00:~# ps aux|grep 3268069
>     root? ? 3268069? 0.0? 0.0 430536 10704 ?? ? ? ? Sl? 07:05? 0:00
>     gluster volume heal cluster_data info summary --xml
>     root? ? 3271626? 0.0? 0.0? 6260? 2516 pts/1? ? S+? 07:22? 0:00 grep
>     3268069
>     root at str957-clustor00:~# ps aux|grep 3269485
>     root? ? 3269485? 0.0? 0.0 430536 10756 ?? ? ? ? Sl? 07:10? 0:00
>     gluster volume heal cluster_data info summary --xml
>     root? ? 3271647? 0.0? 0.0? 6260? 2480 pts/1? ? S+? 07:22? 0:00 grep
>     3269485
>     root at str957-clustor00:~# ps aux|grep 3270347
>     root? ? 3270347? 0.0? 0.0 430536 10672 ?? ? ? ? Sl? 07:15? 0:00
>     gluster volume heal cluster_data info summary --xml
>     root? ? 3271666? 0.0? 0.0? 6260? 2568 pts/1? ? S+? 07:22? 0:00 grep
>     3270347
>     -8<--
> 
>     Seems glfsheal is spawning more processes.
>     I can't rule out a metadata corruption (or at least a desync), but
it
>     shouldn't happen...
> 
>     Diego
> 
>     Il 15/03/2023 20:11, Strahil Nikolov ha scritto:
>      > If you don't experience any OOM , you can focus on the heals.
>      >
>      > 284 processes of glfsheal seems odd.
>      >
>      > Can you check the ppid for 2-3 randomly picked ?
>      > ps -o ppid= <pid>
>      >
>      > Best Regards,
>      > Strahil Nikolov
>      >
>      >? ? On Wed, Mar 15, 2023 at 9:54, Diego Zuccato
>      >? ? <diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>> wrote:
>      >? ? I enabled it yesterday and that greatly reduced memory
pressure.
>      >? ? Current volume info:
>      >? ? -8<--
>      >? ? Volume Name: cluster_data
>      >? ? Type: Distributed-Replicate
>      >? ? Volume ID: a8caaa90-d161-45bb-a68c-278263a8531a
>      >? ? Status: Started
>      >? ? Snapshot Count: 0
>      >? ? Number of Bricks: 45 x (2 + 1) = 135
>      >? ? Transport-type: tcp
>      >? ? Bricks:
>      >? ? Brick1: clustor00:/srv/bricks/00/d
>      >? ? Brick2: clustor01:/srv/bricks/00/d
>      >? ? Brick3: clustor02:/srv/bricks/00/q (arbiter)
>      >? ? [...]
>      >? ? Brick133: clustor01:/srv/bricks/29/d
>      >? ? Brick134: clustor02:/srv/bricks/29/d
>      >? ? Brick135: clustor00:/srv/bricks/14/q (arbiter)
>      >? ? Options Reconfigured:
>      >? ? performance.quick-read: off
>      >? ? cluster.entry-self-heal: on
>      >? ? cluster.data-self-heal-algorithm: full
>      >? ? cluster.metadata-self-heal: on
>      >? ? cluster.shd-max-threads: 2
>      >? ? network.inode-lru-limit: 500000
>      >? ? performance.md-cache-timeout: 600
>      >? ? performance.cache-invalidation: on
>      >? ? features.cache-invalidation-timeout: 600
>      >? ? features.cache-invalidation: on
>      >? ? features.quota-deem-statfs: on
>      >? ? performance.readdir-ahead: on
>      >? ? cluster.granular-entry-heal: enable
>      >? ? features.scrub: Active
>      >? ? features.bitrot: on
>      >? ? cluster.lookup-optimize: on
>      >? ? performance.stat-prefetch: on
>      >? ? performance.cache-refresh-timeout: 60
>      >? ? performance.parallel-readdir: on
>      >? ? performance.write-behind-window-size: 128MB
>      >? ? cluster.self-heal-daemon: enable
>      >? ? features.inode-quota: on
>      >? ? features.quota: on
>      >? ? transport.address-family: inet
>      >? ? nfs.disable: on
>      >? ? performance.client-io-threads: off
>      >? ? client.event-threads: 1
>      >? ? features.scrub-throttle: normal
>      >? ? diagnostics.brick-log-level: ERROR
>      >? ? diagnostics.client-log-level: ERROR
>      >? ? config.brick-threads: 0
>      >? ? cluster.lookup-unhashed: on
>      >? ? config.client-threads: 1
>      >? ? cluster.use-anonymous-inode: off
>      >? ? diagnostics.brick-sys-log-level: CRITICAL
>      >? ? features.scrub-freq: monthly
>      >? ? cluster.data-self-heal: on
>      >? ? cluster.brick-multiplex: on
>      >? ? cluster.daemon-log-level: ERROR
>      >? ? -8<--
>      >
>      >? ? htop reports that memory usage is up to 143G, there are 602
>     tasks and
>      >? ? 5232 threads (~20 running) on clustor00, 117G/49 tasks/1565
>     threads on
>      >? ? clustor01 and 126G/45 tasks/1574 threads on clustor02.
>      >? ? I see quite a lot (284!) of glfsheal processes running on
>     clustor00 (a
>      >? ? "gluster v heal cluster_data info summary" is
running on clustor02
>      >? ? since
>      >? ? yesterday, still no output). Shouldn't be just one per
brick?
>      >
>      >? ? Diego
>      >
>      >? ? Il 15/03/2023 08:30, Strahil Nikolov ha scritto:
>      >? ? ? > Do you use brick multiplexing ?
>      >? ? ? >
>      >? ? ? > Best Regards,
>      >? ? ? > Strahil Nikolov
>      >? ? ? >
>      >? ? ? >? ? On Tue, Mar 14, 2023 at 16:44, Diego Zuccato
>      >? ? ? >? ? <diego.zuccato at unibo.it
<mailto:diego.zuccato at unibo.it>
>     <mailto:diego.zuccato at unibo.it>> wrote:
>      >? ? ? >? ? Hello all.
>      >? ? ? >
>      >? ? ? >? ? Our Gluster 9.6 cluster is showing increasing
problems.
>      >? ? ? >? ? Currently it's composed of 3 servers (2x Intel
Xeon
>     4210 [20
>      >? ? cores dual
>      >? ? ? >? ? thread, total 40 threads], 192GB RAM, 30x HGST
>     HUH721212AL5200
>      >? ? [12TB]),
>      >? ? ? >? ? configured in replica 3 arbiter 1. Using Debian
>     packages from
>      >? ? Gluster
>      >? ? ? >? ? 9.x latest repository.
>      >? ? ? >
>      >? ? ? >? ? Seems 192G RAM are not enough to handle 30 data
bricks + 15
>      >? ? arbiters
>      >? ? ? >? ? and
>      >? ? ? >? ? I often had to reload glusterfsd because glusterfs
>     processed
>      >? ? got killed
>      >? ? ? >? ? for OOM.
>      >? ? ? >? ? On top of that, performance have been quite bad,
especially
>      >? ? when we
>      >? ? ? >? ? reached about 20M files. On top of that, one of the
servers
>      >? ? have had
>      >? ? ? >? ? mobo issues that resulted in memory errors that
>     corrupted some
>      >? ? ? >? ? bricks fs
>      >? ? ? >? ? (XFS, it required "xfs_reparir -L" to
fix).
>      >? ? ? >? ? Now I'm getting lots of "stale file
handle" errors and
>     other
>      >? ? errors
>      >? ? ? >? ? (like directories that seem empty from the client
but still
>      >? ? containing
>      >? ? ? >? ? files in some bricks) and auto healing seems unable
to
>     complete.
>      >? ? ? >
>      >? ? ? >? ? Since I can't keep up continuing to manually fix
all the
>      >? ? issues, I'm
>      >? ? ? >? ? thinking about backup+destroy+recreate strategy.
>      >? ? ? >
>      >? ? ? >? ? I think that if I reduce the number of bricks per
>     server to just 5
>      >? ? ? >? ? (RAID1 of 6x12TB disks) I might resolve RAM issues -
at the
>      >? ? cost of
>      >? ? ? >? ? longer heal times in case a disk fails. Am I right
or it's
>      >? ? useless?
>      >? ? ? >? ? Other recommendations?
>      >? ? ? >? ? Servers have space for another 6 disks. Maybe those
>     could be
>      >? ? used for
>      >? ? ? >? ? some SSDs to speed up access?
>      >? ? ? >
>      >? ? ? >? ? TIA.
>      >? ? ? >
>      >? ? ? >? ? --
>      >? ? ? >? ? Diego Zuccato
>      >? ? ? >? ? DIFA - Dip. di Fisica e Astronomia
>      >? ? ? >? ? Servizi Informatici
>      >? ? ? >? ? Alma Mater Studiorum - Universit? di Bologna
>      >? ? ? >? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>      >? ? ? >? ? tel.: +39 051 20 95786
>      >? ? ? >? ? ________
>      >? ? ? >
>      >? ? ? >
>      >? ? ? >
>      >? ? ? >? ? Community Meeting Calendar:
>      >? ? ? >
>      >? ? ? >? ? Schedule -
>      >? ? ? >? ? Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>      >? ? ? >? ? Bridge: https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>
>      >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>
>      >? ? ? >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>
>      >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>>
>      >? ? ? >? ? Gluster-users mailing list
>      >? ? ? > Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at
gluster.org>
>      >? ? <mailto:Gluster-users at gluster.org>
>      >? ? ? >
https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>      >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>      >? ? ? >   
>     <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>      >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>
> 
>      >
>      >? ? ? >
>      >
>      >? ? --
>      >? ? Diego Zuccato
>      >? ? DIFA - Dip. di Fisica e Astronomia
>      >? ? Servizi Informatici
>      >? ? Alma Mater Studiorum - Universit? di Bologna
>      >? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>      >? ? tel.: +39 051 20 95786
>      >? ? ________
>      >
>      >
>      >
>      >? ? Community Meeting Calendar:
>      >
>      >? ? Schedule -
>      >? ? Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>      >? ? Bridge: https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>
>      >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>
>      >? ? Gluster-users mailing list
>      > Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     <mailto:Gluster-users at gluster.org>
>      > https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>      >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>      >
> 
>     -- 
>     Diego Zuccato
>     DIFA - Dip. di Fisica e Astronomia
>     Servizi Informatici
>     Alma Mater Studiorum - Universit? di Bologna
>     V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>     tel.: +39 051 20 95786
> 
-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Strahil Nikolov

2023-Mar-21 00:21 UTC

head link

[Gluster-users] How to configure?

Theoretically it might help.If possible, try to resolve any pending heals.
Best Regards,Strahil Nikolov?
 
 
  On Thu, Mar 16, 2023 at 15:29, Diego Zuccato<diego.zuccato at unibo.it>
wrote:   In Debian stopping glusterd does not stop brick processes: to stop
everything (and free the memory) I have to
systemctl stop glusterd
? killall glusterfs{,d}
? killall glfsheal
? systemctl start glusterd
[this behaviour hangs a simple reboot of a machine running glusterd... 
not nice]

For now I just restarted glusterd w/o killing the bricks:

root at str957-clustor00:~# ps aux|grep glfsheal|wc -l ; systemctl restart 
glusterd ; ps aux|grep glfsheal|wc -l
618
618

No change neither in glfsheal processes nor in free memory :(
Should I "killall glfsheal" before OOK kicks in?

Diego

Il 16/03/2023 12:37, Strahil Nikolov ha scritto:> Can you restart glusterd service (first check that it was not modified 
> to kill the bricks)?
> 
> Best Regards,
> Strahil Nikolov
> 
>? ? On Thu, Mar 16, 2023 at 8:26, Diego Zuccato
>? ? <diego.zuccato at unibo.it> wrote:
>? ? OOM is just just a matter of time.
> 
>? ? Today mem use is up to 177G/187 and:
>? ? # ps aux|grep glfsheal|wc -l
>? ? 551
> 
>? ? (well, one is actually the grep process, so "only" 550
glfsheal
>? ? processes.
> 
>? ? I'll take the last 5:
>? ? root? ? 3266352? 0.5? 0.0 600292 93044 ?? ? ? ? Sl? 06:55? 0:07
>? ? /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>? ? root? ? 3267220? 0.7? 0.0 600292 91964 ?? ? ? ? Sl? 07:00? 0:07
>? ? /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>? ? root? ? 3268076? 1.0? 0.0 600160 88216 ?? ? ? ? Sl? 07:05? 0:08
>? ? /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>? ? root? ? 3269492? 1.6? 0.0 600292 91248 ?? ? ? ? Sl? 07:10? 0:07
>? ? /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
>? ? root? ? 3270354? 4.4? 0.0 600292 93260 ?? ? ? ? Sl? 07:15? 0:07
>? ? /usr/libexec/glusterfs/glfsheal cluster_data info-summary --xml
> 
>? ? -8<--
>? ? root at str957-clustor00:~# ps -o ppid= 3266352
>? ? 3266345
>? ? root at str957-clustor00:~# ps -o ppid= 3267220
>? ? 3267213
>? ? root at str957-clustor00:~# ps -o ppid= 3268076
>? ? 3268069
>? ? root at str957-clustor00:~# ps -o ppid= 3269492
>? ? 3269485
>? ? root at str957-clustor00:~# ps -o ppid= 3270354
>? ? 3270347
>? ? root at str957-clustor00:~# ps aux|grep 3266345
>? ? root? ? 3266345? 0.0? 0.0 430536 10764 ?? ? ? ? Sl? 06:55? 0:00
>? ? gluster volume heal cluster_data info summary --xml
>? ? root? ? 3271532? 0.0? 0.0? 6260? 2500 pts/1? ? S+? 07:21? 0:00 grep
>? ? 3266345
>? ? root at str957-clustor00:~# ps aux|grep 3267213
>? ? root? ? 3267213? 0.0? 0.0 430536 10644 ?? ? ? ? Sl? 07:00? 0:00
>? ? gluster volume heal cluster_data info summary --xml
>? ? root? ? 3271599? 0.0? 0.0? 6260? 2480 pts/1? ? S+? 07:22? 0:00 grep
>? ? 3267213
>? ? root at str957-clustor00:~# ps aux|grep 3268069
>? ? root? ? 3268069? 0.0? 0.0 430536 10704 ?? ? ? ? Sl? 07:05? 0:00
>? ? gluster volume heal cluster_data info summary --xml
>? ? root? ? 3271626? 0.0? 0.0? 6260? 2516 pts/1? ? S+? 07:22? 0:00 grep
>? ? 3268069
>? ? root at str957-clustor00:~# ps aux|grep 3269485
>? ? root? ? 3269485? 0.0? 0.0 430536 10756 ?? ? ? ? Sl? 07:10? 0:00
>? ? gluster volume heal cluster_data info summary --xml
>? ? root? ? 3271647? 0.0? 0.0? 6260? 2480 pts/1? ? S+? 07:22? 0:00 grep
>? ? 3269485
>? ? root at str957-clustor00:~# ps aux|grep 3270347
>? ? root? ? 3270347? 0.0? 0.0 430536 10672 ?? ? ? ? Sl? 07:15? 0:00
>? ? gluster volume heal cluster_data info summary --xml
>? ? root? ? 3271666? 0.0? 0.0? 6260? 2568 pts/1? ? S+? 07:22? 0:00 grep
>? ? 3270347
>? ? -8<--
> 
>? ? Seems glfsheal is spawning more processes.
>? ? I can't rule out a metadata corruption (or at least a desync), but
it
>? ? shouldn't happen...
> 
>? ? Diego
> 
>? ? Il 15/03/2023 20:11, Strahil Nikolov ha scritto:
>? ? ? > If you don't experience any OOM , you can focus on the heals.
>? ? ? >
>? ? ? > 284 processes of glfsheal seems odd.
>? ? ? >
>? ? ? > Can you check the ppid for 2-3 randomly picked ?
>? ? ? > ps -o ppid= <pid>
>? ? ? >
>? ? ? > Best Regards,
>? ? ? > Strahil Nikolov
>? ? ? >
>? ? ? >? ? On Wed, Mar 15, 2023 at 9:54, Diego Zuccato
>? ? ? >? ? <diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>> wrote:
>? ? ? >? ? I enabled it yesterday and that greatly reduced memory
pressure.
>? ? ? >? ? Current volume info:
>? ? ? >? ? -8<--
>? ? ? >? ? Volume Name: cluster_data
>? ? ? >? ? Type: Distributed-Replicate
>? ? ? >? ? Volume ID: a8caaa90-d161-45bb-a68c-278263a8531a
>? ? ? >? ? Status: Started
>? ? ? >? ? Snapshot Count: 0
>? ? ? >? ? Number of Bricks: 45 x (2 + 1) = 135
>? ? ? >? ? Transport-type: tcp
>? ? ? >? ? Bricks:
>? ? ? >? ? Brick1: clustor00:/srv/bricks/00/d
>? ? ? >? ? Brick2: clustor01:/srv/bricks/00/d
>? ? ? >? ? Brick3: clustor02:/srv/bricks/00/q (arbiter)
>? ? ? >? ? [...]
>? ? ? >? ? Brick133: clustor01:/srv/bricks/29/d
>? ? ? >? ? Brick134: clustor02:/srv/bricks/29/d
>? ? ? >? ? Brick135: clustor00:/srv/bricks/14/q (arbiter)
>? ? ? >? ? Options Reconfigured:
>? ? ? >? ? performance.quick-read: off
>? ? ? >? ? cluster.entry-self-heal: on
>? ? ? >? ? cluster.data-self-heal-algorithm: full
>? ? ? >? ? cluster.metadata-self-heal: on
>? ? ? >? ? cluster.shd-max-threads: 2
>? ? ? >? ? network.inode-lru-limit: 500000
>? ? ? >? ? performance.md-cache-timeout: 600
>? ? ? >? ? performance.cache-invalidation: on
>? ? ? >? ? features.cache-invalidation-timeout: 600
>? ? ? >? ? features.cache-invalidation: on
>? ? ? >? ? features.quota-deem-statfs: on
>? ? ? >? ? performance.readdir-ahead: on
>? ? ? >? ? cluster.granular-entry-heal: enable
>? ? ? >? ? features.scrub: Active
>? ? ? >? ? features.bitrot: on
>? ? ? >? ? cluster.lookup-optimize: on
>? ? ? >? ? performance.stat-prefetch: on
>? ? ? >? ? performance.cache-refresh-timeout: 60
>? ? ? >? ? performance.parallel-readdir: on
>? ? ? >? ? performance.write-behind-window-size: 128MB
>? ? ? >? ? cluster.self-heal-daemon: enable
>? ? ? >? ? features.inode-quota: on
>? ? ? >? ? features.quota: on
>? ? ? >? ? transport.address-family: inet
>? ? ? >? ? nfs.disable: on
>? ? ? >? ? performance.client-io-threads: off
>? ? ? >? ? client.event-threads: 1
>? ? ? >? ? features.scrub-throttle: normal
>? ? ? >? ? diagnostics.brick-log-level: ERROR
>? ? ? >? ? diagnostics.client-log-level: ERROR
>? ? ? >? ? config.brick-threads: 0
>? ? ? >? ? cluster.lookup-unhashed: on
>? ? ? >? ? config.client-threads: 1
>? ? ? >? ? cluster.use-anonymous-inode: off
>? ? ? >? ? diagnostics.brick-sys-log-level: CRITICAL
>? ? ? >? ? features.scrub-freq: monthly
>? ? ? >? ? cluster.data-self-heal: on
>? ? ? >? ? cluster.brick-multiplex: on
>? ? ? >? ? cluster.daemon-log-level: ERROR
>? ? ? >? ? -8<--
>? ? ? >
>? ? ? >? ? htop reports that memory usage is up to 143G, there are 602
>? ? tasks and
>? ? ? >? ? 5232 threads (~20 running) on clustor00, 117G/49 tasks/1565
>? ? threads on
>? ? ? >? ? clustor01 and 126G/45 tasks/1574 threads on clustor02.
>? ? ? >? ? I see quite a lot (284!) of glfsheal processes running on
>? ? clustor00 (a
>? ? ? >? ? "gluster v heal cluster_data info summary" is
running on clustor02
>? ? ? >? ? since
>? ? ? >? ? yesterday, still no output). Shouldn't be just one per
brick?
>? ? ? >
>? ? ? >? ? Diego
>? ? ? >
>? ? ? >? ? Il 15/03/2023 08:30, Strahil Nikolov ha scritto:
>? ? ? >? ? ? > Do you use brick multiplexing ?
>? ? ? >? ? ? >
>? ? ? >? ? ? > Best Regards,
>? ? ? >? ? ? > Strahil Nikolov
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? On Tue, Mar 14, 2023 at 16:44, Diego Zuccato
>? ? ? >? ? ? >? ? <diego.zuccato at unibo.it
<mailto:diego.zuccato at unibo.it>
>? ? <mailto:diego.zuccato at unibo.it>> wrote:
>? ? ? >? ? ? >? ? Hello all.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? Our Gluster 9.6 cluster is showing increasing
problems.
>? ? ? >? ? ? >? ? Currently it's composed of 3 servers (2x Intel
Xeon
>? ? 4210 [20
>? ? ? >? ? cores dual
>? ? ? >? ? ? >? ? thread, total 40 threads], 192GB RAM, 30x HGST
>? ? HUH721212AL5200
>? ? ? >? ? [12TB]),
>? ? ? >? ? ? >? ? configured in replica 3 arbiter 1. Using Debian
>? ? packages from
>? ? ? >? ? Gluster
>? ? ? >? ? ? >? ? 9.x latest repository.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? Seems 192G RAM are not enough to handle 30 data
bricks + 15
>? ? ? >? ? arbiters
>? ? ? >? ? ? >? ? and
>? ? ? >? ? ? >? ? I often had to reload glusterfsd because glusterfs
>? ? processed
>? ? ? >? ? got killed
>? ? ? >? ? ? >? ? for OOM.
>? ? ? >? ? ? >? ? On top of that, performance have been quite bad,
especially
>? ? ? >? ? when we
>? ? ? >? ? ? >? ? reached about 20M files. On top of that, one of the
servers
>? ? ? >? ? have had
>? ? ? >? ? ? >? ? mobo issues that resulted in memory errors that
>? ? corrupted some
>? ? ? >? ? ? >? ? bricks fs
>? ? ? >? ? ? >? ? (XFS, it required "xfs_reparir -L" to
fix).
>? ? ? >? ? ? >? ? Now I'm getting lots of "stale file
handle" errors and
>? ? other
>? ? ? >? ? errors
>? ? ? >? ? ? >? ? (like directories that seem empty from the client
but still
>? ? ? >? ? containing
>? ? ? >? ? ? >? ? files in some bricks) and auto healing seems unable
to
>? ? complete.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? Since I can't keep up continuing to manually fix
all the
>? ? ? >? ? issues, I'm
>? ? ? >? ? ? >? ? thinking about backup+destroy+recreate strategy.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? I think that if I reduce the number of bricks per
>? ? server to just 5
>? ? ? >? ? ? >? ? (RAID1 of 6x12TB disks) I might resolve RAM issues -
at the
>? ? ? >? ? cost of
>? ? ? >? ? ? >? ? longer heal times in case a disk fails. Am I right
or it's
>? ? ? >? ? useless?
>? ? ? >? ? ? >? ? Other recommendations?
>? ? ? >? ? ? >? ? Servers have space for another 6 disks. Maybe those
>? ? could be
>? ? ? >? ? used for
>? ? ? >? ? ? >? ? some SSDs to speed up access?
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? TIA.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? --
>? ? ? >? ? ? >? ? Diego Zuccato
>? ? ? >? ? ? >? ? DIFA - Dip. di Fisica e Astronomia
>? ? ? >? ? ? >? ? Servizi Informatici
>? ? ? >? ? ? >? ? Alma Mater Studiorum - Universit? di Bologna
>? ? ? >? ? ? >? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>? ? ? >? ? ? >? ? tel.: +39 051 20 95786
>? ? ? >? ? ? >? ? ________
>? ? ? >? ? ? >
>? ? ? >? ? ? >
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? Community Meeting Calendar:
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? Schedule -
>? ? ? >? ? ? >? ? Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>? ? ? >? ? ? >? ? Bridge: https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>
>? ? ? >? ? <https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>>
>? ? ? >? ? ? >? ? <https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>
>? ? ? >? ? <https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>>>
>? ? ? >? ? ? >? ? Gluster-users mailing list
>? ? ? >? ? ? > Gluster-users at gluster.org
>? ? <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at
gluster.org>
>? ? ? >? ? <mailto:Gluster-users at gluster.org>
>? ? ? >? ? ? >
https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>
>? ? ? >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>? ? ? >? ? ? >? 
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>
>? ? ? >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
> 
>? ? ? >
>? ? ? >? ? ? >
>? ? ? >
>? ? ? >? ? --
>? ? ? >? ? Diego Zuccato
>? ? ? >? ? DIFA - Dip. di Fisica e Astronomia
>? ? ? >? ? Servizi Informatici
>? ? ? >? ? Alma Mater Studiorum - Universit? di Bologna
>? ? ? >? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>? ? ? >? ? tel.: +39 051 20 95786
>? ? ? >? ? ________
>? ? ? >
>? ? ? >
>? ? ? >
>? ? ? >? ? Community Meeting Calendar:
>? ? ? >
>? ? ? >? ? Schedule -
>? ? ? >? ? Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>? ? ? >? ? Bridge: https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>
>? ? ? >? ? <https://meet.google.com/cpu-eiue-hvk
>? ? <https://meet.google.com/cpu-eiue-hvk>>
>? ? ? >? ? Gluster-users mailing list
>? ? ? > Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>? ? <mailto:Gluster-users at gluster.org>
>? ? ? > https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>
>? ? ? >? ? <https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>? ? ? >
> 
>? ? -- 
>? ? Diego Zuccato
>? ? DIFA - Dip. di Fisica e Astronomia
>? ? Servizi Informatici
>? ? Alma Mater Studiorum - Universit? di Bologna
>? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>? ? tel.: +39 051 20 95786
> 
-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20230321/2153e8da/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

Gluster users - Mar 2023 - How to configure?

[Gluster-users] How to configure?

[Gluster-users] How to configure?

Possibly Parallel Threads