My volume is replica 3 arbiter 1, maybe that makes a difference?
Bricks processes tend to die quite often (I have to restart glusterd at 
least once a day because "gluster v info | grep ' N '" reports
at least
one missing brick; sometimes even if all bricks are reported up I have 
to kill all glusterfs[d] processes and restart glusterd).
The 3 servers have 192GB RAM (that should be way more than enough!), 30 
data bricks and 15 arbiters (the arbiters share a single SSD).
And I noticed that some "stale file handle" are not reported by heal
info.
root at str957-cluster:/# ls -l 
/scratch/extra/m******/PNG/PNGQuijote/ModGrav/fNL40/
ls: cannot access 
'/scratch/extra/m******/PNG/PNGQuijote/ModGrav/fNL40/output_21': Stale 
file handle
total 40
d?????????  ? ?            ?               ?            ? output_21
...
but "gluster v heal cluster_data info |grep output_21" returns
nothing. :(
Seems the other stale handles either got corrected by subsequent 'stat's
or became I/O errors.
Diego.
Il 12/02/2023 21:34, Strahil Nikolov ha scritto:> The 2-nd error indicates conflicts between the nodes. The only way that 
> could happen on replica 3 is gfid conflict (file/dir was renamed or 
> recreated).
> 
> Are you sure that all bricks are online? Usually 'Transport endpoint is
> not connected' indicates a brick down situation.
> 
> First start with all stale file handles:
> check md5sum on all bricks. If it differs somewhere, delete the gfid and 
> move the file away from the brick and check in FUSE. If it's fine , 
> touch it and the FUSE client will "heal" it.
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
>     On Tue, Feb 7, 2023 at 16:33, Diego Zuccato
>     <diego.zuccato at unibo.it> wrote:
>     The contents do not match exactly, but the only difference is the
>     "option shared-brick-count" line that sometimes is 0 and
sometimes 1.
> 
>     The command you gave could be useful for the files that still needs
>     healing with the source still present, but the files related to the
>     stale gfids have been deleted, so "find -samefile" won't
find anything.
> 
>     For the other files reported by heal info, I saved the output to
>     'healinfo', then:
>      ? for T in $(grep '^/' healinfo |sort|uniq); do stat
/mnt/scratch$T >
>     /dev/null; done
> 
>     but I still see a lot of 'Transport endpoint is not connected'
and
>     'Stale file handle' errors :( And many 'No such file or
directory'...
> 
>     I don't understand the first two errors, since /mnt/scratch have
been
>     freshly mounted after enabling client healing, and gluster v info does
>     not highlight unconnected/down bricks.
> 
>     Diego
> 
>     Il 06/02/2023 22:46, Strahil Nikolov ha scritto:
>      > I'm not sure if the md5sum has to match , but at least the
content
>      > should do.
>      > In modern versions of GlusterFS the client side healing is
>     disabled ,
>      > but it's worth trying.
>      > You will need to enable cluster.metadata-self-heal,
>      > cluster.data-self-heal and cluster.entry-self-heal and then
create a
>      > small one-liner that identifies the names of the files/dirs from
the
>      > volume heal ,so you can stat them through the FUSE.
>      >
>      > Something like this:
>      >
>      >
>      > for i in $(gluster volume heal <VOL> info | awk -F
'<gfid:|>'
>     '/gfid:/
>      > {print $2}'); do find /PATH/TO/BRICK/ -samefile
>      > /PATH/TO/BRICK/.glusterfs/${i:0:2}/${i:2:2}/$i | awk
'!/.glusterfs/
>      > {gsub("/PATH/TO/BRICK", "stat
/MY/FUSE/MOUNTPOINT", $0); print
>     $0}' ; done
>      >
>      > Then Just copy paste the output and you will trigger the client
side
>      > heal only on the affected gfids.
>      >
>      > Best Regards,
>      > Strahil Nikolov
>      > ? ??????????, 6 ???????? 2023 ?., 10:19:02 ?. ???????+2, Diego
>     Zuccato
>      > <diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>> ??????:
>      >
>      >
>      > Ops... Reincluding the list that got excluded in my previous
>     answer :(
>      >
>      > I generated md5sums of all files in vols/ on clustor02 and
>     compared to
>      > the other nodes (clustor00 and clustor01).
>      > There are differences in volfiles (shouldn't it always be 1,
>     since every
>      > data brick is on its own fs? quorum bricks, OTOH, share a single
>      > partition on SSD and should always be 15, but in both cases
sometimes
>      > it's 0).
>      >
>      > I nearly got a stroke when I saw diff output for 'info'
files,
>     but once
>      > I sorted 'em their contents matched. Pfhew!
>      >
>      > Diego
>      >
>      > Il 03/02/2023 19:01, Strahil Nikolov ha scritto:
>      >? > This one doesn't look good:
>      >? >
>      >? >
>      >? > [2023-02-03 07:45:46.896924 +0000] E [MSGID: 114079]
>      >? > [client-handshake.c:1253:client_query_portmap]
>     0-cluster_data-client-48:
>      >? > remote-subvolume not set in volfile []
>      >? >
>      >? >
>      >? > Can you compare all vol files in /var/lib/glusterd/vols/
>     between the
>      > nodes ?
>      >? > I have the suspicioun that there is a vol file mismatch
(maybe
>      >? > /var/lib/glusterd/vols/<VOLUME_NAME>/*-shd.vol).
>      >? >
>      >? > Best Regards,
>      >? > Strahil Nikolov
>      >? >
>      >? >? ? On Fri, Feb 3, 2023 at 12:20, Diego Zuccato
>      >? >? ? <diego.zuccato at unibo.it <mailto:diego.zuccato
at unibo.it>
>     <mailto:diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>>> wrote:
>      >? >? ? Can't see anything relevant in glfsheal log, just
messages
>     related to
>      >? >? ? the crash of one of the nodes (the one that had the mobo
>     replaced... I
>      >? >? ? fear some on-disk structures could have been silently
>     damaged by RAM
>      >? >? ? errors and that makes gluster processes crash, or
it's just
>     an issue
>      >? >? ? with enabling brick-multiplex).
>      >? >? ? -8<--
>      >? >? ? [2023-02-03 07:45:46.896924 +0000] E [MSGID: 114079]
>      >? >? ? [client-handshake.c:1253:client_query_portmap]
>      >? >? ? 0-cluster_data-client-48:
>      >? >? ? remote-subvolume not set in volfile []
>      >? >? ? [2023-02-03 07:45:46.897282 +0000] E
>      >? >? ? [rpc-clnt.c:331:saved_frames_unwind] (-->
>      >? >
>      >
>    
/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7fce0c867b95]
>      >? >? ? (-->
>     /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7fce0c0ca2fc] (-->
>      >? >
>      >
>    
/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7fce0c0d2419]
>      >? >? ? (-->
>     /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x10308)[0x7fce0c0d3308]
>      > (-->
>      >? >
>      >
>    
/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7fce0c0ce7e6]
>      >? >? ? ))))) 0-cluster_data-client-48: forced unwinding frame
>     type(GF-DUMP)
>      >? >? ? op(NULL(2)) called at 2023-02-03 07:45:46.891054 +0000
>     (xid=0x13)
>      >? >? ? -8<--
>      >? >
>      >? >? ? Well, actually I *KNOW* the files outside .glusterfs
have
>     been deleted
>      >? >? ? (by me :) ). That's why I call those 'stale'
gfids.
>      >? >? ? Affected entries under .glusterfs have usually link
count >     1 =>
>      >? >? ? nothing
>      >? >? ? 'find' can find.
>      >? >? ? Since I already recovered those files (before deleting
from
>     bricks),
>      >? >? ? can
>      >? >? ? .glusterfs entries be deleted too or should I check
>     something else?
>      >? >? ? Maybe I should create a script that finds all files/dirs
(not
>      > symlinks,
>      >? >? ? IIUC) in .glusterfs on all bricks/arbiters and moves
'em to
>     a temp
>      > dir?
>      >? >
>      >? >? ? Diego
>      >? >
>      >? >? ? Il 02/02/2023 23:35, Strahil Nikolov ha scritto:
>      >? >? ? ? > Any issues reported in
/var/log/glusterfs/glfsheal-*.log ?
>      >? >? ? ? >
>      >? >? ? ? > The easiest way to identify the affected entries
is to run:
>      >? >? ? ? > find /FULL/PATH/TO/BRICK/ -samefile
>      >? >? ? ? >
>      >? >
>      >
>    
/FULL/PATH/TO/BRICK/.glusterfs/57/e4/57e428c7-6bed-4eb3-b9bd-02ca4c46657a
>      >? >? ? ? >
>      >? >? ? ? >
>      >? >? ? ? > Best Regards,
>      >? >? ? ? > Strahil Nikolov
>      >? >? ? ? >
>      >? >? ? ? >
>      >? >? ? ? > ? ???????, 31 ?????? 2023 ?., 11:58:24 ?.
???????+2,
>     Diego Zuccato
>      >? >? ? ? > <diego.zuccato at unibo.it
<mailto:diego.zuccato at unibo.it>
>     <mailto:diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>>
>      > <mailto:diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>
>     <mailto:diego.zuccato at unibo.it <mailto:diego.zuccato at
unibo.it>>>>
>     ??????:
>      >? >? ? ? >
>      >? >? ? ? >
>      >? >? ? ? > Hello all.
>      >? >? ? ? >
>      >? >? ? ? > I've had one of the 3 nodes serving a
"replica 3
>     arbiter 1"
>      > down for
>      >? >? ? ? > some days (apparently RAM issues, but actually
failing
>     mobo).
>      >? >? ? ? > The other nodes have had some issues (RAM
exhaustion,
>     old problem
>      >? >? ? ? > already ticketed but still no solution) and some
brick
>     processes
>      >? >? ? ? > coredumped. Restarting the processes allowed the
>     cluster to
>      > continue
>      >? >? ? ? > working. Mostly.
>      >? >? ? ? >
>      >? >? ? ? > After the third server got fixed I started a
heal, but
>     files
>      >? >? ? didn't get
>      >? >? ? ? > healed and count (by "ls -l
>      >? >? ? ? > /srv/bricks/*/d/.glusterfs/indices/xattrop/|grep
^-|wc
>     -l")
>      > did not
>      >? >? ? ? > decrease over 2 days. So, to recover I copied
files
>     from bricks
>      >? >? ? to temp
>      >? >? ? ? > storage (keeping both copies of conflicting files
with
>     different
>      >? >? ? ? > contents), removed files on bricks and arbiters,
and
>     finally
>      >? >? ? copied back
>      >? >? ? ? > from temp storage to the volume.
>      >? >? ? ? >
>      >? >? ? ? > Now the files are accessible but I still see lots
of
>     entries like
>      >? >? ? ? > <gfid:57e428c7-6bed-4eb3-b9bd-02ca4c46657a>
>      >? >? ? ? >
>      >? >? ? ? > IIUC that's due to a mismatch between
.glusterfs/
>     contents and
>      > normal
>      >? >? ? ? > hierarchy. Is there some tool to speed up the
cleanup?
>      >? >? ? ? >
>      >? >? ? ? > Tks.
>      >? >? ? ? >
>      >? >? ? ? > --
>      >? >? ? ? > Diego Zuccato
>      >? >? ? ? > DIFA - Dip. di Fisica e Astronomia
>      >? >? ? ? > Servizi Informatici
>      >? >? ? ? > Alma Mater Studiorum - Universit? di Bologna
>      >? >? ? ? > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>      >? >? ? ? > tel.: +39 051 20 95786
>      >? >? ? ? > ________
>      >? >? ? ? >
>      >? >? ? ? >
>      >? >? ? ? >
>      >? >? ? ? > Community Meeting Calendar:
>      >? >? ? ? >
>      >? >? ? ? > Schedule -
>      >? >? ? ? > Every 2nd and 4th Tuesday at 14:30 IST / 09:00
UTC
>      >? >? ? ? > Bridge: https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk >
>      > <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>
>      >? >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk >
>      > <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk >>>
>      >? >? ? ? > <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk >
>      > <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>
>      >? >? ? <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk >
>      > <https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>>>>
>      >? >? ? ? > Gluster-users mailing list
>      >? >? ? ? > Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at
gluster.org
>     <mailto:Gluster-users at gluster.org>>
>      > <mailto:Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at
gluster.org
>     <mailto:Gluster-users at gluster.org>>>
>      >? >? ? <mailto:Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org>
>      > <mailto:Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org>>
>     <mailto:Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>      > <mailto:Gluster-users at gluster.org
>     <mailto:Gluster-users at gluster.org>>>>
>      >? >? ? ? >
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users >
>      > <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>      >? >? ?
<https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users >
>      > <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>      >? >? ? ? >
>     <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users >
>      > <https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>      >? >? ?
<https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users >
>      > <https://lists.gluster.org/mailman/listinfo/gluster-users
>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>>
> 
>      >
>      >? >
>      >? >
>      >? >? ? --
>      >? >? ? Diego Zuccato
>      >? >? ? DIFA - Dip. di Fisica e Astronomia
>      >? >? ? Servizi Informatici
>      >? >? ? Alma Mater Studiorum - Universit? di Bologna
>      >? >? ? V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>      >? >? ? tel.: +39 051 20 95786
>      >? >
>      >
>      > --
>      > Diego Zuccato
>      > DIFA - Dip. di Fisica e Astronomia
>      > Servizi Informatici
>      > Alma Mater Studiorum - Universit? di Bologna
>      > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>      > tel.: +39 051 20 95786
> 
>     -- 
>     Diego Zuccato
>     DIFA - Dip. di Fisica e Astronomia
>     Servizi Informatici
>     Alma Mater Studiorum - Universit? di Bologna
>     V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>     tel.: +39 051 20 95786
> 
-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786