Hi, not sure what you mean with "clients" - do you mean the clients that mount the volume? gluster volume status workdata clients ---------------------------------------------- Brick : glusterpub2:/gluster/md3/workdata Clients connected : 20 Hostname BytesRead BytesWritten OpVersion -------- --------- ------------ --------- 192.168.0.222:49140 43698212 41152108 110000 [...shortened...] 192.168.0.126:49123 8362352021 16445401205 110000 ---------------------------------------------- Brick : glusterpub3:/gluster/md3/workdata Clients connected : 3 Hostname BytesRead BytesWritten OpVersion -------- --------- ------------ --------- 192.168.0.44:49150 5855740279 63649538575 110000 192.168.0.44:49137 308958200 319216608 110000 192.168.0.126:49120 7524915770 15489813449 110000 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean by "old" - probably not the age of the server, but rather the gluster version. op-version is 110000 on all servers+clients, upgraded from 10.4 -> 11.1 "Have you checked if a client is not allowed to update all 3 copies ?" -> are there special log messages for that? "If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync." -> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick -> Replacing bricks in Replicate/Distributed Replicate volumes this part, right? Well, can't do this right now, as there are ~33TB of data (many small files) to copy, that would slow down the servers / the volume. But if the replacement is running i could do it afterwards, just to see what happens. Hubert Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov <hunter86_bg at yahoo.com>:> > 2800 is too much. Most probably you are affected by a bug. How old are the clients ? Is only 1 server affected ? > Have you checked if a client is not allowed to update all 3 copies ? > > If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync. > > Best Regards, > Strahil Nikolov > > On Mon, Jan 29, 2024 at 8:44, Hu Bert > <revirii at googlemail.com> wrote: > Morning, > a few bad apples - but which ones? Checked glustershd.log on the "bad" > server and counted todays "gfid mismatch" entries (2800 in total): > > 44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/212>, > 44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/174>, > 44 <gfid:d5c6d7b9-f217-4cc9-a664-448d034e74c2>/94037803>, > 44 <gfid:d263ecc2-9c21-455c-9ba9-5a999c03adce>/94066216>, > 44 <gfid:cbfd5d46-d580-4845-a544-e46fd82c1758>/249771609>, > 44 <gfid:aecf217a-0797-43d1-9481-422a8ac8a5d0>/64235523>, > 44 <gfid:a701d47b-b3fb-4e7e-bbfb-bc3e19632867>/185>, > > etc. But as i said, these are pretty new and didn't appear when the > volume/servers started missbehaving. Are there scripts/snippets > available how one could handle this? > > Healing would be very painful for the running system (still connected, > but not very long anymore), as there surely are 4-5 million entries to > be healed. I can't do this now - maybe, when the replacement is in > productive state, one could give it a try. > > Thx, > Hubert > > Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov > <hunter86_bg at yahoo.com>: > > > > From gfid mismatch a manual effort is needed but you can script it. > > I think that a few bad "apples" can break the healing and if you fix them the healing might be recovered. > > > > Also, check why the client is not updating all copies. Most probably you have a client that is not able to connect to a brick. > > > > gluster volume status VOLUME_NAME clients > > > > Best Regards, > > Strahil Nikolov > > > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > > <revirii at googlemail.com> wrote: > > Hi Strahil, > > there's no arbiter: 3 servers with 5 bricks each. > > > > Volume Name: workdata > > Type: Distributed-Replicate > > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 5 x 3 = 15 > > > > The "problem" is: the number of files/entries to-be-healed has > > continuously grown since the beginning, and now we're talking about > > way too many files to do this manually. Last time i checked: 700K per > > brick, should be >900K at the moment. The command 'gluster volume heal > > workdata statistics heal-count' is unable to finish. Doesn't look that > > good :D > > > > Interesting, the glustershd.log on the "bad" server now shows errors like these: > > > > [2024-01-28 18:48:33.734053 +0000] E [MSGID: 108008] > > [afr-self-heal-common.c:399:afr_gfid_split_brain_source] > > 0-workdata-replicate-3: Gfid mismatch detected for > > <gfid:70ab3d57-bd82-4932-86bf-d613db32c1ab>/803620716>, > > 82d7939a-8919-40ea- > > 9459-7b8af23d3b72 on workdata-client-11 and > > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 > > > > Shouldn't the heals happen on the 2 "good" servers? > > > > Anyway... we're currently preparing a different solution for our data > > and we'll throw away this gluster volume - no critical data will be > > lost, as these are derived from source data (on a different volume on > > different servers). Will be a hard time (calculating tons of data), > > but the chosen solution should have a way better performance. > > > > Well... thx to all for your efforts, really appreciate that :-) > > > > > > Hubert > > > > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov > > <hunter86_bg at yahoo.com>: > > > > > > What about the arbiter node ? > > > Actually, check on all nodes and script it - you might need it in the future. > > > > > > Simplest way to resolve is to make the file didappear (rename to something else and then rename it back). Another easy trick is to read thr whole file: dd if=file of=/dev/null status=progress > > > > > > Best Regards, > > > Strahil Nikolov > > > > > > On Sat, Jan 27, 2024 at 8:24, Hu Bert > > > <revirii at googlemail.com> wrote: > > > Morning, > > > > > > gfid1: > > > getfattr -d -e hex -m. > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > > > glusterpub1 (good one): > > > getfattr: Removing leading '/' from absolute path names > > > # file: gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > trusted.afr.dirty=0x000000000000000000000000 > > > trusted.afr.workdata-client-11=0x000000020000000100000000 > > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb > > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecff000000002695ebb70000000065aaecff000000002695ebb70000000065aaecff000000002533f110 > > > > > > glusterpub3 (bad one): > > > getfattr: /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: > > > No such file or directory > > > > > > gfid 2: > > > getfattr -d -e hex -m. > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > > > glusterpub1 (good one): > > > getfattr: Removing leading '/' from absolute path names > > > # file: gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > trusted.afr.dirty=0x000000000000000000000000 > > > trusted.afr.workdata-client-8=0x000000020000000100000000 > > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 > > > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecfe000000000c5403bd0000000065aaecfe000000000c5403bd0000000065aaecfe000000000ad61ee4 > > > > > > glusterpub3 (bad one): > > > getfattr: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642: > > > No such file or directory > > > > > > thx, > > > Hubert > > > > > > Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov > > > <hunter86_bg at yahoo.com>: > > > > > > > > You don't need to mount it. > > > > Like this : > > > > # getfattr -d -e hex -m. /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e > > > > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e > > > > trusted.gfid=0x00462be83e6149318bdadae1645c639e > > > > trusted.gfid2path.05fcbdafdeea18ab=0x30326333373930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 > > > > trusted.glusterfs.mdata=0x010000000000000000000000006170340c0000000025b6a745000000006170340c0000000020efb577000000006170340c0000000020d42b07 > > > > trusted.glusterfs.shard.block-size=0x0000000004000000 > > > > trusted.glusterfs.shard.file-size=0x00000000000000cd000000000000000000000000000000010000000000000000 > > > > > > > > > > > > Best Regards, > > > > Strahil Nikolov > > > > > > > > > > > > > > > > ? ?????????, 25 ?????? 2024 ?. ? 09:42:46 ?. ???????+2, Hu Bert <revirii at googlemail.com> ??????: > > > > > > > > > > > > > > > > > > > > > > > > Good morning, > > > > > > > > hope i got it right... using: > > > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 > > > > > > > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata > > > > > > > > gfid 1: > > > > getfattr -n trusted.glusterfs.pathinfo -e text > > > > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > getfattr: Removing leading '/' from absolute path names > > > > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht> > > > > (<REPLICATE:workdata-replicate-3> > > > > <POSIX(/gluster/md6/workdata):glusterpub1:/gluster/md6/workdata/images/133/283/13328349/128x128s.jpg> > > > > <POSIX(/gluster/md6/workdata):glusterpub2:/gl > > > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))" > > > > > > > > gfid 2: > > > > getfattr -n trusted.glusterfs.pathinfo -e text > > > > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > getfattr: Removing leading '/' from absolute path names > > > > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht> > > > > (<REPLICATE:workdata-replicate-2> > > > > <POSIX(/gluster/md5/workdata):glusterpub2:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642> > > > > <POSIX(/gluster/md5/workdata > > > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))" > > > > > > > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the > > > > misbehaving (not healing) one. > > > > > > > > The file with gfid 1 is available under > > > > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 > > > > bricks, but missing on glusterpub3 brick. > > > > > > > > gfid 2: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > is present on glusterpub1+2, but not on glusterpub3. > > > > > > > > > > > > Thx, > > > > Hubert > > > > > > > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov > > > > <hunter86_bg at yahoo.com>: > > > > > > > > > > > > > > Hi, > > > > > > > > > > Can you find and check the files with gfids: > > > > > 60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > > > > > > > Use 'getfattr -d -e hex -m. ' command from https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output . > > > > > > > > > > Best Regards, > > > > > Strahil Nikolov > > > > > > > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > > > > > <revirii at googlemail.com> wrote: > > > > > Good morning, > > > > > > > > > > thx Gilberto, did the first three (set to WARNING), but the last one > > > > > doesn't work. Anyway, with setting these three some new messages > > > > > appear: > > > > > > > > > > [2024-01-20 07:23:58.561106 +0000] W [MSGID: 114061] > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd > > > > > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.561177 +0000] E [MSGID: 108028] > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: > > > > > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor > > > > > in bad state] > > > > > [2024-01-20 07:23:58.562151 +0000] W [MSGID: 114031] > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: > > > > > remote operation failed. > > > > > [{path=<gfid:faf59566-10f5-4ddd-8b0c-a87bc6a334fb>}, > > > > > {gfid=faf59566-10f5-4ddd-8b0c-a87b > > > > > c6a334fb}, {errno=2}, {error=No such file or directory}] > > > > > [2024-01-20 07:23:58.562296 +0000] W [MSGID: 114061] > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: > > > > > remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.860552 +0000] W [MSGID: 114061] > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd > > > > > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.860608 +0000] E [MSGID: 108028] > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: > > > > > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor > > > > > in bad state] > > > > > [2024-01-20 07:23:58.861520 +0000] W [MSGID: 114031] > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: > > > > > remote operation failed. > > > > > [{path=<gfid:60465723-5dc0-4ebe-aced-9f2c12e52642>}, > > > > > {gfid=60465723-5dc0-4ebe-aced-9f2c1 > > > > > 2e52642}, {errno=2}, {error=No such file or directory}] > > > > > [2024-01-20 07:23:58.861640 +0000] W [MSGID: 114061] > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: > > > > > remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > > > > > > Not many log entries appear, only a few. Has someone seen error > > > > > messages like these? Setting diagnostics.brick-sys-log-level to DEBUG > > > > > shows way more log entries, uploaded it to: > > > > > https://file.io/spLhlcbMCzr8 - not sure if that helps. > > > > > > > > > > > > > > > Thx, > > > > > Hubert > > > > > > > > > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira > > > > > <gilberto.nunes32 at gmail.com>: > > > > > > > > > > > > > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING > > > > > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > > > > > > gluster volume set testvol diagnostics.client-log-level ERROR > > > > > > gluster --log-level=ERROR volume status > > > > > > > > > > > > --- > > > > > > Gilberto Nunes Ferreira > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Em sex., 19 de jan. de 2024 ?s 05:49, Hu Bert <revirii at googlemail.com> escreveu: > > > > > >> > > > > > >> Hi Strahil, > > > > > >> hm, don't get me wrong, it may sound a bit stupid, but... where do i > > > > > >> set the log level? Using debian... > > > > > >> > > > > > >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > > > > > >> > > > > > >> ls /etc/glusterfs/ > > > > > >> eventsconfig.json glusterfs-georep-logrotate > > > > > >> gluster-rsyslog-5.8.conf group-db-workload group-gluster-block > > > > > >> group-nl-cache group-virt.example logger.conf.example > > > > > >> glusterd.vol glusterfs-logrotate > > > > > >> gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache > > > > > >> group-samba gsyncd.conf thin-arbiter.vol > > > > > >> > > > > > >> checked: /etc/glusterfs/logger.conf.example > > > > > >> > > > > > >> # To enable enhanced logging capabilities, > > > > > >> # > > > > > >> # 1. rename this file to /etc/glusterfs/logger.conf > > > > > >> # > > > > > >> # 2. rename /etc/rsyslog.d/gluster.conf.example to > > > > > >> # /etc/rsyslog.d/gluster.conf > > > > > >> # > > > > > >> # This change requires restart of all gluster services/volumes and > > > > > >> # rsyslog. > > > > > >> > > > > > >> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " > > > > > >> > > > > > >> restart glusterd on that node, but this doesn't work, log-level stays > > > > > >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably > > > > > >> /etc/rsyslog.conf on debian. But first it would be better to know > > > > > >> where to set the log-level for glusterd. > > > > > >> > > > > > >> Depending on how much the DEBUG log-level talks ;-) i could assign up > > > > > >> to 100G to /var > > > > > >> > > > > > >> > > > > > >> Thx & best regards, > > > > > >> Hubert > > > > > >> > > > > > >> > > > > > >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov > > > > > >> <hunter86_bg at yahoo.com>: > > > > > >> > > > > > > >> > Are you able to set the logs to debug level ? > > > > > >> > It might provide a clue what it is going on. > > > > > >> > > > > > > >> > Best Regards, > > > > > >> > Strahil Nikolov > > > > > >> > > > > > > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > > > > > >> > <diego.zuccato at unibo.it> wrote: > > > > > >> > That's the same kind of errors I keep seeing on my 2 clusters, > > > > > >> > regenerated some months ago. Seems a pseudo-split-brain that should be > > > > > >> > impossible on a replica 3 cluster but keeps happening. > > > > > >> > Sadly going to ditch Gluster ASAP. > > > > > >> > > > > > > >> > Diego > > > > > >> > > > > > > >> > Il 18/01/2024 07:11, Hu Bert ha scritto: > > > > > >> > > Good morning, > > > > > >> > > heal still not running. Pending heals now sum up to 60K per brick. > > > > > >> > > Heal was starting instantly e.g. after server reboot with version > > > > > >> > > 10.4, but doesn't with version 11. What could be wrong? > > > > > >> > > > > > > > >> > > I only see these errors on one of the "good" servers in glustershd.log: > > > > > >> > > > > > > > >> > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031] > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > > > > >> > > remote operation failed. > > > > > >> > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>}, > > > > > >> > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > > > > >> > > f00681b}, {errno=2}, {error=No such file or directory}] > > > > > >> > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031] > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > > > > >> > > remote operation failed. > > > > > >> > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>}, > > > > > >> > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > > > > >> > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > > >> > > > > > > > >> > > About 7K today. Any ideas? Someone? > > > > > >> > > > > > > > >> > > > > > > > >> > > Best regards, > > > > > >> > > Hubert > > > > > >> > > > > > > > >> > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at googlemail.com>: > > > > > >> > >> > > > > > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took > > > > > >> > >> a couple of restarts, cksum checks etc. > > > > > >> > >> > > > > > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all. > > > > > >> > >> > > > > > >> > >> gluster volume status > > > > > >> > >> Status of volume: workdata > > > > > >> > >> Gluster process TCP Port RDMA Port Online Pid > > > > > >> > >> ------------------------------------------------------------------------------ > > > > > >> > >> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 > > > > > >> > >> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 > > > > > >> > >> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 > > > > > >> > >> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 > > > > > >> > >> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 > > > > > >> > >> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 > > > > > >> > >> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 > > > > > >> > >> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 > > > > > >> > >> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 > > > > > >> > >> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 > > > > > >> > >> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 > > > > > >> > >> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 > > > > > >> > >> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 > > > > > >> > >> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 > > > > > >> > >> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 > > > > > >> > >> Self-heal Daemon on localhost N/A N/A Y 4141 > > > > > >> > >> Self-heal Daemon on glusterpub1 N/A N/A Y 5570 > > > > > >> > >> Self-heal Daemon on glusterpub2 N/A N/A Y 4139 > > > > > >> > >> > > > > > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > > > > > >> > >> "gluster volume heal workdata statistics heal-count" shows thousands > > > > > >> > >> of files per brick. > > > > > >> > >> "gluster volume heal workdata enable" has no effect. > > > > > >> > >> > > > > > >> > >> gluster volume heal workdata full > > > > > >> > >> Launching heal operation to perform full self heal on volume workdata > > > > > >> > >> has been successful > > > > > >> > >> Use heal info commands to check status. > > > > > >> > >> > > > > > >> > >> -> not doing anything at all. And nothing happening on the 2 "good" > > > > > >> > >> servers in e.g. glustershd.log. Heal was working as expected on > > > > > >> > >> version 10.4, but here... silence. Someone has an idea? > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> Best regards, > > > > > >> > >> Hubert > > > > > >> > >> > > > > > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > > > > > >> > >> <gilberto.nunes32 at gmail.com>: > > > > > >> > >>> > > > > > >> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well. > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at googlemail.com> escreveu: > > > > > >> > >>>> > > > > > >> > >>>> morning to those still reading :-) > > > > > >> > >>>> > > > > > >> > >>>> i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > > > > > >> > >>>> > > > > > >> > >>>> there's a paragraph about "peer rejected" with the same error message, > > > > > >> > >>>> telling me: "Update the cluster.op-version" - i had only updated the > > > > > >> > >>>> server nodes, but not the clients. So upgrading the cluster.op-version > > > > > >> > >>>> wasn't possible at this time. So... upgrading the clients to version > > > > > >> > >>>> 11.1 and then the op-version should solve the problem? > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> Thx, > > > > > >> > >>>> Hubert > > > > > >> > >>>> > > > > > >> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii at googlemail.com>: > > > > > >> > >>>>> > > > > > >> > >>>>> Hi, > > > > > >> > >>>>> just upgraded some gluster servers from version 10.4 to version 11.1. > > > > > >> > >>>>> Debian bullseye & bookworm. When only installing the packages: good, > > > > > >> > >>>>> servers, volumes etc. work as expected. > > > > > >> > >>>>> > > > > > >> > >>>>> But one needs to test if the systems work after a daemon and/or server > > > > > >> > >>>>> restart. Well, did a reboot, and after that the rebooted/restarted > > > > > >> > >>>>> system is "out". Log message from working node: > > > > > >> > >>>>> > > > > > >> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163] > > > > > >> > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > > > > > >> > >>>>> 0-management: using the op-version 100000 > > > > > >> > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490] > > > > > >> > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > > > > > >> > >>>>> 0-glusterd: Received probe from uuid: > > > > > >> > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e > > > > > >> > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010] > > > > > >> > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > > > > > >> > >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525, > > > > > >> > >>>>> remote cksum = 1931483801 on peer gluster190 > > > > > >> > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493] > > > > > >> > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > > > > > >> > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1 > > > > > >> > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493] > > > > > >> > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > > > > > >> > >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > > > > > >> > >>>>> gluster190, port: 0 > > > > > >> > >>>>> > > > > > >> > >>>>> peer status from rebooted node: > > > > > >> > >>>>> > > > > > >> > >>>>> root at gluster190 ~ # gluster peer status > > > > > >> > >>>>> Number of Peers: 2 > > > > > >> > >>>>> > > > > > >> > >>>>> Hostname: gluster189 > > > > > >> > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > > > > > >> > >>>>> State: Peer Rejected (Connected) > > > > > >> > >>>>> > > > > > >> > >>>>> Hostname: gluster188 > > > > > >> > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > > > > > >> > >>>>> State: Peer Rejected (Connected) > > > > > >> > >>>>> > > > > > >> > >>>>> So the rebooted gluster190 is not accepted anymore. And thus does not > > > > > >> > >>>>> appear in "gluster volume status". I then followed this guide: > > > > > >> > >>>>> > > > > > >> > >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > > > > > >> > >>>>> > > > > > >> > >>>>> Remove everything under /var/lib/glusterd/ (except glusterd.info) and > > > > > >> > >>>>> restart glusterd service etc. Data get copied from other nodes, > > > > > >> > >>>>> 'gluster peer status' is ok again - but the volume info is missing, > > > > > >> > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from another > > > > > >> > >>>>> node, the volume then is available again, heals start etc. > > > > > >> > >>>>> > > > > > >> > >>>>> Well, and just to be sure that everything's working as it should, > > > > > >> > >>>>> rebooted that node again - the rebooted node is kicked out again, and > > > > > >> > >>>>> you have to restart bringing it back again. > > > > > >> > >>>>> > > > > > >> > >>>>> Sry, but did i miss anything? Has someone experienced similar > > > > > >> > >>>>> problems? I'll probably downgrade to 10.4 again, that version was > > > > > >> > >>>>> working... > > > > > >> > >>>>> > > > > > >> > >>>>> > > > > > >> > >>>>> Thx, > > > > > >> > >>>>> Hubert > > > > > >> > >>>> ________ > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> Community Meeting Calendar: > > > > > >> > >>>> > > > > > >> > >>>> Schedule - > > > > > >> > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > >>>> Gluster-users mailing list > > > > > >> > >>>> Gluster-users at gluster.org > > > > > >> > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > ________ > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > Community Meeting Calendar: > > > > > >> > > > > > > > >> > > Schedule - > > > > > >> > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > > Gluster-users mailing list > > > > > >> > > Gluster-users at gluster.org > > > > > >> > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > > > > > >> > -- > > > > > >> > Diego Zuccato > > > > > >> > DIFA - Dip. di Fisica e Astronomia > > > > > >> > Servizi Informatici > > > > > >> > Alma Mater Studiorum - Universit? di Bologna > > > > > >> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > > > > >> > tel.: +39 051 20 95786 > > > > > >> > > > > > > >> > ________ > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > Community Meeting Calendar: > > > > > >> > > > > > > >> > Schedule - > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > Gluster-users mailing list > > > > > >> > Gluster-users at gluster.org > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > > > > > >> > ________ > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > Community Meeting Calendar: > > > > > >> > > > > > > >> > Schedule - > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > Gluster-users mailing list > > > > > >> > Gluster-users at gluster.org > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> ________ > > > > > >> > > > > > >> > > > > > >> > > > > > >> Community Meeting Calendar: > > > > > >> > > > > > >> Schedule - > > > > > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> Gluster-users mailing list > > > > > >> Gluster-users at gluster.org > > > > > >> https://lists.gluster.org/mailman/listinfo/gluster-users
Strahil Nikolov
2024-Jan-30 06:14 UTC
[Gluster-users] Upgrade 10.4 -> 11.1 making problems
This is your problem : bad server has only 3 clients. I remember there is another gluster volume command to list the IPs of the clients. Find it and run it to find which clients are actually OK (those 3) and the remaining 17 are not.? Then try to remount those 17 clients and if the situation persistes - work with your Network Team to identify why the 17 clients can't reach the brick. Do you have selfheal enabled?cluster.data-self-heal cluster.entry-self-heal cluster.metadata-self-heal Best Regards,Strahil Nikolov On Mon, Jan 29, 2024 at 10:26, Hu Bert<revirii at googlemail.com> wrote: Hi, not sure what you mean with "clients" - do you mean the clients that mount the volume? gluster volume status workdata clients ---------------------------------------------- Brick : glusterpub2:/gluster/md3/workdata Clients connected : 20 Hostname? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BytesRead BytesWritten? ? ? OpVersion --------? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --------- ------------? ? ? --------- 192.168.0.222:49140? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 43698212 41152108? ? ? ? ? 110000 [...shortened...] 192.168.0.126:49123? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 8362352021 16445401205? ? ? ? ? 110000 ---------------------------------------------- Brick : glusterpub3:/gluster/md3/workdata Clients connected : 3 Hostname? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BytesRead BytesWritten? ? ? OpVersion --------? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --------- ------------? ? ? --------- 192.168.0.44:49150? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 5855740279 63649538575? ? ? ? ? 110000 192.168.0.44:49137? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 308958200 319216608? ? ? ? ? 110000 192.168.0.126:49120? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 7524915770 15489813449? ? ? ? ? 110000 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean by "old" - probably not the age of the server, but rather the gluster version. op-version is 110000 on all servers+clients, upgraded from 10.4 -> 11.1 "Have you checked if a client is not allowed to update all 3 copies ?" -> are there special log messages for that? "If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync." -> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick -> Replacing bricks in Replicate/Distributed Replicate volumes this part, right? Well, can't do this right now, as there are ~33TB of data (many small files) to copy, that would slow down the servers / the volume. But if the replacement is running i could do it afterwards, just to see what happens. Hubert Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov <hunter86_bg at yahoo.com>:> > 2800 is too much. Most probably you are affected by a bug. How old are the clients ? Is only 1 server affected ? > Have you checked if a client is not allowed to update all 3 copies ? > > If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync. > > Best Regards, > Strahil Nikolov > > On Mon, Jan 29, 2024 at 8:44, Hu Bert > <revirii at googlemail.com> wrote: > Morning, > a few bad apples - but which ones? Checked glustershd.log on the "bad" > server and counted todays "gfid mismatch" entries (2800 in total): > >? ? 44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/212>, >? ? 44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/174>, >? ? 44 <gfid:d5c6d7b9-f217-4cc9-a664-448d034e74c2>/94037803>, >? ? 44 <gfid:d263ecc2-9c21-455c-9ba9-5a999c03adce>/94066216>, >? ? 44 <gfid:cbfd5d46-d580-4845-a544-e46fd82c1758>/249771609>, >? ? 44 <gfid:aecf217a-0797-43d1-9481-422a8ac8a5d0>/64235523>, >? ? 44 <gfid:a701d47b-b3fb-4e7e-bbfb-bc3e19632867>/185>, > > etc. But as i said, these are pretty new and didn't appear when the > volume/servers started missbehaving. Are there scripts/snippets > available how one could handle this? > > Healing would be very painful for the running system (still connected, > but not very long anymore), as there surely are 4-5 million entries to > be healed. I can't do this now - maybe, when the replacement is in > productive state, one could give it a try. > > Thx, > Hubert > > Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov > <hunter86_bg at yahoo.com>: > > > > From gfid mismatch a manual effort is needed but you can script it. > > I think that a few bad "apples" can break the healing and if you fix them the healing might be recovered. > > > > Also, check why the client is not updating all copies. Most probably you have a client that is not able to connect to a brick. > > > > gluster volume status VOLUME_NAME clients > > > > Best Regards, > > Strahil Nikolov > > > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > > <revirii at googlemail.com> wrote: > > Hi Strahil, > > there's no arbiter: 3 servers with 5 bricks each. > > > > Volume Name: workdata > > Type: Distributed-Replicate > > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 5 x 3 = 15 > > > > The "problem" is: the number of files/entries to-be-healed has > > continuously grown since the beginning, and now we're talking about > > way too many files to do this manually. Last time i checked: 700K per > > brick, should be >900K at the moment. The command 'gluster volume heal > > workdata statistics heal-count' is unable to finish. Doesn't look that > > good :D > > > > Interesting, the glustershd.log on the "bad" server now shows errors like these: > > > > [2024-01-28 18:48:33.734053 +0000] E [MSGID: 108008] > > [afr-self-heal-common.c:399:afr_gfid_split_brain_source] > > 0-workdata-replicate-3: Gfid mismatch detected for > > <gfid:70ab3d57-bd82-4932-86bf-d613db32c1ab>/803620716>, > > 82d7939a-8919-40ea- > > 9459-7b8af23d3b72 on workdata-client-11 and > > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 > > > > Shouldn't the heals happen on the 2 "good" servers? > > > > Anyway... we're currently preparing a different solution for our data > > and we'll throw away this gluster volume - no critical data will be > > lost, as these are derived from source data (on a different volume on > > different servers). Will be a hard time (calculating tons of data), > > but the chosen solution should have a way better performance. > > > > Well... thx to all for your efforts, really appreciate that :-) > > > > > > Hubert > > > > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov > > <hunter86_bg at yahoo.com>: > > > > > > What about the arbiter node ? > > > Actually, check on all nodes and script it - you might need it in the future. > > > > > > Simplest way to resolve is to make the file didappear (rename to something else and then rename it back). Another easy trick is to read thr whole file: dd if=file of=/dev/null status=progress > > > > > > Best Regards, > > > Strahil Nikolov > > > > > > On Sat, Jan 27, 2024 at 8:24, Hu Bert > > > <revirii at googlemail.com> wrote: > > > Morning, > > > > > > gfid1: > > > getfattr -d -e hex -m. > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > > > glusterpub1 (good one): > > > getfattr: Removing leading '/' from absolute path names > > > # file: gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > trusted.afr.dirty=0x000000000000000000000000 > > > trusted.afr.workdata-client-11=0x000000020000000100000000 > > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb > > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecff000000002695ebb70000000065aaecff000000002695ebb70000000065aaecff000000002533f110 > > > > > > glusterpub3 (bad one): > > > getfattr: /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: > > > No such file or directory > > > > > > gfid 2: > > > getfattr -d -e hex -m. > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > > > glusterpub1 (good one): > > > getfattr: Removing leading '/' from absolute path names > > > # file: gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > trusted.afr.dirty=0x000000000000000000000000 > > > trusted.afr.workdata-client-8=0x000000020000000100000000 > > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 > > > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecfe000000000c5403bd0000000065aaecfe000000000c5403bd0000000065aaecfe000000000ad61ee4 > > > > > > glusterpub3 (bad one): > > > getfattr: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642: > > > No such file or directory > > > > > > thx, > > > Hubert > > > > > > Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov > > > <hunter86_bg at yahoo.com>: > > > > > > > > You don't need to mount it. > > > > Like this : > > > > # getfattr -d -e hex -m. /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e > > > > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e > > > > trusted.gfid=0x00462be83e6149318bdadae1645c639e > > > > trusted.gfid2path.05fcbdafdeea18ab=0x30326333373930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 > > > > trusted.glusterfs.mdata=0x010000000000000000000000006170340c0000000025b6a745000000006170340c0000000020efb577000000006170340c0000000020d42b07 > > > > trusted.glusterfs.shard.block-size=0x0000000004000000 > > > > trusted.glusterfs.shard.file-size=0x00000000000000cd000000000000000000000000000000010000000000000000 > > > > > > > > > > > > Best Regards, > > > > Strahil Nikolov > > > > > > > > > > > > > > > > ? ?????????, 25 ?????? 2024 ?. ? 09:42:46 ?. ???????+2, Hu Bert <revirii at googlemail.com> ??????: > > > > > > > > > > > > > > > > > > > > > > > > Good morning, > > > > > > > > hope i got it right... using: > > > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 > > > > > > > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata > > > > > > > > gfid 1: > > > > getfattr -n trusted.glusterfs.pathinfo -e text > > > > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > getfattr: Removing leading '/' from absolute path names > > > > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht> > > > > (<REPLICATE:workdata-replicate-3> > > > > <POSIX(/gluster/md6/workdata):glusterpub1:/gluster/md6/workdata/images/133/283/13328349/128x128s.jpg> > > > > <POSIX(/gluster/md6/workdata):glusterpub2:/gl > > > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))" > > > > > > > > gfid 2: > > > > getfattr -n trusted.glusterfs.pathinfo -e text > > > > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > getfattr: Removing leading '/' from absolute path names > > > > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht> > > > > (<REPLICATE:workdata-replicate-2> > > > > <POSIX(/gluster/md5/workdata):glusterpub2:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642> > > > > <POSIX(/gluster/md5/workdata > > > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))" > > > > > > > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the > > > > misbehaving (not healing) one. > > > > > > > > The file with gfid 1 is available under > > > > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 > > > > bricks, but missing on glusterpub3 brick. > > > > > > > > gfid 2: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > is present on glusterpub1+2, but not on glusterpub3. > > > > > > > > > > > > Thx, > > > > Hubert > > > > > > > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov > > > > <hunter86_bg at yahoo.com>: > > > > > > > > > > > > > > Hi, > > > > > > > > > > Can you find and check the files with gfids: > > > > > 60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > > > > > > > Use 'getfattr -d -e hex -m. ' command from https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output . > > > > > > > > > > Best Regards, > > > > > Strahil Nikolov > > > > > > > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > > > > > <revirii at googlemail.com> wrote: > > > > > Good morning, > > > > > > > > > > thx Gilberto, did the first three (set to WARNING), but the last one > > > > > doesn't work. Anyway, with setting these three some new messages > > > > > appear: > > > > > > > > > > [2024-01-20 07:23:58.561106 +0000] W [MSGID: 114061] > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd > > > > > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.561177 +0000] E [MSGID: 108028] > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: > > > > > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor > > > > > in bad state] > > > > > [2024-01-20 07:23:58.562151 +0000] W [MSGID: 114031] > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: > > > > > remote operation failed. > > > > > [{path=<gfid:faf59566-10f5-4ddd-8b0c-a87bc6a334fb>}, > > > > > {gfid=faf59566-10f5-4ddd-8b0c-a87b > > > > > c6a334fb}, {errno=2}, {error=No such file or directory}] > > > > > [2024-01-20 07:23:58.562296 +0000] W [MSGID: 114061] > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: > > > > > remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.860552 +0000] W [MSGID: 114061] > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd > > > > > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > [2024-01-20 07:23:58.860608 +0000] E [MSGID: 108028] > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: > > > > > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor > > > > > in bad state] > > > > > [2024-01-20 07:23:58.861520 +0000] W [MSGID: 114031] > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: > > > > > remote operation failed. > > > > > [{path=<gfid:60465723-5dc0-4ebe-aced-9f2c12e52642>}, > > > > > {gfid=60465723-5dc0-4ebe-aced-9f2c1 > > > > > 2e52642}, {errno=2}, {error=No such file or directory}] > > > > > [2024-01-20 07:23:58.861640 +0000] W [MSGID: 114061] > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: > > > > > remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > > > > > {errno=77}, {error=File descriptor in bad state}] > > > > > > > > > > Not many log entries appear, only a few. Has someone seen error > > > > > messages like these? Setting diagnostics.brick-sys-log-level to DEBUG > > > > > shows way more log entries, uploaded it to: > > > > > https://file.io/spLhlcbMCzr8 - not sure if that helps. > > > > > > > > > > > > > > > Thx, > > > > > Hubert > > > > > > > > > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira > > > > > <gilberto.nunes32 at gmail.com>: > > > > > > > > > > > > > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING > > > > > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > > > > > > gluster volume set testvol diagnostics.client-log-level ERROR > > > > > > gluster --log-level=ERROR volume status > > > > > > > > > > > > --- > > > > > > Gilberto Nunes Ferreira > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Em sex., 19 de jan. de 2024 ?s 05:49, Hu Bert <revirii at googlemail.com> escreveu: > > > > > >> > > > > > >> Hi Strahil, > > > > > >> hm, don't get me wrong, it may sound a bit stupid, but... where do i > > > > > >> set the log level? Using debian... > > > > > >> > > > > > >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > > > > > >> > > > > > >> ls /etc/glusterfs/ > > > > > >> eventsconfig.json? glusterfs-georep-logrotate > > > > > >> gluster-rsyslog-5.8.conf? group-db-workload? ? ? group-gluster-block > > > > > >>? group-nl-cache? group-virt.example? logger.conf.example > > > > > >> glusterd.vol? ? ? glusterfs-logrotate > > > > > >> gluster-rsyslog-7.2.conf? group-distributed-virt? group-metadata-cache > > > > > >>? group-samba? ? gsyncd.conf? ? ? ? thin-arbiter.vol > > > > > >> > > > > > >> checked: /etc/glusterfs/logger.conf.example > > > > > >> > > > > > >> # To enable enhanced logging capabilities, > > > > > >> # > > > > > >> # 1. rename this file to /etc/glusterfs/logger.conf > > > > > >> # > > > > > >> # 2. rename /etc/rsyslog.d/gluster.conf.example to > > > > > >> #? ? /etc/rsyslog.d/gluster.conf > > > > > >> # > > > > > >> # This change requires restart of all gluster services/volumes and > > > > > >> # rsyslog. > > > > > >> > > > > > >> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " > > > > > >> > > > > > >> restart glusterd on that node, but this doesn't work, log-level stays > > > > > >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably > > > > > >> /etc/rsyslog.conf on debian. But first it would be better to know > > > > > >> where to set the log-level for glusterd. > > > > > >> > > > > > >> Depending on how much the DEBUG log-level talks ;-) i could assign up > > > > > >> to 100G to /var > > > > > >> > > > > > >> > > > > > >> Thx & best regards, > > > > > >> Hubert > > > > > >> > > > > > >> > > > > > >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov > > > > > >> <hunter86_bg at yahoo.com>: > > > > > >> > > > > > > >> > Are you able to set the logs to debug level ? > > > > > >> > It might provide a clue what it is going on. > > > > > >> > > > > > > >> > Best Regards, > > > > > >> > Strahil Nikolov > > > > > >> > > > > > > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > > > > > >> > <diego.zuccato at unibo.it> wrote: > > > > > >> > That's the same kind of errors I keep seeing on my 2 clusters, > > > > > >> > regenerated some months ago. Seems a pseudo-split-brain that should be > > > > > >> > impossible on a replica 3 cluster but keeps happening. > > > > > >> > Sadly going to ditch Gluster ASAP. > > > > > >> > > > > > > >> > Diego > > > > > >> > > > > > > >> > Il 18/01/2024 07:11, Hu Bert ha scritto: > > > > > >> > > Good morning, > > > > > >> > > heal still not running. Pending heals now sum up to 60K per brick. > > > > > >> > > Heal was starting instantly e.g. after server reboot with version > > > > > >> > > 10.4, but doesn't with version 11. What could be wrong? > > > > > >> > > > > > > > >> > > I only see these errors on one of the "good" servers in glustershd.log: > > > > > >> > > > > > > > >> > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031] > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > > > > >> > > remote operation failed. > > > > > >> > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>}, > > > > > >> > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > > > > >> > > f00681b}, {errno=2}, {error=No such file or directory}] > > > > > >> > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031] > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > > > > >> > > remote operation failed. > > > > > >> > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>}, > > > > > >> > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > > > > >> > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > > >> > > > > > > > >> > > About 7K today. Any ideas? Someone? > > > > > >> > > > > > > > >> > > > > > > > >> > > Best regards, > > > > > >> > > Hubert > > > > > >> > > > > > > > >> > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at googlemail.com>: > > > > > >> > >> > > > > > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took > > > > > >> > >> a couple of restarts, cksum checks etc. > > > > > >> > >> > > > > > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all. > > > > > >> > >> > > > > > >> > >> gluster volume status > > > > > >> > >> Status of volume: workdata > > > > > >> > >> Gluster process? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port? RDMA Port? Online? Pid > > > > > >> > >> ------------------------------------------------------------------------------ > > > > > >> > >> Brick glusterpub1:/gluster/md3/workdata? ? 58832? ? 0? ? ? ? ? Y? ? ? 3436 > > > > > >> > >> Brick glusterpub2:/gluster/md3/workdata? ? 59315? ? 0? ? ? ? ? Y? ? ? 1526 > > > > > >> > >> Brick glusterpub3:/gluster/md3/workdata? ? 56917? ? 0? ? ? ? ? Y? ? ? 1952 > > > > > >> > >> Brick glusterpub1:/gluster/md4/workdata? ? 59688? ? 0? ? ? ? ? Y? ? ? 3755 > > > > > >> > >> Brick glusterpub2:/gluster/md4/workdata? ? 60271? ? 0? ? ? ? ? Y? ? ? 2271 > > > > > >> > >> Brick glusterpub3:/gluster/md4/workdata? ? 49461? ? 0? ? ? ? ? Y? ? ? 2399 > > > > > >> > >> Brick glusterpub1:/gluster/md5/workdata? ? 54651? ? 0? ? ? ? ? Y? ? ? 4208 > > > > > >> > >> Brick glusterpub2:/gluster/md5/workdata? ? 49685? ? 0? ? ? ? ? Y? ? ? 2751 > > > > > >> > >> Brick glusterpub3:/gluster/md5/workdata? ? 59202? ? 0? ? ? ? ? Y? ? ? 2803 > > > > > >> > >> Brick glusterpub1:/gluster/md6/workdata? ? 55829? ? 0? ? ? ? ? Y? ? ? 4583 > > > > > >> > >> Brick glusterpub2:/gluster/md6/workdata? ? 50455? ? 0? ? ? ? ? Y? ? ? 3296 > > > > > >> > >> Brick glusterpub3:/gluster/md6/workdata? ? 50262? ? 0? ? ? ? ? Y? ? ? 3237 > > > > > >> > >> Brick glusterpub1:/gluster/md7/workdata? ? 52238? ? 0? ? ? ? ? Y? ? ? 5014 > > > > > >> > >> Brick glusterpub2:/gluster/md7/workdata? ? 52474? ? 0? ? ? ? ? Y? ? ? 3673 > > > > > >> > >> Brick glusterpub3:/gluster/md7/workdata? ? 57966? ? 0? ? ? ? ? Y? ? ? 3653 > > > > > >> > >> Self-heal Daemon on localhost? ? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 4141 > > > > > >> > >> Self-heal Daemon on glusterpub1? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 5570 > > > > > >> > >> Self-heal Daemon on glusterpub2? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 4139 > > > > > >> > >> > > > > > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > > > > > >> > >> "gluster volume heal workdata statistics heal-count" shows thousands > > > > > >> > >> of files per brick. > > > > > >> > >> "gluster volume heal workdata enable" has no effect. > > > > > >> > >> > > > > > >> > >> gluster volume heal workdata full > > > > > >> > >> Launching heal operation to perform full self heal on volume workdata > > > > > >> > >> has been successful > > > > > >> > >> Use heal info commands to check status. > > > > > >> > >> > > > > > >> > >> -> not doing anything at all. And nothing happening on the 2 "good" > > > > > >> > >> servers in e.g. glustershd.log. Heal was working as expected on > > > > > >> > >> version 10.4, but here... silence. Someone has an idea? > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> Best regards, > > > > > >> > >> Hubert > > > > > >> > >> > > > > > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > > > > > >> > >> <gilberto.nunes32 at gmail.com>: > > > > > >> > >>> > > > > > >> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well. > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > > >> > >>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at googlemail.com> escreveu: > > > > > >> > >>>> > > > > > >> > >>>> morning to those still reading :-) > > > > > >> > >>>> > > > > > >> > >>>> i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > > > > > >> > >>>> > > > > > >> > >>>> there's a paragraph about "peer rejected" with the same error message, > > > > > >> > >>>> telling me: "Update the cluster.op-version" - i had only updated the > > > > > >> > >>>> server nodes, but not the clients. So upgrading the cluster.op-version > > > > > >> > >>>> wasn't possible at this time. So... upgrading the clients to version > > > > > >> > >>>> 11.1 and then the op-version should solve the problem? > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> Thx, > > > > > >> > >>>> Hubert > > > > > >> > >>>> > > > > > >> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii at googlemail.com>: > > > > > >> > >>>>> > > > > > >> > >>>>> Hi, > > > > > >> > >>>>> just upgraded some gluster servers from version 10.4 to version 11.1. > > > > > >> > >>>>> Debian bullseye & bookworm. When only installing the packages: good, > > > > > >> > >>>>> servers, volumes etc. work as expected. > > > > > >> > >>>>> > > > > > >> > >>>>> But one needs to test if the systems work after a daemon and/or server > > > > > >> > >>>>> restart. Well, did a reboot, and after that the rebooted/restarted > > > > > >> > >>>>> system is "out". Log message from working node: > > > > > >> > >>>>> > > > > > >> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163] > > > > > >> > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > > > > > >> > >>>>> 0-management: using the op-version 100000 > > > > > >> > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490] > > > > > >> > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > > > > > >> > >>>>> 0-glusterd: Received probe from uuid: > > > > > >> > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e > > > > > >> > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010] > > > > > >> > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > > > > > >> > >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525, > > > > > >> > >>>>> remote cksum = 1931483801 on peer gluster190 > > > > > >> > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493] > > > > > >> > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > > > > > >> > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1 > > > > > >> > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493] > > > > > >> > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > > > > > >> > >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > > > > > >> > >>>>> gluster190, port: 0 > > > > > >> > >>>>> > > > > > >> > >>>>> peer status from rebooted node: > > > > > >> > >>>>> > > > > > >> > >>>>> root at gluster190 ~ # gluster peer status > > > > > >> > >>>>> Number of Peers: 2 > > > > > >> > >>>>> > > > > > >> > >>>>> Hostname: gluster189 > > > > > >> > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > > > > > >> > >>>>> State: Peer Rejected (Connected) > > > > > >> > >>>>> > > > > > >> > >>>>> Hostname: gluster188 > > > > > >> > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > > > > > >> > >>>>> State: Peer Rejected (Connected) > > > > > >> > >>>>> > > > > > >> > >>>>> So the rebooted gluster190 is not accepted anymore. And thus does not > > > > > >> > >>>>> appear in "gluster volume status". I then followed this guide: > > > > > >> > >>>>> > > > > > >> > >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > > > > > >> > >>>>> > > > > > >> > >>>>> Remove everything under /var/lib/glusterd/ (except glusterd.info) and > > > > > >> > >>>>> restart glusterd service etc. Data get copied from other nodes, > > > > > >> > >>>>> 'gluster peer status' is ok again - but the volume info is missing, > > > > > >> > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from another > > > > > >> > >>>>> node, the volume then is available again, heals start etc. > > > > > >> > >>>>> > > > > > >> > >>>>> Well, and just to be sure that everything's working as it should, > > > > > >> > >>>>> rebooted that node again - the rebooted node is kicked out again, and > > > > > >> > >>>>> you have to restart bringing it back again. > > > > > >> > >>>>> > > > > > >> > >>>>> Sry, but did i miss anything? Has someone experienced similar > > > > > >> > >>>>> problems? I'll probably downgrade to 10.4 again, that version was > > > > > >> > >>>>> working... > > > > > >> > >>>>> > > > > > >> > >>>>> > > > > > >> > >>>>> Thx, > > > > > >> > >>>>> Hubert > > > > > >> > >>>> ________ > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> > > > > > >> > >>>> Community Meeting Calendar: > > > > > >> > >>>> > > > > > >> > >>>> Schedule - > > > > > >> > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > >>>> Gluster-users mailing list > > > > > >> > >>>> Gluster-users at gluster.org > > > > > >> > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > ________ > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > Community Meeting Calendar: > > > > > >> > > > > > > > >> > > Schedule - > > > > > >> > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > > Gluster-users mailing list > > > > > >> > > Gluster-users at gluster.org > > > > > >> > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > > > > > >> > -- > > > > > >> > Diego Zuccato > > > > > >> > DIFA - Dip. di Fisica e Astronomia > > > > > >> > Servizi Informatici > > > > > >> > Alma Mater Studiorum - Universit? di Bologna > > > > > >> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > > > > >> > tel.: +39 051 20 95786 > > > > > >> > > > > > > >> > ________ > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > Community Meeting Calendar: > > > > > >> > > > > > > >> > Schedule - > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > Gluster-users mailing list > > > > > >> > Gluster-users at gluster.org > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> > > > > > > >> > ________ > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > Community Meeting Calendar: > > > > > >> > > > > > > >> > Schedule - > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> > Gluster-users mailing list > > > > > >> > Gluster-users at gluster.org > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >> ________ > > > > > >> > > > > > >> > > > > > >> > > > > > >> Community Meeting Calendar: > > > > > >> > > > > > >> Schedule - > > > > > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > >> Bridge: https://meet.google.com/cpu-eiue-hvk > > > > > >> Gluster-users mailing list > > > > > >> Gluster-users at gluster.org > > > > > >> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20240130/5190d64b/attachment.html>