Strahil Nikolov
2024-Jan-18 21:58 UTC
[Gluster-users] Upgrade 10.4 -> 11.1 making problems
Are you able to set the logs to debug level ?It might provide a clue what it is going on. Best Regards,Strahil Nikolov On Thu, Jan 18, 2024 at 13:08, Diego Zuccato<diego.zuccato at unibo.it> wrote: That's the same kind of errors I keep seeing on my 2 clusters, regenerated some months ago. Seems a pseudo-split-brain that should be impossible on a replica 3 cluster but keeps happening. Sadly going to ditch Gluster ASAP. Diego Il 18/01/2024 07:11, Hu Bert ha scritto:> Good morning, > heal still not running. Pending heals now sum up to 60K per brick. > Heal was starting instantly e.g. after server reboot with version > 10.4, but doesn't with version 11. What could be wrong? > > I only see these errors on one of the "good" servers in glustershd.log: > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > remote operation failed. > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>}, > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > f00681b}, {errno=2}, {error=No such file or directory}] > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > remote operation failed. > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>}, > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > d94dd11}, {errno=2}, {error=No such file or directory}] > > About 7K today. Any ideas? Someone? > > > Best regards, > Hubert > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at googlemail.com>: >> >> ok, finally managed to get all servers, volumes etc runnung, but took >> a couple of restarts, cksum checks etc. >> >> One problem: a volume doesn't heal automatically or doesn't heal at all. >> >> gluster volume status >> Status of volume: workdata >> Gluster process? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port? RDMA Port? Online? Pid >> ------------------------------------------------------------------------------ >> Brick glusterpub1:/gluster/md3/workdata? ? 58832? ? 0? ? ? ? ? Y? ? ? 3436 >> Brick glusterpub2:/gluster/md3/workdata? ? 59315? ? 0? ? ? ? ? Y? ? ? 1526 >> Brick glusterpub3:/gluster/md3/workdata? ? 56917? ? 0? ? ? ? ? Y? ? ? 1952 >> Brick glusterpub1:/gluster/md4/workdata? ? 59688? ? 0? ? ? ? ? Y? ? ? 3755 >> Brick glusterpub2:/gluster/md4/workdata? ? 60271? ? 0? ? ? ? ? Y? ? ? 2271 >> Brick glusterpub3:/gluster/md4/workdata? ? 49461? ? 0? ? ? ? ? Y? ? ? 2399 >> Brick glusterpub1:/gluster/md5/workdata? ? 54651? ? 0? ? ? ? ? Y? ? ? 4208 >> Brick glusterpub2:/gluster/md5/workdata? ? 49685? ? 0? ? ? ? ? Y? ? ? 2751 >> Brick glusterpub3:/gluster/md5/workdata? ? 59202? ? 0? ? ? ? ? Y? ? ? 2803 >> Brick glusterpub1:/gluster/md6/workdata? ? 55829? ? 0? ? ? ? ? Y? ? ? 4583 >> Brick glusterpub2:/gluster/md6/workdata? ? 50455? ? 0? ? ? ? ? Y? ? ? 3296 >> Brick glusterpub3:/gluster/md6/workdata? ? 50262? ? 0? ? ? ? ? Y? ? ? 3237 >> Brick glusterpub1:/gluster/md7/workdata? ? 52238? ? 0? ? ? ? ? Y? ? ? 5014 >> Brick glusterpub2:/gluster/md7/workdata? ? 52474? ? 0? ? ? ? ? Y? ? ? 3673 >> Brick glusterpub3:/gluster/md7/workdata? ? 57966? ? 0? ? ? ? ? Y? ? ? 3653 >> Self-heal Daemon on localhost? ? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 4141 >> Self-heal Daemon on glusterpub1? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 5570 >> Self-heal Daemon on glusterpub2? ? ? ? ? ? N/A? ? ? N/A? ? ? ? Y? ? ? 4139 >> >> "gluster volume heal workdata info" lists a lot of files per brick. >> "gluster volume heal workdata statistics heal-count" shows thousands >> of files per brick. >> "gluster volume heal workdata enable" has no effect. >> >> gluster volume heal workdata full >> Launching heal operation to perform full self heal on volume workdata >> has been successful >> Use heal info commands to check status. >> >> -> not doing anything at all. And nothing happening on the 2 "good" >> servers in e.g. glustershd.log. Heal was working as expected on >> version 10.4, but here... silence. Someone has an idea? >> >> >> Best regards, >> Hubert >> >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira >> <gilberto.nunes32 at gmail.com>: >>> >>> Ah! Indeed! You need to perform an upgrade in the clients as well. >>> >>> >>> >>> >>> >>> >>> >>> >>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at googlemail.com> escreveu: >>>> >>>> morning to those still reading :-) >>>> >>>> i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them >>>> >>>> there's a paragraph about "peer rejected" with the same error message, >>>> telling me: "Update the cluster.op-version" - i had only updated the >>>> server nodes, but not the clients. So upgrading the cluster.op-version >>>> wasn't possible at this time. So... upgrading the clients to version >>>> 11.1 and then the op-version should solve the problem? >>>> >>>> >>>> Thx, >>>> Hubert >>>> >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii at googlemail.com>: >>>>> >>>>> Hi, >>>>> just upgraded some gluster servers from version 10.4 to version 11.1. >>>>> Debian bullseye & bookworm. When only installing the packages: good, >>>>> servers, volumes etc. work as expected. >>>>> >>>>> But one needs to test if the systems work after a daemon and/or server >>>>> restart. Well, did a reboot, and after that the rebooted/restarted >>>>> system is "out". Log message from working node: >>>>> >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163] >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] >>>>> 0-management: using the op-version 100000 >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490] >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] >>>>> 0-glusterd: Received probe from uuid: >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010] >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525, >>>>> remote cksum = 1931483801 on peer gluster190 >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493] >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1 >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493] >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: >>>>> gluster190, port: 0 >>>>> >>>>> peer status from rebooted node: >>>>> >>>>> root at gluster190 ~ # gluster peer status >>>>> Number of Peers: 2 >>>>> >>>>> Hostname: gluster189 >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 >>>>> State: Peer Rejected (Connected) >>>>> >>>>> Hostname: gluster188 >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d >>>>> State: Peer Rejected (Connected) >>>>> >>>>> So the rebooted gluster190 is not accepted anymore. And thus does not >>>>> appear in "gluster volume status". I then followed this guide: >>>>> >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ >>>>> >>>>> Remove everything under /var/lib/glusterd/ (except glusterd.info) and >>>>> restart glusterd service etc. Data get copied from other nodes, >>>>> 'gluster peer status' is ok again - but the volume info is missing, >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from another >>>>> node, the volume then is available again, heals start etc. >>>>> >>>>> Well, and just to be sure that everything's working as it should, >>>>> rebooted that node again - the rebooted node is kicked out again, and >>>>> you have to restart bringing it back again. >>>>> >>>>> Sry, but did i miss anything? Has someone experienced similar >>>>> problems? I'll probably downgrade to 10.4 again, that version was >>>>> working... >>>>> >>>>> >>>>> Thx, >>>>> Hubert >>>> ________ >>>> >>>> >>>> >>>> Community Meeting Calendar: >>>> >>>> Schedule - >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>>> Bridge: https://meet.google.com/cpu-eiue-hvk >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Universit? di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20240118/42bd4695/attachment.html>
I don't want to hijack the thread. And in my case setting logs to debug would fill my /var partitions in no time. Maybe the OP can. Diego Il 18/01/2024 22:58, Strahil Nikolov ha scritto:> Are you able to set the logs to debug level ? > It might provide a clue what it is going on. > > Best Regards, > Strahil Nikolov > > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > <diego.zuccato at unibo.it> wrote: > That's the same kind of errors I keep seeing on my 2 clusters, > regenerated some months ago. Seems a pseudo-split-brain that should be > impossible on a replica 3 cluster but keeps happening. > Sadly going to ditch Gluster ASAP. > > Diego > > Il 18/01/2024 07:11, Hu Bert ha scritto: > > Good morning, > > heal still not running. Pending heals now sum up to 60K per brick. > > Heal was starting instantly e.g. after server reboot with version > > 10.4, but doesn't with version 11. What could be wrong? > > > > I only see these errors on one of the "good" servers in > glustershd.log: > > > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > remote operation failed. > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>}, > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > f00681b}, {errno=2}, {error=No such file or directory}] > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > remote operation failed. > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>}, > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > About 7K today. Any ideas? Someone? > > > > > > Best regards, > > Hubert > > > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert > <revirii at googlemail.com <mailto:revirii at googlemail.com>>: > >> > >> ok, finally managed to get all servers, volumes etc runnung, but > took > >> a couple of restarts, cksum checks etc. > >> > >> One problem: a volume doesn't heal automatically or doesn't heal > at all. > >> > >> gluster volume status > >> Status of volume: workdata > >> Gluster process? ? ? ? ? ? ? ? ? ? ? ? ? ? TCP Port? RDMA Port > Online? Pid > >> > ------------------------------------------------------------------------------ > >> Brick glusterpub1:/gluster/md3/workdata? ? 58832? ? 0 > Y? ? ? 3436 > >> Brick glusterpub2:/gluster/md3/workdata? ? 59315? ? 0 > Y? ? ? 1526 > >> Brick glusterpub3:/gluster/md3/workdata? ? 56917? ? 0 > Y? ? ? 1952 > >> Brick glusterpub1:/gluster/md4/workdata? ? 59688? ? 0 > Y? ? ? 3755 > >> Brick glusterpub2:/gluster/md4/workdata? ? 60271? ? 0 > Y? ? ? 2271 > >> Brick glusterpub3:/gluster/md4/workdata? ? 49461? ? 0 > Y? ? ? 2399 > >> Brick glusterpub1:/gluster/md5/workdata? ? 54651? ? 0 > Y? ? ? 4208 > >> Brick glusterpub2:/gluster/md5/workdata? ? 49685? ? 0 > Y? ? ? 2751 > >> Brick glusterpub3:/gluster/md5/workdata? ? 59202? ? 0 > Y? ? ? 2803 > >> Brick glusterpub1:/gluster/md6/workdata? ? 55829? ? 0 > Y? ? ? 4583 > >> Brick glusterpub2:/gluster/md6/workdata? ? 50455? ? 0 > Y? ? ? 3296 > >> Brick glusterpub3:/gluster/md6/workdata? ? 50262? ? 0 > Y? ? ? 3237 > >> Brick glusterpub1:/gluster/md7/workdata? ? 52238? ? 0 > Y? ? ? 5014 > >> Brick glusterpub2:/gluster/md7/workdata? ? 52474? ? 0 > Y? ? ? 3673 > >> Brick glusterpub3:/gluster/md7/workdata? ? 57966? ? 0 > Y? ? ? 3653 > >> Self-heal Daemon on localhost? ? ? ? ? ? ? N/A? ? ? N/A > Y? ? ? 4141 > >> Self-heal Daemon on glusterpub1? ? ? ? ? ? N/A? ? ? N/A > Y? ? ? 5570 > >> Self-heal Daemon on glusterpub2? ? ? ? ? ? N/A? ? ? N/A > Y? ? ? 4139 > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > >> "gluster volume heal workdata statistics heal-count" shows thousands > >> of files per brick. > >> "gluster volume heal workdata enable" has no effect. > >> > >> gluster volume heal workdata full > >> Launching heal operation to perform full self heal on volume > workdata > >> has been successful > >> Use heal info commands to check status. > >> > >> -> not doing anything at all. And nothing happening on the 2 "good" > >> servers in e.g. glustershd.log. Heal was working as expected on > >> version 10.4, but here... silence. Someone has an idea? > >> > >> > >> Best regards, > >> Hubert > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > >> <gilberto.nunes32 at gmail.com <mailto:gilberto.nunes32 at gmail.com>>: > >>> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert > <revirii at googlemail.com <mailto:revirii at googlemail.com>> escreveu: > >>>> > >>>> morning to those still reading :-) > >>>> > >>>> i found this: > https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them <https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them> > >>>> > >>>> there's a paragraph about "peer rejected" with the same error > message, > >>>> telling me: "Update the cluster.op-version" - i had only > updated the > >>>> server nodes, but not the clients. So upgrading the > cluster.op-version > >>>> wasn't possible at this time. So... upgrading the clients to > version > >>>> 11.1 and then the op-version should solve the problem? > >>>> > >>>> > >>>> Thx, > >>>> Hubert > >>>> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert > <revirii at googlemail.com <mailto:revirii at googlemail.com>>: > >>>>> > >>>>> Hi, > >>>>> just upgraded some gluster servers from version 10.4 to > version 11.1. > >>>>> Debian bullseye & bookworm. When only installing the > packages: good, > >>>>> servers, volumes etc. work as expected. > >>>>> > >>>>> But one needs to test if the systems work after a daemon > and/or server > >>>>> restart. Well, did a reboot, and after that the > rebooted/restarted > >>>>> system is "out". Log message from working node: > >>>>> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163] > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > >>>>> 0-management: using the op-version 100000 > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490] > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > >>>>> 0-glusterd: Received probe from uuid: > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010] > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] > 0-management: > >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525, > >>>>> remote cksum = 1931483801 on peer gluster190 > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493] > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] > 0-glusterd: > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1 > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493] > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, > host: > >>>>> gluster190, port: 0 > >>>>> > >>>>> peer status from rebooted node: > >>>>> > >>>>> root at gluster190 <mailto:root at gluster190> ~ # gluster peer status > >>>>> Number of Peers: 2 > >>>>> > >>>>> Hostname: gluster189 > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > >>>>> State: Peer Rejected (Connected) > >>>>> > >>>>> Hostname: gluster188 > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > >>>>> State: Peer Rejected (Connected) > >>>>> > >>>>> So the rebooted gluster190 is not accepted anymore. And thus > does not > >>>>> appear in "gluster volume status". I then followed this guide: > >>>>> > >>>>> > https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ <https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/> > >>>>> > >>>>> Remove everything under /var/lib/glusterd/ (except > glusterd.info) and > >>>>> restart glusterd service etc. Data get copied from other nodes, > >>>>> 'gluster peer status' is ok again - but the volume info is > missing, > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from > another > >>>>> node, the volume then is available again, heals start etc. > >>>>> > >>>>> Well, and just to be sure that everything's working as it should, > >>>>> rebooted that node again - the rebooted node is kicked out > again, and > >>>>> you have to restart bringing it back again. > >>>>> > >>>>> Sry, but did i miss anything? Has someone experienced similar > >>>>> problems? I'll probably downgrade to 10.4 again, that version was > >>>>> working... > >>>>> > >>>>> > >>>>> Thx, > >>>>> Hubert > >>>> ________ > >>>> > >>>> > >>>> > >>>> Community Meeting Calendar: > >>>> > >>>> Schedule - > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > >>>> Gluster-users mailing list > >>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > ________ > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > Gluster-users mailing list > > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > > https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Universit? di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> >-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Universit? di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Hi Strahil, hm, don't get me wrong, it may sound a bit stupid, but... where do i set the log level? Using debian... https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level ls /etc/glusterfs/ eventsconfig.json glusterfs-georep-logrotate gluster-rsyslog-5.8.conf group-db-workload group-gluster-block group-nl-cache group-virt.example logger.conf.example glusterd.vol glusterfs-logrotate gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache group-samba gsyncd.conf thin-arbiter.vol checked: /etc/glusterfs/logger.conf.example # To enable enhanced logging capabilities, # # 1. rename this file to /etc/glusterfs/logger.conf # # 2. rename /etc/rsyslog.d/gluster.conf.example to # /etc/rsyslog.d/gluster.conf # # This change requires restart of all gluster services/volumes and # rsyslog. tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " restart glusterd on that node, but this doesn't work, log-level stays on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably /etc/rsyslog.conf on debian. But first it would be better to know where to set the log-level for glusterd. Depending on how much the DEBUG log-level talks ;-) i could assign up to 100G to /var Thx & best regards, Hubert Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov <hunter86_bg at yahoo.com>:> > Are you able to set the logs to debug level ? > It might provide a clue what it is going on. > > Best Regards, > Strahil Nikolov > > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > <diego.zuccato at unibo.it> wrote: > That's the same kind of errors I keep seeing on my 2 clusters, > regenerated some months ago. Seems a pseudo-split-brain that should be > impossible on a replica 3 cluster but keeps happening. > Sadly going to ditch Gluster ASAP. > > Diego > > Il 18/01/2024 07:11, Hu Bert ha scritto: > > Good morning, > > heal still not running. Pending heals now sum up to 60K per brick. > > Heal was starting instantly e.g. after server reboot with version > > 10.4, but doesn't with version 11. What could be wrong? > > > > I only see these errors on one of the "good" servers in glustershd.log: > > > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > remote operation failed. > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>}, > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > f00681b}, {errno=2}, {error=No such file or directory}] > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > remote operation failed. > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>}, > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > About 7K today. Any ideas? Someone? > > > > > > Best regards, > > Hubert > > > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at googlemail.com>: > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took > >> a couple of restarts, cksum checks etc. > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all. > >> > >> gluster volume status > >> Status of volume: workdata > >> Gluster process TCP Port RDMA Port Online Pid > >> ------------------------------------------------------------------------------ > >> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 > >> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 > >> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 > >> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 > >> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 > >> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 > >> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 > >> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 > >> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 > >> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 > >> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 > >> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 > >> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 > >> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 > >> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 > >> Self-heal Daemon on localhost N/A N/A Y 4141 > >> Self-heal Daemon on glusterpub1 N/A N/A Y 5570 > >> Self-heal Daemon on glusterpub2 N/A N/A Y 4139 > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > >> "gluster volume heal workdata statistics heal-count" shows thousands > >> of files per brick. > >> "gluster volume heal workdata enable" has no effect. > >> > >> gluster volume heal workdata full > >> Launching heal operation to perform full self heal on volume workdata > >> has been successful > >> Use heal info commands to check status. > >> > >> -> not doing anything at all. And nothing happening on the 2 "good" > >> servers in e.g. glustershd.log. Heal was working as expected on > >> version 10.4, but here... silence. Someone has an idea? > >> > >> > >> Best regards, > >> Hubert > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > >> <gilberto.nunes32 at gmail.com>: > >>> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at googlemail.com> escreveu: > >>>> > >>>> morning to those still reading :-) > >>>> > >>>> i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > >>>> > >>>> there's a paragraph about "peer rejected" with the same error message, > >>>> telling me: "Update the cluster.op-version" - i had only updated the > >>>> server nodes, but not the clients. So upgrading the cluster.op-version > >>>> wasn't possible at this time. So... upgrading the clients to version > >>>> 11.1 and then the op-version should solve the problem? > >>>> > >>>> > >>>> Thx, > >>>> Hubert > >>>> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii at googlemail.com>: > >>>>> > >>>>> Hi, > >>>>> just upgraded some gluster servers from version 10.4 to version 11.1. > >>>>> Debian bullseye & bookworm. When only installing the packages: good, > >>>>> servers, volumes etc. work as expected. > >>>>> > >>>>> But one needs to test if the systems work after a daemon and/or server > >>>>> restart. Well, did a reboot, and after that the rebooted/restarted > >>>>> system is "out". Log message from working node: > >>>>> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163] > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > >>>>> 0-management: using the op-version 100000 > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490] > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > >>>>> 0-glusterd: Received probe from uuid: > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010] > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > >>>>> Version of Cksums sourceimages differ. local cksum = 2204642525, > >>>>> remote cksum = 1931483801 on peer gluster190 > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493] > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1 > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493] > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > >>>>> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > >>>>> gluster190, port: 0 > >>>>> > >>>>> peer status from rebooted node: > >>>>> > >>>>> root at gluster190 ~ # gluster peer status > >>>>> Number of Peers: 2 > >>>>> > >>>>> Hostname: gluster189 > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > >>>>> State: Peer Rejected (Connected) > >>>>> > >>>>> Hostname: gluster188 > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > >>>>> State: Peer Rejected (Connected) > >>>>> > >>>>> So the rebooted gluster190 is not accepted anymore. And thus does not > >>>>> appear in "gluster volume status". I then followed this guide: > >>>>> > >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > >>>>> > >>>>> Remove everything under /var/lib/glusterd/ (except glusterd.info) and > >>>>> restart glusterd service etc. Data get copied from other nodes, > >>>>> 'gluster peer status' is ok again - but the volume info is missing, > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir from another > >>>>> node, the volume then is available again, heals start etc. > >>>>> > >>>>> Well, and just to be sure that everything's working as it should, > >>>>> rebooted that node again - the rebooted node is kicked out again, and > >>>>> you have to restart bringing it back again. > >>>>> > >>>>> Sry, but did i miss anything? Has someone experienced similar > >>>>> problems? I'll probably downgrade to 10.4 again, that version was > >>>>> working... > >>>>> > >>>>> > >>>>> Thx, > >>>>> Hubert > >>>> ________ > >>>> > >>>> > >>>> > >>>> Community Meeting Calendar: > >>>> > >>>> Schedule - > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk > >>>> Gluster-users mailing list > >>>> Gluster-users at gluster.org > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users > > ________ > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://meet.google.com/cpu-eiue-hvk > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Universit? di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users