That's the same kind of errors I keep seeing on my 2 clusters,
regenerated some months ago. Seems a pseudo-split-brain that should be
impossible on a replica 3 cluster but keeps happening.
Sadly going to ditch Gluster ASAP.
Diego
Il 18/01/2024 07:11, Hu Bert ha scritto:> Good morning,
> heal still not running. Pending heals now sum up to 60K per brick.
> Heal was starting instantly e.g. after server reboot with version
> 10.4, but doesn't with version 11. What could be wrong?
>
> I only see these errors on one of the "good" servers in
glustershd.log:
>
> [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> remote operation failed.
> [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
> {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> f00681b}, {errno=2}, {error=No such file or directory}]
> [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> remote operation failed.
> [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
> {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> d94dd11}, {errno=2}, {error=No such file or directory}]
>
> About 7K today. Any ideas? Someone?
>
>
> Best regards,
> Hubert
>
> Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at
googlemail.com>:
>>
>> ok, finally managed to get all servers, volumes etc runnung, but took
>> a couple of restarts, cksum checks etc.
>>
>> One problem: a volume doesn't heal automatically or doesn't
heal at all.
>>
>> gluster volume status
>> Status of volume: workdata
>> Gluster process TCP Port RDMA Port Online
Pid
>>
------------------------------------------------------------------------------
>> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y
3436
>> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y
1526
>> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y
1952
>> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y
3755
>> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y
2271
>> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y
2399
>> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y
4208
>> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y
2751
>> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y
2803
>> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y
4583
>> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y
3296
>> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y
3237
>> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y
5014
>> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y
3673
>> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y
3653
>> Self-heal Daemon on localhost N/A N/A Y
4141
>> Self-heal Daemon on glusterpub1 N/A N/A Y
5570
>> Self-heal Daemon on glusterpub2 N/A N/A Y
4139
>>
>> "gluster volume heal workdata info" lists a lot of files per
brick.
>> "gluster volume heal workdata statistics heal-count" shows
thousands
>> of files per brick.
>> "gluster volume heal workdata enable" has no effect.
>>
>> gluster volume heal workdata full
>> Launching heal operation to perform full self heal on volume workdata
>> has been successful
>> Use heal info commands to check status.
>>
>> -> not doing anything at all. And nothing happening on the 2
"good"
>> servers in e.g. glustershd.log. Heal was working as expected on
>> version 10.4, but here... silence. Someone has an idea?
>>
>>
>> Best regards,
>> Hubert
>>
>> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
>> <gilberto.nunes32 at gmail.com>:
>>>
>>> Ah! Indeed! You need to perform an upgrade in the clients as well.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at
googlemail.com> escreveu:
>>>>
>>>> morning to those still reading :-)
>>>>
>>>> i found this:
https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
>>>>
>>>> there's a paragraph about "peer rejected" with
the same error message,
>>>> telling me: "Update the cluster.op-version" - i had
only updated the
>>>> server nodes, but not the clients. So upgrading the
cluster.op-version
>>>> wasn't possible at this time. So... upgrading the clients
to version
>>>> 11.1 and then the op-version should solve the problem?
>>>>
>>>>
>>>> Thx,
>>>> Hubert
>>>>
>>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii
at googlemail.com>:
>>>>>
>>>>> Hi,
>>>>> just upgraded some gluster servers from version 10.4 to
version 11.1.
>>>>> Debian bullseye & bookworm. When only installing the
packages: good,
>>>>> servers, volumes etc. work as expected.
>>>>>
>>>>> But one needs to test if the systems work after a daemon
and/or server
>>>>> restart. Well, did a reboot, and after that the
rebooted/restarted
>>>>> system is "out". Log message from working node:
>>>>>
>>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
>>>>>
[glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
>>>>> 0-management: using the op-version 100000
>>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
>>>>>
[glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
>>>>> 0-glusterd: Received probe from uuid:
>>>>> b71401c3-512a-47cb-ac18-473c4ba7776e
>>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
>>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume]
0-management:
>>>>> Version of Cksums sourceimages differ. local cksum =
2204642525,
>>>>> remote cksum = 1931483801 on peer gluster190
>>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
>>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp]
0-glusterd:
>>>>> Responded to gluster190 (0), ret: 0, op_ret: -1
>>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
>>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk]
0-glusterd:
>>>>> Received RJT from uuid:
b71401c3-512a-47cb-ac18-473c4ba7776e, host:
>>>>> gluster190, port: 0
>>>>>
>>>>> peer status from rebooted node:
>>>>>
>>>>> root at gluster190 ~ # gluster peer status
>>>>> Number of Peers: 2
>>>>>
>>>>> Hostname: gluster189
>>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
>>>>> State: Peer Rejected (Connected)
>>>>>
>>>>> Hostname: gluster188
>>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
>>>>> State: Peer Rejected (Connected)
>>>>>
>>>>> So the rebooted gluster190 is not accepted anymore. And
thus does not
>>>>> appear in "gluster volume status". I then
followed this guide:
>>>>>
>>>>>
https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>>>>>
>>>>> Remove everything under /var/lib/glusterd/ (except
glusterd.info) and
>>>>> restart glusterd service etc. Data get copied from other
nodes,
>>>>> 'gluster peer status' is ok again - but the volume
info is missing,
>>>>> /var/lib/glusterd/vols is empty. When syncing this dir from
another
>>>>> node, the volume then is available again, heals start etc.
>>>>>
>>>>> Well, and just to be sure that everything's working as
it should,
>>>>> rebooted that node again - the rebooted node is kicked out
again, and
>>>>> you have to restart bringing it back again.
>>>>>
>>>>> Sry, but did i miss anything? Has someone experienced
similar
>>>>> problems? I'll probably downgrade to 10.4 again, that
version was
>>>>> working...
>>>>>
>>>>>
>>>>> Thx,
>>>>> Hubert
>>>> ________
>>>>
>>>>
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> Schedule -
>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786